Canvas Data FAQ

7 Likes

Overview
General Canvas Data Questions
Canvas Data Implementation Questions
Canvas Data Portal UI Troubleshooting Questions
Redshift Troubleshooting Questions
CLI Tool Troubleshooting Questions
Specific Data Questions

Overview

Canvas Data is a service from Canvas that will provide schools with optimized access to their data for reporting and queries.

This document is meant to serve as a place to find answers to common questions you may have about Canvas Data. Instructure will attempt to keep this document updated as trends in questions arise that are not addressed within this FAQ. Please feel free to add any additional questions as a comment, and we will do our best to answer them and add to this document.

General Canvas Data Questions

Question	Answer
How do I add another admin to the Canvas Data Portal?	How do I manage Canvas Data admin users?
Where do I find the data schema?	Canvas Data API Documentation A JSON version can also be downloaded from the API with the endpoint: GET /api/schema/latest
What are 'dim' and 'fact'?	Canvas Data uses Kimball methodology to create a star-schema. "dim" stands for dimension and "fact" stands for fact. More info on star schemas and Kimball methodology can be found at https://en.wikipedia.org/wiki/Dimensional_modeling Essentially, facts provide more general detail about the item. For example, you may use a fact to count enrollments. Meanwhile, the dimensions provide insight into the data about an item, meaning the dimension could get be used to show how many enrollments are "Teacher" enrollments.
Is historical data available for the requests table?	Yes—Historical Data is loaded the second Wednesday of every month following the activation of Canvas Data in Canvas. Historical requests (page view) data needs to be uploaded separately after Canvas Data has been enabled. Schools that are enabled with Canvas Data will get their historical data loaded in batch sometime during the first month of activation. Once historical data has been loaded, it doesn't need to be updated again and again. Note: If you have missed your historical load, please contact your CSM to have the data reloaded.
How far back does the historical data go?	With the exception of the requests table. All historical data since the beginning for the customer's subscription is included. For requests data historical data is loaded starting with 2014-03-01 or the beginning of the customer's subscription. Whichever is later.
What time are the daily files/Redshift/API available for download?	Typically, the dump completes around 2:00 AM, Mountain Time. This data is the same across the flat files available in the Canvas Data Portal UI, the data used for the population of Redshift, and files available via the Canvas Data API endpoints. Note: This time is not guaranteed as many external factors may cause the load to be later in the morning.
Does each day's flat file contain data only for that day, or does it contain historical data with the new data added?	Excluding the requests table, each file will continue to grow as we continuously append the previous day's data to the table. We do not provide deltas for these files. Due to the nature and potential size of the requests table, we only provide the previous day's data.
What is the data model for managing transactions?	There are no transactions. Flat files are a complete refresh except for requests which is append-only. Redshift is read-only.
The data exports have a ".gz" extension on them when downloaded through the UI, why?	The files are in GZIP format. There are several open-source and commercial tools to unpack these files for either Windows and Mac OS. A popular free tool is 7zip (https://www.7-zip.org/). We do not publish the files in any other formats.
How do I open flat files?	Files are tab-delimited files. These can be opened with Excel, text editor, Tableau, or any other program that can open ".txt" files. Once you open the raw .txt file, you will need to reference our schema documentation to add headers. This can be avoided by using the API to download the data into a data warehouse. Instructure also has built an open-source command line tool, capable of adding these headers in. Use the link below for instructions on installation and usage. The user will need to download their data with the CLI, and then use the "unpack" command. GitHub - instructure/canvas-data-cli
Why can't headers be generated for the columns in the Canvas Data CSV export?	The primary reason is that most of the tables have more than one file. If we put headers at the top of one of the files (or all of them), it makes it more cumbersome to use simple command line tools like cat, awk, grep, wc, etc, to manipulate the data. Some customers wanted this and some did not. The choice was made to not include them. The Canvas Data API can be used to get a JSON-based schema that can be used to generate headers. The CLI can also be used for those customers wanting headers in their regular flat files.
Is there a way I download all the data files at once?	Yes, this is what the API is meant for (Canvas Data Portal API). A cli tool is also available to use: https://www.npmjs.com/package/canvas-data-cli
Where is the API documentation?	Canvas Data Portal
What exactly is "Hosted Data Services"?	Hosted Data Service is a service that allows Instructure to manage Canvas Data for you by automating the loading of Canvas Data into an Amazon Redshift instance. While using this service, Instructure ensures that your data stays up to date, handles any schema/process changes, and handles the management of the large data set for you. The service also allows your team to focus on querying the data. This can be done by using tools that allow for OBDC connections. For more information and pricing, please contact your CSM.
Is there any sort of orientation discussion for Canvas Data?	If you need an orientation to Canvas Data, please reach out to your CSM to schedule one.
Is there consulting for Canvas Data?	Yes—Canvas Data consulting is available for assistance in understanding the Canvas Data schema and creating reports based on the Canvas Data schema. For more information about Canvas Data consulting and pricing, please contact your CSM.
What data is currently not in Canvas Data?	The data available in Canvas data will be a fraction of the data available in the main Canvas API https://canvas.instructure.com/doc/api/all_resources.html endpoints. Rather than list all of the items not available in Canvas Data, it's best to review the Canvas Data schema to see the data that is available. These items are not in Canvas Data and are asked about most frequently: "Total Activity", as seen in the "Users" tab within a Canvas course Syllabus (Calendar portion) Assignment Rubrics Quiz Question Answer Submissions Calendar Events & Scheduler ePortfolios
Some of the tables listed in the Canvas Data schema are not in my daily data dumps. Why is that?	We do not provide data for empty tables. If you are sure that there should be data in your missing tables, please reach out to our support staff (canvasdatahelp@instructure.com).
Why do I sometimes see duplicate files for the same date & time in my Canvas Data Portal?	This is a known issue in Canvas Data that will happen from time to time. One of our jobs gets a false negative health check once in a while. As a result, we start a new / duplicate job to eliminate any possibilities for missing files. It is a very rare occasion and should resolve on its own with the next run. Note: While the rows are in a different order, the content is the same.

Canvas Data Implementation Questions

Question / Issue

Possible Solution

What is required to implement the Canvas Data Integration? How long does it take?

Canvas Data is a service that provides each client access to key Canvas data points, delivered in a purposefully optimized form for queries and reports. If you are interested, contact your Customer Success Manager or Implementation Consultant and they can schedule time for the implementation.

What does the Canvas Data integration process look like?

Canvas' implementation team will lead you through the Canvas Data integration process after you have made the request. The process for enabling a customer to use Canvas Data involves the following steps:

Notify your Customer Success Manage or Implementation Consultant that you would like to enable Canvas Data for your instance.
Your Implementation Consultant will meet with you and find out what admin user you want to designate as your primary Canvas Data administrator.
The Implementation Consultant will add the primary Canvas Data administrator to the Canvas Data system and install the External App into your Canvas account.

What are the considerations for Canvas Data admin?

You must designate a Canvas Data Administrator to manage who is given access to the Canvas Data dataset, which is the entire dataset in Canvas including personal information about users. They will also manage information about which IP address ranges can access the database. The Canvas Data Administrator must be an Account Admin. The individual should be Canvas account admin that understands the data governance procedures and policies for the organization and also have enough technical proficiency to understand IP address ranges and database connection strings. The Canvas Data Administrator will receive access to the documentation to generate Canvas Data files and downloads.

Canvas Data Portal UI Troubleshooting Questions

Question / Issue

Possible Solution

Error when trying to access the Canvas Data LTI: Insufficient Access to use LTI Tool

The user must also be an account admin at the root level. If you are an account admin and do not have access, you will need to contact another account admin who has access to the Canvas Data Portal to add you.

I see a user I did not add inside the Canvas Data Portal LTI.

Any user capable of viewing the main Admin account page within Canvas will also have access to click on the "Canvas Data Portal" link. When a user clicks on this link, that user is automatically added as a user without any permissions to the Canvas Data Portal.

These users can be removed or, if not removed from the user list, will remain as users without any permissions until removed.

Redshift Troubleshooting Questions

Question / Issue	Possible Solution
I don't see any tables in Redshift.	Ensure that you are connected to the correct database name. The name will be the same as your Canvas instance. Example: Your Canvas URL is "someschool.instructure.com", your database name will be "someschool"
What is the Redshift database name?	The database name is the same as your Canvas instance name. You can also derive what it is by looking at your Redshift hostname through the Canvas Data Portal. Example: Your Canvas URL is "someschool.instructure.com", your database name will be "someschool" Example: If the hostname is "xyzu-redshift.prod.inshosteddata.com", then the database name would be "xyzu"
I cannot connect to Redshift.	This is most likely an IP whitelisting issue. Please ensure that you have added your computer's IP address to the whitelist area with the Canvas Data Portal under "Credentials". If you want to eliminate whitelisting as a possible source of the problem, you can add an entry to the whitelist to enable all IP addresses with 0.0.0.0/0. If the connection goes through after this, it was a whitelisting issue. If not, other likely issues are ODBC/JDBC driver issues.
My Redshift username and password are not working.	Try regenerating your credentials. If this still does not work, please reach out to your CSM.

CLI Tool Troubleshooting Questions

Question / Issue	Possible Solution
The CLI tool is failing, and I don't know why.	We recommend generating log files and filing a support case. To generate logs, simply run the CLI again with the extra argument of "-l debug" (minus the quotes). The output will be your log file(s).
How do I know what commands I can run?	All commands from the CLI tools can be listed by simply running "canvasDataCli --help". If you need help with a specific command, you can add it to the help command. E.g. "canvasDataCli sync --help".
How can I update the CLI?	You can update the CLI by running: "npm update -g canvas-data-cli".
How do I know what version of the CLI I'm on?	You can check your version of the CLI by running: "canvasDataCli --version".

Specific Data Questions

Question	Answer
How do I update the gender / birthdate / country code in Canvas?	These fields are no longer used. They will likely be deprecated in future versions of Canvas Data.
Can I get the first name, last name separated out?	In the user_dim, both "name" and "sortable_name" are available and can be "split" in a manner that is most efficient for your reporting needs. The sortable_name would help manage users with multiple last names as it is in the format of "last name, first name".
Why are some of the user_ids negative?	This is normal. We have obscured the user IDs so that join keys can be shared without sharing actual Canvas IDs to users. The same user_id is used across the other tables where the user_dim's "id" is referenced. Some DBMS systems do not support unsigned 64-bit integers so we went with signed integers.
Where are student grades located?	Student grades are located in the course_score_fact (previously enrollment_fact but the fields within enrollment_fact were deprecated)
Can Canvas Data help with how faculty are using Canvas? 1) Whether faculty are using Canvas for their courses or not 2) The tools and features faculty use in Canvas in their first year 3) Growth in usage and use of functionality in year two.	One way to do this would be to determine the following: Courses with published assignments by enrollment term Courses with published discussions by enrollment term Courses with published quizzes by enrollment term External tool activations by course and enrollment term This would give an indication of the extent to which faculty are using courses by determining how many assignments, discussions, quizzes, and external tool activations exists for each course. The numbers between year two and year one could then be compared to find the difference in usage by looking at the enrollment_term_dim.
I'm looking for a "data extract date" or "data as of date" so that we know when the data was loaded.	The best we can say is that the date associated with the latest dump is the data extract date.
What data is in the course_ui_navigation tables?	These tables represent the navigation settings that have been chosen by instructors for different courses.
What exactly does "pseudonym" mean in Canvas?	In Canvas, users can have one or more logins. The table with information about logins and user SIS IDs is called pseudonyms in the underlying Canvas database.
Can I obtain login information for every user?	Using the pseudonym_fact and pseudonym_dim tables, you will be able to obtain login information and SIS IDs (if SIS IDs have been added) for each user.
Do the assignment tables include assignments, quizzes, and discussions?	Yes. Using the "submission_types" value in the assignment_dim will allow you to filter data based on the type of "assignment" it is: Assignment = online_text_entry, online_url, media_recording, online_upload, external_tool, on_paper, none, not_graded Quiz = online_quiz Discussion = discussion_topic
Is any kind of record kept of the communication that occurs in conversations?	This would be within conversations. The associated tables to use are conversation_message_participant_fact, conversation_dim, and conversation_message_dim. Please review the schema documentation to see the data available in those tables.

jago_brown · ‎03-13-2017

Dear CanvasData team,

Re: Access to a complete CanvasData dataset - from GoLive date

What triggers CanvasData or the files being pruned/deleted? is it total size of all files/tables, number of files, a date?

e.g. for Requests files, what triggers some of these being made un-available via the Canvas Data API ?

This kind of information can help customers verify they are not missing data and validate their analytics

Kind regards

Jago

smccann · ‎03-15-2017

Hi Jago,

What triggers CanvasData or the files being pruned/deleted? is it total size of all files/tables, number of files, a date?

While the file links are available forever to download, we remove the data from the files after 2 months. This should only be affecting your request files since those are a daily delta. All other files in Canvas data are a growing historical record.

Every other month, on the second Friday we provide a historical requests data dump that will include ALL of your requests since your instance started, or March 1st, 2014.

Does this answer your question?

jago_brown · ‎03-15-2017

Hi Sydney,

Yes this mostly answers my question. So at this refresh point every 2 months, the Requests files should start growing again from empty files with new data generated from this refresh point? This is what I had previously understood, but it looks as though the CLI Tool has been able to download 8+ months of Requests, when I expected only 2 months - which confused me.

...So maybe I am not clear on the differences if any?, between data displayed in the web Canvas Data portal - and data pulled from the CanvasData API with the CLI Tool:

# How does the CLI Tool know how to disregard the historical Request data dump when it syncs? (Does it compare the row ID of all Requests it has previously downloaded?)

# Requests files seem to have an ordinal number count in the middle of the file name e.g.: "requests-00000-2fxxxxxx.gz" - "requests-00001-86xxxxxx.gz" - "requests-00002-2bxxxxxx.gz". In the web Canvas Data portal for our Canvas instance there are typically only 3 files numbered 0-2 in each daily dump, but the CLI Tool has downloaded many more files numbered 0-19+. Is this because the CanvasAPI and web Canvas data portal provide different (independent) sets of files from the CanvasData Requests table? A schematic diagram providing an overview of how all this works, would help me at least.

Kind regards

Jago

a1222252 · ‎03-16-2017

Hi Sydney,

A very useful document. A few comments if I may:

The data exports have a .gz extension on them when downloaded through the UI, why?

You can also use canvasDataCli -unpack to unzip the files.

Why can't headers be generated for the columns in the Canvas Data CSV export?

As you say, canvasDataCli -unpack unzips the gzip file and adds field headers. In the case of data delivered as multiple gzip files, it also concatenates the unzipped files. However, the header insertion causes problems:

1. With canvasDataCli version 0.4.1 the header record is added at the beginning of the file, but a blank line is also added for each file that is concatenated. These can be removed once the data has been extracted into a database.

2. With canvasDataCli version 0.5.2 the header record is added at the beginning of the file but it does not include a carriage return so that the first record is the header concatenated with the first data record. Depending on the extract tool, this means that the first data record can be lost when extracted into the database. However, this version does suppress the blank lines between concatenated files.

I suspect that the easiest way to resolve this is to include a switch in canvasDataCli -unpack to allow headers to be generated or not.

What data is currently not in Canvas Data?

Modules data became available in 1.14.0.

Where are student grades?

Score fact and dim became available in 1.15.0 and the release notes indicate that grades are no longer in enrollment fact.

I'm looking for a "data extract date" or "data as of date" so that we know when the data was loaded

The data dump is performed daily at midday UTC and typically becomes available at around 22:00 UTC on the same day. With the exception of requests data this includes data to about 03:20 UTC on the morning before the midday dump.

Requests data includes data to 23:59:59 UTC two days prior to the midday dump.

So for example the dump performed at midday UTC on 16th March will include data to about 03:20 UTC on 16th March and requests data to 23:59:59 UTC on 14th March.

To confirm this once the data has been extracted into a database, max(timestamp) from the requests table and max(last_request_at) from the pseudonym dim table will provide the required information.

Can I obtain login information from every user?

Pseudonym fact provides the number of logins and failed logins, pseudonym dim provides current and last login details.

To obtain full details of all user logins, query the requests table where the url contains 'login'.

Hope this helps.

Stuart.

ccoan · ‎03-16-2017

Hey Stuart,

I know I'm not Sydney, but I'd be happy to help step in here.

- You can also use canvasDataCli -unpack to unzip the files.

Yes you can, although since it does other things we don't always want to suggest it as a starting point.

- With canvasDataCli version 0.4.1 the header record is added at the beginning of the file, but a blank line is also added for each file that is concatenated. These can be removed once the data has been extracted into a database.

We don't recommend running out of date cli versions. In fact we actively push for people to absolutely not do this. If there's a problem with a newer version of the CLI we should fix it. This issue was brought up and should be fixed in the latest version.

- With canvasDataCli version 0.5.2 the header record is added at the beginning of the file but it does not include a carriage return so that the first record is the header concatenated with the first data record. Depending on the extract tool, this means that the first data record can be lost when extracted into the database. However, this version does suppress the blank lines between concatenated files.

- I haven't heard of this, and don't see it on a quick glance through of my data files on the latest CLI. Please submit a support case to canvasdatahelp@instructure.com if this is happening on the latest version, and if not, please upgrade.

- The unpack command is not recommended for loading into a database since it adds in headers. This is one of the reasons the files by default don't come with headers. We expect the common person will need to load them into a database. Using gzip/7zip/<insert compression program here> from the CLI is recommended over unpack since it doesn't have those headers. The unpack command was added specifically to help people who needed those header rows (e.g. people using Excel, tableau, etc).

- Modules data became available in 1.14.0.

I'll make sure we get that updated, thanks!

- Score fact and dim became available in 1.15.0 and the release notes indicate that grades are no longer in enrollment fact.

Actually if you look at our announcement: HERE. We will be backfilling the data back into this table. We screwed up by moving this data, and breaking a lot of people's reports. With no warning. This was a huge problem with our process, and something we're devoting to fixing. As such the first thing we're going to do is backfill the data back into these tables as a migration plan for schools who have reports running on these tables.

- The data dump is performed daily at midday UTC and typically becomes available at around 22:00 UTC on the same day. With the exception of requests data this includes data to about 03:20 UTC on the morning before the midday dump.
Requests data includes data to 23:59:59 UTC two days prior to the midday dump.
So for example the dump performed at midday UTC on 16th March will include data to about 03:20 UTC on 16th March and requests data to 23:59:59 UTC on 14th March.
To confirm this once the data has been extracted into a database, max(timestamp) from the requests table and max(last_request_at) from the pseudonym dim table will provide the required information.

While this information is generally true we don't want to always stick to it'll become available at around 22:00 UTC. Sometimes (although it's rare) there will be a problem, which is why we have the "24 to 36 hour disclaimer". Also with requests sometimes it's a historical load, and actually doesn't contain the two days prior but more. We always recommend checking for min/max timestamps yourself, but we don't want to stick to one time until we're sure we can absolutely hit it 100% of the time.

- Pseudonym fact provides the number of logins and failed logins, pseudonym dim provides current and last login details. To obtain full details of all user logins, query the requests table where the url contains 'login'.

Yes you can always use the requests table to dig in more, but this statement is still true that the pseudonym fact/dim contain a majority of the information here.

Thank you very much for your feedback. Hope I was able to further clarify/confirm your suggestions/questions.

a1222252 · ‎03-16-2017

Hi Eric,

Thanks for the feedback.

We use Oracle ODI to extract the text files into an Oracle database. Having headers is great for this because the ODI reverse engineering process picks up the header details from the file rather than having to type them in.

I had to revert from canvasDataCli 0.5.2 to 0.4.1 due to the issue with concatenation of the header record with the first data record. I’ll have a look to see what the latest version is and if the issue still exists I’ll raise a support request. I did do this at the time but didn’t hear back.

In relation to grade data, that sounds good. Probably should add this to the release notes.

In relation to dump timing, I think it’s valuable to understand the normal behaviour. Late dumps typically coincide with schema changes, but not always. Late dumps are readily visible from within the application. I generally check that manually ahead of our automated download / extract process.

Regards,

Stuart.

tue58800 · ‎02-27-2019

Some content in this document contradicts recent release notes. Namely, whether Outcomes are or are not included in Canvas Data.
See https://community.canvaslms.com/docs/DOC-8513#jive_content_id_What_data_is_currently_not_in_Canvas_D...

and

https://community.canvaslms.com/docs/DOC-15973-canvas-data-release-notes-2018-12-11

Was it removed after being added, or was this document not edited accordingly?

Thanks.

awwolfe · ‎02-07-2020

Hi. From the above chart: "The Canvas Data API can be used to get a JSON-based schema that can be used to generate headers." Has anybody done this? It is not clear form Canvas documentation how to get this json schema data.
All I'm looking to get is a file with table name and column names.

Andrew Wolfe

You're signed out

Canvas Data FAQ

Canvas Data FAQ

Overview

General Canvas Data Questions

Canvas Data Implementation Questions

Canvas Data Portal UI Troubleshooting Questions

Redshift Troubleshooting Questions

CLI Tool Troubleshooting Questions

Specific Data Questions

General Information

The data exports have a .gz extension on them when downloaded through the UI, why?

Why can't headers be generated for the columns in the Canvas Data CSV export?

What data is currently not in Canvas Data?

Where are student grades?

I'm looking for a "data extract date" or "data as of date" so that we know when the data was loaded

Can I obtain login information from every user?

Community help

View our top guides and resources: