Canvas Data 2 Early Access

Instructure
Instructure
4 28 1,905

We recently shared some exciting news about our new product addition in the Canvas Data User Group: Canvas Data 2. Canvas Data 2 will enable customers to easily find, filter, and understand the variety of Canvas data in a timely manner. The tool will share raw data in multiple formats grouped in logical datasets. 

Early Access

Canvas Data 2 Early Access is a small scale preview mode of Canvas Data 2 in Instructure sandbox settings. Customers are able to run product utilities such as cli with sandbox data. 

Goals

The goal of Canvas Data 2 Early Access is to allow customers to provide early feedback. Additionally, current Canvas Data customers can build their proof of concept Canvas Data migration logic with sandbox data. 

Data Availability

The sandbox data is available to customers from all regions. The sandbox is hosted in the cloud production Canvas environment canvasdata.instructure.com, which is specifically set up for testing Canvas Data 2. Customers do not have direct access to this Canvas URL, but they can request data using the Canvas Data 2 cli tool. The Canvas Data 2 cli outputs tables where data is available, and customers interested in viewing data for specific tables can request data to be added.

Features

Command Line Interface Utility (cli) 

Information about the CLI can be found in data-access-platform-cli.

Credentials 

Please use the following file as is when configuring your cli: config.json.

Please see the instructions on how to use static credentials to run cli in the cli documentation. 

Schema

Canvas Data 2 schema has been finalized and is available in the publicly shared documentation.

Schema versioning is done at the dataset level, where each dataset has a schema version ID.

 

 

 

{
  "account_users": {
    "columns": {
      "account_id": {
        "description": "The unique id of an account. The foreign key for the accounts dataset",
        "hive_type": "bigint",
        "type": "bigint"
      },
      "created_at": {
        "description": "Timestamp of when an account_users record was created",
        "hive_type": "bigint",
        "type": "timestamp without time zone"
      },
      "id": {
        "description": "The unique identifier for the users account association record",
        "hive_type": "bigint",
        "type": "bigint"
      },
      "role_id": {
        "description": "The unique id of a role. The foreign key for the role dataset",
        "hive_type": "bigint",
        "type": "bigint"
      },
      "root_account_id": {
        "description": "The unique id of main Canvas account",
        "hive_type": "bigint",
        "type": "bigint"
      },
      "sis_batch_id": {
        "description": "The unique id of a sis batch import. Used for debugging sis imports.",
        "hive_type": "bigint",
        "type": "bigint"
      },
      "updated_at": {
        "description": "Timestamp of when an account_users record was updated",
        "hive_type": "bigint",
        "type": "timestamp without time zone"
      },
      "user_id": {
        "description": "The unique id of a user. The foreign key for the users dataset",
        "hive_type": "bigint",
        "type": "bigint"
      },
      "workflow_state": {
        "description": "The current state of the account user record. Possible Values: active, deleted",
        "hive_type": "string",
        "type": "character varying"
      }
    },
    "description": "Join table for accounts, users and roles",
    "keys": {
      "foreign": {
        "account_id": [
          "accounts.id"
        ],
        "role_id": [
          "roles.id"
        ],
        "root_account_id": [
          "accounts.id"
        ],
        "sis_batch_id": [
          "sis_batches.id"
        ],
        "user_id": [
          "users.id"
        ]
      },
      "primary": "id"
    },
    "version": 2
  }
}

 

 

 

 

Data

Dummy Data 

All tables listed in the public schema document are available in the Canvas Data 2 sandbox. However, some tables don’t have any data because of lack of feature usage in the sandbox environment. Customers can request data for specific tables using the Comment field at the end of this post.

We enabled a handful of automated tasks in the Canvas Data Canvas account to populate tables with fresh data on a weekly basis. The following four workflows are executed periodically in the account.


Use Case 1: Content Creation

Course : https://canvasdata.instructure.com/courses/1

Early American History

  • Example Teacher creates an assignment
  • Example Teacher creates a graded discussion
  • Example Teacher creates an auto-graded quiz


Use Case 2: Student Online Activity 

Course : https://canvasdata.instructure.com/courses/1

Early American History

  • 10 random students this course will:
    • Submit an assignment
    • Create a discussion entry
    • Submit a quiz
    • Visit course wiki page


Use Case 3: Grading 

Course : https://canvasdata.instructure.com/courses/1

Early American History

  • Example Teacher grades—10 student assignment and discussion submissions
  • Example Teacher comments—10 student submissions


Use Case 4: Communication 

Course : https://canvasdata.instructure.com/courses/1

Early American History

  1. Example Teacher sends group message to 10 students in a course 
  2. Example Teacher sends individual messages to 10 students in a course

Supported Files

In Canvas Data 2 Early Access we support csv data file format for Updates and Snapshots. 

Each file consists of public database table fields as well as Canvas Data 2 database record metadata. 

  1. The duplication of records in tables is a possibility; we recommend not using unique constraint in data import validation logic to avoid errors.
  2. Dataset schema has PK and FK indicators, which are unique in the Canvas database table PK and FK values.
  3. workflow_state field is used to indicate the record state in Canvas database table. When a record is soft-deleted from Canvas, the workflow_state will be set to the `deleted`; status field. The Canvas Data 2 metadata field indicates whether a record has been hard-deleted from database. Only a handful of processes in Canvas permanently purge data records from database, and the status will be set to `deleted` when a record is hard-deleted from Canvas database.
  4. orderId is the Canvas Data 2 metadata value used to indicate the order of the change when it took place in Canvas database. orderId is lexicographically sortable and is available in snapshot and update files. All records in update files are pre-sorted from earliest to latest.

Snapshot Files

Snapshots are collections of account table data; they include all records captured in Canvas LMS for that dataset from the moment the account was established to the moment data was requested . Snapshots are used to seed data in customer data warehouses when Canvas Data 2 is used the first time, after data tables are established the differential or updates files should be used to populate data going forward. 

Differential [Updates] Files

Updates files contain an audit log of all changes that occurred in the selected database table from the moment the client stipulates. By default, the files generate updates from the last 24 hours. 

Globalization ID

Global vs Local record ID 

Canvas Data legacy product we globalize majority of our IDs by running compose_global logic in our ETL: 

 

 

 

 import spark.sqlContext.udf
 udf.register("compose_global", (key: JLong, shardId: Long) => composeGlobal(key, shardId))
 udf.register(
"compose_versioned_global",..

 

 

 

To re-create global Ids, the following logic could be applied to generate global record Ids using Canvas Data 2 records:

 

 

 

shard_id * 10000000000000 + PK[ID] as global_id

 

 

 

shard_id is included in every data file

PK[ID] is the primary key ID of the record in the table

 

 

 

SHARD_OFFSET = 10000000000000L

 

 

 

Globalization of Ids is heavily discouraged by Instructure development. Customer databases frequently undergo database shard splitting, which inadvertently impacts global IDs by changing the original shard_id.

Canvas Data 2 does generate global IDs but only for the user_id of a user that doesn’t reside within a customer database shard. In this case, the user_id will have a foreign shard_id. Cross- shard data sharing is a common use case in Canvas database; users of consortia or large district accounts take courses in multiple Canvas accounts that will lead to their user enrollment records to contain global user_ids. All cross-shard user related records will contain global user_id. To retrieve cross-shard user data, Canvas Data 2 should be installed on all consortia shards and data are synced into one data warehouse.

Map: Canvas Data Legacy to Canvas Data 2 

Canvas Data Legacy supports star data schema where facts and dimensions represent a handful of Canvas database fields. Canvas Data 2 supports Canvas proprietary lightly obstructed database schema, majority of Canvas Data Legacy tables could be re-created by leveraging Canvas Data 2 schema . Please note Canvas Data Legacy relies on a series of custom ETL jobs to construct star schema , the primary logic in the transformation code relies on temporary tables built in memory during data extraction, those logical tables could not be re-created unless the client is running the same ETL code downstream. The mapping document makes an attempt to link all available data points in Canvas Data 2 to Canvas Legacy tables thru a list of basic pseudo sql queries , the code found in the mapping document has not been tested for performance optimization and needs to be evaluated in the clients data context. The general use case for the mapping document is to provide an idea during the Canvas Data Legacy to Canvas Data 2 migration process how one could get data that once was available in the old product. 

Mapping Document

A mapping document is a google sheet type document that has one master sheet that provides a list of Canvas Data Legacy facts and dimensions mapped to Canvas Data 2 list of tables ordered alphabetically.

Each sheet in the document represents one combination of Canvas Data Legacy fact and dimension.

Each fact and dimension fields are mapped to corresponding Canvas Data 2 fields.

The comments for each field provide pseudo sql statements. Not all tables are built with sql code—some use spark or scala to assemble cascading joins, parse and explode yaml table fields. We attempted to document logical steps for the complex rollup tables in code snippets in the comments section. The code statements are partial and will not work as is. 

Map: Canvas Data Legacy to Canvas Data 2

Canvas Data Legacy Sandbox Data Dumps

Canvas Data 2 Early Access Omissions

The following features are not available in Early Access preview mode of Canvas Data 2:

  1. Access to Canvas Data 2 API 
  2. Request table in Canvas Data 2 schema
  3. Catalog Data
  4. New Quizzes Data
  5. Ability to request updates by Order ID
  6. User role based access [ based on a user role in Canvas, user gets access to CD2 data] 

Some tables do not have data in the sandbox environment, don’t hesitate to ask us via community blog post comment to add more data. 

Please address all your questions related to the offering in the blog post comments so we could share knowledge across all users.

 

28 Comments
Surveyor

Hi, where is the config.example.json as mentioned in https://www.npmjs.com/package/data-access-platform-cli? I dont see any link to this json

Instructure
Instructure

Hi @Tr101_1 

Please place the config.json included the blog post into the https://www.npmjs.com/package/data-access-platform-cli

Thanks,

Oxana 

Surveyor

Hi @oxana , Im referring to this below - not sure where this config.example.json is? There is not link to this file. Or am I missing something?

 

Tr101_1_0-1601331934235.png

 

Surveyor

The CLI currently accepts a since date-time when requesting updates to specific table(s).  Is there a way an ‘up-to’ date-time as well?  This would greatly help in automating scheduled tasks that grab updates to tables.

 

Example.

I can fetch updates for the following 4 tables using this command:

  • dap updates users courses accounts submissions --concurrency 4 --since '2020-09-24T00:00:00Z'

This results in 4 following 4 CSV files being created in the updates folder:

  • accounts_2020-09-23-20-00-00_2020-09-29-10-44-11.csv
  • courses_2020-09-23-20-00-00_2020-09-29-10-44-29.csv
  • submissions_2020-09-23-20-00-00_2020-09-29-10-44-32.csv
  • users_2020-09-23-20-00-00_2020-09-29-10-41-15.csv

You will notice that each file has the same since date-time (converted to local TZ) which I specified in my CLI command but each file has a unique 'up-to' date-time which corresponds to the time the file was created.  This means that my next fetch for these files has to be performed by reading the file name, retrieving the unique up-to date-time, converting this local date-time to UTC, and then issuing individual calls to the updates like so:

Example for Users Table Update Fetch

  • retrieve time of 2020-09-29-10-41-15 by parsing file name
  • convert time to UTC -> 2020-09-29-14-41-15Z
  • Issues Fetch -> dap updates users --since '2020-09-29T14:41:15Z'

 I don't think the ability to perform a concurrent group fetch is not possible?

Thanks!

Surveyor

@Tr101_1 , After installation of the DAP node package you will find the config.example.json file inside the  node_modules/data_access_platform_cli/ folder.  You can just copy the whole file to the same folder with the name config.json and then change the credentials inside it.

Note that I installed the DAP NPM package globally (with the -g option) which made things easier. 

 

Instructure
Instructure

@Will1537 

Good suggesting—we will capture it as one of the features to improve usability.
 
Thank you for your feedback!
 
Oxana
Surveyor

@Will1537thanks for your response. Is it possible to run 2 diff versions of the cli client app. We currently using an old version to load the current canvas data - I tried running the install of the latest version however got an error. Do we need to uninstall the current or is the install an in-place upgrade?

Surveyor

@Will1537  - great. I used the -g switch and the install was successful. I am now able to run 2x er of cli concurrently.

Surveyor

Hi Oxana,

The schemas do not match the data found in the snapshots and updates in that the schema does not include the meta-fields.  This complicates the ETL process since the schemas cannot be used to create table definitions (we use AWS GLUE for this) that directly refer to the CSV files that were generated from snapshots and updates.  In order to do this I have to ETL the CSV files and remove all the meta fields that do not show up in the related schema.

Is there a way to retrieve snapshots and updates that do not include meta fields? 

Also, the order of fields in the schema does not match the order of fields in the CSVs.  Is there a way they could be matched up or a sort order field could be added to the schema?

Thanks!

Rob

Surveyor

Hi Oxana,

Once Canvas Data 2 is in production. At any time, we can get the Snapshot files (which contains historical data since Canvas Data 2 went to production) for all the Canvas Data 2 tables right?

How about request table (aka weblogs) can we get the snapshot files for request table since Canvas Data 2 went to production. (i.e) at any point can we get the historical data for request table since Canvas Data 2 went to production. I know we can’t get the request table historical data from Canvas Data legacy (pre-Canvas Data 2) once we switched to Canvas Data 2.

Like Canvas Data Legacy is the table schema and data type will change frequently in Canvas Data 2?

Thanks!

Instructure
Instructure

Hi @pitt 

A snapshot for any Canvas Data 2 tables could be requested at any time. We might introduce some partitioning for large tables so you don't have to pull so much data at once.

Request logs will have a limited time data retention policy—it is tbd at the moment. I envision an ability for someone to request logs based on a specific time period like a month, a day, an hour or use—since and last conditions similar to what is offered for other datasets. I do not believe we will be enabling the download of the entire history of requests via a snapshot file. We will also not support any historical request files we typically run in Canvas Data legacy. As soon as CD2 is enabled for your account we will start collecting requests from your account. All historical requests could be retrieved from Canvas Data legacy till it is no longer available.

Canvas Data 2 follows Canvas source data schema pretty closely so anytime Canvas source database deprecates or adds a new field it has a potential to impact Canvas Data 2 schema. In Canvas Data 2 any time we want to deprecate a field, we will mark it as deprecated in CD2 schema and remove it 90 days after we issue deprecation notice. Any data additions will be announced in the release notes.

Oxana

Instructure
Instructure

@Will1537 

We will be updating our schema with additional metadata fields to match what you see in the files. And no, there isn't a way to get a file that doesn't include metadata fields.
 
Thanks,
Oxana
Learner II

Is it possible to add lti_resource_links data to the early access dataset?

It would be very valuable to see this alongside the context_external_tools table in order to track LTI tool usage for things like assignments.

Thanks!

Instructure
Instructure

Hi, everyone,

We recently discovered the cli doesn’t support node v14+. Please use any versions below to run cli; we will resolve this situation in our CD2 Beta offering.

Thanks,

Oxana

Surveyor

Hi @oxana 

What is this "cli doesn’t support node v14+". Are you talking about the CLI tool version or something else? Dont quite understand what v14 is about.

 

regards

Adventurer

@Tr101_1  --

There is a video that details the install and setup of the CLI tool and this includes the installation of Node, which is a a JavaScript runtime built on Chrome's V8 JavaScript engine. 

The v14+ references the version of Node.  

 

Adventurer

First, @oxana thank you and your team for all of your hard work. 

Secondly, I had a question in https://docs.google.com/spreadsheets/d/1Uft22XqM-pWxOwPEq8i2xvCFhJ7OcOWq1cNAMoa2DwQ/edit#gid=0&range...
there is a comment that refers to canvasData3 is this a typo or is there additional information about the iteration changes between 2&&3

Thirdly, we write everything from scratch whenever there are fundamental data changes like this one. And one thing that would help us configure prior to official launch is a map.(like the one attached)

jsimon3_0-1602699840645.png

I do not mind creating another one but I would like to know if there is a document showing the keys and 1:1,1:many,many:many and so on.
- As I develop these things I will share them with the community especially when it comes to CD2 => tSQL conversion process. 

 

Instructure
Instructure

@jsimon3 

The comment in the map document you are referring to `Not supported in Canvas Data 3` had a typo , there is no Canvas Data version 3 planned in our current roadmap. I corrected the typo.

Your suggestion to provide a map for the migration purposes: do you need more information added to the existing map, or are you just requesting us to create similar map for any new versions of Canvas Data in the future?

Thank you,

Oxana

Surveyor

Hi, Just to clarify - Im using this current format for my schema.json to load each day. Do we have to manually construct the new schema.json using the details as per the google sheet and then apply the new format as shown above as an example? I'm a little confused. Could someone possibly post their schema.json so that I can better understand what is required. Thanks

 

Tr101_1_0-1602721079431.png

 

Adventurer

@oxana I apologize for the confusion in my question. I was looking for the schema documentation for CD2, so that I can create the visual maps based on the keys provided.I did not notice the large glaring heading Schema with the publicly shared documentation link... I was searching the https://institution.instructure.com/doc/api/ because I thought that was the new home for CD2 docs... Sorry about that ...

Surveyor

@oxana, I'm getting a 504 Error whenever I run snapshot or updates on courses and users; I presume this will be an issue with any relatively large dataset. However, I was able to successfully run a snapshot on the accounts table/dataset, which is a small file. Is there anything I can do on my end to prevent this timeout error? Is there a way for me to request a subset of the data (table), e.g., get the first 100 results of a table?

Attached is an example of the error that I'm getting in the console/terminal.

Canvas Data 2 - CLI - 504 Error on Users.png

  

Thank you.

Instructure
Instructure

@amoore-uview 

Looks like the cli is experiencing some intermittent errors today; we are actively troubleshooting them. I'll update everyone as soon as it's fixed. Thank you for your patience!
 
Oxana
 
Surveyor

Thank you, @oxana. It appears to be working again. Another question; I've noticed, of the tables I've sampled, that there is a UUID column. Is this the new preferred/suggested column to use when joining tables? And if so, given that UUIDs aren't in the old schema (e.g., course_dim, assignment_dim, etc.), will they be assigned to each row of a given table retroactively?

 

Thanks again.

Surveyor

Hi Oxana,

Can you provide instructions on how to fetch updates based on an orderId?  I can only see methods to fetch based on a relative age (like --last 20d) or a since date (--since <date>) and do not see a way in which I can fetch all updates that took place after a specific orderId.

Also, can you simulate some 'modifications' and 'deletes' for some tables so we can pull the updates and get examples of what updates in deletes will look like in the files.  I am only seeing 'adds' right now.

Thanks!

Rob

Adventurer

what is the abstract_courses table?
I don't see any info in the '_desc' or '_fields'  documentation

```

access_tokens
account_users
accounts
assessment_question_banks
assessment_questions
asset_user_accesses
assignment_groups
assignment_override_students
assignment_overrides
assignments
attachment_associations
attachments
calendar_events
communication_channels
content_migrations
content_participation_counts
content_participations
content_shares
content_tags
context_external_tools
context_module_progressions
context_modules
conversation_message_participants
conversation_messages
conversation_participants
conversations
course_account_associations
course_sections
courses
custom_gradebook_column_data
custom_gradebook_columns
dataset_name
developer_key_account_bindings
developer_keys
discussion_entries
discussion_entry_participants
discussion_topic_participants
discussion_topics
enrollment_dates_overrides
enrollment_states
enrollment_terms
enrollments
favorites
folders
grading_period_groups
grading_periods
grading_standards
group_categories
group_memberships
groups
late_policies
learning_outcome_groups
learning_outcome_question_results
learning_outcome_results
learning_outcomes
lti_line_items
lti_resource_links
lti_results
master_courses_child_content_tags
master_courses_child_subscriptions
master_courses_master_content_tags
master_courses_master_migrations
master_courses_master_templates
master_courses_migration_results
originality_reports
outcome_proficiencies
outcome_proficiency_ratings
post_policies
pseudonyms
quiz_groups
quiz_questions
quiz_submissions
quizzes
role_overrides
roles
rubric_assessments
rubric_associations
rubrics
score_statistics
scores
submission_comments
submission_versions
submissions
user_account_associations
user_notes
users
web_conference_participants
web_conferences
wiki_pages
wikis

```

Instructure
Instructure
 
It's one of the Canvas tables referenced in our courses table; it is not part of CD2 Beta dataset. The abstract_courses table will unlikely be added to CD2. I will remove the fields to make it less confusing. Thank you for pointing it out. 
 
Thanks,
 
Oxana
Community Team
Community Team

@Will1537 

Hi Rob,
 
Question 1 : 
 
The command for you to retrieve updates by OrderId isn't supported yet, we are planning on making it available in the Beta release. You could code around it at the moment and convert the OrderID [ ulid] into the timestamp nd request updates using dap updates [table name] --since command for now
 
Question 2 : 
 
at the moment the hard deleted records are not supported in the Early Access, they will be supported in the Beta release. Fortunately, hard deletes don't happen often in  the tables released in Canvas  Data 2, majority of them rely on a soft-delete process where we update workflow_state to deleted rather than removing a record from the database. These are the fields that will not contain NULL value when a record is hard deleted in the database: 
 
ID [ PK]
root_account_id
metadata.orderId
metadata.shard_id
metadata.root_account_uuid
 
 
The rest of the field will be NULL .
 
Thanks,
 
Oxana
Surveyor

Hi, currently getting the following error when running C:\Windows\system32> dap snapshot users accounts courses. Do I need to specifically map to a different path?

Tr101_1_0-1603330235240.png

 

Labels