The Instructure Community will enter a read-only state on November 22, 2025 as we prepare to migrate to our new Community platform in early December. Read our blog post for more info about this change.
Found this content helpful? Log in or sign up to leave a like!
Hi, I have been tasked with creating a data lake ( snapshot ) for all canvas data from the API. I have already seen the AWS data lake blog but have a few differences in what we are trying to do.
Also I am aware that Canvas Data API 2 is around the corner and will have a different workflow. This is still necessary to accomplish on version 1 for us.
We are using essentially bigquery on google cloud platform to ingest each file to its respective table.
From the snapshot sync algorithm I see that there is this:
- After all files have been processed, delete any local file that isn't in the list of files from the API
I am not sure if this means that any data will be overwritten for any tables or if for all tables it would be sufficient to just continue to append the files that have not been ingested yet, or is there a situation where we'd have to go back and remove certain files from an ingest ( e.g., records from assignment_group_score_fact-00001-8ddbe09c.gz, for example ) would have to be removed from the table in the data lake. In which case seems like it would make sense to just not check for previously uploaded files and just replace the table each time with the full set of files for each table ( except for requests, of course ). But I don't know if this is the case or not and was hoping someone might be able to clarify the best approach.
I'm still fairly new to Canvas Data myself, but I'm going to try to answer this if only to help collect my own understanding.
The /file/sync endpoint and algorithm specifically refer to the TSV files that exist in the Canvas Data platform. The sync list contains all files that are needed for a complete snapshot of data. The last step is to delete any files previously downloaded that is not in the current sync list because all that data now exists within a new file.
The new file will need to be ingested into your data lake to insert new rows and update existing rows. Since comparing each row can be expensive, it may be simpler to truncate the table and reload it with the new file(s).
You my also see files in the sync list that you have already downloaded. This indicates that the data within that file is unchanged, but are still needed to get a complete snapshot of the data
If you are truncating your tables you will need to reload these files to get the complete set of data. However, if you are comparing rows you may be able to skip these files entirely because nothing has changed.
You will not need to keep or ingest any files that are not in the current sync list, unless you want to keep some sort of change log of the data over time.
Community helpTo interact with Panda Bot, our automated chatbot, you need to sign up or log in:
Sign inTo interact with Panda Bot, our automated chatbot, you need to sign up or log in:
Sign in