Our institution is wanting to take a look at Canvas Data to see what kind of information we can pull out of it. I have looked around the Canvas community quite a bit to learn more about the data portal and what it offers. I have seen how to download and sync data with the Canvas CLI tool, and it seems straightforward. I do have a few questions though to help us get moving forward.
Thank you for your help?
Hi @erik_gustafson ,
There is no delta data or differential sets. You will redownload the same data, plus what is new since the last time you downloaded. Using the Canvas Data CLI Tool, you can use the sync command to keep everything up to date, while the fetch command can be used to download everything for a specific table. The requests table is partial/incremental and has its own historical dump every ~2 months.
SSIS is an option, but I would have had issues maintaining this locally, and didn't like the time constraints if our data department managed it via SSIS. I believe Talend and some other licensed products are available to.
I opted for a Linux VM and some open source. I currently use and support Managing Canvas Data with Embulk, to load CD into MS SQL Server.
Thank you for the quick reply. So, if I am understanding you correctly, the sync command will keep the tables updated with the latest data, correct?
It sounds like you use Embulk to load data in MS SQL Server instead of SSIS. Is this an automated task?
Correct-ish. It will download all the gzipped files, and on rerun, will download any new gzipped files it doesn't have. I actually don't like this command for our daily needs. I use the fetch command to get a completely new copy of the data everyday. The sync command has an issue, and I honestly don't understand how others tolerate it. If you use the sync command, any records previously downloaded will not get updated, only records that are new, so you essentially have the original state of many records, not it's current state. For my daily import I rely on the fetch command to redownload and import all 100 tables. I incrementally load the requests table daily using the fetch command as well, and delete the previous sequences before hand. Fetch for requests table downloads a short window into the past. Some details on these steps here, Canvas Data CLI - What am I missing?
Embulk is just the data loader, and I have documented some useful tips an tricks (in the repo) as well as the known schema for 4 databases. The repo will help you import each set of tables with a command. My actual workflow is a bash script that goes through a text file and downloads each desired table, then another bash script that loads each table into a staging database with a considerable amount of fault tolerance and logging built in. I then use the staged data to load to production with a slightly different purpose. Automation is finished with a scheduled cron job.
At Belfast Metropolitan College we use the Canvas Data Viewer to download the data and import it into a SQ:L Server. We then use Power BI to visualise the data
Good to know about Canvas Data Services! Most of our ETL runs through Python API calls and we were able to get a nightly import setup in a day using this Python library: https://github.com/Harvard-University-iCommons/canvas-data-sdk
Nigel - any chance you have written views to join all the star schema that you are interested in sharing?