cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Community Member

Canvas data dumps: full history or delta?

Hi,

I'm using the canvas-data-sdk package to download Canvas data via its API in Python.  I've been using the following commands:

```

canvas-data -c config.yml list-dumps # lists the daily dumps

canvas-data -c config.yml unpack-dump-files # download the latest dump files, unpack, and parse them into table files

```

My question is do the daily dump files contain the full history?  Or does it contain delta (daily changes)?  In some days, there are 100 files in the dump.  In others, there are 11.  Sometimes, the dump has 1 file or 130 files.

Could someone help me understand how I should interpret the data in each of the dumps?  Thank you.

Vinh

0 Kudos
5 Replies
Highlighted
Learner II

Hi Vinh!

That's a great question; my understanding is that the daily dumps _should_ contain the full contents of all of the tables _except_ for the requests table. Since the requests table is so big, each day's dump just contains a delta.  However, there do seem to be occasions where a dump only has a small number of files. Once in a while (maybe every 6 months or so?) there will be a "mega-dump" that contains all data from the day your Canvas instance was born to the present. 

Full disclosure: I wrote this Canvas Data SDK, but I've since moved to a different way of maintaining a local copy of all of the data. Back when I was using this library, I typically was just downloading a single table or a small handful of tables at a time.  I also ran into confusing situations where occasionally there were multiple dumps of various sizes on a single day, and it wasn't always clear which files I should download in order to reconstitute a particular table. 

The other approach (not supported by this SDK, sadly) is to use the 'sync' API endpoint: that will give you the full list of all of the files you need to reconstitute a full set of data in all of the tables. The idea is that you download all of the files in the list and store them. The next time you want to refresh your data, you get the list of files from the 'sync' endpoint again, download anything listed that you don't already have, and delete anything locally that's no longer listed. This is the approach suggested in the Canvas Data API documentation, and it's the approach that I currently use. Though it requires having a place to store a large amount of data (there will be thousands of files making up the requests table), it's relatively easy to keep up to date. 

If you're interested in the details of how I maintain my Canvas Data warehouse on AWS, I wrote a blog post here with instructions. 

I hope this is helpful!

--Colin

Highlighted
Community Member

Thank you for your response, 50581462, and thank you for creating the Canvas Data SDK in Python.

From your description, it sounds like there is not a sure logic on what is included in each day's daily dump, so I shouldn't rely on it if I want to have all of Canvas' data.

I have heard of the sync API, but was trying to rely on the Canvas Data SDK because it is based on Python and hence, should be more portable.  It's great that you the creator of the python package acknowledge that you no longer use it and are using the sync API instead.  I too will look into that approach now.

Thanks again!

0 Kudos
Highlighted
Learner II

FWIW, it shouldn't be too much work to add support for the sync endpoint to the Canvas Data SDK.  If you're up for contributing to the project, let me know!

--Colin

0 Kudos
Highlighted
Community Member

Colin, you're using this Node.js package from Canvas to sync your data correct?  That's the recommended approach?

0 Kudos
Highlighted
Learner II

Hi Vinh,

I'm actually not using that package, but I've essentially implemented the same logic in Python. Since I'm hosting my data warehouse in AWS, I'm using a couple of Lambda functions to do the sync/fetch operations.  The code is all in this repo, and there's a link to a tutorial that walks through all of the setup steps: 

GitHub - Harvard-University-iCommons/canvas-data-aws: Build a Canvas Data warehouse on AWS 

--Colin

0 Kudos
Labels