Community help

dogrdon · ‎03-27-2023

Hi, I have been tasked with creating a data lake ( snapshot ) for all canvas data from the API. I have already seen the AWS data lake blog but have a few differences in what we are trying to do.

Also I am aware that Canvas Data API 2 is around the corner and will have a different workflow. This is still necessary to accomplish on version 1 for us.

We are using essentially bigquery on google cloud platform to ingest each file to its respective table.

From the snapshot sync algorithm I see that there is this:

- After all files have been processed, delete any local file that isn't in the list of files from the API

I am not sure if this means that any data will be overwritten for any tables or if for all tables it would be sufficient to just continue to append the files that have not been ingested yet, or is there a situation where we'd have to go back and remove certain files from an ingest ( e.g., records from assignment_group_score_fact-00001-8ddbe09c.gz, for example ) would have to be removed from the table in the data lake. In which case seems like it would make sense to just not check for previously uploaded files and just replace the table each time with the full set of files for each table ( except for requests, of course ). But I don't know if this is the case or not and was hoping someone might be able to clarify the best approach.

bliszewski · ‎03-28-2023

I'm still fairly new to Canvas Data myself, but I'm going to try to answer this if only to help collect my own understanding.

The /file/sync endpoint and algorithm specifically refer to the TSV files that exist in the Canvas Data platform. The sync list contains all files that are needed for a complete snapshot of data. The last step is to delete any files previously downloaded that is not in the current sync list because all that data now exists within a new file.

The new file will need to be ingested into your data lake to insert new rows and update existing rows. Since comparing each row can be expensive, it may be simpler to truncate the table and reload it with the new file(s).

You my also see files in the sync list that you have already downloaded. This indicates that the data within that file is unchanged, but are still needed to get a complete snapshot of the data

If you are truncating your tables you will need to reload these files to get the complete set of data. However, if you are comparing rows you may be able to skip these files entirely because nothing has changed.

You will not need to keep or ingest any files that are not in the current sync list, unless you want to keep some sort of change log of the data over time.

dogrdon · ‎03-29-2023

Hey, thanks. I think it's clearer to me now that -- given our setup -- it's probably easier just to blow away each non-requests table each time and resync with the full snapshot for any given day.

I can start adding logic for checking against previous uploads and replacing records, but that seems like more trouble than its worth.

Thanks very much.

Not clear to me how to update the full sync data snapshot across all tables (Canvas Data API 1 )

API

Canvas Data

data lake

python

Display LTI content in pages from Editor LTI Place...

Media Plugin for Canvas LMS

Is there a way to customize the format and structu...

Search API - return Courses with include[] data ?

Dynamic Registration Error: Client does not have a...

Need Canvas LMS API Consultant - Grade Passback In...

Enhancement request: wildcard deletion from all se...

Display LTI content in pages from Editor LTI Place...

Media Plugin for Canvas LMS

Is there a way to customize the format and structu...

You're signed out

Not clear to me how to update the full sync data snapshot across all tables (Canvas Data API 1 )

Community help

View our top guides and resources: