Not clear to me how to update the full sync data snapshot across all tables (Canvas Data API 1 )

Community Member

Hi, I have been tasked with creating a data lake ( snapshot ) for all canvas data from the API. I have already seen the AWS data lake blog but have a few differences in what we are trying to do.

Also I am aware that Canvas Data API 2 is around the corner and will have a different workflow. This is still necessary to accomplish on version 1 for us.

We are using essentially bigquery on google cloud platform to ingest each file to its respective table. 

From the snapshot sync algorithm I see that there is this:

- After all files have been processed, delete any local file that isn't in the list of files from the API

I am not sure if this means that any data will be overwritten for any tables or if for all tables it would be sufficient to just continue to append the files that have not been ingested yet, or is there a situation where we'd have to go back and remove certain files from an ingest ( e.g., records from  assignment_group_score_fact-00001-8ddbe09c.gz, for example ) would have to be removed from the table in the data lake. In which case seems like it would make sense to just not check for previously uploaded files and just replace the table each time with the full set of files for each table ( except for requests, of course ).  But I don't know if this is the case or not and was hoping someone might be able to clarify the best approach.

Labels (4)