I'm working on building out an integration with the Canvas Data API and pulling this information in to a local database. I'm having a few design issues with my update program I was hoping others could provide some insight with - this isn't language specific.
First, I'm seeing that all but one of the tables (requests), in the most recent schema version (1.13.1) are NOT partial. From what I've seen, comparing data between dumps, is that when a table says that it is not partial, it is indeed not partial - even if that means a dump of 42k lines each night. I'm trying to reduce IO as much as possible, so my first thought was to simply compare the two files, but this will progressively take longer and longer. I thought about using a diff utility and parsing the results, but in my testing with the assignment_dim table the rows aren't even in order, which would require me to first load and sort the rows. Is there any way to only get a list of additions and changes without having to retain the previous day's table file and compare it?
Otherwise, if every update is a full copy of the table, wouldn't this eventually grow to unreasonable sizes and time? We've only been tracking ~230 days worth of information, but surely in a few years we could expect some tables/files to reach sizes that are exceedingly large?
Second, how is everyone automating and handling the possible changes in Schema? I wrote a few simple methods that allow me to compare two versions, and I see that only one was something removed (back in the first few schema changes). Beyond that, it seems to be changes in descriptions, additions of tables or columns in a table.
I'm not a big data person myself, so I'm not sure if I'm missing something obvious here. Either way, I have no idea what the best way to approach this would be (aside from forking over the money for the Redshift option).