We're pulling the flat files from the Canvas Data API to load into our own AWS data warehouse, ultimately ending up in Redshift. We're a little behind, and have just hit the first dump that uses version 1.6.0 of the schema, which changes the type of the id field in requests to a guid. Right now we're storing the id field as a BigInt in Redshift, but I wanted to make sure that we don't lose precision in case the GUIDs in the data set are wider than 8 bytes. What Redshift type are you using to store the ID?
Solved! Go to Solution.
Thanks for the response. A couple of follow-ups:
- Since previous files for the request table had a numerical value for the ID, we'll need to convert them to fit into a varchar column. Did you just replace the numerical value for a String representing that number?
- My understanding was that if a change was made to the schema of an incremental table, we would expect a full dump of that table with the updated data. I can easily understand why you wouldn't want to do that unnecessarily given the size of the requests table, but is that still something that we should plan on seeing in the future? Our data pipeline has a lot of special-case code for the "megadumps" of the full request table, but if that's something that's less likely to happen in the future it'd be useful to know for our capacity planning.
A couple things:
* The underlying data for the requests table has not changed. It has been a ASCII hex encoded guid since the Canvas data release. We changed the datatype in the schema from varchar to guid because it is more descriptive and would allow databases with native 16-byte types to take better advantage of the field. Whatever you were doing before to convert the hex string to a number should still work.
* Since the data did not change we did not see the need for a "megadump", however, the megadumps are still a reality and will happen from time to time. In fact, there will likely be one this week as we are adding a few new fields to this table.
Thanks for the response. You're right - I'd been looking at the number in the database and assuming it was a bigint - it turns out that my code was smarter than I was, and was treating it as a text field all along, so that resolves the conversion issue.
Thanks also for the clarification on megadumps, and it's a big help for us to get a heads-up that one's likely to be coming up soon.
My instructure plans to do the same data warehouse like you. It is highly appreciate if you can share your idea to us.
1) Do you need to pull down the Canvas Data into your local machine, and then load into your AWS Redshift?
2) Is it need to re-created all tables and re-load the data into Redshift during the process, except request table?
3) Do you mind to share your code because we are new in AWS and want to learn from you?
Many thanks your efforts and looking forward to your reply
redshift big data canvasdata
This question seems like it's more of a general "how do I use Canvas Data" question rather than a followup to Phil's original question, and would probably be better off in a new thread. In any case, Phil was a former colleague of mine but he's moved on to another position outside of our organization, so posting your questions to the larger Canvas Data group will likely be more productive.
As an aside: your username/profile is a bit misleading -- people may get the impression that you work for Instructure, and I think you'll be better off changing it to be more personal. Most people use real names in this community, FWIW.