We are using different technology at my institution, but I can still share some things that I've picked up. Each morning, we grab and store the raw Canvas data in an ODS (Operational Data Store) schema on our data warehouse. Later, we run a series of ETL processes that transform the raw Canvas data, join information from our SIS, and store the results in a different set of tables. These new tables are then used for reporting and analytics. We’re still in the early stages of using Canvas data, so that part is rapidly changing for us as new needs are identified, but we're using dimensional modeling practices and ultimately creating star schemas with the data. We haven't yet got to the point of a unified and complete data model, but we're working towards it.
From a technical perspective, I found that using a lot of intermediate staging tables improved performance and made it a lot easier to design/test the data flow. I usually prefer simple ETL flows, where I might transform raw data and join SIS data in a single step. The quantity of data in some Canvas tables, along with the need to perform more complicated transformations to extract SIS keys from Canvas fields, made it much easier to split this out into several mini steps. In some cases, I also made use of table partitioning to avoid re-processing older data that was no longer changing.
When it comes to design, my biggest piece of advice is to assume that you're going to get some wonky data and actively design with that assumption in mind. Our faculty have a lot of latitude in designing their courses, and I found a lot of things in the data I wasn't initially expecting. If an instructor manually enrolls a student into their Canvas course that is not officially enrolled in the SIS, do you report that student's data along with the rest of the class or use your SIS enrollments as the source of truth? Last week, a faculty member (seemingly out of nowhere) updated the enrollment term of their class from several semesters ago and I needed to update one of my joins to SIS data in order to correct the mismatch. A big one (for us) was that we were initially reporting students who had not been logging into their courses, but we found that some of the LTIs being used allowed students to access the material directly (i.e., through the textbook publisher's website) so they weren't going through Canvas to access their assignments - so now we use the last access time OR the latest assignment submission time to determine the student's most recent class activity.
I hope that gave you at least something useful to think about, but let me know if you have any questions!
Best,
Andy