Having moved off of Redshift this summer, I just got my hands on a SQL Server instance to import all of our Data files. I'm more comfortable with MySQL so I had a little learning curve getting things set up and imported. I saved the Requests table for last since I could work on other visualizations while it was importing.
Our uncompressed Requests text file is 286GB and has been importing for the last 7 days. We are up to 46.9M rows (and counting at the time I write this) and I have no idea how much further we have to go.
Imagine the feeling when, yesterday at 6 days in, I realized that I messed something up and the ID field is importing NULL for every record. I'm letting it run to see how many rows I'm dealing with but ultimately will have to re-import it all over again.
What super awesome valuable reports you guys have gotten out of the requests table?
I've been able to create some nice (first pass) PowerBI reports to profile Canvas adoption using the other tables. Between import fatigue and a mild case of data-blindness, I'm wondering if I should bother with the requests file at all. Querying that thing at all is going to be a resource hog in and of itself, not to mention periodically updating it.
All ideas, advice, and condolences welcome.
Although I won't really be able to say from a perspective of a school. What queries I've run, I don't want to drop a little note in about: Requests Table that thread. There's lots of talk about optimizing the requests table itself. (Adding indexes to the table, and only importing part of it) Hopefully you can gleam some piece of info that works for you, and makes it a bit easier, even though it's mainly MySQL focused.
Doing full loads of a request table I've found is generally not worth it unless you absolutely need it for a query. I found it much easier to keep the fully downloaded file somewhere waiting to be imported, and import the latest 30 days worth of data. (then every day delete 31 days ago, and load in the new day).
I also figure I'll drop a note about some queries I'm either working on, or have finished:
- Building a "Path" through courses. Based on grades. Looking at how does a student who gets good grades go through a course, compared to bad grades. (Do they visit discussions more? Do they take longer quizzes? etc.)
- I do my own (as I know others do), do detection of app based users/web based users. Brenden Goetz made an open tableau book (which contains SQL Queries): Device Usage in Canvas Data with Tableau there.
Thanks Eric, I appreciate the feedback.
I'm definitely going to devise a partial loading strategy. The tricky part is knowing which of the gzip files to keep in the directory to unpack since it's not easy to determine what files match up with any particular time frame. I'm thinking several tables broken up by year (if I can figure out where that line is).
One month's data might serve well for designing a model but once we have it locked in, we'd probably need the full historical for any predictive analytics we come up with.
I really do not mess with the requests table all that much. At the request of course designers, I have compared page designs to see which yields higher page requests and also used it in cases where the question of specific student actions come up. At present I have one report which identifies if instructors have set up certain parts of the class before the start date by identifying if their user_Id has records with specific pages..
In all of those cases I never download the entire request table. I typically query the Redshift db directly with very specific "where" parameters. In the the future, if I have to download say a month's worth of data I am going to use the "import-csv" command in powershell and basically write a script to get what I need out of it.
Only other thing I will mention is using the CLI tool gets everything in real quick and small. You'd still have to unpack it though.
I don’t download the requests table but it’s available to me in a “data lake”. I access it using Alteryx and recently used this to obtain lists of unique activity dates per student. From this I can calculate the percentage of days a student was active within a course and also the average gaps between active days.
These scatter charts show the final mark in three different courses against Activity Score (%age of active days)…
…and the Merged Activity Score (%age of active days-Average Gap+100/2).
By grouping students into 4 buckets using Merged Activity Scores, this 100% Stacked Chart shows the varying ratios of final grades.
At the very beginning we also imported all the data from the request tables. Pretty soon it became really painful to query that table, and we do not have unlimited hardware resource for that as well. Meanwhile, I figured a lot of information that I am looking for actually can be retrieved via Canvas Analytics API. Therefore, we decided to skip requests table in the cron job. Of course, in some cases, we do need to use the data in the requests table, then we extract the requests table data based on course_ids into another table. Since that table is much smaller, we finally find a balance for our instance.
I've found two approaches to identifying which requests*.gz files are which.
1. Perform a sync operation, which will download files not already present in the dataFiles directory. The newly downloaded files can then be separated by timestamp, so by relocating the older files temporarily, an unpack operation will produce a text file containing only newly downloaded data. Remember to copy the old files back in before the next sync operation.
2. Use the canvasDataCli list and grab operations. The list operation lists all available daily dumps, so will go back more than a month. The requests*.gz files downloaded using the grab operation have a numerical prefix which identifies the dump they are from. The problem here is that the grab operation doesn't save the files into the appropriate directories under dataFiles, so that has to be done manually before the unpack operation.
We use an Oracle database, so the quantity of data is not an issue. We load requests data daily into a table partitioned on timestamp_month. Including the partitioning column in the query therefore only uses the appropriate partition(s). This also makes pruning of obsolete data simple.