We would like some advice about loading the "Requests" table in the Canvas data database. My understanding is that for this table the Canvas Data CLI downloads a daily incremental file, which is then unzipped and appended to the cumulative "Requests" file during the unpack process. The cumulative file becomes quite large over time. Currently we perform a full reload of this large file every day.
Is there a way to configure Canvas Data CLI so that only the current day's increment is unpacked so that we append the latest activity rather than performing a complete refresh each day?
I'm just curious if you ever received an answer to this Kevin. I am currently in the same situation; where I would like to also know if there is a specific command argument which would only download, then unpack the previous days worth of data.
Just figured I'd drop in here, and leave a tidbit of knowledge. We have an internal tracker for this improvement since we've heard this feature request a lot. We don't yet have an ETA for it, but nonetheless it is assigned, and being worked on!
If you'd like you can submit a support case linking to this thread, asking to be added to the tracker so you can get automatically notified when it's released. (Although I'll try my best to remember to post here too!)
Hi Deactivated user
I just wanted to quickly follow up on this thread while we wait for the fix to be applied to the CLI tool.
I'm not quite sure how to submit a support case and wanted to check on the status of the fix. Is it possible to provide a status check since we are still struggling to get the delta data from the requests?
To provide some more information on our current situation, we have the daily requests being brought in via a fetch and unpack command however the dataFiles for requests are being fetched and stored daily, but during the unpack, we are unable to unpack the previous days worth of data.
Our unpack command currently looks like this:
call canvasDataCli unpack -c C:\CanvasData\config.js -f requests
If there was a way to add some type of parameter to this request so that it grabs the previous days worth of data using, say, the filename, that would be a solution. Something using a date or filename such as may work:
call canvasDataCli unpack -c C:\CanvasData\config.js -f -requests-00001-33c1c161
Thank you and please let me know if more info is needed or if there is a solution currently in place that we are unaware of.
You can actually email email@example.com (this will go straight to the Canvas Data Support Queue, SLA of 24 hours generally). We can then go ahead and attach it to the jira.
Although we haven't rolled out the update for this yet, we did roll out some major changes to how unpack works internally that should prevent out of memory errors in node, as well as make the process faster in general. So this is the next change inline for the unpack command specifically.
Unfortunately though I can't really provide an ETA if that's what you're looking for. Sorry!
That being said I'll try to put some extra pressure on this since it's causing problems for you guys.
The main problem with data downloads of requests data is that all the gzip files have a timestamp of the download time rather than retaining the original creation timestamp. This makes it difficult to isolate incremental data.
Here's the approach I use daily:
1. Create an archive directory for requests gzip files.
2. Run canvasDataCli sync to download the latest requests gzip files into the unpackedFiles directory.
3. Move any gzip files with a timestamp older than 1 hour to the archive directory.
4. Run canvasDataCli unpack to produce the incremental unpacked file.
5. Copy the gzip files from the archive directory back into the unpackedFiles directory so that the next sync only downloads newer files.
6. Ready for the next sync.
I think you're convincing me not to update to the latest version of the CLI tool. I'm running version 0.2.2 from May 2016 that doesn't delete my requests file after downloading.
As far as the incremental nature goes, I only load the incremental files since the last time I loaded them. But I generally only load about a week's worth of requests table information as I want something small that I can develop with rather than loading the entire table. Sometimes I pre-process the requests table to remove the ID column and sometimes the user_agent string depending on what I'm working on at the time.
I don't use the CLI Tool for anything other than to download the files. I've got my own BASH script that queries the MySQL installation and only installs files newer than the current version in there. The MySQL script I use to create the tables also creates a version table to hold the last version loaded.
My last download was up to version 482 of the requests table (filename: 482_requests-00000-2a7b048c.gz), so if I wanted to install a week's worth I would just set the last version loaded to an appropriate value before that. If I want to skip the requests table completely and just update everything else (useful after schema changes when I'm not using the requests at that particular time), I just set the version to 1000 (something bigger than what I have) so it skips the requests table. For the requests table, it doesn't create one huge file first, it only loads the files one at a time, which makes it a lot easier to do incremental updates and doesn't waste the space needed to keep the uncompressed flat file around.
I'm not saying that's the best way to do it, just what I've worked out. The MySQL schema and import.sh files are available at canvancement/canvas-data/mysql on GitHub. I would consider those starter scripts for people to get their Canvas Data into something quickly so they can start playing around, but then they'll want to customize it for their installation at some point. I have found that adding some extra indexes can really speed things up.
I would not even consider doing a full reload of the requests table every day -- we were given an old retired server with only 4GB RAM to work with and added some new hard drives. A full-update of everything but requests plus an incremental update of a week's worth of the requests table just took 38:15 to run. 2:49 of that was for 173MB of requests table files. That's just 0.87% of the total requests file were I to do a full load. If you assume a similar rate for the rest, that would be about 5.4 hours to import just the requests table and it would only be growing each day. For a reference, we have 20.8 GB of Canvas Data flat files, 19.8 GB of which is the requests table. We're not a large school, but we have been using Canvas since Fall 2012.
The canvasDataCli sync operation is intended to maintain a local copy of the gzip files for all tables synchronised with the source. This means that for requests a sync operation will download any new requests*.gz files since the last sync operation by comparing the source files with the files stored in dataFiles/requests. The exception to this is that following periodic historical requests dumps the more recent small requests*.gz files are repackaged into larger files and renamed. The first sync operation after this will download the repackaged files and delete the old ones from the local store.
For all other tables the sync operation downloads a new set of gzip files each time it's run.
As I described below, we work around the issue by temporarily moving requests*.zip files older than a few hours out of the dataFiles/requests directory. The unpack operation then produces a requests.txt containing only data since that last sync operation, (i.e. incremental data). The older files are then moved back ready for the next sync operation.
If you need requests data for a specific day, use the list / grab operations to download an individual day's dump, then move the *requests*.gz files to dataFiles/requests and use the unpack operation to produce a requests.txt file for the day. Note that the gzip files downloaded in this way are named differently from those downloaded by the sync operation and are therefore not compatible.
If it helps I'm happy to share the shell script we have developed to automate this process. It also checks for error conditions, e.g. a schema version change, historical requests dump, sync failure etc. and prompts for manual intervention.
Thanks, Stuart for sharing your way of using CLI to pull CANVAS data.
By using sync operation daily to get the zip files, we are trying to use unpack operation to pull Request daily data and it would be really helpful if you could share the automated shell script that you have developed.