Community Help

kdoherty · ‎09-02-2016

We would like some advice about loading the "Requests" table in the Canvas data database. My understanding is that for this table the Canvas Data CLI downloads a daily incremental file, which is then unzipped and appended to the cumulative "Requests" file during the unpack process. The cumulative file becomes quite large over time. Currently we perform a full reload of this large file every day.

Is there a way to configure Canvas Data CLI so that only the current day's increment is unpacked so that we append the latest activity rather than performing a complete refresh each day?

Thank you.

Kevin Doherty

george_markaria · ‎11-22-2016

I'm just curious if you ever received an answer to this Kevin. I am currently in the same situation; where I would like to also know if there is a specific command argument which would only download, then unpack the previous days worth of data.

ccoan · ‎12-03-2016

Hello,

Just figured I'd drop in here, and leave a tidbit of knowledge. We have an internal tracker for this improvement since we've heard this feature request a lot. We don't yet have an ETA for it, but nonetheless it is assigned, and being worked on!

If you'd like you can submit a support case linking to this thread, asking to be added to the tracker so you can get automatically notified when it's released. (Although I'll try my best to remember to post here too!)

Thanks,

Eric

george_markaria · ‎01-23-2017

Hi Deactivated user

I just wanted to quickly follow up on this thread while we wait for the fix to be applied to the CLI tool.

I'm not quite sure how to submit a support case and wanted to check on the status of the fix. Is it possible to provide a status check since we are still struggling to get the delta data from the requests?

To provide some more information on our current situation, we have the daily requests being brought in via a fetch and unpack command however the dataFiles for requests are being fetched and stored daily, but during the unpack, we are unable to unpack the previous days worth of data.

Our unpack command currently looks like this:

call canvasDataCli unpack -c C:\CanvasData\config.js -f requests

If there was a way to add some type of parameter to this request so that it grabs the previous days worth of data using, say, the filename, that would be a solution. Something using a date or filename such as may work:

call canvasDataCli unpack -c C:\CanvasData\config.js -f -requests-00001-33c1c161

Thank you and please let me know if more info is needed or if there is a solution currently in place that we are unaware of.

ccoan · ‎01-23-2017

Hey George,

You can actually email canvasdatahelp@instructure.com (this will go straight to the Canvas Data Support Queue, SLA of 24 hours generally). We can then go ahead and attach it to the jira.

Although we haven't rolled out the update for this yet, we did roll out some major changes to how unpack works internally that should prevent out of memory errors in node, as well as make the process faster in general. So this is the next change inline for the unpack command specifically.

Unfortunately though I can't really provide an ETA if that's what you're looking for. Sorry!

That being said I'll try to put some extra pressure on this since it's causing problems for you guys.

subhashreebalu · ‎04-19-2019

Any update to unpack command specifically for requests file yet?

a1222252 · ‎02-16-2017

Hi,

The main problem with data downloads of requests data is that all the gzip files have a timestamp of the download time rather than retaining the original creation timestamp. This makes it difficult to isolate incremental data.

Here's the approach I use daily:

1. Create an archive directory for requests gzip files.

2. Run canvasDataCli sync to download the latest requests gzip files into the unpackedFiles directory.

3. Move any gzip files with a timestamp older than 1 hour to the archive directory.

4. Run canvasDataCli unpack to produce the incremental unpacked file.

5. Copy the gzip files from the archive directory back into the unpackedFiles directory so that the next sync only downloads newer files.

6. Ready for the next sync.

Regards,

Stuart.

James · ‎02-16-2017

I think you're convincing me not to update to the latest version of the CLI tool. I'm running version 0.2.2 from May 2016 that doesn't delete my requests file after downloading.

As far as the incremental nature goes, I only load the incremental files since the last time I loaded them. But I generally only load about a week's worth of requests table information as I want something small that I can develop with rather than loading the entire table. Sometimes I pre-process the requests table to remove the ID column and sometimes the user_agent string depending on what I'm working on at the time.

I don't use the CLI Tool for anything other than to download the files. I've got my own BASH script that queries the MySQL installation and only installs files newer than the current version in there. The MySQL script I use to create the tables also creates a version table to hold the last version loaded.

My last download was up to version 482 of the requests table (filename: 482_requests-00000-2a7b048c.gz), so if I wanted to install a week's worth I would just set the last version loaded to an appropriate value before that. If I want to skip the requests table completely and just update everything else (useful after schema changes when I'm not using the requests at that particular time), I just set the version to 1000 (something bigger than what I have) so it skips the requests table. For the requests table, it doesn't create one huge file first, it only loads the files one at a time, which makes it a lot easier to do incremental updates and doesn't waste the space needed to keep the uncompressed flat file around.

I'm not saying that's the best way to do it, just what I've worked out. The MySQL schema and import.sh files are available at canvancement/canvas-data/mysql on GitHub. I would consider those starter scripts for people to get their Canvas Data into something quickly so they can start playing around, but then they'll want to customize it for their installation at some point. I have found that adding some extra indexes can really speed things up.

I would not even consider doing a full reload of the requests table every day -- we were given an old retired server with only 4GB RAM to work with and added some new hard drives. A full-update of everything but requests plus an incremental update of a week's worth of the requests table just took 38:15 to run. 2:49 of that was for 173MB of requests table files. That's just 0.87% of the total requests file were I to do a full load. If you assume a similar rate for the rest, that would be about 5.4 hours to import just the requests table and it would only be growing each day. For a reference, we have 20.8 GB of Canvas Data flat files, 19.8 GB of which is the requests table. We're not a large school, but we have been using Canvas since Fall 2012.

a1222252 · ‎04-20-2019

Hi Kevin,

The canvasDataCli sync operation is intended to maintain a local copy of the gzip files for all tables synchronised with the source. This means that for requests a sync operation will download any new requests*.gz files since the last sync operation by comparing the source files with the files stored in dataFiles/requests. The exception to this is that following periodic historical requests dumps the more recent small requests*.gz files are repackaged into larger files and renamed. The first sync operation after this will download the repackaged files and delete the old ones from the local store.

For all other tables the sync operation downloads a new set of gzip files each time it's run.

As I described below, we work around the issue by temporarily moving requests*.zip files older than a few hours out of the dataFiles/requests directory. The unpack operation then produces a requests.txt containing only data since that last sync operation, (i.e. incremental data). The older files are then moved back ready for the next sync operation.

If you need requests data for a specific day, use the list / grab operations to download an individual day's dump, then move the *requests*.gz files to dataFiles/requests and use the unpack operation to produce a requests.txt file for the day. Note that the gzip files downloaded in this way are named differently from those downloaded by the sync operation and are therefore not compatible.

If it helps I'm happy to share the shell script we have developed to automate this process. It also checks for error conditions, e.g. a schema version change, historical requests dump, sync failure etc. and prompts for manual intervention.

Regards,

Stuart.

subhashreebalu · ‎04-22-2019

Thanks, Stuart for sharing your way of using CLI to pull CANVAS data.
By using sync operation daily to get the zip files, we are trying to use unpack operation to pull Request daily data and it would be really helpful if you could share the automated shell script that you have developed.

a1222252 · ‎04-22-2019

Hi,

Here are the important parts of the script. It's configured to remove requests*.gz files older than 6 hours from the dataFiles/requests directory, so if you run this every 24 hours or more you'll get a requests.txt file containing only incremental data.

. /home/odisrcint/.bash_profile
export LOGFILE=${HOME}/canvas_download.log
export SCHEMA_REF_FILE=${HOME}/schema.json.4.2.3
export SCHEMA_FILE=/odicanvas/dataFiles/schema.json
export SCHEMA_TEST_FILE=canvas_schema_test_file.txt
export SYNC_LOG=${HOME}/canvas_sync.log
#
echo "Canvas data download started at: `date`" | tee -a ${LOGFILE}
echo "Checking dump for historical requests dump..." | tee -a ${LOGFILE}
canvasDataCli list -c config.js | grep 'Number of Files' | head -3 | cut -d '[' -f 2 | cut -d ']' -f 1 | tee filecount.txt
echo "Number of files in the three most recent dumps: `cat filecount.txt | awk '{print}' ORS=':'`" | tee -a ${LOGFILE}
if [ `cat filecount.txt | sed -n '1 p' | cut -d '[' -f 2 | cut -d ']' -f 1` -lt 80 ]
then
echo "Historical requests dump detected. Local data files will not be synchronised." | tee -a ${LOGFILE}
/usr/sbin/sendmail -t < hist_dump.txt
# /bin/mailx -s "The most recent Canvas data dump is an historical requests dump. Canvas data not synchronised." dadmin02@adelaide.edu.au < /dev/null
echo "Canvas data download completed with errors at: `date`" | tee -a ${LOGFILE}
echo "---------------------------------------------------------------------------------------------------" | tee -a ${LOGFILE}
mv canvas_download.sh canvas_download.sh.donotrun
exit
elif [ `cat filecount.txt | sed -n '2 p' | cut -d '[' -f 2 | cut -d ']' -f 1` -lt 80 ]
then
echo "Historical requests dump detected. Local data files will not be synchronised." | tee -a ${LOGFILE}
/usr/sbin/sendmail -t < hist_dump.txt
# /bin/mailx -s "The second most recent Canvas data dump is an historical requests dump. Canvas data not synchronised." dadmin02@adelaide.edu.au < /dev/null
echo "Canvas data download completed with errors at: `date`" | tee -a ${LOGFILE}
echo "---------------------------------------------------------------------------------------------------" | tee -a ${LOGFILE}
mv canvas_download.sh canvas_download.sh.donotrun
exit
else
echo "Synchronising local datafiles..." | tee -a ${LOGFILE}
# canvasDataCli sync -c config.js | tee -a ${SYNC_LOG}
canvasDataCli sync -c config.js -l debug 2>&1 | tee -a ${SYNC_LOG}
echo "Local datafiles synchronised." | tee -a ${LOGFILE}
fi
#
echo "Checking for canvasDataCli sync errors and schema changes..." | tee -a ${LOGFILE}
export errchk=`tail -n 1 ${SYNC_LOG}`
echo "Last line of canvasDataCli sync log: ${errchk}" | tee -a ${LOGFILE}
echo "Expected schema version:" `egrep '"version": "[0-9].[0-9].[0-9]",' ${SCHEMA_REF_FILE}` | tee -a ${LOGFILE}
echo "Actual schema version: " `egrep '"version": "[0-9].[0-9].[0-9]",' ${SCHEMA_FILE}` | tee -a ${LOGFILE}
diff ${SCHEMA_REF_FILE} ${SCHEMA_FILE} > ${SCHEMA_TEST_FILE}
if [ "${errchk}" != "sync command completed successfully" ]
then
echo "canvasDataCli sync error. Local data files not synchronised." | tee -a ${LOGFILE}
mv ${SCHEMA_TEST_FILE} ${HOME}/schema/${SCHEMA_TEST_FILE}.`date +%Y%m%d%H%M%S`
/usr/sbin/sendmail -t < sync_error.txt
# /bin/mailx -s "Canvas data download sync error on odi-canvas VM. Canvas data not synchronised." dadmin02@adelaide.edu.au < /dev/null
echo "Canvas data download completed at with errors: `date`" | tee -a ${LOGFILE}
echo "---------------------------------------------------------------------------------------------------" | tee -a ${LOGFILE}
exit
elif [[ -s ${SCHEMA_TEST_FILE} ]]
then
echo "Schema version mismatch. FACT and DIM files will not be unpacked." | tee -a ${LOGFILE}
mv ${SCHEMA_TEST_FILE} ${HOME}/schema/${SCHEMA_TEST_FILE}.`date +%Y%m%d%H%M%S`
/usr/sbin/sendmail -t < schema_mismatch.txt
# /bin/mailx -s "Canvas data schema version mismatch. FACT and DIM files not unpacked." dadmin02@adelaide.edu.au < /dev/null
echo "Canvas data download completed with errors at: `date`" | tee -a ${LOGFILE}
echo "---------------------------------------------------------------------------------------------------" | tee -a ${LOGFILE}
exit
else
mv ${SYNC_LOG} ${HOME}/logs/canvas_sync.log.`date +%Y%m%d%H%M%S`
cp -p ${SCHEMA_FILE} ${HOME}/schema/schema.json.`date +%Y%m%d%H%M%S`
mv ${HOME}/${SCHEMA_TEST_FILE} ${HOME}/schema/${SCHEMA_TEST_FILE}.`date +%Y%m%d%H%M%S`
#
echo "No sync errors detected and schema version unchanged, unpacking FACT and DIM datafiles..." | tee -a ${LOGFILE}
chmod 644 /odicanvas/unpackedFiles/*.txt
canvasDataCli unpack -c config.js -f account_dim
canvasDataCli unpack -c config.js -f assignment_dim
canvasDataCli unpack -c config.js -f assignment_group_dim

~

canvasDataCli unpack -c config.js -f submission_file_fact
canvasDataCli unpack -c config.js -f wiki_fact
canvasDataCli unpack -c config.js -f wiki_page_fact
echo "FACT and DIM datafiles unpacked at: `date`" | tee -a ${LOGFILE}
#
echo "Isolating downloaded requests incremental datafiles..." | tee -a ${LOGFILE}
chmod 600 /odicanvas/dataFiles/requests/requests*.gz
# delete all requests*.gz files older than 6 hours...
find /odicanvas/dataFiles/requests -name "requests*.gz" -mmin +360 -exec rm {} \;
cp -p /odicanvas/unpackedFiles/requests.txt /odicanvas/requests_archive/requests.txt.`date +%Y%m%d%H%M%S`
cp -p /odicanvas/dataFiles/requests/requests*.gz /odicanvas/requests_gz_archive
gzip /odicanvas/requests_archive/requests.txt.*
chmod 400 /odicanvas/requests_archive/requests.txt.*
chmod 400 /odicanvas/requests_gz_archive/requests*.gz
#
echo "Unpacking requests incremental datafiles at: `date`" | tee -a ${LOGFILE}
canvasDataCli unpack -c config.js -f requests
echo "Copying requests datafiles back at: `date`" | tee -a ${LOGFILE}
cp -p /odicanvas/requests_gz_archive/requests*.gz /odicanvas/dataFiles/requests
chmod 444 /odicanvas/unpackedFiles/*.txt
echo "Requests incremental datafiles unpacked at: `date`" | tee -a ${LOGFILE}
#
echo "Number of unpacked dim files : `ls -l /odicanvas/unpackedFiles/*dim.txt|wc -l`" | tee -a ${LOGFILE}
echo "Number of unpacked fact files : `ls -l /odicanvas/unpackedFiles/*fact.txt|wc -l`" | tee -a ${LOGFILE}
echo "Total number of unpacked files: `ls -l /odicanvas/unpackedFiles/*.txt|wc -l`" | tee -a ${LOGFILE}
echo "Canvas data download completed at: `date`" | tee -a ${LOGFILE}
/usr/sbin/sendmail -t < success.txt
# /bin/mailx -s "Canvas data download successfully completed." dadmin02@adelaide.edu.au < /dev/null
echo "---------------------------------------------------------------------------------------------------" | tee -a ${LOGFILE}
fi
exit

Canvas Data CLI: Incremental load of Requests table

AWS Harvard Data 1 extract conversion to Data 2

Error running initdb with DAP 1.1

CD1 to CD2 schema mapping document.

Is there a way to translate bash script to Azure w...

CD2 and Course Audit information

AWS Harvard Data 1 extract conversion to Data 2

DAP parquet file transformation strategy

pysqlsync - no rows to upsert/insert - tsv2py not ...

CD2 - Weblogs

Incremental query with missing pseudonym entries

You're signed out

Canvas Data CLI: Incremental load of Requests table

Community Help

View our top guides and resources: