Canvas Data CLI - What am I missing?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
CD-CLI:
Can sync by downloading and keeping every file on the local volume
- encumbers a ridiculous amount of disk space
Can fetch individual tables
- downloads all files for that table
- but for requests table, downloads every requests table file that hasn't expired
Can download a specific dump by id
- every file
There doesn't seem to be a way to
- Specify a list of tables I need to use and get all files for those tables
- Get the latest Requests files, without downloading previous increments, or every other table in the dump
Are these assumptions correct? Is there another way?
I'm coming at this a little biased. I currently use @James ' canvas data PHP code to load our data. His API is simple and fantastic, and I can pass an array of tables and skip over everything else I don't want. We don't have the space to store every table in the database, and it's silly to store every table file on a local volume just to handle a sync operation. I'm trying to move away from PHP* on our task server, many alternatives are better suited for this type of thing. I like the ease of the CLI and the responsive error messages, but it feels incomplete. Might try James' PERL script too, just tinkering with options at the moment.
I also read through Canvas Data CLI: Incremental load of Requests table
I've been working my way around this today with a little bash scripting...
- fetch-tables.txt is just a file with the tables I want to download, 1 per line.
- download all files for each table
- delete the files from the requests not from the current dump sequence
- unpack
#!/bin/bash
# robert carroll, ccsd-k12-obl
DOWNLOAD_DIR='/canvas/data/files-downloaded'
# clear old files
rm -rf "$DOWNLOAD_DIR/*"
# get the latest schema for unpacking
wget "https://portal.inshosteddata.com/api/schema/latest" -O "$DOWNLOAD_DIR/schema.json"
# read table list into array
mapfile -t TABLES < fetch-tables.txt
# loop through tables array
for i in "${TABLES[@]}"
do
# fetch table files
canvasDataCli fetch -c config.js -t $i | sed "s/$/: $i/g";
if [ "$i" == "requests" ]; then
# get the most recent sequence id, latest dump
sequence_id=$(canvasDataCli list -c config.js -j | python -c 'import json,sys;obj=json.load(sys.stdin);print obj[0]["sequence"]');
# delete all request files not from the latest dump
find "$DOWNLOAD_DIR/$i" -type f ! -name "$sequence_id-*.gz" -delete;
fi
done
# unpack files
echo 'unpacking files'
UNPACK=$(IFS=' ' eval 'echo "${TABLES[*]}"');
canvasDataCli unpack -c config.js -f $UNPACK;
# eof
Was having issues with unpacking all files at the end, because the documentation shows comma separation on the tables... but needs spaces. canvasDataCli unpacking & adding headers
CLI readme also says you can only Unpack after a Sync operation... which I found is only because it needs the schema.json file, which I download on line 8 with wget.
Next I'd be using James' bash import.sh, which I currently use but swapped in MSSQL via MS/BCP;
I'd love to know how any one else is dealing with this or what suggestions you might have.