AnsweredAssumed Answered

Canvas Data CLI -  What am I missing?

Question asked by Robert Carroll Champion on Nov 20, 2018
Latest reply on Sep 24, 2019 by Brilzen Varghese


Can sync by downloading and keeping every file on the local volume

   - encumbers a ridiculous amount of disk space

Can fetch individual tables

   - downloads all files for that table

   - but for requests table, downloads every requests table file that hasn't expired

Can download a specific dump by id

   - every file


There doesn't seem to be a way to

- Specify a list of tables I need to use and get all files for those tables

- Get the latest Requests files, without downloading previous increments, or every other table in the dump


Are these assumptions correct? Is there another way?


I'm coming at this a little biased. I currently use James Jones' canvas data PHP code to load our data. His API is simple and fantastic, and I can pass an array of tables and skip over everything else I don't want. We don't have the space to store every table in the database, and it's silly to store every table file on a local volume just to handle a sync operation. I'm trying to move away from PHP* on our task server, many alternatives are better suited for this type of thing. I like the ease of the CLI and the responsive error messages, but it feels incomplete. Might try James' PERL script too, just tinkering with options at the moment.


I also read through Canvas Data CLI: Incremental load of Requests table


I've been working my way around this today with a little bash scripting...

   - fetch-tables.txt is just a file with the tables I want to download, 1 per line.

   - download all files for each table

   - delete the files from the requests not from the current dump sequence

   - unpack

# robert carroll, ccsd-k12-obl

# clear old files
rm -rf "$DOWNLOAD_DIR/*"
# get the latest schema for unpacking
wget "" -O "$DOWNLOAD_DIR/schema.json"

# read table list into array
mapfile -t TABLES < fetch-tables.txt
# loop through tables array
for i in "${TABLES[@]}"
  # fetch table files
  canvasDataCli fetch -c config.js -t $i | sed "s/$/: $i/g";
  if [ "$i" == "requests" ]; then
    # get the most recent sequence id, latest dump
    sequence_id=$(canvasDataCli list -c config.js -j | python -c 'import json,sys;obj=json.load(sys.stdin);print obj[0]["sequence"]');
    # delete all request files not from the latest dump
    find "$DOWNLOAD_DIR/$i" -type f ! -name "$sequence_id-*.gz" -delete;

# unpack files
echo 'unpacking files'
UNPACK=$(IFS=' ' eval 'echo "${TABLES[*]}"');
canvasDataCli unpack -c config.js -f $UNPACK;

# eof

Was having issues with unpacking all files at the end, because the documentation shows comma separation on the tables... but needs spaces. canvasDataCli unpacking & adding headers

CLI readme also says you can only Unpack after a Sync operation... which I found is only because it needs the schema.json file, which I download on line 8 with wget.


Next I'd be using James' bash, which I currently use but swapped in MSSQL via MS/BCP;


I'd love to know how any one else is dealing with this or what suggestions you might have.