AnsweredAssumed Answered

Canvas Data CLI -  What am I missing?

Question asked by Robert Carroll on Nov 20, 2018
Latest reply on Nov 21, 2018 by Robert Carroll

CD-CLI:

Can sync by downloading and keeping every file on the local volume

   - encumbers a ridiculous amount of disk space

Can fetch individual tables

   - downloads all files for that table

   - but for requests table, downloads every requests table file that hasn't expired

Can download a specific dump by id

   - every file

 

There doesn't seem to be a way to

- Specify a list of tables I need to use and get all files for those tables

- Get the latest Requests files, without downloading previous increments, or every other table in the dump

 

Are these assumptions correct? Is there another way?

 

I'm coming at this a little biased. I currently use James Jones' canvas data PHP code to load our data. His API is simple and fantastic, and I can pass an array of tables and skip over everything else I don't want. We don't have the space to store every table in the database, and it's silly to store every table file on a local volume just to handle a sync operation. I'm trying to move away from PHP* on our task server, many alternatives are better suited for this type of thing. I like the ease of the CLI and the responsive error messages, but it feels incomplete. Might try James' PERL script too, just tinkering with options at the moment.

 

I also read through Canvas Data CLI: Incremental load of Requests table

 

I've been working my way around this today with a little bash scripting...

   - fetch-tables.txt is just a file with the tables I want to download, 1 per line.

   - download all files for each table

   - delete the files from the requests not from the current dump sequence

   - unpack

#!/bin/bash
# robert carroll, ccsd-k12-obl

DOWNLOAD_DIR='/canvas/data/files-downloaded'
# clear old files
rm -rf "$DOWNLOAD_DIR/*"
# get the latest schema for unpacking
wget "https://portal.inshosteddata.com/api/schema/latest" -O "$DOWNLOAD_DIR/schema.json"

# read table list into array
mapfile -t TABLES < fetch-tables.txt
# loop through tables array
for i in "${TABLES[@]}"
do
  # fetch table files
  canvasDataCli fetch -c config.js -t $i | sed "s/$/: $i/g";
  if [ "$i" == "requests" ]; then
    # get the most recent sequence id, latest dump
    sequence_id=$(canvasDataCli list -c config.js -j | python -c 'import json,sys;obj=json.load(sys.stdin);print obj[0]["sequence"]');
    # delete all request files not from the latest dump
    find "$DOWNLOAD_DIR/$i" -type f ! -name "$sequence_id-*.gz" -delete;
  fi
done

# unpack files
echo 'unpacking files'
UNPACK=$(IFS=' ' eval 'echo "${TABLES[*]}"');
canvasDataCli unpack -c config.js -f $UNPACK;

# eof

Was having issues with unpacking all files at the end, because the documentation shows comma separation on the tables... but needs spaces. canvasDataCli unpacking & adding headers

CLI readme also says you can only Unpack after a Sync operation... which I found is only because it needs the schema.json file, which I download on line 8 with wget.

 

Next I'd be using James' bash import.sh, which I currently use but swapped in MSSQL via MS/BCP;

 

I'd love to know how any one else is dealing with this or what suggestions you might have.

Outcomes