Community Help

robotcars · ‎11-20-2018

CD-CLI:

Can sync by downloading and keeping every file on the local volume

- encumbers a ridiculous amount of disk space

Can fetch individual tables

- downloads all files for that table

- but for requests table, downloads every requests table file that hasn't expired

Can download a specific dump by id

- every file

There doesn't seem to be a way to

- Specify a list of tables I need to use and get all files for those tables

- Get the latest Requests files, without downloading previous increments, or every other table in the dump

Are these assumptions correct? Is there another way?

I'm coming at this a little biased. I currently use @James ' canvas data‌ PHP code to load our data. His API is simple and fantastic, and I can pass an array of tables and skip over everything else I don't want. We don't have the space to store every table in the database, and it's silly to store every table file on a local volume just to handle a sync operation. I'm trying to move away from PHP* on our task server, many alternatives are better suited for this type of thing. I like the ease of the CLI and the responsive error messages, but it feels incomplete. Might try James' PERL script too, just tinkering with options at the moment.

I also read through Canvas Data CLI: Incremental load of Requests table

I've been working my way around this today with a little bash scripting...
   - fetch-tables.txt is just a file with the tables I want to download, 1 per line.
   - download all files for each table
   - delete the files from the requests not from the current dump sequence
   - unpack

#!/bin/bash
# robert carroll, ccsd-k12-obl

DOWNLOAD_DIR='/canvas/data/files-downloaded'
# clear old files
rm -rf "$DOWNLOAD_DIR/*"
# get the latest schema for unpacking
wget "https://portal.inshosteddata.com/api/schema/latest" -O "$DOWNLOAD_DIR/schema.json"

# read table list into array
mapfile -t TABLES < fetch-tables.txt
# loop through tables array
for i in "${TABLES[@]}"
do
  # fetch table files
  canvasDataCli fetch -c config.js -t $i | sed "s/$/: $i/g";
  if [ "$i" == "requests" ]; then
    # get the most recent sequence id, latest dump
    sequence_id=$(canvasDataCli list -c config.js -j | python -c 'import json,sys;obj=json.load(sys.stdin);print obj[0]["sequence"]');
    # delete all request files not from the latest dump
    find "$DOWNLOAD_DIR/$i" -type f ! -name "$sequence_id-*.gz" -delete;
  fi
done

# unpack files
echo 'unpacking files'
UNPACK=$(IFS=' ' eval 'echo "${TABLES[*]}"');
canvasDataCli unpack -c config.js -f $UNPACK;

# eof‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Was having issues with unpacking all files at the end, because the documentation shows comma separation on the tables... but needs spaces. canvasDataCli unpacking & adding headers

CLI readme also says you can only Unpack after a Sync operation... which I found is only because it needs the schema.json file, which I download on line 8 with wget.

Next I'd be using James' bash import.sh, which I currently use but swapped in MSSQL via MS/BCP;

I'd love to know how any one else is dealing with this or what suggestions you might have.

a1222252 · ‎11-20-2018

Hi Robert,

From what I can see your assumptions are correct.

We use the sync process daily to maintain a set of files locally. Only the requests files are large, the other files are of manageable size, (for us requests gzip files occupy 178GB and the remainder 2GB). We also unpack and upload to an Oracle database daily.

To manage the requests files, we maintain an archive directory where all of the requests gzip files are copied. When the sync runs, it only downloads requests gzip files created since the previous sync operation, so if you delete the older files by timestamp, the unpack operation produces an incremental requests.txt file containing only data since the previous sync operation. You then copy the new files to the archive directory and copy the older files back from the archive directory ready for the next sync operation.

If this uses too much disk capacity, you could create a set of small stub files with the same name as the older gzip files instead of retaining the actual gzip files.

The next problem that arises is the full historical dumps of requests data, which occur without warning every couple of months or so. These appear to produce a new set of gzip files where small daily gzip files are consolidated into a set of smaller gzip files. I haven't looked in detail recently, but I think that some of the older gzip files are not changed. The way I've been dealing with this is to download a complete set of requests gzip files as a once-off, use the grab operation to get that day's incremental requests data, then continuing as above.

I have a script which includes a number of checks as well as implementing the above process:

1. Checks for historical request dump, stops and emails if detected for manual intervention.

2. Syncs the data.

3. Checks that the sync operation has completed successfully, (last line of the sync log). Stops and emails if not.

4. Checks the schema version hasn't changed, (parsing schema.json). Stops and emails if changed.

5. Unpacks all files except requests.

6. Removes older requests gzip files from the download location and copies the new ones to the archive location.

7. Unpacks requests data.

8. Copies requests gzip files back from the archive location.

This process is reliable and doesn't take long. Happy to share the script if you're interested.

Regards,

Stuart.

robotcars · ‎11-21-2018

Hi Stuart,

Thanks for confirming. Especially appreciate all the suggestions and experience you shared.

I really like the CLI and the built in reporting/errors, just need to work through some caveats.

Maybe you can answer this one too...

Where are newly updated rows from old files?

I've always downloaded every file for each table every night and reloaded a fresh copy.

I guess my biggest concern with the sync operation; is that it would seem if it downloads all the files and never gets those files again, yet downloads new files... if an older row (old course, old enrollment, old user) in an older file gets updated is that row only in its original file or is its update now also in the newest file?

If the sync operation manages to update old rows, and isn't a concern then I'd probably use the sync. I'd love to see your script if you'll share. I do like the idea of final output to log file vs creating my own completion status.

The 43 tables we download normally are 3G on disk, so I decided to run a full sync til completion...

Following the sync we have 6.3G on disk, not including requests.

If this uses too much disk capacity, you could create a set of small stub files with the same name as the older gzip files instead of retaining the actual gzip files.

Smart! Ran the following periodically (from the requests dir) while doing the sync operation to keep things small.

for file in *.gz; do echo '' > $file; done‍‍

Syncing our requests is a heavy task even for CD-CLI :smileygrin:

downloaded 50 new files out of 3608 total files
2906 files failed to download, please try running the sync again, if this error persists, open a ticket. No files will be cleaned up
...Error: max number of retries reached for

I keep all of our historical requests in a test database and keep current school year requests on the production system. In the database it's over 1TB excluding rows we delete after import.

a1222252 · ‎11-21-2018

Hi Robert,

I agree, the CLI tool is easy to use and robust.

With the exception of the requests data each dump is a current snapshot of the data and does not include history. If you wish to track things like changes to workflow state for courses, enrolments etc. you would need to create a type 2 dimension in your database so that history is retained. I've been considering this approach but not implemented it so far. Historical daily dumps can be downloaded using the list / grab functionality, generally the last two months is available. The downside is that the grab places all of the gzip files into a single directory which then need to be distributed appropriately for the unpack operation to work. The requests gzip files obtained in this way are named differently to those available from the sync operation, so it's an either / or thing.

Yes, we do the same. As I mentioned, the problem is with requests data historical dumps. These typically replace all of the existing files with a new set so an unpack operation would produce a very large requests.txt file.

We retain a full history of requests, I imagine that a data retention policy will be forthcoming at some stage, but year-on-year comparisons would seem to be of value.

We occasionally see sync errors, in general it appears to be network-related.

Here is the script I use, (with most of the unpack lines removed for brevity). We currently have 84 tables in each dump, you may have a different number. Requests historical dumps typically have around 12, so for us anything less than 80 indicates a problem. The script relies on the presence of a reference schema.json file and the log of the diff operation is useful when evaluating schema changes if release notes are not available.Using the CLI debug option is also useful because this is often the first thing support requests when problems occur.

Regards,

Stuart.

. /home/odisrcint/.bash_profile
export LOGFILE=${HOME}/canvas_download.log
export SCHEMA_REF_FILE=${HOME}/schema.json.4.0.0
export SCHEMA_FILE=/odicanvas/dataFiles/schema.json
export SCHEMA_TEST_FILE=canvas_schema_test_file.txt
export SYNC_LOG=${HOME}/canvas_sync.log
#
echo "Canvas data download started at: `date`" | tee -a ${LOGFILE}
echo "Checking dump for historical requests dump..." | tee -a ${LOGFILE}
canvasDataCli list -c config.js | grep 'Number of Files' | head -3 | cut -d '[' -f 2 | cut -d ']' -f 1 | tee filecount.txt
echo "Number of files in the three most recent dumps: `cat filecount.txt | awk '{print}' ORS=':'`" | tee -a ${LOGFILE}
if [ `cat filecount.txt | sed -n '1 p' | cut -d '[' -f 2 | cut -d ']' -f 1` -lt 80 ]
then
echo "Historical requests dump detected. Local data files will not be synchronised." | tee -a ${LOGFILE}
/bin/mailx -s "The most recent Canvas data dump is an historical requests dump. Canvas data not synchronised." dadmin02@adelaide.edu.au < /dev/null
echo "Canvas data download completed with errors at: `date`" | tee -a ${LOGFILE}
echo "---------------------------------------------------------------------------------------------------" | tee -a ${LOGFILE}
mv canvas_download.sh canvas_download.sh.donotrun
exit
elif [ `cat filecount.txt | sed -n '2 p' | cut -d '[' -f 2 | cut -d ']' -f 1` -lt 80 ]
then
echo "Historical requests dump detected. Local data files will not be synchronised." | tee -a ${LOGFILE}
/bin/mailx -s "The second most recent Canvas data dump is an historical requests dump. Canvas data not synchronised." dadmin02@adelaide.edu.au < /dev/null
echo "Canvas data download completed with errors at: `date`" | tee -a ${LOGFILE}
echo "---------------------------------------------------------------------------------------------------" | tee -a ${LOGFILE}
mv canvas_download.sh canvas_download.sh.donotrun
exit
else
echo "Synchronising local datafiles..." | tee -a ${LOGFILE}
# canvasDataCli sync -c config.js | tee -a ${SYNC_LOG}
canvasDataCli sync -c config.js -l debug 2>&1 | tee -a ${SYNC_LOG}
echo "Local datafiles synchronised." | tee -a ${LOGFILE}
fi
#
echo "Checking for canvasDataCli sync errors and schema changes..." | tee -a ${LOGFILE}
export errchk=`tail -n 1 ${SYNC_LOG}`
echo "Last line of canvasDataCli sync log: ${errchk}" | tee -a ${LOGFILE}
echo "Expected schema version:" `egrep '"version": "[0-9].[0-9].[0-9]",' ${SCHEMA_REF_FILE}` | tee -a ${LOGFILE}
echo "Actual schema version: " `egrep '"version": "[0-9].[0-9].[0-9]",' ${SCHEMA_FILE}` | tee -a ${LOGFILE}
diff ${SCHEMA_REF_FILE} ${SCHEMA_FILE} > ${SCHEMA_TEST_FILE}
if [ "${errchk}" != "sync command completed successfully" ]
then
echo "canvasDataCli sync error. Local data files not synchronised." | tee -a ${LOGFILE}
mv ${SCHEMA_TEST_FILE} ${HOME}/schema/${SCHEMA_TEST_FILE}.`date +%Y%m%d%H%M%S`
/bin/mailx -s "Canvas data download sync error on odi-canvas VM. Canvas data not synchronised." dadmin02@adelaide.edu.au < /dev/null
echo "Canvas data download completed at with errors: `date`" | tee -a ${LOGFILE}
echo "---------------------------------------------------------------------------------------------------" | tee -a ${LOGFILE}
exit
elif [[ -s ${SCHEMA_TEST_FILE} ]]
then
echo "Schema version mismatch. FACT and DIM files will not be unpacked." | tee -a ${LOGFILE}
mv ${SCHEMA_TEST_FILE} ${HOME}/schema/${SCHEMA_TEST_FILE}.`date +%Y%m%d%H%M%S`
/bin/mailx -s "Canvas data schema version mismatch. FACT and DIM files not unpacked." dadmin02@adelaide.edu.au < /dev/null
echo "Canvas data download completed with errors at: `date`" | tee -a ${LOGFILE}
echo "---------------------------------------------------------------------------------------------------" | tee -a ${LOGFILE}
exit
else
mv ${SYNC_LOG} ${HOME}/logs/canvas_sync.log.`date +%Y%m%d%H%M%S`
cp -p ${SCHEMA_FILE} ${HOME}/schema/schema.json.`date +%Y%m%d%H%M%S`
mv ${HOME}/${SCHEMA_TEST_FILE} ${HOME}/schema/${SCHEMA_TEST_FILE}.`date +%Y%m%d%H%M%S`

#
echo "No sync errors detected and schema version unchanged, unpacking FACT and DIM datafiles..." | tee -a ${LOGFILE}
chmod 644 /odicanvas/unpackedFiles/*.txt
canvasDataCli unpack -c config.js -f account_dim
canvasDataCli unpack -c config.js -f assignment_dim
canvasDataCli unpack -c config.js -f assignment_group_dim

~

canvasDataCli unpack -c config.js -f wiki_fact
canvasDataCli unpack -c config.js -f wiki_page_fact
echo "FACT and DIM datafiles unpacked at: `date`" | tee -a ${LOGFILE}
#
echo "Isolating downloaded requests incremental datafiles..." | tee -a ${LOGFILE}
chmod 600 /odicanvas/dataFiles/requests/requests*.gz
find /odicanvas/dataFiles/requests -name "requests*.gz" -mmin +720 -exec rm {} \;
cp -p /odicanvas/unpackedFiles/requests.txt /odicanvas/requests_archive/requests.txt.`date +%Y%m%d%H%M%S`
cp -p /odicanvas/dataFiles/requests/requests*.gz /odicanvas/requests_gz_archive
gzip /odicanvas/requests_archive/requests.txt.*
chmod 400 /odicanvas/requests_archive/requests.txt.*
chmod 400 /odicanvas/requests_gz_archive/requests*.gz
#
echo "Unpacking requests incremental datafiles at: `date`" | tee -a ${LOGFILE}
canvasDataCli unpack -c config.js -f requests
echo "Copying requests datafiles back at: `date`" | tee -a ${LOGFILE}
cp -p /odicanvas/requests_gz_archive/requests*.gz /odicanvas/dataFiles/requests
chmod 444 /odicanvas/unpackedFiles/*.txt
echo "Requests incremental datafiles unpacked at: `date`" | tee -a ${LOGFILE}
#
echo "Number of unpacked dim files : `ls -l /odicanvas/unpackedFiles/*dim.txt|wc -l`" | tee -a ${LOGFILE}
echo "Number of unpacked fact files : `ls -l /odicanvas/unpackedFiles/*fact.txt|wc -l`" | tee -a ${LOGFILE}
echo "Total number of unpacked files: `ls -l /odicanvas/unpackedFiles/*.txt|wc -l`" | tee -a ${LOGFILE}
echo "Canvas data download completed at: `date`" | tee -a ${LOGFILE}
/bin/mailx -s "Canvas data download successfully completed." dadmin02@adelaide.edu.au < /dev/null
echo "---------------------------------------------------------------------------------------------------" | tee -a ${LOGFILE}
fi
exit

robotcars · ‎11-21-2018

Thanks for the script! I'll definitely reference it, and I'm sure it'll be helpful for others too.

I don't need the change history of each row, just the most up to date value of each row.

Never heard of a type 2 dimension until now, but Live Events fire off for the workflow_state of various Canvas Data tables and foreign keys. It's a nice way to extend tables like users, courses, enrollments, and submissions with up to date values, states, ids, and timestamps; since the most recent Canvas Data import.

My (James') import script actually uses the gzip files from the same directory instead of them being in separate folders. I can accommodate. I do like the unpack operation, I think that will work nicely with BCP and I can sort the txt file before import.

I have issues with 0 byte files or the downloads not finishing. I figured it was mostly due to network connectivity, but often wondered if it was PHP having issues with large files. I like that CLI tells us what or how it failed. Hopefully allowing me to restart for just the files I need.

I like that we can periodically get the historical requests all at once. As I get things whittled down to what we use in production, especially with requests (I'm going bring this down to users clicks soon), I'll eventually settle on keeping all the files offline for those audit purposes. Keeping a lot of the junk in requests is pointless. Bots crawling the web trying /wp-admin against *.instructure.com/ isn't worth sifting through in our data lti.

Our normal dump has 89 tables.

Our historical requests dump currently has 76 files.

robotcars · ‎05-17-2019

@a1222252

I really appreciate you sharing your solution to the historical requests. My fetch script is basically as outlined above, but I've been working on making sure changes in schema and the dumps don't corrupt my daily imports. I started playing around with your logic for counting the number of files in the last 3 dumps, which seems sound, but in practice I don't want to pay attention to the number of files and update the script. Additionally, I don't just want to exit and email if there's a requests dump, I eventually want to skip it and wait for daily dump.

I've been using this for years with it's python/json hack. Awhile back someone on IRC pointed me to jq which is very handy when dealing with APIs and JSON in bash. I've recently been updating my scripts to use jq instead, I feel it makes things much more stable and predictable. I shared an example here, simplifying schema.json

My solution is to get contents of the latest dump, not just it's timestamps and file count. CD:CLI doesn't expose the basic API endpoints to the command line, it uses them in it's own methods but they aren't available to us. I worked out the API call using openssl and cURL to get the contents of a dumpId. Then using jq to collect and list only the table names. Finally, if the only table is 'requests' we can exit (for now). https://jqplay.org/s/_6tAtkdi_M

On Line 32, I queried /api/account/self/dump?limit=100 to create a local json response cut down to the last request dump for testing.

#!/bin/bash
# set -x
# set -e
# https://community.canvaslms.com/docs/DOC-6623-how-to-make-an-api-call-with-canvas-data
# https://portal.inshosteddata.com/docs/api
# GET /api/account/(:accountId|self)/file/byDump/:dumpId

getDump () {
  local method="GET"
  local host="portal.inshosteddata.com"
  local content_type="application/json"
  local content_md5=""
  local path="/api/account/self/file/byDump/$1"
  local query=""
  local timestamp
  timestamp=$(TZ=GMT date +'%a, %d %b %Y %T %Z')
  local secret="$CD_SECRET"
  local apikey="$CD_KEY"
  local message="$method
$host
$content_type
$content_md5
$path
$query
$timestamp
$secret"
  hmac=$(echo -n "$message" | openssl dgst -sha256 -binary -hmac "$secret" | base64)
  res=$(curl -sSL -H "Authorization: HMACAuth $apikey:$hmac" -H "Date:$timestamp" -H "Content-Type: application/json" "https://$host$path")
  echo "$res"
}

list=$(canvasDataCli list -c config/cd-cli/fetch-config.js -j)
#list=$(cat request-historical.json)
top3=$(jq '.[0:3]' <<< "$list")
numFiles=$(jq '.[0:3][].numFiles' <<< "$list")
latest_dump_created=$(jq --raw-output '.[0].createdAt' <<< "$list")
latest_dump_id=$(jq --raw-output '.[0].dumpId' <<< "$list")
latest_dump_seq=$(jq '.[0].sequence' <<< "$list")

# echo "$top3"
# echo "$numFiles"
# echo "$latest_dump_created"
# echo "$latest_dump_id"
# echo "$latest_dump_seq"

dump=$(getDump "$latest_dump_id")
contents=$(jq --raw-output '.artifactsByTable[].tableName' <<< "$dump")
# return unique tables, if the only table is requests, we have a historical dump
if [[ "$(echo "$contents" | uniq)" == "requests" ]]; then
  echo 'requests detected'
  exit
fi

# carry on
echo 'get the latest'‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Now my issue is what to do when it's a request dump. I've been informed the daily dump could be published before or after the request dump. So based on the timing my scheduled job might see the request dump as the latest or the daily, skipping the request dump all together. In which case, I have to determine how to look back (possibly local db/file with cache of dump ids) or sleep and check again every hour until I get the daily.

a1222252 · ‎05-19-2019

Hi Robert,

We received an historical requests dump on Saturday which has focused me on the issue again. It was published at 07:30 UTC on the morning of the 18th, the daily dump arrived at 14:23 in the afternoon.

The problem is that when you next sync after the historical dump, requests*.gz back to the last historical dump will be deleted and replaced with new files with today's timestamp. This means that you can't distinguish between today's data and previous data which has already been uploaded via incremental updates. This is why I handle it manually:

1. Disable the automated process based on file count.

2. Sync to bring the local data store up-to-date.

3. Relocate the requests*.gz files from ~/dataFiles/requests to an archive location.

3. Check in the database which days are missing.

4. Grab the daily dumps from the list for the missing days.

5. Copy the *requests*.gz files from the dump directories to ~/dataFiles/requests.

6. Use the canvasDataCli unpack operation to produce the requests.txt file.

7. Upload data from the file.

8. Delete the *requests*.gz files from ~/dataFiles/requests.

9. Delete the dump directories.

10. Copy the requests*.gz files back from the archive location to ~/dataFiles/requests.

The other thing to bear in mind is that sometimes data is missed from the daily dump and potentially the only way to get it back is to reload from the historical dump.

S.

James · ‎05-20-2019

Thanks for this @a1222252 .

I don't do much of anything with requests, but you just helped me realize that what I was doing was potentially incorrect. If I understand you correctly, the historical dump that comes out includes information that had already been processed through the incremental updates and so there is a potential for duplicates. If someone wasn't checking for that it and just blindly loading all of the data that was downloaded that hand't already been processed, it might help explain why the requests table had more entries than people were getting through the access report or analytics.

MySQL recommends NOT using the guid as a primary key, so I don't, and the imports wouldn't have detected duplicates. However, this year I started a new process where I save the first and last timestamp contained in any of the files. I also save the last timestamp of the data that is imported and filter out any dates that are before then. My current script allows me to specify a start and end date for import and if not given, it uses anything since the last import.

I think that should avoid the duplicates, but your last statement says I might have some missing data. Hopefully it's not a lot.

Anyway, I just wanted to say thanks. I hadn't fully read this thread and think I had missed the bigger picture.

James · ‎11-20-2018

I'm glad someone's getting use out of that PHP code. It's one of those things I started because there were so many people who were confused and then never really finished it. When I got back to actually needing it, I had the disk space to download the whole thing, so I just use the CLI tool.

I started rewriting the import.sh script -- never finished it completely, but the first time I run it, it extracts the files. If I run it again, it loads it into the database. That's what happens when you don't finish code, I don't think I intended that to happen. Anyway, I found that if I load it into a canvas_data_new table, it can take as long as it needs -- finally got the new server so it's faster but still a long time. Then I look for tables in canvas_data_new and move them out of canvas_data into canvas_data_old. Then I move them from canvas_data_new to canvas_data and that process takes a few seconds. That was my work-around so that I didn't have to turn off canvas_data for 8 hours to load a new incarnation of the data.

robotcars · ‎11-21-2018

The PHP code was really useful 2 years ago to bootstrap our usage, when I only had 1 box to do everything. Now I'm trying to put all back end data tasks on their own server with easily supported tools or dependencies. Recompiling PHP or writing things from scratch just wastes a lot of time. Versus, when the tool exists (CLI) and writing some wrapper code to use it. Supporting less lines allows us to focus on our products not the tools that support our products.

It's good to know you're not using the PHP either. In the event I/we wanted to add the ability to download/sync/fetch only a certain set of files it seems like a fork on the repo and I we can run our own version or submit it for inclusion with a pull request.

I might be able to help with the import.sh. I want to start from scratch with your file as an example. I want to build it specifically for BCP this time and add a few extra ideas for version tracking and checksums. I also think I'm going to start importing to test and migrating the current term/school year to production by coping db to db, instead of sorting files and importing... which seems similar to what you just described.

robotcars · ‎05-30-2019

I'm just gonna keep sharing 'patchy' thoughts for how do handle certain issues.

Sometimes a file in a dump may have duplicate rows... sometimes this is by design, and sometimes by error in the batch. Apparently multiple machines are spun up to produce a dump file, if a machine crashes, another is spun up to make sure nothing is missed, and we get duplicates.

I'm still working out how to identify the exception on import for duplicates (my importer is kinda terse here).

I think I've noticed that when there are duplicates the files aren't vary even in file size.

This will repack 5.5M rows in about 12 seconds, or 29M rows in 1m44s.

Would need to set number of cores on line 15 where parallel=12

parallel zcat here is faster than canvasDataCli unpack.

zcat also doesn't create header row, which keeps the sort from being done on tail -n +2

I don't keep track of the original file name, .gz is all I need.

#!/bin/bash
# set -u
# set -e
# set -o pipefail
# requires GNU Parallel

repack () {
  echo "[REPACKING] $1"

  echo "[UNPACKING]"
  parallel zcat ::: "/canvas/data/files-fetched/$1/"*.gz > "/canvas/data/files-unpacked/$1.txt"
  #canvasDataCli unpack -c /canvas/data/config/cd-cli/fetch-config.js -f "$1"

  echo "[SORT & UNIQ]"
  (export LC_ALL=C; sort -S1G --parallel=12 -k 1 -u "/canvas/data/files-unpacked/$1.txt" | \
   split -l 1000000 -d - "/canvas/data/files-unpacked/$1-" --additional-suffix=.txt)
  # (export LC_ALL=C; tail -n +2 "/canvas/data/files-unpacked/$1.txt" | \
  #   sort -S1G --parallel=12 -k 1 -u | \
  #   split -l 1000000 -d - "/canvas/data/files-unpacked/$1-" --additional-suffix=.txt)

  chunks=(/canvas/data/files-unpacked/*-*.txt)
  if [[ "${#chunks[@]}" -ge 1 ]]; then
    echo "[COMPRESSING]"
    parallel gzip ::: "/canvas/data/files-unpacked/$1-"*.txt
  fi
  
  gz=(/canvas/data/files-unpacked/*-*.gz)
  if [[ "${#gz[@]}" -ge 1 ]]; then
    echo "[REPLACING]"
    find "/canvas/data/files-fetched/$1/"*.gz -exec rm -rf {} \+ # 2>/dev/null
    find "/canvas/data/files-unpacked/$1"*.gz -exec mv -t "/canvas/data/files-fetched/$1" {} \+ # 2>/dev/null
  fi

  echo "[REMOVING UNPACKED FILE]"
  find /canvas/data/files-unpacked/ -name "$1.txt" -delete
}

repack "submission_dim"

# eof‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

tuh38964 · ‎09-24-2019

Thanks to everyone

[ARCHIVED] Canvas Data CLI - What am I missing?

Archived

Dataforum-board

You're signed out

[ARCHIVED] Canvas Data CLI - What am I missing?

Archived

Dataforum-board

Community Help

View our top guides and resources: