Embulk is an open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.embulk.orggithubcontributors
Simply put, Embulk makes importing gzipped CSV files into any RDBMS* and managing the data and workflow necessary for Canvas Data using command line tools easy, really easy, specifically solving issues we experience working with Canvas Data without fancier tools.
Insert, Insert Direct, Replace, Merge, Truncate and Truncate Insert
TimeZone conversion from UTC for date time columns
before_load and after_load, config options to run queries before (truncate) and after import (indexes)
Embulk uses YAML config files for each task, for Canvas Data this means each input source (table files) and it's output destination (db table) is 1 file. This includes differences between staging, test and production destinations. I imagine your workflow and setup will be different than mine and many others. You may only need a few tables, or only have one database, or you could simply use Embulk to manage, manipulate, filter and possibly join CSV files to examine with Tableau if that's your thing. For this reason, I have only shared each set of config files for MySQL, MSSQL, Oracle, and PostgreSQL. I have not worked with RedShift.
Our old workflow, requires that we attempt to maintain the newest data from Canvas Data for reporting, attendance, API services and automation, and LTIs. One of our biggest issues is the size of the daily batch without deltas and the growing use of Canvas within our schools and how long importing everything can take, how slow and unnecessary it is to hold 6 years worth of data for this semester, tried different things in SQL and bash to limit the data quickly for the current school year in production, never implement. LTI queries for attendance and submissions are really slow. Then some days the downloaded files are 0 bytes, we must have lost internet, or there was duplicates and the table didn't load, and it takes until 2pm to get everything loaded. Sometimes there's new columns in the table and I forgot to read the release notes and we've truncated the table before importing, and it takes hours to import. And so on.
Some of these are human, some of these are manageable.
Our new workflow uses Embulk
Download with Canvas Data CLI, some of that documented here
Import all CD tables using CSV in SQL out to staging environment with Replace mode, this creates temporary tables for the import, if it fails, the previous version is still intact. After successful import, Embulk will drop the old table and run the after_load queries, I use this for enumerable constraints and indexes. I left a lot of examples in the configs.
The Requests table config uses Insert mode to append the new rows.
I use staging for Tableau reporting. For production, I only need to load the tables necessary for our LTIs and API services. Some of these configs are straight copies of the staging imports, except they point to production. Some of the configs create new tables using SQL in SQL out and importing filtered or composite tables from query results using https://github.com/embulk/embulk-input-jdbc
I used to develop web applications and content management systems; since 2013 I've been cobbling together Canvas integrations and features for K12 Online & Blended Learning...KOBL. From LAMP stack to Canvas stack(?), and trying to make it portable one project at a time.