Managing Canvas Data with Embulk

robotcars · ‎07-05-2019

Embulk is an open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services. embulk.org github contributors

Simply put, Embulk makes importing gzipped CSV files into any RDBMS* and managing the data and workflow necessary for Canvas Data using command line tools easy, really easy, specifically solving issues we experience working with Canvas Data without fancier tools.

with support for

Linux, OSX, Windows https://github.com/embulk/embulk#quick-start

MySQL, MS SQL Server, Oracle, PostgreSQL, RedShift https://github.com/embulk/embulk-output-jdbc

* Embulk goes beyond SQL, List of Embulk Plugins by Category

and features useful for Canvas Data

Decode gzipped files
The ability to intelligently guess the format and data types of CSV files
Parallel execution of tasks, multi-threading per CPU core, and a task for each batch file
Input CSV Plugin as default Input for Embulk
Filter data with Filter Plugins, https://plugins.embulk.org/#filter
- Add and Remove Columns... deprecated will always be NULL
- Filter rows with SQL like syntax https://github.com/sonots/embulk-filter-row
- Unique, Distinct, JOIN (CSV files)
Output Data to SQL
- Insert, Insert Direct, Replace, Merge, Truncate and Truncate Insert
- Timestamp formatting
- TimeZone conversion from UTC for date time columns
- before_load and after_load, config options to run queries before (truncate) and after import (indexes)
- and more

Embulk uses YAML config files for each task, for Canvas Data this means each input source (table files) and it's output destination (db table) is 1 file. This includes differences between staging, test and production destinations. I imagine your workflow and setup will be different than mine and many others. You may only need a few tables, or only have one database, or you could simply use Embulk to manage, manipulate, filter and possibly join CSV files to examine with Tableau if that's your thing. For this reason, I have only shared each set of config files for MySQL, MSSQL, Oracle, and PostgreSQL. I have not worked with RedShift.

Our old workflow, requires that we attempt to maintain the newest data from Canvas Data for reporting, attendance, API services and automation, and LTIs. One of our biggest issues is the size of the daily batch without deltas and the growing use of Canvas within our schools and how long importing everything can take, how slow and unnecessary it is to hold 6 years worth of data for this semester, tried different things in SQL and bash to limit the data quickly for the current school year in production, never implement. LTI queries for attendance and submissions are really slow. Then some days the downloaded files are 0 bytes, we must have lost internet, or there was duplicates and the table didn't load, and it takes until 2pm to get everything loaded. Sometimes there's new columns in the table and I forgot to read the release notes and we've truncated the table before importing, and it takes hours to import. And so on.

Some of these are human, some of these are manageable.

Our new workflow uses Embulk

Download with Canvas Data CLI, some of that documented here
Import all CD tables using CSV in SQL out to staging environment with Replace mode, this creates temporary tables for the import, if it fails, the previous version is still intact. After successful import, Embulk will drop the old table and run the after_load queries, I use this for enumerable constraints and indexes. I left a lot of examples in the configs.

The Requests table config uses Insert mode to append the new rows.
I use staging for Tableau reporting. For production, I only need to load the tables necessary for our LTIs and API services. Some of these configs are straight copies of the staging imports, except they point to production. Some of the configs create new tables using SQL in SQL out and importing filtered or composite tables from query results using https://github.com/embulk/embulk-input-jdbc

heres' an example https://github.com/ccsd/canvas-data-embulk-configs/wiki/SQL-in-SQL-out

Using RHEL7, 6 CPUs with 12 cores, and 16GB Ram, Embulk imports 7.9GB of CSVs into >1TB of SQL (no requests) in less than 4.5 hours, depending on which indexes you keep in the configs.