Your Community is getting an upgrade!
Read about our partnership with Higher Logic and how we will build the next generation of the Instructure Community.
Embulk is an open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services. embulk.org github contributors
Simply put, Embulk makes importing gzipped CSV files into any RDBMS* and managing the data and workflow necessary for Canvas Data using command line tools easy, really easy, specifically solving issues we experience working with Canvas Data without fancier tools.
with support for
Linux, OSX, Windows https://github.com/embulk/embulk#quick-start
MySQL, MS SQL Server, Oracle, PostgreSQL, RedShift https://github.com/embulk/embulk-output-jdbc
* Embulk goes beyond SQL, List of Embulk Plugins by Category
and features useful for Canvas Data
Embulk uses YAML config files for each task, for Canvas Data this means each input source (table files) and it's output destination (db table) is 1 file. This includes differences between staging, test and production destinations. I imagine your workflow and setup will be different than mine and many others. You may only need a few tables, or only have one database, or you could simply use Embulk to manage, manipulate, filter and possibly join CSV files to examine with Tableau if that's your thing. For this reason, I have only shared each set of config files for MySQL, MSSQL, Oracle, and PostgreSQL. I have not worked with RedShift.
Our old workflow, requires that we attempt to maintain the newest data from Canvas Data for reporting, attendance, API services and automation, and LTIs. One of our biggest issues is the size of the daily batch without deltas and the growing use of Canvas within our schools and how long importing everything can take, how slow and unnecessary it is to hold 6 years worth of data for this semester, tried different things in SQL and bash to limit the data quickly for the current school year in production, never implement. LTI queries for attendance and submissions are really slow. Then some days the downloaded files are 0 bytes, we must have lost internet, or there was duplicates and the table didn't load, and it takes until 2pm to get everything loaded. Sometimes there's new columns in the table and I forgot to read the release notes and we've truncated the table before importing, and it takes hours to import. And so on.
Some of these are human, some of these are manageable.
Our new workflow uses Embulk
Using RHEL7, 6 CPUs with 12 cores, and 16GB Ram, Embulk imports 7.9GB of CSVs into >1TB of SQL (no requests) in less than 4.5 hours, depending on which indexes you keep in the configs.
GitHub - ccsd/canvas-data-embulk-configs: YAML configs for importing Canvas Data with Embulk
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Constantly curious about Canvas, customization and Community collaboration.
To interact with Panda Bot, our automated chatbot, you need to sign up or log in:
Sign InTo interact with Panda Bot, our automated chatbot, you need to sign up or log in:
Sign In