Community Help

venkat5121 · ‎10-03-2019

We have created a json script to download Canvas Data through API. When we first started in 2017 it downloaded all 72 files then in less than 5min. now it takes anywhere between 30min-80min. After that we are loading these files individually to Oracle database which takes whole Night.

I am trying to see if there is a better design I should opt to to and solve this in atleast 3-4hr window.

I tried direct scripts to MySQL in this community and Canvas Data Viewer tool they have their own limitations.

Any expert opinions or suggestions most welcome.

robotcars · ‎10-03-2019

Since CD is constantly growing from the start of your instance, the download and load time is always going to increase, until there is delta data option... but we've come up with some faster solutions, check these out.

Build a Canvas Data Warehouse on AWS in 30 minutes!

low cost solution with AWS CloudFormation Stack

or my setup

Managing Canvas Data with Embulk YAML configs for importing Canvas Data with Embulk, for any RDBMS and more

Canvas Data: CLI Instructure's command line tool to connect and download files from Canvas Data

If you're using Oracle and choose Embulk, @afarnsworth ‌ was working implementing the OCI drivers vs the JDBC defaults, for quicker import.

View solution in original post

robotcars · ‎10-03-2019

Since CD is constantly growing from the start of your instance, the download and load time is always going to increase, until there is delta data option... but we've come up with some faster solutions, check these out.

Build a Canvas Data Warehouse on AWS in 30 minutes!

low cost solution with AWS CloudFormation Stack

or my setup

Managing Canvas Data with Embulk YAML configs for importing Canvas Data with Embulk, for any RDBMS and more

Canvas Data: CLI Instructure's command line tool to connect and download files from Canvas Data

If you're using Oracle and choose Embulk, @afarnsworth ‌ was working implementing the OCI drivers vs the JDBC defaults, for quicker import.

a1222252 · ‎10-03-2019

Hi,

The canvasDataCli tool does do incremental downloads. Once the base set of files is in the local datastore, the next time a sync operation is run only changed files are downloaded. For us the nightly sync and unpack operations on all tables takes about 25 minutes on a small Linux host, (virtual machine). Uploading the text files into a local Oracle database using Oracle Data Integrator then takes about 45 minutes.

The only table of any real size is requests. To get just the delta for this we separate the gz files by timestamp and unpack only the new files. The only problem then is detecting and dealing with the regular historical dumps where the files downloaded since the previous historical dump are repackaged into fewer, larger gz files. We do this manually at present.

For context, we have been using Canvas since mid-2016 and our requests table is approaching 2.5 billion records and is over 1TB in size. This table is partitioned by month for performance reasons.

Regards,

Stuart.

venkat5121 · ‎10-04-2019

Thank you @stuart_smith ‌

I will try this approach.

Are you unpacking all the files and loading them to the oracle database daily or unpacking only the newly changed files?

is there a logic that you follow to identify such newly changed files (apart from the updated timestamp in the unpack folder)

also, your ODI process does update insert or truncate insert?

I will deal with requests later, I am not using it for now

a1222252 · ‎10-05-2019

Hi,

The gz files for all tables except for requests change daily so we sync and unpack them all. The ODI mappings are all truncate / insert. I'm not aware of any way to segregate incremental data, I think the whole data set for these tables is repackaged into new gz files daily. Having said that, the data set is not large so we haven't had a need to look at it. There are a number of fields which contain data larger than the 4,000 byte limit of the varchar2 datatype, these need to be converted to clobs. Also, we have found that the data contains multibyte characters, so we use varchar2(4000 char) for most columns and a unicode character set for the database.

The requests gz files are different. New files are added to the data set each day with the previous files remaining unchanged. This means that the timestamps on the files in the download target directory will tell you when they were downloaded. By moving the files which have already been uploaded to the database to a temporary location, we can unpack just the most recent files to create a requests.txt file with incremental data. The temporary files then have to be moved back so that the next sync operation doesn't find that they are missing and download them again. Every couple of months there is an historical dump of requests data. This replaces the gz files since the previous historical dump with a smaller number of larger files. This is tricky to manage, but let me know when you are ready to use requests data and I can let you know how we handle it.

Regards,

Stuart.

Quicker ways to Download Canvas Data

Error running initdb with DAP 1.1

CD1 to CD2 schema mapping document.

Is there a way to translate bash script to Azure w...

CD2 and Course Audit information

Canvas Data 2 Missing Some Learning Outcomes

DAP parquet file transformation strategy

pysqlsync - no rows to upsert/insert - tsv2py not ...

CD2 - Weblogs

Incremental query with missing pseudonym entries

Error running initdb with DAP 1.1

You're signed out

Quicker ways to Download Canvas Data

Community Help

View our top guides and resources: