cancel
Showing results for 
Search instead for 
Did you mean: 

Build a Canvas Data Warehouse on AWS in 30 minutes!

ColinMurtaugh
Community Champion
29 15 6,629

Introduction

Canvas Data provides a wealth of information that can be used in many interesting ways, but there are a few hurdles that can make it hard to even get started:

  • The Canvas Data API uses a different authentication mechanism than the one that you're probably already used to using with the Canvas API.
  • Data is provided in compressed tab-delimited files. To be useful, they typically need to be loaded into some kind of database.
  • For larger tables, data is split across multiple files, each of which must be downloaded and loaded into your database.
  • The size of the data can be unwieldy for use with locally-running tools such as Excel.
  • Maintaining a local database and keeping up with schema changes can be tedious.

This tutorial will show you how to build a data warehouse for your Canvas Data using Amazon Web Services. Besides solving all of the problems above, this cloud-based solution has several advantages compared to maintaining a local database:

  • You can easily keep your warehouse up to date by scheduling the synchronization process to run daily.
  • You can store large amounts of data very cheaply.
  • There are no machines to maintain, operating systems to patch, or software to upgrade.
  • You can easily share your Canvas Data Warehouse with colleagues and collaborators.

Before we get started

Before you begin this tutorial, you should make sure that you've got an AWS account available, and that you have administrator access. You'll also need an API key and secret for the Canvas Data API.

Experience with relational databases and writing SQL will be necessary in order to query your data. Experience with AWS and the AWS console will be helpful.

Be aware that you'll be creating AWS resources in your account that are not free, but the total cost to run this data warehouse should be under about $10/month (based on my current snapshot size of 380GB). There's also a cost associated with the queries that you run against this data, but typical queries will only cost pennies.

All of the code used in this tutorial can be found in GitHub:
GitHub - Harvard-University-iCommons/canvas-data-aws: Build a Canvas Data warehouse on AWS 

AWS services we'll use

We'll use several different AWS services to build the warehouse:

  • S3: we'll store all of the raw data files in an S3 bucket. Since S3 buckets are unlimited in size and extremely durable, we won't need to worry about running out of space or having a hard drive fail.
  • Lambda: we'll use serverless Lambda functions to synchronize files to the S3 bucket. Since we can launch hundreds or even thousands of Lambda functions in parallel, downloading all of our files is very fast.
  • SNS: we'll use the Simple Notification Service to let us know when the synchronization process runs.
  • Glue: we'll create a data catalog that describes the contents of our raw files. This creates a "virtual database" of tables and columns.
  • Athena: we'll use this analytics tool along with the Glue data catalog to query the data files directly without having to load them into a database first
  • CloudFormation: we'll use AWS' infrastructure automation service to set up all of the pieces above in a few easy steps!

Let's build a warehouse!

  1. Log into the AWS console and access the CloudFormation service.
  2. Click on the Create Stack button
  3. On the next screen, leave the Template is ready and Amazon S3 URL options selected. Below, Enter this S3 URL:
    https://huit-at-public-build-artifacts.s3.amazonaws.com/canvas-data-aws/canvas_data_aws.yaml
    Click Next. 315881_pastedImage_1.png
  4. On the stack details screen, first enter a name for this stack. Something like "canvas-data-warehouse" is fine. Enter your Canvas Data API key and secret in the fields provided. Enter your email address (so that you can receive updates when the synchronization process runs). You can leave the default values for the other parameters. Click Next.
  5. On the stack options screen, leave all of the default values and click Next.
  6. On the review screen, scroll to the bottom and check the box to acknowledge that the template will create IAM resources (roles, in this case). Click the Create stack button, and watch as the process begins!

It'll take several minutes for all of the resources defined in the CloudFormation template to be created. You can follow the progress on the Events tab. Once the stack is complete, check your email -- you should have received a message from SNS asking you to confirm your subscription. Click on the link in the email and you'll be all set to receive updates from the data-synchronization process.

Now we're ready to load some data!

Loading data into the warehouse

Instructure's documentation for the Canvas Data API describes an algorithm for maintaining a snapshot of your current data:

  1. Make a request to the "sync" API endpoint, and for every file returned:
    • If the filename has been downloaded previously, do not download it
    • If the filename has not yet been downloaded, download it
  2. After all files have been processed:
    • Delete any local file that isn't in the list of files from the API

The CloudFormation stack that you just created includes an implementation of this algorithm using Lambda functions. A scheduled job will run the synchronization process every day at 10am UTC, but right now we don't want to wait -- let's manually kick off the synchronization process and watch the initial set of data get loaded into our warehouse.

To do that, we just need to manually invoke the sync-canvas-data-files function. Back in the AWS console, access the Lambda service. You'll see the two functions that are used by our warehouse listed -- click on the sync-canvas-data-files function.

On this screen you can see the details about the Lambda function. We can use the AWS Lambda Console's test feature to invoke the function. Click on the Configure test events button, enter a name for your test event (like "manual"), and click Create. Now click on the Test button, and your Lambda function will be executed. The console will show an indication that the function is running, and when it's complete you'll see the results. You'll also receive the results in your email box.

Querying your data

When the Lambda function above ran, in addition to downloading all of the raw data files, it created tables in our Glue data catalog making them queryable in AWS Athena. In the AWS console, navigate to the Athena service.  You should see something similar to the screenshot below:

315882_pastedImage_1.png

You can now write SQL to query your data just as if it had been loaded into a relational database. You'll need to understand the schema, and Instructure provides documentation explaining what each table contains: https://portal.inshosteddata.com/docs

Some example queries:

  • Get the number of courses in each workflow state:
    SELECT workflow_state, count(*) FROM course_dim GROUP BY workflow_state;
  • Get the average number of published assignments per course in your active courses:
    SELECT AVG(assignments) FROM (SELECT COUNT(*) AS assignments 
    FROM course_dim c, assignment_dim a
    WHERE c.id = a.course_id
    AND c.workflow_state = 'available'
    AND a.workflow_state = 'published'
    GROUP BY c.id);

Cleaning up

If you don't want to keep your data warehouse, cleaning up is easy: just delete the "raw_files" folder from your S3 bucket, and then delete the stack in the CloudFormation console. All of the resources that were created will be removed, and you'll incur no further costs. 

Good luck, and please let me know if you run into any trouble with any of the steps above!

15 Comments
robotcars
Community Champion

50581462

It's amazing how much of a struggle, both technically and financially, this task is for different groups, schools, and users to tackle. It's also difficult to understand, estimate, plan, and test something like this within AWS without a considerable amount of legwork and hesitation to see high volumes and costs.

You have done this community a tremendous favor.

Thank you

dgrobani
Community Champion

50581462‌, thank you so very much for making this available! I've never done anything with Canvas Data, have a sudden, urgent need to do so, and discovered your amazing tool in carroll-ccsd‌'s fantastic Awesome CanvasLMS.

Today I started working my way through your tutorial. When I got to the Lambda part, I discovered there were no functions and then saw that the status on my CloudFormation stack was "ROLLBACK_COMPLETE." I then saw that an event whose type "AWS::Lambda::Function" and logical ID "FetchLambdaFunction" had this status reason:

"Error occurred while GetObject. S3 Error Code: PermanentRedirect. S3 Error Message: The bucket is in this region: us-east-1. Please use this region to retry the request (Service: AWSLambdaInternal; Status Code: 400; Error Code: InvalidParameterValueException; Request ID: b355e71d-b79f-4af0-8bed-a25dce3ad673)"

The next event was "ROLLBACK_IN_PROGRESS," followed by "ROLLBACK_COMPLETE."

I browsed and searched my way through the template but didn't see anything I could identify as relevant. I'd be immensely grateful for any guidance you can offer.

Many thanks,

Daniel

dgrobani
Community Champion

I'm celebrating the creation of my stack! I resolved the issue by taking the naive step of changing the location in my top AWS menu bar to US East (N. Virginia) and then creating the stack again. Is there a better approach?

ColinMurtaugh
Community Champion

Hey Daniel!

I'd forgotten that some AWS resources are region-specific; I should have mentioned that I used us-east when I created all of this.  Glad you got it working! 

--Colin

dgrobani
Community Champion

I'm still amazed at how quick and easy it was to go from nothing to querying our data!

We're fine with a snapshot and don't need it updated daily--how do I disable the sync?

Again, a million thanks!

ColinMurtaugh
Community Champion

Hi Daniel,

You should be able to log into the AWS console and go into CloudWatch > Events > Rules.  There should be a rule there that runs the sync process, and you can select it and choose "Disable" from the Actions menu.  If you ever want to refresh your local data again, you can re-enable the rule.

--Colin

dgrobani
Community Champion

Excellent--thanks again!

dtod
Community Contributor

I realize this is a generic Amazon question, but how do you keep an eye on costs? What kinds of costs are you looking at?

Alexandre_Sch
Community Contributor

Hey Colin! 

Thank you so much for this brilliant tutorial. I have a question and would be very grateful if you could ask: Is there a way to not extract the request table? i Wont use any request information in my reports so, i'm trying to prevent the request table to be imported but i really don't know how i can achieve it. Maybe changing the template u gave: https://huit-at-public-build-artifacts.s3.amazonaws.com/canvas-data-aws/canvas_data_aws.yaml  would do the trick, but i'm still trying to understand wich any sentence are doing before editing it. 

Regards,

ALexandre S.

ColinMurtaugh
Community Champion

Hi Alexandre, 

There are two Lambda functions that are used by the synchronization process: one that is executed once per synchronization and determines which files need to be fetched or removed, and one that is executed once per file that needs to be fetched. You could modify the code for the first function -- after it gets the list of files needed to perform a synchronization, you could just skip any files with "requests" in the name, somewhere near here: canvas-data-aws/sync-canvas-data-files.py at 5cdd41e8e40cafc53a9e258c608f654cce2c0446 · Harvard-Univ... 

If you do modify the Lambda function code, you'd need to package it up in a new zip file and upload it to an S3 bucket in your AWS account. Then, when you create the CloudFormation stack using the template, you'd specify the location of that zip file using the 'LambdaFunctionBucketParameter' and 'LambdaFunctionZipFileKeyParameter' parameters. 

To be honest, though, since storing this data in S3 is so cheap, I wouldn't bother to go to this extra effort in order to exclude the requests data. Someday it might come in handy! 🙂 

--Colin

James
Community Champion

To go along with your note about fetching the requests table because you might need it some day, Canvas is now suggesting people fetch the requests table from Canvas Data before Canvas Data 2 comes out: https://community.canvaslms.com/community/answers/data/blog/2020/07/01/canvas-data-2-features-and-ti... 

Alexandre_Sch
Community Contributor

Hey James, thanks for sharing that. I wasn't aware of a Canvas Data 2, very glad to hear about it. 

ColinMurtaugh
Community Champion

Thanks for the heads-up on that, James!

--Colin

dnimri
Community Participant

Will the above steps work for Canvas Data 2 as well once it's released?

Thank you @ColinMurtaugh for sharing!

ColinMurtaugh
Community Champion

@dnimri -- This is going to need to be re-written to work with Canvas Data 2. We haven't begun planning that work yet, but I hope to be able to share that here someday too. 

Thanks!

--Colin