Register for InstructureCon25 • Passes include access to all sessions, the expo hall, entertainment and networking events, meals, and extraterrestrial encounters.
Found this content helpful? Log in or sign up to leave a like!
I've got Canvas Data coming in to my AWS environment by following this guide:
Build a Canvas Data Warehouse on AWS in 30 minutes... - Canvas Community (canvaslms.com)
However, I'd like to do something similar in Azure. Without building it from scratch, are there any institutions out there that are sending Canvas Data into Azure like this?
No. Our AWS infrastructure is years more mature than Azure, so I'm doubling down on AWS.
Were you able to get anywhere with using Azure?
No, I never pursued it further than what you see above.
I'm working on this now, but we're pretty new to Azure so there's a lot of discovery happening along the way and I'm not sure yet what the final result will be as our institution works out the Azure tools and processes they want to support.
So far I've got an Azure Timer Function dumping (a selection of) object downloads to Blob Storage, and was going to start looking into Data Factory to see if that can handle the CD2 metadata or if we need something more custom to process it first. It currently requires a database to store query history as I'm not sure if there's a better way to do that.
For anyone interested in this, what kinds of things are you looking for? Are there other Azure services/tools you like to use or think would be useful in the process? Is your endgame a (fully-structured?) database, a data lake, or just dumping the files so you can figure out what you want to do with them from there?
There's a lot of refinement I still need to do with regards to how objects are queried and downloaded, but--especially as I'm pretty new to Azure--I'd appreciate some feedback about how to make this a more robust and re-usable resource that I'd be happy to share.
I am working on doing this, and we are fairly new to Azure as well but would be happy to collaborate as we go along! I have experimented with a couple different routes so far- including an Azure Database for PostgreSQL flexible server, Synapse pipelines, etc. I have run into a couple intermittent issues getting one last table to sync to the PSQL server for some reason...
I'd be happy to talk about what we're doing and what's working or not.
I've been reworking my Function into a "Durable Function" and am supposed to be meeting with something from Microsoft about managing the Function for multiple tables, especially since we may want to run some at different rates or at all.
But I'm still struggling with what to do with the data once we export it.
it shouldn't be too hard to shove it into a database, but that only even makes sense for the smaller, relational data like Users, Courses, Enrollments, etc. But as we get into the bigger tables like assignment submissions and especially the log tables (which is valuable data to us) it's not clear that approach going to be very effective.
So I've been trying to look for a more generic way to store it that we can explore the raw data a little better as we figure out what to do with it.
I've also heard about the OpenEDU Analytics as mentioned in another thread, but I haven't had much chance to dig into that to figure out how that is expecting the data to be transformed and stored.
Anyways, I know my email is in my profile but not yet familiar enough with this community to have figured out the best way to actually share it.
@bliszewski , I have a quick comment to one of your paragraphs above, in which you mention that you are looking for a more generic way to store the data that lets you explore the raw version a little better. My institution has used Splunk for several years now for logging and security-related purposes, but my group started using it a few years back to also store Canvas Data (1 and 2), in addition to Canvas Live Events. This gives us a good way to explore the data in the way you describe, without having to do any conversions before ingesting it. If Splunk happens to be available at your institution, this could be an option for you. Would be open to discuss if desirable.
I would be super interested in seeing how you are using Splunk for this! We have been using it on our campus for logging and security as well, I hadn't even given it a thought to use it in this way, it's genius!
Sure @Chrisleej !. Feel free to connect offline to chat . BTW, if you happen to attend Educause this week, I'm going to be chatting to others about this in a Splunk-sponsored lunch (BTW, I am with Northwestern university).
We are also just starting work with Azure data factory.
Wondering if there's any opportunity to catch up with someone (anyone?) who's had success doing this? We are very much starting from scratch.
We are stuck at some very basic questions about how to querying CD2 and the process, ie:
We would be very keen to hear from anyone who has gone before about getting data into (and refreshed) in data factory. At the moment we're just looking to start with daily snapshots... but if there is any more nuanced advice, we'd be very appreciative.
Hi Pete,
Understanding Data Factory is a big holdup for me at the moment. We've talked to Microsoft for some advice, but are still waiting on a response.
To the best of my knowledge, Azure Data Factory doesn't currently have any built-in or readily available support for Instructure's DAP/CD2. It may be possible to use Data Factory's existing HTTP Connector if you only intend to use snapshots anyways, so wouldn't need more complicated processing to track previous jobs and modify the request for incremental queries. Writing a custom Data Factory Connector may also be an option, but I'm not familiar enough with Data Factory even know where to start with that.
Instead, I've written an Azure Function using the durable monitoring pattern (https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-overview?tabs=in-p...) to start a table query, monitor for completion, and download the resulting data files to Azure Blob Storage.
Presumably, a Function can be called from a Data Factory pipeline and then read the data from Blob Storage. Using a Function gives gives extra flexibility to save the job history and pass the last successful timestamp, for a given table, into an incremental query. But this is where I've stalled out for lack of familiarity with Data Factory.
I have a Python script that can run the whole process and dump the data into a database, but struggling to map this to Data Factory components. My plan for now is just to run my script on our own hardware into a Azure hosted database until I can figure out the Azure pieces.
For context, we're using an Azure SQL Managed Instance database (Microsoft SQL Server) and many of our non-Azure systems are still running Python 3.6, so haven't been able to use Instructure's DAP Client and have been instead been building my own. This has helped get a bit more flexibility working in Azure, but lacks some of conveniences offered by their implementation.
In case it's still of use... this Fabric example has provided an additional possible solution that bypasses the whole Data Factory pipeline approach. (We initially started with ADF to get the json files, but I've now got databricks grabbing the table list for each namespace and then processing the parquet files.) I'm still evaluating whether a flexible postgres server might still be the most cost effective and easiest to maintain solution for us, but the Fabric/databricks solution does work... biggest issue is I've had to add a sleep instruction to stop hitting the rate limit on the API.
Hey Pete,
Thanks for sharing that. I haven't worked with Fabric as a whole yet, but had a chance a few months ago to throw some DAP data in OneLake and play with pyspark.
We didn't do a full integration with it, rather I just manually uploaded a bunch of DAP exports we already and threw together some notebooks demonstrating reading them into a dataframe, applying incrementals over the snapshot data, doing some joining and filtering, pushing to a temporary table for some SQL, etc, to show what we could do with it.
It was fast and, being somewhat familiar with dataframes from toying with pandas on-and-off, pretty slick and easy to work with. Personally I like it more than our current solution, but the powers that be weren't ready to move forward with OneLake or Fabric yet.
It's a neat idea to just run the DAP query and downloads in the pyspark script. Otherwise these scripts are actually pretty similar to the process I ended up building to manage the database tables.
I'm not clear how to deal with scheduling in this method though? E.g. our current process uses cron to fetch tables on a schedule (which allows us to configure different tables on on a different schedule as needed) to keep fresh data available.
I'm guessing we would still need something else, like an Azure Function, to run things on a schedule? Or just start every process with a query to fetch new data?
If running a Function you can trigger it from ADF. With Databricks or Fabric, there's an inbuilt scheduler, plus you can configure the compute resource to auto shut down after 5 mins of inactivity. The pricing stucture for Fabric is what is keeping us with Databricks, for now.
As I've mentioned, I would actually like to go further and send the data to Azure postgres. But finding the time to worked on timed functions is elusive for me; no time; lack of familiarity (I have no idea if you can you execute shell commands within the python code? If you could, it would make CD2 much easier to build and maintain (and the readability). Likewise, I've found it difficult to confirm if databricks could run dap via it's magic shell commands and whether it can write to an outside system - I think it's sandboxed - nut potentially the easiest approach.
Thanks for the tips. Cost are a concern for us as well, but I recently heard we got a grant to explore Fabric, and getting our data there is bringing our current solution into question anyways. Sounds like I may need to take a closer look at Databricks as something that works with Fabric but is not specifically tied to Fabric so could be more sustainable after that money goes away.
My hope was to use a Function to just dump the files into some storage device somewhere were they could be picked up by Fabric/OneLake/Data Factory/wherever we wanted to use them. But in recent conversations with Microsoft They just keep pushing to use Data Factory and I've been getting a bit discouraged about the whole thing.
(I have no idea if you can you execute shell commands within the python code? If you could, it would make CD2 much easier to build and maintain (and the readability). Likewise, I've found it difficult to confirm if databricks could run dap via it's magic shell commands and whether it can write to an outside system - I think it's sandboxed - nut potentially the easiest approach.
A, possibly, better approach is to use the dap package as a Python library instead instead of the CLI, e.g. https://data-access-platform-api.s3.amazonaws.com/client/README.html#code-examples
I'll admit I haven't actually tried this yet because we ended up writing our own DAP client to work with Python 3.6. So I don't know if using it as a library will be as easy as the CLI, but if you can find their __main__ you should be able to see how the CLI works and use it in the same/similar way.
At this point I haven't exactly wrapped my head round what Databricks is and how that's different from Fabric, or OneLake, or PySpark, and everything else. It could be the case that Databricks itself is a sandbox and not intended to talk to external database, but unless the network is also sandboxes you should by able to use PySpark or straight Python (or whatever language you prefer) to connect to another database. see possibly: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
It looks like there may also be a specific PySpark connector for Azure/MSSQL databases, https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=azuresqldb-mi-current, but that may not be of much use if you're looking for PostgreSQL instead.
We have CD2 in Azure now, and are looking to add Live Events next.
Are you using OpenEDU Analytics (https://github.com/microsoft/OpenEduAnalytics) or did you build out yourself?
Hey jbowers3,
I've been (trying to) look into OpenEDU Analytics because there's a lot of interest in better analytics here, but not yet a lot of direction for actually managing the data to make this possible. Can I ask what makes it work with OpenEDU Analytics?
Is it just conforming to OpenEDU Analytics data models, or does the data need to be built a particular way or go somewhere in particular (e.g. a database vs blob storage)?
https://github.com/uvadev/PullCanvasData2/pkgs/nuget/PullCanvasData2
This will give you a sense on how to use it just configure the table writing to point to azure
https://github.com/uvadev/CD2-Cron-or-DB-Source
To interact with Panda Bot in the Instructure Community, you need to sign up or log in:
Sign In