New dataset coming: weblogs aka requests

Edina_Tipter · ‎06-22-2023

As we are approaching our next significant release, Canvas Data 2 will include weblogs (currently requests table in CD1). Important note: the target release date has changed from 21 June to 5 July and the deploy notes were updated accordingly. Here we’ll provide some details about what’s new and how this differs from Canvas Data 1.

First, I’d like to share key design considerations:

In the weblogs, data is appended meaning that any activity is represented as a new row in the table. Thus the size of this table can grow very quickly, which was a pain point for most customers when managing large amounts of data or opening daily files containing differentials or deltas generated by the CD1 ETL process.
Most customers want to gain insight into and track user activity. However, by nature of the data, logs contain not only useful data but other information the user may not be looking for, such as internal bounces, non-user related actions, page refreshes, etc.
The requests table itself is not equivalent with the raw weblogs emitted by the server. It is a pre-processed stream of data, enriched with a few additional columns for ease of use to join with other datasets.

With those considerations in mind, to streamline transition from CD1 to CD2 and to provide a better user experience, we put some measures in place to reduce the size of the files.

Reduced file size by 30%: Canvas Data 2 allows customers to adjust the frequency of queries, leading to smaller chunks to download and process. Furthermore, we Implemented the user_agent_id column and separated the values in the user_agent table in order to save space. With these changes as well as updating some field types from string to enum, we managed to reduce the size of the web_logs table (aka requests in CD1) by approximately 30%.
Schema changes: We aimed to keep the schema as close as possible to the CD1 version, however some unavoidable changes were triggered by the relational nature of the CD2 table schemas and the streaming architecture underneath the hood:
A new namespace, canvas_logs will be added. The web_logs table will become the “new” requests table; preparing the ground for the upcoming mobile_logs.

Example query: POST {{BASE_URL}}/dap/query/canvas_logs/table/web_logs/data

To better align data with the Canvas API and Live Events payloads, IDs won’t be globalized but will match with the CD2 relational tables.
We kept most of the helper columns, such as course_id, quiz_id, etc. but removed course.account.id as that information can be found via a simple join with the accounts table.

As part of the documentation, the API specification will capture the mentioned changes. In addition, we provide a mapping between CD1 and CD2 fields and call out the differences. This auxiliary documentation will be published to the Community Admin guides,

I am happy to share that our CLI reference implementation for loading Canvas LMS tables in a Postgres database is prepared to ingest this additional web_logs table as well.

In terms of data retention and log history, this upcoming release will contain weblogs from the time the stream was enabled which is July 5. These logs will be kept in the data lake and available for querying by customers for a rolling 30 days window at minimum. Logs older than the retention limit will be purged.

Some customers might need or already have a longer log history which has to be pulled from the CD1 service. If you already have historic logs from CD1 that you need to merge with CD2 weblogs, the mapping will highlight the necessary transformations.

Availability: all those customers will have this feature enabled who have been onboarded to the Canvas Data 2 already. Access to the weblogs dataset goes hand in hand with the customer onboarding for the Canvas Data 2 solution.

Other improvements

First things first, in case you missed the newest changes to the API and CLI, please visit the Canvas deploy notes and check what improvements and fixes were made.

Customers have expressed interest in supporting several database engines in the Python Client Library. Starting at version 0.4.0, the Client Library will adopt an extensible plugin architecture. Integrations to various database engines would become plugins. This move opens up the opportunity to contribute integrations for other database engines in the future, e.g. add Oracle, MSSQL, MySQL or SQLite support in addition to PostgreSQL support that exists today. With the version 0.3.9 release, PostgreSQL support has been re-written as a plugin but remains bundled with the Client Library package. While the Python class interfaces may be subject to change, early adopters are welcome to explore the solution and leave feedback as we solidify the plugin framework.

We are continuously learning how to serve your needs better so we have updated our Technical FAQ, to share some more best practices and tips and tricks for our CD2 champs. 😉

I hope you found these updates useful. If you did and feel like sharing which dataset is most important for you that is not available yet in CD2, please fill in this 5 minutes SURVEY.

Let’s build Instructure’s data journey together because your voice matters.