New dataset coming: weblogs aka requests

Edina_Tipter · ‎06-22-2023

As we are approaching our next significant release, Canvas Data 2 will include weblogs (currently requests table in CD1). Important note: the target release date has changed from 21 June to 5 July and the deploy notes were updated accordingly. Here we’ll provide some details about what’s new and how this differs from Canvas Data 1.

First, I’d like to share key design considerations:

In the weblogs, data is appended meaning that any activity is represented as a new row in the table. Thus the size of this table can grow very quickly, which was a pain point for most customers when managing large amounts of data or opening daily files containing differentials or deltas generated by the CD1 ETL process.
Most customers want to gain insight into and track user activity. However, by nature of the data, logs contain not only useful data but other information the user may not be looking for, such as internal bounces, non-user related actions, page refreshes, etc.
The requests table itself is not equivalent with the raw weblogs emitted by the server. It is a pre-processed stream of data, enriched with a few additional columns for ease of use to join with other datasets.

With those considerations in mind, to streamline transition from CD1 to CD2 and to provide a better user experience, we put some measures in place to reduce the size of the files.

Reduced file size by 30%: Canvas Data 2 allows customers to adjust the frequency of queries, leading to smaller chunks to download and process. Furthermore, we Implemented the user_agent_id column and separated the values in the user_agent table in order to save space. With these changes as well as updating some field types from string to enum, we managed to reduce the size of the web_logs table (aka requests in CD1) by approximately 30%.
Schema changes: We aimed to keep the schema as close as possible to the CD1 version, however some unavoidable changes were triggered by the relational nature of the CD2 table schemas and the streaming architecture underneath the hood:
A new namespace, canvas_logs will be added. The web_logs table will become the “new” requests table; preparing the ground for the upcoming mobile_logs.

Example query: POST {{BASE_URL}}/dap/query/canvas_logs/table/web_logs/data

To better align data with the Canvas API and Live Events payloads, IDs won’t be globalized but will match with the CD2 relational tables.
We kept most of the helper columns, such as course_id, quiz_id, etc. but removed course.account.id as that information can be found via a simple join with the accounts table.

As part of the documentation, the API specification will capture the mentioned changes. In addition, we provide a mapping between CD1 and CD2 fields and call out the differences. This auxiliary documentation will be published to the Community Admin guides,

I am happy to share that our CLI reference implementation for loading Canvas LMS tables in a Postgres database is prepared to ingest this additional web_logs table as well.

In terms of data retention and log history, this upcoming release will contain weblogs from the time the stream was enabled which is July 5. These logs will be kept in the data lake and available for querying by customers for a rolling 30 days window at minimum. Logs older than the retention limit will be purged.

Some customers might need or already have a longer log history which has to be pulled from the CD1 service. If you already have historic logs from CD1 that you need to merge with CD2 weblogs, the mapping will highlight the necessary transformations.

Availability: all those customers will have this feature enabled who have been onboarded to the Canvas Data 2 already. Access to the weblogs dataset goes hand in hand with the customer onboarding for the Canvas Data 2 solution.

Other improvements

First things first, in case you missed the newest changes to the API and CLI, please visit the Canvas deploy notes and check what improvements and fixes were made.

Customers have expressed interest in supporting several database engines in the Python Client Library. Starting at version 0.4.0, the Client Library will adopt an extensible plugin architecture. Integrations to various database engines would become plugins. This move opens up the opportunity to contribute integrations for other database engines in the future, e.g. add Oracle, MSSQL, MySQL or SQLite support in addition to PostgreSQL support that exists today. With the version 0.3.9 release, PostgreSQL support has been re-written as a plugin but remains bundled with the Client Library package. While the Python class interfaces may be subject to change, early adopters are welcome to explore the solution and leave feedback as we solidify the plugin framework.

We are continuously learning how to serve your needs better so we have updated our Technical FAQ, to share some more best practices and tips and tricks for our CD2 champs. 😉

I hope you found these updates useful. If you did and feel like sharing which dataset is most important for you that is not available yet in CD2, please fill in this 5 minutes SURVEY.

Let’s build Instructure’s data journey together because your voice matters.

marco_divittori · ‎06-22-2023

Thanks for the update @Edina_Tipter. This point about a new data retention policy stands out for me as CD1 logs are currently retained by Instructure indefinitely (we can retrieve the full history of request data). Can you confirm that this will no longer be the case for CD2 weblogs?

"In terms of data retention and log history, this upcoming release will contain weblogs from the time the stream was enabled which is July 5. These logs will be kept in the data lake and available for querying by customers for a rolling 30 days window at minimum. Logs older than the retention limit will be purged."

Edina_Tipter · ‎06-23-2023

@marco_divittori Yes, I can confirm that in the upcoming release historic logs will not be served. Nevertheless, you have the option to pull the historic ones (older than 30 days) from CD1 for the time being.

dtod · ‎06-30-2023

How long is "the time being"?

Edina_Tipter · ‎07-11-2023

@dtod It is till the end of the year 2023.

ryucali · ‎07-18-2023

@Edina_Tipter Re: "you have the option to pull the historic ones (older than 30 days) from CD1 for the time being", we cannot easily get the requests logs from a specific course in CD1. During the transition from CD1 to CD2, is there any way to pull weblogs from a past course?

Edina_Tipter · ‎07-20-2023

@ryucali Just to make sure I understand the pain point can you please elaborate what do you mean by "cannot easily get? ".

As I view this, in order to search for a past course in the requests table from CD1 you need to scan through the data and search for the particular course you are interested in (given that you have downloaded the daily difs and/or the historic one). Is there any other aspect that I am not seeing?

Right now, CD2 has weblogs accrued starting from the release time which is 7 July. These are the "oldest" logs you can query.

jcsorenson79 · ‎07-31-2023

Is this working? I have spent the better part of 4 days trying to get web_logs and user_agents to initialize to by Postgres installation.

Edina_Tipter · ‎08-01-2023

@jcsorenson79 Can you share the repro steps and (error) message the CLI client returns?

dtod · ‎08-01-2023

Will Instructure support have access to older logs?

Will user pageviews still have older data?

One scenario we run into is documenting specific user behavior from way more than 30 days ago (e.g. did a student log in to Canvas and attempt an assignment in January), or alternatively, trying to track down who did what (e.g. who deleted something) last fall.

jcsorenson79 · ‎08-01-2023

@Edina_Tipter ,
When I run the dap client initdb through PowerShell, the web_logs portion stalls at file 4/7 and never completes (last ran for 14 hours). When I run that same command for user_agents, I get an error on the table within the Postgres instance:

line 69, in <lambda> return lambda record_json: record_json["value"][column_name]

~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: 'http_user_agent'

I have been successful using the initdb and syncdb commands on the Canvas data, but no luck with the canvas_logs data.

I have forwarded my errors to the Support desk and have not received an answer.

Edina_Tipter · ‎08-02-2023

@jcsorenson79 We started the investigation. Will revert what we found.

mcarruth · ‎08-30-2023

CLI release version 0.3.11 for Python Client Library and CLI will support the MySQL database protocol dialect was released on 2023-08-23. We downloaded and installed the client. We ran into issues when ran the initdb() function. The error is associated within the COMMENT for some of the fields is >1024 characters. MYSQL has a limit of 1024 characters. The CLI client fails because of this issue.