It's not designed for what we're trying to use it for
- almost anyone, including the Canvas Data team
The Canvas Data Portal - Requests documentation states
Pageview requests. Disclaimer: The data in the requests table is a 'best effort' attempt, and is not guaranteed to be complete or wholly accurate. This data is meant to be used for rollups and analysis in the aggregate, _not_ in isolation for auditing, or other high-stakes analysis involving examining single users or small samples. As this data is generated from the Canvas logs files, not a transactional database, there are many places along the way data can be lost and/or duplicated (though uncommon). Additionally, given the size of this data, our processes are often done on monthly cycles for many parts of the requests tables, so as errors occur they can only be rectified monthly.
When we started looking into this data's use, we provided a similar disclaimer.
Then Neal Shebeck, would consistently add "This is a conversation starter, not a smoking gun."
The requests table is unlike any other table in Canvas Data. With the exception of Pageviews, all other tables record events that are triggered when a teacher or student saves something in Canvas, such as when a teacher creates a page or assignment, or when a student submits an assignment or takes a quiz.
Pageviews are different, a user might click a link on a website from the moment the page loads until they stop clicking. Each of these events is useful.
It might tell us
- how the user navigated the site, the order in which they clicked to get to content
- how long before the user changed pages, which can give us some insight into how long the content was accessed
- helping with... did the user find the content they wanted, or did they leave quickly (wrong place, or step between)
And much more
Except the requests table doesn't just contain clicks, it contains logs. Anyone who has ever seen web server logs, knows that any transaction or request over HTTP is logged in whatever detail the engineers decide suits their needs. To give you some scope, here is a list of HTTP response status codes - HTTP | MDN. You would expect any of these to be a line for each request made to the server. Along with the URL, timestamp, ip address, user agent, and more.
Some of it's useful, most of it's completely undocumented. I tried compiling a spreadsheet once, to catalog my best effort at understanding the various web_application_/controller/action/context_type. Most of them appear to be routes. Canvas is nice and links up the user, course if /courses, and some other useful information.
Part of the problem with the Requests table, is the beauty of the web, and the Canvas LMS REST API. The same API that allows Canvas Developers to integrate their institution and extend Canvas or create tools is the same API that Canvas itself is built on. This means that any requests to the server made by Canvas are also logged, not just clicks or transactions made by the user.
Here's the best way I can demonstrate. Open Canvas and go to the Dashboard.
Right click the page and choose Inspect or Inspect Element, to open the Developer Tools - for the Canvas User
Click on the Network Tab, then the button. Cleared? Good.
Now start moving your mouse over the interface, here's some targets.
Now, look at the Network tab again, most of your actions were clicks or even hovering.
Look closely, do you see the unread_count? This is not something you performed, this was Canvas checking for new messages in your inbox to update the flag in the navigation.
The Problem with the requests table, is what also makes Canvas a great LMS. "Born in the Cloud"
This, and LTI's. LTI's and more, are hosted outside of Canvas in the Cloud creating a lot of noise in the table, rows which contain requests not triggered by the user, or Canvas, that we might not need for these purposes.
I found this, because like many of you, we have full time ** students. One of my early questions was geared toward understanding if all our users were local, or if they roamed. Can we make instructors aware of when students are traveling? Can we be empathetic to timezone differences? To answer this, I used some of the many Geo Location API's on the web to collect the location data of the remote_ip's in the requests table. At first I was extremely impressed with how many students we had traveling. Then I counted... there were too many students.
Using the Pseudonym Dim - Canvas Data Portal,
which contains the user's last_login_ip and current_login_ip.
Here's an overlay of logins vs. requests in Tableau. Student Logins Noise
I generated this map to share at Hack Night. Before that, we generated a map for NVLA with just student logins.
As you can see, the physical location of a user is different from some of their requests. If you understand the Cloud, then you can also see that a traveling student start's triggering cloud services in the regions they travel. It's also possible that a student sitting at home using Canvas on their laptop, while also using their phone can have a mobile IP address from another state. Solved: IP address in another state? - Verizon Fios Community
Along with the web_application_* fields, URL path's like /api, /ping,/pageviews, and others make filtering out the massive amount of data that grows in the request table difficult. Let's say you want to try anyway, check out Requests Table and the discussion about how to host and handle the large table, filter or delete rows.
OK, Let's Try
Here's a scenario. A common question in the Community.
Daily User Activity in a Course
where course_id is not null and course_id = # user_id is not null and user_id = # grouping by course_id user_id timestamp the complete date time, helps narrow down sessions timestamp_day quickly group by day - redundant, but really happy they provide this session_id helpful for trying to separate windows of user activity, this helps reduce idle time from our collection remote_ip identifies the user on the internet/location, this can change throughout the day, it also helps separate sessions
If a user walks away from the screen while Canvas is open, Canvas will run /ping requests that keep things alive. We can use session_id and remote_ip in order to attempt to filter data for active sessions. If you don't filter, and remove inactive requests, you will likely end up with data that shows user activity for hours, all day, sometimes multiple days.
Breaking it up into sessions, with a stacked bar chart – minutes on the y-axisccsd/lti/palette/teacher
compared to a student with less activity
The plot line, showing average student activity for the course
Let's zoom out, to all students in the course
This at least measures all users equally. Whether it's fully accurate is questionable, and mobile?
Back to 'conversation starters, not smoking guns'.
Here's a query - How Do I Determine Time Spent on Site #comment-97617
Nevada Learning Academy at CCSD uses this data, along with course activity by hour and submission times to identify when students are active, to schedule Live Sessions, for the most popular time of day or weekday. A teacher with full time and part time students, can schedule sessions when the most students will be available, or do more and split days and hours to be available for different groups. Teachers can flex their time to make these accomodations.
From here, you can expand queries and join tables to do a decent number of user analytics.
Here's some examples, I will try to update, add, and curate.
Where does that bring us? We can keep trying to filter and define the data, helping make it more manageable, and a little more accurate for these purposes. But maybe there's another way?
I will share more in a future post, but this is relevant now
Which is an experimental beta feature from Canvas, which sends messages to AWS Simple Queue Service. The messages are events and transactions, which are consumable in real time. During Hack Night, a member of the Canvas Data Team stated the paraphrased quote at the top of this post, adding that Live Events is a better way of dealing with requests and events. What about both?
I also had an opportunity at Hack Night to discuss this issue with some Canvas Engineers. While I have some other use cases for this, which I will share at a later time, my only request was to add the IP address of the user to the login event. I have been after the IP of each login for about 2 years now. With the IP of the login, we can filter out or specifically collect just the requests of the user's computer* instead of the noise, getting us closer to user activity and clicks.
* You might ask, why not just use the last_login_ip and current_login_ip from the pseudonym table?
- Canvas Data compiles once a day, 1 row per user. If the user logs into 3 or more devices, something is lost.
I invite any questions, comments, or contributions below; adapt the queries, post results, maybe a visualization.
What questions does this table help answer?
CCSD Canvas Team
** I have tried adding 'o-n-l-i-n-e' to this sentence a dozen times. Jive keeps removing it!
Also getting removed before LMS. What gives Jive?