We currently leverage the Canvas Analytics API to query page views as a measure for engagement.
The page views are, typically, less than the counts per user from the requests table - which makes logical sense. However, the magnitude of this difference bears investigation.
Some additional context from @rubyn :
I am looking at page_views count. The results from Redshift doesn't match the Analytics API for the page_views value, nor what I see on the Course Analytics page in Canvas. Should I count a different thing, or add more filters to get the expected results?
select u.canvas_id as "user id",
c.canvas_id as "course id",
from requests r
inner join user_dim u on r.user_id = u.id
inner join course_dim c on r.course_id = c.id
inner join enrollment_fact ef on r.user_id = ef.user_id and r.course_id = ef.course_id
inner join enrollment_dim ed on ef.enrollment_id = ed.id
where c.canvas_id in ('565')
and ed.type = 'StudentEnrollment'
and ed.workflow_state <> 'deleted'
group by u.canvas_id, c.canvas_id
order by u.canvas_id
compared to Analytics API: https://umich.instructure.com//api/v1/courses/565/analytics/student_summaries
The requests table actually shows much more than analytics, I wouldn't be surprised if it was literally easily double, or triple the amount of requests from pageviews/analytics. There isn't a calculation method per say for Page Views, it's more page views don't get created as often. To put this in perspective, a page view is generated by multiple methods. But this is whenever the user manually goes to a page, clicks a link, etc. I.E. they physically go to an item, or click to go to another page. This is most noticeable with two types of traffic that are often missing:
1. Mobile Apps (API Requests).
2. AJAX (asynchronous) Requests.
If you've been with Canvas for a while that mobile apps don't always generate page views for every action that the PC does, this is due to the fact that it's using the API. It's not counting as the user actually going to a specific page because in reality they aren't, they're just getting data, and that's being scaffolded into a view on the mobile app. Same thing with backend async requests your browser kicks off. This is say loading more than 100 students, each 100 students is actually a different network request (due to pagination), but these only show up once in Page Views. Because it isn't the user manually clicking something, or manually going to a page. They went to the page manually, that's one page view. (Sometime pagination can kick off multiple page views, but these are very rare, and for the purpose of the example lets just say they do one). All the loading in the background doesn't register a page request as that would fill up page requests really quickly. With a lot of data that wouldn't be useful to analytics, except on a massive scale (Canvas Data). For quick analytics (like the ones built into Canvas) they would muddy things up.
However the Requests Table generates an item for every single request. Regardless if it's through the api, or an asynchronous ajax request. Every single request that gets processed gets added in the requests table. Since the Canvas Apps rely on the API, and The Browser makes heavy use of Asynchronous background requests your requests table should be much bigger than your page view requests.
You can see how it could heavily add up. Since this is the way requests work they allow for excellent tracking (the only thing that you don't have is if the app server errors out for some reason, before it gets to writing the requests table (extremely rare) which we'll have internally), and other sorts of what exact steps someone is taking through the program.
As for getting the same stats I'd recommend filtering out Mobile traffic that's done over the api completely. For example apis that contain 'api/v1'. That would get rid of almost all of the Mobile Apps API requests, and the AJAX Requests (as those usually use the API), Next I'd filter out the user agent to anything that isn't a recognized browser for example IE, Chrome, FireFox, Opera, Safari, etc.
This would probably give you the closest estimate with the most minimal amount of work. If you'd like you could test which URLs appear in page views, and let those through, etc. Though that would be considerably harder for minimal benefit.
Hi Eric, I have come across this discussion and see that your post was back in 2016. But I am struggling with Steve's original three questions now in 2019. Maybe things have changed?
In my case, my Canvas administrator's first pass at trying to give me "page views" from tables resulted in many few hits than what gets reported to me within my course. My admin mentioned that she might not be catching API/ajax hits, and is working on revising her query.
Not being an api or ajax expert, I am not sure if you are saying that if a student uses the mobile app, that their page views are lost? Or are you saying that the tabled page views can grow to exorbitant numbers of records?
Might you have the SQL that will produce an exact match with what Canvas Analytics reports?
We have also been looking at this issue for some time.
We implemented Canvas in mid-2016 and our requests table is now approaching 2 billion records. Early on Instructure advised that this volume could be cut down by excluding api calls, (i.e. the url starts with '/api/v1/'). We used this, together with excluding records that didn't have a valid user_id and course_id to generate a cut-down requests table, currently about 330 million records. If you then use the http_method column to separate views, (GET), from actions, (POST), you have a starting point. The web_application_controller and web_application_action columns can then be used to eliminate non user-generated records, for example taking a quiz generates a lot of system-generated records. This approach has allowed us to get reasonably close to the numbers that appear in the Canvas Analytics page.
We are in the process of developing a more sophisticated custom table that pulls in all element ids that are not included in the native requests table by parsing the url column and joining to the canvas_id column of the relevant tables, (e.g. file_id, wiki_page_id, conference_id, module_item_id, group_id). The data structure is not consistent, in some cases the value in the url column is the canvas_id of the element table, (e.g. file_id), in others the url value is the canvas_id of the module_item which in turn is used to find the relevant id, (e.g. wiki_page_id).
This approach, however, apparently excludes mobile app data, which can be identified using the user_agent column, (e.g. user_agent starts with Canvas, CanvasKit, iCanvas, candroid etc). We intend to approach this using a separate custom table because parsing the user_agent column is a challenge and the list of values for Canvas mobile apps is likely to evolve.
I doubt that the numbers would ever agree 100% with the analytics numbers, but close enough to be useful.
Hope this helps.
Thanks, Stuart, for your post. You are much "deeper" into this than I.
From my simple-minded approach, I assume that when I go to Analytics for a particular student that Canvas, in the background, must run an SQL query in order to calculate the Page Views, which it shows to me (the teacher.) I cannot imagine that Canvas would keep a "counter" that accumulates Page Views, and then throws away the interaction data.
So, why can't one simply run the same query that Canvas runs? What am I missing? Is the Canvas query a big secret? I don't understand why we can't run a query that will generate exactly the same results. Isn't 1+1 equal always equal to 2? How are we able to trust the Canvas Page View counts as correct?
I guess that the answer probably comes down to the fact that the analytics page is generated directly from the Canvas database. That's how it can display current data. On the other hand, the requests data available from Canvas Data is derived from the web server access log. This is why it's typically 24+ hrs old.
To run the same sql you would need direct access to the database.
Hi Stuart, thanks for your continued thoughts about this.
Let me mention several things (now that I have slept on this.)
1) My course is over, so I am not worried about real-time accuracy. Yes, I do understand that the table data might be 24 hours old. My analytics person has not been able to recreate the Page View counts via a SQL query. Can you? If so, can you share your query?
2) As I rethink your post, I am wondering if you are saying that you are getting more "Page Views" from your table query than what shows within the course? Our problem, so far, is that the table query is producing less "Page Views" than what Analytics from within the course is showing. Can you confirm what you are seeing, tables > Analytics versus tables < Analytics?
3) It appears that we are trying to reverse-engineer the query that produces the Page View count within a course. I wonder why? Why can't someone from Instructure tell us, or give us, the query that Canvas is using? I don't understand this? Is this meant to be a big secret?
Use of direct SQL within Canvas is limited. Most of the time, it is Ruby code that accesses and writes to the database. The code that they use is mostly available, since Canvas is open source, but sometimes it's difficult to track things down. Anyway, it is probably not there in a "here is the SQL" block anyone could give you. And it wouldn't help since you don't have access to the data.
The number of page view counts is in a different table than requests. The analytics and access reports use the summary data, not the requests table, which would take too long to calculate on the fly. The number of participations is available through the Get user-in-a-course-level participation data endpoint of the Analytics API, but not in Canvas Data. It is broken into buckets of 1 hour time intervals. Instead of reverse engineering it, you can use the API call. However, that's one API call for each enrollment, but at least it doesn't appear (in my limited testing with a sample size of 1) involve pagination. For an entire school, that would be a lot of calls. For one course, even a large one, it's not so bad. People want to use Canvas Data because it's already collected and in one place, but it does not duplicate all of the functionality of the API, and in other places has data that the API doesn't.
Some reasons for the mismatch might be:
There have been issues, I don't know if they're completely resolved, with mobile apps and data not showing up in the page counts or the requests table.
The Canvas Data requests table documentation starts off with a disclaimer:
Disclaimer: The data in the requests table is a 'best effort' attempt, and is not guaranteed to be complete or wholly accurate. This data is meant to be used for rollups and analysis in the aggregate, _not_ in isolation for auditing, or other high-stakes analysis involving examining single users or small samples. As this data is generated from the Canvas logs files, not a transactional database, there are many places along the way data can be lost and/or duplicated (though uncommon). Additionally, given the size of this data, our processes are often done on monthly cycles for many parts of the requests tables, so as errors occur they can only be rectified monthly.
The exact questions you asked back in February in this thread: Analytics, Page Views, Weird! I have a student who has 529 page views through the analytics page (this matches the API results). If I go to the Access Report for that student, I get 495 "times viewed", but I know that one takes one view away for each quiz attempt since it's also counted as an interaction. I have 28 reading quizzes (we're now up to 523), and there might be six other places I'm missing, but essentially those are the same numbers. Those page views include basic activity from a GET, including accesses to the home page.
The analytics API gives the summary, including the hourly breakdown, but not the individual items. The Access Report (not available through the API) gives the breakdown of assets, but not the times (except for first and last). Canvas Data may be missing things and shouldn't be used at the individual student level.
Hi James, thanks for your post and information.
I am going to take a while to digest everything that you have said. I'll also direct my Canvas analytics administrator to your post.
I am not interested in real-time stats. My course is over, and I just what to know which pages were viewed by students, and when.
Where is the "Total Activity" statistic held? Is it also in the Summary report area?
It's been a long day of giving finals, can you clarify what you mean by "Total Activity" ? I don't see it mentioned earlier in this thread. There is a "total activity time" (in seconds) that shows up on the People (roster) page and is available in the Enrollments API.
If you're looking for the total of the number of something (like the 529 page views in my example), then you sum the number of page views or you use the student summary (mentioned below).
If by "Total Activity" you mean the number of times they actually did something with a POST or PUT, then you would look at the participations object of the analytics API call I mentioned earlier. It has a created_at and url for each item. If you want to know what they did, you would need to parse the url and get the ID for whatever it was.
Another option is to use the Get course-level student summary data endpoint of the Analytics API. It contains much of the information that you get on the overview page of the View Course Analytics page. It has an array, one object per student, with a summary of their activity. This is the same student as before, but they did some more work today and are up to 536 page_views.
The max are for anyone in the course so you can compare.
Much of what you get after you you click on a students name comes from the Get user-in-a-course-level participation data endpoint.
The page_views and activity are pretty much what shows up on the access report. A participation is when they create or update content (take a quiz, submit an assignment, post to a discussion). In the Admin Page Views, it shows up as a checkmark in the participated column.
If you look in the Canvas Data requests table, it will show up as a method of PUT or POST.
The page_view happens whenever they access information (not just content pages -- this course has only one content page in it). In the requests table, it has a method of GET.
Still, not everything that goes into the requests table goes in as a page view or a participation. For an idea of what does get logged, look for log_asset_access in the source code or look at the documentation for the access report.