Canvas Data Requests count vs. Canvas Analytics Page Views
We currently leverage the Canvas Analytics API to query page views as a measure for engagement.
The page views are, typically, less than the counts per user from the requests table - which makes logical sense. However, the magnitude of this difference bears investigation.
- Exactly how is page views calculated?
- How can that calculation be replicated in the requests table?
- What additional "views" information is included in requests table that is "valid" user activity and how much is "noise"? (e.g., does one user action in an assignments page equate to 5 lines in the requests table for some reason)
Some additional context from @rubyn :
I am looking at page_views count. The results from Redshift doesn't match the Analytics API for the page_views value, nor what I see on the Course Analytics page in Canvas. Should I count a different thing, or add more filters to get the expected results?
select u.canvas_id as "user id",
c.canvas_id as "course id",
from requests r
inner join user_dim u on r.user_id = u.id
inner join course_dim c on r.course_id = c.id
inner join enrollment_fact ef on r.user_id = ef.user_id and r.course_id = ef.course_id
inner join enrollment_dim ed on ef.enrollment_id = ed.id
where c.canvas_id in ('565')
and ed.type = 'StudentEnrollment'
and ed.workflow_state <> 'deleted'
group by u.canvas_id, c.canvas_id
order by u.canvas_id
compared to Analytics API: https://umich.instructure.com//api/v1/courses/565/analytics/student_summaries
The requests table actually shows much more than analytics, I wouldn't be surprised if it was literally easily double, or triple the amount of requests from pageviews/analytics. There isn't a calculation method per say for Page Views, it's more page views don't get created as often. To put this in perspective, a page view is generated by multiple methods. But this is whenever the user manually goes to a page, clicks a link, etc. I.E. they physically go to an item, or click to go to another page. This is most noticeable with two types of traffic that are often missing:
1. Mobile Apps (API Requests).
2. AJAX (asynchronous) Requests.
If you've been with Canvas for a while that mobile apps don't always generate page views for every action that the PC does, this is due to the fact that it's using the API. It's not counting as the user actually going to a specific page because in reality they aren't, they're just getting data, and that's being scaffolded into a view on the mobile app. Same thing with backend async requests your browser kicks off. This is say loading more than 100 students, each 100 students is actually a different network request (due to pagination), but these only show up once in Page Views. Because it isn't the user manually clicking something, or manually going to a page. They went to the page manually, that's one page view. (Sometime pagination can kick off multiple page views, but these are very rare, and for the purpose of the example lets just say they do one). All the loading in the background doesn't register a page request as that would fill up page requests really quickly. With a lot of data that wouldn't be useful to analytics, except on a massive scale (Canvas Data). For quick analytics (like the ones built into Canvas) they would muddy things up.
However the Requests Table generates an item for every single request. Regardless if it's through the api, or an asynchronous ajax request. Every single request that gets processed gets added in the requests table. Since the Canvas Apps rely on the API, and The Browser makes heavy use of Asynchronous background requests your requests table should be much bigger than your page view requests.
You can see how it could heavily add up. Since this is the way requests work they allow for excellent tracking (the only thing that you don't have is if the app server errors out for some reason, before it gets to writing the requests table (extremely rare) which we'll have internally), and other sorts of what exact steps someone is taking through the program.
As for getting the same stats I'd recommend filtering out Mobile traffic that's done over the api completely. For example apis that contain 'api/v1'. That would get rid of almost all of the Mobile Apps API requests, and the AJAX Requests (as those usually use the API), Next I'd filter out the user agent to anything that isn't a recognized browser for example IE, Chrome, FireFox, Opera, Safari, etc.
This would probably give you the closest estimate with the most minimal amount of work. If you'd like you could test which URLs appear in page views, and let those through, etc. Though that would be considerably harder for minimal benefit.
Hi Eric, I have come across this discussion and see that your post was back in 2016. But I am struggling with Steve's original three questions now in 2019. Maybe things have changed?
In my case, my Canvas administrator's first pass at trying to give me "page views" from tables resulted in many few hits than what gets reported to me within my course. My admin mentioned that she might not be catching API/ajax hits, and is working on revising her query.
Not being an api or ajax expert, I am not sure if you are saying that if a student uses the mobile app, that their page views are lost? Or are you saying that the tabled page views can grow to exorbitant numbers of records?
Might you have the SQL that will produce an exact match with what Canvas Analytics reports?
We have also been looking at this issue for some time.
We implemented Canvas in mid-2016 and our requests table is now approaching 2 billion records. Early on Instructure advised that this volume could be cut down by excluding api calls, (i.e. the url starts with '/api/v1/'). We used this, together with excluding records that didn't have a valid user_id and course_id to generate a cut-down requests table, currently about 330 million records. If you then use the http_method column to separate views, (GET), from actions, (POST), you have a starting point. The web_application_controller and web_application_action columns can then be used to eliminate non user-generated records, for example taking a quiz generates a lot of system-generated records. This approach has allowed us to get reasonably close to the numbers that appear in the Canvas Analytics page.
We are in the process of developing a more sophisticated custom table that pulls in all element ids that are not included in the native requests table by parsing the url column and joining to the canvas_id column of the relevant tables, (e.g. file_id, wiki_page_id, conference_id, module_item_id, group_id). The data structure is not consistent, in some cases the value in the url column is the canvas_id of the element table, (e.g. file_id), in others the url value is the canvas_id of the module_item which in turn is used to find the relevant id, (e.g. wiki_page_id).
This approach, however, apparently excludes mobile app data, which can be identified using the user_agent column, (e.g. user_agent starts with Canvas, CanvasKit, iCanvas, candroid etc). We intend to approach this using a separate custom table because parsing the user_agent column is a challenge and the list of values for Canvas mobile apps is likely to evolve.
I doubt that the numbers would ever agree 100% with the analytics numbers, but close enough to be useful.
Hope this helps.
Thanks, Stuart, for your post. You are much "deeper" into this than I.
From my simple-minded approach, I assume that when I go to Analytics for a particular student that Canvas, in the background, must run an SQL query in order to calculate the Page Views, which it shows to me (the teacher.) I cannot imagine that Canvas would keep a "counter" that accumulates Page Views, and then throws away the interaction data.
So, why can't one simply run the same query that Canvas runs? What am I missing? Is the Canvas query a big secret? I don't understand why we can't run a query that will generate exactly the same results. Isn't 1+1 equal always equal to 2? How are we able to trust the Canvas Page View counts as correct?
I guess that the answer probably comes down to the fact that the analytics page is generated directly from the Canvas database. That's how it can display current data. On the other hand, the requests data available from Canvas Data is derived from the web server access log. This is why it's typically 24+ hrs old.
To run the same sql you would need direct access to the database.
Hi Stuart, thanks for your continued thoughts about this.
Let me mention several things (now that I have slept on this.)
1) My course is over, so I am not worried about real-time accuracy. Yes, I do understand that the table data might be 24 hours old. My analytics person has not been able to recreate the Page View counts via a SQL query. Can you? If so, can you share your query?
2) As I rethink your post, I am wondering if you are saying that you are getting more "Page Views" from your table query than what shows within the course? Our problem, so far, is that the table query is producing less "Page Views" than what Analytics from within the course is showing. Can you confirm what you are seeing, tables > Analytics versus tables < Analytics?
3) It appears that we are trying to reverse-engineer the query that produces the Page View count within a course. I wonder why? Why can't someone from Instructure tell us, or give us, the query that Canvas is using? I don't understand this? Is this meant to be a big secret?
Use of direct SQL within Canvas is limited. Most of the time, it is Ruby code that accesses and writes to the database. The code that they use is mostly available, since Canvas is open source, but sometimes it's difficult to track things down. Anyway, it is probably not there in a "here is the SQL" block anyone could give you. And it wouldn't help since you don't have access to the data.
The number of page view counts is in a different table than requests. The analytics and access reports use the summary data, not the requests table, which would take too long to calculate on the fly. The number of participations is available through the Get user-in-a-course-level participation data endpoint of the Analytics API, but not in Canvas Data. It is broken into buckets of 1 hour time intervals. Instead of reverse engineering it, you can use the API call. However, that's one API call for each enrollment, but at least it doesn't appear (in my limited testing with a sample size of 1) involve pagination. For an entire school, that would be a lot of calls. For one course, even a large one, it's not so bad. People want to use Canvas Data because it's already collected and in one place, but it does not duplicate all of the functionality of the API, and in other places has data that the API doesn't.
Some reasons for the mismatch might be:
There have been issues, I don't know if they're completely resolved, with mobile apps and data not showing up in the page counts or the requests table.
The Canvas Data requests table documentation starts off with a disclaimer:
Disclaimer: The data in the requests table is a 'best effort' attempt, and is not guaranteed to be complete or wholly accurate. This data is meant to be used for rollups and analysis in the aggregate, _not_ in isolation for auditing, or other high-stakes analysis involving examining single users or small samples. As this data is generated from the Canvas logs files, not a transactional database, there are many places along the way data can be lost and/or duplicated (though uncommon). Additionally, given the size of this data, our processes are often done on monthly cycles for many parts of the requests tables, so as errors occur they can only be rectified monthly.
The exact questions you asked back in February in this thread: Analytics, Page Views, Weird! I have a student who has 529 page views through the analytics page (this matches the API results). If I go to the Access Report for that student, I get 495 "times viewed", but I know that one takes one view away for each quiz attempt since it's also counted as an interaction. I have 28 reading quizzes (we're now up to 523), and there might be six other places I'm missing, but essentially those are the same numbers. Those page views include basic activity from a GET, including accesses to the home page.
The analytics API gives the summary, including the hourly breakdown, but not the individual items. The Access Report (not available through the API) gives the breakdown of assets, but not the times (except for first and last). Canvas Data may be missing things and shouldn't be used at the individual student level.
Hi James, thanks for your post and information.
I am going to take a while to digest everything that you have said. I'll also direct my Canvas analytics administrator to your post.
I am not interested in real-time stats. My course is over, and I just what to know which pages were viewed by students, and when.
Where is the "Total Activity" statistic held? Is it also in the Summary report area?
It's been a long day of giving finals, can you clarify what you mean by "Total Activity" ? I don't see it mentioned earlier in this thread. There is a "total activity time" (in seconds) that shows up on the People (roster) page and is available in the Enrollments API.
If you're looking for the total of the number of something (like the 529 page views in my example), then you sum the number of page views or you use the student summary (mentioned below).
If by "Total Activity" you mean the number of times they actually did something with a POST or PUT, then you would look at the participations object of the analytics API call I mentioned earlier. It has a created_at and url for each item. If you want to know what they did, you would need to parse the url and get the ID for whatever it was.
Another option is to use the Get course-level student summary data endpoint of the Analytics API. It contains much of the information that you get on the overview page of the View Course Analytics page. It has an array, one object per student, with a summary of their activity. This is the same student as before, but they did some more work today and are up to 536 page_views.
The max are for anyone in the course so you can compare.
Much of what you get after you you click on a students name comes from the Get user-in-a-course-level participation data endpoint.
The page_views and activity are pretty much what shows up on the access report. A participation is when they create or update content (take a quiz, submit an assignment, post to a discussion). In the Admin Page Views, it shows up as a checkmark in the participated column.
If you look in the Canvas Data requests table, it will show up as a method of PUT or POST.
The page_view happens whenever they access information (not just content pages -- this course has only one content page in it). In the requests table, it has a method of GET.
Still, not everything that goes into the requests table goes in as a page view or a participation. For an idea of what does get logged, look for log_asset_access in the source code or look at the documentation for the access report.
I still need to get caught up with some of the discussion that has happened since last Friday. James, you asked about "Total Activity." I made this short video showing the two statistics, Total Activity, and Page Views, that I seek to verify. When I say "verify" I mean that I should be able to get to the same data that Canvas is using to create these statistics, do a query (or run a program), and come up with the exact, not "close", values.
Thanks for providing the video, @richard-jerz .
The data IS bogus, suspect, inaccurate, or whatever word you want to use. Other people have discussed that at length in the Community and that there isn't a good way to truly know how much time people spent on a page and worse yet, whether they actually did anything while they were on that page or whether it sat there while they were multitasking and came back to it so it looks like it spent a bunch of time there.
However, that data is available through the List enrollments endpoint of the Enrollments API. It returns a last_activity_at timestamp in ISO8601 format and a total_activity_time in seconds. You can make one API call and get this for all of the students in a course.
I have not seen it in Canvas Data. It definitely is not available in the requests table, which is more like the log file for a web server. When you make a web page request, you do not know how long that person will spend on the page, but the log file entry has to be written right then, not held open waiting for some additional activity.
The closest you would get to total activity time through the requests table would be to incorporate the session ID and look at when the first request was made and when the last request was made. Still, they may be doing things in other courses during that time, sessions might go on for a really long time with large portions of inactivity, and I'm not sure how the mobile stuff works, but I've seen requests come from mobile devices where the student isn't actively using it at the time.
Basically, you do not want to pull this from the requests table for individual students, which is why they gave the warning.
You are not going to be able to recreate the values exactly from the requests table. If that's a condition for using it, then stop and give up now.
Realize that page views are not the same as content page views. Page views in the analytics are closer to web page views. Content page views are a subset of the requests views. It should be a proper subset, but I cannot confirm that. What Canvas is treating as a page view is actually an asset access, which is what the Access Report also gives.
There's a note in the source code at the log_asset_access routine:
Asset accesses are used for generating usage statistics. This is how we say, "the user just downloaded this file" or "the user just viewed this wiki page". We can then after-the-fact build statistics and reports from these accesses. This is currently being used to generate access reports per student per course.
Looking at the database creation commands, it appears that there is just one entry for each asset / user combination. It contains created_at and last_access, which give you the first and last time the student viewed. It does not appear to save every time that the student viewed something.
Going back to this data being suspect, when I look at the source code for the asset_user_access.rb model, I see interaction seconds are given a default of 30 seconds when there isn't a recorded value for it. It has a lot of guess work in involved.
It just doesn't seem to me that it should be this difficult for faculty or IDs to get basic interaction level stats for students in an online course. I wish there were a way to .... click some boxes and say "I want to see the stats for these things" and have Canvas return a report that can be easily read.
Well, Cynthia, I am glad to see another person seeing exactly what I see, it shouldn't be this difficult! There are probably many ways that this can be done. For example, I should be able to click on the "bar" that shows 8 Page Views, and magically, I should be able to get to the details of these 8 Page Views. It should be as simple as that.
As it stands right now, Canvas is stretching it a bit to call "overall counted data" "analytics" when timestamp and specific items are missing from the data, and when data cannot be verified. At this point, Canvas should call this data "gobbledegook."
There are analytics and there is analytics beta (analytics 2.0) in development. GraphQL is the closest thing to allowing a one-stop shop and it's being developed. The student course information cards that you get from the People (roster) page and the new analytics beta both use it. There is a graphical interface, but it's not the kind you're wanting. It's an editor that allows you to create and execute GraphQL statements (kind of like a toolbox to help you write SQL but you still have to know how everything fits together). I haven't made time to figure it out yet.
My course is over, and I just what to know which pages were viewed by students, and when.
The page_views (and it's counterpart the Canvas Data requests table) will be the only places that you'll get the "when" for anything accessed more than twice. The access report only contains the first and last view, as well as the total number, but not the details on each one. It doesn't sound like it, but if you can live with first view time, last view time, and total number of times, the https://community.canvaslms.com/docs/DOC-6061-obtaining-and-using-access-report-data-for-an-entire-c... script I wrote allows you to get that for an entire course on demand.
Hi @richard-jerz :
I've asked for the total_activity field to be added to the enrollment tables in Canvas Data.
However, you'll notice that we are using it to flag potential inactivity vs. show actual time spent in a course. It's not an accurate number by any means since you can leave your computer logged in. However, if it only shows 30 minutes for the entire course, you can assume that a user wasn't doing much during the term.
Okay, Joni. Seems like a good idea. I gave you my vote.
You might notice that Kevin, who is Director of Product Data Analytics at Canvas, is addressing the three (initial) questions about student analytics. At this point (yesterday) he tackled question #1. But I think that he will get to the other two questions. Question #3 that I posted is about the "Total Activity" statistic, and I hope Kevin describes exactly how it is calculated and provides the logic for my internal analytics team so that they can verify this number from the data.
If you are correct, that "it's not an accurate number," then there is not any reason to show it. For example, if your bank said "Your account has $1,500 in it, but this number is inaccurate" how would you react?
Yeah, that number is pretty bad, but it's always been included as part of LMS reporting in any sorta-modern LMS. I don't think it should be included in any UI. I think it is useful for looking for blatent non-participation by an instructor but not much else. With students, at least you have the data points of assignments submitted and grading. I'm not sure how mobile activity calculates into that number either.
If anyone else has better ways of looking at instructor activity for the obvious problems, I'd love to hear about it. This was just the easiest solution I have come up with that took into account that not all courses use the same tools.
Joni, I don't think so. I know that Moodle doesn't report it because Moodle realizes that it is a bogus number. I don't ever recall seeing it in Desire2Learn, nor do I recall Blackboard showing it. However, I could be wrong. Maybe you or someone else can provide screenshots of these other LMSs showing "Total Activity" time.
Yes, of course there is a better way. One way would be to use "Page Views." However, right now, this too is a bogus statistic because it doesn't tell what the student looked at. Or, if you are interested in student activity, why not look at their grade book. If the course is in week 3, for example, and a student does not have any grade for the assignments, wouldn't this indicate that the student is absent? Or why not just look at the "Last Activity" Canvas statistic (assuming that it might be accurate, but due to other inaccurate statistics that Canvas provides, I have some natural doubts that Last Activity is accurate.)
Joni, I am not trying to "pick" on you. You haven't answered my "bank balance" question. Really, what would you do? If your bank shows you bogus data, do you say "Well, I guess that I must have some money in my account?" Although some might see my analogy as silly, I don't. Shouldn't data shown by Canvas be as accurate as data shown by your bank?
So if you query the requests data for a specific course, restrict it to view, (GET), records and eliminate system-generated api calls, you should get more records than the analytics page.
select count(*) from requests_table where course_id = nnnnnnnnnnnnnnnnn and http_method = 'GET' and DBMS_LOB.SUBSTR(lmrr.url, 8, 1) <> '/api/v1/';
This is Oracle syntax, the way CLOBS are handled on other platforms is probably different.
Back to a previous question, and some of James' input, the reason we went with Canvas Data is twofold. When we migrated to Canvas we were informed that access to apis was limited by volume. Secondly, Canvas data provides a full data set over time, so if a requirement comes up requiring a comparison of data across multiple admission years, we have the data available.
In relation to another of James' comments regarding performance, this is why we are developing a custom requests table. The required queries take too long to be useful on the native requests data, however if you are populating the custom table with daily incremental data, the volume is readily manageable. Query performance on the custom table is then adequate using indexing etc.
We also partition the requests tables by the timestamp_month column to improve performance of queries relating to a specific period.
Lastly, we convert the timestamp columns to local time in the custom table so we can easily see when a particular action occurred.
I'm sorry you were given two inaccurate statements when starting out. It sounds like you were able to work around it well.
APIs are limited by the a threshold, which depends on the complexity of the call and the number of concurrent calls that are being made. That may in turn limit the number of API calls you can make to the point that it becomes unfeasible to download the entire set of data on a timely basis. So, it may be limited by volume, but it's not as simple as "You can only download X gigabytes per day" kind of volume. When warehousing our data through the API, I average about 10 API calls per second (last night's run made 2116 calls in 189 seconds ~ 11.2 calls/sec). The threshold is 700 and I only got down to 353 with my concurrent requests. I average 10 because I make all of one type finish before I start fetching another type. But with some API calls, I have up to 50 going at a time.
Canvas Data does have the data that it has over the years (or since it was enabled in Canvas -- they don't create data that wasn't collected), but it doesn't have all of the data that is in Canvas. For example, you can get responses to individual quiz questions through the API, but not through Canvas Data.
Another data source is Live Events. I haven't used it yet, but it sounds like it has a lot of potential for people needing near real-time information or who want don't want the 36 hour delay with Canvas Data.
Still, there is no single source that has everything in it. Some questions can only be answered with API, some only with Canvas Data, some with both, and some not at all.
Thanks for the clarification. It may be that the rules have changed since 2016.
Agreed, Canvas data does not include the full data set. More tables have been added on a regular basis since we started. In December 2016 65 tables were available in schema 1.13.2, now there are 117 tables in schema 4.2.3.
We are using the canvasDataCli tool for sync / unpack. It would be good to have a look at the newer data loader, but it is not yet available for Oracle.
I haven't had a look at live events yet. As you say, there isn't a single data source, and my understanding is that this extends to live events where different data is available in the two available formats, raw and Caliper.
In relation to quiz question responses, are these not available in the TEXT column of the quiz_question_answer_dim table?
The available responses are available in quiz_question_answer_dim table, but not the responses that the students actually chose when they took the quiz. Sorry for not being clearer.
I'm still using the canvasDataCLI as well. My MySQL script to handle the import is woefully inaccurate because it relies on the schema file to be accurate and it isn't, especially when it comes to enumerated fields. Robert Carroll has looked into embulk and it looks like it has some potential.
Thanks for that. I haven't looked at quiz responses, so that's useful information. Seems to be be a bit of an oversight, because linking the quiz_question_answer_dim with the submission would appear to be straightforward. Even in the "record_answer" requests url records shows the link to the next question but not to the answer given.
I've generated Oracle tables manually to work around object name length and reserved word issues. We use Oracle Data Integrator to load from the text files into the raw staging tables, then to the cleansed "target" tables. I only use the schema.json to detect when the schema has changed, though for the past year or so the Instructure team has been sending notifications of impending schema changes.
The purpose of that table isn't to give the student responses. There is a need for what they have, but the actual responses would need to be another table altogether. Otherwise, you would only ever know about what choices were selected, not which ones were available.
Any table of student responses would be much, much, much larger. Think about a 100 question multiple-choice quiz with 4 results times 100 students = 40,000 rows just for one quiz; now think about all the quizzes in all the courses at an institution and some of them allowing multiple submissions. You get the point. I'm not saving it when I get it through the API, but I would probably save it as a JSON type and worry about parsing it later if I needed it. Then I would just have the 100 rows for the 100 students who took the quiz (I arbitrarily picked some numbers, we're small, but some schools have 1000's in a course).
The record_answer in the requests table is a POST and the data isn't sent as part of the URL, which explains why the answer isn't given there. They also scrub some of the information from the URL for privacy, but I do not know if that applies to anything beyond the authorization query parameter.
For completeness (in case someone stumbles across this thread), the actual responses are available in the get a submission/multiple submissions endpoints of the submissions API when you include the submission_history. I don't remember seeing it documented, I just remember stumbling across it when I needed the history for something else last October.
A lot of good discussions, James and Stuart. I will provide my thoughts down here.
First, I did post a video showing what I seek (see my post above.)
Stuart, you say "you should get more records than the analytics page." I don't want more records, I simply want to verify the Page View counts that Canvas shows to me. I should somehow be able to exactly match these values, otherwise "course analytics" as presented is invalid.
I see discussion about the quantity of records, and the speed of getting to this data. Neither of these is a concern in my course. This course is over, so even if the "query" took a day to run, no problem. I also am trying to just get to the pages that are viewed, and in my course, I only have a half-dozen of these. I am not concerned about quizzes and other elements. Furthermore, what is a big quantity? A million records?
Also, I see discussion about how to get to the data, via api's, etc.? I don't really care about the best way to get to the data. I assume that the data is in a table, or a file. Where ever the data is, Canvas is using it to generate these "Page Views" data. We should be able to do the same, and include "what was viewed" and "when it was viewed." This doesn't seem to be rocket science. But, if Canvas throws away this data, now we have a different problem!
As the instructor, I don't have access to the data. But I do have an internal Canvas Analytics person who does have some access to the data. I am just trying to help her help me.
Since I do not have access to the real data, I can only look at the data dictionary information. Yes, it seems to me that the place to start is with the Requests table. It seems to include timestamp information, the user_id, the course_id, and the URL that was requested. Yes, it also contains quiz, discussion, and other items that are NULL when not meaningful. In SQL form, I see something like this (for one student within one course.)
SELECT timestamp, user_id, real_user_id, course_id, url
WHERE course_id = "12345"
and user_id = "6789"
Of course, I would probably begin with the following query until I understand what the other columns contain.
WHERE course_id = "12345"
and user_id = "6789"
Furthermore, what is a big quantity? A million records?
Much larger than a million, but it depends on the size of your institution. We have been using Canvas since Fall 2012 and have 40 GB of compressed request data (435,345,868 records). We are a small community college so others will have much, much, larger tables. We have 4.5 million records in the requests table just between April 9 and April 26 of this year (all that I currently have loaded). Finding that out took my Canvas Data server 5.76 seconds and that's on a brand new server with solid state drives and 64 GB of RAM. It took the server 13 minutes to count the number of records we had in the downloaded files.
This is where Stuart was saying that filtering the data and re-packaging it can make the requests table more useful.
The requests table doesn't include isn't every web request that's made, otherwise we wouldn't have had the issue with mobile stuff not getting recorded. Indiscriminately throwing out all the /api/v1/ stuff when you load the requests table isn't a good thing to do because the mobile apps load things through the API.
Hi James (and others), let's see if I can summarize what we have learned.
I began probing about Page View and Analytics data being bogus over in this discussion. I was told by Stephanie to contact my internal Canvas support at my university, which I did. My internal support people, so far, can't figure it out. They pointing me to this discussion, and people here have pointed me back to the other discussion. So we have gone full circle.
I decided to concentrate on "Page Views" since Total Activity is a derived value. Yep James, we know how bad trying to calculate "time" is, yet Canvas does it on the People web page. I have already proposed in another discussion that the Total Activity column be eliminated from the People web page, since it is completely bogus data. Nope, Canvas keeps it, so it is fair to try to verify it. It's a number that Canvas provides, where is it created? (If the List Enrollments reports total_activity_time, where is it getting this value and how was it calculated?)
Back to Page Views, which should be the result of student clicks and should not be bogus. Yet, we cannot find any way (yet) to verify it as accurate. James, you seem to say that there will be no way to recreate this data, yet Canvas creates it? I doubt that they use a random number generator.
James, no, there are not more than a million records for one student in one course.
No, there are not "analytics and there are analytics." There are values that Canvas reports that cannot be verified. If someone takes a quiz and gets 80 percent, you can look at the questions and verify this 80%. What would it be like if a student got a 75% on a quiz be could not be told which questions they missed? We should be able to do the same when Canvas reports anything to us.
I don't understand why more people aren't asking to verify "analytics" from Canvas.
I have already proposed in another discussion that the Total Activity column be eliminated from the People web page, since it is completely bogus data. Nope, Canvas keeps it, so it is fair to try to verify it.
Too many people want it for them to get rid of it.
You can try to verify the numbers, but it will be futile as you do not have the full picture. Run your own server and log every thing or convince Canvas to share all of the data and not just some of it, and you might have a chance to duplicate their numbers and verify their calculations. But you're not going to be able to do it with what's available right now as they don't give you all of the information that you need to verify it.
But study the source code and see what they're doing and see what counts and what doesn't count and you might have a better chance of getting closer. I still think it's going to be a nearly impossible task, but perhaps that is because I don't understand Ruby enough to know what they're doing.
James, you seem to say that there will be no way to recreate this data, yet Canvas creates it?
Recreating is not the same as creating. If I give you summary statistics, you're not going to be able to recreate the original data, although I can easily calculate any appropriate statistics because I have access to the raw data.
Now, let's say that I want to find the mean and standard deviation of a set of numbers. I don't have to save the numbers in order to do that. All I have to do is keep the number of values, the sum of the values, and the sum of the squares of the values. If I don't save all of the information, only what is needed to calculate what I'm sharing, then the person on the other end isn't going to be able to recreate the information, even though my results are probably as accurate as they can be. Even if I save all of the information, but don't give it to you, you're not going to be able to recreate my results.
Transferring the discussion back to Canvas ... Canvas doesn't give us all of the information that they used to create it in the first place, so it is impossible for us to recreate it with the information we're supplied. They don't even have to hang on to all of the data to generate the statistics.
This is my point. They don't give us access to the raw data that is used so it is impossible for us to recreate their numbers exactly.
The requests table is not the raw data that is used to create their numbers in the first place. They have specific code within the controllers that determine what gets logged as a page_view and what gets counted as an asset_access. It's not exactly the same as every web request. You may have several web requests
There is (or at least there used to be) a ping sent from Canvas to it that you were still on a page. There is code to explicitly not return any page views that come from that ping particular ping, but there is still ping information in the requests table from another source. There are other commands that don't get returned. When you look through the code, there is a comment about page_views in the application controller.
We only record page_views for html page requests coming from within the app, or if coming from a developer api request and specified as a page_view.
That statement is ambiguous unless you study the code. Is it that they don't record anything but page_views for html requests coming from within the app, or is that they don't recording anything but page_views for html requests coming from within the app?
James, no, there are not more than a million records for one student in one course.
You didn't say anything about single student in a single course. You were talking about the speed in getting access to the information. Canvas doesn't pull much from the page views table except when the admin goes in and asks for a users page_views. The reason is because it is huge and takes too long to do any processing. Canvas is a web application and needs to return nearly instantaneous results. People don't want to wait more than a couple of seconds, let alone minutes. When you requests the page views through the admin interface, or through the API, you can only get a maximum of 100 at a time.
If you want to subset the data so that all you have is the data that comes from your course, then you'll have a much smaller number of records and it will be easier to work with.
I don't understand why more people aren't asking to verify "analytics" from Canvas.
Perhaps because they realize that the numbers themselves aren't important in isolation. Does it really matter whether someone spent 2 hours or 50 hours in the course? Not for some courses. Good students may not need to spend much time, while other students are doing well because they did spend a lot of time in there. I know this last semester that many of my students who had low activity time, when compared to others, were the ones in trouble in the course. But that still doesn't give a threshold, it gives a relative number.
By the way, Canvas does that was well, lumping people into categories by participation and page_views relative to the other people in the class. The information I record each day from the student analytics includes the number of page_views, the maximum number of page_views in the class, and the level of page_views; it does the same thing with participations. It has an overview of tardiness for assignments.
If you are willing to accept the first time and last time someone viewed an asset, then that's available. If you want to look at page_views, then the count by hour is the best they make available. For most people, that's enough, it answers the question that they were busy between 11 and midnight, for example. If you want to know when people participated, that's available for every participation in the course.
If you want to know when every page_view happened, then start gathering the data every day while the course is going on and archive it over time. That still only gets you down to a the nearest day, not the exact moment. You can tell that if the count was 12 one day and it's 15 the next, that they viewed that asset 3 times. The problem is that the access report, which details the information, isn't available through the API. I think I had to write a script that would log into Canvas and fetch it daily when I was tracking things.
No, there are not "analytics and there are analytics."
There is an existing view course analytics and there is a version that Canvas is working on. You may not have participated in the beta testing of Analytics 2.0. My point was that Canvas realizes that people weren't happy with the existing analytics and Canvas is working to improve it.
But analytics isn't about getting the raw data. It's about finding meaningful patterns. Most people using Canvas would not know how to take a requests table, even for a single user, and come up with meaningful results from it. So Canvas does that for them. Most users do not need to see the nail prints in the hand to believe what Canvas says, they're perfectly happy to accept on blind faith that the numbers are correct.
Canvas doesn't design for the needs of the power user or for the few. Those of us in mathematics and the sciences have been dealing with that in quizzes for a long time. If you're not in the middle of the user base, you may just have to make do with what you have available because there's not enough return on their investment to warrant creating something more powerful. You can make a feature request, but it probably won't do any good.
P.S., I would love to be wrong because I would like this information out of Canvas Data as well. Other people know a lot more about Canvas Data than I do and have hobbled together things that approximate this. I seem to remember them sharing it in the community, too. They may be able to shed some ability to approximate things.
Firstly, when I said that we eliminate api calls from the custom requests table, this is because it dramatically reduces the volume of requests data though we know that it excludes data generated by mobile apps.. The intention is to create a separate custom requests table to hold requests generated by mobile phone apps, this will be derived from the USER_AGENT column. My understanding is that the analytics page views also excludes mobile app activity.
Secondly, as part of the population of our custom table, we derive the resource ids which aren't available in the native requests data then derive the title of the resource being viewed. This is fairly convoluted since the titles of different resource types, (files, quizzes, wiki pages etc.) are stored in different ways. Another example is that we have seen situations where a quiz can also be an assignment, so we need to make sure the view is not counted twice.
Once I have this under control, I'll be looking at other anomalies. For example, I came across a set of requests records for one user where an identical record was being generated every 30 seconds or so. Clearly this isn't user activity.
In collaboration with a colleague, Daniel Barry, we have tested several examples for one user in one course and have seen better than 90% agreement with analytics numbers.
Once I've finalised the view that populates the custom table, I'm happy to share it.
It sounds like Analytics might include some items from mobile apps (as of April 2019) and not others: Mobile App Usage Reported in Total Activity. It also sounds like what Peyton Craighill talked about in March 2018 might not have fully matured yet: Canvas Mobile Update. I'm glad to hear you're not indiscriminately throwing them out, but filtering through them and putting them into their own tables.
Daniel Barry is one of those people I think of when I wrote "Other people know a lot more about Canvas Data than I do and have hobbled together things that approximate this. I seem to remember them sharing it in the community, too. They may be able to shed some ability to approximate things." I remember discussing the difference between the access report in the GUI and the information returned by the non-API AJAX call. As I was looking through the code for this discussion, it sounds like they may have solidified that information in other places instead of just when it's displayed.
Stuart, thanks for your continued thoughts. From this post, I find the following:
1) You are suggesting that the current Page View counts, as shown by Canvas, are inaccurate because they do not include Canvas Mobile App activity. If so, what good are they? This is a very serious flaw, if you are correct.
2) I am simply trying to verify the Canvas Page Views. Sure, in the future, one might try to correct the flaws (as in #1).
3) Regardless of the difficulties that you have had, Canvas doesn't have these difficulties. Canvas simply generates its own Page Views. As I have said many times, I just want to verify the Canvas generated Page Views.
(I see that Stephanie, who I believe is associated with Canvas, keeps tagging many of the posts as "helpful." Yes, all of this discussion is helpful, but nothing has answered the basic questions that I have raised. All that I want to be able to do is verify the Page View (and eventually Total Activity) statistics that Canvas provides by exploring the stored data. Stephanie, might you dig into your own resources and help us?)
Yes, James, it appears that too many people prefer bad data to no data.
I really want to be able to "verify" data, but this leads us to recreating data. Without the ability to verify data as correct, we really do not have "analytics?" Do we? Getting close is not good enough.
As Stephanie has been watching and tagging posts as "helpful" (and I believe that Stephanie is a Canvas team member), maybe Stephanie can respond to the apparent position that "Page Views does not include Canvas Mobile App student interactions?" Stephanie???
I do understand your example of keeping summary statistics. But is this what Canvas is doing? Is Canvas throwing out the raw data and just keeping accumulated summary statistics? Again, maybe Stephanie can verify this. Stephanie??? (In our era of large databases, I would be surprised if Canvas actually throws away data, but this could be happening. I can understand generating intermediate summary statistic, for speed, but I have a harder time accepting that the raw data is deleted.) Let's see what Stephanie says.
Yes, let's begin with one student in one course. A much smaller set of data. Once we verify the Page View data for one student, then we can sample a few more students until we are confident in the process.
We seem to be on the same wavelength about "student activity time," that it is a futile process trying to produce this kind of statistic. However, Canvas does! And I think it is done incorrectly. I simply want to verify what Canvas is doing. Without verification, do we really have "analytics?"
I don't consider trying to verify Canvas Page Views as what "power users" do. It is a simple administrative issue. If Canvas reports a student grade as 94%, we should be able to verify this.
No I don't think that omitting traffic generated by mobile apps is the cause for the count differences. From what I've seen mobile app traffic is fairly minor. As long as you know it's not included this makes no difference to the reconciliation process. The difficulty in reconciling the two sources is the definition of which requests records correspond to a Canvas page view. Some are straightforward, for example if a record has a non-null discussion_id and a valid user_id and course_id, then it's countable and the title can be derived directly from the discussion_topic_dim table using the id. On the other hand if you look at the sequence of records generated by a student taking a quiz you see the student going to the quiz splash page, clicking on take the course, then a sequence of records for each question including backup records, a submission record which is a POST etc. We need to decide how many of these records should be counted as a view. This appears to be a combination of http_method, web_application_controller and web_application_action. By reconciling actual page views from Canvas with requests records, (by comparing timestamp), the rules can be refined.
The problem is that this can't be done across the full requests data set, it's simply too large. This is why we are generating the custom table which derives the missing element ids from the url, derives element titles etc. Because this process can be performed on incremental data it is not onerous. This takes most of the complexity out of the final process to select only the records from the custom table that the rules derived above indicate should be counted as a page view.
As discussed previously, you can't expect it to be easy to reconcile structured data directly from the database with clickstream data which represents web server transactions that cause the structured data to be generated.
I simply want to be able to verify the Page View counts in a course from the database, to have the Page Views timestamped, and to know what was viewed. If Canvas doesn’t use the database for this purpose, what do they use?
I had not originally thought about Mobile App page views. The question, for Stephanie and other Canvas support employees, is “Are Mobile Apps Page Views included in the Canvas Page View analytics?”
Later, we might care about how we would define a Page View, and that it be done correctly. Right now, I simply want to verify the Page View counts, as defined by Canvas.
Later, I might want to expand my students and courses, but for now, I only want a method to verify the Page View counts for one student. In SQL form, the clause WHERE course_id = 12345 and student_id = 67890 will not produce massive results of a file that is too large to handle.
I am not saying that it will be easy. Does or does not the Canvas database keep track of student clicks and what was clicked, does Canvas throw away “click” data, and doesn’t anyone care that Canvas analytics cannot be verified as correct? Someone (at Canvas) needs to explain how Page View statistics are derived. We should then be able to verify their method.
I will provide another perspective on the question that I ask. If historical data can’t be found, we can shift our focus to current data. I have another course that just began this week. How do I verify the Page View statistics for what a student did two days ago on Monday (today is Wednesday?)
Hi, @richard-jerz , I have indeed been following this conversation with great interest (I do love marking helpful answers as Helpful), and am happy to see that you have been getting robust and expert help from Canvas users—people in the field—who are exceptionally well versed in the use of Canvas Data and API as part of their daily activities managing Canvas accounts. That is the power of the Canvas Community.
@James mentioned the Analytics Beta upthread, and given your interest in this area, you might want to participate in the User Group: Analytics Beta to see how this compares with the experience you've had with the data. Are you one of the Canvas admins for your school? If you are, you can coordinate with your school's Customer Success Manager (CSM) to opt in to the Analytics Beta—and if you are not in an admin role, and you do not see Analytics Beta is in the Settings | Feature Options area of your course, your local Canvas admin can choose to opt in at the account level. You (if you're the local admin) or your Canvas admin can also coordinate with the CSM to identify the specific data needs your school might have, and the CSM will be happy to pursue solutions.
In the meantime, this is a great conversation! Please continue. 🙂
Hi Stephanie, thanks for your reply.
No, I am not a Canvas administrator, but I am working with one who specializes in analytics, and she is watching this discussion.
As you can see, I have come the full circle. In my (our) other discussion, you mentioned that I should get with my Cavans administrator for help about analytics. I did. This person directed me to this discussion, and now I am being directed back to my Canvas administrators. You can see my head spinning.
So, is there no one in Canvas who can precisely describe how Page Views, as shown to the instructor, are created? And can't this person direct us to the records which will provide us timestamp data and what page was viewed? I am not trying to trouble you, but you seem to be the only Canvas person reading this discussion.
I think you missed a very pointed question from Rick: "So, is there no one in Canvas who can precisely describe how Page Views, as shown to the instructor, are created?"
This isn’t an unreasonable question – it’s a question we regularly receive at our institution when we’re discussing Canvas Analytics. Instructors want to know what a “page view” means. I don’t think it’s fair to refer Rick to the community to answer a question that should be easily answered by Instructure. Can you engage someone internally to document this?