To Our Amazing Educators Everywhere,
Happy Teacher Appreciation Week!
Found this content helpful? Log in or sign up to leave a like!
I've been working toward "validating" my query results from our Canvas Data 2 instance as compared to Admin Analytics and CD1. Looking at the current term (Fall, 2023), the AA enrollment numbers are:
Students with activity: 49,444
Students without activity: 1,850
Student enrollment: 51,294 (adding the previous 2 numbers)
CD1 student enrollment query result (active/invited): 51,292
However, in CD2, my "distinct" enrollment query returns (active/invited):
Student enrollment: 24,956
If I expand this to include "any" enrollment workflow_state:
Student enrollment: 36,472
My CD1 query returns results almost exactly matching the AA numbers. This makes me think that my CD2 query is correct, but that my CD2 instance may be lacking data and that maybe the process I'm using is incorrect. I'm performing a pretty vanilla DAP "syncdb" twice a day which always completes successfully. Thoughts?
@reynlds , at my institution we are onboarding with CD2 and the database team has also seen latency/freshness issue with CD2. Not all data, but some rows are behind or older than expected. We have a support ticket in with Canvas, and Canvas is working hard to resolve the problem.
I appreciate the response and context. However, this goes beyond latency as the queries are thousands of records in discrepancy, and I'm only looking at data before "today".
My query matches the AA counts close enough when considering latency issues, and the fact that I am only comparing CD2 data that is a few days old.
Given the CD2 numbers are lower, and you are looking at counting distinct users, have you perhaps filtered too many out when joining all the tables you need to achieve this - depending on whether or not you are looking at a complete instance, or just for a particular sub-account, you need to join courses, terms, roles, enrollments and possibly accounts.
The other possibility is that if you are querying for a particular sub-account, have you accounted for any courses in child sub-accounts (as AA does)?
I don't see that being the issue. Here is a generic enrollment query for a single term (this Fall) for each Canvas Data instance (only schema and table names were changed):
CD1 returning 239,072 enrollments:
SELECT count(*)
FROM canvasdata_prod.enrollment_dim ed
JOIN canvasdata_prod.course_dim cd on cd.id = ed.course_id
JOIN canvasdata_prod.enrollment_term_dim etd on etd.id = cd.enrollment_term_id
JOIN canvasdata_prod.pseudonym_dim pd on pd.user_id = ed.user_id
WHERE etd.sis_source_id = '120238'
AND ed.workflow_state in ('active', 'invited')
AND ed.type not in('StudentViewEnrollment')
CD2 returning 26,188 enrollments:
SELECT count(*)
FROM canvas.enrollments ed
JOIN canvas.courses cd on cd.id = ed.course_id
JOIN canvas.enrollment_terms etd on etd.id = cd.enrollment_term_id
JOIN canvas.pseudonyms pd on pd.user_id = ed.user_id
WHERE etd.sis_source_id = '120238'
AND ed.workflow_state in ('active', 'invited')
AND ed.type not in ('StudentViewEnrollment')
I am assuming you do not have a consortia implementation and can get away with the simplistic joining and the assumptions of the identifiers that go along with that.
I would not be joining the pseudonyms - that table has a many to one relationship to users, and there could be multiple different pseudonyms for each user. The user table would be more suitable to join.
Why are you using the sis batch identifier of the term as a magic number to select the term? The actual id of the term table would be more suitable. If you modified other terms in that batch, even if they are now deleted, they will be multiplying out the count. I don't know if CD1 contains all the deleted objects, but CD2 definitely does.
And another point - without a check on the workflow_state of the pseudonym, you could be getting multiple records for each user with deleted login records
- Unsure what you mean by "consortia implementation", unless you're referring to Unizen, which we do not have.
- I can get away with "simplistic joining" because I'm not a DBA and have no intention to optimize this query beyond what's been produced to give me a result. Also, there are no assumptions.
- All our users, enrollments, terms, etc., have an "SIS Identifier" which is authoritative (for us) over the assigned Canvas ID's. That's why I'm using the SIS ID on the term...it's how we refer to it in Banner.
- I joined the enrollments.user_id to the "users" table with the exact same results as the "pseudonyms" table. I only pull from pseudonyms if I need to return actual usernames for our institution which are only stored there.
- The check on the pseudonyms workflow_state is irrelevant as I'm checking the enrollment workflow_state.
I've got a case open on this with Instructure: 10070619
Adding @Edina_Tipter and @LeventeHunyadi to this in case they have any ideas. We're close to the EOL for CD1 and this needs to be rock solid!
FYI...following update to DAP client (now at latest published (0.3.11) the "syncdb" isn't even completing. It just hangs after the "wikis" table (unless it's sitting there doing something like combining parts):
2023-09-07 11:11:42,771 - INFO - Query job still in status: running. Checking again in 5 seconds...
2023-09-07 11:11:48,070 - INFO - Query job still in status: running. Checking again in 5 seconds...
2023-09-07 11:11:53,472 - INFO - Query job still in status: running. Checking again in 5 seconds...
2023-09-07 11:11:58,671 - INFO - Query job still in status: running. Checking again in 5 seconds...
2023-09-07 11:12:03,998 - INFO - Query job still in status: running. Checking again in 5 seconds...
2023-09-07 11:12:09,632 - INFO - Data has been successfully retrieved:
{"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823", "status": "complete", "expires_at": "2023-09-08T15:49:25Z", "objects": [{"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00000-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00003-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00009-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00023-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00034-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00039-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00081-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00083-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00111-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00118-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00124-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00125-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00129-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00130-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00142-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00158-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00202-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00204-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00232-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00244-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00251-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00265-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00266-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00268-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00271-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00277-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00279-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00280-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00319-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00338-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00354-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00364-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}, {"id": "77dc9ce0-bee2-4064-8f22-07e8b9b49823/part-00367-5568a1a4-a14c-4d15-b6d4-840691da387c-c000.json.gz"}], "schema_version": 3, "since": "2023-09-06T14:53:03Z", "until": "2023-09-07T15:39:10Z"}
^[^[[D
Hi, I would like to let you know that we are looking into the issue reported about the potentially missing records in the enrollments table. Please, bear with us, we will update you as soon as we have something.
@Gabor_Endrodi Thanks for the response but this goes well beyond enrollments. Do a simple count on courses for enrollment_term_id "144" (the Canvas ID for our Fall, 2023, term) and compare the result from CD1 and CD2. The difference is remarkable:
CD1: 3,844
CD2: 437
I started looking at previous terms in the interest of proving that this is isolated to the Fall, 2023, term. The CD2 numbers I ran for Spring of this year match what I see in CD1. So, at this point I dropped and initialized the "courses" table and ran some simple queries, comparing the numbers exactly to CD1. This prompted me to do a FULL drop and initialize on all tables, which I'm currently waiting on to complete. I anticipate that it'll be fine after this.
However, it makes me wonder exactly "how" to monitor all this for errors or omissions so that I don't go down this path again. The whole thing runs as a single BASH shell script, relying on the Instructure DAP client to do the heavy lifting. Look forward to comments and suggestions on this.
@reynlds , were you ever able to explain all the data discrepancies you were describing above? I ask because I realized that the different values for the 'workflow_state' field in CD1 vs CD2 could explain in some cases the discrepancies in the size of the query result sets in CD1 vs. CD2. (see my post on this here: https://community.canvaslms.com/t5/Data-and-Analytics-Group/Need-Instructure-s-guidance-on-field-val... ). Would love to hear if you have some comments/insights on this.
To participate in the Instructure Community, you need to sign up or log in:
Sign In