Register for InstructureCon25 • Passes include access to all sessions, the expo hall, entertainment and networking events, meals, and extraterrestrial encounters.
Found this content helpful? Log in or sign up to leave a like!
Confession: I am not a developer or an SQL expert, but I can view certain kinds of code, sort of learn it, and make adjustments as needed. Otherwise, I rely heavily on ChatGPT, particularly when it comes to SQL. 😁
That said, I was wondering if someone more experienced with SQL than I might be able to please look over the SQL query down below that our team has come up with and let me know if we’re on the right track?
We are trying to verify that this query accurately reports on:
WITH LatestSubmissions AS (
SELECT
SPLIT_PART(u.name, ' ', 2) AS last_name,
SPLIT_PART(u.name, ' ', 1) AS first_name,
p.sis_user_id AS user_id,
et.name AS term_name,
TO_CHAR(et.start_at - INTERVAL '4 hours', 'MM-DD-YYYY HH:MI:SS AM "ET"') AS term_start_date, -- Subtract 4 hours and format as requested
c.course_code,
c.name AS course_name,
MAX(teachers.teacher_name) AS teacher_name, -- Aggregate the teacher's name
MAX(s.submission_type) AS submission_type,
TO_CHAR(s.submitted_at - INTERVAL '4 hours', 'MM-DD-YYYY HH:MI:SS AM "ET"') AS submitted_at, -- Subtract 4 hours and format as requested
MAX(a.title) AS assignment_title,
e.workflow_state AS course_enrollment_status, -- Rename the column
ROW_NUMBER() OVER (PARTITION BY u.id, c.id ORDER BY COALESCE(s.submitted_at, '0001-01-01'::timestamp) DESC) AS row_num
FROM
public.users u
JOIN
public.enrollments e ON u.id = e.user_id
JOIN
public.courses c ON e.course_id = c.id
JOIN
public.enrollment_terms et ON c.enrollment_term_id = et.id
JOIN
(SELECT enroll.course_id, users.name AS teacher_name
FROM public.enrollments AS enroll
JOIN public.users AS users ON enroll.user_id = users.id
WHERE enroll.type = 'TeacherEnrollment') AS teachers ON c.id = teachers.course_id
JOIN
public.pseudonyms p ON u.id = p.user_id
JOIN
public.submissions s ON e.course_id = s.course_id AND u.id = s.user_id
JOIN
public.assignments a ON s.assignment_id = a.id
WHERE
e.type = 'StudentEnrollment'
AND et.sis_source_id = 'SP_2024'
AND s.submission_type IS NOT NULL -- Exclude null submission types
AND s.submitted_at IS NOT NULL -- Exclude null submission times
GROUP BY
u.id, c.id, p.sis_user_id, et.name, et.start_at, s.submitted_at, e.workflow_state -- Include enrollment status in the GROUP BY clause
)
SELECT
last_name,
first_name,
user_id,
term_name,
term_start_date,
course_code,
course_name,
teacher_name,
submission_type,
submitted_at,
assignment_title,
course_enrollment_status -- Include enrollment status with the new name
FROM
LatestSubmissions
WHERE
row_num = 1
ORDER BY
last_name, course_code ASC;
Please let me know if any other information is needed.
Thank you!
Taking a quick look at this, it does look like you will achieve what you are looking for. When you run it does it produce what you were expecting?
It does produce what we're expecting, but I just wanted to be extra sure since our Financial Aid department will be using this report as a metric to determine student course activity in the event of a withdrawal. Thank you for taking a look, Kyle!
If you're wanting to use this for Financial aid reporting (anything that has significant consequences), be careful. I recently tried to recreate a local database used in the SIS imports using the data in Canvas Data 2 (CD2). It turned out there were missing items in Canvas Data 2. I had about 350 users who had completed an assignment (I know this because they had to complete the assignment to get out of the orientation course and into their regular classes) but CD2 had no record of them. Then I tried to get a list of email addresses and it turns out that I have none in CD2, even though querying the API shows I do.
Now granted, most of the missing data seems to be from a long time ago (my Canvas account was created in 2012 and most of the people missing the orientation assignment we from 2014). It seemed to do a better job with more recent information.
There were a couple of other places that I had to download the provisional reports and re-create the information from those. I wanted CD2 since my tables had created_at and updated_at columns and those don't come in the provisioning reports.
Hi @NickChevalierUT ! This an interesting query and reporting goal. I've noted a some thoughts.
@stimme We just wanted to thank you for these amazing recommendations -- I'm working together with @NickChevalierUT on this beautiful report he created, and as much as we love his SQL, we wanted more eyes on it to be absolutely sure. We've taken the items you've listed as actual tasks to check. As you are probably aware, follow-up work is slow, and we are still working on it. There's radio silence here because it takes time to check. That said, when we're done, we'll follow up here, but we did want to let you know how much we appreciate this response. You not only explained what, you explained why. This is just invaluable.
Thank you again.
--Sharon Austin
Yes, @stimme, thank you for the valuable feedback! As Sharon mentioned, we're working on those tweaks and will let you know how it goes.
There are other technical issues with the fact that lots of those tables are not just one-one relationships. The query may work for the particular data, but is not guaranteed by any means.
If the requirement here is to return the most recent submission in each course for each student (and return all students that were ever enrolled in a course in the given term), then this query is not going to achieve it or give accurate results for what is does return.
Rather than using aggregations arbitrarily, if you have sufficiently small data sets, you are better off qualifying your sub-selects to only return the rows you are after. For example, for teachers change the select to something like
(SELECT enroll.course_id, users.name AS teacher_name,
ROW_NUMBER() OVER (PARTITION BY enroll.course_id, users.name ORDER BY enroll.created_at DESC) AS row_num
FROM public.enrollments AS enroll
JOIN public.users AS users ON enroll.user_id = users.id
WHERE enroll.type = 'TeacherEnrollment' and enroll.workflow_state != 'deleted')
AS teachers ON c.id = teachers.course_id and teachers.row_num = 1
That would give you the most recently added teacher enrolment that was not deleted - but only one for each course. You should be left outer joining this, as some courses may not have a non-deleted teacher enrolment, or indeed any teacher enrolment at all.
Likewise, instead of the current approach to submissions, where you are aggregating, try something more along the lines of
(select submiss.submission_type AS submission_type,
submiss.submitted_at,
ROW_NUMBER() OVER (PARTITION BY submiss.user_id, submiss.course.id ORDER BY
COALESCE(s.submitted_at, '0001-01-01'::timestamp) DESC) AS row_num_sub
from
public.submissions submiss
where submiss.submission_type IS NOT NULL -- Exclude null submission types
AND susmiss.submitted_at IS NOT NULL) -- Exclude null submission times
as s ON e.course_id = s.course_id AND u.id = s.user_id and s.row_num = 1
This would allow a direct join of the assignment (left outer join as there may be no submission to start with) to get the correct assignment detail (this subquery should be left outer joined as well, to ensure you do not miss students with no submissions).
You would also need to filter down the enrolments similarly to get a single enrolment per student in a course (as you only need the user id it doesn't really matter which one, except that you are after the enrolment status, so you would need to order to get the "correct" first one. If all you want is active in preference to deleted, but don't mind which other status you get, it is simple, but if you want anything other than deleted in preference to deleted, the query gets much trickier).
@KeithSmith_au Your response, as in the case of others from this community, is absolutely invaluable to us. The nuts and bolts of getting not only results, but interpreting the results, is beyond gold.
To the point,
Yes, we want to preserve getting multiple records for a user. There are very rare instances in which we have multiple sis_id's for a user, usually given by accident, but for which we want a record of any transaction should the records not be merged yet. It is a safety net.
To the point,
You are obviously aware that there can be multiple teacher enrolments in a course, including multiple enrolments for the same teacher if they are added to multiple sections in the course. Using MAX to aggregate on the name is not a reliable or safe way to get a single teacher name. At the very least, the workflow state of teacher enrolments should be restricted to those that are active. If a single name (only) is required, then selecting either the oldest or newest active teacher enrolment would be more reliable and consistent.
We agree that MAX aggregate is not a reliable or safe way to get a teacher name. We will be removing that component of the code; thank you for the reality check and confirmation. That said, we want to include the multiple teachers, as often laboratories represent team-teaching. At this point, I am unsure too, if we should restrict to active state because when instructors must leave the institution for any reason, we want to know what was assigned to the students left behind. Sometimes, faculty do leave suddenly for health or other reasons.
In particular, this suggestion of yours is particularly invaluable as it relates to finding submissions
This would allow a direct join of the assignment (left outer join as there may be no submission to start with) to get the correct assignment detail (this subquery should be left outer joined as well, to ensure you do not miss students with no submissions).
You made so many great points, we cannot address them all here, but please know that we are taking each point you've made into consideration, and as in the case above with @stimme exact SQL recommendations, are incorporating them as revisions, and testing. The testing is so, so slow, as you must know. But you have to know how appreciative we are of the targeted responses. Thank you so, so much. We'll look forward to any additional insights you may have as we continue to revise and test.
Yes, we want to preserve getting multiple records for a user. There are very rare instances in which we have multiple sis_id's for a user, usually given by accident, but for which we want a record of any transaction should the records not be merged yet. It is a safety net.
Depending on how you manage merging, you may get various results. If the issue is that an additional user is created by mistake (with a different sis_id) - i.e. treated as 2 different users, and then when the mistake is realised, the users are merged - you don't really want 2 records. If in this case, the erroneous user is made inactive in the Student Records system, and a merge is done in Canvas, only 1 actual user record exists - it just has 2 sis_ids attached (due to the merge). In this case, you need to select only the one that remains active.
We agree that MAX aggregate is not a reliable or safe way to get a teacher name. We will be removing that component of the code; thank you for the reality check and confirmation. That said, we want to include the multiple teachers, as often laboratories represent team-teaching. At this point, I am unsure too, if we should restrict to active state because when instructors must leave the institution for any reason, we want to know what was assigned to the students left behind. Sometimes, faculty do leave suddenly for health or other reasons.
What you should / could get will depend a little on the process that occurs when faculty leave - as to whether the enrolment is deleted, or just made inactive (removing access, but retaining history). We would never automatically delete enrolments (for staff or students) but simply make them inactive, preserving the history (even more important for students, as deleting an enrolment deletes all the submissions - forever).
If you want a single result with the names of all non-deleted teachers, it will depend on what SQL engine you are using - as those sorts or aggregation functions tend to be dialect specific. For Microsoft SQL server, left outer joining something like the following in place of your teacher selection will give you a comma separated list of all the teacher names for the course (assuming it fits with in 8,000 characters - otherwise you will need to convert to a nvarchar(max) to allow it to work.,
(SELECT enroll.course_id,
string_agg(users.name, ', ') AS teacher_name,
FROM public.enrollments AS enroll
JOIN public.users AS users ON enroll.user_id = users.id
WHERE enroll.type = 'TeacherEnrollment' and enroll.workflow_state != 'deleted'
group by enroll.course_id)
AS teachers ON c.id = teachers.course_id
@stimme Good afternoon, I was in the process of incorporating recommendations by this amazing community, and was looking a bit hard at the following that you recommended:
The way students' enrollment workflow_state is included could be confusing. It's possible for a student to have both active and deleted enrollments in one course (whether connected to different student roles or different sections). I think the WHERE conditions should include either e.workflow_state = 'active' or e.workflow_state != 'deleted' (the latter is a little less strict).
I have a question of ignorance. Users can have two roles in Canvas, something I was surprised to learn. In fact, I could be wrong -- and PLEASE correct me if I am wrong on this, but to be a member of a group, someone must have a role of student.
I certainly understand that it is possible for a student to have both active and deleted roles in one course, but when the student can have TWO roles in a course, for example, a teacher role AND a student role, and either of those roles can be in different workflows, do you think that the workflow state you are asking us to consider would "mask" either of the roles if someone were to have two roles in a course?
Here is an example of myself as both a teacher and a student in one course. -- and thank you again so, so much for all this incredible feedback you are providing.
@saustin You are correct that group memberships can only be created for users enrolled as Student (or some custom role based on Student). Teachers and TAs can see every course group's tools, homepage, etc. and may receive notifications for certain types of activity occurring in any course group (e.g. announcements).
The maximum number of enrollments one user can have in one course is the product of sections x roles in your Canvas instance. The e.type = 'StudentEnrollment' WHERE clause excludes the enrollments of other types (Teacher, TA, etc.) from the result set. The e.workflow_state = 'active' WHERE clause excludes enrollments with other states. Specifying the type and the workflow_state narrows the result set to focus on active student enrollments (which could exist for each section in the course).
The question about what non-student and non-active enrollments an active student has in a course couldn't be answer by the results of the query that excludes them. I was imagining that those questions would be answered by a different sort of query. Now I think I may not have understood the meaning of reporting the "enrollment status of the student in question in the course in question" in @NickChevalierUT 's original post. What does "enrollment status" mean across administrative units in your institution? Does it matter to them if a student has a second role in Canvas?
@stimme wrote:
maximum number of enrollments one user can have in one course is the product of sections x roles in your Canvas instance. The e.type = 'StudentEnrollment' WHERE clause excludes the enrollments of other types (Teacher, TA, etc.) .
...
in @NickChevalierUT 's original post. What does "enrollment status" mean across administrative units in your institution? Does it matter to them if a student has a second role in Canvas?
This is correct. the e,type in the enrollment record is generic - and refers to the base type the actual role was based on - to my knowledge there are 6 possible values (documented in the Canvas Data 2 type for class enrollments__type - being
As there can be any number of custom roles created using those base types, then there can potentially be many "StudentEnrollment" records for a single user in a single section.
If I understand what your reporting is intended for, you are really only interested in the "state" of a particular person in a particular course overall - not the individual enrolments. You will probably have to assign a hierarchy of preference for workflow_state of the enroilments, and take the ""most appropriate" state.
Possible workflow_states for enrollments are (roughly in the order in which you probably want to consider them):
Some of these are a little vague - my understanding is that creation_pending is roughly equivalent to invited, but is used for student invitations when the course is unpublished.
To select the "top" from these, you cannot easily just sort - you would have to assign a hierarchy order using a case statement, then partition over the course / user combination, and order by your hierarchy, selecting the first applicable.
@James We definitely appreciate the second set of eyes your experience brings, and absolutely welcome any additional insights you may have from your experience. In a sense, we are lucky, because we have only come online with Canvas 2 Data. Our frustration is that, as we learn to navigate this database, much of the historical "how-to" applies to Canvas 1 Data.
Further, we have a T&L subscription, so we only have access to the Public Table, which may actually be a blessing in disguise, as we are supplementing anything we cannot get from the data base with API calls. This forces us to check, and cross-check, everything we do.
Even such small details as the lack of availability of emails in the database, we found discovered early, but obviously wondered if there were something else we should be doing. It is the confirmation from the community members such as yourself that give us the confidence that we should be using API calls to supplement the report.
Thank you so much for your input, and we always look forward to more!
Further, we have a T&L subscription, so we only have access to the Public Table, which may actually be a blessing in disguise, as we are supplementing anything we cannot get from the data base with API calls. This forces us to check, and cross-check, everything we do.
Even such small details as the lack of availability of emails in the database,
I am not familiar with what a "T&L subscription" means, and what ""Public Table" infers - you obviously have access to many tables. I was under the impression that Canvas Data 2 when available gave access to all the tables. The only things that are not accessible are things that require versions (such as prior versions of submissions) that have not been made accessible via Canvas Data 2. Is this something your institution has put between the actual Canvas Data 2 and your access?
In that case, you will find the emails in the communication_channels table - linked to pseudonyms, Emails have a path_type value of ''email' and the actual email address is held in the path column.
Good evening to this amazing community. @stimme @KeithSmith_au @James @kyle_cole We have taken all your suggestions, and incorporated them, where possible. (It was not always possible, mainly due to time constraints to check them out fully).
I think we may not yet be capturing "nuance" that is possible with Canvas Data (especially as it applies to roles) but I do think the simple report we are generating here gives them (the administrators) what they are asking for. We understand that we need to work further with administrators to better understand and explain the data. (For example, we can get a last date of submission, but I found in spot-checking, that students may have made a submission without receiving a grade for it -- from a teacher's point of view, the submission may not have been "blessed".)
Our team cannot thank you enough for your input. SQL, the database, and understanding Canvas combine together to make the report. What this group brought forward to us was powerful in the extreme, and again, we can't thank you enough for it. Thank you, thank you, thank you.
Below is the latest summary, which our team will now check with API calls and individual course investigation. Again, thank you, and if you see red flags, please feel free to say something.
To interact with Panda Bot in the Instructure Community, you need to sign up or log in:
Sign In