Community help

akhelms · ‎09-22-2025

Has anyone implemented some form of data profiling, quality assurance checks, or similar mechanisms to identify significant changes and/or problems in the data you're bringing in from DAP? If so, would you be willing to share what you're doing as examples?

For context, Instructure delivered a change this summer that resulted in a new record in the pseudonyms table for each user. AFAIK, this wasn't announced in any of the release notes so I found out about it when an end user of the data found issues in a query. My school's configuration previously meant each user only had one pseudonym, so this effectively changed the granularity of the table for us and required making some updates to the data model.

I'm now trying to brainstorm methods to find out about something like this before my end users. I'm considering, for example, recording how many new rows are being inserted to a table during each run and then creating an alert if the table growth exceeds X%. I'd love to hear other suggestions or ideas if anyone is doing something similar. I'm already handling schema version changes, but that didn't apply to the pseudonyms change since the schema itself wasn't updated.

Thanks in advance!

Andy

sgergely · ‎09-23-2025

Hello!

I understand this has caused problems for you, and I agree that the data could have been handled better. We are actively working on improving our processes to help prevent issues like this in the future.

I’d also like to highlight what our documentation says about the pseudonyms table:

Pseudonyms are login profiles associated with users. Contains user / account relationship (may contain multiple records per user_id if that user_id is associated with multiple accounts); note: not all users can be found in the pseudonyms table.

By this definition, the additional records that were included — and caused the issue for you — could still be considered valid entries in this table.

mclark19 · ‎09-23-2025

@akhelms Nice find on the pseudonyms change info. We were wondering about that ourselves.

We pull daily data into a staging table before pushing it to the final table (this is a hack because Redshift doesn't honor uniqueness constraints and will gladly load duplicate records), so we've created an "observability" table that tracks the total number of records in the table and the number of records loaded for each run. Using that, we have the system generate an email that lists tables that are most discrepant compared to their average for the past 30 days. But, to tell the truth, things change so much based on the flow of the semester that we haven't really dialed in a good metric for what would be a "meaningful" change, so we don't do much with that info except for retroactively when someone raises an issues.

Ironically, it was easier in CD1 to identify potential issues as we could look for dim tables and fact tables that had mismatching numbers of records and/or tables that had fewer records than previously, etc.

Profiling and/or quality assurance testing Canvas data

Metrics Easy Button

Analytics API / Metrics calculation logic

CD2: Enhanced Rubrics

DAP initdb error - aiohttp.client_exceptions.Clien...

CD2: courses table not updating course changes, bu...

Inconsistencies in user submission attempts (new q...

Seeking advice on CD2, ETL and presentation proces...

Sample Data

Canvas Data Access Platform (DAP) Python Client Li...

Seeking Advice: Integrating CD2 Data for Student &...

You're signed out

Profiling and/or quality assurance testing Canvas data

Community help

View our top guides and resources: