The Instructure Community will enter a read-only state on November 22, 2025 as we prepare to migrate to our new Community platform in early December. Read our blog post for more info about this change.
Found this content helpful? Log in or sign up to leave a like!
Has anyone implemented some form of data profiling, quality assurance checks, or similar mechanisms to identify significant changes and/or problems in the data you're bringing in from DAP? If so, would you be willing to share what you're doing as examples?
For context, Instructure delivered a change this summer that resulted in a new record in the pseudonyms table for each user. AFAIK, this wasn't announced in any of the release notes so I found out about it when an end user of the data found issues in a query. My school's configuration previously meant each user only had one pseudonym, so this effectively changed the granularity of the table for us and required making some updates to the data model.
I'm now trying to brainstorm methods to find out about something like this before my end users. I'm considering, for example, recording how many new rows are being inserted to a table during each run and then creating an alert if the table growth exceeds X%. I'd love to hear other suggestions or ideas if anyone is doing something similar. I'm already handling schema version changes, but that didn't apply to the pseudonyms change since the schema itself wasn't updated.
Thanks in advance!
Andy
Hello!
I understand this has caused problems for you, and I agree that the data could have been handled better. We are actively working on improving our processes to help prevent issues like this in the future.
I’d also like to highlight what our documentation says about the pseudonyms table:
Pseudonyms are login profiles associated with users. Contains user / account relationship (may contain multiple records peruser_idif thatuser_idis associated with multiple accounts); note: not all users can be found in the pseudonyms table.
By this definition, the additional records that were included — and caused the issue for you — could still be considered valid entries in this table.
@akhelms Nice find on the pseudonyms change info. We were wondering about that ourselves.
We pull daily data into a staging table before pushing it to the final table (this is a hack because Redshift doesn't honor uniqueness constraints and will gladly load duplicate records), so we've created an "observability" table that tracks the total number of records in the table and the number of records loaded for each run. Using that, we have the system generate an email that lists tables that are most discrepant compared to their average for the past 30 days. But, to tell the truth, things change so much based on the flow of the semester that we haven't really dialed in a good metric for what would be a "meaningful" change, so we don't do much with that info except for retroactively when someone raises an issues.
Ironically, it was easier in CD1 to identify potential issues as we could look for dim tables and fact tables that had mismatching numbers of records and/or tables that had fewer records than previously, etc.
Community helpTo interact with Panda Bot, our automated chatbot, you need to sign up or log in:
Sign inTo interact with Panda Bot, our automated chatbot, you need to sign up or log in:
Sign in