The Instructure Community will enter a read-only state on November 22, 2025 as we prepare to migrate to our new Community platform in early December. Read our blog post for more info about this change.
Found this content helpful? Log in or sign up to leave a like!
Hi - I'm using DAPClient 1,1 and postgres 16
With an incremental query I've discovered we seem to have started losing pseudonym "id"'s. e.g. users.id 2720 has pseudonym.id 2689, but in the pseudonymn table 2689 has been skipped, so has 2693. Confirmed that this relationship does exist in live API. No errors in DAP output per below. And just did a snapshot query outside postgres of pseudonym's and it's in that table.
There are no error messages in the logs. It was identified in a query today and only seems to have affected today's logs.
Wondering if anyone else is experiencing this? And how do can these errors be detected?
I figure I'm going to have to drop that table... or recreate from scratch, but that also means recreating views, etc.
Fresh Snapshot query for pseudonymns contains:
{"key":{"id":2689},"value":{"user_id":2720,"created_at":"2024-04-12T03:11:16.385Z","updated_at":"2024-04-15T00:21:18.582Z","workflow_state":"active",...}
Logs from the postgres DAP query:
INFO:pysqlsync.postgres:PostgreSQL version 16.0.2 final
2024-04-15 11:27:03,060 - INFO - Query started with job ID: a45e6b97-...
...
2024-04-15 11:28:12,709 - INFO - Query job still in status: running. Checking again in 5 seconds...
2024-04-15 11:28:19,132 - INFO - Data has been successfully retrieved:
{"id": "a45e6b97-...", "status": "complete", ...}
INFO:pysqlsync:267 rows have been inserted or updated into "canvas"."pseudonyms"
WARNING:pysqlsync:no rows to delete
Welcome to the world of Canvas Data 2 inconsistency.
I am making some assumptions here, as your post is not completely clear to me. If I understand correctly,
My initial questions with this would be:
During the Beta phase of Canvas Data 2, I had long discussions about the lack of referential integrity in Canvas Data 2 - where things like this could occur, where users existed, but the pseudonyms did not. Instructure have classified the system has providing "eventual consistency" in that eventually the records should be present and consistent, but I have begged to differ and consider it to be "continually inconsistent".
I currently have tickets outstanding for missing data from CD2 - but it is probably not caused by this, but by Canvas errors during updates that cause some records to be left in inconsistent states such that the streaming logs that CD2 relies on do not reflect actual data - that is a completely different world of pain.
What I think is happening here, is a result of the way data is streamed to CD2, and the way the queries work.
In order to save space, and make the various queries more efficient within the constraints of the design choices made, only the most recent version of every record is stored. If a record is updated multiple times in a short space of time, only the most recent state will be retained (or ever present) in CD2. Each record is timestamped with the time that record is committed into CD2 - which is not the time of the actual update. There is no guaranteed timing of these streamed updates, and different tables can be update at different times - so you may have newer values in users, but the pseudonyms have not been update at that point.
Because there is only one version of each record, and the timestamp only reflects when it was persisted into CD2, it is not possible to generate a set of data consistent across tables at any given point in time. The only thing the timestamps can be used for, is to try and incrementally fetch data such that all records eventually get updated.
I don't have any knowledge of how locking on update or read is performed when accessing Canvas Data 2. I am guessing (and this is only a guess), that the updating of data into CD2 is done on a transactional basis per table, and that all updates for a given timestamp are committed at the same time, but this may not be the case, and updates may be committed on a more frequent basis, potentially record by record. I am also guessing that the reads (be it snapshot or incremental) do not lock at all, and are effectively dirty reads of whatever data is available. To provide consistent sets, be it for snapshot or incremental, the queries return the "until" parameter, which gives the starting point for the next incremental update.
What I am guessing, is that to provide the best approximation of consistency, is that at the start of a query, the maximum current timestamp is determined, and used to return the "until". All records for the table are then fetched, and any greater than the "from" (for incremental queries, or no lower bound for snapshots) and less than or equal to the "until" are returned. I believe this is a realistic assumption, as the query supports supply of the "until" parameter, which is not recommend, but can be used to effectively paginate updates, so the underlying processing obviously uses it.
If there is not full locking on updates and reads (which I highly doubt), the possibility exists (and this is what I suspect may have happened here) is that the following scenario is in play:
If this is true, a subsequent incremental query will obtain the updated record - where the prior creation was not obtained when expected due to the subsequent update.
The timestamps on the information you have provided are not conclusive - the updated at timestamp from snapshot shows a Zulu timezone, and the records present in your screenshot from the Database that have an updated at time slightly later than that are without timezone - so I am not sure if the Zulu time from the extract has been shifted to a local timezone in the database.
Today, DAP (a.k.a. Canvas Data 2) provides eventual consistency. We are looking into providing referential integrity across tables, and have identified approaches how we can make this possible in a scalable manner. (This is more challenging than it might at first seem.) Unfortunately, we don't have an official timeline yet when referential integrity across tables would be supported.
DAP is implemented as a transactional data lake; dirty reads are not possible. There are no locks (which are not supported by the object storage that acts as the persistence layer) but updates are atomic. Only records from committed source transactions appear in DAP. However, source transactions that impact multiple tables are not necessarily applied together, which contributes to the weaker consistency that DAP provides as of today. Furthermore, consecutive updates to the same record may coalesce into a single update. As such, successive incremental queries cannot be used as an audit trail because not all record state transitions are returned. The purpose of an incremental query is to bring a previous state (identified by the commit time) to the latest state.
@LeventeHunyadi wrote:
We are looking into providing referential integrity across tables, and have identified approaches how we can make this possible in a scalable manner.
I hope this investigation includes providing referential integrity across Consortia implementations were objects reside in tables in non-local instances.
Thanks both - all timezones were Z.
I started the first incremental update at about 11am local time.
I noticed the data gaps when I ran a query with a left join pseudonyms to users where the output was to be used as a sis_import (and discovered the users' sis_user_id/unique_id were empty).
The user was created three calendar days earlier.
I then executed a further incremental update (using the DAP CLI linked to postgress) - about 4 hours had elapsed since the original incremental update. There was additional data that was imported into the pseudonyms table from new users created that morning (plus changes to other records, last_logins etc), but records 2689 and 2693 were still missing.
That's when I ran queried the Live API and ran a local snapshot and pulled jsonl that showed the records existed and were in CD2.
I've dropped the canvas namespace, deleted source_namespace = canvas from the instructure_dap.table_sync table and run a full snapshot refresh on the canvas namespace!
I've had this problem and I had to drop and recreate the table with dap init.
I found users that have no pseudonym with this query:
select
u.*
from canvas.users u
left join canvas.pseudonyms p on p.user_id = u.id
where p.user_id is null
Here's my post from a while back when I had the issue:
Jason
In the dim recesses of my mind, I can actually recall reading that thread! I've pretty much done exactly the same troubleshooting (so something stuck)!
@LeventeHunyadi does it seem unusual that nothing was pulled by the two incremental queries given the user was created three calendar days earlier?
Being unable to join pseudonymn in CD2 to select sis_user_id and/or unique_id is problematic!
Canvas Live API and DAP API (a.k.a. CD 2) query different subsystems. DAP replicates the Canvas PostgreSQL database directly; all records that have been inserted/updated/deleted in Canvas should be reflected in DAP within 4 hours at most. If this is not the case, I would recommend that you open a support ticket such that the team can investigate. For example, they can take a look in the database directly to check if the record is there based on the record primary key. Records may occasionally end up in a quarantine if they violate the schema, and our team can check if this is the case if you share specifics. Thanks in advance!
Community helpTo interact with Panda Bot, our automated chatbot, you need to sign up or log in:
Sign inTo interact with Panda Bot, our automated chatbot, you need to sign up or log in:
Sign in