Register for InstructureCon25 • Passes include access to all sessions, the expo hall, entertainment and networking events, meals, and extraterrestrial encounters.
Found this content helpful? Log in or sign up to leave a like!
Hi Everyone,
Our organisation recently got access to Canvas Data 2 and I have been comparing data pulled from CD2 to Canvas' internal reports and have noticed some large discrepancies.
For example when looking at the enrollments table from CD2, I get around 10% of the rows expected. The same is true for users, quizzes, basically all the data pulled from CD2 has been incomplete. I do know there is a 4 hour freshness interval, but that would not account for these large discrepancies. When I use the internal reporting tool, all the data is there properly and the numbers make sense.
I pulled data from CD2 using both the CLI tool and Postman and got the same result.
Here is the code in python I have:
import os
from dap.api import DAPClient
from dap.dap_types import Credentials, SnapshotQuery, Format
import asyncio
base_url = "https://api-gateway.instructure.com"
client_id = "clientid"
client_secret = "secret"
credentials = Credentials.create(client_id=client_id, client_secret=client_secret)
output_directory = os.getcwd()
async def download_data():
async with DAPClient(base_url=base_url, credentials=credentials) as session:
query = SnapshotQuery(format=Format.JSONL, mode=None)
await session.download_table_data(
"canvas", "enrollments", query, output_directory, decompress=True
)
if __name__ == "__main__":
asyncio.run(download_data())
I have tried reaching out to Canvas Data Help, and haven't had a response so I was hoping that someone here might have an idea or experienced a similar problem.
Many thanks,
Matt
I just recently posted about a discrepancy I noticed between the CD2 web_logs table and CD1 requests. I wonder if by chance you notice a discrepancy on those tables too? For us the divergence between CD2 and the Canvas GUI & CD1 seems to have started a couple months ago.
Are you noticing this discrepancy with a fresh snapshot of the tables you have mentioned, or just over time on a collection of incremental loads? In my system we ran snapshots when CD2 was released and have been loading incremental data ever since; as far as we can tell we have not noticed missing items in regards to courses, enrollments, and users as we run numerous reports involving those and we haven't seen any departures from the expected values. If this problem is more recent, then it's possible that not enough has changed in our system for the problem to be noticeable.
In my case it was just a fresh initial snapshot that I took, I also had not used CD1 before and was just comparing what CD2 gave me to the canvas internal reporting stuff. I got in touch with Canvas Support and they confirmed the issue on their side. I would imagine your issue is similar to mine.
They are now working on fixing the issue, not sure how long it will take. Bit of a weird one
With that amount of data missing, the most obvious question is are you downloading and processing ALL the dataset objects returned, or only the first one? The queries will spawn any number of actual worker jobs, all returning parts of the set. The actual number (and naming) can change from run to run, but in general, as the CD2 warehouse gets busier (for the whole AWS region), then there are more jobs being spawned per request.
Hey Keith,
Yeah I wondered that as well, so I set up a postgreSQL database and initialised the database and took a initial snapshot that way. I came up with the same missing rows as before and nothing was different.
I have gotten in touch with Canvas Data Support, and they were able to replicate the issue on their side as well so they are now working on the problem with no firm deadline
Hi Matt,
Did they ever get back to you? I'm running into a similar issue.
What kind of data quality issue have you found?
The engineering team has spent a significant effort last summer to increase data quality that is served through CD2 and there are automated data quality checks running in the background.
All in all I would be really interested if there are any issues with data quality.
Hi @sgergely ,
I'm using Python as well. I tried a full load of the CD2 data into our database, and the results seem incomplete—some tables didn’t return any rows. I re-ran the program, and the row counts were completely different. What’s interesting is that some tables that had no rows the first time showed data in the second run, while others that had rows before either had different counts or came back empty.
I’m still checking to see if the issue is on our side, but I came across this post and wanted to see if there have been any updates.
Thanks!
To interact with Panda Bot in the Instructure Community, you need to sign up or log in:
Sign In