The Instructure Community will enter a read-only state on November 22, 2025 as we prepare to migrate to our new Community platform in early December.
Read our blog post for more info about this change.
Found this content helpful? Log in or sign up to leave a like!
I was working on switching over a process I had previously been using CD1 Requests for to show the break down of the types of Operating Systems and Browsers are used by our faculty and students. When I was comparing the data I noticed there was substantially less data in the web_logs table compared to requests. Luckily we have been running both processes so I was then able to compare the same date ranges and the difference between the two has become alarming.
I have yet to identify a pattern between which events make it and which events do not. We have zoomed in to a specific session where we knew what the traffic was for this particular user and random events are just completely missing in CD2 web_logs, in CD1 requests we see the expected events as well as in page views in the GUI. Just to be clear the volume of data I am referring to far exceed what could be considered transactional losses or any other small losses that can occur in big data ETL/ELT pipelines.
When CD2 web_logs was first released this was something we had checked on regularly and the differences were within an acceptable margin, over the last few months the margin has become alarmingly large.
* I excluded July since web_logs did not release on the 1st and I would be comparing a full month to a partial.
Unfortunately September was about the time we had stopped comparing the data between the 2 and began trusting that they were roughly equivalent.
We are still running through all of the possible places on our side of the fence to see if there is a mistake on our end. Our collection process has not changed since the July release of Web Logs. I was wondering if perhaps anyone else out there has both CD1 and CD2 running and is able to see if there is a divergence in the requests and web_logs data in their system?
How are you downloading the web_logs for CD2? Are you processing all the files returned for a request, or are you assuming there is only one.
From the look of your stats, I suspect that early on, you may have been receiving everything in one file, but later, there are multiple files and they are not all being processed.
Some of the minor differences might be explained by use of the mobile apps. These are inconsistent in when things do or don't appear in logs, page views etc., and could be the cause of some minor discrepancies.
We assume multiple files. In CD1 we used to get about 10-12 files per day during a major semester since they had ~1M records per file. Mobile traffic does seem part of it, one of the months I checked we had 11.7M iPhone events in CD1, the same month had 22k in CD2. The missing events are random, we zoomed in to a particular session where we knew all of the page visits since we were looking for what certain events looked like. In that session we saw some of the events then a few pages are missing, then we see some pages that we later in the session.
We run a separate process where we run 24-hour increment requests for web_logs and send them directly to S3, mostly to allow for a complete table rebuild since the API has the 30 day limitation. Our next step is to see if the data that we have backed up in S3 is the same as the data in our tables. In the event our S3 backups match our tables that would mean 2 independent systems using separate processes ended up at the same place.
I'll provide another reply once we have loaded our backups and have run comparisons.
@MikeRichards I wonder if you have any update on this. We lost access to CD1 requests a while ago (and never trusted the requests table in the first place), but also noticed some discrepancies (e.g., 1M more records in CD1 requests than CD2 web_logs for the same day). That said, we were getting a lot of "junk" in CD1 requests towards the end (duplicates, random records from previous months, etc.) and our usage is always up-and-down during over the break when CD1 was towards the end of its life for us.
We did do a snapshot of web_logs and compared that to our incremental process. Things weren't exact, but they were not off by much when there were discrepancies. We don't have CD1 data to compare it to, though, so we could be missing things and not know it.
Are you in the us-east-1 region by any chance? There could be something going on that is region-specific.
My institution is in us-east-1. CD1 was by no means perfect, we experienced multi-day gaps where data was missing. At least our impression was that it seemed to be all or none when we get data from it, it seemed like the amount of traffic was consistent when we collected it, unless it fell into one of those gaps.
If you have historical CD1 data, you might be able to see if the recent months in CD2 match up with historical trends. In the case of my institution the traffic follows a lot of predictable patterns that line up with the semester and holiday schedules.
The other way to check possibly would be to have some users follow a predetermined series of page views and then when the data for that time range is available see if it matches or not. Assuming no adjustments have already been made to address the issue, you would likely see random page_views missing from some of those sessions.
It seems like they are aware of the issue and are working on it. when I created my post I was still uncertain if it was an problem with our pipeline or with the source data.
Thanks for the reply, @MikeRichards . We submitted a ticket last week after comparing the general trends between spring semester 2023 (CD1) and what we are seeing this year. We're also in us-east-1.
Community helpTo interact with Panda Bot, our automated chatbot, you need to sign up or log in:
Sign inTo interact with Panda Bot, our automated chatbot, you need to sign up or log in:
Sign in