Canvas weblogs operation issues (North Virginia)

Edina_Tipter
Instructure Alumni
Instructure Alumni
16
3302
In the past two days, our weblogs (including user_agents table) data processing experienced some issues while completing the query jobs for customers in the North Virginia region. As an effort to prevent a wider incident and maintain Canvas Data 2 main functionality and uptime, the ability to query Weblogs (including user agents) was temporarily turned off for customers in North America (us-east-1 AWS Region, or IAD). Thus any queries targeting Weblogs will result in Error 503: "The namespace 'canvas_logs' specified in the request is currently not available for querying due to internal errors."  Queries of tables from the Canvas or Catalog namespaces are completed successfully. The issue is contained to the North Virginia region and impacts less than 40 customers.
 
The team is currently investigating the impact of queries targeting Weblogs on overall performance of CD2, which might require a few days even.
 
Please bear with us, I will keep you posted. 

 

--------------------------------------------------------

Update on weblogs (5 Oct, 2023): the fix is on its way and we are targeting next week to release it after it undergoes a testing phase.

--------------------------------------------------------

Update on weblogs (13 Oct, 2023): the fix we wanted to release has not successfully passed the tests last weekend. Throughout this week, we have undertaken revisions, and today we will subject it to further testing and continuous monitoring throughout the weekend.Should it meet the anticipated standards, the release will be scheduled for Monday.

--------------------------------------------------------

Update on weblogs (17 Oct, 2023): We wish to inform all consumers of CD2 weblogs in the Northern Virginia region that both the weblogs and user_agents tables are now available for querying. Yet, we have identified a gap in the data for Sunday, October 8th. We are currently assessing potential solutions for data restoration for this day, although we anticipate a longer resolution timeframe due to the development effort required on Instructure’s part. In the interim, should anyone require immediate mitigation for this data gap, the CD1 requests table can be utilized as a temporary workaround. We apologize for any inconvenience this event may have caused.

16 Comments
amcdona
Community Participant

following

c_carrillo1
Community Participant

following

JamesSekcienski
Community Coach
Community Coach

@Edina_Tipter 

Is this functionality still disabled?  I don't see any updates here and the status page is unclear.

According to the Instructure Status page, Canvas Data 2 is "Operational".  However, when looking back at the Incident Report for this issue, it says this is "Resolved", but at the same time the details say the mentioned functionality is disabled for the time of the investigation. 

If the feature is disabled, then the issue isn't Resolved, and I'm unclear why it was labeled as "Resolved" if it is still being investigated.  Needing to review Incident Report history and read the details to know if an issue with Canvas Data 2 is truly "Resolved" or still being worked on does little to help clearly communicate when there is an active issue.  If the issue has been fully resolved and the functionality has been restored, I would recommend updating the details on this Incident Report to reflect that.

AlexShin
Community Explorer

@JamesSekcienski 

I'm not sure the Canvas Data 2 section of the status page is very reliable as an indicator of service at this point in time. We were told by our CSM that "We will report outages if they are P1 incidents. (When the CD2 API is down or returns only errors."

It has been indicating that CD2 is "operational" for essentially the entire time there have been outages/web_logs being disabled the past few weeks. I assume this is because they consider the indicated issue here not a complete outage since it only affects 2 tables or because "The issue is contained to the North Virginia region and impacts less than 40 customers."

Edina_Tipter
Instructure Alumni
Instructure Alumni
Author

Update on weblogs: the fix is on its way and we are targeting next week to release it after it undergoes a testing phase.

@AlexShin @JamesSekcienski Thank you for your feedback. I will take away and discuss with our support team how to improve our practices on sharing this type of information.

reynlds
Community Participant

I agree with @AlexShin that traditional uptime monitoring is not sufficient for this service. While simple checks may show the overall service as "available", it really requires a "functional" evaluation to determine that it's responding correctly and in a timely manner. Ok, the host is up...but if it's throwing errors on actual use, it's no good.

J-J-Jason
Community Participant

Agreed with the other posts! A little communication would go a long way.

stimme
Community Coach
Community Coach

@Edina_Tipter Could you provide an update and a solid date for releasing the fix for this issue?

Given the 30-day retention window for web_logs, we CD2 users in US-East-1 will lose records before too long. The last day for which my institution retrieved them was Sunday 9/24. The small number of business days between now and October 24 is making me nervous.

Thanks!

Edina_Tipter
Instructure Alumni
Instructure Alumni
Author

@stimme @J-J-Jason 

I have just provided an update in my post.

With regard to the 30-day retention period, I acknowledge your concern. As per the updates made available, it appears that we will not be running our from this timeframe. However, if such a situation does indeed arise, we assume the responsibility of offering an alternative.

JamesSekcienski
Community Coach
Community Coach

@Edina_Tipter 

Will the hard deadline to transition to Canvas Data 2 be re-evaluated? 

While I understand it is challenging and expensive to support both Canvas Data 1 and Canvas Data 2, there are still issues being reported with Canvas Data 2 and it is less than 1 year old from its official launch.  Forcing institutions to make the switch to Canvas Data 2 in the middle of an academic school year while needing to figure out how to update all dependent systems/reports and handle new errors and issues is a lot to manage.

The deadline for New Quizzes was eliminated because Instructure agreed that the "reasons to switch to New Quizzes should be compelling, not compulsory."  It would be nice if the same or similar statement could be made for Canvas Data 2.  While, I'm sure you would need a hard deadline eventually, the current deadline has not provided sufficient time for customers to see that Canvas Data 2 is stable, so that institutions trust making the switch, and so institutions have sufficient time to get the necessary resources to make a successful transition and update all dependencies smoothly.  

reynlds
Community Participant

Agreed...I'd like to have at least a month or two with NO ERRORS OR OMISSIONS before turning CD1 off. It only makes sense as I can't get through a week without a failure of some sort.

JamesSekcienski
Community Coach
Community Coach

I'm glad to see the update that the functionality has been restored and hopefully it is working for users again to pull this data.

@Edina_Tipter In the future, can you provide updates as a comment too?  I didn't see a notification that the original post was edited to provide an update even though I am subscribed to this thread.

Edina_Tipter
Instructure Alumni
Instructure Alumni
Author

@JamesSekcienski Of course I can do that. Thanks for raising this. 

JamesSekcienski
Community Coach
Community Coach

@Edina_Tipter Thank you!

reynlds
Community Participant

@Edina_Tipter Still not working for me using the latest DAP client.

$ dap initdb --namespace canvas_logs --table web_logs
...
2023-10-19 08:30:21,827 - INFO - Query job still in status: running. Checking again in 5 seconds...
2023-10-19 08:30:27,645 - WARNING - Received error in response: Middleware error
2023-10-19 08:30:27,669 - ERROR - Middleware error
Traceback (most recent call last):
File "/var/lib/canvas-mgmt/python39-venv/lib64/python3.9/site-packages/dap/__main__.py", line 133, in console_entry
main()
File "/var/lib/canvas-mgmt/python39-venv/lib64/python3.9/site-packages/dap/__main__.py", line 125, in main
asyncio.run(dapCommand.execute(args))
File "/usr/lib64/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/var/lib/canvas-mgmt/python39-venv/lib64/python3.9/site-packages/dap/commands/commands.py", line 31, in execute
executed = await super().execute(args)
File "/var/lib/canvas-mgmt/python39-venv/lib64/python3.9/site-packages/dap/commands/base.py", line 49, in execute
if await subcommand.execute(args):
File "/var/lib/canvas-mgmt/python39-venv/lib64/python3.9/site-packages/dap/commands/base.py", line 45, in execute
await self._execute_impl(args)
File "/var/lib/canvas-mgmt/python39-venv/lib64/python3.9/site-packages/dap/commands/initdb_command.py", line 31, in _execute_impl
await init_db(
File "/var/lib/canvas-mgmt/python39-venv/lib64/python3.9/site-packages/dap/actions/init_db.py", line 16, in init_db
await SQLReplicator(session, db_connection).initialize(
File "/var/lib/canvas-mgmt/python39-venv/lib64/python3.9/site-packages/dap/replicator/sql.py", line 37, in initialize
client = await SnapshotClientFactory(
File "/var/lib/canvas-mgmt/python39-venv/lib64/python3.9/site-packages/dap/downloader.py", line 58, in get_client
table_data = await self._session.get_table_data(
File "/var/lib/canvas-mgmt/python39-venv/lib64/python3.9/site-packages/dap/api.py", line 608, in get_table_data
job = await self.execute_job(namespace, table, query)
File "/var/lib/canvas-mgmt/python39-venv/lib64/python3.9/site-packages/dap/api.py", line 546, in execute_job
job = await self.await_job(job)
File "/var/lib/canvas-mgmt/python39-venv/lib64/python3.9/site-packages/dap/api.py", line 522, in await_job
job = await self.get_job(job.id)
File "/var/lib/canvas-mgmt/python39-venv/lib64/python3.9/site-packages/dap/api.py", line 425, in get_job
job = await self._get(f"/dap/job/{job_id}", Job) # type: ignore
File "/var/lib/canvas-mgmt/python39-venv/lib64/python3.9/site-packages/dap/api.py", line 211, in _get
return await self._process(response, response_type)
File "/var/lib/canvas-mgmt/python39-venv/lib64/python3.9/site-packages/dap/api.py", line 324, in _process
raise error_object
dap.dap_error.ServerError: Middleware error
 

 

AlexShin
Community Explorer

@reynlds 

I've been trying to get our backup process caught up since it was indicated that the web_logs table is now accessible again. We are not making use of the DAP client. 

For pretty much all yesterday (10/19/2023) we were running into errors where the API would take so long to respond that the token would expire causing the job to fail. 

All this morning I've been trying to take daily snapshots for our backups and I've been hitting the same middleware error very consistently. Trying to get a daily snapshot for 10/14 I encountered the middleware error 5 or 6 times before finally getting a good return.

The instructure status page has indicated CD2 is green for the entire time we've been experiencing the failures.