September 2022 Canvas Service Disruptions

sptownsend · ‎09-27-2022

This post has been edited to add information for a 9/30/22 incident.

TLDR: Seven Canvas service disruptions in September caused problems for many users. We’re learning and making improvements to ensure this doesn’t happen again. This includes training, adjustments to our processes, and improvements to our technology.

Between September 7th and September 26th, Canvas operations in whole or in part were disrupted seven times. We recognize these instances have added additional stressors for you, and for that, we apologize.

As the leader of Instructure’s engineering organization, my priority is to deliver products and build teams that consistently get better together. It’s critical to maintain the confidence of educators in our ability to deliver a reliable service that enables them to teach. Although our annual global uptime remains above 99.9%, we expect far better from critical infrastructure software.

Here is what we’re doing to learn from these instances and improve:

Every incident is a learning opportunity for us to assess our people, process and technology. As such, we recently completed an investigation into each of these incidents to understand their causes and prevent a recurrence (see below for additional detail). Based on these findings, we are implementing changes to prevent disruptions and minimize customer impact in the future. We’re instituting changes to our code, implementing additional automated tests, updating our deployment process, and updating our monitoring infrastructure. We are also reinforcing our incident response training since in one case we failed to learn from a previous incident.

My team and I are committed to being open and transparent about the steps we’re taking to deliver the software services you rely on. If you have any input or questions about our engineering operations please reach out to your CSM or feel free to contact me directly.

Kindest Regards,
Steve Townsend
SVP, Engineering
steve.townsend@instructure.com

Incident Detail

Current service status is found status.instructure.com. And a rolling status history of service availability is found at statushistory.instructure.com. More detailed information about specific incidents can be provided by your CSM.

Two incidents were caused by underlying AWS infrastructure failure that Canvas did not recover from as expected. No data was lost during this service failover.
- 9/14 - 56 minutes of degraded service for some users in the Americas
- 9/7 - 3 hours of delayed enrollment changes for some users in the Americas

New Quizzes
SIS Integration

Two incidents were caused by an incomplete resource configuration during a code deploy
- 9/13, 25 minutes of no service for some users in the Americas
- 9/15, 30 minutes of no service for some users in the Americas

Document Viewing (Speedgrader, Student Collaborations)
Document Viewing (Speedgrader, Student Collaborations)

One incident was caused by a failure to scale up under load
- 9/14, 5 hours of intermittent unavailability for some users in the Americas

SCORM content viewing

One incident was caused by a bad code deploy
- 9/21, 46 minutes of degraded service and no service in all regions

Canvas LMS

One incident was caused by a cache configuration change that prevented scale under heavy load
- 9/26, 35 minutes of no service and degraded service for some users in the Americas

Canvas LMS

One incident was caused by a bug associated with a code deploy
- Canvas LMS
  - 9/30, 28 minutes of worldwide disruption on pages that allow users to upload attachments