September 2022 Canvas Service Disruptions

sptownsend
Instructure
Instructure
6 12 2,701

Blog Headers -- 2000 x 200.png

This post has been edited to add information for a 9/30/22 incident.


TLDR: Seven Canvas service disruptions in September caused problems for many users. We’re learning and making improvements to ensure this doesn’t happen again. This includes training, adjustments to our processes, and improvements to our technology.

Between September 7th and September 26th, Canvas operations in whole or in part were disrupted seven times. We recognize these instances have added additional stressors for you, and for that, we apologize.

As the leader of Instructure’s engineering organization, my priority is to deliver products and build teams that consistently get better together. It’s critical to maintain the confidence of educators in our ability to deliver a reliable service that enables them to teach. Although our annual global uptime remains above 99.9%, we expect far better from critical infrastructure software.

Here is what we’re doing to learn from these instances and improve:

Every incident is a learning opportunity for us to assess our people, process and technology. As such, we recently completed an investigation into each of these incidents to understand their causes and prevent a recurrence (see below for additional detail). Based on these findings, we are implementing changes to prevent disruptions and minimize customer impact in the future. We’re instituting changes to our code, implementing additional automated tests, updating our deployment process, and updating our monitoring infrastructure. We are also reinforcing our incident response training since in one case we failed to learn from a previous incident.

My team and I are committed to being open and transparent about the steps we’re taking to deliver the software services you rely on. If you have any input or questions about our engineering operations please reach out to your CSM or feel free to contact me directly.

Kindest Regards,
Steve Townsend
SVP, Engineering
steve.townsend@instructure.com

 

Incident Detail

Current service status is found status.instructure.com.  And a rolling status history of service availability is found at statushistory.instructure.com. More detailed information about specific incidents can be provided by your CSM.

  • Two incidents were caused by underlying AWS infrastructure failure that Canvas did not recover from as expected. No data was lost during this service failover.
    • 9/14 - 56 minutes of degraded service for some users in the Americas
    • 9/7 - 3 hours of delayed enrollment changes for some users in the Americas
    • New Quizzes
    • SIS Integration
  • Two incidents were caused by an incomplete resource configuration during a code deploy
    • 9/13, 25 minutes of no service for some users in the Americas
    • 9/15, 30 minutes of no service for some users in the Americas
    • Document Viewing (Speedgrader, Student Collaborations)
    • Document Viewing (Speedgrader, Student Collaborations)
  • One incident was caused by a failure to scale up under load
    • 9/14, 5 hours of intermittent unavailability for some users in the Americas
    • SCORM content viewing
  • One incident was caused by a bad code deploy
    • 9/21, 46 minutes of degraded service and no service in all regions
    • Canvas LMS
  • One incident was caused by a cache configuration change that prevented scale under heavy load
    • 9/26, 35 minutes of no service and degraded service for some users in the Americas
    • Canvas LMS
  • One incident was caused by a bug associated with a code deploy
    • Canvas LMS
      • 9/30, 28 minutes of worldwide disruption on pages that allow users to upload attachments
12 Comments