September 2022 Canvas Service Disruptions

sptownsend · ‎09-27-2022

This post has been edited to add information for a 9/30/22 incident.

TLDR: Seven Canvas service disruptions in September caused problems for many users. We’re learning and making improvements to ensure this doesn’t happen again. This includes training, adjustments to our processes, and improvements to our technology.

Between September 7th and September 26th, Canvas operations in whole or in part were disrupted seven times. We recognize these instances have added additional stressors for you, and for that, we apologize.

As the leader of Instructure’s engineering organization, my priority is to deliver products and build teams that consistently get better together. It’s critical to maintain the confidence of educators in our ability to deliver a reliable service that enables them to teach. Although our annual global uptime remains above 99.9%, we expect far better from critical infrastructure software.

Here is what we’re doing to learn from these instances and improve:

Every incident is a learning opportunity for us to assess our people, process and technology. As such, we recently completed an investigation into each of these incidents to understand their causes and prevent a recurrence (see below for additional detail). Based on these findings, we are implementing changes to prevent disruptions and minimize customer impact in the future. We’re instituting changes to our code, implementing additional automated tests, updating our deployment process, and updating our monitoring infrastructure. We are also reinforcing our incident response training since in one case we failed to learn from a previous incident.

My team and I are committed to being open and transparent about the steps we’re taking to deliver the software services you rely on. If you have any input or questions about our engineering operations please reach out to your CSM or feel free to contact me directly.

Kindest Regards,
Steve Townsend
SVP, Engineering
steve.townsend@instructure.com

Incident Detail

Current service status is found status.instructure.com. And a rolling status history of service availability is found at statushistory.instructure.com. More detailed information about specific incidents can be provided by your CSM.

Two incidents were caused by underlying AWS infrastructure failure that Canvas did not recover from as expected. No data was lost during this service failover.
- 9/14 - 56 minutes of degraded service for some users in the Americas
- 9/7 - 3 hours of delayed enrollment changes for some users in the Americas

New Quizzes
SIS Integration

Two incidents were caused by an incomplete resource configuration during a code deploy
- 9/13, 25 minutes of no service for some users in the Americas
- 9/15, 30 minutes of no service for some users in the Americas

Document Viewing (Speedgrader, Student Collaborations)
Document Viewing (Speedgrader, Student Collaborations)

One incident was caused by a failure to scale up under load
- 9/14, 5 hours of intermittent unavailability for some users in the Americas

SCORM content viewing

One incident was caused by a bad code deploy
- 9/21, 46 minutes of degraded service and no service in all regions

Canvas LMS

One incident was caused by a cache configuration change that prevented scale under heavy load
- 9/26, 35 minutes of no service and degraded service for some users in the Americas

Canvas LMS

One incident was caused by a bug associated with a code deploy
- Canvas LMS
  - 9/30, 28 minutes of worldwide disruption on pages that allow users to upload attachments

Mikee · ‎10-02-2022

Hi Steve

As a significantly sized user in the rest of the world (we don't use freedom units to measure things, and stick to metric), your honesty about last month is most welcome.

September looks like it wasn't a great month for your platform, and we've had a few other things that have knocked around our implementation done without notice that we've raised with our CSM.

I've been asking your team to allow us to use our native team platform to receive status updates for some time and have been rebuffed. Given your striving here for openness and honesty, we'd find it beneficial should you allow us to subscribe to your status notifications through Slack.

Thanks! 🙂

dgrobani · ‎10-03-2022

@Mikee,

I believe there's an app that enables you to subscribe to RSS feeds in Slack, which would then enable you to receive updates from https://status.instructure.com/history.rss.

Mikee · ‎10-03-2022

Thanks @dgrobani - many of our other vendors permit their status pages to be subscribed natively into Slack. Don't see the benefit of enacting a cludge when the functionality is there, but being withheld.

DanBurgess · ‎10-05-2022

Thanks for acknowledging these incidents and providing insight into what steps you are taking to prevent future occurrences. It is unfortunate that a clustering of issues within a relatively short time frame (and particularly painful during the start of the academic year) leaves folks with the impression that the system has questionable reliability. To paraphrase a familiar quote, it can take years to develop a reputation and minutes to ruin it.

We are all rooting for you and the team in meeting the goal of delivering Canvas to our users with as little downtime as possible.

cdoherty · ‎10-05-2022

@sptownsendPlease update this page with the incident from Sept 30. Seems that was deployment-related as well, but it would be good to have more details

Mikee · ‎10-10-2022

@sptownsend - posting with a promise of accountability & transparency then not acknowledging folks on a site designed for communication & engagement isn't a great look. 😉

Our CSM has been fabulous through this, but keeps not being given the technical nuts & bolts we (as admins used to deploying features and with our own in-house dev team) need and want to play nicely with canvas.

ChrisMedina · ‎10-12-2022

This whole incident reminded me of

https://youtu.be/3fGHaVn5rGo

sptownsend · ‎10-13-2022

@DanBurgess: Thank you Dan for your encouragement. We recognize how fragile trust is and how critical LMS stability is to every customer. We and especially myself expect far better from critical infrastructure software.

@cdoherty: We've added the 9/30 incident to the original post. Our incident process owner is exploring a method to post incidents and incident reports in a more visible place, possibly in the community, rather than updating an old post. 🙂

@Mikee : Hi Mikee. You’re right, an important part of building trust through transparency is listening to and responding to suggestions like the one you provided about Slack. I should have responded to your comment. Our support team is currently investigating options for allowing a Slack integration with our individual instance status page updates. We will look into adding this functionality to the status page and keep you updated on our progress.

We’ve updated our processes internally to ensure we don’t miss valuable customer feedback like this. While I can’t promise a response to every suggestion, I’ll do my best to listen to our customers and communicate. Understanding the challenges you face in your job is a crucial part of delivering the best product experience possible. With regard to the recent incidents, by now you should have received a detailed incident report. If you have not received these, reach out to your CSM again and they should be able to provide you with a copy of the incident report.

Mikee · ‎10-13-2022

😍Perfect. Thank you @sptownsend

cdoherty · ‎10-18-2022

@sptownsend Perhaps it's time to start an October 2022 Service Disruptions page to include the fact that Instructure decided to take down our test instance for the entire day WITHOUT WARNING OR NOTIFICATION of any kind. It doesn't seem like Instructure is learning from past mistakes. This kind of decision demonstrates the continuing lack of transparency. This was not an accidental outage, but a decision to ruin the productivity of many institutions without warning.

Renee_Carney · ‎10-21-2022

Greetings @cdoherty

Thank you for your comment. We are still exploring options for posting incident reports and will keep the Community posted as solutions emerge.

Also, thank you for acknowledging this was not an accidental incident, which is what this thread is focused on. I appreciate the clarity for other Community members. I do want you to know that your concerns are heard and we are conversing internally on how notification and communication of work on test (and beta) can be improved. Please follow up with your CSM for further conversation around your experience.

Mikee · ‎10-26-2022

While @cdoherty only mentioned their non-production site availability, it looks like the bumpy ride is continuing this month.

I'm hoping the spate of service notices is because of an increased focus on acknowledging things, in line with the commitment to accountability and transparency.