Amazon Web Services / Canvas Outage -- Most Timely Update

jschreier
Community Participant

Hi everyone. So, I am a Canvas admin at my college. I got an email from Canvas Support on Tues, Oct. 1 at 10pm informing me that AWS failed on the night of Sunday, Sept. 29. 

I sent that information out to my faculty first thing when I came into the office on Wed, Oct 2. I have received a few emails from faculty wondering if they could find out about these outages in a more timely fashion. I do sympathize with the professors as many have assignments due on Sun. night and they have to answer students who claim that they were unable to submit. 

So, my question is this:  Is there a way to find out about these outages in a more timely fashion? 

I know there is a status page. Can this be followed? Is it standard practice for Canvas admins to follow the status page? 

Thank you! 

From: Canvas Support <support@instructure-contact.com>
Sent: Tuesday, October 1, 2019 10:07 PM
To: Jesse Schreier <jschreier@massasoit.mass.edu>
Subject: Update from Canvas: Some users received page errors while uploading files on 9/29
 
CAUTION: This email originated from outside of Massasoit. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Summary

An AWS (Amazon Web Services) instance failed between 7:59 PM MDT and 11:30 PM MDT, September 29, 2019, which resulted in the page crashing and “500” page errors for some users when attempting to upload files to Canvas.

DETAILS

On September 29, 2019, some users experienced errors while uploading files to Canvas between 7:59 PM MDT and 8:10 PM. This was caused by an AWS instance failing due to its metadata credentials expiring. The health check configuration for AWS does not currently include verifying metadata credentials so none of the systems triggered an "unhealthy" alert.

Amazon's self-healing infrastructure resolved itself within 10 minutes. After this process, a small number of users intermittently received page errors between 8:10 PM MDT to 11:30 PM MDT until we manually restarted the system.

MITIGATION

We have filed a request with AWS to add a check for invalid metadata credentials so that we can be properly alerted in the event this ever happens again.

CONCLUSION

We sincerely apologize for the inconvenience and interruption to your service. Although this situation was unforeseen, we will make certain that future mechanisms are put in place to ensure that user experience is not compromised.

0 Likes