cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
matt6
Community Participant

Disaster Recovery Testing

Hi all,

Our internal audits group requires disaster recovery testing on all enterprise level apps.  When we had our LMS in house this was fairly routine where we would be observed going through an exercise where we recovered the system to bare metal hardware and did a rebuild.  As we have moved to Canvas the group still insists on a observable DR exercise for the LMS.  Has anyone done anything similar or do you stick to the position that Canvas is AWS and therefor well covered by their own internal disaster recovery policies.  Just trying to get community consensus as I feel like paying for a cloud solution with a certified DR procedure and still trying to do internal DR testing may be overkill. 

5 Replies
kmeeusen
Community Coach
Community Coach

I don't know about "Community consensus", but that is one of the huge advantages to a cloud-based LMS - all of that is someone else's responsibility.

You could point them to Canvas Status 

Even when AWS had a major disaster awhile back, we were down barely five hours; and again, that was their problem. We saw at InstCon that their average up-time is greater than 99%. And just to focus on reality - what could your local IT folks do about Canvas if it did go down? Are they going to catch a plane to the server farm, and help the Amazon techs sort it out? Just saying.

What your disaster management folks should be looking at should include:

  • Management of Canvas downtime on your campus,
  • How to use Canvas to support instructional continuity during local disasters/campus closings,
  • How to use Canvas to support campus operations during local disasters/campus closings (snow days and the like)

These are the things our disaster management committee works on in regards to Canvas.

I hope this help,

Kelley

nr2522
Community Champion

Hi,  @matt6 ‌.

Would your internal audits group be appeased if Instructure provided some paperwork, perhaps the Cloud Security Alliance's Consensus Assessments Initiative Questionnaire (https://cloudsecurityalliance.org/group/consensus-assessments/#_overview), perhaps even just the "Business Continuity Management & Operational Resilience" section? 

matt6
Community Participant

I would hope Instructure has this document something like it.  I would be interested to see what our internal audits would do with it.

bdalton_sales
Instructure
Instructure

There are a couple of things to note, if they want to see observable DR testing in action, the Test and Beta instances have the database backups from Prod restored on a 3 weekly and 1 weekly basis.  That is the Recovery process is constantly being tested to ensure backups are occurring and the process will work.  Thats pretty observable.  The process for restaging an instance in another AWS Region is the same as this process.  There is a custom doc of the DR processes you can ask your CSM's for.

stuart_ryan
Community Coach
Community Coach

Good call Deactivated user, I was going to suggest the same document.  @matt6 , we also have very stringent DR processes and yearly testing at UTS. The Disaster Recovery Plan and Procedures document that Brett mentioned and your CSM can provide really helped us with this process to satisfy our requirements, and lead to the further documents and processes needed. 

The Instructure DR Plan covers a range of topics that a DR committee would be interested in signing off on including:

  • Policy and Practices
  • Definition of Disaster
  • Declaration of Disaster
  • Key Organizational Resources
  • Disaster Recovery Team
  • Notification
  • Notifying Staff
  • Notifying Clients and Business Partners
  • Testing
  • Disaster Recovery Solution
  • Current Operating Infrastructure
  • Objectives
  • Backup and Recovery Practices
  • Sample Disaster Scenarios
  • Complete Loss of a Master Database
  • Simultaneous Complete Loss of Master and Slave Databases
  • Database Destruction by Hacker
  • Complete Loss of Primary Hosting Facility

We took this and considered two parts to cover off DR for our Canvas LMS. The first addresses that DR of the Canvas LMS is the responsibility of Instructure (and part of what we have contracted them for).


The second part is something that (in my mind) is equally important, and that is any internal dependant systems. These are any self-hosted or other Software-as-a-Service (SaaS) apps you may rely on for full functionality of Canvas, and hence are specific to your institution and should be considered for DR which may include:

  • Your authentication systems such as SAML Servers, LDAP, CAS etc.
  • Your Domain Name Servers (DNS) if you have a vanity URL such as yourcanvas.yourinstitution.edu
  • Secondary systems and LTI tools such as TurnItIn, or self-hosted LTI tools.

Having a comprehensive overviee that covers off all the interdependencies can be very useful if one of those dependant or related systems go into DR, so Canvas can be rapidly identified as being impacted by the DR committee.

Dependant on if your institution conducts organisational-wide DR scenarios (for example, total loss of service, in addition to individual system testing) you can take this a step further. When we run one of these DR scenarios, we will participate by hooking our Test Canvas system into the DR Recovered authentication and DNS systems for example.

That ensures that we have documented and well-tested changing any settings if required, will provide us with an opportunity to test what is currently only theory, and that all the systems talk correctly to each-other, connect through the DR firewalls and users can log-on successfully.

Hope that helps and gives you some ideas about how you could still document the DR processes to satisfy the internal requirements, and a really good question! Even while writing this I have identified there are some areas I will need to work on to document better (i.e. working on formalising the second part I mentioned above).