Cloud SLA (Government / Defense)

I teach a cloud computing course for Thomas Erl @ Arcitura from time to time and the question that always comes up is the one about service level agreements.   It is a complicated subject that does not get enough attention in our industry.  I am presenting some of the discussions that we have had in class and some of the questions asked all relative to cloud computing and mostly aligned to the defense industry.    This is part of an SLA series of blogs to get people thinking about their requirements and to understand what an SLA can do or not do for a business.

Today I will share a short story .

It was 5:00 PM on a monday afternoon,  a routine procedure on a patient in London turned into a catastrophic challenge requiring expertise from across the pond.    Dr.  Jack Ash is a leading Gastroenterologist living in Ontario, Canada.   Dr. Ash received a call asking if he could help the patient in London.   Dr. Ash has extensive experience with remote surgery a relatively new practice taking shape using cloud oriented services.   Dr. Ash is using IT services provided by the hospital in Ontario.  The services in London are with a different hospital and a different IT service provider.   The configuration of the services look very similar to this (see figure 2)

The SLA looks something like this IT_Service_Level_Agreement_Publish_To_Customers.   Dr. Ash didn’t have much time, he needed to act fast.   He ran down to the office and coordinated all of the requirements with the remote team in London.   This situation had been discussed before and the staff had practice with dummies and cadavers.  Fortunately for Dr. Ash he routinely performs procedures like this in Ontario between hospitals often.

The stage was set, the anesthesiologist had the patient stable and all of the supporting staff acted as planned.   Due to the situation and set backs some time had gone by and the procedure took place closer to 8:00 PM .

Dr. Ash started the robotic services and invoked various commands over the network.  At first the procedure was going well, there were some minor fluctuations with bandwidth but nothing major.  In order to make sure that the network connectivity was consistent Dr. Ash asked a staff member to call over to his local IT and make sure QoS was turned on.   The helpdesk reported back quickly that all available resources and network traffic managing the robotic services had priority on the network.   The procedure continued and Dr. Ash was noticing severe latency through his video stream.   He had to act fast to finish the procedure and just as he was finishing the final maneuvers, the robotic arms lost connection.  Luckily the original staff working on patient x were in the room for Dr. Ash to talk through the final moments.

What happened?  Why did the arm lose connectivity?  How did the SLA help?  How could the SLA have helped?

Although this story was not real, the technology is real and the fact is that medical practitioners are doing something like this today.   Why did the arm lose connectivity?  I’ll give you a hint, there was a soccer game at Chelsea and there were a lot of people in London very interested in that game.

There are various reasons for the potential failure of service.   The problem is that most people focus on what technology can do as opposed to understanding where it make sense to use services like this and where it makes sense not to use these services.   There are plenty of business cases that could have created a more stable environment but more often than not businesses choose to go head first into situations like this.   It seems like a great idea and they even had an SLA!  I wonder what that would have done for patient x had this person died.

These are some of the concepts that we will need to explore further.  I will put together some defense oriented generic scenarios for thought.