Back from Disaster

Computer code on a screen

This article was originally published on EdTechMagazine.

After nearly losing its payroll data permanently in a data center outage, Beaverton School District in Oregon worked with CDW to devise and implement a comprehensive disaster recovery plan.

Just a few days before the start of a recent school year, Beaverton School District in Oregon experienced what officials there now simply call “The Event.” 
When IT staffers arrived at work on the Friday before classes were scheduled to begin, every server in the district’s data center was offline, and they couldn’t identify why. When they tried to reboot, about half of the hard drives failed, including those dedicated to the district’s student information system, email, finance system and a number of other applications. At first, staffers thought the issue might be related to the data center’s uninterruptible power supply, but they couldn’t find anything wrong. With the district staring down the prospect of starting the school year without a number of its critical applications, IT staffers canceled their end-of-summer vacations and dug into the problem. 
Eventually, officials discovered that the fire suppression monitoring system protecting the district’s IT infrastructure had malfunctioned. A nozzle in the system, which was recalled years ago, caused vibrations in the room. For hard drives spinning at 15,000 RPMs, it was like a sonic boom. 
Gas canisters discharged throughout the data center to stop the fire, even though there wasn’t a fire. An alert was sent to the school’s alarm company, but the fire department wasn’t notified; the district’s office of public safety had instructed the company to call the fire department only if two alarms went off as a cost-savings measure. Any system of protection that could be compromised, had been compromised.
IT staffers worked around the clock to bring systems back up on the remaining hard drives, restoring applications from tape. By Sunday evening, just two days before the start of school, they finally had the student information system and most other critical applications up and running. 
“We thought that was the end of the crisis, and that we’d pretty much mitigated the problem,” recalls CIO Steven Langford. “Everyone was exhausted, and I told everybody, ‘Go home and sleep.’ What we didn’t know was that the problem hadn’t even started yet.” 

18%: The portion of disaster recovery decision-makers who say their organizations are “very prepared” to recover their data center in the event of a site failure or disaster event

Source: “The State of Disaster Recovery Preparedness,” Forrester Research, February 2017 

The next morning, Langford was decompressing with a bike ride when the call came: An IT staffer had misconfigured the district’s payroll system backups a year earlier, and the district had only incremental backup data. 
“We had no finance data for 5,000 employees,” Langford says. “We had 17 days to pay 5,000 people. We didn't know how much money they made. We didn't know their withholdings for taxes. We didn't know their vacation/sick leave balances. We didn’t even know who had stopped working for us. And this was the day before school started, with 40,000 kids coming back.” 
Beaverton’s schools had a number of disaster recovery tools in place, but a cascade of problems set the district on its heels. The fire suppression system that malfunctioned had been recalled two years earlier, but district IT officials didn’t know that. When the gas canisters were set off, alarms weren’t sent to the local fire department or district IT officials. And the failure to back up payroll data went unnoticed for an entire year. 
Beaverton was on the verge of giving up and rebuilding its human resources finance database from scratch when an employee — the same one who caused the initial error — found a company that was able to recover the payroll data from a heat-damaged backup tape. But even then, the district continued to feel the impact of the incident throughout the school year as it raced to build back its reports. “The Event” had opened officials’ eyes about the importance of a disaster recovery strategy that would ensure continuity of operations. 
“We learned that we really didn’t have systems set up — both in the technology vein, but also the human systems around our data center — to be successful,” Langford says. 

Watch Beaverton School District officials explain how a data center disaster drove them to build a better continuity plan.


A Shifting Mindset

Shortly after the worst of the crisis had passed, Langford met with CDW field representative Angela Gadient, who is now a field sales manager for K-12 West. “We started looking at the systems that we had around protecting our data and data center, and we realized we needed a lot of help,” Langford says. “That’s not our expertise.” 
Gadient arranged for CDW experts to hold a one-week workshop for Beaverton officials and employees. As the CDW team listened to stakeholders describe their existing operations and their concerns about data security, it became clear that the district needed to move away from thinking solely in terms of IT solutions and toward the problem of disaster recovery. “Instead of saying, ‘Hey, why don’t we help you with some servers and storage,’ it was clear that it was a much bigger systemic issue,” says Gadient. “We started having conversations around risk, and how organizations can address this kind of risk. It isn’t just a matter of servers and storage. It’s a matter of policies and procedures, as well.” 
Langford says that district officials were impressed with CDW’s broad view of the situation. While other vendors and resellers tended to talk in terms of only the data center, he says, CDW’s experts focused on ensuring continuity of operations for the entire school district. “It really exposed us to the business continuity planning piece of things, which is larger than disaster recovery. And it showed us where the data center planning would fit into a holistic organizationwide plan.”
Gary Arnce, who now works as a regional manager of data center solutions in the Western U.S. for CDW, was a data center architect at the time, and worked closely with Beaverton officials as the district worked through a months-long engagement aimed at identifying, prioritizing and protecting its critical applications. 
“Very few organizations, especially school districts, know where their data lives, or have service-level agreements (SLAs) outlined for every application,” says Arnce. “It’s a different mindset from how most K–12 districts are used to approaching disaster recovery.”  
In part because of how spectacularly their existing processes failed, Langford says, school district officials were receptive to a new way of thinking. “We had been investing in our data center, in terms of backup technologies and database snapshot tools,” he says. “So, we had those systems. Our challenge was that everything was in the same room, except for the tapes. So, it really made us start thinking about what a better approach would be.” 

DR in the Cloud

The public cloud has proven to be a popular choice for disaster recovery systems. In “A Guide to Disaster Recovery in the Cloud,” VMware offers these top three use cases for hosted DR:
Site Flexibility
For organizations that are unable to invest in a secondary data center operated by core IT staff, cloud-based DR provides an offsite option for ensuring continuity of operations.
Replace or Enhance Traditional DR
Even for organizations that already operate a secondary data center, the cloud offers an extra layer of disaster recovery for mission-critical applications, communications and data storage.
DR for Remote Offices
Cloud-based DR allows organizations to protect remote office sites without additional capital investments.

Redundancy and Responsibilities

Beaverton’s disaster recovery and continuity of operations engagement with CDW resulted in a plan that ensured multiple levels of backup outside of the district’s primary data center. As part of a school construction bond, Beaverton built a secondary data center in an existing district facility around 1.5 miles from the primary site. The new facility ensures that the district can keep critical applications online if its primary data center faces an isolated outage (such as the malfunctioning fire suppression system that crippled the district before). However, it is too close to the primary site to protect against a larger disaster, such as a major earthquake. Partly because of this proximity, Beaverton schools are also working to implement a cloud disaster recovery solution. 
“We have redundancy at the data center level locally, and then we’re looking at the cloud as our disaster recovery and business continuity,” Langford says. 
Perhaps just as importantly, the district has codified its practices for backing up data and maintaining those backups, and has written service-level agreements outlining which applications will come back online in certain time frames after a disaster. The engagement with CDW•G also resulted in a runbook that provides guidance for a number of different scenarios (such as how to issue paychecks if a disaster prevents human resources employees from coming to work). The runbook incorporates details down to the level of vendor phone numbers, ensuring that district officials have all of the information they need to navigate a crisis. 
Gadient notes that the school district’s initial crisis was triggered by an employee mistake, and says that the district’s new plans are designed to mitigate the impact of human error. “Beaverton thought they had disaster recovery in place already,” she says. “But what they learned is, because they didn’t follow best practices, they were harmed. Now, they’ve vetted their responsibilities and priorities. The next piece is practicing the plan with tabletop exercises. Those are the things that are going to be critical to their success, and those are the things that school districts often don’t think about.” 
“If there is a data center failure today, everybody in the district knows what needs to happen,” Arnce says. “Before, they were scrambling. Now, they know those tier-one applications are going to come up in that secondary data center, and they’ll be up and running.” 
Langford says the district is updating its disaster recovery and continuity of operations plan annually, and that staffers now feel prepared to execute it if necessary. 
“Hopefully, we’re never going to need it,” Langford says. “But the planning work was very valuable in increasing staff confidence that we’re going to be ready should something happen. If this happens again, I know I’m going to have my systems up, I know how fast we can do it, and I know what data we have that we need to recover.”

Related toolkits

Learn more now with materials from these toolkit and resource collections: