Small Leak Teaches Big Lessons
- Managing Editor
Published June 1, 1991 | June 1991 issue
Lesson 1: Standardized, high-tech operations outperform others that combine various modes.
Lesson 2: While certain bank operations may be designated as "critical" in a disaster recovery plan, over time, all operations are critical, and the important thing is to resume "business as usual."
- Lesson 3: You are never as prepared as you think you are.
In the exhilarating, yet sobering, aftermath of the Minneapolis Fed's April 8 flood crisis, those three lessons loom large.
When a water pipe burst above the bank's third-floor computer site on that early Monday morning, the ensuing downpour set in motion a disaster recovery operation unparalleled in the Federal Reserve System. Within minutes of the deluge, the decision was made to transfer computer operations to the Federal Reserve System's backup site in Culpeper, Va.; and by 3 a.m., employees began arriving at the bank with their bags packed for an early-morning charter flight. Other employees quickly moved to establish local off-site operations at the Postal Data Center in Bloomington.
At the start of business that Monday, just a few hours after the bank's computer mainframe was deemed inoperable, about 50 employees were stationed at the Postal Data Center, six were in Culpeper to work with the backup computer, and other departments affected by the flood were restationed throughout the bank. By noon that same day, 10 hours ahead of the disaster-plan schedule, electronic wire service was fully restored. And that was the simple part.
The hard work of disaster recoveryfor example, establishing efficient communication links with financial institutions and addressing the pressing needs of "non-critical" bank functionswould be completed in the ensuing days.
High-Tech Gets High Marks
"Getting the mainframe up and running according to plan was probably the easiest part of the recovery," says Colleen Strand, Minneapolis Fed chief financial officer and senior vice president in charge of disaster recovery. And, in the beginning, it was also the most important part. In this age of electronics, when the financial services industry is increasingly reliant on electronic data processing, a Fed bank's computer mainframe is the heart of the institution.
Every day the Minneapolis Fed moves about $10 billion electronically through its wire transfer and automated clearing house (ACH) system, which enables companies to validate transactions, and allows automatic deposits and bill-paying for consumers. Also, banks use electronic services to manage their reserve funds and to balance their daily interbank accounts.
As it happened, the timing of the flood could not have been much better. During the early morning hours of Monday, the mainframe computer completes ACH work from the previous Friday and has not yet begun the heavy load of a new business day. Had the accident occurred at 2 p.m. Monday, for example, there would have been more problems, according to Thomas Kleinschmit, assistant vice president for Electronic Payments and Network Services.
Still, those ACH files from April 5 that were "in the hopper" when the computer shut down posed particular problems that took about a week to remedy. Some files were lost because there was no time to back up the transactions; ACH staff then had to methodically replay the transactions from April 5 with each financial institution to ensure that every item was properly accounted for.
And, as Kleinschmit says: "ACH is a delicate balance in normal times, let alone when something like this happens. ACH is complex, very complex." For example, each ACH file can contain thousands of payment transactions that must be individually processed; the Minneapolis Fed handles over 12 million such items every month.
"ACH is really resource intensivecomputer intensive," Kleinschmit says, a point that was greatly emphasized during the flood recovery. And that reliance on computer technology goes beyond the Minneapolis Fed and extends to the financial institutions that use the bank's electronic network services. While some institutions receive their financial information from the Fed on paper copies or magnetic tape, many rely on electronic services. Aside from the lost files of April 5, the maintenance of computer links with financial institutions proved to be the most enduring problem during those first days of the recovery.
In order for some financial institutions to use the Fed's wire transfer service during the initial stage of the recovery, they had to dial in on specially leased phone lines that allowed just one caller at a time; this meant that institutions had to form daily queues.
Also, according to Strand, some transmission troubles existed because of the Fed's policy of allowing financial institutions to use a variety of computer and peripheral equipment, like modems and printers, to communicate with the Fed. That meant that the Fed had to scramble to establish special links with individual institutions, which was a time-consuming proposition for the already over-worked technical staff. In other words: those financial institutions with the most up-to-date and standard equipment fared much better during the recovery period.
Some Federal Reserve banks only allow their financial institutions to use one standard set of equipment, and they require those institutions to test the equipment on a regular basis, according to Susan Mendesh-Fitzgerald, disaster recovery planning manager. Those who don't have the optimal equipment and who don't test are given a low priority in the disaster recovery plan.
"We've never been that aggressive here," Strand says.
"We want to please our customers and they don't want to have to buy new equipment and new technology, so we've accommodated them. In a disaster situation, however, you find out that that policy can cause problems. As long as you don't have a disaster, your customers are happy."
Strand says the current policy may be reevaluated as the bank reviews the recent events and fine-tunes its disaster recovery plans. She is also quick to say that not all financial institutions experienced problems during the recovery period. In fact, with the fast start-up time of the Culpeper mainframe and the efficient links with some institutions, Strand says that many institutions experienced no disruption of service and only became aware of the flood after they were directly informed by the Fed.
An Intricate Web of Computer Reliance ...
While the emphasis of disaster recovery has traditionally been on the immediate resumption of the data center and the critical functions like Electronic Network Services and certain Accounting areas, the current crisis stressed the need to also prepare for the resumption of other bank functions. When the main computer of a large corporation goes down, there are many jobs that are affected. For example, data for certain Research publications was unavailable during the initial stage of recovery, and the bank's Supervision Department had to borrow the computer resources of another Fed bank in order establish linkage with the computers of the Federal Reserve Board in Washington, D.C.
Ironically enough, at the time of the flood, the bank had just begun work on a more comprehensive disaster recovery plan. As Strand explains, there is a distinct difference between recovering a data center and recovering the ability to conduct day-to-day business. "The difficult thing about disaster recovery is resuming your business, making sure your customers are connected, that information is flowing, that what are considered non-critical functions are up and running. That's where disaster recovery literature is beginning to focuson resuming business."
Strand says the Minneapolis Fed quickly realized the need for an adequate business recovery plan. For example, while the Culpeper mainframe was quickly engaged on the first day, by the second day some financial institutions were calling for records of their recent transactions, and those records weren't immediately available; by the fifth day, those institutions were still calling.
... Stresses the Need for Thorough Planning
Last year the Minneapolis Fed distributed a booklet to its customers, or financial institutions that use Fed services, that spelled out the steps they should take in the event of an emergency that disrupted connections to the Fed's critical electronic services. Following the flood of April 8, very few institutions used the booklet as a resource or had any sort of plan to deal with such a contingency. In the early days of the recovery the resultant phone calls from confused financial institutions swamped much of the Fed staff.
In review, Strand says, the Fed's disaster preparedness planning fell short: "What we had tested, to be very blunt about it, is whether we could get the data center up. What we hadn't tested was whether we could resume and sustain operations in a disaster mode for several days. We took our testing seriously but we now know that it never went far enough. Of course, you can test all you want and people still run on instinct much of the time. As it happened, we did well despite the limited testing."
Doug Fleming, vice president at the Kansas City Fed and district recovery manager for the Federal Reserve System, agrees that the bank did well. "All in all, my assessment is very positive," he says. "They handled the recovery very well."
Every Federal Reserve District bank has its own recovery plan, Fleming says, but each plan is approved by a Systemwide group of first vice presidents. And even though each bank has its own plan, the entire system will learn from Minneapolis' experience, he says. Since the Federal Reserve System converted its computer site in Culpeper, Va., into a backup system in 1984 (it formerly housed a System communications network), the Minneapolis Fed is the first bank to use its services on anything but a test basis.
Fleming says that officials from all Fed district banks have been taking notes on Minneapolis' situation and that, eventually, the flood crisis will serve as a learning model for the entire System. "We can practice, test and plan all we want, but when we actually have to use a recovery plan we learn the most," he says.
'Staff is Everything'
From the moment the first gush of water poured through the third floor ceiling and four computer workers responded by quickly covering the equipment with plastic, until weeks later when the last question from a perplexed customer was answered, employee response to the flood has been critical to the bank's recovery, bank officials say.
"Staff is everything," Kleinschmit says. "You may have all the computers in the world, but your staff is your major card. And this staff rose to the occasion."
Employees in many departments worked extended hours during the first few weeks of recovery (as many as 18 to 20 hours per day during the initial stages), and they got together to help solve child-care and transportation problems. Also, employees who were not directly affected by the crisis got involved by volunteering to make phone calls to Ninth District financial institutions to provide periodic updates.
All of which, perhaps, suggests a final important lesson:
- Lesson 4: People are the key.