In a shining example of integrity, transparency and customer service, 365 Main, a data center company that promises extremely high levels of availability, has published details of a serious power failure that took out service to over 40% of its San Francisco colocation clients for as much as 45 minutes. The diary of events describes the frantic investigative engineering work required to analyze and resolve a problem in the backup power systems, finally traced to a timing issue (one of the nastiest forms of software bug!) in a PLC (Programmable Logic Controller - a type of Supervisory Control and Data Acquisition SCADA) subsystem that failed to clear the memory reliably when the diesel generator control units reset. Although I'm not a SCADA security expert, the fact that the failure occurred after a number of set/reset events sounds like a memory leakage and buffer overflow problem to me, but then I'm reading another texbook about software security testing at the moment so it's on my mind.
In the course of explaining the failure, the company outlines the design of its "N+2" standby power system using ten 2.1MW diesel generators, two of which are backups in case of maintenance or failure of the remaining eight. This level of power system investment is evidently sufficient to deliver 99.99% availability ("four nines")in an area subject to "dozens of surges and utility failures" during the last five years, although it is patently insufficient to reach five nines. Close but no cigar.
Describing the rapid sequence of five poer surges as a "unique event" implies that they had not previously tested the power systems under the specific conditions that led to the failure. This is known as Sod's Law or Murphy's Law, I'm not sure which. The preventive maintenance and testing regime looks reasonable by most standards i.e. "preventative maintenance logs on the Hitec generators are currently available for customer review. All generators in San Francisco pass weekly start tests and monthly load tests where diesels are started and run at full load for 2 hours. Both of these tests simulate a loss of utility and the auto start function is accurately tested." That said, however, if I were advising them [which I am not!], I would probably suggest running occasional on-load tests for much longer - perhaps 24 to 48 hours or more - to ensure that the diesel tanks, pumps/valves and pipes are clear, to confirm their capacity for exceptional long-term outages, and to refresh the diesel in the tanks. One of our clients experienced a backup generator on-load failure due to a blockage between the diesel header tank and the main diesel tank: the header capacity was sufficient for short on-load tests but not for a multi-hour power failure.
Reading between the lines of the diary a little, it looks as if the company had 'full and frank exchanges' with senior management at Hitec, the supplier of the no-break diesel generators and controls. The fact that they name the supplier is perhaps indicative of a frosty chill in the business relationship, but equally could imply their confidence in the way the supplier responded to the incident.
Anway, this is all fascinating and will probably form the basis of a case study in our forthcoming awareness module on physical security and environmental services for IT, due for release in October, or perhaps a later as-yet-unplanned module on application security. As with this month's case study based on the ongoing Ferrari-McLaren spying incident, real world cases often make more convincing classroom assignments. The trick is to summarize and crystallize the key factors into a format suitable for discussion.
No comments:
Post a Comment