I understand the outage was caused by a technical issue in the network - something to do with the BGP configuration. I'm not particularly interested in, and probably wouldn't even understand, the details. The self same issue locked Facebook's IT administrators out of their own systems, leaving them cut off and unable to address/reverse/fix the issue for several hours, causing mild panic and a little outrage among its users, customers and other stakeholders. The same issue took down related websites too. Doubtless the admins were stressed out, possibly frantic, while their managers were unimpressed.
I'm bringing it up here to point out a lesson for all other organisations, not just those reliant on remote system admin.
If the network access is broken and unavailable, for whatever reason, remote admin is also broken and unavailable. That's screamingly obvious to all of us now with 20/20 hindsight thanks to the Farcebook Fiasco, and clearly an issue worth addressing by organisations that use and rely on remote system/network/app/IT admin, of which I'm sure there are many. I'm told that cloud is in, and the Interwebs are quite useful.
Less obviously, the incident is a neat reminder that foresight is even more valuable, more specifically information risk management. Regardless of the nature of the technical issue and preceding activities that sparked the outage, single points of failure are a class of vulnerability well worth identifying and addressing, especially for anything important. The solution is known as defence-in-depth, an approach that is universally employed by all living organisms - except, it seems, Facebook IT people.
As to how they might have mitigated the risks, there are several possible means of administering network systems aside from remote access through the same network. I'm not even going to attempt to list them. Go ahead, Google if you care.
There are myriad ways that information services may be interrupted, some deliberate/intentional, many accidental, inadvertent or due to natural causes. It's simply impracticable to attempt to identify and deal with them all, individually, hence the value of a much more generalised approach to specifying, achieving, maintaining and being confident in the required availability. It's called resilience, a natural complement to contingency planning, both of which are parts of the nebulous approach called business continuity management.
That's more than enough waffle from me. If you get it, great. If not, well I just hope I'm not reliant on you for anything important.
Thanks, Facebook, for demonstrating How Not To Do IT.