Efforts at service reliability in today’s data centers (or in the cloud) are all about economics. What reliability can I get for what price, or more explicitly, what’s the cost of making “service X” more reliable/fault tolerant/etc.
The good news is that with the advent of cloud (and the utter lack of SLA’s from cloud service providers – arguably the single best thing AWS did for the industry) we are finally recognizing that we need to actually think about and accommodate failure in systems. I’d even go so far as to assert that most of Netflix’s open source street cred came from popularizing standard mechanisms to trace, monitor, and recover from issues – they didn’t invent the patterns like circuit breaker, but they wrote solid code, socialized them, and continue to be a group to follow as thought leaders in dealing with providing a low-cost reliable solution with broad, distributed, and unreliable components.
Most enterprises are years behind where Netflix is advocating. Tremendous numbers of systems that run in today’s enterprise data centers have no or extremely minimal tolerance for failure, and frequently were architected assuming low-latency, reliable network connectivity and at most hardware failures on a very rare incidence. Solutions have grown organically, and are very often subject to cascading failures.
When an enterprise does start to attack this problem (all of them are – they’re just in various stages of maturing), they’re usually starting from a dedicated operational organization that is rewarded based on adherence to SLAs, citing terms like like “percentage uptime” and “mean time between failure” (MTBF). A few are stepping forward to where they need to go – analysis of what it costs to run the service.
Nothing is 100% reliable, and we learned a while ago how to make reliable things out of unreliable parts: redundancy. We know how to add reliability through redundancy: in transmission of data (from the Shannon Theorem) to redundancy in disks (RAID). When you add redundancy, you add cost in some form; hence the economics.
In enterprise data centers today, we go to great lengths to make the most efficient use of our hardware. The efficiency gain – the “capex cost reduction” is the win that VMware capitalized on so effectively. With virtualization technologies, we can slice that hardware up into smaller usable pieces, so instead of using a $3000 server for a single thing, we have host dozens, or even hundreds (if we’re willing to play the “overbook” game) running. The same game is being provided by containers, except with a much lower overhead in loss that you typically see with a hypervisor. While all this is great for the cost side of the equation, it aggregates things to fail. It can make reliability worse. It means when I accidentally bump the power cord out, a lot more things will be impacted by that failure.
The common pattern for dealing with this is right back to redundancy. In general, you often try and run those redundant services across those fault domains, not within them, and that’s where you get back into the mechanics of cost.
If you dug up some companies data center diagrams, they’ll look something like this:
A bunch of servers are wired up to a switch, and those switches connect in a hierarchy to another layer of switches. Typically the tribe of network administrators goes to great lengths to use redundancy in switches, but somewhere you draw the line. When you’re working with a dozen or more racks of machines, you’re starting to get enough scale that maybe it’s not worthwhile to have redundant network paths all the way down to the server. If you can spread your services across the fault zone of a switch, or room circuit powering a rack with it’s switch, then the value of having that extra switch starts to fall away.
If you view those servers as not individual things to be managed, but a resource pool for combined use – and you spread your services across, you can get reliable services at lower costs.
What this leads to is wins from economies of scale, and frankly choices to be made by enterprises about how to size, and manage, resource pools. A smaller sized infrastructure is going to have a hardware time providing reliable services cheaper than a large one. And as you drive the cost down for each “unit of computing”, what you’ll find is that there’s more cost in the humans managing the system than the systems themselves.
They are critical as our technology is still immature for providing “fair use” of resources, which ends up being most of the overhead (and demands for isolation/dedicated resources) in large enterprises. It’s far too easy to have one business unit’s usage of a system directly impact another. The most common case of this has been network contention (I can tell you quite the series of stories about seeing what March Madness and ESPN did to the networks when working at Disney), and with the shift to make compute and storage more common pools, those same issues will (and do) exist for compute, memory, and persistent storage.