Distributed service platforms largest challenge: network interoperability

The biggest challenge to introducing technologies to an organization today is all about making them available to the end users. Underneath the layers of software, and even hardware, it’s the network – and how technologies plug into that network is a constant challenge. I doubt it’s a surprise to anyone deeply involved in this space and deploying things like OpenStack, Docker Swarms, or Hadoop/Storm clusters.

The dark reality is that network interoperability is really crappy. Seriously, stunningly bad. Even with standards, the reality of vendor interoperability is troublesome and competing standards makes it even worse. Just within the space of container orchestration and exposing that across multiple/many physical servers, what you choose for networking is hugely impactful. Calico, Weave, Flannel, or bridging, possibly with OpenVSwitch, are all pushing for attention and adoption, with the goal of becoming a defacto standard. Individually these are all good solutions – each has their bonuses and draw backs – but the boundary to something else, the interface, is ultimately the problem.

Implementing one of these against an existing environment is the trouble point. It’s not just container platforms that drive this – many modern distributed system solutions have this same issue: OpenStack, Hadoop, Storm, in addition to the Docker, Marathon, and Kubernetes platform sets. Even pre-engineered solutions that are at the multiple-machine or rack scale struggle with this. There is not a tried and true method, and due to the ongoing complexities with just basic vendor network interoperability, it’s likely going to be a pain point for a number of more years.

The most common patterns I’ve seen to date try and make a router or layer-3 switch the demarkation point for a service. Define a boundary and (depending on the scale) a failure zone to make faults explicit, leveraging scale to reduce the impact. The critical difficulty comes with providing redundant and reliable access in the face of failures.

Configurations may include VLANS routed internally, and static IP address schemes – maybe based on convention, or maybe just assigned. Alternatively (and quite popularly), internal services are set up with an expectation of their own, private controlled network and the problem scope reduced to boundary points where egress and ingress are handled to at least restrict the “problem space”.

When this is done with appliances, it is typically resolved to reference configurations and pre-set tweaks for the most common network vendors. Redundancy choices move with the vendor solutions to handle the specifics needed to validate full interoperability – from MLAG setups with redundant switches to (this one always gets hated on, with good reasons) spanning tree to more recent work like Calico which interact directly with router protocols like BGP.

Watching this pain point for the last 5-6 years, I don’t see any clear and effectively interoperable solutions on the horizon. It is going to be a lot of heavy lifting to vet specific scenarios, against specific vendor implementations. I’m hopeful that the effort going into container runtime networking will help drive better interoperability. If nothing else, the push of private clouds, and now containers, are driving a lot of attention to this problem space. Maybe it will lie in the evolving overlay networks, the BGP protocols interop (like Calico), or the equivalents in the IPv6 space which only seem to be taking hold at the largest enterprises or government setups.