The Future of Service Mesh Is Networking

Key Takeaways

  • Networking functionality, like TCP/IP and eBPF, moves from userspace to kernel space and will continue to do so.
  • Service mesh began as a way to manage layer 7 traffic between microservices but can be missing context from other layers.
  • People want the best of both worlds with “traditional networking,” giving the observability, context, and control on Layers 3-4, as well as service mesh handling API communication on Layer 7.
  • Service mesh becoming just another part of networking brings significant advantages through standardization, growing integrations and ecosystems, and easier operations.
  • The future of service mesh is as a networking feature, not a product category.

Service mesh has long been hyped as the future of cloud native allowing new features like canary deployments, failover, and mTLS alongside “traditional” networking features like traffic routing, observability, and error handling and troubleshooting.

Service mesh has promised to help turn your network security, service discovery, and progressive delivery practices (e.g., blue/green and canary deployments) into a self-service interface for developers. However, when we move away from the marketing hype to the actual implementation, we find a different story.

In the most recent CNCF service mesh survey, the most significant hurdles to adoption are a shortage of engineering expertise and experience, architectural and technical complexity, and a lack of guidance, blueprints, and best practices.

To help understand new technology, it is often easier to relate the new paradigm to existing features and functionalities to create a common vocabulary and starting point. In the case of service mesh, that is networking.

If we think of service mesh as the networking layer of distributed computing, we can learn a lot from how traditional networking was built, implemented, and adopted.

With that as a jumping-off point, I’ll dive into our experience building Cilium Service Mesh and what it has taught us about how the networking stack will evolve with service mesh becoming embedded. 

On this journey, we will discover that, to quote David Mooter, “The future of service mesh is as a networking feature, not a product category, as far out of sight and mind from developers as possible—and that is a good thing.”

Figure 1: Service Mesh will become just another part of the networking stack

TCP/IP Working and Winning from Implementation to Kernel Space

Before we dive into the future, let’s first jump into the past to understand how distributed networking came to be. The largest and most famous example of distributed networking is TCP/IP. While we may take it for granted today, that was not always the case. 

Before it became the backbone of the internet, TCP/IP originally began as a side project at DARPA. During the “Protocol Wars” (TCP/IP vs. OSI), engineers, companies, and even nations debated which communication protocol to use for connecting computer networks. OSI was “standardized” by many organizations, including the DoD, but as more companies started to interconnect networks, TCP/IP quickly became the emergent standard that people used to connect computers and networks because it was implemented (including an open source implementation in BSD) and could be used. People prioritized technology that worked and went with the technology they could implement today over the “chosen” solution.

These competing user-space implementations eventually gave way to an implementation of the TCP/IP stack in the kernel. Nowadays, you would be crazy to give most operations teams a kernel without TCP/IP included because it is seen as a basic part of networking that everyone can rely upon. Because it’s run across billions of devices, the Linux kernel implementation has seen and fixed the edge cases you would run into if you rewrote your own TCP/IP in userspace. Once again, people choose technology that works, and they understand.

Even for specialized use cases, like latency or performance-sensitive workloads, implementing your own TCP/IP stack has lost its steam. You can see this progression in Cloudflare’s blog as they go from writing about kernel bypass to why we use the Linux kernel’s TCP stack. “The Linux TCP stack has many critical features and very good debugging capabilities. It will take years to compete with this rich ecosystem. For these reasons, it’s unlikely userspace networking will become mainstream.”

People, especially for critical applications like networking, consistently want reliable technology that works and works everywhere.

Figure 2: TCP/IP wins because it is implemented

eBPF Superpowers for Kernel Networking

Cloudflare didn’t stop with just using the Linux TCP/IP stack. “eBPF is eating the software” seems to be the new motto. And they aren’t alone. Since 2017, all traffic going into a Facebook data center has been going through eBPF, and the majority of traffic at Google does too. In fact, a whole host of companies are using eBPF in production today. If we just stated that people want reliable technology, why would they turn to eBPF for their networking needs?

Because of the broad adoption of the Linux kernel across billions of devices, making changes, especially to core functionalities like networking, is not taken lightly. Once a change is made, it can still take years for vendors to test and adopt new kernel versions and for end users to put them into production. However, traditional networking technologies like iptables don’t scale to meet the needs of dynamic cloud native environments.

Google, Facebook, Cloudflare, and others turned to eBPF because it helped solve their networking problems at scale. From increasing packet throughput to DDoS mitigation and continuous profiling, eBPF allowed them to add the functionality needed to kernel networking in almost real time. Instead of hand-crafting server names and iptables rules, they could now serve billions of users simultaneously and mitigate 26 million requests per second DDoS attacks. eBPF helped solve the challenges of scaling out kernel networking. It is now being broadly adopted with Cilium, an eBPF based Container Network Interface (CNI), becoming the default CNI for all major cloud vendors.

eBPF has become the new standard for kernel networking because it brings superpowers that allow the network to dynamically add new features and scale out to meet users’ needs.

Figure 3: Networking moves to the kernel

Service Mesh Today

So far, we have shown that people adopt technologies because they solve an immediate need rather than a top-down demand. This naturally leads to the question, what problem is service mesh solving?

Very simplified, a service mesh is an equivalent of a dynamic linker now just for distributed computing. In traditional programming, including another module involves importing a library into your IDE, and when it is deployed, the OS’s dynamic linker connects your program with the library at runtime. It also does library discovery, security validation, and establishing connections.

In cloud native environments, your “library” is now a network hop to another microservice. Finding the “library” and establishing a secure connection is the problem of the service mesh. Similarly, for the dynamic linker, there is only one copy of the library, and for the service mesh, only one Envoy or eBPF program per node. It doesn’t make sense for development or operations teams to think about a dynamic linker, so why should they care about a complicated service mesh?

Service meshes became first-class infrastructure because they helped solve many initial complications with dynamic distributed computing environments on the application layer. Still, they aren’t always ready for Day 2. 

Even without diving into the problems laid out in the survey above again, service meshes have additional issues that can only be solved with networking. Since services usually use application-layer (OSI Layer 7) protocols, like HTTP, observability data is generally restricted to that layer. However, the root cause may be in another layer, which is typically opaque to their observability tools. The service mesh you rely on to connect your infrastructure suddenly can’t see what is happening. It must change.

Embedding Service Mesh into Networking

Figure 4: Moving service mesh to the kernel

Service mesh was created to address the networking gaps in the cloud native world, but as discussed above, it also has its own gaps. What we want now is the best of both worlds, “traditional networking,” giving us the observability, context, and control on Layers 3-4 and service mesh handling API communication on Layer 7. 

Luckily, we are already seeing these two layers come together, and with eBPF, this trend is only accelerating. Connecting service mesh to networking will allow us to have full context and control across all networking layers, improving setup, operations, and troubleshooting. Let’s dive into real-world examples of how this is happening with Isitio and Cilium Service Mesh.

In Istio

Istio is going from service mesh down to networking. With Merbridge and Ambient Mesh, Istio is working on bolting on more networking capabilities and reducing its reliance on sidecars for networking.

Merbridge accelerates the networking of Istio by replacing iptables with eBPF. “Using eBPF can greatly simplify the kernel’s processing of traffic and make inter-service communication more efficient.” Rather than relying upon the sidecar’s networking stack, Merbridge uses eBPF to pass packets from sidecars to pods and sidecar to sidecar without going through their whole networking stack. Merbridge accelerates networking by moving it from the service mesh to the kernel.

Istio also recently launched Ambient Mesh, a sidecarless service mesh with the goal of reducing how sidecars create operational complexity, underutilize resources, and break traffic. Ambient Mesh moves L7 processing to “waypoint proxies” for each namespace and utilizes Layer 4 processing whenever possible. This moves “service mesh functionality” to lower networking layers and only relies upon the actual service mesh when it is impossible to do otherwise. Adding service mesh into networking reduces the resource and operational overhead.

In Cilium Service Mesh

We saw this same trend of service mesh features moving into networking while working on Cilium. Cilium’s “first service mesh” was just an integration with Istio. Cilium did the CNI Layer 3-4 networking and network policy, while Istio handled Layer 7 functionality. From this experience, we realized that Cilium has more context about what is actually happening with the network and has better primitives for controlling the network since it is running in the kernel and sees everything that is happening on the machine.

At the same time, end-user requests also saw Cilium adding features usually associated with a service mesh, like multi-cluster mesh and egress gateways, because networking teams saw them as simply a way to network applications together. When we looked around, we realized that Cilium already had 80% of a “service mesh” built. They were just normal parts of networking that Cilium already included.

The only pieces missing from the kernel were more advanced Layer 7 functionalities, like HTTP parsing and mTLS, which are currently delegated to user space. However, eBPF can move even more of these features into the kernel. One of the next releases of Cilium will include mTLS, and there are multiple implementations of HTTP parsing with eBPF in the wild. While not everything can be implemented in eBPF, and Cilium will continue to ship with Envoy included for a subset of Layer 7 use cases, with eBPF, many service mesh features have now become table stakes for networking everywhere.

Figure 5: eBPF Service Mesh from Layer 1 – Layer 7

How the Networking Stack Will Evolve with Service Mesh Embedded

With service mesh now becoming a part of networking, there are three key advantages that the combined story brings through standardization, integration, and operations.

First, we can standardize and propagate context across layers by bringing service mesh and networking together. We are already beginning to see this in Kubernetes with Gateway API. Similar to OCI, which provides a standard for container runtimes and enables interchanging container runtimes as needed, Gateway API provides a standard way to do networking in Kubernetes. Standardization allows consumers to choose the best implementation for themselves and, in this case, makes it easier to combine networking and service mesh into one connectivity layer.

Second, once standardization occurs, an ecosystem of integrations can begin to grow. Ecosystems are great for adding new or additional functionality because they provide consistent integration points. More importantly, standardization facilitates integration into legacy workload environments, especially with networking. From our experience working with customers, enterprises need enterprise networking. That means additional protocols and complex networking topologies beyond just simple HTTP requests. Being able to integrate into these environments with varying requirements is what will make service mesh continue to be relevant for the future of networking.

Finally, nobody likes debugging networking. Why would you split it into two separate layers with both of them—and the interface between them—needing to be debugged when things go wrong? Combining service mesh and networking will make running, operating, and debugging easier because all of the context and issues will be in one place rather than leaking across abstractions. I think we can safely say that the management and operations of platform connectivity will fall squarely with the platform and networking teams, and they will want something that just works. Being able to tell a complete connectivity story at every layer of the network will be critical for the combined future of networking and service mesh.

While the debate of the best implementation combining service mesh and networking may still be up in the air, there will be many exciting opportunities ahead of us after it happens. Once we standardized containers with OCI, we could create new runtimes, add containers to registries for discoverability, and orchestrate them in production. Running a container today isn’t seen as unique but just another part of computing.

I’m looking forward to the same thing happening with service mesh. It will just be another part of the network, allowing us to have a complete connectivity story from Layer 1 to Layer 7 for microservices, applications, and workloads—and it will just work. Service mesh has a bright future as a part of the network, as far out of sight and mind from developers as possible—and that is a good thing.