- eBPF is a tool meant to allow improved performance by (carefully) allowing some user code to run in the kernel.
- The layer 7 processing needed for service meshes is unlikely to be feasible in eBPF for the foreseeable future, which means that meshes will still need proxies.
- Compared to sidecar proxies, per-host proxies add operational complexity and reduce security.
- Typical performance complaints about sidecar proxies can be addressed by smaller, faster sidecar proxies.
- For now the sidecar model continues to make the most sense for service mesh.
Stories about eBPF have been flooding the cloud-native world for a bit, sometimes presenting it as the greatest thing since sliced bread and sometimes deriding it as a useless distraction from the real world. The reality, of course, is considerably more nuanced, so taking a closer look at what eBPF can and can’t do definitely seems warranted – technologies are just tools after all, and we should fit the tool we use to the task at hand.
One particular task that’s been coming up a lot recently is the complex layer 7 processing needed for a service mesh. Handing that off to eBPF could potentially be a huge win for service meshes, so let’s take a closer look at that possible role for eBPF.
What is eBPF, anyway?
Let’s get the name out of the way first: “eBPF” was originally the “extended Berkeley Packet Filter”, though these days it doesn’t stand for anything at all. The Berkeley packet filter goes back nearly 30 years: it’s technology that allows user applications to run certain code – very closely vetted and highly constrained code, to be sure – directly in the operating system kernel itself. BPF was limited to the network stack, but it still made some amazing things possible:
- The classic example is that it could make it dramatically easier to experiment with things like new kinds of firewalls. Instead of constantly recompiling kernel modules, just make some edits to your eBPF code and reload it.
- Likewise, it can open the door to easily develop some very powerful kinds of network analysis, including things that you really wouldn’t want to run in the kernel. For example, if you want to do classification of incoming packets using machine learning, you could grab packets of interest with BPF and hand them out to a user application running the ML model.
There are other examples: these are just two really obvious things that BPF made possible1 — and eBPF took the same concept and extended it to areas beyond just networking. But all this discussion raises the question of why exactly this kind of thing requires special attention in the first place.
The short answer is “isolation”.
Computing – especially cloud-native computing – relies heavily on the hardware’s ability to simultaneously do multiple things for multiple entities, even when some of the entities are hostile to others. This is contended multitenancy, which we typically manage with hardware that can mediate access to memory itself. In Linux, for example, the operating system creates one memory space for itself (kernel space), and a separate space for each user program (user space collectively, although each program has its own). The operating system then uses the hardware to prevent any cross-space access2.
Maintaining this isolation between parts of the system is absolutely critical both for security and reliability — basically everything in computing security relies on it, and in fact the cloud-native world relies on it even more heavily by requiring the kernel to maintain isolation between containers, as well. As such, kernel developers have collectively spent thousands of person-years scrutinizing every interaction around this isolation and making sure that the kernel handles everything correctly. It is tricky, subtle, painstaking work that, sadly, often goes unnoticed until a bug is found, and it is a huge part of what the operating system actually does3.
Part of why this work is so tricky and subtle is that the kernel and the user programs can’t be completely isolated: user programs clearly need access to some operating system functions. Historically, this was the realm of system calls.
System calls, or syscalls, are the original way that the operating system kernel exposed an API to user code. Glossing over a vast amount of detail, the user code packs up a request and hands it to the kernel. The kernel carefully checks to make sure that all its rules are being followed, and – if everything looks good – the kernel will execute the system call on behalf of the user, copying data between kernel space and user space as needed. The critical bits about syscalls are:
- The kernel is in control of everything. User code gets to make requests, not demands.
- The checking, copying of data, etc., take time. This makes a system call slower than running normal code, whether that’s user code or kernel code: it’s the act of crossing the boundary that slows you down. Things have gotten much faster over time, but it’s just not feasible to, say, do a syscall for every network packet on a busy system.
This is where eBPF shines: instead of doing a syscall for every network packet (or trace point, or whatever), just drop some user code directly into the kernel! Then the kernel can run it all at full speed, handing data out to user space only when it’s really necessary. (There’s been a fair amount of this kind of rethinking of the user/kernel interaction in Linux recently, often to great effect.
io_uring is another example of great work in this area.)
Of course, running user code in the kernel is really dangerous, so the kernel spends an awful lot of effort on verifying what that user code is actually meant to be doing.
When a user process starts up, the kernel basically starts it running with the perspective that it’s probably OK. The kernel puts guardrails around it, and will summarily kill any user process that tries to break the rules, but – to anthropomorphize a bit – the user code is fundamentally assumed to have a right to execute.
No such courtesy can be afforded to eBPF code. In the kernel itself, the protective guardrails are basically nonexistent, and blindly running user code in the hopes that it’s safe would be throwing the gates wide open to every security exploit there is (as well as allowing bugs to crash the whole machine). Instead, eBPF code gets to run only if the kernel can decisively prove that it’s safe.
Proving that a program is safe is incredibly hard4. In order to make it even sort of tractable, the kernel dramatically constrains what eBPF programs are allowed to do. Some examples:
- eBPF programs are not allowed to block.
- They’re not allowed to have unbounded loops (in fact, they weren’t allowed to have loops at all until fairly recently).
- They’re not allowed to exceed a certain maximum size.
- The verifier must be able to evaluate all possible paths of execution.
The verifier is utterly Draconian and its decision is final: it has to be, in order to maintain the isolation guarantees that our entire cloud-native world relies on. It also has to err on the side of declaring the program unsafe: if it’s not completely certain that the program is safe, it is rejected. Unfortunately, there are eBPF programs that are safe, but that the verifier just isn’t smart enough to pass — if you’re in that position, you’ll need to either rewrite the program until the verifier is OK with it, or you’ll need to patch the verifier and build your own kernel5.
The end result is that eBPF is a very highly constrained language. This means that while things like running simple checks for every incoming network packet are easy, seemingly-straightforward things like buffering data across multiple packets are hard. Implementing HTTP/2, or terminating TLS, are simply not possible in eBPF: they’re too complex.
And all of that, finally, brings us to the question of what it would look like to apply eBPF’s networking capabilities to a service mesh.
eBPF and Service Mesh
Service meshes have to handle all of the complexity of cloud-native networking. For example, they typically must originate and terminate mTLS; retry requests that fail; transparently upgrade connections from HTTP/1 to HTTP/2; enforce access policy based on workload identity; send traffic across cluster boundaries; and much more. There’s a lot going on in the cloud-native world.
Most service meshes use the sidecar model to manage everything. The mesh attaches a proxy, running in its own container, to every application pod, and the proxy intercepts network traffic to and from the application pod, doing whatever is necessary for mesh functionality. This means that the mesh can work with any workload and requires no application changes, which is a pretty dramatic win for developers. It’s also a win for the platform side: they no longer need to rely on the app developers to implement mTLS, retries, golden metrics6, etc., as the mesh provides all this and more across the entire cluster.
On the other hand, it wasn’t very long ago that the idea of deploying all these proxies would have been utter insanity, and people still worry about the burden of running the extra containers. But Kubernetes makes deployment easy, and as long as you keep the proxy lightweight and fast enough, it works very well indeed. (“Lightweight and fast” is, of course, subjective. Many meshes use the general-purpose Envoy proxy as a sidecar; Linkerd seems to be the only one using a purpose-built lightweight proxy.)
An obvious question, then, is whether we can push functionality from the sidecars down into eBPF, and if it will help to do so. At OSI layers 3 and 4 – IP, TCP, and UDP – we already see several clear wins for eBPF. For example, eBPF can make complex, dynamic IP routing fairly simple. It can do very intelligent packet filtering, or do sophisticated monitoring, and it can do all of that quickly and inexpensively. Where meshes need to interact with functionality at these layers, eBPF seems like it could definitely help with mesh implementation.
However, things are different at OSI layer 7. eBPF’s execution environment is so constrained that protocols at the level of HTTP and mTLS are far outside its abilities, at least today. Given that eBPF is constantly evolving, perhaps some future version could manage these protocols, but it’s worth remembering that writing eBPF is very difficult, and debugging it can be even more so. Many layer 7 protocols are complex beasts that are bad enough to get right in the relatively forgiving environment of user space; it’s not clear that rewriting them for eBPF’s constrained world would be practical, even if it became possible.
What we could do, of course, would be to pair eBPF with a proxy: put the core low-level functionality in eBPF, then pair that with user space code to manage the complex bits. That way we could potentially get the win of eBPF’s performance at lower levels, while leaving the really nasty stuff in user space. This is actually what every extant “eBPF service mesh” today does, though it’s often not widely advertised.
This raises some questions about where, exactly, such a proxy should go.
Per-Host Proxies vs Sidecars
Rather than deploying a proxy at every application pod, as we do in the sidecar model, we could instead look at deploying a single proxy per host (or, in Kubernetes-speak, per Node). It adds a little bit of complexity to how you manage IP routing, but at first blush seems to offer some good economies of scale since you need fewer proxies.
However, sidecars turn out to have some significant benefits over per-host proxies. This is because sidecars get to act like part of the application, rather than standing apart from it:
- Sidecar resource usage scales proportional to application load, so if the application isn’t doing much, the sidecar’s resource usage will stay low7. When the application is taking a lot of load, all of Kubernetes’ existing mechanisms (resource requests and limits, the OOMKiller, etc.) keep working exactly as you’re used to.
- If a sidecar fails, it affects exactly one pod, and once again existing Kubernetes mechanisms for responding to the pod failing work fine.
- Sidecar operations are basically the same as application pod operations. For example, you upgrade to a new version of the sidecar with a normal Kubernetes rolling restart.
- The sidecar has exactly the same security boundary as its pod: same security context, same IP addresses, etc. For example, it needs to do mTLS only for its pod, which means that it only needs key material for that single pod. If there’s a bug in the proxy, it can leak only that single key.
All of these things go away for per-host proxies. Remember that in Kubernetes, the cluster scheduler decides which pods get scheduled onto a given node, which means that every node effectively gets a random set of pods. That means that a given proxy will be completely decoupled from the application, which is a big deal:
- It’s effectively impossible to reason about an individual proxy’s resource usage, since it will be driven by a random subset of traffic to a random subset of application pods. In turn, that means that eventually the proxy will fail for some hard-to-understand reason, and the mesh team will take the blame.
- Applications are suddenly more susceptible to the noisy neighbor problem, since traffic to every pod scheduled on a given host has to flow through a single proxy. A high-traffic pod could completely consume all the proxy resources for the node, leaving all the other pods to starve. The proxy could try to ensure fairness, but that will fail too if the high-traffic pod is also consuming all the node’s CPU.
- If a proxy fails, it affects a random subset of application pods — and that subset will be constantly changing. Likewise, trying to upgrade a proxy will affect a similarly random, constantly-changing subset of application pods. Any failure or maintenance task suddenly has unpredictable side effects.
- The proxy now has to span the security boundary of every application pod that’s been scheduled on the Node, which is far more complex than just being coupled to a single pod. For example, mTLS requires holding keys for every scheduled pod, while not mixing up which key goes with which pod. Any bug in the proxy is a much scarier affair.
Basically, the sidecar uses the container model to its advantage: the kernel and Kubernetes put in the effort to enforce isolation and fairness at the level of the container, and everything just works. The per-host proxies step outside of that model, which means that they have to solve all the problems of contended multitenancy on their own.
Per-host proxies do have advantages. First, in the sidecar world, going from one pod to another is always two passes through the proxy; in the per-host world, sometimes it’s only one pass8, which can reduce latency a bit. Also, you can end up running fewer proxies, which could save on resource consumption if your proxy has a high resource usage at idle. However, these improvements are fairly minor compared to the costs of the operational and security issues, and they’re largely things that can be mitigated by using smaller, faster, simpler proxies.
Could we also mitigate these issues by improving the proxy to better handle contended multitenancy? Maybe. There are two main problems with that approach:
- Contended multitenancy is a security concern, and security matters are best handled with smaller, simpler code that’s easier to reason about. Adding a lot of code to better handle contended multitenancy is basically diametrically opposed to security best practices.
- Even if the security issues could be addressed completely, the operational issues would remain. Any time we choose to have more complex operations, we should be asking why, and who benefits.
Overall, these sorts of proxy changes would likely be an enormous amount of work9, which raises real questions about the value of doing that work.
Bringing everything full circle, let’s look back at our original question: what would it look like to push service mesh functionality down to eBPF? We know that we need a proxy to maintain the layer 7 functionality we need, and we further know that sidecar proxies get to work within the isolation guarantees of the operating system, where per-host proxies would have to manage everything themselves. This is not a minor difference: the potential performance benefits of the per-host proxy simply don’t outweigh the extra security concerns and operational complexity, leaving us with the sidecar as the most viable option whether or not eBPF is involved.
To state the obvious, the first priority of any service mesh must be the users’ operational experience. Where we can use eBPF for greater performance and lower resource usage, great! But we need to be careful that we don’t sacrifice the user experience in the process.
Will eBPF eventually become able to take on the full scope of a service mesh? Unlikely. As discussed above, it’s very much unclear that actually implementing all the needed layer 7 processing in eBPF would be practical, even if it does become possible at some point. Likewise, there could be some other mechanism for moving these L7 capabilities into the kernel — historically, though, there’s not been a big push for this, and it’s not clear what would really make that compelling. (Remember that moving functionality into the kernel means removing the guardrails we rely on for safety in user space.)
For the foreseeable future, then, the best course forward for service meshes seems to be to actively look for places where it does make sense to lean on eBPF for performance, but to accept that there’s going to be a need for a user-space sidecar proxy, and to redouble efforts to make the proxy as small, fast, and simple as possible.
1. Or, at least, dramatically easier.
2. At least, not without prearrangement between the programs. That’s outside the scope of this article.
3. Much of the rest is scheduling.
4. In fact, it’s impossible in the general case. If you want to dust off your CS coursework, it starts with the halting problem.
5. One of these things is probably easier than the other. Especially if you want to get your verifier patches accepted upstream!
6. Traffic, latency, errors, and saturation.
7. Assuming, again, a sufficiently lightweight sidecar.
8. Sometimes it’s still two, though, so this is a bit of a mixed blessing.
9. There’s an interesting twitter thread about how hard it would be to do this for Envoy, for example.