Archives April 2023

How Not to Use the DORA Metrics to Measure DevOps Performance

Key Takeaways

  • With metrics teams must remember Goodhart’s law: “When a measure becomes a target, it ceases to be a good measure.”
  • Low-performing teams take a hit on stability when they try to increase their deployment frequency simply by working harder.
  • Driving improvements in the metric may lead to taking shortcuts with testing causing buggy code or producing brittle software quickly.
  • A high change failure rate may reduce the effectiveness of the other metrics in terms of measuring progress toward continuous delivery of value to your customers.

Since 2014, Google’s DevOps Research and Assessment (DORA) team has been at the forefront of DevOps research. This group combines behavioural science, seven years of research, and data from over 32,000 professionals to describe the most effective and efficient ways to deliver software. They have identified technology practices and capabilities proven to drive organisational outcomes and published four key metrics that teams can use to measure their progress. These metrics are:

  1. Deployment Frequency
  2. Lead Time for Changes
  3. Mean Time to Recover
  4. Change Failure Rate

In today’s world of digital transformation, companies need to pivot and iterate quickly to meet changing customer requirements while delivering a reliable service to their customers. The DORA reports identify a range of important factors which companies must address if they want to achieve this agility, including cultural (autonomy, empowerment, feedback, learning), product (lean engineering, fire drills, lightweight approvals), technical (continuous delivery, cloud infrastructure, version control) and monitoring (observability, WIP limits) factors. 

While an extensive list of “capabilities” is great, for software teams to continually improve their processes to meet customer demands they need a tangible, objective yardstick to measure their progress. The DORA metrics are now the de facto measure of DevOps success for most and there’s a consensus that they represent a great way to assess performance for most software teams, thanks to books like Accelerate: The Science of Lean Software and DevOps (Forsgren et al, 2018) and Software Architecture Metrics (Ciceri et al, 2022).

But when handling metrics, teams must always be careful to remember Goodhart’s law: “When a measure becomes a target, it ceases to be a good measure.” The danger is that metrics become an end in themselves rather than a means to an end.

Let’s explore what this might look like in terms of the DORA metrics — and how you can avoid pulling the wool over your own eyes.

Deployment Frequency

For the primary application or service you work on, how often does your organisation deploy code to production or release it to end users?

At the heart of DevOps is an ambition that teams never put off a release simply because they want to avoid the process. By addressing any pain points, deployments cease to be a big deal, and your team can release more often. As a result, value is delivered sooner, more incrementally, allowing for continuous feedback from end users, who then shape the direction of travel for ongoing development work.

For teams who are currently only able to release at the end of a biweekly sprint or even less often, the deployment frequency metric hopefully tracks your progress toward deployments once a week, multiple times a week, daily, and then multiple times a day for elite performers. That progression is good, but it also matters how the improvements are achieved.

What does this metric really measure? Firstly, whether the deployment process is continuously improving, with obstacles being identified and removed. Secondly, whether your team is successfully breaking up projects into changes that can be delivered incrementally. 

As you celebrate the latest increase in deployment frequency, ask yourself: are our users seeing the benefit of more frequent deployments? Studies have shown that low-performing teams take a big hit on stability when they try to increase their deployment frequency simply by working harder (Forsgren, Humble, and Kim, 2018). Have we only managed to shift the dial on this metric by cracking the whip to increase our tempo?

Lead Time for Changes

For the primary application or service you work on, what is your lead time for changes (that is, how long does it take to go from code committed to code successfully running in production)?

While there are a few ways of measuring lead times (which may be equivalent to or distinct from “cycle times,” depending on who you ask), the DORA definition is how long it takes from a feature being started, to a feature being in the hands of users.

By reducing lead times, your development team will improve business agility. End users don’t wait long to see the requested features being delivered. The wider business can be more responsive to challenges and opportunities. All this helps improve engagement and interplay between your development team, the business, and end users.

Of course, reduced lead times go hand in hand with deployment frequency. More frequent releases make it possible to accelerate project delivery. Importantly, they ensure completed work doesn’t sit around waiting to be released.

How can this metric drive the wrong behaviour? If your engineering team works towards the metric rather than the actual value the metric is supposed to measure, they may end up taking shortcuts when it comes to testing and releasing buggy code, or code themselves into a corner with fast but brittle approaches to writing software. 

These behaviours produce a short-term appearance of progress, but a long-term hit to productivity. Reductions in lead times should come from a better approach to product management and improved deployment frequency, not a more lax approach to release quality where existing checks are skipped and process improvements are avoided.

Mean Time to Recover

For the primary application or service you work on, how long does it generally take to restore service when a service incident or a defect that impacts users occurs (for example, unplanned outage, service impairment)?

Part of the beauty of DevOps is that it doesn’t pit velocity and resilience against each other but makes them mutually beneficial. For example, frequent small releases with incremental improvements can more easily be rolled back if there’s an error. Or, if a bug is easy to identify and fix, your team can roll forward and remediate it quickly. 

Yet again, we can see that the DORA metrics are complementary; success in one area typically correlates with success across others. However, driving success with this metric can be an anti-pattern – it can unhelpfully conceal other problems. For example, if your strategy to recover a service is always to roll back, then you’ll be taking value from your latest release away from your users, even those that don’t encounter your new-found issue. While your mean time to recover will be low, your lead time figure may now be skewed and not account for this rollback strategy, giving you a false sense of agility. Perhaps looking at what it would take to always be able to roll forward is the next step on your journey to refine your software delivery process. 

It’s possible to see improvements in your mean time to recovery (MTTR) that are wholly driven by increased deployment frequency and reduced lead times. Alternatively, maybe your mean time to recovery is low because of a lack of monitoring to detect those issues in the first place. Would improving your monitoring initially cause this figure to increase, but for the benefit of your fault-finding and resolution processes? Measuring the mean time to recovery can be a great proxy for how well your team monitors for issues and then prioritises solving them. 

With continuous monitoring and increasingly relevant alerting, you should be able to discover problems sooner. In addition, there’s the question of culture and process: does your team keep up-to-date runbooks? Do they rehearse fire drills? Intentional practice and sufficient documentation are key to avoiding a false sense of security when the time to recover is improving due to other DevOps improvements.

Change Failure Rate

For the primary application or service you work on, what percentage of changes to production or releases to users result in degraded service (for example, lead to service impairment or service outage) and subsequently require remediation (for example, require a hotfix, rollback, fix forward, patch)?

Change failure rate measures the percentage of releases that cause a failure, bug, or error: this metric tracks release quality and highlights where testing processes are falling short. A sophisticated release process should afford plenty of opportunities for various tests, reducing the likelihood of releasing a bug or breaking change.

Change failure rate acts as a good control on the other DORA metrics, which tend to push teams to accelerate delivery with no guarantee of concern for release quality. If your data for the other three metrics show a positive trend, but the change failure rate is soaring, you have the balance wrong. With a high change failure rate, those other metrics probably aren’t giving you an accurate assessment of progress in terms of your real goal: continuous delivery of value to your customers.

As with the mean time to recover, change failure rate can—indeed should—be positively impacted by deployment frequency. If you make the same number of errors but deploy the project across a greater number of deployments, the percentage of deployments with errors will be reduced. That’s good, but it can give a misleading sense of improvement from a partial picture: the number of errors hasn’t actually reduced. Perhaps some teams might even be tempted to reduce their change failure rate by these means artificially!

Change failure rate should assess whether your team is continuously improving regarding testing. For example, are you managing to ‘shift left’ and find errors earlier in the release cycle? Are your testing environments close replicas of production to effectively weed out edge cases? It’s always important to ask why your change failure rate is reducing and consider what further improvements can be made.

The Big Picture Benefits of DevOps

Rightfully, DORA metrics are recognized as one of the DevOps industry standards for measuring maturity. However, if we think back to Goodhart’s Law and start to treat them as targets rather than metrics, you may end up with a misleading sense of project headway, an imbalance between goals and culture, and releases that fall short of your team’s true potential. 

It’s difficult to talk about DORA metrics without having the notion of targets in your head; that bias can slowly creep in and before long you’re unknowingly talking about them in terms of absolute targets. To proactively avoid this slippery slope, focus on the trends in your metrics – when tweaking your team’s process or practices, relative changes in your metrics over time give you much more useful feedback than a fixed point-in-time target ever will; let them be a measure of your progress.

If you find yourself in a team where targets are holding you hostage from changing your process, driving unhelpful behaviours, or so unrealistic that they’re demoralising your team, ask yourself what context is missing that makes them unhelpful. Go back and question what problem you’re trying to solve – and are your targets driving behaviours that just treat symptoms, rather than identifying an underlying cause? Have you fallen foul of setting targets too soon? Remember to measure first, and try not to guess.

When used properly, the DORA metrics are a brilliant way to demonstrate your team’s progress, and they provide evidence you can use to explain the business value of DevOps. Together, these metrics point to the big-picture benefits of DevOps: continuous improvements in the velocity, agility, and resilience of a development and release process that brings together developers, business stakeholders, and end users. By observing and tracking trends with DORA metrics, you will have made a good decision that facilitates your teams and drives more value back to your customers.

Agility and Architecture: Balancing Minimum Viable Product and Minimum Viable Architecture

Key Takeaways

  • No matter what you do, you will end up with an architecture. Whether it is good or bad depends on your decisions and their timing.
  • In an agile approach, architecture work has to be done as a continuous stream of decisions and experiments that validate them. The challenge is how to do this under extreme time pressure.
  • Developing a system in short intervals is challenging by itself, and adding “extra” architectural work adds complexity that teams can struggle to overcome. Using decisions about the MVP to guide architectural decisions can help teams decide what parts need to be built with the future in mind, and what parts can be considered “throw-aways”.
  • The concept of a “last responsible moment” to make a decision is not really helpful: teams rarely know what this moment is until it is too late. Using the evolution of the MVP as a way to examine when decisions need to be made provides more concrete guidance on when decisions need to be made.
  • With an agile approach, there is no point at which “minimum viability” as a concept is abandoned.
     

 

The Agile movement, initiated with the publication of the Agile Manifesto over 20 years ago, has become a well-established approach to software development.

Unfortunately, software architecture and agility are often portrayed as incompatible by some Agile practitioners and even by some software architects.

In reality, they are mutually reinforcing – a sound architecture helps teams build better solutions in a series of short intervals, and gradually evolving a system’s architecture helps by validating and improving it over time.

The source of the apparent antagonism between agility and architecture is the old flawed model of how software architecture is created: largely as an up-front, purely intellectual exercise whose output is a set of diagrams and designs based on assumptions and poorly documented Quality Attribute Requirements (QARs).

In this model, architectural decisions are made by people who, in theory, have the experience to make the important technical choices that will enable the system to meet its future challenges in a timely and cost-effective way, but are in fact based on flawed assumptions. 

Decisions are the essence of software architecture

In an earlier article, we asserted that software architecture is all about decisions. This view is shared by others as well:

we do not view a software architecture as a set of components and connectors, but rather as the composition of a set of architectural design decisions – Jan Bosch and Anton Jansen (IEEE 2005) 

So which decisions are architectural? In another article, we introduced the concept of a cost threshold, which is the maximum amount that an organization would be willing to pay to achieve a particular outcome or set of outcomes, expressed in present value, and based on the organization’s desired, and usually risk-adjusted, rate of return.  

Architectural decisions, then, are those that might cause the cost threshold for the product to be exceeded. In other words,

Software architecture is the set of design decisions which, if made incorrectly, may cause your project [or product] to be canceled. – Eoin Woods (SEI, 2010)

No matter what you do, you will have an architecture… 

Whether it is any good depends on your decisions.  Since teams are making decisions constantly, they need to constantly ask themselves, “Will this decision we are about to make cause the cost threshold, over the lifetime of the product, to be exceeded?” For many decisions, the answer is “no”; they are free to make the decision they feel is best.

Architectural decisions tend to be focused on certain kinds of questions:

  • Scalability: will the product perform acceptably when workloads increase?
  • Security: will it be acceptably secure?
  • Responsiveness: will it provide acceptable responsiveness to user-initiated events? Will it provide acceptable responsiveness to externally-generated events?
  • Persistency: what are the throughput and structure (or lack thereof) of data that must be stored and retrieved?
  • Monitoring: how will the product be instrumented so that the people who support the product can understand when it starts to fail to meet QARs and prevent critical system issues?
  • Platform: how will it meet QARs related to system resource constraints such as memory, storage, event signaling, etc? For example, real-time and embedded products (such as a digital watch, or an automatic braking system) have quite different constraints than cloud-based information systems.
  • User interface: how will it communicate with users? For example, virtual reality interfaces have quite different QARs than 2-dimensional graphical user interfaces, which have quite different QARs than command-line interfaces. 
  • Distribution: Can applications or services be relocated, or must they run in a particular environment? Can data be relocated dynamically, or does it have to reside in a particular data store location?
  • Sustainability: Will the product be supportable in a cost-effective way? Can the team evaluate the quality of their architectural decisions as they develop the MVA, by adding sustainability criteria to their “Definition of Done”?

When the architecture of a system “emerges” in an unconscious way, without explicitly considering these questions, it tends to be brittle and costly to change, if change is even possible. 

Even a non-decision is a decision

Postponing decisions can still have an impact because deciding not to address something now often means you are continuing to work under your existing assumptions. If these turn out to be wrong you may have to undo the work that is based on those assumptions. The relevant questions teams need to ask is “are we about to do work that, at some point, may need to be undone and redone?” If the answer is yes, you should be more mindful in making a conscious decision about the issue at hand. Rework is ok so long as you understand the potential cost and it doesn’t push you beyond your cost threshold.

In a series of prior articles, we introduced the concept of a Minimum Viable Architecture (MVA), which supports the architectural foundations of the Minimum Viable Product (MVP).

If not making an architectural decision will affect the viability of the MVP, it needs to be addressed immediately. For example if, for the MVP to be financially viable, it has to support 1000 concurrent users, then decisions related to scalability and concurrency to support this QAR need to be made. However, no decision should be made outside what is strictly required for the success proposition of the MVP. 

Investing in more than is needed to support the necessary level of success of the MVP would be wasteful since that level of investment (i.e. solving those problems) may never be needed; if the MVP fails, no more investment is needed in the MVA.

Agility and software architecture

Some people believe that the architecture of a software system will emerge rather naturally as a by-product of the development work. In their view, there is no need to design an architecture as it is mostly structural, and a good structure can be developed by thoughtfully refactoring the code. Refactoring code and developing a good modular code structure is important, but well-structured code does not solve all the problems that a good architecture must solve.

Unfortunately, you only know whether something is incorrect through empiricism, which means that every architectural decision needs to be validated by running controlled experiments. This is where an agile approach helps: every development timebox (e.g. Sprint, increment, etc.) provides an opportunity to test assumptions and validate decisions. By testing and validating (or rejecting) decisions, it provides a way to limit the amount of rework caused by decisions that turn out to be incorrect. And there will always be some decisions that turn out to be incorrect.

By providing a way to empirically validate architectural decisions, an agile approach helps the team to improve the architecture of their product over time. 

Agility requires making architectural decisions under extreme time pressure

In an agile approach, the development team evolves the MVP in very short cycles, measured in days or weeks rather than months, quarters, and years, as often happens in traditional development approaches. In order to ensure they can sustain the product over its desired lifespan, they must also evolve the architecture of the release (the MVA). In other words, their forecast for their work must include the work they need to do to evolve the MVA.

As implied in the previous section, this involves asking two questions:

  • Will the changes to the MVP change any of the architectural decisions made by the team? 
  • Have they learned anything since the previous delivery of their product that will change the architectural decisions they have made?

The answer to these questions, at least initially, is usually “we don’t know yet” so the development team usually needs to do some work to answer the questions more confidently. 

One way to do this is for the team to include revisiting architectural decisions as part of their Definition of Done for a release. Leaving time to make this happen usually means reducing the functional scope of a release to account for the additional architectural work.

So what does a release need to contain? First and foremost, it has to be valuable to customers by improving the outcomes they experience. Second, providing those outcomes includes an implicit commitment to support the outcomes for the life of the product. As a result, the product’s architecture must evolve along with the new outcomes it provides.

Because the team is delivering in very short cycles, it is constantly making trade-offs. Sometimes these trade-offs are between different product capabilities, and sometimes between product capabilities and architectural capabilities. These trade-offs mean that good architecture is never perfect, it’s (at best) just barely adequate.

Various forces make these trade-offs more challenging, including:

  • Taking on too much scope for the increment; teams need to resist doing this and de-scope to make time for architecture work.
  • Being subject to external mandates with fixed scope, e.g. new regulations; teams are tempted to sacrifice MVA in the short run, and they may need to (since regulators rarely consider achievability when they make mandates) but they need to be aware of the potential consequences and factor in the deferred work into later releases. Examples of this include intentionally incurring technical debt.
  • Adopting unproven, unstable, or unmastered technology; teams doing this are committing, in the vernacular of sports, unforced errors; they need to build evaluation experiments into their release work to prove out the technology and (this is the most important part) have a backup and back-out plan if they need to revert to more proven technologies. 
  • Being subject to urgent business needs driven by competition, or market opportunity; they need to keep in mind that short-term “wins” that can’t be sustained because the product can’t be supported represent a kind of Pyrrhic victory that can kill a business.

As with many things, timing is everything

When making architectural decisions, teams balance two different constraints:

  • If the work they do is based on assumptions that later turn out to be wrong, they will have more work to do: the work needed to undo the prior work, and the new work related to the new decision.
  • They need to build things and deliver them to customers in order to test their assumptions, not just about the architecture, but also about the problems that customers experience and the suitability of different solutions to solve those problems.

No matter what, teams will have to do some rework. Minimizing rework while maximizing feedback is the central concern of the agile team. The challenge they face in each release is that they need to run experiments and validate both their understanding of what customers need but also the viability of their evolving answer to those needs. If they spend too much time focused just on the customer needs, they may find their solution is not sustainable, but if they spend too much time assessing the sustainability of the solution they may lose customers who lose patience waiting for their needs to be met.

Teams are sometimes tempted to delay decisions because they think that they will know more in the future. This isn’t automatically true, since teams usually need to run intentional experiments to gather the information they need to make decisions. Continually delaying these experiments can have a tremendous cost when work needs to be undone because prior decisions have to be reversed. 

Delaying decisions until the last responsible moment is like playing chicken with the facts

Principles such as “delay decisions until the last responsible moment”, seem to provide useful guidance to teams by giving them permission to delay decisions as long as possible. In this way of thinking, a team should delay a decision on a specific architectural design to implement scalability requirements until those requirements are reasonably known and quantified.

While the spirit of delaying design decisions until the “last responsible moment” is well-intentioned, seeking to reduce waste by reducing the amount of work done to solve problems that may never occur, the concept is challenging to implement in practice. In short, it’s almost impossible to know what is the “last responsible moment” until after it has passed. For example, scalability requirements may not be accurately quantified until after the software system has been in use for a period of time. In this instance, the architectural design may need to be refactored, perhaps several times, regardless of when decisions were made.

To address this, a team needs to give itself some options by running experiments. It may need to experiment with the limits of the scalability of its current design by pushing an early version of the system or a prototype until it fails with simulated workloads, to know where its current boundaries lie. It can then ask itself if those limits are likely to be exceeded in the real world. If the answer is, “possibly, if …” then the team will probably need to make some different decisions to expand this limit. Drastically shortening the time required to run and analyze experiments, or even eliminating them, results in decisions based on guesses, not facts. It is also unclear what “delaying decisions until the last responsible moment” would mean in this case or how it would help.

Also, most decisions are biased by the inertia of what the team already knows. For example, when selecting a programming language for delivering an MVP, a team may lean toward technologies that most team members are already familiar with, especially when delivery timeframes are tight. In these situations, avoiding delays caused by the learning curve associated with a new technology may take precedence over other selection criteria, and the tool selection decision would be made as early as possible – probably not “at the last responsible moment”.

In the end, principles that encourage us to make decisions “just in time” aren’t very helpful. Every architectural decision is a compromise. Every architectural decision will turn out to be wrong at some point and will need to be changed or reversed. Usually, teams don’t know when the “last responsible moment” is until it has already passed. Without some way of determining what the last responsible moment is, the admonition really doesn’t mean anything.

For every evolution of the MVP, re-examine the MVA 

A more concrete approach helps teams know what architectural decisions they must make, and when they must make them. Developing and delivering a product in a series of releases, each of which builds upon the previous one and expands the capabilities of the product, can be thought of as a series of incremental changes to the MVP. 

Various forces influence investments in MVPs versus MVAs. MVPs grow because of the need to respond to large user satisfaction gaps, while MVAs grow due to increases in the technical complexity of the solution needed to support the MVP. Teams must balance these two sets of forces in order to develop successful and sustainable solutions (see Figure 1).

Figure 1: Balancing MVP and MVA releases is critical for Agile Architecture

This is a slightly different way of looking at the MVP. Many people think of the MVP as only the first release, but we think of each release as an evolution of the MVP, an incremental change in its capabilities, and one that still focuses on a minimal extension of the current MVP. With an agile approach, there is no point at which “minimum viability” as a concept is abandoned. 

With each MVP delivery, there are several possible outcomes:

  1. The MVP is successful and the MVA doesn’t need to change;
  2. The MVP is successful but the MVA isn’t sustainable;
  3. The MVP is partially but not wholly successful, but it can be fixed; 
  4. The MVA is also partially but not wholly successful and needs improvement;
  5. The MVP isn’t successful and so the MVA doesn’t matter

Of these scenarios, the first is what everyone dreams of but it’s rarely encountered. Scenarios 2-4 are more likely – partial success, but significant work remains. The final scenario means going back to the drawing board, usually because the organization’s understanding of customer needs was significantly lacking.

In the most likely scenarios, as the MVP is expanded, and as more capabilities are added, the MVA also needs to be evolved. The MVA may need to change because of new capabilities that the MVP takes on: some additional architectural decisions may need to be made, or some previous decisions may need to be reversed. The team’s work on each release is a balance between MVP and MVA work.

How can we achieve this balance? Both the MVP and the MVA include trade-offs to solving different kinds of problems. For the MVP, the problem is delivering valuable customer outcomes, while for the MVA, the problem is delivering those outcomes sustainably. 

Teams have to resist two temptations regarding the MVA: the first is ignoring its long-term value altogether and focusing only on quickly delivering functional product capabilities using a “throw-away” MVA. The second temptation they must resist is over-architecting the MVA to solve problems they may never encounter. This latter temptation bedevils traditional teams, who are under the illusion that they have the luxury of time, while the first temptation usually afflicts agile teams, who never seem to have enough time for the work they take on. 

Dealing with rework through anticipation

Rework is, at its core, caused by having to undo decisions that have been made in the past. Rework is inevitable, but as it is more expensive than “simple” work because it includes undoing previous work, it’s worth taking time to reduce it as much as possible. This means, where possible, anticipating future rework and taking measures to make it easier when it does happen.

Modular code helps reduce rework by localizing changes and preventing them from rippling throughout the code base. Localizing dependencies on particular technologies and commercial products makes it easier to replace them later. Designing to make everything easily replaceable, as much as possible, is one way to limit rework. As with over-architecting to solve problems that never occur, developers need to look at the likelihood that a particular technology or product may need to be replaced. For example, would it be possible to port the MVA to a different commercial cloud or even move it back in-house if the system costs outweigh the benefits of the MVP? When in doubt, isolate.

Rework is triggered by learning new things that invalidate prior decisions. As a result, when reviewing the results from an MVP release, teams can ask themselves if they learned anything that tells them that they need to revisit some prior decision. If so, their next question should be whether that issue needs to be addressed in the next release because it is somehow related to the next release’s MVP. When that’s the case, the team will need to make room for the rework when they scope the rest of the work.  

Sometimes this means that the MVP needs to be less aggressive in order to make room in the release cycle (e.g. Sprint or increment) to deal with architectural issues, which may not be an easy sell to some of the stakeholders. To that point, keeping all stakeholders apprised of the trade-offs and their rationale at every stage of the design and delivery of the MVP helps to facilitate important conversations when the team feels they need to spend time reducing technical debt or improving the MVA instead of delivering a “quick and dirty” MVP.

The same thing is true for dealing with new architectural issues: As the team looks at the goals for the release of the MVP, they need to ask themselves what architectural decisions need to be made to ensure that the product will remain sustainable over time. 

Conclusion

No matter what you do, you will end up with an architecture. Whether it is good or bad depends on your architectural decisions and their timing. For us, a “good” architecture is one that meets its QARs over its lifetime. Because systems are constantly changing as they evolve, this “architectural goodness” also constantly changes and evolves. 

In an agile approach, teams develop and grow the architecture of their products continuously. Each release involves making decisions and trade-offs, informed by experiments that the team conducts to test the limits of their decisions. To frame evolving the architecture as an either-or trade-off with developing functional aspects of the product is a false dichotomy; teams must do both, but achieving a balance between the two is more art than science.

There really is no such thing as a “last responsible moment” for architectural decisions, as that moment can only be determined ex post facto. Using the evolution of the MVP as a way to examine when decisions need to be made provides more concrete guidance on when decisions need to be made. In this context, the goal of the MVP is to determine whether its benefits result in improved outcomes for the customers or users of the product, while the goal of the MVA is to ensure that the product will be able to sustain those benefits over the full lifespan of the product. 

A big challenge teams face is how to do this under extreme time pressure. Developing a system in short intervals is challenging by itself, and adding “extra” architectural work adds complexity that teams can struggle to overcome. Using decisions about the MVP to guide architectural decisions can help teams decide what parts need to be built with the future in mind, and what parts can be considered “throw-aways”.

Finally, we should assume that every architectural decision will need to be undone at some point. Make options for yourself and design for replaceability. Isolate decisions by modularizing code and using abstraction and encapsulation to reduce the cost of undoing decisions.

Dark Side of DevOps – the Price of Shifting Left and Ways to Make it Affordable

Key Takeaways

  • DevOps adoption and Shifting Left movement empower developers, but also create additional cognitive load for them
  • A set of pre-selected best practices, packaged as a Paved Path – can alleviate some of the cognitive load without creating unnecessary barriers
  • However, as companies evolve, Paved Path(s) have to evolve to keep up with changes in technology and business needs – one Paved Path may not be enough
  • Eventually a company needs to separate responsibilities between experts (responsible for defining Paved Paths) and developers (who can use Paved Paths to remove routine and concentrate on solving business problems)
  • It is important to evaluate which stage of DevOps journey your company is at, so no effort is wasted on solving a “cool” problem instead of the problem you actually have

Topics like “you build it, you run it” and “shifting testing/security/data governance left” are popular: moving things to the earlier stages of software development, empowering engineers, and shifting control bring proven benefits.

Yet, what is the cost? What does it mean for the developers who are involved?

The benefits for developers are clear: you get more control, you can address issues earlier in the development cycle, and you shorten the feedback loop. However, your responsibilities are growing beyond your code  – now they include security, infrastructure and other things that have been “shifted left”. That’s especially important since the best practices in those areas are constantly evolving – the demand of upkeep is high (and so is the cost!).

What are the solutions that can help you keep DevOps and Shifting Left? What can we do to break a grip of the dark side? Let’s find out!

The Impact on Developers of Shifting Left Activities

When we shift some of the software development lifecycle activities left, that’s it; when we move them to the earlier stage of our software development process, we can empower developers. According to the State of DevOps report, companies with best DevOps practices show more than 50% reduction in change failure rates, despite having higher deployment frequency (multiple deployments a day for top performers vs. deploying changes once in 6+ months for bottom performers).

The reason is obvious – the developers in top-performing companies do not have to suffer from the long feedback cycle – for example, when we shift testing from the “Deploy & Release” stage to the “Develop & Build” stage, it means that the developers won’t have to wait days or even weeks for the QA to verify their changes, they can catch bugs earlier. If we shift testing further left – to the “Plan & Design stage”, it means that developers won’t have to spend their time building code that is defective by design.

However, it is not a silver bullet – Shifting Left also means that developers have to learn things like testing methodologies and tools (e.g., TDD, JUnit, Spock, build orchestration tools like GitHub Actions).  

On top of that, Shifting Left doesn’t stop with shifting left testing – more and more things are shifted left, for example security or data governance. All this adds to the developers’ cognitive load. Developers have to learn this tool, they have to adopt best practices and they need to keep their code in infrastructure up-to-date as those best practices change.

Growth of Responsibilities

On the one hand, not having a gatekeeper feels great. Developers don’t have to wait for somebody’s approval – they can iterate faster and write better code because their feedback loop is shorter, and it is easier to catch and fix bugs.

On the other hand, the added cognitive load is measurable – all the tools and techniques that developers have to learn now require time and mental effort. Some developers don’t want that – they just want to concentrate on writing their own code, on solving business problems.

For example, one developer may be happy to be able to experiment with deployment tools, to be able to migrate deployment pipeline from Jenkins to Spinnaker to get native out-of-the box canary support. However, other developers may not be so excited about having to deal with those tools, especially if they already have a lot on their plate.

Steps in a DevOps Journey – Ad-Hoc DevOps Adoption

These additional responsibilities don’t come for free. The actual cost, though, depends on the size of the company and on the complexity of its computing infrastructure. For smaller companies, where a handful of people are working together on just a few services/components, the cognitive load is minimal. After all, everybody knows everybody else and what they are working on; exchanging context doesn’t take a lot of effort. In this condition, removing artificial barriers and empowering developers by shifting some of the SDLC activities left can bring immediate benefits.

Even in large companies like Netflix there is room for manually managed solutions. For example, if a particular application has caches that have to be warmed up before it can start accepting production traffic, then a team that owns that application can create a manually managed deployment pipeline to make sure that a custom warm-up period is observed before the application can start taking traffic (so a redeployment won’t cause a performance degradation). 

Steps in a DevOps Journey – Paved Path

However, as companies grow, so does the complexity of their IT infrastructure. Maintaining dozens of interconnected services is not a trivial task anymore. Even locating their respective owners is not so easy. At this point, companies face a choice – either reintroducing the gatekeeping practices that negatively affect productivity, or to provide a paved path – a set of predefined solutions that codifies the best practices, and takes away mental toil, allowing developers to concentrate on solving business problems.

Creating and maintaining a paved path requires investment- someone has to identify the pain points, organise tooling for the paved path that developers can interact with, write documentation, invest in developers’ education. Someone needs to monitor the outliers – if the applications that are not using the paved path solution perform better, maybe it is worth borrowing whatever they are doing and incorporating it into the paved path?

The investment into the paved path is balanced by decreasing the cognitive load for developers. Developers can stop worrying about all the things that are being shifted left and start concentrating on delivering value, on solving core business problems. For example, they can rely on CLI tools (as Netflix’s newt) or internal developer portal (like Spotify’s Backstage) to bootstrap all required infrastructure in one click – no more manual set up of GitHub repositories, CI/CD pipelines and cloud resources like AWS security groups and scaling policies. All this can – and should be – provided out-of-the-box.

On top of that, if the experts take care of migrations, developers don’t have to deal with the cognitive load caused by those migrations. For example, when streaming backend services Netflix had to be updated to pick up a new version of AWS instance metadata API to improve security – the existing paved path was changed transparently for the developers. Encapsulating AWS settings as code allowed rolling out the change for the services that used the paved path with zero additional cognitive load and zero interruptions.

Steps in a DevOps Journey – Multiple Composable Paved Paths

Finally, as the companies keep growing, one paved path cannot encompass all the diversity of best practices for the services and components that are being developed. Multiple paved paths have to be created and maintained, and common parts between these paths have to be identified and reused, requiring additional investment.

For example, Netflix streaming services have to serve millions of users, so their latency and availability requirements are drastically different from internally facing tools that are used by Netflix employees. Both groups will have different best practices – the streaming apps will benefit from canaries and red-black deployments, which don’t make much sense for the internal tools (since they may not have enough traffic to provide a reliable signal for canaries). On the other hand, these two groups may have common practices, like continuous testing and integration, and it makes sense to codify those practices as separate building blocks to be reused across different paved paths.

Next Steps in YOUR DevOps Journey

The first step is identifying the actual problem caused by shifting left. If you have a small cross-functional team, the hardest problem may be identifying what you need to shift left. For example, your team may be suffering from deployment failures- in this case you may need to invest into testing and CI/CD pipelines.

Even if you don’t find the right solution right away, migrating from one solution to another does not require that much time and effort. The worst that can happen is choosing the wrong pain point, choosing the wrong problem to solve. However, if you find the right leverage point, then even an imperfect solution can improve things. For example, if you realise that Jenkins-based CI pipelines are hard to manage and it would be better to migrate to Github Actions, this kind of migration is not prohibitive, when you just deal with a handful of services.

The actual developer experience plays a crucial role here – it always makes sense to listen to developers, to observe their work so you can find out what the main problem is.

Without that, you may end up chasing the latest trend instead of problems that you actually have. I’ve seen people deciding to copy practices that are popular in large companies, and sometimes it is the right call, but often you don’t even have the same kind of problem that other companies do. Sometimes things that are antipatterns for other companies can be a right fit for you. For example, you may not need to invest in microservices orchestration when your problem is simple enough to be solved with a monolith (even though very often a monolith is a genuine anti-pattern).

Another thing to consider is a build-vs-buy problem. Many companies suffer from Not Invented Here syndrome, coming up with custom solutions instead of choosing a third party tool. On the other hand, even third party tools have to be integrated with each other and existing internal tools to provide a seamless developer experience. (Personally, I think if the problem is either unique to your company or close to your company’s core business, it may be worth investing in custom solutions.)

Finally, to make sure you’re heading in the right direction, it makes sense to keep track of metrics like Deployment Frequency (how often do you deploy to production? Once a month? Once a week? Daily?) and Change Failure Rate (how often do your deployments break?). Another set of metrics to monitor is time to bootstrap a service and time to reconfigure a service (or a fleet of services).

Together, those metrics can help you estimate your ability to deliver quality code as well as the time developers spend on non-coding activities – and hopefully, they can help you make the next step in your DevOps journey.

Adopting an API Maturity Model to Accelerate Innovation

Key Takeaways

  • A common side effect of digital transformation is addressing the problem of API maturity
  • With widespread API acceptance, you begin to get API sprawl. API sprawl results when you have an unplanned and unmanaged proliferation of APIs to address day-to-day business issues.
  • Managing APIs at scale requires top-down oversight.
  • When considering the lifecycles and maturity of APIs, there are two phases: API maturity and API program maturity.
  • The ideal API program improvement cycle consists of five stages: Assess and Explore,  Design and Recommend, Build and Implement, Test and Monitor, and Operate the New API Program. 

Digital transformation can impact every aspect of an organization when it’s done correctly. Unfortunately, a common side effect of digital transformation is addressing the problem of API maturity. APIs tend to become the bridges that drive business growth, but with widespread API acceptance, you can begin to get API sprawl. API sprawl results when you have an unplanned and unmanaged proliferation of APIs to address day-to-day business issues. API sprawl describes the exponentially large number of APIs being created and the physical spread of the distributed infrastructure locations where the APIs are deployed. 

Companies are seeing their APIs spread out across the globe at an unprecedented rate. This API sprawl presents a unique challenge for organizations wishing to maintain consistency in quality and experience among distributed infrastructure locations.

Managing APIs at scale requires oversight. It also requires a pragmatic approach that should start with an API program initiative that unifies APIs based on logical groupings. The program should package APIs as a product or service to drive adoption and facilitate management for their entire lifecycle. The challenge is that creating a viable program to manage API maturity is a slow process.

This article will offer a framework for building a mature API initiative. The framework uses a four-level API program maturity model that results in the evolution of a holistic API-driven business.

What is an API Maturity Model?

When considering the lifecycles and maturity of APIs, there are two phases: API maturity and API program maturity.

API maturity is specific to design and development and follows a process consistent with software development maturity. API maturity ensures that the APIs conform to recognized API specifications, such as REST. When discussing API maturity, you are talking about a set of APIs created for a specific application or purpose.

API program maturity takes priority when considering APIs on a companywide scale, i.e., the myriad of APIs a company amasses over time to meet various business objectives. With API program maturity, bundling APIs as unified services is necessary. An API program maturity model offers a benchmark to streamline APIs to promote business innovation.

The API Program Maturity Model

API program maturity assesses the non-functional metrics of APIs from the perspective of technology and business. The technical API metrics include performance, security, experience, and scalability. The business API metrics relate to improvements in processes and productivity that indirectly affect time and costs.

Like all well-thought-out business processes, API programs should start small and grow gradually. API programs must be structured to follow a continuous improvement cycle. Metrics should improve as the API program moves through a series of transitions from lower to higher maturity levels.

Before starting your journey through the API maturity model, you must start by perceiving APIs as tools. You will then progress through the model, perceiving APIs as components, models, and ecosystems as you reach higher maturity levels. Each level is viewed based on the APIs enabling everyday business processes.

The Four Levels of API Program Maturity

When you consider API program maturity as part of a holistic approach to corporate digital transformation, API programs can be characterized by four maturity levels:

Level 1: “The API Dark Age” – APIs as Tools for Data Acquisition

Historically, APIs have been built to facilitate data acquisition. The early APIs from Salesforce and Amazon are prime examples. Those types of APIs were designed to standardize data sharing across multiple business applications.

The first level of API program maturity is creating a standardized data access interface for data acquisition that offers a single source of truth. These types of APIs are categorized into different business functions. For example, you have separate APIs to access financials, sales, employee, and customer data.

Your organization achieves API Program Maturity Level 1 when you have established best practices for API design and architecture. Some examples of best practices include: 

  • Designing APIs with ease of integration and reusability in mind
  • Creating a consistent interface across all APIs
  • Incorporating versioning in the design to support multiple clients simultaneously
  • Ensuring scalability of the APIs to accommodate changing user needs

However, these APIs are relatively simple and don’t require advanced programmable capabilities. Level 1 is also defined by a relatively immature, hand-cranked approach to API deployment. Manual deployment of individual APIs does not support closely-knit API lifecycle management. The technical focus is on building better APIs as standalone tools.

Level 2: “The API Renaissance” – APIs as Components for Process Integration

When reviewing the history of API development, APIs started to see a renaissance in the 2000s when they started to be leveraged as connectors to integrate different systems. Single sign-on (SSO) is a prime example. SSO is widely used as an API integration tool to authenticate users for secure access to multiple applications and third-party services.

When your organization reaches API Program Maturity Level 2, your API program will use a component-based approach. The component-based approach involves breaking down an application into its separate components. This means that each component can be developed and tested independently from the other parts of the application and then integrated to form a complete application. This approach reduces complexity, simplifies maintenance, and improves scalability.

APIs will be bundled as components that integrate different business and domain-specific processes. These API bundles streamline operations and workflows and connect multiple departments. They may even extend to integrate workflows and interactions with external partners.

Your organization takes its first steps toward utilizing APIs for business when you reach Level 2. By approaching APIs as components, Level 2 maturity gives you a catalog of APIs that are standardized and reusable. Level 2 also advances API development and lifecycle management by improving development cycles, focusing on standardization and streamlined automation through CI/CD (continuous integration/continuous delivery) pipelines.

Level 3: “The Age of API Enlightenment” – APIs as Platforms for a Unified Experience

APIs are treated as components during the API renaissance to simplify integration and reusability. Level 3 is the API Enlightenment Age and extends development further to make APIs more user-friendly and valuable.

When you reach Level 3, APIs are no longer considered components or discrete tools that improve business workflows. The focus is now on building API suites that drive better workflows by creating a connected experience. Recall that API components enable API providers to break down applications while designing and building. API suites refer to how API providers group their functionality so that API consumers can integrate with them for a better experience.

For example, a logistics company relies on a fleet of trucks and delivery vans for business continuity. It will use an API suite to monitor and manage all aspects of its fleet. At Level 3, you expect a well-conceived API suite that incorporates multiple APIs to handle everything from monitoring individual trucks to mapping routes and providing analytics for fleet performance.

At Level 3, APIs are pivotal in defining the user experience (UX). The API suite becomes the backbone of user-facing applications. In our truck fleet example, the front-end software the company uses for fleet management relies on APIs to drive the end-user experience, so the API suite becomes the backend platform that provides the interface for the entire software package.

When you reach Level 3, the API program now plays an essential role since the API suite is elevated to a mission-critical service. At this stage, API consumers become heavily invested, and API reliability and maturity are highly important. Any API program operating at Level 3 attains a degree of technical maturity, including:

  • Deployment: You use the batched deployment of the API suite, and it’s closely coupled with API lifecycle stages and version control. 
  • Performance: APIs support a cloud-native environment for better scalability and elastic workloads to handle data traffic.
  • Security: Multi-layered security is enabled to ensure strict authentication and authorization procedures.
  • Automation: The CI/CD pipeline is fully automated, including rigorous API testing.
  • Experience: A self-service API portal is also in place for speedier developer onboarding.

Level 4: “The Age of API Liberalization” – APIs as Ecosystems for Business Transformation

When your organization reaches API program maturity Level 4, you will have fully externalized APIs as products. This final stage of API evolution is driven more by business needs than technology. You may already have a well-oiled technology stack at this level and are driving API adoption among internal and partner stakeholders because APIs generate a lot of business value. The next logical progression is to externalize this value by monetizing it.

With Level 4, you are adopting a new approach, API-as-a-Product. At this level, APIs may be offered to customers using an as-as-service (AAS) subscription model. Depending on the nature of your company’s business, API-as-a-Product can be provided as a standalone or complementary service. Either way, the APIs are tightly integrated into your product, marketing, and sales organizations, so everyone can collaborate to boost this newfound revenue stream.

At the Level 4 program maturity level, the API program becomes the business growth engine. Some indicators that you have achieved Level 4 include:

API Governance

You have a dedicated API product management group. This group ensures that all APIs are developed based on a predefined set of rules. It also defines API lifecycle progression policies and ensures APIs adhere to architectural and security compliances.

API Observability

Your team goes beyond standard monitoring of APIs, capturing the internal state of API business logic to gather actionable intelligence data about performance.

API Ecosystem:

You have also built an API community for developers and consumers to exchange views and seek support. API advocacy forums further augment your API ecosystem to bolster the adoption of APIs.

The API Program Improvement Cycle

No API program is ever perfect. Any API governance framework must have a provision for periodic audits to determine the current maturity level of any API program.

Regardless of the API maturity level, adopting a DevOps approach to improve API maturity using small sprints continuously is necessary. Applying a DevOps approach also requires building an organizational-wide consensus to adopt a more agile and faster improvement cycle with small increments.

The ideal API program improvement cycle consists of five stages:

  1. Assess and Explore

The first stage is to assess the current state of the API program at both a technology and business level and explore possibilities to improve it. However, technological maturity precedes business maturity and should be the core focus of maturity Levels 1 and 2 above. 

It also is essential to set small goals as sub-levels when exploring areas to improve rather than trying to leapfrog from one level to the next. You can define these sub-levels internally to improve one specific aspect of the API program, such as deployment automation, security, or scalability.

  1. Design and Recommend

This second stage is the most crucial decision point in the improvement cycle. You collate the technical specifications and business objectives from different stakeholders at this stage. Then you can recommend changes in the underlying API management tech stack that should be part of the current improvement cycle.

  1. Build and Implement

Stage three is the implementation stage of the improvement cycle. This stage encompasses development and configuration enhancements based on the proposed recommendations.

  1. Test and Monitor

In stage four, the rubber meets the road, and you test-drive the API ride. This stage is when you monitor vital performance and improvement metrics to gauge the overall effectiveness of the API improvement cycle. This stage also tends to be prolonged since you must transition back and forth with stage three until the metrics show measurable improvement.

  1. Operate New API Program

Once the test and monitor stage is complete and you see real improvement, the final stage is production deployment, where you productionize the new API program and get it up and running.

Level Up Your API Program Maturity Today

The different levels of API program maturity presented here should provide a clear pathway with logical milestones to help your organization transition from a low to a high level of API implementation. However, there is a more significant challenge you will need to tackle.

Your API program symbolizes the organizational-wide ethos of adopting additional APIs. It’s an ideal vision that positions API evolution as one of the primary engines driving business growth. However, for any API program to succeed, it must be established as a horizontal function that cuts across departments and teams.

In most enterprises, each department needs more clarity and visibility into other departments. This is one of the reasons why it takes a lot of work to enforce governance and standardization. This also increases the chance of creating duplicate APIs due to a lack of visibility.

API team silos can present several challenges, such as a lack of communication, understanding, and visibility. When teams are siloed from each other, creating an integrated strategy for API development can be challenging. Additionally, each team may have different priorities, leading to delays and errors throughout the process. Furthermore, when teams are siloed from each other, there may be limited opportunities for collaboration and knowledge sharing, which could otherwise improve the quality of the API being developed.

A horizontal API program function cuts across this interdepartmental siloing and helps to ensure consistent governance and standardization of APIs.

Here are a few overarching rules you can apply to counter any challenges to the continuous improvement cycles:

Outside-in Consensus Building

An outside-in approach requires analyzing business workflows to devise the right digital experiences around them. Rather than adopting an inside-out approach (“build it, and they will come”), an outside-in approach is much more effective at capturing the expectations of the various stakeholders.

Top-down Cultural Shift

Finding the best way to drive a companywide cultural shift is a highly debatable topic. Since horizontal alignment is required for a successful API program, adopting a top-down rather than bottoms-up approach has less likelihood of creating unfettered API sprawl.

The top-down approach to API development offers several advantages, including improved time-to-market, shorter development cycles, and easier maintenance. It also allows for a greater degree of clarity when it comes to the architecture of the API. This makes it easier for developers to know what they have to work with and where their responsibilities lie in the overall project. Additionally, a top-down approach can reduce the effort needed to ensure the APIs are secure, reliable, and well-documented.

Strategic Point of View

The initial levels of API program maturity consider APIs as another part of the technical toolkit. Remember that this is a short-term, tactical perspective. As your API program continues to evolve, it is essential to continuously strive towards building a strategic vision. That way, the API program starts delivering value that can be measured in business-level KPIs.

Working your way to the highest API program maturity level will take time and effort while simultaneously managing stakeholder expectations. However, developing a mature API program will unlock new opportunities for accelerated business innovation, leading to growth.

The Journey from Underrepresented IC to CTO: How Open Source Helped

Key Takeaways

  • It is hard for women to make the leap from IC to CTO when they don’t see themselves represented.
  • You need to examine the pros and cons when thinking about your career from a woman’s perspective.
  • When taking on an open source initiative you need to consider how to develop it  from zero.
  • Apache ShardingSphere is an example of a healthy and sustainable open source community.
  • Know the boundary between commercialization and open source, and consider the feasibility of open-source commercialization.

Background

“How do you feel about your job as a CTO?” As a woman striving to make a name for herself and redefine expectations in the tech and open-source fields, I have often been asked this question in interviews and conferences. At first, I didn’t have a specific answer because I assumed that everything happened as it should. My goal was clear, I visualized where I wanted to be, set my goals, planned my roadmap, and then worked hard, hoping luck would also be on my side.

Recently, a question I saw on Twitter, “What’s the future for a DBA?” reminded me of my career transition and the discomfort I felt during my last performance evaluation when I was up for a promotion at my previous employer. I spent almost twice as much time as the other male candidates answering the interviewers’ aggressive questions and demonstrating my competence for the role. I felt frustrated, wronged, and unfairly treated at the time. It was the first time I had noticed this kind of gender disparity.

At the same time, I reflected on my growth up to that point and the kindness I found throughout my open-source and tech journey. These mixed emotions prompted me to reconsider the question, “How do you feel about your job as a CTO, especially as a woman?”

Taking a step back and reflecting on everything that happened, I think it’s a good time to share valuable points and experiences that made me who I am today:

  • How to view your tech career and make career shifts
  • How to leverage open-source to advance your career prospects
  • The distinction between IC (Individual Contributor) and CTO
  • How to deal with open-source monetization
  • The true story of an Asian woman in technology

Overall, I believe that these topics can be helpful to anyone, regardless of gender, who is seeking to build a successful career in the tech industry.

Overview

The following diagram depicts the timeline of my journey, with different focuses and topics at different stages of this line. If you’re reading this, I suspect that you’re going through a similar career transition or at least considering one. I will use this timeline as a thread to share my thoughts and experiences based on these points.

From a tech role to becoming an Individual Contributor

Transitioning from a tech role to becoming an Individual Contributor (IC) was a pivotal moment in my career journey. I began as a Database Administrator (DBA), and while I was grateful for the experience, I knew I needed to set myself apart and differentiate myself from the average DBA to advance in my career. I realized that simply working hard and being a “reactive” employee would not be enough to achieve my goals. So, I turned to open source and programming to find a solution.

After a few years as a DBA, I made the decision to become a programmer (DevOps). This move was met with resistance from my director and colleagues, but I persisted and eventually found success in open-source and DevOps. I was able to leverage my programming skills to create new tools, platforms, and software to solve companies’ and people’s infrastructure technology challenges. This role transition allowed me to become a “proactive” value creator rather than a “reactive” firefighter.

Why open-source?

One of the reasons why I turned to open source was that I saw the value in it for my professional development. While some people view open-source as having no relevance to their job or career, I believe it can be a powerful tool. Open-source allows you to solve problems and contribute to projects, giving you the opportunity to grow your skills and reputation in the industry.

If you are invited to contribute to an open-source project, you may wonder what you can gain from it. As an open-source contributor or maintainer, you can benefit in several ways. For instance:

  • You can improve your programming skills and knowledge by working on real-world projects with experienced developers.
  • You can build a portfolio of work that showcases your abilities and demonstrates your contributions to the community.
  • You can gain recognition and respect from your peers and potential employers by being an active and valuable member of an open-source community.
  • You can potentially earn income through consulting or support services related to open-source projects.

Overall, transitioning from a tech role to becoming an IC was a challenging yet rewarding experience. By embracing open-source, I was able to take my career to the next level and become a proactive value creator in the tech industry.

Becoming an open-source contributor or maintainer can have many benefits, including enhancing your skillsets, improving your understanding of the tech field, networking and collaboration opportunities, professional opportunities, and career development. By participating in open-source communities, you can learn new skills, gain practical experience, make important professional connections, and potentially advance your career. If you’re interested in getting started in open source, look for well-known and welcoming communities, consider participating in hackathons or other events, and explore opportunities to work for commercial open-source companies.

Don’t let others define who you are or limit your options

If I were to share my experiences and insights about women in technology and open source, I’d say that it’s unfortunate that there is still a significant gender gap in STEM fields, and it’s important that we continue to work towards creating a more inclusive and diverse environment for everyone.

As for the career options in open-source, I can only provide some suggestions for individuals looking to advance their careers in this field. Each option comes with its own set of challenges and opportunities, and it’s important for individuals to consider their skills, interests, and priorities when making a decision. It’s also important to note that the difficulty level may vary for each person, depending on their experience and background.

In any case, it’s encouraging to see that there are many opportunities available for individuals who are passionate about open-source and are looking to make a career out of it. As open-source continues to grow and evolve, I believe that there will be even more opportunities and challenges that arise, and it’s important for individuals to stay adaptable and keep learning in order to stay competitive in the field.

How to choose your career

Choosing the right career path can be a challenging task. With numerous options available, it can be difficult to make the best choice. You may also wonder if it’s the right time for you to make a career change. Ultimately, the responsibility of choosing your career lies with you. However, the following methods and ideas can offer some useful insights to aid your decision-making process.

To begin, it’s important to analyze yourself. Consider your interests, skills, strengths, and weaknesses. Reflect on what motivates you, what you enjoy doing, and what you’re passionate about. This introspective process can help you identify career options that align with your personal preferences and goals.

Next, research various industries and professions that match your interests and skills. Look into the current and future job market trends, salary ranges, and career growth opportunities. It’s essential to have a good understanding of the industry before committing to a career path.

Furthermore, consider seeking advice from professionals in the field you’re interested in. Conduct informational interviews with people who have experience in your desired career path. This will provide you with valuable insights and help you understand the day-to-day realities of the profession.

Once you have identified potential career paths, it’s important to gain relevant experience and skills. Consider taking courses or training programs to develop your expertise. Volunteering or interning in your desired field can also provide valuable hands-on experience.

Finally, it’s important to regularly reassess your career path. As you gain experience and knowledge, your interests and goals may change. Stay open to new opportunities and be willing to make adjustments to your career plans as necessary.

In summary, choosing the right career path requires introspection, research, and gaining relevant experience. Regularly reassess your career path and stay open to new opportunities. By following these methods and ideas, you can make an informed decision about your career and work towards achieving your goals.

Interests

When it comes to choosing a career, there are many factors to consider. However, I believe that interests should be prioritized. While compensation and company brand is important, job satisfaction is essential for long-term success.

If a job doesn’t align with your interests and passions, it can feel like torture to show up every day. This can lead to decreased motivation, productivity, and competitiveness. On the other hand, if your job is fulfilling and enjoyable, you’ll look forward to going to work each day. You’ll be eager to learn and grow, which can lead to increased creativity and positive energy.

I personally experienced the benefits of pursuing my interests in my career. In 2021, I spent my holidays working on GitHub, which demonstrated my passion for programming and the ShardingSphere community. I prioritized this work over other leisure activities, which ultimately led to my promotion. However, I ultimately decided to resign to pursue my startup venture, which aligned even more closely with my interests.

Of course, it’s important to balance your interests with practical considerations such as salary and job security. However, prioritizing your passions can lead to greater fulfillment and success in the long run. When you’re passionate about your work, you’re more likely to be motivated, productive, and innovative. So when considering your career path, don’t forget to prioritize your interests and passions.

Circumstance

When considering your circumstances, there are two important factors to keep in mind:

  1. Industry Outlook: Choosing a promising industry is key to securing a job with a high starting salary and future opportunities for growth. It’s important to strive in the right direction and move up, rather than run in the opposite direction. Even if you’re a veteran in a shrinking industry, my advice is to pick a growing industry and jump into it. This can lead to a bright new career path and greater financial stability.

  2. Work Environment: The environment you work in can shape your mindset and character, which can ultimately impact your future earnings and growth. It’s not just about the salary figure, but the people and environment that you work with. When I graduated with my Master’s degree, I chose to work in the Internet industry over a government department because I was drawn to the open and flourishing environment. As a result, I became more extroverted, open, and logical over time. The people and environment you interact with can have complex effects on your personal and professional growth.

In summary, when considering your circumstances, it’s important to choose a promising industry that aligns with your interests and jump into a work environment that fosters personal and professional growth. By taking these factors into account, you can set yourself up for success in the long run.

Capabilities

Capabilities are composed of both hard and soft skills, which are equally important in my opinion. When searching for a soul mate, their looks may not be the only deciding factor, as their qualities, personalities, and other traits also play a significant role in our decision-making. Similarly, in choosing a career, it’s important to not only focus on your expertise, but also actively practice your communication, presentation, and other soft skills.

I remember when I gave my first offline talk, I was so nervous that I forgot my department name and only managed to pronounce ‘a … en …’. Despite spending a week writing and reciting every word of my slides, my nerves got the better of me. However, with practice and persistence, I was able to improve and can now give fluent presentations within a day or two. This experience serves as a reminder that it’s possible to learn new skills if you allow yourself to try and practice them.

In summary, developing both hard and soft skills is essential to achieving success in your career. Don’t be afraid to step out of your comfort zone and practice new skills – it’s the key to personal and professional growth.

Breaking Stereotypes: Empowering Women to Pursue Their Passions and Goals in Evolving Career Landscapes

As a woman, I want to address a topic that’s dear to me. There’s a prevalent belief that men are better suited for certain careers or disciplines, while women are better suited for others. However, I reject this notion. These ideas are based on outdated stereotypes and biased surveys. If we cling to these beliefs, we’ll never see progress. The world is constantly evolving, and people’s mindsets are changing too. For instance, some men choose to stay at home and care for their children, while some women aspire to become pilots, not flight attendants. Don’t let others define who you are or limit your options. It’s up to you to make choices that align with your passions and goals. I understand how difficult it can be, as I’ve struggled with this myself.

I believe that sharing personal stories can help others, particularly women, who may feel isolated in their struggles. It’s important to know that you’re not alone and that there are countless success stories out there. However, I also want to caution against assuming that you’re the only one facing challenges. There are many men and women who face various obstacles and are working hard to overcome them. You’re not the only one who’s experienced difficulties. For example, I used to believe that forced marriage only affected women until my gym coach confided in me about his recent pressure to marry. It’s a reminder that these issues can affect anyone.

Open source startup

If you’re up for a serious challenge in your open-source career, consider starting an open-source startup. While I won’t go into the details of open-source monetization, from my personal experience, the key paths to success for an open-source startup are through open-core, support and service, and SaaS models.

However, despite this knowledge, it’s still incredibly difficult to make it work and achieve a successful exit. According to statistics, the failure rate for startups is between 80% to 90%. As someone who has explored the world of open-source businesses, I would like to share my own journey transitioning from a developer to a CTO and co-founder of SphereEx.

Open source business vs Proprietary technology business

Evaluation models for standard software companies and open-source business companies differ significantly. In an open-source business, the focus of valuation shifts depending on the stage of the company. At the start, venture capitalists are interested in the maintenance team’s skills and basic statistics of the open-source project, such as the number of contributors and GitHub launches. As the company develops, the product’s market fit, the number of key account customers, and user adoption become important indicators to evaluate. Finally, revenue and its growth curve become the primary key performance indicators.

Compared to proprietary technology, open-source provides several advantages for startup companies, including trust, credibility, a good reputation, and a large user base. These factors offer significant value, such as aiding customer acquisition and talent hunting. Furthermore, open-source is changing the product development paradigm by enabling shorter development cycles to reduce time to market and shortening the distance between developers and customers or users.

However, both standard software and open-source companies aim for market share and revenue, which are inherent properties of any business.

Community vs Company

It’s important to recognize the difference between being a maintainer in an open-source community and a C-level executive in a startup company. It’s true that the two roles require different skill sets, and transitioning from one to the other can be a challenge. However, there are also transferable capabilities that can help you succeed in both roles.

Communication, presentation, and cooperation skills are all essential for building and managing a successful open-source community, as well as for leading a startup company. Additionally, the experience of running an open-source community can provide valuable insights into building a successful business, such as understanding user needs, fostering collaboration, and promoting projects.

Ultimately, the key is to recognize the differences between the two roles, while also building on the skills and experiences that can help you succeed in both.

A Woman’s Perspective

During a C-level meeting, I found myself as the only woman in the room, and the initial atmosphere was uncomfortable for both myself and others present. Despite my qualifications, some clients with traditional views would express suspicion toward me, and I would often have to rely on my male colleagues to move the conversation forward.

This experience left me feeling frustrated and questioning whether looks and age take precedence over skills and experience. However, these challenges ultimately provided me with opportunities to strengthen my self-confidence and reconnect with my ambitions and patience.

It’s important to recognize and utilize our strengths and weaknesses, as this is essential for our personal growth and survival. It’s also important to acknowledge that gender should not imply guilt or blame. As a child, I had internalized negative attitudes towards my gender due to the disapproval of girls in my family. Learning to accept and take care of oneself is a lengthy but necessary process to gain the approval of others.

Wrap-up

In conclusion, while I recognize that this article may contain some abstract or emotional elements, I also understand the importance of providing practical advice and spiritual support to aid us in our journeys. I hope that you have gained some valuable insights from this piece, such as the potential of open-source to propel your career, enhance your startup experience, and inspire your career path.

The internet is a powerful tool that has opened up a vast space for people to connect, share stories, and learn from one another. While it is not without its drawbacks, I believe that the diverse array of experiences and perspectives that are shared online can help drive positive change in society. As a woman living in the 21st century, I consider myself fortunate to have access to opportunities that were not available to women in previous decades.

Unleash the Power of Open Source Java Profilers: Comparing VisualVM, JMC, and async-profiler

Key Takeaways

  • Analyzing the performance of programs is important: open-source tools for profiling have you covered
  • There are two major types of profilers: Sampling and instrumenting profilers; understanding their differences will help you to choose the right type
  • There are three major open-source profilers with different pros and cons: a simple profiler (VisualVM), a hackable profiler with lots of features (async-profiler), and a built-in profiler which obtains lots of additional information (JMC)
  • All of these profilers are sampling profilers that approximate the results, which make them faster and less intrusive than others too, but also requires support from the Java Runtime
  • Using profilers doesn’t come without risks and might sometimes cause performance degradations and rare crashes

 

 

I want to convey the foundational concepts and different types of Open Source Java profilers in this article. The article should allow you to choose the best-suited profiler for your needs and comprehend how these tools work in principle.

It is the accompanying post to my “Your Java Application Is Slow? Check Out These Open-Source Profilers” talk at QCon London 2023 in which I dive deeper into the topic and also cover the different profile viewers.

The aim of a profiler is to obtain information on the program execution so that a developer can see how much time a method executed in a given period.

But how does a profiler do this? There are two ways to obtain a profile: instrumenting the program and sampling.

Instrumenting Profilers

One way to obtain a profile is to log the entering and exiting of every method that is interesting for the developer.

This instrumentation is what many developers already do when they want to know how long a specific part of their program took.

So the following method:

void methodA() {
      // … // do the work
}

is modified to record the relevant information:

void methodA() {
      long start = System.currentTimeMillis();
      // … // do the work
      long duration = System.currentTimeMillis() - start;
      System.out.println(“methodA took “ + duration + “ms”);
}

This modification is possible for basic time measurements. Still, it gives little information when nesting measured methods, as it’s also interesting to know the relationship between methods, e.g. methodB() was executed in seconds by methodA(). We, therefore, need to log every entry and exit into the relevant methods. These logs are associated with a timestamp and the current thread.

The idea of an instrumenting profiler is to automate this code modification: it inserts a call to the logEntry() and a logExit() methods into the bytecode of the methods. The methods are part of the profiler runtime library. This insertion is usually done at runtime, when the class is loaded, using an instrumentation agent. The profiler then modifies our methodA() from before to:

void methodA() {
      logEntry(“methodA”);
      // … // do the work
      logExit(“methodA”);
}

Instrumenting profilers have the advantage that they work with all JVMs, as they can be implemented in pure Java. But they have the disadvantage that the inserted method calls incur a significant performance penalty and skew the results heavily. The popularity of purely instrumenting profilers, therefore, has faded in recent decades. Modern profilers nowadays are mostly sampling profilers.

Sampling Profilers

The other type of profilers are sampling profilers which take samples from the execution of the profiled programs. These profilers ask the JVM in regular intervals, typically every 10ms to 20ms, for the stack of the currently running program. The profiler can then use this information to approximate the profiles. But this leads us to the major disadvantages: shorter running methods might be invisible from the profile.

The main advantage of sampling profilers is that they profile the unmodified program with low overhead without skewing the results significantly.

Modern sampling profilers typically work by running the following in a loop every 10 to 20ms:

A sampling profiler obtains the list of currently available (Java) threads for every iteration. It then chooses a random subset of threads to sample. The size of this subset is usually between 5 and 8, as sampling too many threads in every iteration would increase the performance impact of running the profiler. Be aware of this fact when profiling an application with a large number of threads.

The profiler then sends every selected thread a signal to every thread, which leads them to stop and call a signal handler each. This signal handler obtains and stores the stack trace for its thread. All stack traces are collected and post-processed at the end of every iteration.

There are other ways to implement sampling profilers, but I’ve shown you the most widely used technique that offers the best precision.

Different Open Source Profilers

Three prominent open-source profilers currently exist: VisualVM, async-profiler, and JDK Flight Recorder (JFR). These profilers are in active development and usable for various applications. All of them are sampling profilers. VisualVM is the only profiler that also supports instrumentation profiling.

We can distinguish between “external” and “built-in” profilers: external profilers are not directly implemented into the JVM but use APIs to collect the stack traces for specific threads. Profilers using only APIs can target different JVM versions and vendors (like OpenJDK and OpenJ9) with the same profiler version.

The two most prominent external profilers are VisualVM and async-profiler; their main distinguishing element is the API they use. VisualVM uses the official Java Management Extensions (JMX) to obtain the stack traces of threads. Async-profiler, on the other hand, uses the unofficial AsyncGetCallTrace API. Both have advantages and disadvantages, but the JMX and related APIs are commonly considered safer and AsyncGetCallTrace more precise.

The single built-in profiler for the OpenJDK and GraalVM is the Java Flight Recorder (JFR); it works roughly the same as the async-profiler and is as precise but slightly more stable.

I’ll cover the different profilers and their history in the following section.

VisualVM

This tool is the stand-alone version of the Netbeans profilers. Starting with Oracle JDK 6 in 2006 till JDK 8, every JDK included the Java VisualVM tool, open-sourced in 2008. This profiler later changed its name to VisualVM, and Oracle did not include it in JDK 9. According to a recent JetBrains survey, VisualVM is the most used open-source profiler. You can obtain the profiler from its VisualVM: Download website.

Its usage is quite simple; just select the JVM that runs the program you want to profile in the GUI and trigger the profiling:

You can then directly view the profile in a simple tree visualization. There is also the possibility to start and stop the sample profiler from the command line using:

visualvm --start-cpu-sampler 
visualvm --stop-sampler 

VisualVM is a profiler with a simple UI that is easy to use, with the caveat of using less specific JVM APIs.

Async-Profiler

One of the most commonly used profilers is async-profiler, not the least because it’s embedded into many other tools like the IntelliJ Ultimate Profiler and AppIication Performance Monitors. You can download async-profiler from its the project’s GitHub page. It is not supported on Windows and consists of platform-specific binaries. I created the ap-loader project, which wraps all async-profiler binaries in a multi-platform binary, making embedding and using the profiler easier.

You can use async-profiler by using the many tools that embed it or directly using it as a native Java agent. Assuming you downloaded the platform-specific libasyncProfiler.so, you can profile your Java application by just adding the following options to your call of the Java binary:

java 
-agentpath:libasyncProfiler.so=start,event=cpu,file=flame.html,flamegraph …

This call will tell the async-profiler to produce a flame graph, a popular visualization.
You can also create JFR files with it:

java 
-agentpath:libasyncProfiler.so=start,event=cpu,file=profile.jfr,jfr …

This call allows you to view the profile in a multitude of viewers.

For the curious now a bit of history for async-profiler:

In November 2002, Sun (later bought by Oracle) added the AsyncGetStackTrace API to the JDK, according to the JVM(TM) Tool Interface specification. The new API made obtaining precise stack traces from an external profiler possible. Sun introduced this API to add a full Java profiler to their Sun Development Studio. Then two months later, they removed the API for publicly unknown reasons. But the API remained in the JDK as AsyncGetCallTrace, and is there to this day, just not exported, so it is harder to use.

A few years later, people stumbled upon this API as a great way to implement profilers. The first public mention of AsyncGetCallTrace as a base for Java profilers is by Jeremy Manson in his 2007 blog post titled Profiling with JVMTI/JVMPI, SIGPROF and AsyncGetCallTrace. Since then, many open-source and closed-source profilers have started using it. Notable examples are YourKit, JProfiler, and honest-profiler. The development of async-profiler started in 2016; it is currently the dominant open-source profiler using AsyncGetCallTrace.

The problem with async-profiler is that it’s based on an unofficial internal API. This API is not well-tested in the official OpenJDK test suite and might break at any point. Although the wide usage of the API leads to a quasi-standardization, this is still a risk. To alleviate these risks, I am currently working on a JDK Enhancement Proposal that adds an official AsyncGetCallTrace version to the OpenJDK; see JEP Candidate 435.

The advantages of async-profiler are its many features (like heap sampling), its embeddability, its support for other JVMs like OpenJ9, and its small code base, which makes it easy to adapt. You can learn more about using async-profiler in the async-profiler README, the async-profiler wiki, and the Async-profiler – manual by use cases by Krzysztof Ślusarski.

JDK Flight Recorder (JFR)

JRockit first developed its runtime analyzer for internal use, but it also grew in popularity with application developers. Later the features were integrated into the Oracle JDK after Oracle bought the developing company. Oracle eventually open-sourced the tool with JDK11 and since then JVM interval profiling tool for OpenJDK, with no support in other JVMs like OpenJ9.

It works comparable to async-profiler, with the main distinction that it uses the internal JVM APIs directly. The profiler is simple to use by either adding the following options to your call to the Java binary:

$ java 
  -XX:+UnlockDiagnosticVMOptions 
  -XX:+DebugNonSafepoints   # improves precision
  -XX:+FlightRecorder 
  -XX:StartFlightRecording=filename=file.jfr 
  arguments

Or by starting and stopping it using the JDK command tool, jcmd:

$ jcmd PID JFR.start
$ jcmd PID JFR.dump filename=file.jfr
$ jcmd PID JFR.stop

JFR captures many profiling events, from sampled stack traces to Garbage Collection and Class Loading statistics. See the JFR Events website for a list of all events. There is even the possibility to add custom events.

You can learn more about this tool in blog posts like JDK Flight Recorder, The Programmatic Way by BellSoft.

The main advantage of JFR over async-profiler is that it is included in the OpenJDK on all platforms, even on Windows. JFR is also considered slightly more stable and records far more events, and information. There is a GUI for JFR called JDK Mission Control which allows you to profile JVMs and view the resulting JFR profiles.

Correctness and Stability

Please keep the following in mind when using profilers like the one I’ve covered: they are just software themselves, interwoven with a reasonably large project, the OpenJDK (or OpenJ9, for that matter), and thus suffer from the same problems as the typical problems of application they are used to profile:

  • Tests could be more plentiful, especially for the underlying API, which could be tested better; there is currently only one single test. (I’m working on it)
  • Tests could be better: the existing test did not even fully test that the API worked for the small sample. It just checked the top frame, but missed that the returned trace was too short. I found this issue and fixed the test case.
  • Lack of automated regression testing: A lack of tests also means that changes in seemingly unrelated parts of the enclosing project can adversely affect the profiling without anyone noticing.

Therefore you take the profiles generated by profilers with a grain of salt. There are several blog posts and talks covering the accuracy problems of profilers:

Furthermore, profiling your application might also cause your JVM to crash in rare instances. OpenJDK developers, like Jaroslav Bachorik and I, are working on fixing all stability problems as far as possible in the underlying profiling APIs. In practice, using one of the mentioned profilers is safe, and crashes are rare. If you encounter a problem, please contact the profiler developers or open a GitHub issue at the respective repository.

Conclusion

Modern sampling-based profilers for Java make it possible to investigate performance problems with open-source tools. You can choose between:

  • a slightly imprecise but easy-to-use tool with a simple UI (VisualVM)
  • a built-in tool with information on GC and more (JFR)
  • a tool that has lots of options and can show information on C/C++ code (async-profiler)

Try them out to know what to use when encountering your next performance problem.

A Simpler Testing Pyramid: Getting the Most out of Your Tests

Key Takeaways

  • The benefits of a test should outweigh the costs of writing, running, and maintaining it.
  • Slow tests tend to contribute the most to the cost of a test suite over time.
  • Even a small decrease in test suite duration can offer large time savings over time.
  • Refactor test code with the same care that you refactor production code.
  • Keep your test suite fast; Fail the build if your test suite takes too long to run.

Test Labels

Developers use many different labels to describe their automated tests (unit, integration, acceptance, component, service, end-to-end, UI, database, system, functional, or API). Each of these labels has a different semantic meaning, either describing the scope of the test, the types of actions that the test takes, the subject of the test, or the subject’s collaborators. We usually don’t agree on what each of these labels means, and the discussions about their definition tend to be futile. 

Rather than arguing over which labels to use and how to define them, I’ve found it more helpful to use one of two adjectives to label each test: slow or fast. These labels can be just as useful when deciding the makeup of a test suite while allowing developers to objectively classify tests without unproductive arguments.

The choice of test labels is an important influence on the makeup of a test suite. Developers use them to know when to write a test for a given behavior, to know which type of test to write, and to assess the balance of the test suite as a whole. When we get this wrong, we end up with a test suite that either doesn’t provide accurate coverage or provides coverage at an unacceptable cost.

When to Write a Test

When should you write a test for a given piece of production code? Developers who, like me, practice Extreme Programming (XP) or Test Driven Development (TDD) often answer this question with “always”. However, not every piece of code should automatically be tested. For each proposed test, first, weigh the costs of writing the test against the benefits.

I’m not advocating against writing tests. Indeed, for most tests, this is a quick check and the answer is yes. However, this check is useful, especially if a test is slow to run, slow to write, or difficult to maintain. In these cases ask yourself a few questions.

Is the test costly because of a design decision? Can the code be refactored to better accommodate testing? Your tests are the first consumers of your production code. Making code easier to test often makes it easier to consume, improving the quality of your codebase.

Is the test costly because of the testing approach? Would a different testing approach make this test easier to write? Consider using test doubles like fakes or mocks in place of collaborators. If your tests need a complicated setup extract this to a test scenario that can be reused between tests.

Be careful not to overuse test doubles, as they don’t provide as much confidence as real collaborators. Sometimes this drop in confidence is worth the ease of setup, decrease in test duration, or increase in reliability. However, too much reliance on test doubles may couple your tests to your implementation, resulting in a test suite that provides low confidence and that inhibits refactoring.

Is the test costly because the behavior is inherently difficult to test? If so, consider the importance of the feature you’re testing. If it’s a critical feature involving processing payments then the test might be worth the cost. If it’s a quirky edge case in your display logic then you should reconsider whether or not to write the test.

Is the test costly because it fails unpredictably? If so you must remove it, rewrite it to be more reliable, or separate it from the rest of your test suite. For a test suite to provide useful feedback, you must be confident that test failures represent undesired behavior. If you find that a test is necessary and cannot be made predictable, move it to another test suite that is run less frequently.

The Testing Pyramid

To help with the decision of when to write a test, and what type of test to write, developers often place test labels on a testing pyramid in order to communicate the importance of having more of one type of test than another.

Given the many different labels used to describe tests, every testing pyramid looks a bit different from the others. Try running an image search for “testing pyramid” and you will find only a few duplicate pyramids on the first page of the results. Each pyramid typically has low-cost unit tests at the bottom, high-cost system tests at the top, and several layers of medium-cost tests in the middle.

 

Before a team can benefit from the testing pyramid, the team must decide on which labels to include in the testing pyramid, what the definition of each label is, and in what order to include the labels on the pyramid.

This is often a contentious decision, as each developer in a team tends to use a different set of labels to describe tests, and there is not wide agreement on what each label means. Indeed, almost every testing pyramid includes unit tests at the bottom, but there is wide disagreement on what the word “unit” refers to. This disagreement reduces the usefulness of the testing pyramid since discussions tend to revolve around the labels rather than reducing the cost of the test suite.

Focus on Speed

Speed tends to be the highest contributor to the cost of a test suite. To get rapid feedback developers should run the test suite multiple times per hour, so even a small increase in the time it takes to run the suite can add up to lots of waiting over time.

Time spent waiting for the tests to run is unproductive time. When a test suite is very slow (taking longer than five minutes to run) developers often work on other tasks while the test is running. This task switching is harmful, as it decreases focus and results in the developer losing context. Once the slow test suite is finished the developer must take additional time to regain context before continuing with their original task.

A Better Pyramid

Focusing on test speed, a simpler testing pyramid emerges.

This pyramid sends a clear message that a test suite should have as many fast tests as possible and just enough slow tests to provide full coverage of desired behavior. It communicates the same message as the more common (and complicated) testing pyramids, but is far easier for developers to understand and agree upon.

While different developers might not agree upon where to place a certain test in a common testing pyramid, it’s easy to know where a given test fits in the pyramid above. Teams only have to agree upon what is a fast test, and what is a slow test. While the threshold may be different depending on the business domain, language, or framework, the speed of tests can be measured objectively.

A Fast Test Suite

Test suites always start out fast, but rarely stay that way. More tests get added over time and developer tolerance for a slow test suite increases. Many developers don’t realize that a fast test suite is a possibility because they have never worked in a codebase where the test suite stays fast.

Keeping a test suite fast takes discipline. Developers must scrutinize any time they add to a test suite, and realize the large benefit gained from even a small decrease in length. For example, if a member of a team of 6 developers spends 4 hours to speed up the tests by 10 seconds, that investment will pay off in just six weeks (assuming developers run tests once per hour during a working day).

Set a Limit

When left unchecked, the length of a test suite increases exponentially over time. That is, the length increases proportionally to the current duration. When the suite runs in 10 seconds a developer might agonize over adding just one second to the build, but once the test suite grows to 3 minutes they might not even notice.

One method to prevent exponential growth is to set a hard limit on your test suite length: fail the build if your test suite takes longer than, for example, one minute to run. If a test run takes too long the build will fail and the developer must take some time to speed up tests before continuing. Don’t fix the build by simply increasing this limit. Rather, take the time to understand why the tests are slow and how you can make them faster.

Refactor

Test code must be treated with the same care and scrutiny as production code. Refactor continuously to keep your test code well-structured and fast, therefore minimizing the cost of maintaining and running your test suite. Keep in mind that refactoring tests should not modify the behavior of either the test code or the production code. Rather, it should change your code to be more readable, more maintainable, and faster to run.

If you can’t avoid having a few slow tests, add them to a separate test suite. This slow test suite isn’t meant to be run as often as your main test suite but is there to provide some additional coverage. It should not block the build process but should be run periodically to ensure the behavior it tests is still functioning correctly.

Existing Test Suites

It’s not too late to change your approach if you’ve used a different testing pyramid to shape your current test suite. If you’ve followed a more-complicated testing pyramid it’s likely that many of your tests contain your testing pyramid’s label names.

As a first step, take some time to rename your tests. The new test names should reflect the behavior under test rather than the test label. For example, you might rename the UserIntegrationTest to the UserAuthenticationTest or the RegistrationApiTest to the AddPaidUserTest.

During this process, you’ll likely find some collisions among the new names. These collisions are a warning that you may have multiple tests that cover the same behavior. Take some time to move, combine, rename, or remove these tests to address the duplication.

Once your tests are renamed, reorganize the test directory structure to group tests according to behavior. This organization will keep tests that change at the same time close to each other in your codebase and will help you to catch new tests that cover duplicate behavior.

Slow Tests Suites

A slow test suite must be addressed right away. Immediately set a limit on the test suite duration so it doesn’t get any slower. Next, add some instrumentation to help you find the slowest tests by listing the execution time for each test or group of tests. You’ll likely find some tests during this process that are easy to speed up.

Once you fix these you’ll be left with another group of slow tests that are more difficult to improve. Separate your fast tests so you can run them separately from the remaining slow tests. This will give you an immediate speed bump for some test runs, which will buy you more time to make improvements.

Dedicate time to speeding up your test on a regular basis. Investigate whether the behavior covered by these slow tests is able to be covered (or already covered) by faster tests. A common example of this is covering many edge cases with tests that drive a browser. Using a browser to run tests is time intensive and the behaviors can often be covered by lower-level tests which tend to run faster.

In Practice

Before your next discussion over whether to write, for example, a system test or an integration test, take a minute to think. You’re likely to find that the distinction between the two matters little. If your goal is to provide high confidence while minimizing cost, then your argument is really about how you can test the desired behavior with the lowest cost test possible. Steer the discussion in this direction and you’ll have a more productive outcome.

Rather than focusing on test labels, focus on what’s important: Write fast tests. If your test is slow, make it faster. If you can’t, try to provide the same coverage with a few tests with a narrower scope. If that fails, ask yourself if the benefit that the test provides is worth the substantial cost of a slow test. If it is worth it, consider moving your slow tests to a separate test suite that doesn’t block the build.

Follow this new testing pyramid and focus on test speed to keep your test suite fast and your confidence high.

Software Architecture and Design InfoQ Trends Report – April 2023

Key Takeaways

  • Design for Portability is gaining adoption, as frameworks like Dapr focus on a cloud-native abstraction model and allow architects to separate business logic from implementation details.
  • Large language models are going to have a significant impact, from helping understand architectural trade-offs to empowering a new generation of low-code and no-code developers.
  • Sustainability of software will be a major design consideration in the coming years. Work is being done to better measure and then reduce the carbon footprint of software systems.
  • Decentralized apps are taking blockchain beyond cryptocurrency and NFTs, but a lack of consumer demand will keep this as a niche pattern.
  • Architects are always looking for improvements on how to document, communicate, and understand decisions. This may be another area where large language models will play a role in the future, acting as forensic archeologists to comb through ADRs and git history.

 

The InfoQ Trends Reports provide InfoQ readers a high-level overview of the topics to pay attention to, and also help the InfoQ editorial team focus on innovative technologies. In addition to this report and the trends graph, an accompanying podcast features some of the editors discussing these trends.

More details follow later in the report, but first it is helpful to summarize the changes from last year’s trends graph. 

Three new items were added to the graph this year. Large language models and software supply chain security are new innovator trends, and “architecture as a team sport” was added under early adopters.

Trends which gained adoption, and therefore moved to the right, included “design for portability,” data-driven architecture, and serverless. eBPF was removed as it has niche applications, and is not likely to be a major driver in architectural decisions.

A few trends were renamed and/or combined. We consider Dapr as an implementation of the “design for portability” concept, so it was removed as a separate trend. Data-driven architecture is the combination of “data + architecture” and data mesh. Blockchain was replaced with the broader idea of decentralized apps, or dApps. WebAssembly now notes both server-side and client-side, as these are related but separate ideas and may evolve independently in the future.

The portability aspect of “design for portability” is not about being able to pick up your code and move it. Rather, it creates a clean abstraction from the infrastructure. As InfoQ editor Vasco Veloso says, “whoever is designing and building the system can focus on what brings value, instead of having to worry too much with the platform details that they are going to be running on.”

This design philosophy is being enabled by frameworks such as Dapr. Daniel Bryant, InfoQ news manager, sees the benefit of the CNCF project as providing a clearly defined abstraction layer and API for building cloud-native services. Bryant said, “[with integration] it’s all about the APIs and [Dapr] provides abstractions without doing the lowest common denominator”.

A recent article by Bilgin Ibryam described the evolution of cloud-native applications into cloud-bound applications. Instead of designing a system with logical components for application logic and compute infrastructure, cloud-bound applications focus on the integration bindings. These bindings include external APIs as well as operational needs such as workflow orchestration and observability telemetry.

Another technology that supports designing for portability is WebAssembly, specifically server-side WebAssembly. Often WebAssembly is thought of as a client-side capability, for optimizing code running in the browser. But using WebAssembly has significant benefits for server-side code. InfoQ Editor Eran Stiller described the process for creating WebAssembly-based containers. 

Instead of compiling it to a Docker container and then needing to spin up an entire system inside that container on your orchestrator, you compile it to WebAssembly and that allows the container to be much more lightweight. It has security baked in because it’s meant to run the browser. And it can run anywhere–in any cloud, or on any CPU, for that matter. – Eran Stiller

More information about Dapr and WebAssembly can be found by following those topics on InfoQ.

The news around AI, specifically large language models such as GPT-3 and GPT-4, has been impossible to ignore. This is not simply a tool used by software professionals as the adoption by everyday people and the coverage in all forms of media has demonstrated. But what does it mean to software architects? In some ways, it is too early to know what will happen. 

With ChatGPT and Bing, we’re just beginning to see what is possible with large language models like GPT-3. This is the definition of an innovator trend. I don’t know what will come of it, but it will be significant, and something I look forward to seeing evolve in the next few years. – Thomas Betts

While the future is uncertain, we have optimism that these AI models will generally have a positive benefit on the software we build and how we build it. The code-generation capabilities of ChatGPT, Bing chat, and GitHub Copilot are useful for writing code and tests and allowing developers to work faster. Architects are also using the chatbots to discuss design options and analyze trade-offs.

While these improvements in efficiency are useful, care must be taken to understand the limitations of AI models. They all have built-in biases which may not be obvious. They also may not understand your business domain, despite sounding confident in their responses.

This will definitely be a major trend to watch in 2023, as new products are built on large language models and companies find ways to integrate them into existing systems.

Last year, we discussed the idea of “data + architecture” as a way to capture how architects are considering data differently when designing systems. This year we are combining that idea with Data Mesh under the heading of “data-driven architecture.” 

The structure, storage, and processing of data are up-front concerns, rather than details to be handled during implementation. Blanca Garcia-Gil, a member of the QCon London programming committee, said, “when designing cloud architectures there is a need to think from the start about data collection, storage, and security, so that later on we can derive value from it, including the use of AI/ML.” Garcia-Gil also pointed out that data observability is still an innovator trend, at least compared to the state of observability of other portions of a system.

Data Mesh was a paradigm shift, with teams aligned around the ownership of data products. This fits the idea of data-driven architecture, as well as incorporating Conway’s Law into the overall design of a system.

While there has been more adoption in designing for sustainability, we chose to leave it as an innovator trend because the industry is just starting to really embrace sustainable systems and designing for a low carbon footprint. We need to consider sustainability as a primary feature, not something we achieve secondarily when trying to reduce costs. Veloso said, “I have noticed that there is more talk about sustainability these days. Let’s be honest that probably half of it is because energy is just more expensive and everybody wants to reduce OPEX.”

One of the biggest challenges is the difficulty in measuring the carbon footprint of a system. Until now, cost has been used as a stand-in for environmental impact, because there is a correlation between how much compute you use and how much carbon you use. But this technique has many limitations.

The Green Software Foundation is one initiative trying to help create tools to measure the carbon consumed. At QCon London, Adrian Cockcroft gave an overview of where the three major cloud vendors (AWS, Azure, GCP) currently stand in providing carbon measurements.

As the tooling improves, developers will be able to add the carbon usage to other observability metrics of a system. Once those values are visible, the system can be designed and modified to reduce them.

This also ties into the ideas around portability and cloud-native frameworks. If our systems are more portable, that means we will more easily be able to adapt them to run in the most environmentally-friendly ways. This could mean moving resources to data centers that use green energy, or processing workloads during times when the energy available is more green. We can no longer assume running at night, when the servers are less busy is the best option, as solar power could mean the middle of the day is the greenest time.

Blockchain and a distributed ledger is the technology behind decentralized apps. Mostly due to changes at Twitter, Mastodon emerged as an alternative, decentralized social network. However, blockchain remains a technology that solves a problem most people do not see as a problem. Because of this niche applicability it remains classified as an innovator trend.

Architects no longer work alone, and architects can no longer think only about technical issues. The role of an architect varies greatly across the industry, and some companies have eliminated the title entirely, favoring “principal engineers” as the role primarily responsible for architectural decisions. This corresponds to a more collaborative approach, where architects work closely with the engineers who are building a system to continually refine the system design.

Architects have been working collaboratively with software teams to come up with and iterate designs. I continue to see different roles here (especially in larger organizations), but communication and working together through proof of concepts to try out designs if needed is key. – Blanca Garcia-Gil

Architecture Decision Records (ADRs) are now commonly recognized as a way to document and communicate design decisions. They are also being used as a collaboration tool to help engineers learn to make technical decisions and consider trade-offs.

Listen to the Trends Report Discussion on the InfoQ Podcast

The Architecture & Design editorial team met remotely to discuss these trends and we recorded our discussion as a podcast.  You can listen to the discussion and get a feel for the thinking behind these trends.

The Silent Platform Revolution: How eBPF Is Fundamentally Transforming Cloud-Native Platforms

Key Takeaways

  • eBPF is already used by many projects and products around the cloud-native ecosystem “under the hood” because it makes the kernel ready for cloud-native computing by enabling enrichment with cloud-native context. eBPF has created a silent infrastructure movement that is already everywhere and has enabled many new use cases that weren’t possible before. 
  • eBPF has been in production and production-proven for more than half a decade at the Internet scale, running 24/7 on millions of servers and devices worldwide.
  • eBPF has enabled new abstractions in the OS layer, which gives platform teams advanced capabilities for cloud-native networking, security, and observability to safely customize the OS to their workload’s needs.
  • Extending the OS kernel was a hard and lengthy process that could take years until a change could be used, but with eBPF, this developer-consumer feedback loop is now almost instant, where changes can be rolled out into production on the fly without having to restart or change the application or its configuration.
  • The next decade of infrastructure software will be defined by platform engineers who can use eBPF and the projects that leverage it to create the right abstractions for higher-level platforms. Open-source projects such as Cilium for eBPF-based networking, observability, and security have pioneered and brought this infrastructure movement to Kubernetes and cloud-native.

Kubernetes and cloud native have been around for nearly a decade. In that time, we’ve seen a Cambrian explosion of projects and innovation around infrastructure software. Through trial and late nights, we have also learned what works and what doesn’t when running these systems at scale in production. With these fundamental projects and crucial experience, platform teams are now pushing innovation up the stack, but can the stack keep up with them?

With the change of application design to API-driven microservices and the rise of Kubernetes-based platform engineering, networking, and security, teams have struggled to keep up because Kubernetes breaks traditional networking and security models. With the transition to cloud, we saw a similar technology sea change at least once. The rules of data center infrastructure and developer workflow were completely rewritten as Linux boxes “in the cloud” began running the world’s most popular services. We are in a similar spot today with a lot of churn around cloud native infrastructure pieces and not everyone knowing where it is headed; just look at the CNCF landscape. We have services communicating with each other over distributed networks atop a Linux kernel where many of its features and subsystems were never designed for cloud native in the first place.

The next decade of infrastructure software will be defined by platform engineers who can take these infrastructure building blocks and use them to create the right abstractions for higher-level platforms. Like a construction engineer uses water, electricity, and construction materials to build buildings that people can use, platform engineers take hardware and infrastructure software to build platforms that developers can safely and reliably deploy software on to make high-impact changes frequently and predictably with minimal toil at scale. For the next act in the cloud native era, platform engineering teams must be able to provision, connect, observe, and secure scalable, dynamic, available, and high-performance environments so developers can focus on coding business logic. Many of the Linux kernel building blocks supporting these workloads are decades old. They need a new abstraction to keep up with the demands of the cloud native world. Luckily, it is already here and has been production-proven at the largest scale for years.

eBPF is creating the cloud native abstractions and new building blocks required for the cloud native world by allowing us to dynamically program the kernel in a safe, performant, and scalable way. It is used to safely and efficiently extend the cloud native and other capabilities of the kernel without requiring changes to kernel source code or loading kernel modules unlocking innovation by moving the kernel itself from a monolith to more modular architecture enriched with cloud native context. These capabilities enable us to safely abstract the Linux kernel, iterate and innovate at this layer in a tight feedback loop, and become ready for the cloud native world. With these new superpowers for the Linux kernel, platform teams are ready for Day 2 of cloud native—and they might already be leveraging projects using eBPF without even knowing. There is a silent eBPF revolution reshaping platforms and the cloud native world in its image, and this is its story.

Extending a Packet Filter for Fun and for Profit

eBPF is a decades-old technology beginning its life as the BSD Packet Filter (BPF) in 1992. At the time, Van Jacobson wanted to troubleshoot network issues, but existing network filters were too slow. His lab designed and created libpcap, tcpdump, and BPF as a backend to provide the required functionality. BPF was designed to be fast, efficient, and easily verifiable so that it could be run inside the kernel, but its functionality was limited to read-only filtering based on simple packet header fields such as IP addresses and port numbers. Over time, as networking technology evolved, the limitations of this “classic” BPF (cBPF) became more apparent. In particular, it was stateless, which made it too limiting for complex packet operations and difficult to extend for developers.

Despite these constraints, the high-level concepts around cBPF of having a minimal, verifiable instruction set where it is feasible for the kernel to prove the safety of user-provided programs to then be able to run them inside the kernel have provided an inspiration and platform for future innovation. In 2014, a new technology was merged into the Linux kernel that significantly extended the BPF (hence, “eBPF”) instruction set to create a more flexible and powerful version. Initially, replacing the cBPF engine in the kernel was not the goal since eBPF is a generic concept and can be applied in many places outside of networking. However, at that time, it was a feasible path to merge this new technology into the mainline kernel. Here is an interesting quote from Linus Torvalds:

So I can work with crazy people, that’s not the problem. They just need to sell their crazy stuff to me using non-crazy arguments and in small and well-defined pieces. When I ask for killer features, I want them to lull me into a safe and cozy world where the stuff they are pushing is actually useful to mainline people first. In other words, every new crazy feature should be hidden in a nice solid “Trojan Horse” gift: something that looks obviously good at first sight.

This, in short, describes the “organic” nature of the Linux kernel development model and matches perfectly to how eBPF got merged into the kernel. To perform incremental improvements, the natural fit was first to replace the cBPF infrastructure in the kernel, which improved its performance, then, step by step, expose and improve the new eBPF technology on top of this foundation. From there, the early days of eBPF evolved in two directions in parallel, networking and tracing. Every new feature around eBPF merged into the kernel solved a concrete production need around these use cases; this requirement still holds true today. Projects like bcc, bpftrace, and Cilium helped to shape the core building blocks of eBPF infrastructure long before its ecosystem took off and became mainstream. Today, eBPF is a generic technology that can run sandboxed programs in a privileged context such as the kernel and has little in common with “BSD,” “Packets,” or “Filters” anymore—eBPF is simply a pseudo-acronym referring to a technological revolution in the operating system kernel to safely extend and tailor it to the user’s needs.

With the ability to run complex yet safe programs, eBPF became a much more powerful platform for enriching the Linux kernel with cloud native context from higher up the stack to execute better policy decisions, process data more efficiently, move operations closer to their source, and iterate and innovate more quickly. In short, instead of patching, rebuilding, and rolling out a new kernel change, the feedback loop with infrastructure engineers has been reduced to the extent that an eBPF program can be updated on the fly without having to restart services and without interrupting data processing. eBPF’s versatility also led to its adoption in other areas outside of networking, such as security, observability, and tracing, where it can be used to detect and analyze system events in real time.

Accelerating Kernel Experiments and Evolution

Moving from cBPF to eBPF has drastically changed what is possible—and what we will build next. By moving beyond just a packet filter to a general-purpose sandboxed runtime, eBPF opened many new use cases around networking, observability, security, tracing, and profiling. eBPF is now a general-purpose compute engine within the Linux kernel that allows you to hook into, observe, and act upon anything happening in the kernel, like a plug-in for your web browser. A few key design features have enabled eBPF to accelerate innovation and create more performant and customizable systems for the cloud native world.

First, eBPF hooks anywhere in the kernel to modify functionality and customize its behavior without changing the kernel’s source. By not modifying the source code, eBPF reduces the time from a user needing a new feature to implementing it from years to days. Because of the broad adoption of the Linux kernel across billions of devices, making changes upstream is not taken lightly. For example, suppose you want a new way to observe your application and need to be able to pull that metric from the kernel. In that case, you have to first convince the entire kernel community that it is a good idea—and a good idea for everyone running Linux—then it can be implemented and finally make it to users in a few years. With eBPF, you can go from coding to observation without even having to reboot your machine and tailor the kernel to your specific workload needs without affecting others. “eBPF has been very useful, and the real power of it is how it allows people to do specialized code that isn’t enabled until asked for,” said Linus Torvalds.

Second, because the verify checks that programs are safe to execute, eBPF developers can continue to innovate without worrying about the kernel crashing or other instabilities. This allows them and their end users to be confident that they are shipping stable code that can be leveraged in production. For platform teams and SREs, this is also crucial for using eBPF to safely troubleshoot issues they encounter in production.

When applications are ready to go to production, eBPF programs can be added at runtime without workload disruption or node reboot. This is a huge benefit when working at a large scale because it massively decreases the toil required to keep the platform up to date and reduces the risk of workload disruption from a rollout gone wrong. eBPF programs are JIT compiled for near native execution speed, and by shifting the context from user space to kernel space, they allow users to bypass or skip parts of the kernel that aren’t needed or used, thus enhancing performance. However, unlike complete kernel bypasses in user space, eBPF can still leverage all the kernel infrastructure and building blocks it wants without reinventing the wheel. eBPF can pick and choose the best pieces of the kernel and mix them with custom business logic to solve a specific problem. Finally, being able to modify kernel behavior at run time and bypass parts of the stack creates an extremely short feedback loop for developers. It has finally allowed experimentation in areas like network congestion control and process scheduling in the kernel.

Growing out of the classic packet filter and taking a major leap beyond the traditional use case unlocked many new possibilities in the kernel, from optimizing resource usage to adding customized business logic. eBPF allows us to speed up kernel innovation, create new abstractions, and dramatically increase performance. eBPF not only reduces the time, risk, and overhead it takes to add new features to production workloads, but in some cases, it even makes it possible in the first place.

Every Packet, Every Day: eBPF at Google, Meta, and Netflix

So many benefits begs the question if eBPF can deliver in the real world—and the answer has been a resounding yes. Meta and Google have some of the world’s largest data center footprints; Netflix accounts for about 15% of the Internet’s traffic. Each of these companies has been using eBPF under the hood for years in production and the results speak for themselves.

Meta was the first company to put eBPF into production at scale with its load balancer project Katran. Since 2017, every packet going into a Meta data center has been processed with eBPF—that’s a lot of cat pictures. Meta has also used eBPF for many more advanced use cases, most recently improving scheduler efficiency, which increased throughput by 15%, a massive boost and resource saving at their scale. Google also processes most of its data center traffic through eBPF, using it for runtime security and observability, and defaults its Google Cloud customers to using an eBPF-based dataplane for networking. In the Android operating system, which powers over 70% of mobile devices and has more than 2.5 billion active users spanning over 190 countries, almost every networking packet hits eBPF. Finally, Netflix relies extensively on eBPF for performance monitoring and analysis of their fleet, and Netflix engineers pioneered eBPF tooling, such as bpftrace, to make major leaps in visibility for troubleshooting production servers and built eBPF-based collectors for On-CPU and Off-CPU flame graphs.

eBPF clearly works and provides extensive benefits for “Internet-scale” companies and has been for the better part of a decade, but those benefits also need to be translated to the rest of us.

eBPF (R)evolution: Making Cloud Native Speed and Scale Possible

At the beginning of the cloud native era, GIFEE (Google Infrastructure for Everyone Else) was a popular phrase, but largely fell out of favor because not everyone is Google or needs Google infrastructure. Instead, people want simple solutions that solve their problems, which begs the question of why eBPF is different. Cloud native environments are meant to “run scalable applications in modern, dynamic environments.” Scalable and dynamic are key to understanding why eBPF is the evolution of the kernel that the cloud native revolution needs.

The Linux kernel, as usual, is the foundation for building cloud native platforms. Applications are now just using sockets as data sources and sinks, and the network as a communication bus. But cloud native needs newer abstractions than currently available in the Linux kernel because many of these building blocks, like cgroups (CPU, memory handling), namespaces (net, mount, pid), SELinux, seccomp, netfiler, netlink, AppArmor, auditd, perf are decades old before cloud even had a name. They don’t always talk together, and some are inflexible, allowing only for global policies and not per-container or per-service ones. Instead of leveraging new cloud native primitives, they lack awareness of Pods or any higher-level service abstractions and rely on iptables for networking.

As a platform team, if you want to provide developer tools for a cloud native environment, you can still be stuck in this box where cloud native environments can’t be expressed efficiently. Platform teams can find themselves in a future they are not ready to handle without the right tools. eBPF now allows tools to rebuild the abstractions in the Linux kernel from the ground up. These new abstractions are unlocking the next wave of cloud native innovation and will set the course for the cloud native revolution.

For example, in traditional networking, packets are processed by the kernel, and several layers of network stack inspect each packet before reaching its destination. This can result in a high overhead and slow processing times, especially in large-scale cloud environments with many network packets to be processed. eBPF instead allows inserting custom code into the kernel that can be executed for each packet as it passes through the network stack. This allows for more efficient and targeted network traffic processing, reducing the overhead and improving performance. Benchmarks from Cilium showed that switching from iptables to eBPF increased throughput 6x, and moving from IPVS-based load balancing to eBPF based allowed Seznam.cz to double throughput while also reducing CPU usage by 72x. Instead of providing marginal improvements on an old abstraction, eBPF enables magnitudes of enhancement.

eBPF doesn’t just stop at networking like its predecessor; it also extends to areas like observability and security and many more because it is a general-purpose computing environment and can hook anywhere in the kernel. “I think the future of cloud native security will be based on eBPF technology because it’s a new and powerful way to get visibility into the kernel, which was very difficult before,” said Chris Aniszczyk, CTO of Cloud Native Computing Foundation. “At the intersection of application and infrastructure monitoring, and security monitoring, this can provide a holistic approach for teams to detect, mitigate, and resolve issues faster.” 

eBPF provides ways to connect, observe, and secure applications at cloud native speed and scale. “As applications shift toward being a collection of API-driven services driven by cloud native paradigms, the security, reliability, observability, and performance of all applications become fundamentally dependent on a new connectivity layer driven by eBPF,” said Dan Wendlandt, CEO and co-founder of Isovalent. “It’s going to be a critical layer in the new cloud native infrastructure stack.”

The eBPF revolution is changing cloud native; the best part is that it is already here.

The Silent eBPF Revolution Is Already a Part of Your Platform

While the benefits of eBPF are clear, it is so low level that platform teams, without the luxury of Linux kernel development experience, need a friendlier interface. This is the magic of eBPF—it is already inside many of the tools running the cloud native platforms of today, and you may already be leveraging it without even knowing. If you spin up a Kubernetes cluster on any major cloud provider, you are leveraging eBPF through Cilium. Using Pixie for observability or Parca for continuous profiling, also eBPF. 

eBPF is a powerful force that is transforming the software industry. Marc Andreessen’s famous quote on “software is eating the world” has been semi-jokingly recoined by Cloudflare as “eBPF is eating the world.” However, success for eBPF is not when all developers know about it but when developers start demanding faster networking, effortless monitoring and observability, and easier-to-use security solutions. Less than 1% of developers may ever program something in eBPF, but the other 99% will benefit from it. eBPF will have completely taken over when there’s a variety of projects and products providing massive developer experience improvement over upstreaming code to the Linux kernel or writing Linux kernel modules. We are already well on our way to that reality.

eBPF has revolutionized the way infrastructure platforms are and will be built and has enabled many new cloud native use cases that were previously difficult or impossible to implement. With eBPF, platform engineers can safely and efficiently extend the capabilities of the Linux kernel, allowing them to innovate quickly. This allows for creating new abstractions and building blocks tailored to the demands of the cloud native world, making it easier for developers to deploy software at scale.

eBPF has been in production for over half a decade at the largest scale and has proven to be a safe, performant, and scalable way to dynamically program the kernel. The silent eBPF revolution has taken hold and is already used in projects and products around the cloud native ecosystem and beyond. With eBPF, platform teams are now ready for the next act in the cloud native era, where they can provision, connect, observe, and secure scalable, dynamic, available, and high-performance environments so developers can focus on just coding business logic.

Migrate a RMI-Based Legacy Application to WebSocket

Key Takeaways

  • Technical debt, especially in enterprise software, is a relevant problem that we recurrently have to face. This article provides a use case related to how I’ve dealt with removing a technical debt in a large enterprise application based on an old fashioned Remote Method Invocation (RMI) protocol, migrating it toward modern cloud-aware communication technologies.
  • In the use case provided, the experience in selecting a suitable open source project, in forking it and in extending it to fit in the purpose. The ability in selecting, using, extending open source software is strategic in a modern Application Lifecycle Management.
  • The emerging reactive programming paradigm in the cloud era, based on a functional approach, seems a better fit in the new software development challenges. In the Java world, the reactive-stream is one of the attempts to standardize reactive approaches in Java development. One of the most important parts faced in migration has been switching from a classical sync request/response to a reactive one.
  • JakartaEE is the heir of JavaEE and it is the reference point for Enterprise Applications. One of the most important goals of such a migration is to make the final assets deployable within a JakartaEE container (e.g., Tomcat, WildFly, etc.).

What is Remote Method Invocation (RMI)?

RMI was first introduced in J2SE 5.0 to provide an all-in-one Java-based solution for application interoperability across networks. Its fundamentals were:

  • Remote Procedure Call (RPC) based on Client-Server model
  • Synchronous by design
  • TCP Socket as transport protocol
  • Java binary (built-in) serialization as application protocol
  • Allow bidirectional communication protocol
    • clients call servers and vice-versa (i.e., callback)

At that time, this was quite a revolution in the Java world because interoperability became easier to achieve in a period where the internet was rising in the IT landscape.

This implied that a lot of client-server applications were developed based on this new and compelling technology.

Obviously looking at RMI now, in the modern Internet era, such technology seems old and out-of-scope considering that in the modern web-based architecture, the front-end is mostly based on internet browsers and protocols are based on open standards that must be platforms and technologies independent.

Target Readers

For what is described below, this article is targeted to Java Developers and/or Architects that, for whatever reasons, are dealing with modernization of large legacy RMI applications.

I’ve had such a challenge and I’d like to share my experience in facing it.

Migration vs Rewritten

First of all,there is one thing that I would like to emphasize for the reader. By “migrate,” I absolutely do not mean “rewrite,” so if you are interested in such “migration,” this could be a new hope to bring a legacy Java application one step closer to its modernization.

To give you a comprehensive view of which tasks are needed and the implications behind such migration, I’ll present to you a real use case that I’ve tackled in one of my works.

Use Case – Evolve an Old Full-Stack Java Application Based on RMI

I had an old, but perfectly working, large client/server java application ported on JDK/JRE 8 with a front-end based on Swing/AWT using RMI as the underlying communication protocol.

Requirement

  1. Move the application to the Cloud using Docker/Kubernetes
  2. Move toward modern web compliant transport protocol like HTTP/Websocket/gRPC

Challenge: RMI Tunneling Over HTTP

The first evaluation was to use RMI HTTP tunneling, but right from the start, it seemed a bit too complicated and since the application heavily uses RMI Callbacks, the one-way protocol like HTTP was not suitable for the purpose.

In such a sense, other than Socket, Websocket seemed the best fit for purpose protocol, but even if I’ve spent enough effort to understand how to plug Websocket as the underlying protocol of RMI, the results were a waste of time :(.

Challenge: Evaluate an Open Source Alternative RMI Implementation

So another possible solution has been to evaluate alternative RMI implementations. I’ve been searching for them trying to identify an open source semi-finished product that is easy to understand and with a flexible and adaptable architecture allowing to plug a new protocol.

Evolving Selected Open Source Project: LipeRMI

During my Internet surfing, I’ve landed on a GitHub hosted project named LipeRMI defined as a light-weight RMI alternative implementation for Java. It seemed to me that LipeRMI had the expected requirements. I tested it with a simple, but complete application and it worked. And amazingly, it also supported, very well, the RMI Callbacks.

Even if the original implementation was based upon socket, its architecture was flexible enough to allow me to be confident in the possibility to extend and enhance it to accomplish my needs, and so my journey began.

Understand : LipeRMI “The Original

In the picture below, there is the High Level Architecture as presented in the original project.

Original High Level Architecture

As you can see, it is pretty simple. The main component is the CallHandler which knows application interfaces and implementations. Both client and server use CallHandler and directly involve Sockets to make a connection’s session between them.

Evolving LipeRMI – “The Fork

As a first step, I forked the project and converted it into a Maven multimodule project to allow better management by simplifying both the extension model and the tests.

As result of such refactoring, I got the following modules:

Module Summary
core the core implementation
socket core extension implementing synch socket protocol
websocket core reactive extension implementing async websocket protocol
rmi-emul core extension to emulate RMI API
examples various examples
cheerpj WebAssembly frontend based upon CheerpJ (EXPERIMENTAL)

In this article, I’m going to focus on core, socket and websocket, where core+socket should be considered a modular re-interpretation that the original project while websocket is a completely new implementation that takes advantage of the introduced core reactive protocol abstraction with usage of a reactive-stream.

Core Module

Protocol Abstraction

In the core module, I’ve placed a greater part of the original project’s code. Taking a look at the original architecture, one of the main goals was decoupling/abstracting the underlying protocol, so I introduced some interfaces like IServer, iClient and IRemoteCaller in order to achieve this and as consequence in core module there aren’t any specific protocol implementations.

In the picture below, there is an overview of the new architecture allowing protocol abstraction.

Class Diagram with protocol abstraction

Socket Module

In the socket module, I’ve simply implemented all the synchronous abstractions provided by the core module that essentially reuses code from the original project, but putting it in the new architecture.

Class Diagram using socket implementation

Code Examples

To give you an idea of the complexity in using the new LipeRMI implementation, consider the code snippets below that I’ve extracted by working examples.

Remotable Interfaces

// Remotable Interface
public interface IAnotherObject extends Serializable {
    int getNumber();    
}

// Remotable Interface
public interface ITestService extends Serializable {

    public String letsDoIt();
    
    public IAnotherObject getAnotherObject();
    
    public void throwAExceptionPlease();
}

Server

// Server
public class TestSocketServer implements Constants {

    // Remotable Interface Implementation
    static class TestServiceImpl implements ITestService {
        final CallHandler callHandler;
        int anotherNumber = 0;

        public TestServiceImpl(CallHandler callHandler) {
            this.callHandler = callHandler;
        }

        @Override
        public String letsDoIt() {
            log.info("letsDoIt() done.");
            return "server saying hi";
        }

        @Override
        public IAnotherObject getAnotherObject() {
            log.info("building AnotherObject with anotherNumber= {}", anotherNumber);
            
            IAnotherObject ao = new AnotherObjectImpl(anotherNumber++);

            callHandler.exportObject(IAnotherObject.class, ao);

            return ao;
        }

        @Override
        public void throwAExceptionPlease() {
            throw new AssertionError("take it easy!");
        }
    }

    public TestSocketServer() throws Exception {

        log.info("Creating Server");
        SocketServer server = new SocketServer();

        final ITestService service = new TestServiceImpl(server.getCallHandler());

        log.info("Registering implementation");
        server.getCallHandler().registerGlobal(ITestService.class, service);

        server.start(PORT, GZIPProtocolFilter.Shared);

        log.info("Server listening");
    
    }

    public static void main(String[] args) throws Exception{
        new TestSocketServer();
    }


}

Client

// Client
public class TestSocketClient implements Constants {
    
    public static void main(String... args) {

        log.info("Creating Client");
        try( final SocketClient client = new SocketClient("localhost", PORT, GZIPProtocolFilter.Shared)) {

            log.info("Getting proxy");
            final ITestService myServiceCaller = client.getGlobal(ITestService.class);

            log.info("Calling the method letsDoIt(): {}", myServiceCaller.letsDoIt());

            try {
                log.info("Calling the method throwAExceptionPlease():");
                myServiceCaller.throwAExceptionPlease();
            }
            catch (AssertionError e) {
                log.info("Catch! {}", e.getMessage());
            }

            final IAnotherObject ao = myServiceCaller.getAnotherObject();

            log.info("AnotherObject::getNumber(): {}", ao.getNumber());
                            
        }
        
    }
    
}

Evolving LipeRMI : Add Reactivity to the Framework with reactive-stream

Unfortunately, the socket promotes a synchronous programming model that does not fit very well with the asynchronous one promoted by websocket, so I decided to move the framework toward a reactive approach using the Reactive Streams standard.

Design Guideline

The basic idea was to simply decouple the request and response using events so the request comes out from a publisher while the response obtained by subscriber and the entire Lifecycle request/response would be managed by a CompletableFuture (essentially the Java Promise Design Pattern).

Reactive Protocol Abstraction (Asynchronous)

As previously mentioned. I’ve introduced using reactive streams in the core module that is a standard for asynchronous stream processing with non-blocking back pressure that encompasses efforts aimed at runtime environments as well as network protocols.

Class Diagram of reactive stream

Interface Description
Processor A Processor represents a processing stage – which is both a Subscriber and a Publisher and obeys the contracts of both.
Publisher A Publisher is a provider of a potentially unbounded number of sequenced elements, publishing them according to the demand received from its Subscriber(s).
Subscriber Will receive a call to the Subscriber.onSubscribe(Subscription) method once after passing an instance of Subscriber to the Publisher.subscribe(Subscriber) method.
Subscription A Subscription represents a one-to-one lifecycle of a Subscriber subscribing to a Publisher.

Below is the new core architecture that include the ReactiveClient abstraction:

Class Diagram with Reactive protocol abstraction

The implementation of the reactive client is contained by the abstract ReactiveClient class. It is based on the RemoteCallProcessor class, which is an implementation of the reactive flow Processor. This acts as both a Publisher by publishing the event to trigger the remote call, and as a Subscriber by receiving the event containing the result of such remote call. Finally, the events’ interactions are coordinated by ReactiveRemoteCaller.

Finally Implements Websocket Module

After introducing a reactive-stream implementation, switching from socket to websocket has been a simple and rewarding coding exercise.

To quickly verify and provide a proof of concept, I’ve decided to use a simple open source micro-framework, Java-WebSocket, providing a websocket implementation which is both simple and effective but my real target is to plug it into Jakarta EE using its Websocket specification. The final part of this article will be dedicated on how to enable whatever Jakarta EE compatible product to be compatible with RMI protocol guaranteeing, at the same time, for your application evolution and a smooth migration.

Class Diagram using WebSocket implementation

As you see from the above class diagram, there are two new handler classes, WSClientConnectionHandler and WSServerConnectionHandler, that manage respectively for client and server the events that come in and out managing, in the same time, their consistency over each call.

Code Examples

Amazingly, the code examples presented above also works essentially in the same way for the websocket. It is enough simply to move from SocketClient to LipeRMIWebSocketClient for the client and from SocketServer to LipeRMIWebSocketServer for the server. That’s all!

// Client
try( LipeRMIWebSocketClient client = new LipeRMIWebSocketClient(new URI(format( "ws://localhost:%d", PORT)), GZIPProtocolFilter.Shared)) 
{ 
    // use client
}

// Server
final LipeRMIWebSocketServer server = new LipeRMIWebSocketServer();

Plug LipeRMI Websocket in Jakarta EE using TomEE runtime

JakartaEE is currently the de-facto standard in developing enterprise applications. This makes strategic LipeRMI integration with it.

Jakarta EE is essentially a specification for realizing Java Application Server; this implies that we have to choose an Application Server compliant with such a specification. To experiment with LipeRMI integration, I’ve chosen Apache TomEE, a certified Jakarta EE Web Profile!

Since RMI provides a built-in service broker, it was designed to work without an application server, but with LipeRMI, we have decoupled remote method invocation from service brokering and such separation allows us to plug RMI processing within a JakartaEE container.

ServerEndPoint

Let’s start to integrate the Jakarta WebSocket specification by creating a ServerEndpoint that will manage all websocket sessions opened by clients.

@ServerEndpoint( value = "/lipermi" )
public class WSServerConnectionHandler {

}

The most important behavior to implement is handling the binary message that could be of two different types: a RemoteCall or RemoteReturn depending if the client is requesting a method invocation (RemoteCall) or the incoming message is a callback invocation result (RemoteReturn) .

A simplified sequence diagram that shows the main tasks performed on an incoming websocket message is shown below:

This is an extract of the original code to give you an idea of the complexity in handling request and response over websocket sessions.

@ServerEndpoint( value = "/lipermi" )
public class WSServerSessionHandler {

 @OnMessage
 public void onMessage(ByteBuffer buffer, Session webSocket) {
   try (final ByteArrayInputStream bais = new ByteArrayInputStream(buffer.array());
        final ObjectInputStream input = new ObjectInputStream(bais))
   {

       final Object objFromStream = input.readUnshared();
       final IRemoteMessage remoteMessage = filter.readObject(objFromStream);

       if (remoteMessage instanceof RemoteCall) {
           this.handleRemoteCall( webSocket, (RemoteCall)remoteMessage );
       } else if (remoteMessage instanceof RemoteReturn) {
           remoteReturnManager.handleRemoteReturn( (RemoteReturn) remoteMessage);
       } else {
           log.warn("Unknown IRemoteMessage type");
       }
   }
   catch( Exception ex ) {
       log.warn("error reading message", ex );
   }

 }
}

ClientEndpoint

JakartaEE , from the client side, makes available a ContainerProvider to acquire a WebSocketContainer that allows connecting to a websocket server getting a new session.

WebSocketContainer container = ContainerProvider.getWebSocketContainer();
session = container.connectToServer(this, serverUri);

A sequence diagram that represents a complete remote method invocation flow that puts in evidence the reactive-stream implementation and the tasks involving the JakartaEE websocket client API is shown below:

The diagram above can be split in three macro tasks:

  1. Acquire a remote service proxy, invoking a method that starts a CompletableFuture that will manage the asynchronous request and response
  2. Send a request to the remote call processor and create a subscription to the remote return publisher for the management of the result
  3. Send the binary request over a websocket session and wait for result through the OnMessage websocket handler

Experiments and Evolution

Once we have moved the RMI implementation over websocket and make it compliant with a JakartaEE container, we can imagine evolving the original architecture of RMI into a more modern fashion.

RMI Design Limit

RMI itself was designed implying that the client and server were developed using Java especially because the RMI application data protocol relies on the built-in Java serialization. Nowadays, modern applications privileges Web Client, and with RMI technology, this does not seem achievable also because the Java Applet technology has been deprecated. But a new cutting edge technology that could help us is WebAssembly.

WebAssembly to the Rescue

With WebAssembly, the technological barriers in the browser have been removed. So not only Javascript language can be performed within the browser context, but also all the programming languages whose compiler is able to produce WebAssembly-compliant bytecode (wasm).

Currently, one of the most famous programming languages that generate WebAssembly is Rust, but other more mature languages, such as C# and Swift, are providing WebAssembly generation. But what about Java?

Java to WebAssembly

One of the most interesting projects allowing you to compile Java to WebAssembly is CheerpJ. It also supports Java Swing/AWT and Serialization. I’ve experimented on it and the results are very promising. In fact, I’ve successfully developed a simple chat using LipeRMI and I’ve deployed the Java client directly inside the browser through CheerpJ.

However, going in-depth with CheerpJ and WebAssembly is out of scope of this article, but is probably a very interesting material for a next one.

Conclusion

I’ve started to migrate a legacy project and it is going fine. Just keep in mind that it requires a lot of effort, but the results are very promising. Moreover, switching over to the websocket protocol opens new unexpected and exciting scenarios.

My idea is to work on the LipeRMI fork to use JSON-based serialization instead of the proprietary Java one. This will allow you, once the application migration will be accomplished, to develop clients with other technology such as JavaScript/React..

I hope this article can be of use for someone that has the same challenge as mine to tackle. In the meanwhile, good programming!

References