Domain-Driven Cloud: Aligning your Cloud Architecture to your Business Model

Key Takeaways

  • Domain-Driven Cloud (DDC) is an approach for creating your organization’s cloud architecture based on the bounded contexts of your business model. DDC extends the principles of Domain-Driven Design (DDD) beyond traditional software systems to create a unifying architecture approach across business domains, software systems and cloud infrastructure.
  • DDC creates a cloud architecture that evolves as your business changes, improves team autonomy and promotes low coupling between distributed workloads. DDC simplifies security, governance and cost management in a way that promotes transparency within your organization.
  • In practice, DDC aligns your bounded contexts with AWS Organizational Units (OU’s) and Azure Management Groups (MG’s). Bounded contexts are categorized as domain contexts based on your business model and supporting technical contexts. DDC gives you freedom to implement different AWS Account or Azure Subscription taxonomies while still aligning to your business model.
  • DDC uses inheritance to enforce policies and controls downward while reporting costs and compliance upwards. Using DDC makes it automatically transparent how your cloud costs align to your business model without implementing complex reports and error-prone tagging requirements.
  • DDC aligns with established AWS and Azure well-architected best practices. You can implement DDC in 5 basic steps whether a new migration (green field) or upgrading your existing cloud architecture (brown field).

Domain-Driven Cloud (DDC) is an approach for creating your organization’s cloud architecture based on your business model. DDC uses the bounded contexts of your business model as inputs and outputs a flexible cloud architecture to support all of the workloads in your organization and evolve as your business changes. DDC promotes team autonomy by giving teams the ability to innovate within guardrails. Operationally, DDC simplifies security, governance, integration and cost management in a way that promotes transparency for IT and business stakeholders alike.

Based on Domain-Driven Design (DDD) and the architecture principle of high cohesion and low coupling, this article introduces DDC including the technical and human benefits of aligning your cloud architecture to the bounded contexts in your business model. You will learn how DDC can be implemented in cloud platforms including Amazon Web Services (AWS) and Microsoft Azure while aligning with their well-architected frameworks. Using illustrative examples from one of our real customers, you will learn the 5 steps to implementing DDC in your organization.

What is Domain-Driven Cloud (DDC)?

DDC extends the principles of DDD beyond traditional software systems to create a unifying architecture spanning business domains, software systems and cloud infrastructure.  

Our customers perpetually strive to align “people, process and technology” together so they can work in harmony to deliver business outcomes. However, in practice, this often falls down as the Business (Biz), IT Development (Dev) and IT Operations (Ops) all go to their separate corners to design solutions for complex problems that actually span all three.

What emerges is business process redesigns, enterprise architectures and cloud platform architecture all designed and implemented by different groups using different approaches and localized languages.  

What’s missing is a unified architecture approach using a shared language that integrates BizDevOps. This is where DDC steps in, with a specific focus on aligning the cloud architecture and software systems that run on them to the bounded contexts of your business model, identified using DDD. Figure 1 illustrates how DDC extends the principles of DDD to include cloud infrastructure architecture and in doing so creates a unified architecture that aligns BizDevOps.

[Click on the image to view full-size]

In DDC, the most important cloud services are AWS Organizational Units (OU’s) that contain Accounts and Azure Management Groups (MG’s) that contain Subscriptions. Because 100% of the cloud resources you secure, use and pay for are connected to Accounts and Subscriptions, these are the natural cost and security containers. By enabling management and security at the higher OU/MG level and anchoring these on the bounded contexts of your business model, you can now create a unifying architecture spanning Biz, Dev and Ops. You can do this while giving your teams flexibility in how they use Accounts and Subscriptions to meet specific requirements.

Why align your Cloud Architecture with your Business Model?

The benefits for aligning your cloud architecture to your organization’s business model include:

  • Evolves with your Business – Businesses are not static and neither is your cloud architecture. As markets change and your business evolves, new contexts may emerge and others may consolidate or fade away. Some contexts that historically were strategic differentiators may drive less business value today. The direct alignment of your cloud management, security and costs to bounded contexts means your cloud architecture evolves with your business.
  • Improves Team Autonomy – While some cloud management tasks must be centralized, DDC recommends giving teams autonomy within their domain contexts for things like provisioning infrastructure and deploying applications. This enables innovation within guardrails so your agile teams can go faster and be more responsive to changes as your business grows. It also ensures dependencies between workloads in different contexts are explicit with the goal of promoting a loosely-coupled architecture aligned to empowered teams.
  • Promotes High Cohesion and Low Coupling – Aligning your networks to bounded contexts enables you to explicitly allow or deny network connectivity between all contexts. This is extraordinarily powerful, especially for enforcing low coupling across the your cloud platform and avoiding a modern architecture that looks like a bowl of spaghetti. Within a context, teams and workloads ideally have high cohesion with respect to security, network integration and alignment on supporting a specific part of your business. You also have freedom to make availability and resiliency decisions at both the bounded context and workload levels.
  • Increases Cost Transparency – By aligning your bounded contexts to OU’s and MG’s, all cloud resource usage, budgets and costs are precisely tracked at a granular level. Then they are automatically summarized at the bounded contexts without custom reports and nagging all your engineers to tag everything! With DDC you can look at your monthly cloud bill and know the exact cloud spend for each of your bounded contexts, enabling you to assess whether these costs are commensurate with each context’s business value. Cloud budgets and alarms can be delegated to context-aligned teams enabling them to monitor and optimize their spend while your organization has a clear top-down view of overall cloud costs.
  • Domain-Aligned Security – Security policies, controls, identity and access management all line up nicely with bounded contexts. Some policies and controls can be deployed across-the-board to all contexts to create a strong security baseline. From here, selected controls can be safely delegated to teams for self-management while still enforcing enterprise security standards.
  • Repeatable with Code Templates – Both AWS and Azure provide ways to provision new Accounts or Subscriptions consistently from a code-based blueprint. In DDC, we recommend defining one template for all domain contexts, then using this template (plus configurable input parameters) to provision and configure new OU’s and Accounts or MG’s and Subscriptions as needed. These management constructs are free (you only pay for the actual resources used within them), enabling you to build out your cloud architecture incrementally yet towards a defined future-state, without incurring additional cloud costs along the way.

DDC may not be the best approach in all situations. Alternatives such as organizing your cloud architecture by tenant/customer (SaaS) or legal entity are viable options, too.

Unfortunately, we often see customers default to organizing their cloud architecture by their current org structure, following Conway’s Law from the 1960’s. We think this is a mistake and that DDC is a better alternative for one simple reason: your business model is more stable than your org structure.

One of the core tenets of good architecture is that we don’t have more stable components depending on less stable components (aka the Stable Dependencies Principle). Organizations, especially large ones, like to reorganize often, making their org structure less stable than their business model. Basing your cloud architecture on your org structure means that every time you reorganize your cloud architecture is directly impacted, which may impact all the workloads running in your cloud environment. Why do this? Basing your cloud architecture on your organization’s business model enables it to evolve naturally as your business strategy evolves, as seen in Figure 2.

[Click on the image to view full-size]

We recognize that, as Ruth Malan states, “If the architecture of the system and the architecture of the organization are at odds, the architecture of the organization wins”. We also acknowledge there is work to do with how OU’s/MG’s and all the workloads within them best align to team boundaries and responsibilities. We think ideas like Team Topologies may help here.

We are seeing today’s organizations move away from siloed departmental projects within formal communications structures to cross-functional teams creating products and services that span organizational boundaries. These modern solutions run in the cloud, so we feel the time is right for evolving your enterprise architecture in a way that unifies Biz, Dev and Ops using a shared language and architecture approach.

What about Well-Architected frameworks?

Both AWS’s Well-Architected framework and Azure’s Well-Architected framework provide a curated set of design principles and best practices for designing and operating systems in your cloud environments. DDC fully embraces these frameworks and at SingleStone we use these with our customers. While these frameworks provide specific recommendations and benefits for organizing your workloads into multiple Accounts or Subscriptions, managed with OU’s and MG’s, they leave it to you to figure out the best taxonomy for your organization.

DDC is opinionated on basing your cloud architecture on your bounded contexts, while being 100% compatible with models like AWS’s Separated AEO/IEO and design principles like “Perform operations as code” and “Automatically recover from failure”. You can adopt DDC and apply these best practices, too. Tools such as AWS Landing Zone and Azure Landing Zones can accelerate the setup of your cloud architecture while also being domain-driven.

5 Steps for Implementing Domain-Driven Cloud

Do you think a unified architecture using a shared language across BizDevOps might benefit your organization? While a comprehensive list of all tasks is beyond the scope of this article, here are the five basic steps you can follow, with illustrations from one of our customers who recently migrated to Azure.

Step 1: Start with Bounded Contexts

The starting point for implementing DDC is a set of bounded contexts that describes your business model. The steps to identify your bounded contexts are not covered here, but the process described in Domain-Driven Discovery is one approach.

Once you identify your bounded contexts, organize them into two groups:

  • Domain contexts are directly aligned to your business model.
  • Technical contexts support all domain contexts with shared infrastructure and services

To illustrate, let’s look at our customer who is a medical supply company. Their domain and technical contexts are shown in Figure 3.

[Click on the image to view full-size]

Your organization’s domain contexts would be different, of course.

For technical contexts, the number will depend on factors including your organization’s industry, complexity, regulatory and security requirements. A Fortune 100 financial services firm will have more technical contexts than a new media start-up. With that said, as a starting point DDC recommends six technical contexts for supporting all your systems and data.

  • Cloud Management – Context for the configuration and management of your cloud platform including OU/MG’s, Accounts/Subscriptions, cloud budgets and cloud controls.
  • Security – Context for identity and access management, secrets management and other shared security services used by any workload.
  • Network – Context for all centralized networking services including subnets, firewalls, traffic management and on-premise network connectivity.
  • Compliance – Context for any compliance-related services and data storage that supports regulatory, audit and forensic activities.
  • Platform Services – Context for common development and operations services including CI/CD, package management, observability, logging, compute and storage.
  • Analytics – Context for enterprise data warehouses, governance, reporting and dashboards.

You don’t have to create these all up-front, start with Cloud Management initially and build out as-needed.

Step 2: Build a Solid Foundation

WIth your bounded contexts defined, it’s now time to build a secure cloud foundation for supporting your organization’s workloads today and in the future. In our experience, we have found it is helpful to organize your cloud capabilities into three layers based on how they support your workloads. For our medical supply customer, Figure 4 shows their contexts aligned to Application, Platform and Foundation layers of their cloud architecture.

[Click on the image to view full-size]

With DDC, you align AWS Organizational Units (OU’s) or Azure Management Groups (MG’s) to bounded contexts. By align, we mean you name them after your bounded contexts. These are the highest levels of management and through the use of inheritance they give you the ability to standardize controls and settings across your entire cloud architecture.

DDC gives you flexibility in how best to organize your Accounts and Subscription taxonomy, from coarse-grained to fine-grained, as seen in Figure 5.

DDC recommends starting with one OU/MG and at least two Accounts/Subscriptions per bounded context. If your organization has higher workload isolation requirements, DDC can support this too, as seen in Figure 5.

[Click on the image to view full-size]

For our customer who had a small cloud team new to Azure, separate Subscriptions for Prod and NonProd for each context made sense as a starting point, as shown in Figure 6.

[Click on the image to view full-size]

Figure 7 shows what this would look like in AWS.

[Click on the image to view full-size]

For our customer, further environments like Dev, Test and Stage could be created within their respective Prod and Non-Prod Subscriptions. This provides them isolation between environments with the ability to configure environment-specific settings at the Subscription or lower levels. They also decided to build just the Prod Subscriptions for the six technical contexts to keep it simple to start. Again, if your organization wanted to create separate Accounts or Subscriptions for every workload environment, this can be done too and still aligned with DDC.

From a governance perspective, in DDC we recommend domain contexts inherit security controls and configurations from technical contexts. Creating a strong security posture in your technical contexts enables all your workloads that run in domain contexts to inherit this security by default. Domain contexts can then override selected controls and settings on a case-by-case basis balancing team autonomy and flexibility with required security guardrails.

Using DDC, your organization can grant autonomy to teams to enable innovation within guardrails. Leveraging key concepts from team topologies, stream-aligned teams can be self-sufficient within domain contexts when creating cloud infrastructure, deploying releases and monitoring their workloads. Platform teams, primarily working in technical contexts, can focus on designing and running highly-available services used by the stream-aligned teams. These teams work together to create the right balance between centralization and decentralization of cloud controls to meet your organization’s security and risk requirements, as shown in Figure 8.

[Click on the image to view full-size]

As this figure shows, policies and controls defined at higher level OU’s/MG’s are enforced downwards while costs and compliance are reported upwards. For our medical supply customer, this means their monthly Azure bill is automatically itemized by their bounded contexts with summarized cloud costs for Orders, Distributors and Payers to name a few.

This makes it easy for their CTO to share cloud costs with their business counterparts and establish realistic budgets that can be monitored over time. Just like costs, policy compliance across all contexts can be reported upwards with evidence stored in the Compliance technical context for auditing or forensic purposes. Services such as Azure Policy and AWS Audit Manager are helpful for continually maintaining compliance across your cloud environments by organizing your policies and controls in one place for management.

Step 3: Align Workloads to Bounded Contexts

With a solid foundation and our bounded contexts identified, the next step is to align your workloads to the bounded contexts. Identifying all the workloads that will run in your cloud environment is often done during a cloud migration discovery, aided in part by a change management database (CMDB) that contains your organization’s portfolio of applications.

When aligning workloads to bounded contexts we prefer a workshop approach that promotes discussion and collaboration. In our experience this makes DDC understandable and relatable by the teams involved in migration. Because teams must develop and support these workloads, the workshop also highlights where organizational structures may align (or not) to bounded contexts. This workshop (or a follow-up one) can also identify which applications should be independently deployable and how the team’s ownership boundaries map to bounded contexts.

For our medical supply customer, this workshop revealed the permissions required for a shared CI/CD tool in the Shared Services context was needed to deploy a new version of their Order Management system in the Orders context. This drove a discussion on working out how secrets and permissions would be managed across contexts, identifying new capabilities needed for secrets management that were prioritized during cloud migration. By creating a reusable solution that worked for all future workloads in domain contexts, the cloud team created a new capability that improved the speed of future migrations.

Figure 9 summarizes how our customer aligned their workloads to bounded contexts, which are aligned to their Azure Management Groups.

[Click on the image to view full-size]

Within the Order context, our customer used Azure Resource Groups for independently deployable applications or services that contain Azure Resources, as shown in Figure 10.

[Click on the image to view full-size]

This design served as a starting point for their initial migration of applications running in a data center to Azure. Over the next few years their goal was to re-factor these applications into multiple independent micro-services. When this time came, they could incrementally do this an application at a time by creating additional Resource Groups for each service.

If our customer were using AWS, Figure 10 would look very similar but use Organizational Units, Accounts and AWS Stacks for organizing independently deployable applications or services that contained resources. One difference in cloud providers is that AWS allows nested stacks (stacks within stacks) whereas Azure Resource Groups cannot be nested.

For networking, in order for workloads running in domain contexts to access shared services in technical contexts, their networks must be connected or permissions explicitly enabled to allow access. While the Network technical context contains centralized networking services, by default each Account or Subscription aligned to a domain context will have its own private network containing subnets that are independently created, maintained and used by the workloads running inside them.

Depending on the total number of Accounts or Subscriptions, this may be desired or it may be too many separate networks to manage (each potentially has their own IP range). Alternatively, core networks can be defined in the Network Context and shared to specific domain or technical contexts thereby avoiding every context having its own private network. The details of cloud networking are beyond the scope of this article but DDC enables multiple networking options while still aligning your cloud architecture to your business model. Bottom line: you don’t have to sacrifice network security to adopt DDC.

Step 4: Migrate Workloads

Now that we have identified where each workload will run, it was time to begin moving them into the right Account or Subscription. While this was a new migration for our customer (greenfield), for your organization this may involve re-architecting your existing cloud platform (brownfield). Migrating a portfolio of workloads to AWS or Azure and the steps for architecting your cloud platform is beyond the scope of this article, but with respect to DDC this is a checklist of the key things to keep in mind:

  • Name your AWS Organizational Units (OU’s) or Azure Management Groups (MG’s) after your bounded contexts.
  • Organize your contexts into domain and technical groupings, with:
    • Technical contexts as the foundation and platform layers of your cloud architecture.
    • Domain contexts as the application layer of your cloud architecture.
  • Centralize common controls in technical contexts for a strong security posture.
  • Decentralize selected controls in domain contexts to promote team autonomy, speed and agility.
  • Use inheritance within OU’s or MG’s for enforcing policies and controls downward while reporting cost and compliance upwards.
  • Decide on your Account / Subscription taxonomy within the OU’s / MG’s, balancing workload isolation with management complexity.
  • Decide how your networks will map to domain and technical contexts, balancing centralization versus decentralization.
  • Create domain contexts templates for consistency and use these when provisioning new Accounts / Subscriptions.

For brownfield deployments of DDC that are starting with an existing cloud architecture, the basic recipe is:

  1. Create new OU’s / MG’s named after your bounded contexts. For a period of time these will live side-by-side with your existing OU’s / MG’s and should have no impact on current operations.
  2. Implement policies and controls within the new OU’s / MG’s for your technical contexts, using inheritance as appropriate.
  3. Create a common code template for all domain contexts that inherits policies and controls from your technical contexts. Use parameters for anything that’s different between contexts.
  4. Based on the output of your workloads mapping workshop, for each workload either:
    • a.  Create a new Account / Subscription using the common template, aligned with your desired account taxonomy, for holding the workload or
    • b.  Migrate an existing Account / Subscription, including all workloads and resources within the, to the new OU / MG. When migrating, pay careful attention to controls from the originating OU / MG to ensure they are also enabled in the target OU / MG.
  5. The order you move workloads will be driven by the dependencies between your workloads, so this should be understood before beginning. The same goes for shared services that workloads depend on.
  6. Depending on the number of workloads to migrate, this may take weeks or months (but hopefully not years). Work methodically as you migrate workloads, verifying that controls, costs and compliance are working correctly for each context.
  7. Once done, decommission the old OU / MG structure and any Accounts / Subscriptions no longer in use.

Step 5: Inspect and Adapt

Your cloud architecture is not a static artifact, the design will continue to evolve over time as your business changes and new technologies emerge. New bounded contexts will appear that require changes to your cloud platform. Ideally much of this work is codified and automated, but in all likelihood you will still have some manual steps involved as your bounded contexts evolve.

Your Account / Subscription taxonomy may change over time too, starting with fewer to simplify initial management and growing as your teams and processes mature. The responsibility boundaries of teams and how these align to bounded contexts will also mature over time. Methods like GitOps work nicely alongside DDC to keep your cloud infrastructure flexible and extensible over time and continually aligned with your business model.

Conclusion

DDC extends the principles of DDD beyond traditional software systems to create a unifying architecture spanning business domains, software systems and cloud infrastructure (BizDevOps). DDC is based on the software architecture principle of high cohesion and low coupling that is used when designing complex distributed systems, like your AWS and Azure environments. Employing the transparency and shared language benefits of DDD when creating your organization’s cloud architecture results in a secure-yet-flexible platform that naturally evolves as your business changes over time.

Special thanks to John Chapin, Casey Lee, Brandon Linton and Nick Tune for feedback on early drafts of this article and Abby Franks for the images.

Evolving the Federated GraphQL Platform at Netflix

Key Takeaways

  • Federated GraphQL distributes the ownership of the graph across several teams. This requires all teams to adopt and learn federated GraphQL and can be accomplished by providing a well-rounded ecosystem for developer and schema workflows.
  • Before you start building custom tools, it helps to use existing tools and resources adopted by the community and gradually work with initial adopters to identify gaps.
  • The (D)omain (G)raph (S)ervices (DGS) Framework is a Spring Boot-based Java framework that allows developers to easily build GraphQL services that can then be part of the federated graph. It provides many out-of-the-box integrations with the Netflix ecosystem, but the core is also available as an open-source project to the community.
  • As more teams came on board, we had to keep up with the scale of development. In addition to helping build GraphQL services, we also needed to replace manual workflows for schema collaboration with more tools that address the end-to-end schema development workflow to help work with the federated graph.
  • Today we have more than 200 services that are part of the federated graph. Federated GraphQL continues to be a success story at Netflix. We are migrating our Streaming APIs to a federated architecture and continue to invest in improving the performance and observability of the graph.

In this article, we will describe our migration journey toward a Federated GraphQL architecture. Specifically, we will talk about the GraphQL platform we built consisting of the Domain Graph Services (DGS) Framework for implementing GraphQL services in Java using Spring Boot and graphql-java and tools for schema development. We will also describe how the ecosystem has evolved at various stages of adoption.

Why Federated GraphQL?

Netflix has been evolving its slate of original content over the past several years. Many teams within the studio organization work on applications and services to facilitate production, such as talent management, budgeting, and post-production.

In a few cases, teams had independently created their APIs using gRPC, REST, and even GraphQL. Clients would have to talk to one or more backend services to fetch the data they needed. There was no consistency in implementation, and we had many ways of fetching the same data resulting in multiple sources of truth. To remedy this, we created a unified GraphQL API backed by a monolith. The client teams could then access all the data needed through this unified API, and they only needed to talk to a single backend service. The monolith, in turn, would do all the work of communicating with all the required backends to fetch and return the data in one GraphQL response.

However, this monolith did not scale as more teams added their data behind this unified GraphQL API. It required domain knowledge to determine how to translate the incoming requests to corresponding calls out to the various services. This created maintenance and operational burden on the team maintaining this graph. In addition, the evolution of the schema was also not owned by the product teams primarily responsible for the data, which resulted in poorly designed APIs for clients.

We wanted to explore a different ownership model, such that the teams owning the data could also be responsible for their GraphQL API while still maintaining the unified GraphQL API for client developers to interact with (see Figure 1).

[Click on the image to view full-size]

Figure 1: Federated ownership of graph

In 2019, Apollo released the Federation Spec, allowing teams to own subgraphs while being part of the single graph. In other words, the ownership of the graph could be federated across multiple teams by breaking apart the various resolvers handling the incoming GraphQL requests. Instead of the monolith, we could now use a simple gateway that routed the requests to the GraphQL backends, called Domain Graph Services, which serves the subgraph. Each DGS will handle fetching the data from corresponding backends owned by the same team (see Figure 2). We started experimenting with a custom implementation of the Federated GraphQL Gateway and started working with a few teams to migrate to this new architecture.

[Click on the image to view full-size]

Figure 2: Federated GraphQL Architecture

A couple of years later, we now have more than 150 services that are a part of this graph. Not only have we expanded the studio graph, but we have created more graphs for other domains, such as our internal platform tools, and another for migrating our streaming APIs as well.

The Early Adoption Phase

When we started the migration, we had around 40 teams already part of the federated graph served by our GraphQL monolith. We asked all these teams to migrate to a completely new architecture, which required knowledge of Federated GraphQL – an entirely new concept for us as well. Providing a great developer experience was key to successfully driving adoption at this scale.

Initially, a few teams opted to onboard onto the new architecture. We swarmed with the developers on the team to better understand the developer workflow, the gaps, and the tools required to bridge the knowledge gap and ease the migration process.

Our goal was to make it as easy as possible for adopters to implement a new GraphQL service and make it part of the federated graph. We started to gradually build out the GraphQL platform consisting of several tools and frameworks and continued to evolve during various stages of adoption.

Our Evolving GraphQL Ecosystem

The Domain Graph Services (DGS) Framework is a Spring Boot library based on graphql-java that allows developers to easily wire-up GraphQL resolvers for their schema. Initially, we created this framework with the goal of providing Netflix-specific integrations for security, metrics, and tracing out of the box for developers at Netflix. In addition, we wanted to eliminate the manual wire-up of resolvers, which we could optimize using custom DGS annotations as part of the core. Figure 3 shows the modular architecture of the framework with several opt-in features for developers to choose from.

[Click on the image to view full-size]

Figure 3: DGS Framework Architecture

When using the DGS Framework, developers can simply focus on the business domain logic and less on learning all the specifics of GraphQL. In addition, we created a code generation Gradle plugin for generating Java or Kotlin classes that represent the schema. This eliminates the manual creation of these classes.

Over time, we added more features that were more generally useful, even for service developers outside Netflix. We decided to open-source the DGS Framework and the DGS code generation plugin in early 2021 and continue evolving it. We also created the DGS IntelliJ plugin that provides navigation from schema to implementation of data resolvers and code completion snippets for implementing your DGS.

Having tackled implementing a GraphQL service, the next step was to register the DGS so it could be part of the federated graph. We implemented a schema registry service to map the schemas to their corresponding services. The federated gateway uses the schema registry to determine which services to reach out to, given an incoming query. We also implemented a self-service UI for this schema registry service to help developers manage their DGSs and discover the federated schema.

Finally, we enhanced our existing observability tools for GraphQL. Our distributed tracing tool allows easy debugging of performance issues and request errors by providing an overall view of the call graph, in addition to the ability to view correlated logs.

Scaling the Federated GraphQL Platform

We started this effort in early 2019, and since then, we have more than 200 teams participating in the federated graph. The adoption of federated GraphQL architecture has been so successful that we ended up creating more graphs for other domains in addition to our Studio Graph. We now have one for internal platform tooling, which we call the Enterprise Graph, and another for our streaming APIs.

After introducing the new Enterprise graph, we quickly realized that teams were interested in exposing the same DGS as part of the Enterprise and Studio graphs. Similarly, clients were interested in fetching data from both graphs. We then merged the Studio and Enterprise graphs into one larger supergraph. This created a different set of challenges for us related to scaling the graph for the size and number of developers.

The larger graph made it harder to scale our schema review process since it’s been mostly manually overseen by members of our schema review group (see Figure 4). We needed to create more tooling to automate some of these processes. We created a tool called GraphDoctor to lint the schema and automatically comment on PRs related to schema changes for all enrolled services. To help with schema collaboration, we created GraphLabs, which stages a sandboxed environment to test schema changes without affecting the other parts of the graph. This allows both front-end and back-end developers to collaborate on schema changes more rapidly (see Figure 4).

[Click on the image to view full-size]

Figure 4: Schema Development Workflow

Developer Support Model

We built the GraphQL platform to facilitate implementing domain graph services and work with the graph. However, this alone would not have been sufficient. We needed to complement the experience with good developer support. Initially, we offered a white glove migration experience by swarming with the teams and doing much migration work for them. This provided many insights into what we needed to build to improve the developer experience. We identified gaps in the existing solutions that could help speed up implementation by allowing developers to focus on the business logic and eliminate repetitive code setup.

Once we had a fairly stable platform, we could onboard many more teams at a rapid pace. We also invested heavily in providing our developers with good documentation and tutorials on federation and GraphQL concepts so they can self-service easily. We continue to offer developer support on our internal communication channels via Slack during business hours to help answer any questions and troubleshoot issues as they arise.

Developer Impact

The GraphQL platform provides a completely paved path for the entire workflow, starting from schema design, implementation of the schema in a GraphQL service, registering the service to be a part of the federated graph, and operating the same once deployed. This has helped more teams adopt the architecture making GraphQL more popular than traditional REST APIs at Netflix. Specifically, Federated GraphQL greatly simplifies data access for front-end developers, allowing teams to move quickly on their deliverables.

Our Learnings

By investing heavily in developer experience, we were able to drive adoption at a much more accelerated pace than we would have otherwise. We started small by leveraging community tools. That helped us identify gaps and where we needed custom functionality. We built the DGS Framework and an ecosystem of tools, such as the code generation plugin and even one for basic schema management.

Having tackled the basic workflow, we could focus our efforts on more comprehensive schema workflow tools. As adoption increased, we were able to identify problems and adapt the platform to make it work with larger graphs and scale it for an increasing number of developers. We automated a part of the schema review process, which has made working with larger graphs easier. We continue to see new use cases emerge and are working to evolve our platform to provide paved-path solutions for the same.

What’s ahead?

So far, we have migrated our Studio architecture to federated GraphQL and merged a new graph for internal platform teams with the Studio Graph to form one larger supergraph. We are now migrating our Netflix streaming APIs that power the discovery experience in Netflix UI to the same model. This new graph comes with a different set of challenges. The Netflix streaming service is supported across various devices, and the UI is rendered differently in each platform. The schema needs to be well-designed to accommodate the different use cases.

Another significant difference is that the streaming services need to handle significantly higher RPS, unlike the other existing graphs. We are identifying performance bottlenecks in the framework and tooling to make our GraphQL services more performant. In parallel, we are also improving our observability tooling to ensure we can operate these services at scale.

Enhancing Your “Definition of Done” Can Improve Your Minimum Viable Architecture

Key Takeaways

  • Adding sustainability criteria to a team’s “Definition of Done” (DoD) can help them evaluate the quality of their architectural decisions as they evolve their Minimum Viable Architectures (MVAs).
  • Using the evolution of the Minimum Viable Product (MVP) as a way to examine when architectural decisions need to be made provides concrete guidance on MVA decisions, including when the MVA can be considered temporarily “done”.
  • MVP as a concept never goes away. The DoD provides a way to make sure that each MVP and associated MVA are sustainable
  • Continuous Delivery practices can provide teams with the means to automate, evaluate, and enforce their Definition of Done. When delivering in rapid cycles, doing so may be the only practical way to ensure a robust DoD
  • Software architecture is now a continual flow of decisions that are revisited continuously, and architectural criteria in DoDs have become ephemeral. Software architectures are never “done”, but effective automated DoDs can help to improve them with each release.
     

A Definition of Done (DoD) is a description of the criteria that a development team and its stakeholders have established to determine whether a software product is releasable (for a more complete explanation of the Definition of Done in an agile context, see this blog post). The DoD helps to ensure that all stakeholders have a shared understanding of what minimum bar a product increment must achieve before it can be released. The DoD usually focuses on functional aspects of quality, but teams can strengthen the quality and sustainability of their products if they expand their DoD to include architectural considerations.

Architecture and the Definition of Done

Typical contents of a DoD include criteria such as:

  • The code must pass code review/inspection
  • The code must pass all unit tests
  • The code must pass all functional tests
  • The product must be defect-free
  • The build must be deployed for testing on its target platforms
  • The code must be documented.

Extending a DoD with architectural concerns means strengthening the DoD with criteria that assess the sustainability of the solution over time. Examples of this include:

  • Assessments of the code’s maintainability and extensibility
  • Assessments of the security of the product
  • Assessments of the scalability of the product
  • Assessments of the performance of the product
  • Assessments of the resiliency of the product
  • Assessments of the amount of technical debt (TD) incurred during the development of the product, and estimates for “repaying” that technical debt

By extending their DoD with architectural concerns, a team is able to evaluate the sustainability of their product over time every time they evaluate whether their product can be released.

Software architecture, agility, and the DoD

In a series of prior articles, we introduced the Minimum Viable Architecture (MVA) concept, which provides the architectural foundation for a Minimum Viable Product (MVP). Many people think of an MVP as something that is useful only in the early stages of product development.

We don’t think of  MVPs and MVAs as “one and done”, but instead as a series of incremental deliveries. The product evolves MVP by MVP. Each MVP should ideally increase customer satisfaction with the product, while each MVA should increase the system’s quality while supporting the MVP delivery (see Figure 1). When a team uses an agile approach, every release is an incremental MVP that helps a team to evaluate the value of that release.

Figure 1: A definition of an MVA

An MVP is an implicit promise to customers that what they are experiencing will continue to be valuable for the life of the product. It is not a “throw-away” prototype to evaluate a concept; the “product” part of MVP implies that a customer who finds the MVP valuable can reasonably expect that they will be able to continue to use the product in the future.

Because of this implicit promise of support over time, every MVP has a supporting MVA that provides an essential architectural foundation to ensure that the MVP will be sustainably supportable over time in a sustainable manner. While the typical DoD focuses on functionality, i.e. the MVP, extending the DoD to evaluate architectural concerns helps a team to make sure that the MVA is also “fit for purpose.”  When extended to consider architectural concerns, the DoD can help a team evaluate whether its MVA is “good enough.” 

Just as the MVP can only be evaluated by delivering running code to real users or customers and getting their feedback, the MVA can only be assessed by stringent testing and evaluation of executing code. The only way to validate the assumptions made by an architecture is to build it and subject the critical parts of the system to tests. Not the whole system, but enough of it that critical architectural decisions can be evaluated empirically. This is the concept behind the MVA.

In an agile approach, time is not on your side

The challenge that teams who work in short cycles using agile approaches face is that they have almost no chance of really evaluating the quality and sustainability of their architecture without automating most of the evaluation work. Most teams are challenged to simply develop a stable, potentially valuable MVP. Making sure the MVP is sustainable by building an MVA is yet more work.

Without automation, teams simply cannot find the time to develop the MVP/MVA and evaluate whether it meets the DoD. Adding more people does not help, as doing so will drag down the effectiveness of the team. Handing the evaluation off to another team is similarly ineffective as it also reduces team effectiveness and slows the receipt of important feedback.

The most effective way to automate the DoD is to build it into the team’s automated continuous delivery (CD) pipeline, supplemented by longer-running quality tests that help to evaluate the fitness for purpose of the MVA. This approach is sometimes referred to as using fitness functions. Doing so has several benefits:

  • Developers obtain immediate feedback about the quality of their work
  • The entire team benefits by not having to work around changes that are not yet “fully baked”.
  • Parts of the system that cannot be evaluated when code is delivered into the code repository are checked frequently to catch architectural degradation.

In practical terms, this often includes adding the following kinds of tools and evaluation steps to the CD pipeline:

  • Code scanning tools. Tools like lint and its successors have been around for a long time. Though many developers use IDEs that catch common coding errors, having a way to check for common coding errors, the existence of comments, usage of style conventions, and many other conditions that can be caught by static code analysis is a useful starting point for automating some basic architectural assessment of the system. For example, some types of security vulnerabilities can also be caught by code-scanning tools, and modularity and code complexity problems can be flagged. The usual argument against code scanning is that, if the rules are not tuned effectively it can generate more noise than signal. When it does, developers ignore it and it becomes ineffective.
  • Build tools. Making sure code can be built into an executable using agreed-upon standard architectural components and environment settings uncovers subtle problems that often fall under the “it works fine on my machine” category of bugs. When developers use different versions of components, or new components not agreed upon by the rest of the team, they introduce errors caused by configurations that are hard to isolate. Sometimes old versions of components have known security vulnerabilities so building with patched versions helps to eliminate security vulnerabilities.
  • Automated provisioning and deployment tools. Once the code is built into an executable, it needs to be deployed into a standard testing environment. Decisions about standard configurations are important architectural decisions; without them, code can fail in unpredictable ways. When these environments are hand-built by developers, their configurations can “drift” from those of the eventual production environment, resulting in another class of “it works fine in the testing environment” kinds of errors. We’ve also noticed that when it’s not easy to create a fresh testing environment, such as when testing environments have to be manually created, people will reuse testing environments. Over time, this can cause the testing environment’s state to drift from that of the intended production environment. Similarly, when someone has to manually deploy code they can forget things, or even just do things in a different order, so that the testing environment is subtly but significantly different from the production environment.
  • Unit testing tools. At its most fundamental level, unit testing makes sure that APIs work the way that the team has agreed they should. While automated builds catch degradation in interfaces, unit testing makes sure the agreed-upon behavior is still provided. Contract testing approaches provide a useful way to evaluate when applications break API commitments.

Incorporating these checks into the CD pipeline provides immediate feedback on the architectural fitness of the system, but to do so the checks need to run quickly or developers will rebel. The architectural checks in the CD pipeline represent only the first line of defense. Checks that provide deeper inspection should be run as automated background tasks, perhaps run overnight, or periodically in the background. These include:

  • API-based testing of functional requirements. Agile development cycles are too short to do meaningful manual testing. As a result, everything, including user interfaces, needs to have an API that can be driven by automated tests. Functional testing isn’t architectural, per se, but giving everything an API is useful for automating integration and system testing in a scalable way.
  • API-based scalability, throughput, and performance testing. These tests include things like scaling up the number of user instances, or process instances, to see what happens to the system under load. Even when the back-end consists largely of microservices that are dynamically replicated by cloud infrastructure, you still need to be able to easily scale the number of clients.
  • Automated reliability testing. Approaches like Netflix’s “chaos monkey” help to evaluate how a system will respond to various kinds of failures, which in turn helps a team to develop a resilient architecture.

Other useful tools and manual techniques include:

  • Test data management. Creating, maintaining, randomizing, and refreshing test data is complex and error-prone. Because it’s hard, people tend not to do it; they copy production data, leading to security vulnerabilities, or they use small manually-created test data that does not expose real-world edge cases or data volumes. The result is that important quality issues can remain undiscovered, including important behavioral defects, unhandled out-of-bounds conditions, stack overflows, race conditions, and security errors, to name a few
  • Code and architectural reviews. Manual inspections for code quality, architectural integrity, reliability, and maintainability can be valuable, but they usually can’t be release-blocking if the organization wants to deliver on short cycles. The issues that they find are real, and need to be addressed, but if the organization wants to obtain fast customer feedback, manual reviews have to be focused on important but non-blocking quality problems. Manual inspections simply cannot respond to the speed and volume of change.
  • Ethical hacking and simulated attacks. We live in a world in which vulnerabilities are constantly tested. Known vulnerability testing needs to be automated, as an extension of basic vulnerability checking in code scans, but there is still an important role for having skilled, creative, but friendly hackers constantly trying to infiltrate the system. The vulnerabilities they find should be folded into the automated testing described above.

Can the Definition of Done be overridden?

Strictly speaking, no. If the team agrees that the DoD criteria really represent the minimum quality bar that it is willing to accept for a released product increment (MVP), they should not lower their standards for any reason. No one is forcing them to adopt their DoD, it’s their decision, but if it means anything they should not abandon it because it’s inconvenient.

This means that a team should thoughtfully discuss what releasable means to them and to their customers. While their DoD might be absolute, they might also want to have a separate category of tests they run to better inform their release decisions and their architectural decisions. If they are willing to release an MVP to obtain customer feedback they might be willing, especially early in the product’s lifecycle, to sacrifice scalability or maintainability, recognizing that they are consciously incurring technical debt. In making that decision, however, they need to understand that if they make a decision that renders the product unsustainable, they may inadvertently kill the product.

Is the architecture ever done?

With each MVP delivery, there are several possible outcomes:

  1. The MVP is successful and we believe that the MVA doesn’t need to change.
  2. The MVP is successful but the MVA is not sustainable. This outcome could follow the previous one after a period of time, as the MVA may decay over time, even if the MVP does not change.
  3. The MVP is mostly successful as its functionality meets a majority of the business stakeholders’ needs, but the MVA is not, as its Quality Attribute Requirements (QARs) are not fully met. The MVA needs improvement, and technical debt may accrue as a result.
  4. The MVP is partially but not wholly successful, but it can be fixed. As a result, the MVA needs to change, too. Again, technical debt may accrue as a result.
  5. The MVP isn’t successful and so the MVA doesn’t matter

We can group these five potential outcomes into the following four scenarios to help us evaluate whether we are really “done” by asking ourselves a few key questions as follows:

  • Scenario 1 (outcome 1): In this scenario, we may believe that we are “done” since both MVA and MVP appear to be successful. However, are we really done? What about the sustainability of the MVA? Do we really believe that the MVA won’t change over time and that technical debt (TD) won’t start accruing as a result of these changes?
  • Scenario 2: (outcome 2): Here, we believe that we are “done” with the MVP, but we have doubts about the long-term sustainability of the MVA. How can we confirm these doubts and prove that the MVA isn’t sustainable? Did we incur an excessive amount of TD while delivering the MVP? Did we test that key QARs such as scalability, performance, and resiliency will be met over a reasonable period of time? Can we estimate the effort necessary to “repay” some of the TD we have incurred so far?
  • Scenario 3 (outcomes 3 and 4) – In this scenario, the MVP is mostly successful but we believe that the MVA isn’t sustainable since we need to make some significant architectural changes either immediately or in the near future. In other words, we know that we are not “done” with the architecture. How can we assess the sustainability of the MVA? Did we test whether the system is scalable and resilient? How will its performance change over time? How much TD was incurred in the MVP delivery?
  • Scenario 4 (outcome 5) is the only one where we can say that “we are done” – but obviously not in a good way. This means going back to the drawing board, usually because the organization’s understanding of customer needs was significantly lacking.

In summary, architectural criteria in a DoD are ephemeral in nature. MVAs are never really “done” as long as MVPs remain in use. As we noted in an earlier article, teams have to resist two temptations regarding their MVA: the first is ignoring its long-term value altogether and focusing only on quickly delivering functional product capabilities using a “throw-away” MVA. The second temptation they must resist is over-architecting the MVA to solve problems they may never encounter. This latter temptation bedevils traditional teams, who are under the illusion that they have the luxury of time, while the first temptation usually afflicts agile teams, who never seem to have enough time for the work they take on.

Conclusion

The Definition of Done helps agile teams make release decisions and guide their work. Most DoDs focus on functional completeness, and in so doing risk releasing products that may become unsustainable over time. Adding architectural considerations to their DoD helps teams correct this problem.

The problem, for most teams, is that they barely have enough time to evaluate a functional DoD, let alone consider architectural concerns. They cannot, realistically, reduce scope to make room for this work and risk unsatisfied stakeholders. Their only realistic solution is to automate as much of their DoD, including architectural concerns, to give themselves room to handle architectural concerns as they come up.

In expanding their DoD with architectural concerns, the concept of a Minimum Viable Architecture (MVA) helps them limit the scope of the architectural work to that which is only absolutely necessary to support their latest release, which can be thought of as an evolution of their Minimum Viable Product (MVP).

Evaluating the MVA (and functional aspects of the release) using a largely automated DoD provides teams with concrete empirical evidence that their architecture is fit for purpose, something that purely manual inspections can never do. Manual inspection still has a role, but more in developing new ideas and novel approaches to solving problems.

Modern software architecture has evolved into a continual flow of decisions that are revisited continuously. A Definition of Done, extended with architectural concerns, helps teams to continually examine and validate, or reject, their architectural decisions, preventing unhealthy architectural technical debt from accumulating.

Adaptive, Socio-Technical Systems with Architecture for Flow: Wardley Maps, DDD, and Team Topologies

Key Takeaways

  • When building and improving adaptive, socio-technical systems, we need to consider the system as a whole instead of focusing on local optimization of separate parts. We need a holistic approach that involves perspectives from business strategy, software architecture and design, and team organization, e.g., by combining Wardley Mapping, Domain-Driven Design, and Team Topologies
  • A Wardley Map – as a part of Wardley Mapping – provides a structured way from a business strategy point of view to discuss and visualize the landscape in which an organization is operating and competing. It helps to anticipate change and identify areas where an organization can innovate, improve efficiency, or outsource to gain a competitive advantage
  • Team Topologies’ well-defined team types and interaction help teams to adapt quickly to new circumstances and achieve fast and sustainable flow of change from a team perspective
  • Domain-Driven Design (DDD) helps to discover the core domain providing competitive advantage and leverage modularity with bounded contexts as suitable seams to split a system apart and well-defined ownership boundaries from a software design perspective
  • The combination of Wardley Mapping, DDD, and Team Topologies provides a powerful holistic toolset to design, build, and evolve adaptive socio-technical systems that are optimized for a fast and sustainable flow of change and can evolve and thrive in the face of constant change  

In a world of rapid changes and increasing uncertainties, organizations must continuously adapt and evolve to remain competitive and excel in the market. Designing for adaptability sounds easier said than done. How do you design and build systems that can evolve and thrive in the face of constant change? My suggestion is to take a holistic approach that combines different viewpoints from business strategy, software architecture, and team organization.

This article provides a high-level introduction to combining Wardley Mapping, Domain-Driven Design (DDD), and Team Topologies to design and build adaptive, socio-technical systems optimized for a fast flow of change.

This article highlights evolving an example of a legacy system of an online school solution for junior students. It describes creating a Wardley Map visualizing its business landscape and demonstrates connecting the map with DDD to discover its core domain and decompose its monolithic big ball of mud into modular components (bounded contexts). Additionally, this article covers identifying suitable team boundaries for Team Topologies’ team types leveraging the previously created Wardley Map as a foundation.

A systemic approach for optimizing organizations

Most initiatives that try to change or optimize a system are usually directed at improving parts taken separately. They tend to focus on the local optimization of separate parts of the system.

According to Dr. Russell Ackoff – one of the pioneers of system thinking – local optimization of separate parts of a system will not improve the performance of the whole. He stated that “a system is more than the sum of its parts. It’s a product of their interactions. The way parts fit together determines the performance of a system – not on how they perform taken separately.” In addition, when building systems in general, we need to consider effectiveness (building the right thing) and efficiency (building the thing right).

[Click on the image to view full-size]

Figure 1 – General challenges of building systems

Building the right thing addresses questions related to effectiveness, such as how aligned our solution is to user needs. Creating meaningful value for its customers is vital for an organization’s success. That involves understanding the problem and sharing a common understanding. Building the thing right focuses on efficiency, in particular, the efficiency of engineering practices. Efficiency is about utilization. It’s not only crucial to generate value but also to be able to deliver that value well. It’s about how fast we can deliver changes, how quickly and easily we can make a change effective and adapt to new circumstances. The one (building the right thing) does not go without the other (building the thing right). But as Dr. Russell Ackoff points out, “doing the wrong thing right is not nearly as good as doing the right thing wrong.”

Designing organizations for adaptability

To build adaptive, socio-technical systems that can evolve and thrive in the face of constant change – with effectiveness and efficiency in mind to build the right thing right and considering the whole –  we need a holistic approach combining various perspectives. It requires understanding the context-specific business landscape in which an organization is operating and competing, including the external forces impacting the landscape, in order to design effective business strategies. It requires gaining domain knowledge and understanding the business domain to build a system that is closely aligned with the business needs and strategy. It requires aligning not only the technical solution but also aligning the teams and evolving their interactions to the system we build and the strategy we plan.

[Click on the image to view full-size]

Figure 2 – Combining Wardley Mapping, Domain-Driven Design, and Team Topologies

Or in other words: one approach to designing, building, and evolving adaptive, socio-technical systems optimized for a fast flow of change could be connecting the dots between Wardley Mapping, Domain-Driven Design (DDD), and Team Topologies as Architecture for Flow.

Using Wardley Maps to visualize the landscape

Creating a Wardley Map is a good starting point for gaining a common understanding of the business landscape in which an organization is operating and competing. A Wardley Map is part of Wardley Mapping – a business strategy framework invented by Simon Wardley. A Wardley Map visualizes the landscape and the evolution of a value chain. It provides a structured way to discuss the landscape in a group, and it helps to identify areas where an organization can innovate, improve efficiency, or outsource to gain a competitive advantage.

When creating a Wardley Map, we typically start with identifying users and their user needs – they represent the anchor of the map. The users could be customers, business partners, shareholders, internal users, etc. In an example of an online school solution for junior students, as illustrated in Figure 3, the teachers and students represent the users. The teachers would like to create course content, plan a class, and support the students during their studies. The students have the user needs of studying courses, requesting and receiving help, and receiving evaluation feedback. Both need to sign up and sign in as well. Those user needs are directly or indirectly fulfilled by a chain of components delivering value to the users representing the value chain. The teachers and students are interacting directly with an online school component. That component is most visible to the users and located at the top of the value chain. At this point, the online school component is reflecting a monolithic big ball of mud that we can decompose into smaller parts when we come to the DDD perspective.

[Click on the image to view full-size]

Figure 3 – The value chain (y-axis) of a Wardley Map

The online school component depends on other components, such as infrastructure components, e.g., data storage, search engine, message broker SMTP server, compute platform components running on top of a virtual machine component. They are less visible to the user and placed further down the value chain.

The components of a value chain are typically mapped to evolution stages, such as genesis, custom-built, product (+rental), e.g., off-the-shelf products or open-source solutions, and commodity (+utility). Each evolution stage comes with different characteristics, as Figure 4 illustrates. Towards the left spectrum of the Wardley Map, the components are changing far more frequently than components towards the right spectrum. On the left, we are dealing with a high level of uncertainty and an undefined, poorly understood market. While towards the right spectrum, the components become more stable, known, widespread, and standardized, and the markets are well-defined and mature.

[Click on the image to view full-size]

Figure 4 – Characteristics (extract) of the evolution stages

We can use these characteristics to determine the stage of evolution for the components of the value chain or our online school solution representing the current landscape – as depicted in Figure 5. The online school component is reflecting a volatile component that is changing frequently and is providing a competitive advantage. It shall go into the custom-built evolution stage. For the infrastructure components, such as search engine, data storage, message broker, etc., we are currently using open-source solutions, and the VM component is provided by a server hosting provider as an off-the-shelf product at the current state. The infrastructure components of the current state go into the product (+rental) evolution stage.

[Click on the image to view full-size]

Figure 5 – The components of a value chain mapped to evolution stages (x-axis)

This Wardley Map represents the very first iteration. A Wardley Map is not supposed to perfectly represent a landscape at precision, but provides a useful abstraction and approximation. A considerable value comes from creating a Wardley Map together with a group of people. It fosters a common understanding of the landscape among participants. Sharing the map with others allows for challenging one’s own assumptions. These conversations help to jointly discuss and understand the current and future landscape. We can use this map of the current landscape as a foundation for future discussions to evolve our system.

A Wardley Map is just one part of Wardley Mapping. In general, Wardley Mapping helps to design and evolve effective business strategies based on situational awareness and movement following a strategy cycle. Figure 6 illustrates the strategy cycle that “is a representation of change and how we need to react to it”, according to Simon Wardley.

[Click on the image to view full-size]

Figure 6 – The strategy cycle of Wardley Mapping

The strategy cycle consists of five sections and starts with the purpose as the why of the business describes the reason and motivation for the organization’s existence. The landscape represents the competitive environment in which an organization operates – visualized by a Wardley Map, as already described earlier.

To anticipate changes and identify areas to innovate requires understanding the external forces impacting the landscape – described as climatic patterns. For example, one climatic pattern is that the landscape is never static but very dynamic: everything evolves through the forces of demand and supply competition. Cloud-hosted services reflect this climatic pattern. What was decades ago non-existent evolved through genesis, custom-built, became product and now commodity.   

To be able to respond to changes quickly and absorb changes gracefully, Wardley Mapping recommends applying context-independent doctrinal principles. Doctrinal principles are universal principles that each industry can apply regardless of their context. For example, the doctrinal principle of “Using appropriate methods per evolution stage” recommends building components in the genesis or custom-built evolution stage in-house using preferably agile methods, using or buying off-the-shelf products or open-source software for components in product (+rental) with preferably lean methods, or outsourcing components in commodity to utility suppliers using preferably six-sigma methods. Later in this article, we will address how to apply this doctrinal principle with DDD’s subdomain types supporting make-buy-outsource decisions.

Leadership is the last section of Wardley Mapping’s strategy cycle and is about context-specific decisions about what strategy to choose considering the landscape, climate, and doctrine. Simon Wardley provided a collection of gameplays that describe strategic actions an organization can take in terms of creating new markets, competing in established markets, protecting existing market positions, and exiting declining markets.  

Requirements for flow optimization from a team perspective

To optimize for a fast flow of change from a team perspective, we need to avoid functional silo teams with handovers. Instead, we need to aim for autonomous, cross-functional teams that are designing, developing, testing, deploying, and operating the systems they are responsible for. We need to avoid repeated handovers so that work is not handed off to another team when implementing and releasing changes. We need to use small, long-lived teams as the norm. The teams need to own the system or the subsystem they are responsible for. They need to have end-to-end responsibility to achieve fast flow. We need to reduce the teams’ cognitive load. If the teams’ cognitive load is largely exceeded, it becomes a delivery bottleneck leading to quality issues and delays. While communication within the teams is highly desired, we have to restrict ongoing high-bandwidth communication between the teams to enable fast flow (see Figure 7).

[Click on the image to view full-size]

Figure 7 – Requirements for flow optimization from a team perspective

Fundamentals of Team Topologies

And that’s where Team Topologies can help us with their well-defined team types (see Figure 8) and well-defined interaction modes (see Figure 9). Stream-aligned teams are autonomous, cross-functional teams that are aligned to a continuous stream of work focusing on a fast flow of changes. To be able to produce a steady flow of feature deliveries and to focus on a fast flow of changes, the stream-aligned teams need support from other teams, e.g., from platform teams. Platform teams support stream-aligned in delivering their work and are responsible for self-service platforms that stream-aligned teams can easily consume. Platform teams provide internal, self-service services and tools for using the platform they are responsible for. Enabling teams can be considered as internal coaches supporting stream-aligned teams to identify and acquire missing capabilities. Complicated subsystem teams – as an optional team type – support stream-aligned on particularly complicated subsystems that require specialized knowledge.

[Click on the image to view full-size]

Figure 8 – The four team types of Team Topologies

All of these team types aim to increase autonomy and reduce the cognitive load of the stream-aligned teams to enable a fast flow of change in the end.

To arrange teams into the aforementioned team types is not enough to become effective. How these teams are interacting with each other and when to change and evolve team interaction is very relevant for high organizational effectiveness.

Figure 9 illustrates the interaction modes promoted by Team Topologies. With collaboration, teams are working very closely together over a limited period of time. It is suitable for rapid discovery and innovation, e.g., when exploring new technologies. Collaboration is meant to be short-lived. X-as-a-service suits well when one team needs to use a code library, a component, an API, or a platform, that can be effectively provided by another team “as a service”. It works best where predictable delivery is needed. Facilitating comes into play when one team would benefit from the active help of another team. This interaction mode is typical for enabling teams.

[Click on the image to view full-size]

Figure 9 – The three interaction modes of Team Topologies

The combination of stream-aligned teams, platform teams, enabling teams, and optional complicated subsystem teams and their interaction modes of collaboration, X-as-a-service, and facilitating promotes organizational effectiveness.

Identifying suitable streams of changes

To apply Team Topologies and optimize a system for flow, we can use the previously created Wardley Map of the online school as a foundation. Optimizing a system for a fast flow of change requires knowing where the most important changes in a system occur – the streams of changes. According to Team Topologies, the type of streams can vary from task, role, activity, geography, and customer segment oriented stream types. In the current online school example, we are focusing on activity streams represented by the user needs of our Wardley Map. The user needs of creating course content, planning classes, etc. – as depicted in Figure 10 –  are good candidates for activity-oriented stream types. They are the focus when optimizing for flow.

[Click on the image to view full-size]

Figure 10 – User needs as activity-oriented streams of changes

Partitioning the problem domain and discovering the core

The users and user needs are not only representing the anchor of our Wardley Map, but also represent the problem domain. And that’s where DDD can come in. DDD helps us to gain domain knowledge of our problem domain and to partition the problem domain into smaller parts – the subdomains. But not all subdomains are equal – some are more valuable to the business than others. We have different types of subdomains, as illustrated in Figure 11.

[Click on the image to view full-size]

Figure 11 – The subdomain types of DDD

The core domain is the essential part of our problem domain, providing competitive advantage. That is the subdomain in which we have to strategically invest the most and build software in-house. The user needs of creating course content, planning a class, learning support, and studying courses fall into the resort of the core domain. They provide a competitive advantage leading to a high level of differentiation. Buying or outsourcing the solutions of this subdomain would jeopardize the business’s success, so we have to build the software for the core domain in-house.

The evaluation of student progress does not necessarily provide a competitive advantage, but it supports the teachers’ experience and is necessary for the organization to succeed. These user needs belong to a supporting subdomain – see Figure 11. The supporting subdomains help to support the core domain. They do not provide a competitive advantage but are necessary for the organization’s success and are typically prevalent in other competitors’ solutions as well. If possible, we should look out for buying off-the-shelf products or using open-source software solutions for supporting subdomains. If that is not possible and we have to custom-build the software for supporting subdomains, we should not invest heavily in that part of the system.

The user needs of signing in and signing up embody user needs of a generic subdomain. Generic subdomains are subdomains that many business systems have, such as authentication and registration. They aren’t core and provide no competitive advantage, but businesses cannot work without them. They are usually already solved by someone else. Buying off-the-shelf products or using open-source solutions, or outsourcing to utility suppliers should be applied to generic subdomains’ solutions.

The subdomain types help us to prioritize the strategic investment and support build, buy, and outsourcing decisions.

Bounded Contexts as suitable seams for decomposition and team boundaries

The solutions of the subdomains are currently all mingled together in a tightly coupled monolithic big ball of mud with a messy model and no clear boundaries. To be responsive to changes, the architecture of our online school example needs to leverage modularity with high functional cohesion and loose coupling. We need to decompose our online school component into modular components. And that’s where we come to the bounded contexts of DDD. Bounded contexts group related business behavior together and reflect boundaries where a domain model can be applied. Bounded contexts not only help to split a system apart, but also work well as ownership boundaries. Designing bounded contexts and domain models involves a close collaboration between the domain experts and development teams to gain a shared understanding of the domain. There exist different complementary techniques for designing bounded contexts and domain models, such as EventStorming, Domain Storytelling, etc.

Figure 12 depicts the bounded context of the online school example. The bounded contexts of content creation, class management, course studies, and learning support fulfil the core domain related user needs. They are strategically important and require the most development effort. They go into the custom-built evolution stage and are built in-house.

The bounded contexts of student evaluation and notification handling belong to supporting subdomains. There might exist solutions on the market already. However, the teams decided that a higher level of specialization is necessary and to build them in-house, but the development investment should not be too high.

The identity and access management bounded context belongs to a generic subdomain. There exist several solutions on the market already. It should go either into the product (+rental) or commodity (+utility) evolution stage.

[Click on the image to view full-size]

Figure 12 – The bounded contexts of the online school example

Bounded contexts not only help to split a system apart but also work well as ownership boundaries forming a unit of purpose, mastery, and autonomy. Bounded contexts are indicating suitable team boundaries for stream-aligned teams, as Figure 13 illustrates.

[Click on the image to view full-size]

Figure 13 – Bounded contexts as well-defined team boundaries for stream-aligned teams

Identifying services supporting flow of change

To be able to focus on a fast flow of change, stream-aligned teams need support from other teams. They are relying on other teams to support them in delivering their work. And that requires identifying services needed to support a reliable flow of change that can form self-service platforms which can be provided as easily consumable x-as-a-services. In general, a platform can vary in its level of abstraction. At a higher level, a platform can reflect a design system, a data platform, etc. At a lower level, a platform can abstract away infrastructure or cross-cutting capabilities. In our example of the online school, the infrastructure-related components of our Wardley Map located in the product (+rental) and commodity (+utility) evolution stage are potential candidates for forming a platform that could be effectively provided as-a-service by platform teams (see Figure 14).

[Click on the image to view full-size]

Figure 14 – Services for reliable flow of change

A possible team constellation

The previous considerations might result in this team constellation as a first draft illustrated in Figure 15. In general, most teams in an organization will be cross-functional, autonomous teams with end-to-end responsibility. To achieve clear responsibility boundaries, one bounded context shall be owned by one team only. However, one team can own multiple bounded contexts. The four core domain related bounded contexts residing in the custom-built evolution stage are going to be split among three stream-aligned teams. The supporting and generic subdomain related bounded contexts of this example are going to be handled by another stream-aligned team. The infrastructure components will be taken care of by one or multiple platform teams.

[Click on the image to view full-size]

Figure 15 – The first draft of a possible team constellation

The platform teams could provide a variety of platforms to fulfil the user needs of stream-aligned teams. That could be visualized in a different Wardley Map where the stream-aligned teams become the internal users addressing their user needs. From there, we can continue with identifying and bridging capabilities gaps, where enabling teams can come in.  

You can start small

You do not need to learn and know Wardley Mapping, Domain-Driven Design, and Team Topologies in detail before applying and combining them. You can start with the parts that are most useful for your context. You could start by creating a Wardley Map in a group together to generate a shared understanding of your competitive landscape. A significant value comes already from the conversations when creating and sharing the map with others and challenging your own assumptions. You could use the map as a structured way to guide and continue future conversations, e.g., identifying suitable streams of change and team boundaries, as illustrated in this article.

You can also think of starting with your teams and analyzing their current cognitive load and delivery bottlenecks. Are they dealing with repeated handover and high levels of communication and coordination efforts, blocking dependencies, lack of ownership boundaries, high team cognitive load, etc.? The conversation could lead to alignment to streams and identifying suitable team boundaries, decomposing the system, etc.

Alternatively, you might want to start with your current software architecture and assess its responsiveness to change, e.g., by analyzing what parts are entangled in a specific change and how these entangled parts are coupled. That might bring the conversation towards identifying suitable seams for modularization, where DDD can help with subdomains and bounded contexts. At some point, the paths of each individual starting point eventually cross, leading to Architecture for Flow. This is just one approach to designing and building adaptive socio-technical systems. In addition, you can complement optimizing your system for a fast flow of change with additional techniques and frameworks, e.g., value stream mapping, independent service heuristics, Cynefin, and many more.

Designing the Jit Analytics Architecture for Scale and Reuse

Key Takeaways

  • With analytics becoming a key service in SaaS products – both for vendors and users -, many times, we can leverage the same architecture for multiple use cases.
  • When using serverless and event-driven architecture, the AWS building blocks available make it easy to design a robust and scalable analytics solution with existing data.
  • Leveraging cost- effective AWS services like AWS Kinesis Data Firehose, EventBridge, and Timestream, it is possible to quickly ramp up analytics for both internal and user consumption.
  • It’s important to note that different serverless building blocks have different limitations and capabilities, and working around these so you don’t have any bottlenecks is critical in the design phase of your architecture.
  • Be sure you consider the relevant data schemas, filtering, dimensions, and other data engineering aspects you will need to extract the most accurate data that will ensure reliable and high-quality metrics.
     

Analytics has become a core feature when building SaaS applications over event-driven architecture, as it is much easier to monitor usage patterns and present the data visually. Therefore, it isn’t surprising that this quickly became a feature request from our clients inside our own SaaS platform.

This brought about a Eureka! moment, where we understood that at the same time we set out to build this new feature for our clients, we could also better understand how our clients use our systems through internal analytics dashboards.

At Jit, we’re a security startup that helps development organizations quickly identify and easily resolve security issues in their applications. Our product has reached a certain level of maturity, where it is important for us to enable our users to have a visual understanding of their progress on their security journey. At the same time, we want to understand which product features are the most valuable to our clients and users.

This got us thinking about the most efficient way to architect an analytics solution that ingests data from the same source but presents that data to a few separate targets.

The first target is a customer metric lake, essentially an over-time solution that is tenant separated. The other targets are 3rd party visualization and research tools for better product analysis that leverages the same data ingestion architecture.

At the time of writing, these tools are Mixpanel and HubSpot, both used by our go-to-market and product teams. This allows the aforementioned teams to collect valuable data on both individual tenant’s usage and general usage trends in the product.

[Click on the image to view full-size]

If you’ve ever encountered a similar engineering challenge, we’re happy to dive into how we built this from the ground up using a serverless architecture.

As a serverless based application, our primary data store is DynamoDB; however, we quickly understood that it does not have the time series capabilities that we would require to aggregate and present the analytics data. Implementing this with our existing tooling would take much longer and would require significant investment for each new metric we’d like to monitor, measure, and present to our clients. So we set out to create something from scratch that we could build quickly with AWS building blocks and provide the dual functionality we were looking to achieve.

To create individualized graphs for each client, we recognized the necessity for processing data in a time series manner. Additionally, maintaining robust tenant isolation, ensuring each client can only access their unique data and thus preventing any potential data leakage, was a key design principle in this architecture. This took us on a journey to finding the right tools for the most economical job with the lowest management overhead and cost. We’ll walk through the technical considerations and implementation of building new analytics dashboards for internal and external consumptions.

Designing the Analytics Architecture

The architecture begins with the data source from which the data is ingested – events written by Jit’s many microservices. These events represent every little occurrence that happens across the system, such as a newly discovered security finding, a security finding that was fixed, and more. Our goal is to listen to all of these events and be able to eventually query them in a time series manner and present graphs that are based on them to our users.

[Click on the image to view full-size]

Into the AWS EventBridge

These events are then fed into AWS EventBridge, where the events are processed and transformed according to predefined criteria to convert them to a unified format that consists of data, metadata, and metric name. This can be achieved by using EventBridge Rules. Since our architecture is already event driven and all of these events are already written to different event bridges, we simply needed to add EventBridge Rules only in places where we wanted to funnel the “KPI-Interesting” data into the analytics feed, which was easy to do programmatically.

Once the data and relevant events are transformed as part of the EventBridge Rule, they are sent into Amazon Kinesis Firehose. This can be achieved with EventBridge Rule’s Target feature, which can send the transformed events to various targets.

The events that are transformed into a unified schema must contain the following parameters to not be filtered out:

  1. metric_name field, which maps to the metric being measured over time.
  2. metadata dictionary – which contains all of the metadata on the event, where each table (the tenant isolation) is eventually created based upon the tenant_id parameter.
  3. data dictionary – which must contain event_time which tells the actual time that the event arrived (as the analytics and metrics will always need to be measured and visualized over a period of time).

Schema structure:

{
 "metric_name": "relevant_metric_name",
 "metadata": {
   "tenant_id": "relevant_tenant_id",
   "other_metadata_fields": "metadata_fields",
   ...
 },
 "data": {
   "event_time": ,
   "other_data_fields": ,
   ...
 }
}

AWS Kinesis Firehose

AWS Kinesis Data Firehose (Firehose in short) is the service that aggregates multiple events for the analytics engine and sends it to our target S3 bucket.

[Click on the image to view full-size]

Once the number of events exceeds the threshold (which can be size or a period of time), these are then sent in a batch to S3 buckets to await being written to the time series database, as well as any other event subscribers, such as our unified system that needs to get all tenant events.

Firehose’s job here is a vital part of the architecture. Because it waits for a threshold and then sends the events as a small batch, we know that when our code kicks in and begins processing the events, we’ll work with a small batch of events with a predictable size. This allows us to avoid memory errors and unforeseen issues.

Once one of the thresholds is passed, Kinesis performs a final validation on the data being sent, verifies that the data strictly complies with the required schema format, and discards anything that does not comply.

Invoking a lambda that runs inside Firehose allows us to discard the non-compliant events and perform an additional transformation and enrichment of adding a tenant name. This involves querying an external system and enriching the data with information about the environment it’s running on. These properties are critical for the next phase that creates one table per tenant in our time series database.

In the code section below, we can see:

  • A batching window is defined, in our case – 60 seconds or 5MB (the earlier of the two)
  • The data transformation lambda that validates and transforms all events that arrive to streamline services and ensure reliable, unified, and valid events.

The lambda that handles the data transformation is called enrich-lambda. Note that Serverless Framework transforms its name into a lambda resource called EnrichDashdataLambdaFunction, so pay attention to this gotcha if you are also using Serverless Framework.

MetricsDLQ:
 Type: AWS::SQS::Queue
 Properties:
   QueueName: MetricsDLQ
KinesisFirehouseDeliveryStream:
 Type: AWS::KinesisFirehose::DeliveryStream
 Properties:
   DeliveryStreamName: metrics-firehose
   DeliveryStreamType: DirectPut
   ExtendedS3DestinationConfiguration:
     Prefix: "Data/" # This prefix is the actual one that later lambdas listen upon new file events
     ErrorOutputPrefix: "Error/"
     BucketARN: !GetAtt MetricsBucket.Arn # Bucket to save the data
     BufferingHints:
       IntervalInSeconds: 60
       SizeInMBs: 5
     CompressionFormat: ZIP
     RoleARN: !GetAtt FirehoseRole.Arn
     ProcessingConfiguration:
       Enabled: true
       Processors:
         - Parameters:
             - ParameterName: LambdaArn
               ParameterValue: !GetAtt EnrichDashdataLambdaFunction.Arn
           Type: Lambda # Enrichment lambda
EventBusRoleForFirehosePut:
 Type: AWS::IAM::Role
 Properties:
   AssumeRolePolicyDocument:
     Version: '2012-10-17'
     Statement:
       - Effect: Allow
         Principal:
           Service:
             - events.amazonaws.com
         Action:
           - sts:AssumeRole
   Policies:
     - PolicyName: FirehosePut
       PolicyDocument:
         Statement:
           - Effect: Allow
             Action:
               - firehose:PutRecord
               - firehose:PutRecordBatch
             Resource:
               - !GetAtt KinesisFirehouseDeliveryStream.Arn
     - PolicyName: DLQSendMessage
       PolicyDocument:
         Statement:
           - Effect: Allow
             Action:
               - sqs:SendMessage
             Resource:
               - !GetAtt MetricsDLQ.Arn

Below is the code for the eventbridge rules that map Jit events in the system to a unified structure. This EventBridge sends the data to Firehose (below is the serverless.yaml snippet).

A code example of our event mappings:

FindingsUploadedRule:
 Type: AWS::Events::Rule
 Properties:
   Description: "When we finished uploading findings we send this notification."
   State: "ENABLED"
   EventBusName: findings-service-bus
   EventPattern:
     source:
       - "findings"
     detail-type:
       - "findings-uploaded"
   Targets:
     - Arn: !GetAtt KinesisFirehouseDeliveryStream.Arn
       Id: findings-complete-id
       RoleArn: !GetAtt EventBusRoleForFirehosePut.Arn
       DeadLetterConfig:
         Arn: !GetAtt MetricsDLQ.Arn
       InputTransformer:
         InputPathsMap:
           tenant_id: "$.detail.tenant_id"
           event_id: "$.detail.event_id"
           new_findings_count: "$.detail.new_findings_count"
           existing_findings_count: "$.detail.existing_findings_count"
           time: "$.detail.created_at"
         InputTemplate: >
           {
             "metric_name": "findings_upload_completed",
             "metadata": {
               "tenant_id": ,
               "event_id": ,
             },
             "data": {
               "new_findings_count": ,
               "existing_findings_count": ,
               "event_time": 

Here we transform an event named “findings-uploaded” that is already in the system (that other services listen to) into a unified event that is ready to be ingested by the metric service.

Timestream – Time Series Database

While, as a practice, you should always try to make do with the technologies you’re already using in-house and extend them to the required use case if possible (to reduce complexity), in Jit’s case, DynamoDB simply wasn’t the right fit for the purpose.

To be able to handle time series data on AWS (and perform diverse queries) while maintaining a reasonable total cost of ownership (TCO) for this service, new options needed to be explored. This data would later be represented in a custom dashboard per client, where time series capabilities were required (with the required strict format described above). After comparing possible solutions, we decided on the fully managed and low-cost database with SQL-like querying capabilities called Timestream as the core of the architecture.

Below is a sample piece of code that demonstrates what this looks like in practice:

SELECT * FROM "Metrics"."b271c41c-0e62-48d2-940e-d8c80b1fe242" 
WHERE time BETWEEN ago(1d) and now()

While other technologies were explored, such as Elasticsearch, we realized that they’d either be harder to manage and implement correctly as time series databases (for example, there would be greater difficulty with rolling out indexes and to perform tenant separation and isolation) or would be much more costly. Whereas with Timestream, a table per tenant is simple, and it is by far more economical, as it is priced solely by use. The pricing includes writing, querying, and storage usage. This may seem like a lot at first glance, but our comparison showed that with our predictable usage and the “peace of mind” that using it provides (given that it’s a serverless Amazon service with practically no management overhead), it is the more economically viable solution.

There are three core attributes for data in Timestream that optimize it for this use case (you can learn more about each in their docs):

The dimensions are essentially what describe the data, such as unique identifiers per client (taken from the user’s metadata) and environment in our case. The data is then leveraged to strip out the tenant_id from the event and use it as a timestream table name, which is how the tenant isolation is achieved. The remaining data enables partitioning by these fields, which makes it great for querying the data later. The more dimensions we utilize, the less data needs to be scanned during queries. This is because the data is partitioned based on these dimensions, effectively creating an index. This, in turn, enhances query performance and provides greater economies of scale.

Measures are essentially anything you require for incrementation or enumeration (such as temperature or weight). In our case, these are values we measure in different events, which works well for aggregating data.

Time is pretty straightforward; it is the timestamp of the event (when it was written to the database), which is also a critical function in analytics, as most queries and measurements are based on a certain time frame or window to evaluate success or improvement.

Visualizing the Data with Mixpanel and Segment

Once the ingestion, transformation, batch writing, and querying technology were defined, the dashboarding was easy. We explored the option of using popular open-source tools like Grafana and Kibana that integrate pretty seamlessly with Timestream; however we wanted to provide maximum customizability for our clients inside their UI. We decided to go with homegrown and embeddable graphs.

Once Firehose has written the data to the S3 in the desired format, there is a dedicated Lambda to read and then transform the data to a Timestream record and write it (as noted above, as a table per tenant, while utilizing `tenant_id` in the metadata field). Another lambda then sends this pre-formatted data to Segment and Mixpanel, providing a birds-eye-view of all the tenant data for both internal ingestion and external user consumption. This is where it gets fun.

We leveraged the Mixpanel and Segment data internally to build the UI for our clients by exposing the API that performs the query against Timestream (which is tenant separated by IAM permissions), making it possible for each client to only visualize their own data.

[Click on the image to view full-size]

This enabled leveraging Mixpanel and Segment as the analytics backbone to give our clients Lego-like building blocks for the graphs our customers can consume.

Leveraging tools like Mixpanel and Segment enables us to have cross-tenant and cross-customer insights for our graphs, to optimize our features and products for our users.

Important Caveats and Considerations

When it comes to Timestream and deciding to go with a fully serverless implementation, this does come with cost considerations and scale limitations. We spoke about the Timestream attributes above; however, in each one, there is a certain threshold that cannot be exceeded, and it’s important to be aware of these. For example, there is a limit of 128 dimensions and a maximum of 1024 measures per table, so you have to ensure you are architecting your systems not to exceed these thresholds.

When it comes to memory, there are two primary configurations, memory and magnetic (i.e., long-term. Note that “magnetic” here refers to AWS Timestream’s long-term, cost-effective storage, not magnetic tapes). In contrast, memory storage is priced higher, but comes with a faster querying speed but with a limited window (2 weeks in our case). You can feasibly store up to 200 years of storage on magnetic, but everything has cost implications (we chose one year, as we felt that was sufficient storage – and it can be dynamically upgraded as needed). The great thing about AWS-based services is that much of the heavy lifting is done automatically, such as the data tiering automatically being migrated from magnetic to disk.

Other limitations include the number of tables per account (a 50K threshold), and there is also a 10MB minimum required for querying (and a 1-second querying time – which might not be as fast as other engines, but the cost was a significant enough advantage for us to compromise on query speed). Therefore, you should be aware of the TCO and optimize queries to always be above the 10MB limitation and even higher when possible, while also reducing latency for clients at the same time. A good method to combat this issue is to cache data, and not do a full query in real time, where you can consolidate data into a single query through unions.

Serverless Lego FTW!

By leveraging existing AWS services over serverless architecture, we were able to ramp up the analytics capabilities quite quickly, with little management and maintenance overhead with a low-cost, pay-per-use model that enables us to be cost effective. The greatest part of this scalable and flexible system is that it also provides the benefit of adding new metrics as our clients’ needs evolve.

Since all the events already exist in the system, and are parsed through event bridges, finding a new and relevant metric is an easy addition to the existing framework. You can create the relevant transformation, and have a new metric in the system that you can query nearly instantaneously.

Through this framework it is easy to add “consumers” in the future, leveraging the same aggregated data. By building upon serverless building blocks, like Legos, it was possible to develop a scalable solution to support a large and growing number of metrics in parallel, while future proofing the architecture as business and technology requirements continuously evolve.

Using Project Orleans to Build Actor-Based Solutions on the .NET platform

Key Takeaways

  • Project Orleans has been completely overhauled in the latest version, making it easy to work with. It has also been re-written to fit in with the new IHost abstraction that was introduced in .NET Core.
  • The actor model is wonderful for the scenarios where it makes sense. It makes development a lot easier for scenarios where you can break down your solution into small stateful entities.
  • The code that developers need to write can be kept highly focused on solving the business needs, instead of on the clustering, networking and scaling, as this is all managed by Project Orleans under the hood, abstracted away.
  • Project Orleans makes heavy use of code generators. By simply implementing marker interfaces, the source generators will automatically add code to your classes during the build. This keeps your code simple and clean.
  • Getting started with Project Orleans is just a matter of adding references to a couple of NuGet packages, and adding a few lines of code to the startup of your application. After that, you can start creating Grains, by simply adding a new interface and implementation.

In this article, we will take a look at Project Orleans, which is an actor model framework from Microsoft. It has been around for a long time, but the new version, version 7, makes it a lot easier to get started with, as it builds on top of the .NET IHost abstraction. This allows us to add it to pretty much any .NET application in a simple way. On top of that it abstracts away most of the complicated parts, allowing us to focus on the important stuff, the problems we need to solve.

Project Orleans

Project Orleans is a framework designed and built by Microsoft to enable developers to build solutions using the actor-model, which is a way of building applications that enables developers to architect and build certain types of solutions in a much easier way than it would be to do it using for example an n-tier architecture.

Instead of building a monolith, or a services-based architecture where the services are statically provisioned, it allows you to decompose your application into lots of small, stateful services that can be provisioned dynamically when you need them. On top of that, they are spread out across a cluster more or less automatically.

This type of architecture lends itself extremely well to certain types of solutions, for example IoT devices, online gaming or auctions. Basically, any solution that would benefit from an interactive, stateful “thing” that manages the current state and functionality, like a digital representation of an IoT device, a player in an online game, or an auction. Each of these scenarios become a lot easier to build when they are backed by an in-memory representation that can be called, compared to trying to manage it using an n-tier application and some state store.

Initially, Orleans was created to run Halo. And using the actor-model to back a game like that makes it possible to do things like modelling each player as its own tiny service, or actor, that handles that specific gamer’s inputs. Or model each game session as its own actor. And to do it in a distributed way that has few limitations when it needs to scale.

However, since the initial creation, and use in Halo, it has been used to run a lot of different services, both at Microsoft and in the public, enabling many large, highly scalable solutions to be built by decomposing them into thousands of small, stateful services. Unfortunately, it is hard to know what systems are using Orleans, as not all companies are open about their tech stack for different reasons. But looking around on the internet, you can find some examples. For example, Microsoft uses it to run several Xbox services (Halo, Gears of War for example), Skype, Azure IoT Hub and Digital Twins. And Honeywell uses it to build an IoT solution. But it is definitely used in a lot more places, to run some really cool services, even if it might not be as easy as you would have hoped to see where it is used.

As you can see, it has been around for a long time, but has recently been revamped to fit better into the new .NET core world.

The actor pattern

The actor pattern is basically a way to model your application as a bunch of small services, called actors, where each actor represents a “thing”. So, for example, you could have an actor per player in an online game. Or maybe an actor for each of your IoT devices. But the general idea is that an actor is a named, singleton service with state. With that as a baseline, you can then build pretty much whatever you want.

It might be worth noting that using an actor-model approach is definitely not the right thing for all solutions. I would even say that most solutions do not benefit very much from it, or might even become more complex if it is used. But when it fits, it allows the solution to be built in a much less complex way, as it allows for stateful actors to be created.

Having to manage state in stateless services, like we normally do, can become a bit complicated. You constantly need to retrieve the state that you want to work with for each request, then manipulate it in whatever way you want, and finally persist it again. This can be both slow and tedious, and potentially put a lot of strain on the data store. So, we often try to speed this up, and take some of the load of the backing store using a cache. Which in turn adds even more complexity. With Project Orleans, your state, and functionality, is already instantiated and ready in memory in a lot of cases. And when it isn’t, it handles the instantiation for you. This removes a lot of the tedious, repetitive work that is needed for the data store communication, as well as removes the need for a cache, as it is already in memory.

So, if you have any form of entity that works as a form of state machine for example, it becomes a lot easier to work with, as the entity is already set up in the correct state when you need it. On top of that, the single threaded nature of actors allows you to ignore the problems of multi-threading. Instead, you can focus on solving your business problems.

Imagine an online auction system that allows users to place bids and read existing bids. Without Orleans, you would probably handle this by having an Auction service that allows customers to perform these tasks by reading and writing data to several tables in a datastore, potentially supported by some form of cache to speed things up as well. However, in a high-load scenario, managing multiple bids coming in at once can get very complicated. More precisely, it requires you to figure out how to handle the locks in the database correctly to make sure that only the right bids are accepted based on several business rules. But you also need to make sure that the locks don’t cause performance issues for the reads. And so on …

By creating an Auction-actor for each auction item instead, all of this can be kept in memory. This makes it possible to easily query and update the bids without having to query a data store for each call. And because the data is in-memory, verifying whether a bid is valid or not is simply a matter of comparing it to the existing list of bids, and making a decision based on that. And since it is single-threaded by default, you don’t have to handle any complex threading. All bids will be handled sequentially. The performance will also be very good, as everything is in-memory, and there is no need to wait for data to be retrieved.

Sure, bids that are made probably need to be persisted in a database as well. But maybe you can get away with persisting them using an asynchronous approach to improve throughput. Or maybe you would still have to slow down the bidding process by writing it to the database straight away. Either way, it is up to you to make that decision, instead of being forced in either direction because of the architecture that was chosen.

Challenges we might face when using the actor pattern

First of all, you have to have a scenario that works well with the pattern. And that is definitely not all scenarios. But other than that, some of the challenges include things like figuring out what actors make the most sense, and how they can work together to create the solution.

Once that is in place, things like versioning of them can definitely cause some problems if you haven’t read up properly on how you should be doing that. Because Orleans is a distributed system, when you start rolling out updates, you need to make sure that the new versions of actors are backwards compatible, as there might be communication going on in the cluster using both the old and the new version at the same time. On top of that, depending on the chosen persistence, you might also have to make sure that the state is backwards compatible as well.

In general, it is not a huge problem. But it is something that you need to consider. Having that said, you often need to consider these things anyway, if you are building any form of service based solution.

Actor-based development with Project Orleans

It is actually quite simple to build actor based systems with Project Orleans. The first step is to define the actors, or Grains as they are called in Orleans. This is a two-part process. The first part is to define the API we need to interact with the actor, which is done using a plain old .NET interface.

There are a couple of requirements for the interface though. First of all, it needs to extend one of a handful of interfaces that comes with Orleans. Which one depends on what type of key you want to use. The choices you have are IGrainWithStringKey, IGrainWithGuidKey, IGrainWithIntegerKey, or a compound version of them.

All the methods on the interface also need to be async, as the calls might be going across the network. It could look something like this:

    public interface IHelloGrain : IGrainWithStringKey
{
    Task SayHello(User user);
}

Any parameters being sent to, or from the interface, also need to be marked with a custom serialisation attribute called GenerateSerializer, and a serialisation helper attribute called Id. Orleans uses a separate serialisation solution, so the Serializable attribute doesn’t work unfortunately. So, it could end up looking something like this:

    [GenerateSerializer]
public class User
{
    [Id(0)] public int Id { get; set; }
    [Id(1)] public string FirstName { get; set; }
    [Id(2)] public string LastName { get; set; }
}

The second part is to create the grain implementation. This is done by creating a C# class, that inherits from Grain, and implements the defined interface.

Because of Orleans being a little bit magi –  more on that later on – we only need to implement our own custom parts. So, implementing the IHelloGrain could look something like this:

public class HelloGrain : Grain, IHelloGrain
{
    public async Task SayHello(User user)
    {
        return Task.FromResult($“Hello {user.FirstName} {user.LastName}!”);
    }
}

It is a good idea to put the grains in a separate class library if you are going to have a separate client, as both the server and client part of the system need to be able to access them. However, if you are only using it behind something else, for example a web API, and there is no external client talking to the Orleans cluster, it isn’t strictly necessary.

A thing to note here is that you should not expose your cluster to the rest of the world. There is no security built into the cluster communication, so the recommended approach is to keep the Orleans cluster “hidden” behind something like a web API.

Once the grains are defined and implemented, it is time to create the server part of the solution, the Silos.

Luckily, we don’t have to do very much at all to set these up. They are built on top of the IHost interface that has been introduced in .NET. And because of that, we just need to call a simple extension method to get our silo registered. That will also take care of registering all the grain types by using reflection. In its simplest form, it ends up looking like this:

    var host = Host.CreateDefaultBuilder()
    .UseOrleans((ctx, silo) => {
        silo.UseLocalhostClustering();
    })
    .Build();

This call will also register a service called IGrainFactory, that allows us to access the grains inside the cluster. So, when we want to talk to a grain, we just write something like this:

    var grain = grainFactory.GetGrain(id);
            var response = await grain.SayHello(myUser);

And the really cool thing is that we don’t need to manually create the grain. If a grain of the requested type with the requested ID doesn’t exist, it will automatically be created for us. And if it isn’t used in a while, the garbage collector will remove it for us to free up memory. However, if you request the same grain again, after it has been garbage collected, or potentially because a silo has been killed, a new instance is created and returned to us automatically. And if we have enabled persistence, it will also have its state restored by the time it is returned.

How Project Orleans makes it easier for us to use the actor pattern

Project Orleans removes a lot of the ceremony when it comes to actor-based development. For example, setting up the cluster has been made extremely easy by using something called a clustering provider. On top of that, it uses code generation, and other .NET features, to make the network aspect of the whole thing a lot simpler. It also hides the messaging part that is normally a part of doing actor development, and simply provides us with asynchronous interfaces instead. That way, we don’t have to create and use messages to communicate with the actors.

For example, setting up the server part, the silo, is actually as simple as running something like this:

var host = Host.CreateDefaultBuilder(args)
    .UseOrleans(builder =>
    {
        builder.UseAzureStorageClustering(options => options.ConfigureTableServiceClient(connectionString));
    })
    .Build();

As you can see, there is not a lot that we need to configure. It is all handled by conventions and smart design. This is something that can be seen with the code-generation as well.

When you want to interact with a grain, you just ask for an instance of the interface that defines the grain, and supply the ID of the grain you want to work with. Orleans will then return a proxy class for you that allows you to talk to it without having to manage any of the network stuff for example, like this:

    var grain = grainFactory.GetGrain(id);
            var response = await grain.SayHello(myUser);

A lot of these simplifications are made possible using some really nice code generation that kicks into action as soon as you reference the Orleans NuGet packages.

Where can readers go when they want to learn more about project Orleans and the actor model?

The easiest way to get more information about building solutions with Project Orleans is to simply go to the official docs of Project Orleans and have a look. Just remember that when you are looking for information about Orleans, you need to make sure that you are looking at documentation that is for version 7+. The older version looked a bit different, so any documentation for that would be kind of useless unfortunately.

Where to go from here?

With Project Orleans being as easy as it is to get started with, it makes for a good candidate to play around with if you have some time left over and want to try something new, or if you think it might fit your problem. There are also a lot of samples on GitHub from the people behind the project if you feel like you need some inspiration. Sometimes it can be a bit hard to figure out what you can do with a new technology, and how to do it. And looking through some of the samples gives you a nice view into what the authors of the project think it should be used for. I must admit, some of the samples are a bit, well, let’s call it contrived, and made up mostly to show off some parts of the functionality. But they might still provide you with some inspiration of how you can use it to solve your problem.

For me, I ended up rebuilding an auction system in a few hours just to prove to my client how much easier their system would be to manage using an actor based model. They are still to implement it in production, but due to the simplicity of working with Project Orleans, it was easy to create a proof of concept in just a few hours. And I really recommend doing that if you have a scenario where you think it might work. Just remember to set a timer, because it is very easy to get carried away and just add one more feature.

In tech, it is rare to find something as complicated as clustering being packaged into something that is as simple to work with as Project Orleans. Often the goal is to make it simple and easy to use, but as developers we tend to expose every single config knob we can find to the developer. Project Orleans has stayed away from this, and provides a nice experience that actually felt fun to work with.

A Case for Event-Driven Architecture With Mediator Topology

Key Takeaways

  • Event-Driven Architecture is powerful and can be very simple to implement and support if a suitable topology is selected.
  • Open-source frameworks for durable workflows and orchestration can help build reliable systems and save many person-months of custom development.
  • KEDA supports many different metrics and can help configure advanced autoscaling rules where classical CPU-based scaling would not be efficient.
  • Trivial business cases might require sophisticated architecture design to satisfy the requirements.
  • Event-Driven Architecture enables elastic scalability even with an orchestration approach.
     

Today, I want to share a story about a business case where we used Event-Driven Architecture with Mediator topology and some interesting implementation details, such as elastic scalability, reliability, and durable workflows. All were built using Kubernetes, KEDA, AWS, and .NET technologies.

The Business Problem

Let’s start with the business case. In our product, users upload files to later share and review online as part of due diligence processes. But behind the scenes, everything is much more complicated. Each file must be processed. Namely, we convert it to a basic format and optimize it to view on browsers, generate previews, determine the language and recognize text on images, collect metadata, and other operations. The files include documents, pictures, technical drawings, archives (.zip), and even videos.

Sometimes we can get hundreds of thousands of files uploaded in a day, and sometimes there are days without activity. Still, users generally want to start collaborating on a file as soon as possible after uploading it. So we need an architecture that will scale elastically and be cost-effective.

Also, each file carries essential and sensitive business information for the customer. We cannot afford to lose a file somewhere in the middle of the process.

It is clear that when we are talking about hundreds of thousands or even millions of files overall, it is crucial to have good system observability to identify and solve problems when they arise quickly.

Another important detail that can affect the architecture design is that processing one file can include a dozen steps. Each step can last a few seconds to an hour, consuming a lot of CPU and RAM (and IO). We also want to be able to modify the file processing process easily and quickly.

We use 3rd-party SDKs to process files, which are not always reliable, and can sometimes corrupt memory and crash with a memory access violation error or stack overflow, etc.

The Implementation

Let’s now see how we implemented it.

Scalability requirements pushed us to the idea of building a solution based on events. But at the same time, we could not compromise on reliability, observability, and ease of system support.

We chose an Event-Driven Architecture with the Mediator topology pattern. There is a special service called Event Mediator (we call it internally Orchestrator). It receives the initial message to process the file and executes a file-processing script, which we call a workflow. The workflow is a declarative description of what must be done with a particular file as a set of discrete steps. Each step type is implemented as a separate stateless service. In pattern terms, they are called Event Processors, but we call them Converters.

The first diagram shows how it works in general. When the user has uploaded the file, we send a command to the Mediator to process the file. Based on the file type, the Mediator selects the required workflow and starts it. Notably, the Mediator itself does not touch files. Instead, it sends a command to a queue corresponding to a specific operation type and waits for a response. The service that implements this type of operation (a Converter) receives the command from the queue, processes the corresponding file, and sends a command back to the Mediator that the operation is completed and where the processing result is stored. After receiving the answer, the Mediator starts the next step the same way until the entire workflow is finished. The output of one step can be an input for the next step. In the end, the Mediator sends a command to the system that the processing is complete.

[Click on the image to view full-size]

Now that we understand how the solution works, let’s look at how the required architectural characteristics are achieved.

Scalability

Let’s start with scalability.

All services are containerized and run in a Kubernetes cluster in Amazon. Also, a Kubernetes-based Event-Driven Autoscaler (KEDA) component is installed in the Kubernetes cluster. Converters implement the Competing Consumers pattern.

On top of that, scaling rules are configured for each Converter type depending on the queue length. KEDA automatically monitors queues, and for example, if there are 100 text recognition commands in the queue, it will automatically instantiate 100 new pods in the cluster and later automatically remove pods when there are fewer commands in the queue. We specifically chose to scale based on queue length because it works more reliably and transparently than classic CPU scaling. We have many different file processing operations running simultaneously, and the load does not always correlate linearly with the CPU load.

Of course, running new pods requires more nodes in the cluster. The Cluster Autoscaler helps us with this. It monitors the load on the cluster and adds or removes nodes as KEDA scales pods.

One interesting nuance here is that during scale-in, you do not want to stop the container in the middle of the processing file and start over. Luckily, Kubernetes allows you to address it. Kubernetes sends a SIGTERM to signal the intent to terminate. The container can delay termination until the processing of the message is complete by delaying the response. So Kubernetes will wait for a SIGTERM response up to the terminationGracePeriodSeconds value before killing the replica.

OK, converters scale elastically, but what about the Mediator? Could it be the bottleneck in the system?

Yes, the Mediator and the workflow engine can scale horizontally. KEDA scales it too, but this time depending on the number of active workflows. We monitor the size of some lists in Redis, which are used internally by the workflow engine and correlate with the active workflows count and current load.

As I mentioned, the Mediator does not perform any operations other than the orchestration of processes, so its resource consumption is minimal. For example, when we have thousands of active workflows and scale out to 200 converters, only about five instances of Mediator are needed.

[Click on the image to view full-size]

The cluster is homogeneous – we do not have separate node types to run converters and orchestrator instances.

Maintainability and Extensibility

Let’s talk about how easy it is to implement and maintain this system.

The converters are stateless services with fairly simple logic – take a command from the queue, run the processing of the specified file (invoke methods from 3rd-party libraries), save the result, and send a response.

Implementing workflow functionality is very difficult, and I don’t recommend anyone doing it themselves. There are quite mature solutions on the market in which many years and millions of dollars have been invested. For example, Temporal.io, Camunda, Azure Durable Functions, and AWS Step Functions, to name a few.

Because our stack is .NET and we are hosted in AWS, and for several other historical reasons, we chose Daniel Gerlag’s Workflow Core library.

It is lightweight, easy to use, and covers our use cases completely. However, Workflow Core is not under active development. As an alternative, you might look at MassTransit’s State Machine by Chris Patterson, which is actively maintained and has some additional features.

Implementing the Mediator is also simple – the source code is primarily a set of declaratively described workflows in the form of a sequence of steps for each type of file.

[Click on the image to view full-size]

Testing the system is possible at many levels. Covering the workflows with unit tests is easy, as it does not require running the converters or instantiating the orchestrator service. It’s also helpful to check that all steps are invoked as expected, retries and timeout policies and error handling work as expected, steps update the workflow state, etc. The Workflow Core library has built-in support for that. Finally, we can run end-to-end integration tests where we start all converters, the orchestrator, the database, Redis, and queues. Docker compose makes this an easy one-click or command-line option for local development.

So when we need to make changes, it’s just a matter of changing the workflow description or sometimes adding a new converter service to the system to support new operations or trying an alternative solution.

Reliability

Finally, we come to perhaps the most critical aspect of the system – reliability.

Let’s start by identifying what can go wrong – any service can go down at any time, the system load can grow faster than the system can scale, some services/infrastructures can be temporarily unavailable, and there can be code defects, which leads to incorrect processing of files and these files need to be re-processed.

The most straightforward cases for reliability involve the converter services. The service locks a message in the queue when it starts processing and deletes it when it has finished its work and sent the result. If the service crashes, the message will become available again in the queue after a short timeout and can be processed by another instance of the converter. If the load grows faster than new instances are added or there are problems with the infrastructure, messages accumulate in the queue. They will be processed right after the system stabilizes.

In the case of the Mediator, all the heavy lifting is again done by the Workflow Core library. Because all running workflows and their state are stored in the database, if an abnormal termination of the service occurs, the workflows will continue execution from the last recorded state.

Also, we have configurations to retry failed steps, timeouts, alternative scenarios, and limits on the maximum number of parallel workflows.

What’s more, the entire system is idempotent, allowing every operation to be retried safely without side effects and mitigating the concern of duplicate messages being received. AWS S3 policies allow us to remove any temporary files automatically and avoid garbage accumulation from failed operations.

Another benefit of Kubernetes is setting limits and minimum resource requirements for each service, e.g., how much CPU or RAM resources it can use at max and required minimum to start. You do not have to worry about a noisy neighbor problem when several pods are running on the same cluster node, and a memory leak or infinite loop occurs in one of the instances, etc.

[Click on the image to view full-size]

Thanks to the Mediator approach and durable workflows, we have very good system observability. At any moment, we completely understand the stage at which each file is, how many files there are, and other important metrics. In case of defects, we can review the historical data, restart the file processing for the affected files, or take other actions as necessary.

We built dashboards with all the critical metrics of AWS infrastructure and application metrics.

Overall, the system can restore itself even after a large-scale failure.

Conclusion

The end result is a highly scalable system that is easy to extend, modify, and test, with good observability and cost-effectiveness.

Finally, some interesting statistics. It took us only two months to build the walking skeleton of the system, thanks to the use of off-the-shelf components. The recorded peak throughput was about 5,000 files per hour. This is not a capacity limit – we intentionally limited auto-scaling and now lift it gradually to avoid unexpected infrastructure bills, such as those caused by a defect leading to a runaway process. The largest batch uploaded by users in a day was 250,000 files. In total, we have already processed millions of files since we switched to the new solution.

We did have some failures and incidents. Most were related to defects in third-party libraries appearing only in edge cases under heavy load. As responsible engineers, we tried our best to contribute diagnostic information and fixes, and we are very grateful to the OSS maintainers for their support and responsiveness. One lesson we learned here is to have easily configurable limits so that when something goes wrong, you can reduce the load, let the system stabilize, recover, and continue operations under degraded throughput while working on the fix.

A Guide to the Quarkus 3 Azure Functions Extension: Bootstrap Java Microservices with Ease

Key Takeaways

  • Serverless architecture automatically scales functions up and down in relation to incoming network traffic to the application. It enables enterprise companies to reduce the cost of infrastructure maintenance.
  • Azure Functions provides a serverless platform on Azure cloud. Developers can build, deploy, and manage event-driven serverless functions using multiple languages and Azure services integration.The platform aims to provide a good developer experience.
  • Quarkus enables integrating Azure Functions with various endpoints (e.g., Funqy, RESTEasy Reactive) with the goal of accelerating the inner loop development.
  • Quarkus 3 introduces a new Azure Functions integration that allows developers to bootstrap Quarkus microservices automatically using CDI beans.

A glance at serverless

“Why does serverless matter?” This is a common question by IT professionals in many industry events when I present a serverless topic, regardless of which roles they’re working for. To answer this question, I’ll give you a quick overview of serverless in terms of background, benefits, and technology.

For decades, enterprise companies spent tons of time and effort to reduce the infrastructure maintenance cost and maximize infrastructure resource utilization. They have accomplished much by modernizing applications, virtualization, and containerization.

But this journey can’t end as long as they continue to adopt new technologies and develop new business models on top of the technologies. In the meantime, the enterprises realized that many applications didn’t need to run all the time (e.g., 24 x 7 x 365). They only used those applications for particular business use cases a few times per week or even once a month after they traced the usage metrics.

The serverless architecture was designed to solve this challenge by hibernating applications when there’s no incoming network traffic to the application workloads. Then, the applications will respond quickly when they have new traffic.

Hyperscalers have started to provide serverless technologies for years with their cloud services, such as AWS Lambda, Azure Functions, and Google Functions to catch up with this market demand. Those serverless services allow developers to choose multiple programming languages such as Java, JavaScript, Python, C#, Ruby, Go, and more for the application runtimes.

In this article, we’ll focus on serverless Java with Azure Functions in terms of how Quarkus, a new Kubernetes native Java framework, integrates Java microservices into Azure Functions with an improved developer experience.

Key benefits of using Azure Functions

Let’s take a step back to understand why developers need to deploy serverless applications to Azure Functions among the other serverless platforms. Azure Functions is a serverless computing service that enables developers to build event-driven, scalable applications and services in the Azure cloud without managing the underlying infrastructure. Here are key benefits to knowing about Azure Functions:

  • Supported multiple languages: Azure functions allow developers to choose the programming language, including Java, JavaScript, Python, C#, and PowerShell, that they are most comfortable with to build their serverless applications.
  • Event driven: Azure Functions are triggered by events such as data changes or queue messages. These triggers can be configured using bindings to connect to various services, such as Azure Event Grid, Azure Blob Storage, or Azure Cosmos DB.
  • Cost effective: Azure Functions charges the resources you only use regarding the number of executions, execution time, and memory consumption of the functions. A free tier is also available for developers to experiment with function deployment.
  • Integration with other Azure services: Azure Functions can be integrated with other Azure services such as Azure Logic Apps, Azure Stream Analytics, and Azure API Management. This makes building complex workflows and integrating them with other systems easy.
  • Accelerating outer loop: Azure Functions enables developers to deploy and manage functions easily and quickly using Azure portal, Visual Studio, or command-line tools. This accelerates the outer loop process for developers in terms of easy-to-test and deploying changes to serverless functions quickly.
  • Scalability: Azure Functions automatically scale up or down based on the incoming network traffic. This means developers don’t have to worry about managing resources during peak usage periods.
  • Monitoring and logging: Azure Functions provides built-in monitoring and logging capabilities through Azure Monitor. Developers can monitor function execution, track function dependencies, and troubleshoot issues using logs.

Overall, Azure Functions provides a powerful and flexible platform for building serverless applications in the cloud. It offers a wide range of features and integration options that make it easy for developers to build, deploy, and manage their applications.

How Quarkus makes Azure Functions even better

Quarkus is a new Kubernetes native Java framework that enables developers not only to optimize Java applications in Kubernetes by extremely fast startup and response time based on fast-jar and native executables but Quarkus also provides great developer experience by Live Coding, Dev services, Dev UI, Continuous Testing, and Remote Development. For example, if you had a chance to attend developer conferences, you could easily find out how many Java developers were keen to modernize traditional Java frameworks to Quarkus due to these out of box features and benefits.

Since the very early days of the Quarkus community, Quarkus offered Azure Functions extensions for developers to write deployable cloud-native microservices with various endpoints such as RESTEasy Reactive, Undertow, Reactive Routes, and Funqy HTTP on Azure Functions runtime.

This journey didn’t end but arrived in a new era of the Quarkus 3. New Azure Function classes you write with Context and Dependency Injection (CDI) and Arc beans are automatically integrated with Quarkus. This means that when developers can implement Azure Functions using @Inject annotation, Quarkus will be automatically bootstrapped. Developers can also select the function’s lifecycle scope, such as application and request Java beans as normal.

You might ask, “What is the difference between Quarkus Funqy and a new Azure Function? Why would I need to use the new Azure Functions with Quarkus rather than Quarkus Funqy?” Quarkus Funqy can only be invoked by HttpTriggers at this moment. Instead, Azure Functions have different event types that can be invoked, such as HttpTrigger, BlobTrigger, and CosmosDBTrigger and additional supported bindings.

Quarkus Funqy was designed for the lowest common denominator to be a portable, simple, and functional layer across multiple serverless platforms such as Azure Functions, AWS Lambda, Google Cloud Functions, and Kubernetes Knative.

Getting started developing Azure Functions with Quarkus

Let’s give it a try to create a new Quarkus project to implement Azure Functions from scratch. The following step-by-step tutorial helps you easily understand how the Azure Functions and Quarkus integration works and what the Java function code looks like.

1. Create a new function project

Use the Quarkus command (CLI) to generate a new project.

You can also use Maven or Gradle package tools, but the Quarkus CLI provides better developer experiences in creating projects, managing extensions, building and deploying applications with auto-completion, and shortening commands using the underlying project build tool.

Run the following command in your local terminal to create a new Quarkus project by adding a quarkus-azure-function extension:

quarkus create app quarkus-azure-function 
--extension=quarkus-azure-functions

The output should look like this:

-----------
selected extensions: 
- io.quarkus:quarkus-azure-functions

applying codestarts...
  java
  maven
  quarkus
  config-properties
  dockerfiles
  maven-wrapper
  azure-functions-example

-----------
[SUCCESS] ✅  quarkus project has been successfully generated in:
--> /YOUR_WORKING_DIR/quarkus-azure-function
-----------
Navigate into this directory and get started: quarkus dev

Let’s verify that the Azure Functions and CDI injection libraries were installed in your local file system. Open the pom.xml file and look at whether the following dependencies are appended automatically.

    
      io.quarkus
      quarkus-azure-functions
    
    
      io.quarkus
      quarkus-arc
    
    
      com.microsoft.azure.functions
      azure-functions-java-library
    

You can also find the azure-functions-maven-plugin in the pom.xml file that you let package the functions and deploy them to Azure Function App.


 com.microsoft.azure
  azure-functions-maven-plugin
  ${azure.functions.maven.plugin.version}
  
    
      package-functions
        
          package
        
    
  

  ${functionAppName}
    ${functionResourceGroup}
    java-functions-app-service-plan
    ${functionAppRegion}
    
      linux
      11
    
    
      
        FUNCTIONS_EXTENSION_VERSION
         ~4
      
    
  

2. Explore the sample function code

Open the Function.java file in the src/main/java/org/acme directory. Look at the GreetingService that injects a CDI bean into Azure Functions when the response body content is built.

    @Inject
    GreetingService service;

As normal Azure Function development using Java, you can still use the @FunctionName annotation to make a method become an Azure Function that the HTTP request will trigger.

    @FunctionName("HttpExample")
    public HttpResponseMessage run(
            @HttpTrigger(
                name = "req",
                methods = {HttpMethod.GET, HttpMethod.POST},
                authLevel = AuthorizationLevel.ANONYMOUS)
                HttpRequestMessage> request,
            final ExecutionContext context) {
        context.getLogger().info("Java HTTP trigger processed a request.");

        // Parse query parameter
        final String query = request.getQueryParameters().get("name");
        final String name = request.getBody().orElse(query);

        if (name == null) {
            return request.createResponseBuilder(HttpStatus.BAD_REQUEST).body("Please pass a name on the query string or in the request body").build();
        } else {
            return request.createResponseBuilder(HttpStatus.OK).body(service.greeting(name)).build();
        }
    }

3. Test Azure Functions locally

Let’s tweak the function a little bit before you verify it locally. Go back to the Function.java file and change the function to “greeting”.

@FunctionName("greeting")

Open the GreetingService.java file and replace the return code with the following line:

return "Welcome to Azure Functions with Quarkus, " + name;

Let’s run your function with a simulated local Azure Functions environment. Run the following maven command:

Note that you need to make sure to use Java 11 with setting JAVA_HOME properly and Azure Functions Core Tools version 4 on your local environment.

./mvnw clean package azure-functions:run

The output should end up with the following logs:

Functions:

        greeting: [GET,POST] http://localhost:7071/api/greeting

For detailed output, run func with --verbose flag.
Worker process started and initialized.
...
INFO: quarkus-azure-function 1.0.0-SNAPSHOT on JVM (powered by Quarkus 3.0.4.) started in 0.353s.

You might have a question. “Can I use the Quarkus dev mode to test the functions while I keep changing the code (aka Live Coding)?” Unfortunately, Quarkus dev mode doesn’t support Azure Functions integration at this time (Quarkus 3.0).

Invoke the REST API to access the function locally. Run the following curl command:

curl -d "Daniel" http://localhost:7071/api/greeting ; echo

The output should look like this:

Welcome to Azure Functions with Quarkus, Daniel

Stop the simulated environment by pressing CTRL-C.

4. Deploy to Azure Cloud

First of all, make sure to log in to Azure Cloud from your local environment using the following az command. If you haven’t installed the az command, find more information here on How to install the Azure CLI.

az login

You will use the azure-functions-maven-plugin again to deploy the greeting function to Azure Function App. Run the following maven command:

./mvnw clean package azure-functions:deploy

The output should end up with BUILD SUCCESS message including a HTTP Trigger URL:

[INFO] HTTP Trigger Urls:
[INFO]   greeting : https://quarkus-azure-function-1681980284966.azurewebsites.net/api/greeting

Let’s go to Azure Portal and navigate to Function App menu, as shown in Figure 1. Then, you will see a deployed function (e.g. quarkus-azure-function-1681980284966).

[Click on the image to view full-size]

Figure 1. Azure Function App

Select the function name to find more information about the function in terms of the resource group, trigger URL, and metrics, as shown in Figure 2.

[Click on the image to view full-size]

Figure 2. Azure Function App Detail

Invoke the REST API to access the function on Azure Function App. Run the following curl command:

curl -d "Daniel on Azure" https://YOUR_TRIGGER_URL/api/greeting ; echo

The output should look like this:

Welcome to Azure Functions with Quarkus, Daniel on Azure

Starting the function based on the cold start strategy will take a few seconds.

Summary

This article showed how Quarkus 3 integrates Azure Functions programmatically using various triggers and bindings that Azure Functions provides. Developers can also understand the differences between Quarkus Funqy and Azure Functions integration for implementing serverless applications.

From the best practice perspective, you might wonder if you need to implement multiple functions (e.g., endpoints) in a single application (e.g., Jar file) or only one function in a single application. There are always tradeoffs, in terms of fault tolerance, maintenance, resource optimization, and startup time. For example, if you only care about fast startup and response time, one function in one application would be better to implement due to the smaller size of a packaged application. But it won’t be easy to maintain many functions after you add more business services along with this development practice.

Additional Resources:

Tales of Kafka at Cloudflare: Lessons Learnt on the Way to 1 Trillion Messages

Key Takeaways

  • Kafka clusters are used at Cloudflare to process large amounts of data, with a general-purpose message bus cluster developed to decouple teams, scale effectively, and process trillions of messages.
  • To address the issue of unstructured communication for event-driven systems, a strong contract should be in place: cross-platform data format Protobuf helped Cloudflare achieve that.
  • Investing in metrics on development tooling is critical to allow problems to be easily surfaced: Cloudflare enriched the SDK with OpenTracing and Prometheus metrics to understand how the system behaves and make better decisions, especially during incidents.
  • To enable consistency in the adoption and use of SDKs and promote best practices, it is important to prioritize clear documentation on patterns.
  • Cloudflare aims to achieve a balance between flexibility and simplicity: while a configurable setup may offer more flexibility, a simpler one allows standardization across different pipelines.

 

Cloudflare has generated over 1 trillion messages to Kafka in less than six years just for inter-service communication. As the company and application services team grew, they had to adapt their tooling to continue delivering fast.

We will discuss the early days of working in distributed domain-based teams and how abstractions were built on top of Kafka to reach the 1 trillion message mark.

We will also cover real incidents faced in recent years due to scalability limitations and the steps and patterns applied to deal with increasing demand.

What Is Cloudflare?

Cloudflare provides a global network to its customers and allows them to secure their websites, APIs, and internet traffic.

This network also protects corporate networks and enables customers to run and deploy entire applications on the edge.

Cloudflare offers a range of products, including CDN, Zero Trust, and Cloudflare Workers, to achieve these goals, identifying and blocking malicious activity and allowing customers to focus on their work.

[Click on the image to view full-size]

Figure 1: Cloudflare’s Global Network

Looking at the Cloudflare network from an engineering perspective, there are two primary components: the Global Edge network and the Cloudflare control plane.

A significant portion of the network is built using Cloudflare’s products, with Workers deployed and used on the edge network. The control plane, on the other hand, is a collection of data centers where the company runs Kubernetes, Kafka, and databases on bare metal. All Kafka producers and consumers are usually deployed into Kubernetes, but the specific deployment location depends on the workload and desired outcomes.

In this article, we will focus on the Cloudflare control plane and explore how inter-service communication and enablement tools are scaled to support operations.

Kafka

Apache Kafka is built around the concept of clusters, which consist of multiple brokers, with each cluster having a designated leader broker responsible for coordination. In the diagram below, broker 2 serves as the leader.

[Click on the image to view full-size]


 
Figure 2: Kafka Cluster

Messages are categorized into topics, such as user events, for example, user creation, or user information updates. Topics are then divided into partitions, an approach that allows Kafka to scale horizontally. In the diagram, there are partitions for topic A on both brokers, with each partition having a designated leader to determine its “source of truth”.  To ensure resilience, partitions are replicated according to a predetermined replication factor, with three being the usual minimum. The services that send messages to Kafka are called producers, while those that read messages are called consumers.

Cloudflare Engineering Culture

In the past, Cloudflare operated as a monolithic PHP application, but as the company grew and diversified, this approach proved to be limiting and risky.
Rather than mandating specific tools or programming languages, teams are now empowered to build and maintain their services, with the company encouraging experimentation and advocating for effective tools and practices. The Application Services team is a relatively new addition to the engineering organization, to make it easier for other teams to succeed by providing pre-packaged tooling that incorporates best practices. This allows development teams to focus on delivering value.

Tight Coupling

With the product offerings growing, there was a need to find better ways of enabling teams to work at their own pace and decouple from their peers and the engineering team also needed to have more control over backoff requests and work completion guarantees.

As we were already running Kafka clusters to process large amounts of data, we decided to invest time in creating a general-purpose message bus cluster: onboarding is straightforward, requiring a pull request into a repository, which sets up everything needed for a new topic, including the replication strategy, retention period, and ACLs. The diagram illustrates how the Messagebus cluster can help decouple different teams.

[Click on the image to view full-size]


 
Figure 3: The general-purpose message bus cluster

For example, three teams can emit messages that the audit log system is interested in, without the need for any awareness of the specific services. With less coupling, the engineering team can work more efficiently and scale effectively.

Unstructured Communication

With an event-driven system, to avoid coupling, systems shouldn’t be aware of each other. Initially, we had no enforced message format and producer teams were left to decide how to structure their messages. This can lead to unstructured communication and pose a challenge if the teams don’t have a strong contract in place, with an increased number of unprocessable messages.

To avoid unstructured communication, the team searched for solutions within the Kafka ecosystem and found two viable options, Apache Avro and protobuf, with the latter being the final choice. We had previously been using JSON, but found it difficult to enforce compatibility and the JSON messages were larger compared to protobuf.

[Click on the image to view full-size]

Figure 4: A protobuf message

Protobuf provides strict message types and inherent forward and backward compatibility, with the ability to generate code in multiple languages also a major advantage. The team encourages detailed comments on their protobuf messages and uses Prototool, an open-source tool by Uber, for breaking change detection and enforcing stylistic rules.

[Click on the image to view full-size]
 
Figure 5: Switching to Protobuf
 
Protobuf alone was not enough: different teams could still emit messages to the same topic, and the consumer may not be able to process it due to the format not being what was expected. Additionally, configuring Kafka consumers and producers was not an easy task, requiring intricate knowledge of the workload. As most teams were using Go, we decided to build a “message bus client library” in Go, incorporating best practices and allowing teams to move faster.

To avoid teams emitting different messages to the same topic, we made the controversial decision to enforce (on the client side) one protobuf message type per topic. While this decision enabled easy adoption, it resulted in numerous topics being created, with multiple partitions replicated, with a replication factor of at least three.

Connectors

The team had made significant progress in simplifying the Kafka infrastructure by introducing tooling and abstractions, but we realized that there were further use cases and patterns that needed to be addressed to ensure best practices were followed: the team developed the connector framework.

[Click on the image to view full-size]

Figure 6: The connector framework

Based on Kafka connectors, the framework enables engineers to create services that read from one system and push it to another one, like Kafka or Quicksilver, Cloudflare’s Edge database. To simplify the process, we use Cookiecutter to template the service creation, and engineers only need to enter a few parameters into the CLI.

The configuration process for the connector is simple and can be done through environment variables without any code changes.

In the example below, the reader is Kafka and the writer is Quicksilver. The connector is set to read from topic 1 and topic 2 and apply the function pf_edge. This is the complete configuration needed, which also includes metrics, alerts, and everything else required to move into production, allowing teams to easily follow best practices. Teams have the option to register custom transformations, which would be the only pieces of code they would need to write.

[Click on the image to view full-size]

Figure 7: A simple connector

For example, we utilize connectors in the communication preferences service: if a user wants to opt out of marketing information in the Cloudflare dashboard, they interact with this service to do so. The communication preference upgrade is stored in its database, and a message is emitted to Kafka. To ensure that the change is reflected in three different source systems, we use separate connectors that sync the change to a transactional email service, a customer management system, and a market email system. This approach makes the system eventually consistent and we leverage the guarantees provided by Kafka to ensure that the process happens smoothly.

[Click on the image to view full-size]
 
Figure 8: Connector and communication preferences

Visibility

As our customer base grew rapidly during the pandemic, so did the throughput, highlighting scalability issues in some of the abstractions that we had created.

One example is the audit logs, which we handle for our Kafka customers: we built a system to manage these logs, allowing producer teams to produce the events, while we listen for them, recording the data in our database.

[Click on the image to view full-size]


 
Figure 9: Adding the log push for Audit logs

We expose this information through an API and an integration called log push that enables us to push the audit log data into various data buckets, such as Cloudflare R2 or Amazon S3.

During the pandemic, we experienced the registration of many more audit logs and customers started using our APIs to get the latest data. As the approach was not scalable, we decided to develop a pipeline to address the issue, creating a small service that listens for audit log events and transforms them into the appropriate format for direct storage in a bucket, without overloading the APIs.

We encountered further issues as we accumulated logs and were unable to clear them out quickly enough, resulting in lags and breaches of our SLAs. We were uncertain about the cause of the lag as we lacked the tools and instrumentation in our SDK to diagnose the problem: was the bottleneck reading from Kafka, transformation, or saving data to the database?

[Click on the image to view full-size]

Figure 10: Where is the bottleneck?

We decided to address it by enhancing our SDK with Prometheus metrics, with histograms measuring the time each step takes in processing a message. This helped us identify slower steps, but we couldn’t tell which specific component was taking longer for a specific message. To solve this, we explored OpenTelemetry, focusing on its tracing integrations: there were not many good integrations for OpenTracing on Kafka, and it was challenging to propagate traces across different services during a production incident.

With the team enriching the SDK with OpenTracing, we were able to identify that pushing data to the bucket and reading from Kafka were both bottlenecks, prioritizing the fixes for those issues.

[Click on the image to view full-size]

Figure 11: Identifying the bottlenecks

Adding metrics into the SDK, we were able to get a better overview of the health of the cluster and the services.

Noisy On-call

We encountered a challenge with the large number of metrics we had collected, leading to a noisy on-call experience, with many alerts related to unhealthy applications and lag issues.

[Click on the image to view full-size]


Figure 12: Alerting pipeline

The basic alerting pipeline consists of Prometheus and AlertManager, which would page to PagerDuty. As restarting or scaling up/down services was not ideal, we decided to explore how to leverage Kubernetes and implement health checks.

In Kubernetes, there are three types of health checks: liveness, readiness, and startup. For Kafka, implementing the readiness probe is not useful because usually, an HTTP server is not exposed. To address this, an alternative approach was implemented.

[Click on the image to view full-size]


Figure 13: Health checks and Kafka

When a request for the liveness check is received, we attempt a basic operation with a broker, such as listing the topics, and if the response is successful, the check passes. However, there are cases where the application is still healthy but unable to produce or consume messages, which led the team to implement smarter health checks for the consumers.

[Click on the image to view full-size]


 
Figure 14: Health checks implementation

The current offset for Kafka is the last available offset on the partition, while the committed offset is the last offset that the consumer successfully consumed.
By retrieving these offsets during a health check, we can determine whether the consumer is operating correctly: if we can’t retrieve the offsets, there are likely underlying issues, and the consumer is reported as unhealthy. If the offsets are retrievable, we compare the last committed offset to the current one. If they are the same, no new messages have been appended, and the consumer is considered healthy. If the last committed offset is different, we check if it is the same as the previously recorded last committed offset to make sure that the consumer is not stuck and needs a restart. This process resulted in better on-call experiences and happier customers.

Inability to Keep Up

We had a system where teams could produce events from Kafka for their email system. These events contained a template, for example, an “under attack” template, that includes information about a website under attack and the identity of the attacker, along with metadata.

We would listen for the event, retrieve the template for the email from their registry, enrich it, and dispatch it to the customers. However, we started to experience load issues: the team started to see spikes in the production rate, causing a lag in consumption and impacting important OTP messages and the SLOs.

[Click on the image to view full-size]


Figure 15: The lag in consumption

Batching

We started exploring different solutions to address the problem, with an initial solution of scaling the number of partitions and consumers not providing significant improvement.

[Click on the image to view full-size]


 
Figure 16: The batching approach

We decided to implement a simpler but more effective approach, batch consuming, processing a certain number of messages at a time, applying a transformation, and dispatching them in batches. This proved effective and allowed the team to easily handle high production rates.

[Click on the image to view full-size]

Figure 17: No lag in consumption with batching

Documentation

Developing our SDK, we found that many developers were encountering issues while using it. Some were encountering bugs while others were unsure how to implement certain features or interpret specific errors. To address this, we created channels on our Google Chat where users could come and ask us questions. We had one person on call to respond and spent time documenting our findings and answers in our wiki. This helped to improve the overall user experience for the SDK.

Conclusions

There are four lessons to be learned:

  • Always find the right balance between flexibility and simplicity: while a configurable setup may offer more flexibility, a simpler one allows standardization across different pipelines.
  • Visibility: adding metrics to the SDK as soon as possible can help teams understand how the system is behaving and make better decisions, especially during incidents.
  • Contracts: enforcing one strong, stric contract, gives great visibility into what is happening inside a topic, allowing one to know who is writing and reading from it.
  • Document the good work that you do so that you don’t have to spend time answering questions or helping people debug production issues. This can be achieved through channels like Google Chat and wikis.

By following these rules, we were able to improve our systems and make our customers happy, even in high-stress situations.

 

API Design Reviews Are Dead. Long Live API Design Reviews!

Key Takeaways

  • Define consistent language and terminology across your API team to ensure consistency and avoid confusion later on.
  • Apply shared style guidelines and linting rules to enable better governance and standardization.
  • Establish your API design review team early on, accounting for all stakeholders and not just the technical ones!
  • Create an API catalog that actually reflects the scope and depth of APIs in your organization to improve discoverability and visibility. 
  • Whenever possible, maximize your use of shared components and models to help your developers scale and replicate with ease. 

In the course of designing APIs at scale, it takes deliberate effort to create consistency. The primary difference between a bunch of APIs and something that feels like a true platform is consistency. In this case, consistency simply means that if you use multiple APIs, factors like naming conventions, interaction patterns like paging, and auth mechanisms are standard across the board.

Traditionally, review committees have traumatized API developers with delay-inducing discoveries when development is thought to be complete. Worse, design by committee can take over, stalling progress or encouraging developers to find ways to sidestep the process to avoid the pain altogether.

To truly unlock a modern platform, enablement through decentralized governance is a much more scalable and engaging approach. This simply means that each domain or functional area has a subject matter expert who has been educated on the standards and overall architecture to be a well-enabled guide for API developers.

More importantly, agreeing on API design before the bulk of development is complete can largely avoid last-minute discoveries that put delivery timeframes in jeopardy (often referred to as a design-first approach). Using a spec format like OpenAPI (the de facto standard for HTTP/”REST” APIs) provides the ability to define an API before any development, which enables much earlier alignment and identification of issues.

With this context in mind, let’s take a closer look at how to conduct API design reviews, and how to develop processes and prepare the organization to avoid protracted timelines and a lack of developer engagement.

Here are some key prerequisites to ensure a smooth process:

1.  Define consistent language/terms

APIs usage is a very distilled experience, and as such, the impact of language is disproportionately higher than most other design realms. Each team member may have a slightly different way of defining and describing various terms, which manifests in confusion and decreased productivity for API teams.

While API portals/documentation are essential to a great developer experience, well-designed APIs should tell most of the story without having to think about it much. If terms are familiar to the consumer, and interaction patterns are obvious, then the experience can be quick and painless. Consistency is the primary difference in experience between a bunch of APIs and something that feels like one platform.

When establishing your API program and governance process, start with shared language. While it can seem impossible at first, defining a customer-centric shared vocabulary/grammar for your platform is essential, and an overall accelerator for an organization. Many terms can have varied meanings inside of a company, and to make things worse, these are often terms that end-consumers wouldn’t even recognize.

Doing this homework upfront avoids conflicts over naming in the midst of designing APIs. Work through each domain with relevant stakeholders to define shared terminology, and ensure wide availability and awareness to API designers. And once you’ve settled on internal standardization of terms, don’t forget to check if it fits with your external needs as well. Using customer language and having a customer-centric view of API development helps teams avoid confusing their customers with unfamiliar technical terms, so ensure there’s synchronization between your internal understanding and external understanding.

2. Define Shared Components

When API consumers encounter models or parameters that vary between APIs, it can be a confusing, frustrating, and time-consuming process. For example, if you use one API which refers to contact information, and the next API in the same platform uses a completely different model, consumers often have to resolve these differences. Worse, systemic differences in handling this data can unfold, creating functional differences.

As early as possible, identify common components (models, parameters, headers etc) and the systems that support them. Linking to shared components in API definitions ensures that future changes to those components are easier to roll out across the platform, as well as reducing undue cognitive burden on API consumers.

The more common components you have, the better opportunity for increased consistency, reusability, further collaboration opportunity and enhanced efficiency. All of us in the developer world love the ‘DRY method’ (Don’t Repeat Yourself), and the more shared components there are, the easier it is to innovate without having to make the same thing from scratch over and over again. Shared components also allow your team to scale quickly, training up new developers or stakeholders outside of the API team with ease.

3. Apply Shared Style Guides and Linting Rules

For the vast majority of simple naming conventions, interaction patterns, and auth mechanisms, automation with style guides can be provided to flag inconsistencies as early as possible.

The first style guides were developed between 2013-2015, setting expectations for look and feel (aka DX) for API development teams. The need for design consistency was apparent from the outset of API platform development, and early efforts by Paypal (I was a part of this team back in the day actually!) and Heroku resulted in some of the first style guides from successful programs to be shared publicly.

While there are a variety of automation tools available to help with style guides, the open-source tool Spectral has emerged as a standard for defining API linting rulesets. Aligning upfront on the conventions for paths, parameters, and more, and defining automated linting rules will avoid delays from conflicts over which conventions are ‘correct’. Have the discussion once, and define rules … try not to talk about it again; just make the lint errors go away!

For the design standards that can’t be automated, these should be documented and made easily available to API designers. Training that explains the importance of automated and manually verified rules can build motivation from developers to fully support the initiative, and avoid surprises and friction.

4. Establish API Design Reviewers Across the Org

While an API enablement team should exist to curate these design standards and foster community, authority should be enabled in each functional area or domain.

Although API standards are important, domain knowledge of systemic constraints, specific customer needs, and organizational strengths and weaknesses are best handled by an expert who is part of that world. If centralized API enablement team members are expected to know about everything in the company, bottlenecks leading to delivery delays and developer disengagement are nearly guaranteed.

Training workshops can be a powerful technique for spreading awareness of the importance of API standards. Additionally, you’ll often discover the right SMEs to provide governance authority. Look for individuals who express a passion for APIs (I often refer to these as a ‘band of rebels’), exhibit an awareness of the relevance of consistency and standards, and have the technical respect of their peers and/or reports.

Developing a successful API will involve many folks across your organization, often with contrasting skill sets, some who will build and deploy the API, and others who will be on the strategic side of the business problem identifying the value of your API. Don’t forget the business stakeholders as well when it comes to who to involve in the design review. Often, we only include the technical side, and that can result in failure later on. The more perspectives, the better!

5. Ensure Portfolio/API Catalog Alignment

Your platform should have product managers who agree on the overall composition of the API portfolio/catalog. Catalogs come in many different forms, and they organize your APIs to make it easier to find what you need without needing to know exactly what you’re looking for. It allows potential users to browse through available APIs grouped by functionality or other user concerns.

Good catalogs are searchable or filterable so that developers can easily narrow down the options, and they offer comparable, digestible detail for each API in the catalog that offers a clear path forward.

For any new API proposed, a functional overview with use cases and basic naming should be reviewed as early as possible. This ensures language alignment, reusability, and overall “fit” of a new API against the larger platform perspective.

Your enablement team should have product managers who own the portfolio alignment process, and each should own a manageable collection of domains. At the very least, a regular venue for domain-specific PMs to have alignment discussions is key.

While that can seem like a lot, remember that API standards should evolve through iteration. As each API is designed, you’ll recognize opportunities to refine standards. With that in mind, make sure the basics are covered in your upfront homework, and make sure API governors have a clear understanding of how to propose and adopt changes to standards.

Conducting API Design Reviews

If you’ve completed the above prerequisites, there’s not much to do in API design review! If domain-centric SMEs are involved, design review can often be largely integrated into ongoing design efforts. If “fit” in the platform is aligned early, design reviewers should have the confidence that this API belongs in the bigger picture. Additionally, if API designers see linting errors as they’re iterating, there should be no discussions about basic conventions beyond educating developers on the relevance of various linting rules, or simply how to resolve lint errors.

Not everything can be automated, and sometimes product and architecture needs can conflict. Make your API design review a time where manually enforced conventions are checked, customer language is validated (as this is hard to automate), and final alignment is solidified. With that scope in mind, meetings can often be bypassed, and asynchronous discussions can often suffice.

Most importantly, keep a close eye on API design review cycle time … it should drop distinctly over time as more decentralized SMEs get more comfortable with existing standards and how to adopt new standards.