Domain-Driven Cloud: Aligning your Cloud Architecture to your Business Model

Key Takeaways

  • Domain-Driven Cloud (DDC) is an approach for creating your organization’s cloud architecture based on the bounded contexts of your business model. DDC extends the principles of Domain-Driven Design (DDD) beyond traditional software systems to create a unifying architecture approach across business domains, software systems and cloud infrastructure.
  • DDC creates a cloud architecture that evolves as your business changes, improves team autonomy and promotes low coupling between distributed workloads. DDC simplifies security, governance and cost management in a way that promotes transparency within your organization.
  • In practice, DDC aligns your bounded contexts with AWS Organizational Units (OU’s) and Azure Management Groups (MG’s). Bounded contexts are categorized as domain contexts based on your business model and supporting technical contexts. DDC gives you freedom to implement different AWS Account or Azure Subscription taxonomies while still aligning to your business model.
  • DDC uses inheritance to enforce policies and controls downward while reporting costs and compliance upwards. Using DDC makes it automatically transparent how your cloud costs align to your business model without implementing complex reports and error-prone tagging requirements.
  • DDC aligns with established AWS and Azure well-architected best practices. You can implement DDC in 5 basic steps whether a new migration (green field) or upgrading your existing cloud architecture (brown field).

Domain-Driven Cloud (DDC) is an approach for creating your organization’s cloud architecture based on your business model. DDC uses the bounded contexts of your business model as inputs and outputs a flexible cloud architecture to support all of the workloads in your organization and evolve as your business changes. DDC promotes team autonomy by giving teams the ability to innovate within guardrails. Operationally, DDC simplifies security, governance, integration and cost management in a way that promotes transparency for IT and business stakeholders alike.

Based on Domain-Driven Design (DDD) and the architecture principle of high cohesion and low coupling, this article introduces DDC including the technical and human benefits of aligning your cloud architecture to the bounded contexts in your business model. You will learn how DDC can be implemented in cloud platforms including Amazon Web Services (AWS) and Microsoft Azure while aligning with their well-architected frameworks. Using illustrative examples from one of our real customers, you will learn the 5 steps to implementing DDC in your organization.

What is Domain-Driven Cloud (DDC)?

DDC extends the principles of DDD beyond traditional software systems to create a unifying architecture spanning business domains, software systems and cloud infrastructure.  

Our customers perpetually strive to align “people, process and technology” together so they can work in harmony to deliver business outcomes. However, in practice, this often falls down as the Business (Biz), IT Development (Dev) and IT Operations (Ops) all go to their separate corners to design solutions for complex problems that actually span all three.

What emerges is business process redesigns, enterprise architectures and cloud platform architecture all designed and implemented by different groups using different approaches and localized languages.  

What’s missing is a unified architecture approach using a shared language that integrates BizDevOps. This is where DDC steps in, with a specific focus on aligning the cloud architecture and software systems that run on them to the bounded contexts of your business model, identified using DDD. Figure 1 illustrates how DDC extends the principles of DDD to include cloud infrastructure architecture and in doing so creates a unified architecture that aligns BizDevOps.

[Click on the image to view full-size]

In DDC, the most important cloud services are AWS Organizational Units (OU’s) that contain Accounts and Azure Management Groups (MG’s) that contain Subscriptions. Because 100% of the cloud resources you secure, use and pay for are connected to Accounts and Subscriptions, these are the natural cost and security containers. By enabling management and security at the higher OU/MG level and anchoring these on the bounded contexts of your business model, you can now create a unifying architecture spanning Biz, Dev and Ops. You can do this while giving your teams flexibility in how they use Accounts and Subscriptions to meet specific requirements.

Why align your Cloud Architecture with your Business Model?

The benefits for aligning your cloud architecture to your organization’s business model include:

  • Evolves with your Business – Businesses are not static and neither is your cloud architecture. As markets change and your business evolves, new contexts may emerge and others may consolidate or fade away. Some contexts that historically were strategic differentiators may drive less business value today. The direct alignment of your cloud management, security and costs to bounded contexts means your cloud architecture evolves with your business.
  • Improves Team Autonomy – While some cloud management tasks must be centralized, DDC recommends giving teams autonomy within their domain contexts for things like provisioning infrastructure and deploying applications. This enables innovation within guardrails so your agile teams can go faster and be more responsive to changes as your business grows. It also ensures dependencies between workloads in different contexts are explicit with the goal of promoting a loosely-coupled architecture aligned to empowered teams.
  • Promotes High Cohesion and Low Coupling – Aligning your networks to bounded contexts enables you to explicitly allow or deny network connectivity between all contexts. This is extraordinarily powerful, especially for enforcing low coupling across the your cloud platform and avoiding a modern architecture that looks like a bowl of spaghetti. Within a context, teams and workloads ideally have high cohesion with respect to security, network integration and alignment on supporting a specific part of your business. You also have freedom to make availability and resiliency decisions at both the bounded context and workload levels.
  • Increases Cost Transparency – By aligning your bounded contexts to OU’s and MG’s, all cloud resource usage, budgets and costs are precisely tracked at a granular level. Then they are automatically summarized at the bounded contexts without custom reports and nagging all your engineers to tag everything! With DDC you can look at your monthly cloud bill and know the exact cloud spend for each of your bounded contexts, enabling you to assess whether these costs are commensurate with each context’s business value. Cloud budgets and alarms can be delegated to context-aligned teams enabling them to monitor and optimize their spend while your organization has a clear top-down view of overall cloud costs.
  • Domain-Aligned Security – Security policies, controls, identity and access management all line up nicely with bounded contexts. Some policies and controls can be deployed across-the-board to all contexts to create a strong security baseline. From here, selected controls can be safely delegated to teams for self-management while still enforcing enterprise security standards.
  • Repeatable with Code Templates – Both AWS and Azure provide ways to provision new Accounts or Subscriptions consistently from a code-based blueprint. In DDC, we recommend defining one template for all domain contexts, then using this template (plus configurable input parameters) to provision and configure new OU’s and Accounts or MG’s and Subscriptions as needed. These management constructs are free (you only pay for the actual resources used within them), enabling you to build out your cloud architecture incrementally yet towards a defined future-state, without incurring additional cloud costs along the way.

DDC may not be the best approach in all situations. Alternatives such as organizing your cloud architecture by tenant/customer (SaaS) or legal entity are viable options, too.

Unfortunately, we often see customers default to organizing their cloud architecture by their current org structure, following Conway’s Law from the 1960’s. We think this is a mistake and that DDC is a better alternative for one simple reason: your business model is more stable than your org structure.

One of the core tenets of good architecture is that we don’t have more stable components depending on less stable components (aka the Stable Dependencies Principle). Organizations, especially large ones, like to reorganize often, making their org structure less stable than their business model. Basing your cloud architecture on your org structure means that every time you reorganize your cloud architecture is directly impacted, which may impact all the workloads running in your cloud environment. Why do this? Basing your cloud architecture on your organization’s business model enables it to evolve naturally as your business strategy evolves, as seen in Figure 2.

[Click on the image to view full-size]

We recognize that, as Ruth Malan states, “If the architecture of the system and the architecture of the organization are at odds, the architecture of the organization wins”. We also acknowledge there is work to do with how OU’s/MG’s and all the workloads within them best align to team boundaries and responsibilities. We think ideas like Team Topologies may help here.

We are seeing today’s organizations move away from siloed departmental projects within formal communications structures to cross-functional teams creating products and services that span organizational boundaries. These modern solutions run in the cloud, so we feel the time is right for evolving your enterprise architecture in a way that unifies Biz, Dev and Ops using a shared language and architecture approach.

What about Well-Architected frameworks?

Both AWS’s Well-Architected framework and Azure’s Well-Architected framework provide a curated set of design principles and best practices for designing and operating systems in your cloud environments. DDC fully embraces these frameworks and at SingleStone we use these with our customers. While these frameworks provide specific recommendations and benefits for organizing your workloads into multiple Accounts or Subscriptions, managed with OU’s and MG’s, they leave it to you to figure out the best taxonomy for your organization.

DDC is opinionated on basing your cloud architecture on your bounded contexts, while being 100% compatible with models like AWS’s Separated AEO/IEO and design principles like “Perform operations as code” and “Automatically recover from failure”. You can adopt DDC and apply these best practices, too. Tools such as AWS Landing Zone and Azure Landing Zones can accelerate the setup of your cloud architecture while also being domain-driven.

5 Steps for Implementing Domain-Driven Cloud

Do you think a unified architecture using a shared language across BizDevOps might benefit your organization? While a comprehensive list of all tasks is beyond the scope of this article, here are the five basic steps you can follow, with illustrations from one of our customers who recently migrated to Azure.

Step 1: Start with Bounded Contexts

The starting point for implementing DDC is a set of bounded contexts that describes your business model. The steps to identify your bounded contexts are not covered here, but the process described in Domain-Driven Discovery is one approach.

Once you identify your bounded contexts, organize them into two groups:

  • Domain contexts are directly aligned to your business model.
  • Technical contexts support all domain contexts with shared infrastructure and services

To illustrate, let’s look at our customer who is a medical supply company. Their domain and technical contexts are shown in Figure 3.

[Click on the image to view full-size]

Your organization’s domain contexts would be different, of course.

For technical contexts, the number will depend on factors including your organization’s industry, complexity, regulatory and security requirements. A Fortune 100 financial services firm will have more technical contexts than a new media start-up. With that said, as a starting point DDC recommends six technical contexts for supporting all your systems and data.

  • Cloud Management – Context for the configuration and management of your cloud platform including OU/MG’s, Accounts/Subscriptions, cloud budgets and cloud controls.
  • Security – Context for identity and access management, secrets management and other shared security services used by any workload.
  • Network – Context for all centralized networking services including subnets, firewalls, traffic management and on-premise network connectivity.
  • Compliance – Context for any compliance-related services and data storage that supports regulatory, audit and forensic activities.
  • Platform Services – Context for common development and operations services including CI/CD, package management, observability, logging, compute and storage.
  • Analytics – Context for enterprise data warehouses, governance, reporting and dashboards.

You don’t have to create these all up-front, start with Cloud Management initially and build out as-needed.

Step 2: Build a Solid Foundation

WIth your bounded contexts defined, it’s now time to build a secure cloud foundation for supporting your organization’s workloads today and in the future. In our experience, we have found it is helpful to organize your cloud capabilities into three layers based on how they support your workloads. For our medical supply customer, Figure 4 shows their contexts aligned to Application, Platform and Foundation layers of their cloud architecture.

[Click on the image to view full-size]

With DDC, you align AWS Organizational Units (OU’s) or Azure Management Groups (MG’s) to bounded contexts. By align, we mean you name them after your bounded contexts. These are the highest levels of management and through the use of inheritance they give you the ability to standardize controls and settings across your entire cloud architecture.

DDC gives you flexibility in how best to organize your Accounts and Subscription taxonomy, from coarse-grained to fine-grained, as seen in Figure 5.

DDC recommends starting with one OU/MG and at least two Accounts/Subscriptions per bounded context. If your organization has higher workload isolation requirements, DDC can support this too, as seen in Figure 5.

[Click on the image to view full-size]

For our customer who had a small cloud team new to Azure, separate Subscriptions for Prod and NonProd for each context made sense as a starting point, as shown in Figure 6.

[Click on the image to view full-size]

Figure 7 shows what this would look like in AWS.

[Click on the image to view full-size]

For our customer, further environments like Dev, Test and Stage could be created within their respective Prod and Non-Prod Subscriptions. This provides them isolation between environments with the ability to configure environment-specific settings at the Subscription or lower levels. They also decided to build just the Prod Subscriptions for the six technical contexts to keep it simple to start. Again, if your organization wanted to create separate Accounts or Subscriptions for every workload environment, this can be done too and still aligned with DDC.

From a governance perspective, in DDC we recommend domain contexts inherit security controls and configurations from technical contexts. Creating a strong security posture in your technical contexts enables all your workloads that run in domain contexts to inherit this security by default. Domain contexts can then override selected controls and settings on a case-by-case basis balancing team autonomy and flexibility with required security guardrails.

Using DDC, your organization can grant autonomy to teams to enable innovation within guardrails. Leveraging key concepts from team topologies, stream-aligned teams can be self-sufficient within domain contexts when creating cloud infrastructure, deploying releases and monitoring their workloads. Platform teams, primarily working in technical contexts, can focus on designing and running highly-available services used by the stream-aligned teams. These teams work together to create the right balance between centralization and decentralization of cloud controls to meet your organization’s security and risk requirements, as shown in Figure 8.

[Click on the image to view full-size]

As this figure shows, policies and controls defined at higher level OU’s/MG’s are enforced downwards while costs and compliance are reported upwards. For our medical supply customer, this means their monthly Azure bill is automatically itemized by their bounded contexts with summarized cloud costs for Orders, Distributors and Payers to name a few.

This makes it easy for their CTO to share cloud costs with their business counterparts and establish realistic budgets that can be monitored over time. Just like costs, policy compliance across all contexts can be reported upwards with evidence stored in the Compliance technical context for auditing or forensic purposes. Services such as Azure Policy and AWS Audit Manager are helpful for continually maintaining compliance across your cloud environments by organizing your policies and controls in one place for management.

Step 3: Align Workloads to Bounded Contexts

With a solid foundation and our bounded contexts identified, the next step is to align your workloads to the bounded contexts. Identifying all the workloads that will run in your cloud environment is often done during a cloud migration discovery, aided in part by a change management database (CMDB) that contains your organization’s portfolio of applications.

When aligning workloads to bounded contexts we prefer a workshop approach that promotes discussion and collaboration. In our experience this makes DDC understandable and relatable by the teams involved in migration. Because teams must develop and support these workloads, the workshop also highlights where organizational structures may align (or not) to bounded contexts. This workshop (or a follow-up one) can also identify which applications should be independently deployable and how the team’s ownership boundaries map to bounded contexts.

For our medical supply customer, this workshop revealed the permissions required for a shared CI/CD tool in the Shared Services context was needed to deploy a new version of their Order Management system in the Orders context. This drove a discussion on working out how secrets and permissions would be managed across contexts, identifying new capabilities needed for secrets management that were prioritized during cloud migration. By creating a reusable solution that worked for all future workloads in domain contexts, the cloud team created a new capability that improved the speed of future migrations.

Figure 9 summarizes how our customer aligned their workloads to bounded contexts, which are aligned to their Azure Management Groups.

[Click on the image to view full-size]

Within the Order context, our customer used Azure Resource Groups for independently deployable applications or services that contain Azure Resources, as shown in Figure 10.

[Click on the image to view full-size]

This design served as a starting point for their initial migration of applications running in a data center to Azure. Over the next few years their goal was to re-factor these applications into multiple independent micro-services. When this time came, they could incrementally do this an application at a time by creating additional Resource Groups for each service.

If our customer were using AWS, Figure 10 would look very similar but use Organizational Units, Accounts and AWS Stacks for organizing independently deployable applications or services that contained resources. One difference in cloud providers is that AWS allows nested stacks (stacks within stacks) whereas Azure Resource Groups cannot be nested.

For networking, in order for workloads running in domain contexts to access shared services in technical contexts, their networks must be connected or permissions explicitly enabled to allow access. While the Network technical context contains centralized networking services, by default each Account or Subscription aligned to a domain context will have its own private network containing subnets that are independently created, maintained and used by the workloads running inside them.

Depending on the total number of Accounts or Subscriptions, this may be desired or it may be too many separate networks to manage (each potentially has their own IP range). Alternatively, core networks can be defined in the Network Context and shared to specific domain or technical contexts thereby avoiding every context having its own private network. The details of cloud networking are beyond the scope of this article but DDC enables multiple networking options while still aligning your cloud architecture to your business model. Bottom line: you don’t have to sacrifice network security to adopt DDC.

Step 4: Migrate Workloads

Now that we have identified where each workload will run, it was time to begin moving them into the right Account or Subscription. While this was a new migration for our customer (greenfield), for your organization this may involve re-architecting your existing cloud platform (brownfield). Migrating a portfolio of workloads to AWS or Azure and the steps for architecting your cloud platform is beyond the scope of this article, but with respect to DDC this is a checklist of the key things to keep in mind:

  • Name your AWS Organizational Units (OU’s) or Azure Management Groups (MG’s) after your bounded contexts.
  • Organize your contexts into domain and technical groupings, with:
    • Technical contexts as the foundation and platform layers of your cloud architecture.
    • Domain contexts as the application layer of your cloud architecture.
  • Centralize common controls in technical contexts for a strong security posture.
  • Decentralize selected controls in domain contexts to promote team autonomy, speed and agility.
  • Use inheritance within OU’s or MG’s for enforcing policies and controls downward while reporting cost and compliance upwards.
  • Decide on your Account / Subscription taxonomy within the OU’s / MG’s, balancing workload isolation with management complexity.
  • Decide how your networks will map to domain and technical contexts, balancing centralization versus decentralization.
  • Create domain contexts templates for consistency and use these when provisioning new Accounts / Subscriptions.

For brownfield deployments of DDC that are starting with an existing cloud architecture, the basic recipe is:

  1. Create new OU’s / MG’s named after your bounded contexts. For a period of time these will live side-by-side with your existing OU’s / MG’s and should have no impact on current operations.
  2. Implement policies and controls within the new OU’s / MG’s for your technical contexts, using inheritance as appropriate.
  3. Create a common code template for all domain contexts that inherits policies and controls from your technical contexts. Use parameters for anything that’s different between contexts.
  4. Based on the output of your workloads mapping workshop, for each workload either:
    • a.  Create a new Account / Subscription using the common template, aligned with your desired account taxonomy, for holding the workload or
    • b.  Migrate an existing Account / Subscription, including all workloads and resources within the, to the new OU / MG. When migrating, pay careful attention to controls from the originating OU / MG to ensure they are also enabled in the target OU / MG.
  5. The order you move workloads will be driven by the dependencies between your workloads, so this should be understood before beginning. The same goes for shared services that workloads depend on.
  6. Depending on the number of workloads to migrate, this may take weeks or months (but hopefully not years). Work methodically as you migrate workloads, verifying that controls, costs and compliance are working correctly for each context.
  7. Once done, decommission the old OU / MG structure and any Accounts / Subscriptions no longer in use.

Step 5: Inspect and Adapt

Your cloud architecture is not a static artifact, the design will continue to evolve over time as your business changes and new technologies emerge. New bounded contexts will appear that require changes to your cloud platform. Ideally much of this work is codified and automated, but in all likelihood you will still have some manual steps involved as your bounded contexts evolve.

Your Account / Subscription taxonomy may change over time too, starting with fewer to simplify initial management and growing as your teams and processes mature. The responsibility boundaries of teams and how these align to bounded contexts will also mature over time. Methods like GitOps work nicely alongside DDC to keep your cloud infrastructure flexible and extensible over time and continually aligned with your business model.

Conclusion

DDC extends the principles of DDD beyond traditional software systems to create a unifying architecture spanning business domains, software systems and cloud infrastructure (BizDevOps). DDC is based on the software architecture principle of high cohesion and low coupling that is used when designing complex distributed systems, like your AWS and Azure environments. Employing the transparency and shared language benefits of DDD when creating your organization’s cloud architecture results in a secure-yet-flexible platform that naturally evolves as your business changes over time.

Special thanks to John Chapin, Casey Lee, Brandon Linton and Nick Tune for feedback on early drafts of this article and Abby Franks for the images.

Dealing with Java CVEs: Discovery, Detection, Analysis, and Resolution

Key Takeaways

  • Including a dependency vulnerability check (Software Composition Analysis or SCA) as part of a  continuous integration or continuous delivery pipeline is important to maintain an effective security posture.
  • The same vulnerability can be critical in one application and harmless in another. Humans should be “kept in the loop” here, and only the developers maintaining the application make an effective decision.
  • It is essential to prevent vulnerability alert fatigue. We should not get used to the fact that the dependency check is failing. If we do, critical vulnerability may pass unnoticed.
  • It is crucial to quickly upgrade vulnerable dependencies or suppress false positives even if we are maintaining dozens of services.
  • Developers should invest in tools that help with discovery, detection, analysis and resolution of vulnerabilities. Examples include OWASP dependency check, GitHub Dependabot, Checkmarx, Snyk and Dependency Shield.

Modern Java applications are built on top of countless open-source libraries. The libraries encapsulate common, repetitive code and allow application programmers to focus on delivering customer value. But the libraries come with a price – security vulnerabilities. A security issue in a popular library enables malicious actors to attack a wide range of targets cheaply.

Therefore, it’s crucial to have dependency vulnerability checks (a.k.a. Software Composition Analysis or SCA) as part of the CI pipeline. Unfortunately, the security world is not black and white; one vulnerability can be totally harmless in one application and a critical issue in another, so the scans always need human oversight to determine whether a report is a false positive.

This article will explore examples of vulnerabilities commonly found in standard Spring Boot projects over the last few years. This article is written from the perspective of software engineers. The focus will shift to the challenges faced when utilizing widely available tools such as the OWASP dependency check.

As software engineers are dedicated to delivering product value, they view security as one of their many responsibilities. Despite its importance, security can sometimes get in the way and be neglected because of the complexity of other tasks.

Vulnerability resolution lifecycle

A typical vulnerability lifecycle looks like this:

Discovery

A security researcher usually discovers the vulnerability. It gets reported to the impacted OSS project and, through a chain of various non-profit organizations, ends up in the NIST National Vulnerability Database (NVD). For instance, the Spring4Shell vulnerability was logged in this manner.

Detection

When a vulnerability is reported, it is necessary to detect that the application contains the vulnerable dependency. Fortunately, a plethora of tools are available that can assist with the detection.

One of the popular solutions is the OWASP dependency check – it can be used as a Gradle or Maven plugin. When executed, it compares all your application dependencies with the NIST NVD database and Sonatype OSS index. It allows you to suppress warnings and generate reports and is easy to integrate into the CI pipeline. The main downside is that it sometimes produces false positives as the NIST NVD database does not provide the data in an ideal format. Moreover, the first run takes ages as it downloads the whole vulnerability database.

Various free and commercial tools are available, such as GitHub Dependabot, Checkmarx, and Snyk. Generally, these tools function similarly, scanning all dependencies and comparing them against a database of known vulnerabilities. Commercial providers often invest in maintaining a more accurate database. As a result, commercial tools may provide fewer false positives or even negatives.

Analysis

After a vulnerability is detected, a developer must analyze the impact. As you will see in the examples below, this is often the most challenging part. The individual performing the analysis must understand the vulnerability report, the application code, and the deployment environment to see if the vulnerability can be exploited. Typically, this falls to the application programmers as they are the only ones who have all the necessary context.

Resolution

The vulnerability has to be resolved.

  1. Ideally, this is achieved by upgrading the vulnerable dependency to a fixed version.
  2.  If no fix is released yet, the application programmer may apply a workaround, such as changing a configuration, filtering an input, etc.
  3. More often than not, the vulnerability report is a false positive. Usually, the vulnerability can’t be exploited in a given environment. In such cases, the report has to be suppressed to prevent becoming accustomed to failing vulnerability reports.

Once the analysis is done, the resolution is usually straightforward but can be time-consuming, especially if there are dozens of services to patch. It’s important to simplify the resolution process as much as possible. Since this is often tedious manual work, automating it to the greatest extent possible is advisable. Tools like Dependabot or Renovate can help in this regard to some extent.

Vulnerability examples

Let’s examine some vulnerability examples and the issues that can be encountered when resolving them.

Spring4Shell (CVE-2022-22965, score 9.8)

Let’s start with a serious vulnerability – Spring Framework RCE via Data Binding on JDK 9+, a.k.a. Spring4Shell, which allows an attacker to remotely execute code just by calling HTTP endpoints.

Detection

It was easy to detect this vulnerability. Spring is quite a prominent framework; the vulnerability was present in most of its versions, and it was discussed all over the internet. Naturally, all the detection tools were able to detect it.

Analysis

In the early announcement of the vulnerability, it was stated that only applications using Spring WebMvc/Webflux deployed as WAR to a servlet container are affected. In theory, deployment with an embedded servlet container should be safe. Unfortunately, the announcement lacked the vulnerability details, making it difficult to confirm whether this was indeed the case. However, this vulnerability was highly serious, so it should have been mitigated promptly.

Resolution

The fix was released in a matter of hours, so the best way was to wait for the fix and upgrade. Tools like Dependabot or Renovate can help to do that in all your services.

If there was a desire to resolve the vulnerability sooner, a workaround was available. But it meant applying an obscure configuration without a clear understanding of what it did. The decision to manually apply it across all services or wait for the fix could have been a challenging one to make.

HttpInvoker RCE (CVE-2016-1000027, score 9.8)

Let’s continue to focus on Spring for a moment. This vulnerability has the same criticality as Spring4Shell 9.8. But one might notice the date is 2016 and wonder why it hasn’t been fixed yet or why it lacks a fancy name. The reason lies in its location within the HttpInvoker component, used for the RPC communication style. It was popular in the 2000s but is seldom used nowadays. To make it even more confusing, the vulnerability was published in 2020, four years after it was initially reported due to some administrative reasons.

Detection

This issue was reported by OWASP dependency check and other tools. As it did not affect many, it did not make the headlines.

Analysis

Reading the NIST CVE detail doesn’t reveal much:

Pivotal Spring Framework through 5.3.16 suffers from a potential remote code execution (RCE) issue if used for Java deserialization of untrusted data. Depending on how the library is implemented within a product, this issue may or [may] not occur, and authentication may be required.

This sounds pretty serious, prompting immediate attention and a search through the link to find more details. However, the concern turns out to be a false alarm, as it only applies if HttpInvokerServiceExporter is used.

Resolution

No fixed version of a library was released, as Pivotal did not consider it a bug. It was a feature of an obsolete code that was supposed to be used only for internal communication. The whole functionality was dropped altogether in Spring 6, a few years later.

The only action that to take is to suppress the warning. Using the free OWASP dependency check, this process can be quite time-consuming if it has to be done manually for each service.

There are several ways to simplify the flow. One is to expose and use a shared suppression file in all your projects by specifying its URL. Lastly, you can employ a simple service like Dependency Shield to streamline the whole suppression flow. The important point is that a process is needed to simplify the suppression, as most of the reports received are likely false positives.

SnakeYAML RCE (CVE-2022-1471, score 9.8)

Another critical vulnerability has emerged, this time in the SnakeYAML parsing library. Once again, it involves remote code execution, with a score of 9.8. However, it was only applicable if the SnakeYAML Constructor class had been used to parse a YAML provided by an attacker.

Detection

It was detected by vulnerability scanning tools. SnakeYAML is used by Spring to parse YAML configuration files, so it’s quite widespread.

Analysis

Is the application parsing YAML that could be provided by an attacker, for example, on a REST API? Is the unsafe Constructor class being used? If so, the system is vulnerable. The system is safe if it is simply used to parse Spring configuration files. An individual who understands the code and its usage must make the decision. The situation could either be critical, requiring immediate attention and correction, or it could be safe and therefore ignored.

Resolution

The issue was quickly fixed. What made it tricky was that SnakeYAML was not a direct dependency; it’s introduced transitively by Spring, which made it harder to upgrade. If you want to upgrade SnakeYAML, you may do it in several ways.

  1. If using the Spring Boot dependency management plugin with Spring Boot BOM,
    • a.    the snakeyaml.version variable can be overridden.
    • b.    the dependency management declaration can be overridden.
  2. If not using dependency management, SnakeYAML must be added as a direct dependency to the project, and the version must be overridden.

When combined with complex multi-project builds, it’s almost impossible for tools to upgrade the version automatically. Both Dependabot and Renovate are not able to do that. Even a commercial tool like Snyk is failing with “could not apply the upgrade, dependency is managed externally.”

And, of course, once the version is overridden, it is essential to remember to remove the override once the new version is updated in Spring. In our case, it’s better to temporarily suppress the warning until the new version is used in Spring.

Misidentified Avro vulnerability

Vulnerability CVE-2021-43045 is a bug in .NET versions of the Avro library, so it’s unlikely to affect a Java project. How, then, is it reported? Unfortunately, the NIST report contains cpe:2.3:a:apache:avro:*:*:*:*:*:*:*:* identifier. No wonder the tools mistakenly identify org.apache.avro/[email protected] as vulnerable, even though it’s from a completely different ecosystem.

Resolution: Suppress

Summary

Let’s look back at the different stages of the vulnerability resolution and how to streamline it as much as possible so the reports do not block the engineers for too long.

Detection

The most important part of detection is to avoid getting used to failing dependency checks. Ideally, the build should fail if there is a vulnerable dependency detected. To be able to enable that, the resolution needs to be as painless and as fast as possible. No one wants to encounter a broken pipeline due to a false positive.
 
Since the OWASP dependency check primarily uses the NIST NVD database, it sometimes struggles with false positives. However, as has been observed, false positives are inevitable, as the analysis is only occasionally straightforward.

Analysis

This is the hard part and actually, the one when tooling can’t help us much. Consider the SnakeYAML remote code execution vulnerability as an example. For it to be exploitable, the library would have to be used unsafely, such as parsing data provided by an attacker. Regrettably, no tool is likely to reliably detect whether an application and all its libraries contain vulnerable code. So, this part will always need some human intervention.

Resolution

Upgrading the library to a fixed version is relatively straightforward for direct dependencies. Tools like Dependabot and Renovate can help in the process. However, the tools fail if the vulnerable dependency is introduced transitively or through dependency management. Manually overriding the dependency may be an acceptable solution for a single project. In cases where multiple services are being maintained, we should introduce centrally managed dependency management to streamline the process.

Most reports are false positives, so it’s crucial to have an easy way to suppress the warning. When using OWASP dependency check, either try a shared suppression file or a tool like Dependency Shield that helps with this task.

It often makes sense to suppress the report temporarily. Either to unblock the pipeline until somebody has time to analyze the report properly or until the transitive dependency is updated in the project that introduced it.

Building Kafka Event-Driven Applications with KafkaFlow

Key Takeaways

  • KafkaFlow is an open-source project that streamlines Kafka-based event-driven applications, simplifying the development of Kafka consumers and producers.
  • The .NET framework offers an extensive range of features, including middleware, message handlers, type-based deserialization, concurrency control, batch processing, etc.
  • By utilizing middlewares, developers can encapsulate the logic for processing messages, which leads to better separation of concerns and maintainable code.
  • The project can be extended, creating the possibility of customization and the growth of an ecosystem of add-ons.
  • Developers benefit from KafkaFlow by being able to focus on what matters, spending more time on the business logic rather than investing in low-level concerns.

KafkaFlow is an open-source framework by FARFETCH. It helps .NET developers working with Apache Kafka to create event-driven applications. KafkaFlow lets developers easily set up “Consumers” and “Producers.” The simplicity makes it an attractive framework for businesses seeking efficiency and robustness in their applications.

In this article, we will explore what KafkaFlow has to offer. If you build Apache Kafka Consumers and Producers using .NET, this article will provide a glance at how KafkaFlow can simplify your life.

Why Should You Care About It?

KafkaFlow provides an abstraction layer over the Confluent .NET Kafka client. It does so while making it easier to use, maintain, and test Kafka consumers and producers.

Imagine you need to build a Client Catalog for marketing initiatives. You will need a service to consume messages that capture new Clients. Once you start laying out your required service, you notice that existing services are not consistent in how they consume messages.

It’s common to see teams struggling and often solving simple problems such as graceful shutdowns. You’ve figured out that you have four different implementations of a JSON serializer across the organization, just to name one of the challenges.

Adopting a framework like KafkaFlow simplifies the process and can speed up the development cycle. KafkaFlow has a set of features designed to enhance the developer experience:

  1. Middlewares: KafkaFlow allows developers to create middlewares to process messages, enabling more control and customization of the Kafka consumer/producer pipeline.
  2. Handlers: Introduces the concept of message handlers, allowing developers to forward message processing from a topic to a message-type dedicated handler.
  3. Deserialization Algorithms: Offers a set of Serialization and Deserialization algorithms out-of-the-box.
  4. Multi-threaded Consumers: Provides multi-threading with message order guaranteed, helping to ensure optimal use of system resources.
  5. Administration API and Dashboard: Provides API and Dashboards to manage Consumers and Consumer groups, with operations such as pausing, resuming, or rewinding offsets, all at runtime.
  6. Consumer Throttling: Provides an easy way to bring priorities to topic consumption.

Let’s explore them so you can see the potential to address a problem like this.

KafkaFlow Producers: Simplified Message Production

Let’s start with message producers.

Producing a Message to Kafka is not rocket science. Even then, KafkaFlow provides a higher-level abstraction over the producer interface from Confluent’s .NET Kafka client, simplifying the code and increasing maintainability.

Here’s an example of how to send a message with a KafkaFlow producer:

await _producers["my-topic-events"]
    .ProduceAsync("my-topic", message.Id.ToString(), message);

This way, you can produce messages to Kafka without dealing directly with serialization or other complexities of the underlying Kafka client.

Not only that, but defining and managing Producers is pleasantly done through a Fluent Interface on your service configuration.

services.AddKafka(kafka => kafka
    .AddCluster(cluster => cluster
        .WithBrokers(new[] { "host:9092" })
        .AddProducer(
            "product-events",
            producer =>
                producer
            ...
        )
    )
);

Producers tend to be simple, but there are some common concerns to address, like compression or serialization. Let’s explore that.

Custom Serialization/Deserialization in KafkaFlow

One of the attractive features of Apache Kafka is being agnostic of data formats. However, that transfers the responsibility to producers and consumers. Without a thoughtful approach, it may lead to many ways to achieve the same result across the system. That makes serialization an obvious use case to be handled by a client framework.

KafkaFlow has serializers available for JSON, Protobuf, and even Avro. Those can be used simply by adding them to the middleware configuration.

.AddProducer(producer => producer
       ...
       .AddMiddlewares(middlewares => middleware
           ...
           .AddSerializer()
       )
)

The list is not restricted to those three due to its ability to use custom serializers/deserializers for messages. While Confluent’s .NET Kafka client already supports custom serialization/deserialization, KafkaFlow simplifies the process by providing a more elegant way to handle it.

As an example, to use a custom serializer, you would do something like this:

public class MySerializer : ISerializer
{
       public Task SerializeAsync(object message, Stream output, ISerializerContext context)
       {
             // Serialization logic here
       }

       public async Task DeserializeAsync(Stream input, Type type, ISerializerContext context)
       {
             // Deserialization logic here
       }
}

// Register the custom serializer when setting up the Kafka consumer/producer

.AddProducer(producer => producer
       ...
       .AddMiddlewares(middlewares => middleware
       	  ...
       	  .AddSerializer()
       )
)

Message Handling in KafkaFlow

Consumers bring a ton of questions and possibilities. The first one is “How do you handle a message?”

Let’s start with the simplest way. With the advent of libraries like MediatR that popularized the CQRS and the Meditor Patterns, .NET developers got used to decoupling message handlers from the request/message receiver. KafkaFlow brings that same principle to Kafka Consumers.

KafkaFlow message handlers allow developers to define specific logic to process messages from a Kafka topic. KafkaFlow’s message handler structure is designed for better separation of concerns and cleaner, more maintainable code.

Here’s an example of a message handler:

public class MyMessageHandler : IMessageHandler
{
    public Task Handle(IMessageContext context, MyMessageType message)
    {
        // Message handling logic here.
    }
}

This handler can be registered in the consumer configuration:

.AddConsumer(consumer => consumer
...
       .AddMiddlewares(middlewares => middlewares
           ...
             .AddTypedHandlers(handlers => handlers
                     .AddHandler()
              )
       )
)

With this approach, it’s easy to separate Consumers from Handlers, simplifying maintainability and testability.

This may look like unneeded complexity if you have a microservice handling one topic with only one message type. In that case, you can take advantage of middlewares.

Middleware in KafkaFlow

KafkaFlow is middleware-oriented. Maybe you noticed on the Message Handlers snippets a reference to “Middlewares.” So, you may be asking yourself what a Middleware is.

Middlewares are what make Typed Handlers possible. Messages are delivered to a middleware pipeline that will be invoked in sequence. You might be familiar with this concept if you have used MediatR pipelines. Also, Middlewares can be used to apply a series of transformations. In other words, a given Middleware can transform the incoming message to the following Middleware.

A Middleware in KafkaFlow encapsulates the logic for processing messages. The pipeline is extensible, allowing developers to add behavior to the message-processing pipeline.

Here’s an example of a middleware:

public class MyMiddleware : IMessageMiddleware
{
    public async Task Invoke(IMessageContext context, MiddlewareDelegate next)
    {
         // Pre-processing logic here.          
        await next(context);          
         // Post-processing logic here.     
    }
}

To use this middleware, it can be registered in the consumer configuration:

.AddConsumer(consumer => consumer
       ...
       .AddMiddlewares(middlewares => middlewares
             ...
             .Add()
         )
)   

This way, developers can plug-in custom logic into the message processing pipeline, providing flexibility and control.

Typed Handlers are a form of Middleware. So, you can even handle a message without a Typed Handler, implementing your middleware, or you can take advantage of Middlewares to build a Message pipeline that performs validations, enrichment, etc., before handling that message.

Handling Concurrency in KafkaFlow

Once you start thinking about infrastructure efficiency, you will notice that many Kafka Consumers are underutilized. The most common implementation is single-threaded, which caps resource utilization. So, when you need to scale, you do it horizontally to keep the desired throughput.

KafkaFlow brings another option to achieve infrastructure efficiency. KafkaFlow gives developers control over how many messages can be processed concurrently by a single consumer. It uses the concept of Workers that can work together consuming a topic.
This functionality allows you to optimize your Kafka consumer to better match your system’s capabilities.

Here’s an example of how to set the number of concurrent workers for a consumer:

.AddConsumer(consumer => consumer
.Topic("topic-name")
       .WithGroupId("sample-group")
       .WithBufferSize(100)
       .WithWorkersCount(10) // Set the number of workers.
       .AddMiddlewares(middlewares => middlewares
       	...
      	)
)

KafkaFlow guarantees order even with concurrent workers.

Batch Processing

With scale, you will face the tradeoff between latency and throughput. To handle that tradeoff, KafkaFlow has an important feature called “Batch Consuming.” This feature addresses the need for efficiency and performance in consuming and processing messages from Kafka in a batch-wise manner. It plays an important role in use cases where a group of messages needs to be processed together rather than individually.

What Is Batch Consuming?

Batch consuming is an approach where instead of processing messages atomically as they come in, the system groups several messages together and processes them all at once. This method is more efficient for dealing with large amounts of data, particularly if messages are independent of each other. Performing operations as a batch will lead to an increase in overall performance.

KafkaFlow’s Approach to Batch Consuming

KafkaFlow takes advantage of the system of Middlewares to provide batch processing. The Batch Processing Middleware lets you group messages according to batch size or timespan. Once one of those conditions is reached, the Middleware will forward the group of messages to the next middleware.

services.AddKafka(kafka => kafka
    .AddCluster(cluster => cluster
        .WithBrokers(new[] { "host:9092" })
        .AddConsumer(
            consumerBuilder => consumerBuilder
            ...
            .AddMiddlewares(
                middlewares => middlewares
                    ...
                    .BatchConsume(100, TimeSpan.FromSeconds(10))
                    .Add()
            )
        )
    )
);

The Impact of Batch Consuming on Performance

With batch processing, developers can achieve higher throughput in their Kafka-based applications. It allows for faster processing as the overhead associated with initiating and finalizing each processing task is significantly reduced. This leads to an overall increase in system performance.

Also, this approach reduces network I/O operations as data is fetched in larger chunks, which can further improve processing speed, especially in systems where network latency is a concern.

Consumer Administration with KafkaFlow

KafkaFlow also simplifies administrative tasks related to managing Kafka consumers. You can start, stop, pause consumers, rewind offsets, and much more with KafkaFlow’s administrative API.

The Administration API can be used throughout a programming interface, REST API, or a Dashboard UI.

[Click on the image to view full-size]

KafkaFlow administration Dashboard

Consumer Throttling

Often, underlying technologies may not be able to deal with high-load periods in the same way as Kafka Consumers. That can bring stability problems. That is where throttling comes in.

Consumer Throttling is an approach to managing the consumption of messages, enabling applications to dynamically fine-tune the rate at which they consume messages based on metrics.

Prioritization

Imagine you’re running an application where you want to segregate atomic and bulk actions into different consumers and topics. You may prefer to prioritize the processing of atomic actions over bulk actions. Traditionally, managing this differentiation could be challenging, given the potential discrepancies in the rate of message production.

Consumer Throttling is valuable in such instances, allowing you to monitor the consumer lag of the consumer responsible for atomic actions. Based on this metric, you can apply throttling to the consumer handling the bulk actions, ensuring that atomic actions are processed as a priority.

The result? An efficient, flexible, and optimized consumption process.

Adding throttling to a consumer is straightforward with a KafkaFlow fluent interface. Here’s a simple example:

.AddConsumer(
    consumer => consumer
        .Topic("bulk-topic")
        .WithName("bulkConsumer")
        .AddMiddlewares(
            middlewares => middlewares
                .ThrottleConsumer(
                    t => t
                        .ByOtherConsumersLag("singleConsumer")
                        .WithInterval(TimeSpan.FromSeconds(5))
                        .AddAction(a => a.AboveThreshold(10).ApplyDelay(100))
                        .AddAction(a => a.AboveThreshold(100).ApplyDelay(1_000))
                        .AddAction(a => a.AboveThreshold(1_000).ApplyDelay(10_000)))
                .AddSerializer()
        )
)

KafkaFlow: Looking Toward the Future

As of now, KafkaFlow provides a robust, developer-friendly abstraction over Kafka that simplifies building real-time data processing applications with .NET. However, like any active open-source project, it’s continually evolving and improving.

Given the project’s current trajectory, we might anticipate several developments. For instance, KafkaFlow could further enhance its middleware system, providing even more control and flexibility over message processing. We might also see more extensive administrative APIs, providing developers with even greater control over their Kafka clusters.

Being extensible by design, we can expect the KafkaFlow community to grow, leading to more contributions, innovative features, extensions, and support. As more developers and organizations adopt KafkaFlow, we’re likely to see an increase in learning resources, tutorials, case studies, and other community-generated content that can help new users get started and existing users get more from the library.

Conclusion

KafkaFlow is a handy and developer-friendly tool that simplifies work with Kafka in .NET. It shines in the area of developer experience and usability. The framework design lends itself well to clean, readable code. With a clear separation of concerns through middlewares and message handlers, as well as abstractions over complex problems when building applications on top of Apache Kafka, KafkaFlow helps to keep your codebase manageable and understandable.

Furthermore, the community around KafkaFlow is growing. If you’re using Kafka and looking to improve productivity and reliability, KafkaFlow is certainly worth considering.

AI, ML, and Data Engineering InfoQ Trends Report – September 2023

Key Takeaways

  • Generative AI, powered by Large Language Models (LLMs) like GPT-3 and GPT-4, has gained significant prominence in the AI and ML industry, with widespread adoption driven by technologies like ChatGPT.
  • Major tech players such as Google and Meta have announced their own generative AI models, indicating the industry’s commitment to advancing these technologies.
  • Vector databases and embedding stores are gaining attention due to their role in enhancing observability in generative AI applications.
  • Responsible and ethical AI considerations are on the rise, with calls for stricter safety measures around large language models and an emphasis on improving the lives of all people through AI.
  • Modern data engineering is shifting towards decentralized and flexible approaches, with the emergence of concepts like Data Mesh, which advocates for federated data platforms partitioned across domains.

The InfoQ Trends Reports provide InfoQ readers with an opinionated high-level overview of the topics we believe architects and technical leaders should pay attention to. In addition, they also help the InfoQ editorial team focus on writing news and recruiting article authors to cover innovative technologies.

In this annual report, the InfoQ editors discuss the current state of AI, ML, and data engineering and what emerging trends you as a software engineer, architect, or data scientist should watch. We curate our discussions into a technology adoption curve with supporting commentary to help you understand how things are evolving.

In this year’s podcast, InfoQ editorial team was joined by external panelist Sherin Thomas, software engineer at Chime. The following sections in the article summarize some of these trends and where different technologies fall in the technology adoption curve.

Generative AI

Generative AI, including Large Language Models (LLMs) like GPT-3, GPT-4, and Chat GPT, has become a major force in the AI and ML industry. These technologies have garnered significant attention, especially given the progress they made over the last year. We have seen wide adoption of these technologies by users, in particular driven by ChatGPT. Multiple players such as Google and Meta have announced their own generative AI models.

The next step we expect is a larger focus on LLMOps to operate these large language models in an enterprise setting. We are divided in whether prompt engineering will be a large topic in the future or whether the adoption will be so widespread that everyone will be able to contribute to the prompts used.

Vector Databases and Embedding Stores

With the rise in LLM technology there’s a growing focus on vector databases and embedding stores. One intriguing application gaining traction is the use of sentence embeddings to enhance observability in generative AI applications.

The need for vector search databases arises from the limitations of large language models, which have a finite token history. Vector databases can store document summaries as feature vectors generated by these language models, potentially resulting in millions or more feature vectors. With traditional databases, finding relevant documents becomes challenging as the dataset grows. Vector search databases enable efficient similarity searches, allowing users to locate the nearest neighbors to a query vector, enhancing the search process.

A notable trend is the surge in funding for these technologies, signaling investor recognition of their significance. However, adoption among developers has been slower, but it’s expected to pick up in the coming years. Vector search databases like Pinecone, Milvus, and open-source solutions like Chroma are gaining attention. The choice of database depends on the specific application and the nature of the data being searched.

In various fields, including Earth observation, vector databases have demonstrated their potential. NASA, for instance, leveraged self-supervised learning and vector search technology to analyze satellite images of Earth, aiding scientists in tracking weather phenomena such as hurricanes over time.

Robotics and Drone Technologies

Cost of robots is going down. In the past legged balancing robots were hard to acquire, but there are already some models available for around 1,500 dollars. This allows more users to use robot technologies in their applications. The Robot Operating System (ROS) is still the leading software framework in this field, but companies like VIAM are also developing middleware solutions that make it easier to integrate and configure plugins for robotics development.

We expect that advances in unsupervised learning and foundational models will translate into improved capabilities. For example, by integrating a large language model into the path planning part of the robot to enable planning using natural language.

Responsible and Ethical AI

As AI starts to affect all of humanity there is a growing interest in responsible and ethical AI. People are simultaneously calling for stricter safety around large language models, as well as being frustrated by the output of such models reminding users of the safeguards in place.

It remains important for engineers to keep in mind to improve the lives of all people, not just a select few. We expect a similar impact from AI regulation as GDPR had a few years ao.

We have seen some AI fail because of bad data. Data discovery, operations, data lineage, labeling and good model development practices are going to take center stage. Data is crucial to explainability.

Data Engineering

The state of modern data engineering is marked by a dynamic shift towards more decentralized and flexible approaches to manage the ever-growing volumes of data. Data Mesh, a novel concept, has emerged to address the challenges posed by centralized data management teams becoming bottlenecks in data operations. It advocates for a federated data platform partitioned across domains, where data is treated as a product. This allows domain owners to have ownership and control over their data products, reducing the reliance on central teams. While promising, Data Mesh adoption may face hurdles related to expertise, necessitating advanced tooling and infrastructure for self-service capabilities.

Data observability has become paramount in data engineering, analogous to system observability in application architectures. Observability is essential at all layers, including data observability, especially in the context of machine learning. Trust in data is pivotal for AI success, and data observability solutions are crucial for monitoring data quality, model drift, and exploratory data analysis to ensure reliable machine learning outcomes. This paradigm shift in data management and the integration of observability across the data and ML pipelines reflect the evolving landscape of data engineering in the modern era.

Explaining the updates to the curve

With this trends report also comes an updated graph showing what we believe the state of certain technologies is. The categories are based on the book “Crossing the Chasm“, by Geoffrey Moore. At InfoQ we mostly focus on categories which have not yet crossed the chasm.

One notable upgrade from innovators to early adopters are the “AI Coding Assistants”. Although they were very new last year and hardly used, we see more and more companies offering this as a service to their employees to make them more efficient. It’s not a default part of every stack, and we are still discovering how to use them most effectively, but we believe that adoption will continue to grow.

Something which we believe is crossing the chasm right now is natural language processing. This will not come as a surprise to anyone as many companies are currently trying to figure out how to adopt generative AI capabilities in their product offering following the massive success of ChatGPT. We thus decided to make it cross the chasm already into the early majority category. There is still a lot of potential for growth here, and time will teach us more what the best practices and capabilities are for this technology.

There are some notable categories who did not move at all. These are technologies such as synthetic data generation, brain-computer interfaces and robotics. All of these seem to be consistently stuck in the innovators category. The most promising in this regard is the synthetic data generation topic, which is lately getting more attention with the GenAI hype. We do see more and more companies talking about generating more of their training data, but have not seen enough applications actually using it in their stack to warrant it moving to the early adopters category. Robotics has been getting a lot of attention for multiple years now, but its adoption rate is still too low for us to warrant a movement.

We also introduced several new categories to the graph. A notable one is vector search databases, something which comes as a byproduct of the GenAI hype. As we are gaining more understanding of how we can represent concepts as a vector there is also more need for efficient storing and retrieving said vectors. We also added explainable AI to the innovators category. We believe that computers explaining why they made a certain decision will be vital for widespread adoption to combat hallucinations and other dangers. However, we currently don’t see enough work in the industry to warrant a higher category.

Conclusion

The field of AI, ML, and Data Engineering keeps on growing year over year. There is still a lot of growth in both the technological capabilities as well as the possible applications. It’s exciting for us editors at InfoQ to be so close to the progress, and we are looking forward to making the same report next year. In the podcast we make several predictions for the coming year, which range from “there will be no AGI” to “Autonomous Agents will be a thing”. We hope you enjoyed listening to the podcast and reading this article, and would love to see your predictions and comments below this article.

Managing the Carbon Emissions Associated with Generative AI

Key Takeaways

  • There’s an increasing concern around carbon emissions as generative AI becomes more integrated in our everyday lives
  • The comparisons of carbon emissions between generative AI and the commercial aviation industry are misleading
  • Organizations should incorporate best practices to mitigate emissions specific to generative AI. Transparency requirements could be crucial to both training and using AI models
  • Improving energy efficiency in AI models is valuable not only for sustainability but also for improving capabilities and reducing costs
  • Prompt engineering becomes key to reducing computational resources and thus carbon emitted when using gen AI. Commands that generate shorter outputs would use less computation which leads to a new process “green prompt engineering”

Introduction

Recent developments in generative AI are transforming our industry and our broader society. Language models like ChatGPT and CoPilot are drafting letters and writing code, image and video generation models can create compelling content from a simple prompt, while music and voice models allow easy synthesis of speech in anyone’s voice, and the creation of sophisticated music.

Conversations on the power and potential value of this technology are happening around the world. At the same time, people are talking about risks and threats.

From extremist worries about superintelligent AI wiping out humanity, to more grounded concerns about the further automation of discrimination and the amplification of hate and misinformation, people are grappling with how to assess and mitigate the potential negative consequences of this new technology.

People are also increasingly concerned about the energy use and corresponding carbon emissions of these models. Dramatic comparisons have resurfaced in recent months.

One article, for example, equates the carbon emissions of training GPT-3 to driving to the moon and back; another, meanwhile, explains that training an AI model emits massively more carbon than a long-distance flight.

The ultimate impact will depend on how this technology is used and to what degree it is integrated into our lives.

It is difficult to anticipate exactly how it will impact our day to day, but one current example, the search giants integrating generative AI into their products, is fairly clear.

As per a recent Wired article:

Martin Bouchard, cofounder of Canadian data center company QScale, believes that, based on his reading of Microsoft and Google’s plans for search, adding generative AI to the process will require “at least four or five times more computing per search” at a minimum.

It’s clear that generative AI is not to be ignored.

Are carbon emissions of generative AI overhyped?

However, the concerns about the carbon emissions of generative AI may be overhyped. It’s important to put things in perspective: the entire global tech sector accounts for 1.8% to 3.9% of global greenhouse-gas emissions but only a fraction of those emissions are caused by AI[1]. Dramatic comparisons between AI and aviation or other sources of carbon are creating confusion from differences in scale: while there are many cars and aircraft traveling millions of kilometers every day, training a modern AI model like the GPT models is something that only happens a relatively small number of times.

Admittedly, it’s unclear exactly how many large AI models have been trained. Ultimately, that depends on how we define “large AI model.” However, if we consider models at the scale of GPT-3 or larger, it is clear that there have been fewer than 1,000 such models trained. To do a little math:

 

A recent estimate suggests that training GPT-3 emitted 500 metric tons of CO2. Meta’s LLaMA model was estimated to emit 173 tons. Training 1,000 500-ton models would involve a total emission of about 500,000 metric tons of CO2. Newer models may increase the emissions somewhat, but the 1,000 models is almost certainly an overestimate and so accounts for this. The commercial aviation industry emitted about 920,000,000 metric tons of CO2 in 2019[2], almost 2,000 times as much as LLM training, and keep in mind that this compares one year of aviation to multiple years of LLM training. The training of LLMs is still not negligible, but the dramatic comparisons are misleading. More nuanced thinking is needed.

This, of course, is only considering the training of such models. The serving and use of the models also requires energy and has associated emissions. Based on one analysis, ChatGPT might emit about 15,000 metric tons of CO2 to operate for a year. Another analysis suggests much less at about 1,400 metric tons. Not negligible, but still nothing compared to aviation.

Emissions transparency is needed

But even if the concerns about the emissions of AI are somewhat overhyped, they still merit attention, especially as generative AI becomes integrated into more and more of our modern life. As AI systems continue to be developed and adopted, we need to pay attention to their environmental impact. There are many well-established practices that should be leveraged, and also some ways to mitigate emissions that are specific to generative AI.

Firstly, transparency is crucial. We recommend transparency requirements to allow for monitoring of the carbon emissions related to both training and use of AI models. This will allow those deploying these models and also end users to make informed decisions about their use of AI based on its emissions. And also to incorporate AI-related emissions into their greenhouse gas inventories and net zero targets. This is one component of holistic AI transparency.

As an example of how such requirements might work, France has recently passed a law mandating telecommunications companies to provide transparency reporting around their sustainability efforts. A similar law could require products incorporating AI systems to report carbon emissions to their customers and also for model providers to integrate carbon emissions data into their APIs.

Greater transparency can lead to stronger incentives to build energy-efficient generative AI systems, and there are many ways to increase efficiency. In another recent InfoQ article, Sara Bergman, Senior Software Engineer at Microsoft, encourages people to consider the entire lifecycle of an AI system and provides advice on applying the tools and practices from the Green Software Foundation to making AI systems more energy efficient, including careful selection of server hardware and architecture, as well as time and region shifting to find less carbon-intensive electricity. But generative AI presents some unique opportunities for efficiency improvements.

Efficiency: Energy use and model performance

As explored in Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning the carbon emissions associated with the training or use of a generative AI model depends on many factors, including:

  • Number of model parameters
  • Quantization (numeric precision)
  • Model architecture
  • Efficiency of GPUs or other hardware used
  • Carbon-intensity of electricity used

The latter factors are relevant for any software and well explored by others, such as the InfoQ article that we mentioned. Thus, we will focus on the first three factors here, all of which involve some tradeoff between energy use and model performance.

It’s worth noting that efficiency is valuable not only for sustainability concerns. More efficient models can improve capabilities in situations where less data is available, decrease costs, and unlock the possibility of running on edge devices.

Number of model parameters

As shown in this figure from OpenAI’s paper, “Language Models are Few-Shot Learners“, larger models tend to perform better.

This is also a point made in Emergent Abilities of Large Language Models:

Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models.

We see that not only do larger models do better at a given task, but there are actually entirely new capabilities that emerge only as models get large.  Examples of such emergent capabilities include adding and subtracting large numbers, toxicity classification, and chain of thought techniques for math word problems.

But training and using larger models requires more computation and thus more energy.  Thus, we see a tradeoff between the capabilities and performance of a model and its computational, and thus carbon, intensivity.

Quantization

There has been significant research into the quantization of models. This is where lower-precision numbers are used in model computations, thus reducing computational intensivity, albeit at the expense of some accuracy. It has typically been applied to allow models to run on more modest hardware, for example, enabling LLMs to run on a consumer-grade laptop. The tradeoff between decreased computation and decreased accuracy is often very favorable, making quantized models extremely energy-efficient for a given level of capability. There are related techniques, such as “distillation“, that use a larger model to train a small model that can perform extremely well for a given task.

Distillation technically requires training two models, so it could well increase the carbon emissions related to model training; however it should compensate for this by decreasing the model’s in-use emissions. Distillation of an existing already-trained model can also be a good solution. It’s even possible to leverage both distillation and quantization together to create a more efficient model for a given task.

Model Architecture

Model architecture can have an enormous impact on computational intensivity, so choosing a simpler model can be the most effective way to decrease carbon emission from an AI system. While GPT-style transformers are very powerful, simpler architectures can be effective for many applications. Models like ChatGPT are considered “general-purpose” meaning that these models can be used for many different applications. However, when a fixed application is required, using a complex model may be unnecessary. A custom model for the task may be able to achieve adequate performance with a much simpler and smaller architecture, decreasing carbon emissions.  Another useful approach is fine-tuning — the paper Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning discusses how fine-tuning “offers better accuracy as well as dramatically lower computational costs”.

Putting carbon and accuracy metrics on the same level

The term “accuracy” easily feeds into a “more is better” mentality. To address this, it is critical to understand the requirements for the given application – “enough is enough”. In some cases, the latest and greatest model may be needed, but for other applications, older, smaller, possibly quantized models might be perfectly adequate. In some cases, correct behavior may be required for all possible inputs, while other applications may be more fault tolerant. Once the application and level of service required is properly understood, an appropriate model can be selected by comparing performance and carbon metrics across the options. There may also be cases in which a suite of models can be leveraged. Requests can, by default, be passed to simpler, smaller models, but in cases in which the task can’t be handled by the simple model, it can be passed off to a more sophisticated model.

Here, integrating carbon metrics into DevOps (or MLOps) processes is important. Tools like codecarbon make it easy to track and account for the carbon emissions associated with training and serving a model. Integrating this or a similar tool into continuous integration test suites allows carbon, accuracy, and other metrics to be analyzed in concert. For example, while experimenting with model architecture, tests can immediately report both accuracy and carbon, making it easier to find the right architecture and choose the right hyperparameters to meet accuracy requirements while minimizing carbon emissions.

It’s also important to remember that experimentation itself will result in carbon emissions. In the experimentation phase of the MLOps cycle, experiments are performed with different model families and architectures to determine the best option, which can be considered in terms of accuracy, carbon and, potentially, other metrics. This can save carbon in the long run as the model continues to be trained with real-time data and/or is put into production, but excessive experimentation can waste time and energy. The appropriate balance will vary depending on many factors, but this can be easily analyzed when carbon metrics are available for running experiments as well as production training and serving of the model.

Green prompt engineering

When it comes to carbon emissions associated with the serving and use of a generative model, prompt engineering becomes very important as well. For most generative AI models — like  GPT — the computational resources used, and thus carbon emitted, depend on the number of tokens passed to and generated by the model.

While the exact details depend on the implementation, prompts are generally passed “all at once” into transformer models. This might make it seem like the amount of computation doesn’t depend on the length of a prompt. However, due to the quadratic nature of the self-attention mechanism, it’s reasonable to expect that optimizations would suppress this function for unused portions of the input, meaning that shorter prompts save computation and thus energy.
For the output, it is clear that the computational cost is proportional to the number of tokens produced, as the model needs to be “run again” for each token generated.

This is reflected in the pricing structure for OpenAI’s API access to GPT4. At the time of writing, the costs for the base GPT4 model are $0.03/1k prompt tokens and $0.06/1k sampled tokens. The prompt length and length of the output in tokens are both incorporated into the price, reflecting the fact that both influence the amount of computation that is required.

So, shorter prompts and prompts that will generate shorter outputs will use less computation. This suggests a new process of “green prompt engineering”. With proper support for experimentation in an MLOps platform, it becomes relatively easy to experiment with shortening prompts while continuously evaluating the impact of both carbon and system performance.

As well as considering only single prompts, there are interesting approaches being developed to improve efficiency for more complex use of LLMs, as in this paper.

Conclusion

Although possibly overhyped, the carbon emissions of AI are still of concern and should be managed with appropriate best practices. Transparency is needed to support effective decision-making and consumer awareness. Also, integrating carbon metrics into MLOps workflows can support smart choices about model architecture, size, quantization, as well as effective green prompt engineering. The content in this article is an overview only and just scratches the surface. For those that truly want to do green generative AI, I encourage you to follow the latest research.

Footnotes

 

Streamlining Code with Unnamed Patterns/Variables: A Comparative Study of Java, Kotlin, and Scala

Key Takeaways

  • Java’s JEP 443: Enhances code readability by allowing the omission of unnecessary components in pattern matching and unused variables.
  • Kotlin’s unused variables: Simplifies code by denoting unused parameters in functions, lambdas, or destructuring declarations.
  • Scala’s unused variables: Used as a wildcard to ignore unused variables and conversions, improving code conciseness.
  • Underscore as Syntactic Sugar: A common feature in many languages, including Java, Kotlin, and Scala, that simplifies code.
  • Enhanced Code Readability and Maintainability: The underscore character improves code readability and maintainability.
  • Future Language Evolution: Expect further enhancements and innovative uses of the underscore as languages evolve.

In the world of programming, the underscore (`_`) is a character with a wide range of uses. It’s often referred to as syntactic sugar, as it simplifies the code and makes it more concise.

This article will explore the use of underscores in three popular programming languages: Java, Kotlin, and Scala.

Java: Unnamed Patterns and Variables with JEP 443

Java, the ever-evolving language, has taken another significant step towards enhancing its code readability and maintainability with the introduction of JEP 443. This proposal, titled “Unnamed Patterns and Variables (Preview),” has been completed from the targeted status for JDK 21.

The JEP aims to enhance the language with unnamed patterns, which match a record component without stating the component’s name or type, and unnamed variables, which you can but not use.

Both of these are denoted by the underscore character, as in r instanceof _(int x, int y) and r instanceof _.

Unnamed Patterns

Unnamed patterns are designed to streamline data processing, particularly when working with record classes. They allow developers to omit the type and name of a record component in pattern matching, which can significantly improve code readability.

Consider the following code snippet:

if (r instanceof ColoredPoint(Point p, Color c)) {
    // ...
}

If the Color c component is not needed in the if block, it can be laborious and unclear to include it in the pattern. With JEP 443, developers can simply omit unnecessary components, resulting in cleaner, more readable code:

if (r instanceof ColoredPoint(Point p, _)) {
    // ...
}

This feature is particularly useful in nested pattern-matching scenarios where only some components of a record class are required. For example, consider a record class ColoredPoint that contains a Point and a Color. If you only need the x coordinate of the Point, you can use an unnamed pattern to omit the y and Color components:

if (r instanceof ColoredPoint(Point(int x, _), _)) {
    // ...
}

Unnamed Variables

Unnamed variables are useful when a variable must be declared, but its value is not used. This is common in loops, try-with-resources statements, catch blocks, and lambda expressions.

For instance, consider the following loop:

for (Order order : orders) {
    if (total < limit) total++;
}

In this case, the order variable is not used within the loop. With JEP 443, developers can replace the unused variable with an underscore, making the code more concise and clear:

for (_ : orders) {
    if (total < limit) total++;
}

Unnamed variables can also be beneficial in switch statements where the same action is executed for multiple cases, and the variables are not used. For example:

switch (b) {
    case Box(RedBall _), Box(BlueBall _) -> processBox(b);
    case Box(GreenBall _) -> stopProcessing();
    case Box(_) -> pickAnotherBox();
}

In this example, the first two cases use unnamed pattern variables because their right-hand sides do not use the box’s component. The third case uses the unnamed pattern to match a box with a null component.

Enabling the Preview Feature

Unnamed patterns and variables are a preview feature, disabled by default. To use it, developers must enable the preview feature to compile this code, as shown in the following command:

javac --release 21 --enable-preview Main.java

The same flag is also required to run the program:

java --enable-preview Main

However, one can directly run this using the source code launcher. In that case, the command line would be:

java --source 21 --enable-preview Main.java

The jshell option is also available but requires enabling the preview feature as well:

jshell --enable-preview

Kotlin: Underscore for Unused Parameters

In Kotlin, the underscore character (_) is used to denote unused parameters in a function, lambda, or destructuring declaration. This feature allows developers to omit names for such parameters, leading to cleaner and more concise code.

In Kotlin, developers can place an underscore instead of its name if the lambda parameter is unused. This is particularly useful when working with functions that require a lambda with multiple parameters but only some of the parameters are needed.

Consider the following Kotlin code snippet:

mapOf(1 to "one", 2 to "two", 3 to "three")
   .forEach { (_, value) -> println("$value!") }

In this example, the forEach function requires a lambda that takes two parameters: a key and a value. However, we’re only interested in the value, so we replace the key parameter with an underscore.

Let’s consider another code snippet:

var name: String by Delegates.observable("no name") {
    kProperty, oldValue, newValue -> println("$oldValue")
}

In this instance, if the kProperty and newValue parameters are not used within the lambda, including them can be laborious and unclear. With the underscore feature, developers can simply replace the unused parameters with underscores:

var name: String by Delegates.observable("no name") {
    _, oldValue, _ -> println("$oldValue")
}

This feature is also useful in destructuring declarations where you want to skip some of the components:

val (w, x, y, z) = listOf(1, 2, 3, 4)
print(x + z) // 'w' and 'y' remain unused

With the underscore feature, developers can replace the unused components with underscores:

val (_, x, _, z) = listOf(1, 2, 3, 4)
print(x + z)

This feature is not unique to Kotlin. Other languages like Haskell use the underscore character as a wildcard in pattern matching. For C#, `_` in lambdas is just an idiom without special treatment in the language. The same semantic may be applied in future versions of Java.

Scala: The Versatility of Underscore

In Scala, the underscore (`_`) is a versatile character with a wide range of uses. However, this can sometimes lead to confusion and increase the learning curve for new Scala developers. In this section, we’ll explore the different and most common usages of underscores in Scala.

Pattern Matching and Wildcards

The underscore is widely used as a wildcard and in matching unknown patterns. This is perhaps the first usage of underscore that Scala developers would encounter.

Module Import

We use underscore when importing packages to indicate that all or some members of the module should be imported:

// imports all the members of the package junit. (equivalent to wildcard import in java using *)
import org.junit._

// imports all the members of junit except Before.
import org.junit.{Before => _, _}

// imports all the members of junit but renames Before to B4.
import org.junit.{Before => B4, _}

Existential Types

The underscore is also used as a wildcard to match all types in type creators such as List, Array, Seq, Option, or Vector.

// Using underscore in List
val list: List[_] = List(1, "two", true)
println(list)

// Using underscore in Array
val array: Array[_] = Array(1, "two", true)
println(array.mkString("Array(", ", ", ")"))

// Using underscore in Seq
val seq: Seq[_] = Seq(1, "two", true)
println(seq)

// Using underscore in Option
val opt: Option[_] = Some("Hello")
println(opt)

// Using underscore in Vector
val vector: Vector[_] = Vector(1, "two", true)
println(vector)

With `_`, we allowed all types of elements in the inner list.

Matching

Using the match keyword, developers can use the underscore to catch all possible cases not handled by any of the defined cases. For example, given an item price, the decision to buy or sell the item is made based on certain special prices. If the price is 130, the item is to buy, but if it’s 150, it is to sell. For any other price outside these, approval needs to be obtained:

def itemTransaction(price: Double): String = {
 price match {
   case 130 => "Buy"
   case 150 => "Sell"

   // if price is not any of 130 and 150, this case is executed
   case _ => "Need approval"
 }
}

println(itemTransaction(130)) // Buy
println(itemTransaction(150)) // Sell
println(itemTransaction(70))  // Need approval
println(itemTransaction(400)) // Need approval

Ignoring Things

The underscore can ignore variables and types not used anywhere in the code.

Ignored Parameter

For example, in function execution, developers can use the underscore to hide unused parameters:

val ints = (1 to 4).map(_ => "Hello")
println(ints) // Vector(Hello, Hello, Hello, Hello)

Developers can also use the underscore to access nested collections:

val books = Seq(("Moby Dick", "Melville", 1851), ("The Great Gatsby", "Fitzgerald", 1925), ("1984", "Orwell", 1949), ("Brave New World", "Huxley", 1932))

val recentBooks = books
 .filter(_._3 > 1900)  // filter in only books published after 1900
 .filter(_._2.startsWith("F"))  // filter in only books whose author's name starts with 'F'
 .map(_._1)  
// return only the first element of the tuple; the book title

println(recentBooks) // List(The Great Gatsby)

In this example, the underscore is used to refer to the elements of the tuples in the list. The filter function selects only the books that satisfy the given conditions, and then the map function transforms the tuples to just their first element (book title). The result is a sequence with book titles that meet the criteria.

Ignored Variable

When a developer encounters details that aren’t necessary or relevant, they can utilize the underscore to ignore them.

For example, a developer wants only the first element in a split string:

val text = "a,b"
val Array(a, _) = text.split(",")
println(a)

The same principle applies if a developer only wants to consider the second element in a construct.

val Array(_, b) = text.split(",")
println(b)

The principle can indeed be extended to more than two entries. For instance, consider the following example:

val text = "a,b,c,d,e"
val Array(a, _*) = text.split(",")
println(a)

In this example, a developer splits the text into an array of elements. However, they are only interested in the first element, 'a'. The underscore with an asterisk  (_*) ignores the rest of the entries in the array, focusing only on the required element.

To ignore the rest of the entries after the first, we use the underscore together with `*`.

The underscore can also be used to ignore randomly:

val text = "a,b,c,d,e"
val Array(a, b, _, d, e) = text.split(",")
println(a)
println(b)
println(d)
println(e)

Variable Initialization to Its Default Value

When the initial value of a variable is not necessary, you can use the underscore as default:

var x: String = _
x = "real value"
println(x) // real value

However, this doesn’t work for local variables; local variables must be initialized.

Conversions

In several ways, you can use the underscore in conversions.

Function Reassignment (Eta expansion)

With the underscore, a method can be converted to a function. This can be useful to pass around a function as a first-class value.

def greet(prefix: String, name: String): String = s"$prefix $name"

// Eta expansion to turn greet into a function
val greeting = greet _

println(greeting("Hello", "John"))

Variable Argument Sequence

A sequence can be converted to variable arguments using `seqName: _*` (a special instance of type ascription).

def multiply(numbers: Int*): Int = {
 numbers.reduce(_ * _)
}

val factors = Seq(2, 3, 4)
val product = multiply(factors: _*)
// Convert the Seq factors to varargs using factors: _*

println(product) // Should print: 24

Partially-Applied Function

By providing only a portion of the required arguments in a function and leaving the remainder to be passed later, a developer can create what’s known as a partially-applied function. The underscore substitutes for the parameters that have not yet been provided.

def sum(x: Int, y: Int): Int = x + y
val sumToTen = sum(10, _: Int)
val sumFiveAndTen = sumToTen(5)

println(sumFiveAndTen, 15)

The use of underscores in a partially-applied function can also be grouped as ignoring things. A developer can ignore entire groups of parameters in functions with multiple parameter groups, creating a special kind of partially-applied function:

def bar(x: Int, y: Int)(z: String, a: String)(b: Float, c: Float): Int = x
val foo = bar(1, 2) _

println(foo("Some string", "Another string")(3 / 5, 6 / 5), 1)

Assignment Operators (Setters overriding)

Overriding the default setter can be considered a kind of conversion using the underscore:

class User {
 private var pass = ""
 def password = pass
 def password_=(str: String): Unit = {
   require(str.nonEmpty, "Password cannot be empty")
   require(str.length >= 6, "Password length must be at least 6 characters")
   pass = str
 }
}

val user = new User
user.password = "Secr3tC0de"
println(user.password) // should print: "Secr3tC0de"

try {
 user.password = "123" // will fail because it's less than 6 characters
 println("Password should be at least 6 characters")
} catch {
 case _: IllegalArgumentException => println("Invalid password")
}

Higher-Kinded Type

A Higher-Kinded type is one that abstracts over some type that, in turn, abstracts over another type. In this way, Scala can generalize across type constructors. It’s quite similar to the existential type. It can be defined higher-kinded types using the underscore:

trait Wrapper[F[_]] {
 def wrap[A](value: A): F[A]
}

object OptionWrapper extends Wrapper[Option] {
 override def wrap[A](value: A): Option[A] = Option(value)
}

val wrappedInt = OptionWrapper.wrap(5)
println(wrappedInt)

val wrappedString = OptionWrapper.wrap("Hello")
println(wrappedString)

In the above example, Wrapper is a trait with a higher-kinded type parameter F[_]. It provides a method wrap that wraps a value into the given type. OptionWrapper is an object extending this trait for the Option type. The underscore in F[_] represents any type, making Wrapper generic across all types of Option.

These are some examples of Scala being a powerful tool that can be used in various ways to simplify and improve the readability of your code. It’s a feature that aligns well with Scala’s philosophy of being a concise and expressive language that promotes readable and maintainable code.

Conclusion

The introduction of unnamed patterns and variables in Java through JEP 443 marks a significant milestone in the language’s evolution. This feature, which allows developers to streamline their code by omitting unnecessary components and replacing unused variables, brings Java closer to the expressiveness and versatility of languages like Kotlin and Scala.

However, it’s important to note that while this is a substantial step forward, Java’s journey in this area is still incomplete. Languages like Kotlin and Scala have long embraced similar concepts, using them in various ways to enhance code readability, maintainability, and conciseness. These languages have demonstrated the power of such concepts in making code more efficient and easier to understand.

In comparison, Java’s current use of unnamed patterns and variables, although beneficial, is still somewhat limited. The potential for Java to further leverage these concepts is vast. Future updates to the language could incorporate more advanced uses of unnamed patterns and variables, drawing inspiration from how these concepts are utilized in languages like Kotlin and Scala.

Nonetheless, adopting unnamed patterns and variables in Java is a significant step towards enhancing the language’s expressiveness and readability. As Java continues to evolve and grow, we expect to see further innovative uses of these concepts, leading to more efficient and maintainable code. The journey is ongoing, and it’s an exciting time to be a part of the Java community.

Happy coding!

Leveraging Eclipse JNoSQL 1.0.0: Quarkus Integration and Building a Pet-Friendly REST API

Key Takeaways

  • Eclipse JNoSQL leverages the Jakarta EE standard specifications, specifically Jakarta NoSQL and Jakarta Data, to ensure compatibility with various NoSQL database vendors and promote interoperability.
  • Eclipse JNoSQL seamlessly integrates with the Quarkus framework, enabling developers to build cloud-native applications with the benefits of both frameworks, such as rapid development, scalability, and resilience.
  • With Eclipse JNoSQL, developers can simplify the integration process, communicate seamlessly with diverse NoSQL databases, and future-proof their applications by easily adapting to changing database requirements.
  • By embracing Eclipse JNoSQL, developers can unlock the power of NoSQL databases while maintaining a familiar programming syntax, enabling efficient and effective data management in modern application development.
  • Eclipse JNoSQL 1.0.0 marks a significant milestone in the evolution of NoSQL database integration, providing developers with comprehensive tools and features to streamline their data management processes.
  • The release of Eclipse JNoSQL empowers developers to leverage the benefits of NoSQL databases, including scalability, flexibility, and performance, while ensuring compatibility and ease of use through standardized specifications.

In today’s data-driven world, the ability to seamlessly integrate and manage data from diverse sources is crucial for the success of modern applications. Eclipse JNoSQL, with its latest release of version 1.0.0, presents a comprehensive solution that simplifies the integration of NoSQL databases. This article explores the exciting new features and enhancements introduced in Eclipse JNoSQL 1.0.0, highlighting its significance in empowering developers to efficiently harness the power of NoSQL databases. From advanced querying capabilities to seamless compatibility with the Quarkus framework, Eclipse JNoSQL opens up new possibilities for streamlined and future-proof data management.

Polyglot persistence refers to utilizing multiple database technologies to store different data types within a single application. It recognizes that other data models and storage technologies are better suited for specific use cases. In modern enterprise applications, polyglot persistence is crucial for several reasons.

Firstly, it allows enterprises to leverage the strengths of various database technologies, including NoSQL databases, to efficiently handle different data requirements. NoSQL databases excel at managing unstructured, semi-structured, and highly scalable data, making them ideal for scenarios like real-time analytics, content management systems, or IoT applications.

By adopting polyglot persistence, enterprises can select the most suitable database technology for each data model, optimizing performance, scalability, and flexibility. For example, a social media platform may store user profiles and relationships in a graph database while utilizing a document database for managing user-generated content.

Eclipse JNoSQL, an open-source framework, simplifies the integration of NoSQL databases within Jakarta EE applications. It provides a unified API and toolset, abstracting the complexities of working with different NoSQL databases and facilitating seamless development and maintenance.

Eclipse JNoSQL is a compatible implementation of Jakarta NoSQL, a specification defining a standard API for interacting with various NoSQL databases in a Jakarta EE environment. By embracing Jakarta NoSQL, developers can leverage Eclipse JNoSQL to seamlessly integrate different NoSQL databases into their Jakarta EE applications, ensuring vendor independence and flexibility.

Why Eclipse JNoSQL?

Eclipse JNoSQL serves as a Java solution for seamless integration between Java and NoSQL databases, specifically catering to the needs of enterprise applications. It achieves this by providing a unified API and utilizing the specifications based on four different NoSQL database types: key-value, column family, document, and graph.

[Click on the image to view full-size]

Using Eclipse JNoSQL, developers can leverage the same mapping annotations, such as @Entity, @Id, and @Column, regardless of the underlying NoSQL database. This approach allows developers to explore the benefits of different NoSQL databases without the burden of learning multiple APIs. It reduces the cognitive load and will enable developers to focus more on business logic while taking full advantage of the capabilities offered by the NoSQL database.

The extensibility of the Eclipse JNoSQL API is another critical advantage. It allows developers to work with specific behaviors of different NoSQL databases. For example, developers can utilize the Cassandra Query Language (CQL) through the same API if working with Cassandra.

The use of Eclipse JNoSQL simplifies the transition between different NoSQL databases. Without learning other classes and methods, developers can utilize the same API to work with multiple databases, such as MongoDB and ArangoDB. This approach enhances developer productivity and reduces the learning curve of integrating various NoSQL databases.

[Click on the image to view full-size]

While the Jakarta Persistence specification is commonly used for relational databases, it is unsuitable for NoSQL databases due to the fundamental differences in behavior and data models. Eclipse JNoSQL acknowledges these differences and provides a dedicated API explicitly designed for NoSQL databases, enabling developers to effectively leverage the unique capabilities of each NoSQL database.

Additionally, when working with the graph database implementation in Eclipse JNoSQL, it utilizes Apache TinkerPop, a standard interface for interacting with graph databases. By leveraging Apache TinkerPop, Eclipse JNoSQL ensures compatibility with various graph database vendors, allowing developers to work seamlessly with different graph databases using a consistent API. This standardization simplifies graph database integration, promotes interoperability, and empowers developers to harness the full potential of graph data in enterprise applications.

Eclipse JNoSQL simplifies Java and NoSQL database integration for enterprise applications. It provides a unified API, allowing developers to focus on business logic while seamlessly working with different NoSQL databases. Developers can explore the benefits of NoSQL databases without learning multiple APIs, thereby improving development efficiency and reducing the cognitive load associated with integrating diverse data sources.

Eclipse JNoSQL is an advanced Java framework that facilitates seamless integration between Java applications and various persistence layers, explicitly focusing on NoSQL databases. It supports two key specifications under the Jakarta EE umbrella: Jakarta Data and Jakarta NoSQL.

[Click on the image to view full-size]

  • Jakarta Data simplifies the integration process between Java applications and different persistence layers. It provides a unified repository interface that allows developers to work with multiple persistence layers using a single interface. This feature eliminates the need to learn and adapt to different APIs for each persistence layer, streamlining the development process. Additionally, Jakarta Data introduces a user-friendly and intuitive approach to handling pagination, making it easier for developers to manage large datasets efficiently. Eclipse JNoSQL extends Jakarta Data’s capabilities to support pagination within NoSQL databases, enhancing the overall data management experience.

[Click on the image to view full-size]

  • Jakarta NoSQL: On the other hand, Jakarta NoSQL focuses on working with NoSQL databases. It offers a fluent API that simplifies the interaction with various NoSQL databases. This API provides a consistent and intuitive way to perform operations and queries within the NoSQL data model. By leveraging Jakarta NoSQL, developers can harness the power of NoSQL databases while enjoying the benefits of a standardized and cohesive API, reducing the learning curve associated with working with different NoSQL databases.

Eclipse JNoSQL provides comprehensive support for integrating Java applications with persistence layers. Jakarta Data enables seamless integration across different persistence layers, and Jakarta NoSQL specifically caters to NoSQL databases. These specifications enhance developer productivity, reduce complexity, and promote interoperability within the Jakarta ecosystem, empowering developers to work efficiently with traditional and NoSQL data stores.

What’s New in Eclipse JNoSQL 1.0.0

Eclipse JNoSQL 1.0.0 has some exciting features. These upgrades improve the framework’s abilities and simplify connecting Java with NoSQL databases.

  • More straightforward database configuration: One of the notable enhancements is the introduction of simplified database configuration. Developers can now easily configure and connect to NoSQL databases without the need for complex and time-consuming setup procedures. This feature significantly reduces the initial setup overhead and allows developers to focus more on the core aspects of their application development.
  • Improved Java Record support: The latest update  includes enhanced support for Java Records – a new feature introduced in Java 14 that allows for the concise and convenient creation of immutable data objects. This update enables developers to easily map Java Records to NoSQL data structures, making data handling more efficient and effortless. This improvement also leads to better code readability, maintainability, and overall productivity in development.
  • Several bug fixes: Eclipse JNoSQL 1.0.0 introduces new features and fixes several bugs reported by the developer community.
  • Enhanced repository interfaces: The latest version comes with improved repository interfaces that effectively connect Java applications and NoSQL databases. These interfaces support a higher level of abstraction, making it easier for developers to interact with databases, retrieve and store data, and perform query operations. The updated repository interfaces in Eclipse JNoSQL also offer enhanced functionality, providing developers with greater flexibility and ease of performing database operations.

Eclipse JNoSQL 1.0.0 has introduced new features that improve the integration between Java and NoSQL databases. With these enhancements, developers can more efficiently utilize the full potential of NoSQL databases in their Java applications. These improvements also allow developers to focus on building innovative solutions rather than dealing with database integration complexities.

Show Me the Code

We will now dive into a live code session where we create a simple Pet application that integrates with a MongoDB database. While we acknowledge the popularity of MongoDB, it’s important to note that the concepts discussed here can be applied to other document databases, such as ArangoDB.

Before we proceed, it’s essential to ensure that the minimum requirements for Eclipse JNoSQL 1.0.0 are met. This includes Java 17, the Jakarta Contexts and Dependency Injection (CDI) specification, the Jakarta JSON Binding (JSON-B) specification and the Jakarta JSON Processing (JSON-P) specification that are compatible with Jakarta EE 10. Additionally, the Eclipse MicroProfile Config specification, version 3.0 or higher, is also required. Any Jakarta EE vendor compatible with version 10.0 and Eclipse MicroProfile Config 3.0 or any MicroProfile vendor compatible with version 6.0 or higher can run Eclipse JNoSQL. This broad compatibility allows flexibility in choosing a compatible Jakarta EE or MicroProfile vendor.

[Click on the image to view full-size]

It’s important to note that while we focus on the live code session, this article will not cover the installation and usage of MongoDB in production or recommend specific solutions like DBaaS with MongoDB Atlas. For demonstration purposes, feel free to install and use MongoDB in any preferred way. The article will use a simple Docker command to set up a database instance.

docker run -d --name mongodb-instance -p 27017:27017 mongo

Now that we have met the prerequisites, we are ready to proceed with the live code session, building and executing the Pet application that leverages Eclipse JNoSQL to interact with the MongoDB database.

In the next step, we will include Eclipse JNoSQL into the project, making it easier to handle dependencies and simplify the configuration process. The updated version of Eclipse JNoSQL streamlines the inclusion of dependencies, eliminating the need to individually add multiple dependencies.

To get started, we can explore the available databases repositories to determine which database we want to use and the required configurations. In our case, we will be using the MongoDB database. You can find MongoDB’s necessary credentials and dependencies at this GitHub repository.

If you are using a Maven project, you can include the MongoDB dependency by adding the following dependency to your project’s pom.xml file:


  org.eclipse.jnosql.databases
  jnosql-mongodb
  1.0.0

By adding this dependency, Eclipse JNoSQL will handle all the configurations and dependencies required to work with MongoDB in your project. This streamlined process simplifies the setup and integration, allowing you to focus on developing your Pet application without getting caught up in complex dependency management.

With Eclipse JNoSQL integrated into the project and the MongoDB dependency added, we are ready to explore leveraging its features and effectively interacting with the MongoDB database.

When integrating Eclipse JNoSQL with the MongoDB database, we can use the power of the Eclipse MicroProfile Config specification to handle the necessary credentials and configuration information. Eclipse MicroProfile Config allows us to conveniently overwrite these configurations through system environment variables, providing flexibility and adhering to the principles and practice of the Twelve-Factor App.

For example, we can define the MongoDB connection’s database name and host URL in an Eclipse MicroProfile Config configuration file (usually microprofile-config.properties). Here are the sample configurations:

jnosql.document.database=pets
jnosql.mongodb.host=localhost:27017

These configurations specify that the database name is pets and the MongoDB host URL is localhost:27017. However, instead of hardcoding these values, Eclipse MicroProfile Config can overwrite them based on system environment variables. This approach allows the development team to dynamically configure the database credentials and connection details without modifying the application code. The configurations will automatically be overwritten at runtime by setting environment variables, such as JNOSQL_DOCUMENT_DATABASE and JNOSQL_MONGODB_HOST. This flexibility ensures easy configuration management across different environments without requiring manual changes.

The combination of Eclipse MicroProfile Config and Eclipse JNoSQL enables the development of Twelve-Factor Apps, which adhere to best practices for modern cloud-native application development. It provides a seamless way to handle configuration management, making it easier to adapt the application to different environments and promoting a consistent deployment process. We can achieve a highly configurable and portable solution that aligns with the principles of modern application development.

To proceed with the implementation, we will create a model for our sample application, focusing on the pet world. In this case, we will restrict the pets to cats and dogs, and we will define their name and breed, which cannot be changed once created.

We will utilize Java records to achieve immutability, which provides a concise and convenient way to define immutable data objects. We will create two records, Dog and Cat, implementing the Pet interface.

Here is an example of the code structure:
 

public sealed interface Pet permits Cat, Dog {

    String name();
    String breed();
}

@Entity
public record Dog(@Id String id, @Column String name, @Column String breed) implements Pet {

    public static Dog create(Faker faker) {
        var dog = faker.dog();
        return new Dog(UUID.randomUUID().toString(), dog.name(), dog.breed());
    }
}

@Entity
public record Cat(@Id String id, @Column String name, @Column String breed) implements Pet {

    public static Cat create(Faker faker) {
        var cat = faker.cat();
        return new Cat(UUID.randomUUID().toString(), cat.name(), cat.breed());
    }
}

The above code defines the Pet interface as a sealed interface, allowing only the Cat and Dog records to implement it. Both records contain the field’s id, name, and breed. The @Id and @Column annotations mark the fields as identifiers and persistable attributes, respectively. The records also implement the Pet interface.

Additionally, static factory methods are included in each record to generate a new instance of a Dog or a Cat using a Faker object from the Java Faker project, a library for generating fake data.

With this modeling structure, we achieve immutability, define the necessary annotations to mark the classes as entities, and provide the essential attributes for persisting the pets in the database. This approach aligns with modern Java frameworks and facilitates the integration of the pet objects with the MongoDB database through Eclipse JNoSQL.

Now, let’s see the code in action! We can utilize the Template interface provided by Eclipse JNoSQL to seamlessly perform operations with the NoSQL database. Depending on the specific requirements, we can explore specialized interfaces such as DocumentTemplate for document databases or provider-specific templates like MongoDBTemplate for MongoDB.

An example code snippet demonstrating some essential operations using the Jakarta NoSQL API:

@Inject
Template template;

// ...

var faker = new Faker();
var cat = Cat.create(faker);
template.insert(cat);

Optional optional = template.find(Cat.class, cat.id());
System.out.println("The result: " + optional);

for (int index = 0; index < 100; index++) {
    template.insert(Cat.create(faker));
}
List result = template.select(Cat.class).where("breed").eq(cat.breed()).result();
System.out.println("The query by breed: " + result);

template.delete(Cat.class, cat.id());

In the code snippet above, we first inject the Template interface using CDI. Then, we create a new Cat object using the Java Faker library and insert it into the database using the insert() method.

We can retrieve the inserted Cat object from the database using the find() method, which returns an object of type Optional. In this case, we print the result to the console.

Next, we insert 100 more randomly generated cat objects into the database using a loop.

We query using the select() method, filtering the cat objects by the breed attribute. The result is stored in a list and printed to the console.

Finally, we delete the previously inserted cat object from the database using the delete() method.

Using the Jakarta NoSQL API and the Template interface, we can perform various operations with the NoSQL database without being aware of the specific implementation details. Eclipse JNoSQL handles the underlying database operations, allowing developers to focus on writing concise and efficient code for their applications.

This code demonstrates the power and simplicity of working with NoSQL databases using Eclipse JNoSQL in a Jakarta EE application.

Pagination is a common requirement when working with large datasets. In this case, we can leverage the Jakarta Data specification and the repository feature to seamlessly handle pagination. By creating interfaces that extend the appropriate repository interface, such as PageableRepository, the framework will automatically implement the necessary methods for us.

Here’s an example of how we can integrate pagination into our Cat and Dog repositories:

@Repository
public interface CatRepository extends PageableRepository, PetQueries {
}

@Repository
public interface DogRepository extends PageableRepository, PetQueries {
    default Dog register(Dog dog, Event event) {
        event.fire(dog);
        return this.save(dog);
    }
}

public interface PetQueries {
    List findByName(String name);

    List findByBreed(String breed);
}

In the code above, we define two repository interfaces, CatRepository and DogRepository, which extend the PageableRepository interface. It allows us to perform pagination queries on the Cat and Dog entities.

Additionally, we introduce a PetQueries interface that defines standard query methods for both Cat and Dog entities. This interface can be shared among multiple repositories, allowing code reuse and modularization.

In the DogRepository, we also showcase using default methods that have been available since Java 8. We define a custom method, register, which triggers an event and saves the dog object. It demonstrates the flexibility of adding custom business logic to the repository interface while benefiting from the framework’s underlying repository implementation.

By leveraging the repository feature and implementing the appropriate interfaces, Eclipse JNoSQL handles the implementation details for us. We can now seamlessly perform pagination queries and execute custom methods with ease.

This integration of pagination and repository interfaces demonstrates how Eclipse JNoSQL, in combination with Jakarta Data, simplifies the development process and promotes code reuse and modularization within the context of a Jakarta EE application.

Let’s put the pagination into action by injecting the DogRepository and using it to perform the pagination operations. The code snippet below demonstrates this in action:

@Inject
DogRepository repository;
@Inject
Event event;
var faker = new Faker();
var dog = Dog.create(faker);
repository.register(dog, event);
for (int index = 0; index < 100; index++) {
    repository.save(Dog.create(faker));
}
Pageable pageable = Pageable.ofSize(10).sortBy(Sort.asc("name"), Sort.asc("breed"));
Page dogs = repository.findAll(pageable);
while (dogs.hasContent()) {
    System.out.println("The page number: " + pageable.page());
    System.out.println("The dogs: " + dogs.stream().count());
    System.out.println("nn");
    pageable = pageable.next();
    dogs = repository.findAll(pageable);
}
repository.deleteAll();

In the code above, we first inject the DogRepository and Event using CDI. We then create a new dog using the Java Faker library and register it by calling the repository.register() method. The register() method also triggers an event using the Event object.

Next, we generate and save 100 more dogs into the database using a loop and the repository.save() method.

To perform pagination, we create a Pageable object with a page size of 10 and sort the dogs by name and breed in ascending order. We then call the repository.findAll() method passing the Pageable object to retrieve the first page of dogs.

We iterate over the pages using a while loop and print the page number and the number of dogs on each page. We update the Pageable object to the next page using the pageable.next() method and call repository.findAll() again to fetch the next page of dogs. Finally, we call repository.deleteAll() to delete all dogs from the database.

This code demonstrates the pagination feature, retrieving dogs in batches based on the defined page size and sorting criteria. It provides a convenient way to handle large datasets and display them to users in a more manageable manner.

In this code session, we witnessed the seamless integration between Eclipse JNoSQL and a MongoDB database in a Jakarta EE application. We explored the power of Eclipse JNoSQL, Jakarta Data, and Eclipse MicroProfile Config in simplifying the development process and enhancing the capabilities of our pet application.

The code showcased the modeling of pets using Java records and annotations, immutability, and entity mapping. We leveraged the Template interface to effortlessly perform operations with the MongoDB database. Pagination was implemented using Jakarta Data’s PageableRepository, providing an easy way to handle large datasets.

Using Eclipse MicroProfile Config enabled dynamic configuration management, allowing us to easily overwrite properties using system environment variables. This flexibility aligned with the Twelve-Factor App principles, making our application more adaptable and portable across environments.

The complete code for this session is available at jnosql-1-0-se. Additionally, you can explore more samples and demos with Java SE and Java EE at the following links: demos-se and demos-ee.

By utilizing Eclipse JNoSQL and its associated repositories, developers can harness the power of NoSQL databases while enjoying the simplicity and flexibility provided by the Jakarta EE and MicroProfile ecosystems. 

Eclipse JNoSQL empowers developers to focus on business logic and application development, abstracting away the complexities of NoSQL integration and allowing for a seamless exploration of polyglot persistence.

Quarkus Integration

One of the standout features in this release is the integration of Eclipse JNoSQL with Quarkus, a popular and highly efficient framework in the market. This integration is available as a separate module, providing seamless compatibility between Eclipse JNoSQL and Quarkus.

Integrating with Quarkus expands the possibilities for using Eclipse JNoSQL in your applications. You can now leverage the power of Eclipse JNoSQL with Quarkus’ lightweight, cloud-native runtime environment. The integration module currently supports ArangoDB, DynamoDB, Redis, Elasticsearch, MongoDB, and Cassandra databases, thanks to the significant contributions from Maximillian Arruda and Alessandro Moscatelli.

To stay updated on the communication and new updates between Quarkus and Eclipse JNoSQL, you can follow the repository at quarkiverse/quarkus-jnosql.

To start a project from scratch, you can explore the Quarkus extension for Eclipse JNoSQL at Quarkus.io – Eclipse JNoSQL.

To conclude this session, let’s look at three classes: Fish as an entity, FishService as a service, and FishResource as a resource. We can create a REST API for managing fish data with these classes. This sample introduces a fun twist by focusing on fish as pet animals, adding a post-tech pet-friendly touch.

@Entity
public class Fish {
    @Id
    public String id;
    @Column
    public String name;
    @Column
    public String color;
    // getters and setters
}
@ApplicationScoped
public class FishService {
    @Inject
    private DocumentTemplate template;
    private Faker faker = new Faker();
    public List findAll() {
        return template.select(Fish.class).result();
    }
    public Fish insert(Fish fish) {
        fish.id = UUID.randomUUID().toString();
        return this.template.insert(fish);
    }
    public void delete(String id) {
        this.template.delete(Fish.class, id);
    }
    public Optional findById(String id) {
        return this.template.find(Fish.class, id);
    }
    // other methods
}
@Path("/fishes")
@ApplicationScoped
public class FishResource {
    @Inject
    private FishService service;
    @GET
    @Path("{id}")
    public Fish findId(@PathParam("id") String id) {
        return service.findById(id)
                .orElseThrow(() -> new WebApplicationException(Response.Status.NOT_FOUND));
    }
    @GET
    @Path("random")
    public Fish random() {
       return service.random();
    }
    @GET
    public List findAll() {
       return this.service.findAll();
    }
    @POST
    public Fish insert(Fish fish) {
        fish.id = null;
        return this.service.insert(fish);
    }
    @PUT
    @Path("{id}")
    public Fish update(@PathParam("id") String id, Fish fish) {
       return this.service.update(id, fish)
               .orElseThrow(() -> new WebApplicationException(Response.Status.NOT_FOUND));
    }
    @DELETE
    @Path("{id}")
    public void delete(@PathParam("id") String id) {
        this.service.delete(id);
    }
}

The provided classes demonstrate the implementation of a REST API for managing fish data. The Fish class represents the entity, with name, color, and id as its properties. The FishService class provides methods for interacting with the fish data using the DocumentTemplate. Finally, the FishResource class serves as the REST resource, handling the HTTP requests and delegating the operations to the FishService.

You can find the detailed code for this example at quarkus-mongodb. With this code, you can explore the integration between Eclipse JNoSQL and Quarkus, building a pet-friendly REST API for managing fish data.

Conclusion

The release of Eclipse JNoSQL 1.0.0 marks a significant milestone in NoSQL database integration with Java applications. With its rich features and seamless compatibility with Jakarta EE, Eclipse MicroProfile, and now Quarkus, Eclipse JNoSQL empowers developers to leverage the full potential of NoSQL databases in modern enterprise applications.

The integration with Quarkus opens up new possibilities for developers, allowing them to harness the power of Eclipse JNoSQL in Quarkus’ lightweight, cloud-native runtime environment. With support for popular databases like ArangoDB, DynamoDB, Redis, Elasticsearch, MongoDB, and Cassandra, developers can choose the suitable database for their specific needs and easily switch between them.

Throughout this article, we explored the core features of Eclipse JNoSQL, including the unified API for various NoSQL databases, the use of Jakarta Data and repository interfaces for seamless integration, and the benefits of using Java records and annotations for entity mapping and immutability.

We witnessed how Eclipse JNoSQL simplifies the development process, abstracting away the complexities of NoSQL database integration and allowing developers to focus on writing clean, efficient code. The seamless integration with Eclipse MicroProfile Config further enhances flexibility, enabling dynamic configuration management.

Moreover, the demonstration of building a pet-friendly REST API using Eclipse JNoSQL showcased its simplicity and effectiveness in real-world scenarios. We explored the usage of repositories, pagination, and standard query methods, highlighting the power of Eclipse JNoSQL in handling large datasets and delivering efficient database operations.

As Eclipse JNoSQL continues to evolve, it offers developers an extensive range of options and flexibility in working with NoSQL databases. The vibrant community and ongoing contributions ensure that Eclipse JNoSQL remains at the forefront of NoSQL database integration in the Java ecosystem.

Eclipse JNoSQL 1.0.0 empowers developers to seamlessly integrate NoSQL databases into their Java applications, providing a powerful and efficient solution for modern enterprise needs.

Evolving the Federated GraphQL Platform at Netflix

Key Takeaways

  • Federated GraphQL distributes the ownership of the graph across several teams. This requires all teams to adopt and learn federated GraphQL and can be accomplished by providing a well-rounded ecosystem for developer and schema workflows.
  • Before you start building custom tools, it helps to use existing tools and resources adopted by the community and gradually work with initial adopters to identify gaps.
  • The (D)omain (G)raph (S)ervices (DGS) Framework is a Spring Boot-based Java framework that allows developers to easily build GraphQL services that can then be part of the federated graph. It provides many out-of-the-box integrations with the Netflix ecosystem, but the core is also available as an open-source project to the community.
  • As more teams came on board, we had to keep up with the scale of development. In addition to helping build GraphQL services, we also needed to replace manual workflows for schema collaboration with more tools that address the end-to-end schema development workflow to help work with the federated graph.
  • Today we have more than 200 services that are part of the federated graph. Federated GraphQL continues to be a success story at Netflix. We are migrating our Streaming APIs to a federated architecture and continue to invest in improving the performance and observability of the graph.

In this article, we will describe our migration journey toward a Federated GraphQL architecture. Specifically, we will talk about the GraphQL platform we built consisting of the Domain Graph Services (DGS) Framework for implementing GraphQL services in Java using Spring Boot and graphql-java and tools for schema development. We will also describe how the ecosystem has evolved at various stages of adoption.

Why Federated GraphQL?

Netflix has been evolving its slate of original content over the past several years. Many teams within the studio organization work on applications and services to facilitate production, such as talent management, budgeting, and post-production.

In a few cases, teams had independently created their APIs using gRPC, REST, and even GraphQL. Clients would have to talk to one or more backend services to fetch the data they needed. There was no consistency in implementation, and we had many ways of fetching the same data resulting in multiple sources of truth. To remedy this, we created a unified GraphQL API backed by a monolith. The client teams could then access all the data needed through this unified API, and they only needed to talk to a single backend service. The monolith, in turn, would do all the work of communicating with all the required backends to fetch and return the data in one GraphQL response.

However, this monolith did not scale as more teams added their data behind this unified GraphQL API. It required domain knowledge to determine how to translate the incoming requests to corresponding calls out to the various services. This created maintenance and operational burden on the team maintaining this graph. In addition, the evolution of the schema was also not owned by the product teams primarily responsible for the data, which resulted in poorly designed APIs for clients.

We wanted to explore a different ownership model, such that the teams owning the data could also be responsible for their GraphQL API while still maintaining the unified GraphQL API for client developers to interact with (see Figure 1).

[Click on the image to view full-size]

Figure 1: Federated ownership of graph

In 2019, Apollo released the Federation Spec, allowing teams to own subgraphs while being part of the single graph. In other words, the ownership of the graph could be federated across multiple teams by breaking apart the various resolvers handling the incoming GraphQL requests. Instead of the monolith, we could now use a simple gateway that routed the requests to the GraphQL backends, called Domain Graph Services, which serves the subgraph. Each DGS will handle fetching the data from corresponding backends owned by the same team (see Figure 2). We started experimenting with a custom implementation of the Federated GraphQL Gateway and started working with a few teams to migrate to this new architecture.

[Click on the image to view full-size]

Figure 2: Federated GraphQL Architecture

A couple of years later, we now have more than 150 services that are a part of this graph. Not only have we expanded the studio graph, but we have created more graphs for other domains, such as our internal platform tools, and another for migrating our streaming APIs as well.

The Early Adoption Phase

When we started the migration, we had around 40 teams already part of the federated graph served by our GraphQL monolith. We asked all these teams to migrate to a completely new architecture, which required knowledge of Federated GraphQL – an entirely new concept for us as well. Providing a great developer experience was key to successfully driving adoption at this scale.

Initially, a few teams opted to onboard onto the new architecture. We swarmed with the developers on the team to better understand the developer workflow, the gaps, and the tools required to bridge the knowledge gap and ease the migration process.

Our goal was to make it as easy as possible for adopters to implement a new GraphQL service and make it part of the federated graph. We started to gradually build out the GraphQL platform consisting of several tools and frameworks and continued to evolve during various stages of adoption.

Our Evolving GraphQL Ecosystem

The Domain Graph Services (DGS) Framework is a Spring Boot library based on graphql-java that allows developers to easily wire-up GraphQL resolvers for their schema. Initially, we created this framework with the goal of providing Netflix-specific integrations for security, metrics, and tracing out of the box for developers at Netflix. In addition, we wanted to eliminate the manual wire-up of resolvers, which we could optimize using custom DGS annotations as part of the core. Figure 3 shows the modular architecture of the framework with several opt-in features for developers to choose from.

[Click on the image to view full-size]

Figure 3: DGS Framework Architecture

When using the DGS Framework, developers can simply focus on the business domain logic and less on learning all the specifics of GraphQL. In addition, we created a code generation Gradle plugin for generating Java or Kotlin classes that represent the schema. This eliminates the manual creation of these classes.

Over time, we added more features that were more generally useful, even for service developers outside Netflix. We decided to open-source the DGS Framework and the DGS code generation plugin in early 2021 and continue evolving it. We also created the DGS IntelliJ plugin that provides navigation from schema to implementation of data resolvers and code completion snippets for implementing your DGS.

Having tackled implementing a GraphQL service, the next step was to register the DGS so it could be part of the federated graph. We implemented a schema registry service to map the schemas to their corresponding services. The federated gateway uses the schema registry to determine which services to reach out to, given an incoming query. We also implemented a self-service UI for this schema registry service to help developers manage their DGSs and discover the federated schema.

Finally, we enhanced our existing observability tools for GraphQL. Our distributed tracing tool allows easy debugging of performance issues and request errors by providing an overall view of the call graph, in addition to the ability to view correlated logs.

Scaling the Federated GraphQL Platform

We started this effort in early 2019, and since then, we have more than 200 teams participating in the federated graph. The adoption of federated GraphQL architecture has been so successful that we ended up creating more graphs for other domains in addition to our Studio Graph. We now have one for internal platform tooling, which we call the Enterprise Graph, and another for our streaming APIs.

After introducing the new Enterprise graph, we quickly realized that teams were interested in exposing the same DGS as part of the Enterprise and Studio graphs. Similarly, clients were interested in fetching data from both graphs. We then merged the Studio and Enterprise graphs into one larger supergraph. This created a different set of challenges for us related to scaling the graph for the size and number of developers.

The larger graph made it harder to scale our schema review process since it’s been mostly manually overseen by members of our schema review group (see Figure 4). We needed to create more tooling to automate some of these processes. We created a tool called GraphDoctor to lint the schema and automatically comment on PRs related to schema changes for all enrolled services. To help with schema collaboration, we created GraphLabs, which stages a sandboxed environment to test schema changes without affecting the other parts of the graph. This allows both front-end and back-end developers to collaborate on schema changes more rapidly (see Figure 4).

[Click on the image to view full-size]

Figure 4: Schema Development Workflow

Developer Support Model

We built the GraphQL platform to facilitate implementing domain graph services and work with the graph. However, this alone would not have been sufficient. We needed to complement the experience with good developer support. Initially, we offered a white glove migration experience by swarming with the teams and doing much migration work for them. This provided many insights into what we needed to build to improve the developer experience. We identified gaps in the existing solutions that could help speed up implementation by allowing developers to focus on the business logic and eliminate repetitive code setup.

Once we had a fairly stable platform, we could onboard many more teams at a rapid pace. We also invested heavily in providing our developers with good documentation and tutorials on federation and GraphQL concepts so they can self-service easily. We continue to offer developer support on our internal communication channels via Slack during business hours to help answer any questions and troubleshoot issues as they arise.

Developer Impact

The GraphQL platform provides a completely paved path for the entire workflow, starting from schema design, implementation of the schema in a GraphQL service, registering the service to be a part of the federated graph, and operating the same once deployed. This has helped more teams adopt the architecture making GraphQL more popular than traditional REST APIs at Netflix. Specifically, Federated GraphQL greatly simplifies data access for front-end developers, allowing teams to move quickly on their deliverables.

Our Learnings

By investing heavily in developer experience, we were able to drive adoption at a much more accelerated pace than we would have otherwise. We started small by leveraging community tools. That helped us identify gaps and where we needed custom functionality. We built the DGS Framework and an ecosystem of tools, such as the code generation plugin and even one for basic schema management.

Having tackled the basic workflow, we could focus our efforts on more comprehensive schema workflow tools. As adoption increased, we were able to identify problems and adapt the platform to make it work with larger graphs and scale it for an increasing number of developers. We automated a part of the schema review process, which has made working with larger graphs easier. We continue to see new use cases emerge and are working to evolve our platform to provide paved-path solutions for the same.

What’s ahead?

So far, we have migrated our Studio architecture to federated GraphQL and merged a new graph for internal platform teams with the Studio Graph to form one larger supergraph. We are now migrating our Netflix streaming APIs that power the discovery experience in Netflix UI to the same model. This new graph comes with a different set of challenges. The Netflix streaming service is supported across various devices, and the UI is rendered differently in each platform. The schema needs to be well-designed to accommodate the different use cases.

Another significant difference is that the streaming services need to handle significantly higher RPS, unlike the other existing graphs. We are identifying performance bottlenecks in the framework and tooling to make our GraphQL services more performant. In parallel, we are also improving our observability tooling to ensure we can operate these services at scale.

AI-based Prose Programming for Subject Matter Experts: Will this work?

Key Takeaways

  • Recent advances in prose-to-code generation via Large Language Models (LLMs) will make it practical for non-programmers to “program in prose” for practically useful program complexities, a long-standing dream of computer scientists and subject-matter experts alike.
  • Assuming that correctness of the code and explainability of the results remain important, testing the code will still have to be done using more traditional approaches. Hence, the non-programmers must understand the notion of testing and coverage.
  • Program understanding, visualization, exploration, and simulation will become even more relevant in the future to illustrate what the generated program does to subject matter experts.
  • There is a strong synergy with very high-level programming languages and domain-specific languages (DSLs) because the to-be-generated programs are shorter (and less error prone) and more directly aligned with the execution semantics (and therefore easier to understand).
  • I think it is still an open question how far the approach scales and how integrated tools will look that exploit both LLMs’ “prose magic” and more traditional ways of computing. I illustrate this with an open-source demonstrator implemented in JetBrains MPS.

 

Introduction

As a consequence of AI, machine learning, neural networks, and in particular Large Language Models (LLMs) like ChatGPT, there’s a discussion about the future of programming. There are mainly two areas. One focuses on how AI can help developers code more efficiently. We have probably all asked ChatGPT to generate small-ish fragments of code from prose descriptions and pasted them into whatever larger program we were developing. Or used Github Copilot directly in our IDEs.

This works quite well because, as programmers, we can verify that the code makes sense just by looking at it or trying it out in a “safe” environment. Eventually (or even in advance), we write tests to validate that the generated code works in all relevant scenarios. And the AI-generated code doesn’t even have to be completely correct because it is useful to developers if it reaches 80% correctness. Just like when we look up things on Stackoverflow, it can serve as an inspiration/outline/guidance/hint to allow the programmer to finish the job manually. I think it is indisputable that this use of AI provides value to developers.

The second discussion area is whether this will enable non-programmers to instruct computers. The idea is that they just write a prompt, and the AI generates code that makes the machine do whatever they intended. The key difference to the previous scenario is that the inherent safeguards against generated nonsense aren’t there, at least not obviously.

A non-programmer user can’t necessarily look at the code and check it for plausibility, they can’t necessarily bring a generated 80% solution to 100%, and they don’t necessarily write tests. So will this approach work, and how must languages and tools change to make it work? This is the focus of this article.

Why not use AI directly?

You might ask: why generate programs in the first place? Why don’t we just use a general-purpose AI to “do the thing” instead of generating code that then “does the thing”? Let’s say we are working in the context of tax calculation. Our ultimate goal is a system that calculates the tax burden for any particular citizen based on various data about their incomes, expenses, and life circumstances.

We could use an approach where a citizen enters their data into some kind of a form and then submits the form data (say, as JSON) to an AI (either a generic LLM or tax-calculation-specific model), which then directly computes the taxes. There’s no program in between, AI-generated or otherwise (except the one that collects the data, formats the JSON, and submits it to the AI). This approach is unlikely to be good enough in most cases for the following reasons:

  • AI-based software isn’t good at mathematical calculations [1]; this isn’t a tax-specific issue since most real-world domains contain numeric calculations.
  • If an AI is only 99% correct, the 1% wrong is often a showstopper.
  • Whatever the result is, it can’t be explained or “justified” to the end user (I will get back to this topic below).
  • Running a computation for which a deterministic algorithm exists with a neural network is inefficient in terms of computing power and the resulting energy and water consumption.
  • If there’s a change to the algorithm, we have to retrain the network, which is even more computationally expensive.

To remedy these issues, we use an approach where a subject matter expert who is not a programmer, say our tax consultant, describes the logic of the tax calculation to the AI, and the AI generates a classical, deterministic algorithm which we then repeatedly run on citizens’ data. Assuming the generated program is correct, all the above drawbacks are gone:

  • It calculates the result with the required numeric precision.
  • By tracing the calculation algorithm, we can explain and justify the result (again, I will explain this in more detail below).
  • It will be correct in 100% of the cases (assuming the generated program is correct).
  • The computation is as energy efficient as any program today.
  • The generated code can be adapted incrementally as requirements evolve.

Note that we assume correctness for all (relevant) cases and explainability are important here. If you don’t agree with these premises, then you can probably stop reading; you are likely of the opinion that AI will replace more or less all traditionally programmed software. I decidedly don’t share this opinion, at least not for 5–7 years.

Correctness and Creativity

Based on our experience with LLMs writing essays or Midjourney & Co generating images, we ascribe creativity to these AIs. Without defining precisely what “creativity” means, I see it here as a degree of variability in the results generated for the same or slightly different prompts. This is a result of how word prediction works and the fact that these tools employ randomness in the result generation process (Stephen Wolfram explains this quite well in his essay). This feels almost like making a virtue from the fault that neural networks generally aren’t precisely deterministic.

Just do an experiment and ask an image-generating AI to render technical subjects such as airplanes or cranes, subjects for which a specific notion of “correct” exists; jet airliners just don’t have two wings of different lengths or two engines on one wing and one on the other. The results are generally disappointing. If, instead, you try to generate “phantasy dogs running in the forest while it rains,” the imprecision and variability are much more tolerable, to the point we interpret it as “creativity.” Generating programs is more like rendering images of airplanes than of running dogs. Creativity is not a feature for this use case of AI.

Explainability

Let me briefly linger on the notion of explainability. Consider again your tax calculation. Let’s say it asks you to pay 15.323 EUR for a particular period of time. Based on your own estimation, this seems too much, so you ask, “Why is it 15.323 EUR?” If an AI produces the result directly, it can’t answer this question. It might (figuratively) reply with the value for each of the internal neurons’ weights, thresholds, and activation levels of the internal neurons. Still, those have absolutely no meaning to you as a human. Their connection to the logic of tax calculation is, at best, very indirect. Maybe it can even (figuratively) show you that your case looks very similar to these 250 others, and therefore, somehow, your tax amount has to be 15.323 EUR. A trained neural network is essentially just an extremely tight curve fit,one with a huge number of parameters. It’s a form of “empirical programming”: it brute-force replicates existing data and extrapolates.

It’s just like in science: to explain what fitted data means, you have to connect it to a scientific theory, i.e., “fit the curve with physical quantities that we know about. The equivalent of a scientific theory (stretching the analogy a bit) is a “traditional” program that computes the result based on a “meaningful” algorithm. The user can inspect the intermediate values, see the branches the program took, the criteria for decisions, and so on. This serves as a reasonable first-order answer to the “why” question – especially if the program is expressed with abstractions, structures, and names that make sense in the context of the tax domain [2].

A well-structured program can also be easily traced back to the law or regulations that back up the particular program code. Program state, expressed with reasonably domain-aligned abstractions, plus a connection to the “requirements” (the law in case of tax calculation) is a really good answer to the “why.” Even though there is research into explainable AI, I don’t think the current approach of deep learning will be able to do this anytime soon. And the explanations that ChatGPT provides are often hollow or superficial. Try to ask “why” one or two more times, and you’ll quickly see that it can’t really explain a lot.

Domain-Specific Tools and Languages

A part of the answer of whether subject-matter expert prose programming works is domain-specific languages (DSL). A DSL is a (software) language that is tailor-made for a particular problem set – for example, for describing tax calculations and the data structures necessary for them, or for defining questionnaires used in healthcare to diagnose conditions like insomnia or drug abuse. DSLs are developed with the SMEs in the field and rely on abstractions and notations familiar to them. Consequently, if the AI generates DSL code, subject matter experts will be more able to read the code and validate “by looking” that it is correct.

There’s an important comment I must make here about the syntax. As we know, LLMs work with text, so we have to use a textual syntax for the DSL when we interact with the LLM. However, this does not mean that the SME has to look at this for validation and other purposes. The user-facing syntax can be a mix of whatever makes sense: graphical, tables, symbolic, Blockly-style, or textual. While representing classical programming languages graphically often doesn’t work well, it works much better if the language has been designed from the get-go with the two syntaxes in mind – the DSL community has lots of experience with this.

More generally, if the code is written by the AI and only reviewed or adapted slightly by humans, then the age-old trade-off between writability and readability is decided in favor of readability. I think the tradeoff has always tended in this direction because code is read much more often than it is written, plus IDEs have become more and more helpful with the writing part. Nonetheless, if the AI writes the code, then the debate is over.

A second advantage to generating code in very high-level languages such as DSLs is that it is easier for the AI to get it right. Remember that LLMs are Word Prediction Machines. We can reduce the risk of predicting wrong by limiting the vocabulary and simplifying the grammar. There will be less non-essential variability in the sentences, so there will be a higher likelihood of correctly generated code. We should ensure that the programming language is good at separating concerns. No “technical stuff” mixed with the business logic the SME cares about.

The first gateway for correctness is the compiler (or syntax/type checker in case of an interpreted language). Any generated program that does not type check or compile can be rejected immediately, and the AI can automatically generate another one. Here is another advantage of high-level languages: you can more easily build type systems that, together with the syntactic structure, constrain programs to be meaningful in the domain. In the same spirit, the fewer (unnecessary) degrees of freedom a language has, the easier it is to analyze the programs relative to interesting properties. For example, a state machine model is easier to model check than a C program. It is also easier to extract an “explanation” for the result, and, in the end, it is easier for an SME to learn to validate the program by reading it or running it with some kind of simulator or debugger. There’s just less clutter, which simplifies everybody’s (and every tool’s) life.

There are several examples that use this approach. Chat Notebooks in Mathematica allow users to write prose, and ChatGPT generates the corresponding Wolfram Language code that can then be executed in Mathematica. A similar approach has been demonstrated for Apache Spark and itemis CREATE, a state machine modeling tool (the linked article is in German, but the embedded video is in English). I will discuss my demonstrator a bit more in the next section.

The approach of generating DSL code also has a drawback: the internet isn’t full of example code expressed in your specific language for the LLM to learn from. However, it turns out that “teaching” ChatGPT the language works quite well. I figure there are two reasons: one is that even though the language is domain-specific, many parts of it, for example, expressions, are usually very similar to traditional programming languages. And second, because DSLs are higher-level relative to the domain, the syntax is usually a bit more “prose-like”; so expressing something “in the style of the DSL I explained earlier” is not a particular challenge for an AI.

The size of the language you can teach to an LLM is limited by the “working memory” of the LLM, but it is fair to assume that this will grow in the future, allowing more sophisticated DSLs. And I am sure that other models will be developed that are optimized for structured text, following a formal schema rather than the structure of (English) prose.

A demonstrator

I have implemented a system that demonstrates the approach of combining DSLs and LLMs. The demonstrator is based on JetBrains’ MPS and ChatGPT; the code is available on github. The example language focuses on forms with fields and calculated values; more sophisticated versions of such forms are used, for example, as assessments in healthcare. Here is an example form:

In addition to the forms, the language also supports expressing tests; these can be executed via an interpreter directly in MPS.

In this video, I show how ChatGPT 3.5 turbo generates meaningfully interesting forms for prose prompts. Admittedly, this is a simple language, and the DSLs we use for real-world systems are more complex. I have also done other experiments where the language was more complicated and it worked reasonably well. And as I have said, LLMs will become better and more optimized for this task. In addition, most DSLs have different aspects or viewpoints, and a user often just has to generate small parts of the model that, from the perspective of the LLM, can be seen as smaller languages.

A brief description of how this demonstrator is implemented technically can be found in the README on github.

Understanding and testing the generated code

Understanding what a piece of code does just by reading it only goes so far. A better way to understand code is to run it and observe the behavior. In the case of our tax calculation example, we might check that the amount of tax our citizen has to pay is correct relative to what the regulations specify. Or we validate that the calculated values in the healthcare forms above have the expected values. For realistically complex programs, there is a lot of variability in the behavior; there are many case distinctions (tax calculations are a striking example, and so are algorithms in healthcare), so we write tests to validate all relevant cases.

This doesn’t go away just because the code is AI-generated. It is even more critical because if we don’t express the prose requirements precisely, the generated code is likely to be incorrect or incomplete – even if we assume the AI doesn’t hallucinate nonsense. Suppose we use a dialog-based approach to get the code right incrementally. In that case, we need regression testing to ensure previously working behavior isn’t destroyed by an AI’s “improvement” of the code. So all of this leads to the conclusion that if we let the AI generate programs, the non-programmer subject matter expert must be in control of a regression test suite – one that has reasonably good coverage.

I don’t think that it is efficient – even for SMEs – to make every small change to the code through a prose instruction to the AI. Over time they will get a feel for the language and make code changes directly. The demo application I described above allows users to modify the generated form, and when they then instruct the LLM to modify further, the LLM continues from the result of the user’s modified state. Users and the LLM can truly collaborate. The tooling also supports “undo”: if the AI changes the code in a way that does more harm than good, you want to be able to roll back. The demonstrator I have built keeps the history of {prompt-reply}-pairs as a list of nodes in MPS; stepwise undo is supported just by deleting the tail of the list.

So how can SMEs get to the required tests? If the test language is simple enough (which is often the case for DSLs, based on my experience), they can manually write the tests. This is the case in my demonstrator system, where tests are just a list of field values and calculation assertions. It’s inefficient to have an LLM generate the tests based on a much more verbose prose description. This is especially true with good tool support where, as in the demonstrator system, the list of fields and calculations is already pre-populated in the test. An alternative to writing the tests is to record them while the user “plays” with the generated artifact. While I have not implemented this for the demonstrator, I have done it for a similar health assessment DSL in a real project: the user can step through a fully rendered form, enter values, and express “ok” or “not ok” on displayed calculated values.

Note that users still have to think about relevant test scenarios, and they still have to continue creating tests until a suitable coverage metric shows green. A third option is to use existing test case generation tools. Based on analysis of the program, they can come up with a range of tests that achieve good coverage. The user will usually still have to manually provide the expected output values (or, more generally, assert the behavior) for each automatically generated set of inputs. For some systems, such test case generators can generate the correct assertion as well, but then the SME user at least has to review them thoroughly – because they will be wrong if the generated program is wrong. Technically speaking, test case generation can only verify a program, not validate it.

Mutation testing (where a program is automatically modified to identify parts that don’t affect test outcomes) is a good way of identifying holes in the coverage; the nice thing about this approach is that it does not rely on fancy program analysis; it’s easy to implement, also for your own (domain-specific) languages. In fact, the MPS infrastructure on which we have built our demonstrator DSL supports both coverage analysis (based on the interpreter that runs the tests), and we also have a prototype program mutator.

We can also consider having the tests generated by an AI. Of course, this carries the risk of self-fulfilling prophecies; if the AI “misunderstands” the prose specification, it might generate a wrong program and tests that falsely corroborate that wrong program. To remedy this issue, we can have the program and tests generated by different AIs. At the very least, you should use two separate ChatGPT sessions. In my experiments, the ChatGPT couldn’t generate the correct expected values for the form calculations; it couldn’t “execute” the expressions it generated into the form earlier. Instead of generating tests, we can generate properties [3] for verification tools, such as model checkers. In contrast to generated tests, generated properties provide a higher degree of confidence. Here’s the important thing: even if tests or properties are generated (based on traditional test generators or via AI), then at least the tests have to be validated by a human. Succeeding tests or tool-based program verifications are only useful if they ensure the right thing.

There’s also a question about debugging. What happens if the generated code doesn’t work for particular cases? Just writing prompts à la “the code doesn’t work in this case, fix it!” is inefficient; experiments with my demonstrator confirm this suspicion. It will eventually become more efficient to adapt the generated code directly. So again: the code has to be understood and “debugged.” A nicely domain-aligned language (together with simulators, debuggers, and other means of relating the program source to its behavior) can go a long way, even for SMEs. The field of program tracing, execution visualization, live programming, and integrated programming environments where there’s less distinction between the program and its executions is very relevant here. I think much more research and development are needed for programs without obvious graphical representations; the proverbial bouncing ball from the original Live Programming demo comes to mind.

There’s also another problem I call “broken magic.” If SMEs are used to things “just working” based on their prose AI prompt, and they are essentially shielded from the source code and, more generally, how the generated program works, then it will be tough for them to dig into that code to fix something. The more “magic” you put into the source-to-behavior path, the harder it is for (any kind of) user to go from behavior back to the program during debugging. You need quite fancy debuggers, which can be expensive to build. This is another lesson learned from years and years of using DSLs without AI.

Summing up

Let’s revisit the skills the SMEs will need in order to reliably use AI to “program” in the context of a particular domain. In addition to being able to write prompts, they will have to learn how to review, write or record tests, and understand coverage to appreciate which tests are missing and when enough tests are available. They have to understand the “paradigm” and structure of the generated code so they can make sense of explanations and make incremental changes. For this to work in practice, we software engineers have to adapt the languages and tools we use as the target of AI code generation:

  • Smaller and more domain-aligned languages have a higher likelihood that the generated code will be correct and are easier for SMEs to understand; this includes the language for writing tests.
  • We need program visualizers, animators, simulators, debuggers, and other tools that reduce the gap between a program and its set of executions.
  • Finally, any means of test case generation, program analysis, and the like will be extremely useful.

So, the promise that AI will let humans communicate with computers using the humans’ language is realistic to a degree. While we can express the expected behavior as prose, humans have to be able to validate that the AI-generated programs are correct in all relevant cases. I don’t think that doing this just via a prose interface will work well; some degree of education on “how to talk to computers” will still be needed, and the diagnosis that this kind of education is severely lacking in most fields except computer science remains true even with the advent of AI.

Of course, things will change as AI improves – especially in the case of groundbreaking new ideas where classical, rule-based AI is meaningfully integrated with LLMs. Maybe more or less manual validation is no longer necessary because the AI is somehow good enough to always generate the correct programs. I don’t think this will happen in the next 5–7 years. Predicting beyond is difficult – so I don’t.

Footnotes

  • [1] In the future, LLMs will likely be integrated with arithmetic engines like Mathematica, so this particular problem might go away.
  • [2] Imagine the same calculation expressed as a C program with a set of global integer variables all names i1 through i500. Even though the program can absolutely produce the correct results and is fully deterministic, inspecting the program’s execution – or some kind of report auto-generated from it – won’t explain anything to a human. Abstractions and names matter a lot!
  • [3] Properties are generalized statements about the behavior of a system that verification tools try to prove or try to find counterexamples for.

Acknowledgments

Thanks to Sruthi Radhakrishnan, Dennis Albrecht, Torsten Görg, Meite Boersma, and Eugen Schindler for feedback on previous versions of this article.

I also want to thank Srini Penchikala and Maureen Spencer for reviewing and copyediting this article.

Monitoring Critical User Journeys in Azure

Key Takeaways

  • ​​A critical user journey (CUJ) is an approach that maps out the key interactions between users and a product. CUJs are a great way to understand the effectiveness of application flows and identify bottlenecks.
  • Tools like Prometheus and Grafana provide a standardized way to collect and visualize these metrics and monitor CUJs.
  • In the Flowe technology stack, fulfilling a CUJ often involves a user’s request being handled by a mix of PaaS and Serverless technology.
  • In the current economic climate, pricing is a critical factor for any monitoring solution. The decision of build vs buy must be analyzed closely, even when running initial experiments.
  • Infrastructure as Code (IaC) frameworks like Azure Bicep help provision resources and organize CUJ metric collection as resources are deployed.

 

The Need for Application Monitoring

I work as an SRE  for Flowe , an Italian challenger digital bank where the software, rather than physical bank branches, is the main product. Everything is in the mobile app, and professionally speaking, ensuring continued operation of this service is my main concern. Unlike a traditional bank, customers rely on this mobile app as a key point of interaction with our services.

Flowe established a monitoring team to ensure proper integration between the bank platform and its numerous third-party services (core banking, card issuer, etc.). This team is available 24/7, and when something goes wrong on the third-party systems (i.e., callbacks are not working), they open a ticket on third-party systems.

Although they do an excellent job, the monitoring team doesn’t have deep knowledge about the system architecture, business logic, or even all the components of the bank platform. Their scope is limited to third parties.

This means that if a third party is not responding, they are quick to open tickets and monitor them until the incident is closed. However, the monitoring team lacks the development skills and knowledge to catch bugs, improve availability/deploy systems, measure performances, monitor dead letter queues (DLQs), etc. For this reason, at least initially, when Flowe launched, senior developers were in charge of these tasks.

However, after the first year of life, we realized developers were too busy building new features, etc., and they didn’t have time for day-to-day platform observation. So we ended up creating our SRE team with the primary goal of making sure the banking platform ran smoothly.

SRE Team Duties

What Flowe needed from an SRE team changed over time. As explained in the previous paragraph, the first necessity was covering what developers and the monitoring team couldn’t do: monitor exceptions and API response status code, find bugs, watch the Azure Service Bus  DLQs, measure performances, adopt infrastructure as code (IaC), improve deployment systems and ensure high availability.

The transition of responsibilities toward the SRE team had been slow but efficient, and over time the SRE team has grown, expanding the skill set and covering more and more aspects of Flowe’s daily routine. We started to assist the “caring team” (customer service) and put what is called Critical User Journeys (CUJ) in place.

CUJs are a great way to understand the effectiveness of application flows and identify bottlenecks. One example of a CUJ in our case is the “card top up process”, an asynchronous flow that involves different actors and services owned by many organizations. This CUJ gives us the context of the transaction and enables us to understand where users encounter issues and what the bottlenecks are. Solving issues rapidly is extremely important. Most users that get stuck in some process don’t chat with the caring team but simply leave a low app rating.

SRE Team Tools

Azure Application Insights  is an amazing APM tool, and we use it intensively for diagnostic purposes within our native iOS/Android SDK. However, Although we had decided to use Application Insights, the integration into the core Azure Monitor suite lacked some features that were critical to our usage.

For example, you can only send alerts via emails and SMS, and there is no native integration with other services such as PagerDuty, Slack, or anything else. Moreover, creating a custom dashboard using an Azure Workbook is only flexible and scalable in some environments because of their limited flexibility.

For all of the mentioned reasons, we, as the SRE team, decided to put in place two well-known open-source products to help us with monitoring and alerting tasks: Prometheus  and Grafana.

Prometheus is an open-source project hosted by the Cloud Native Computing Foundation (CNCF). Prometheus uses the pull model to collect metrics on configured targets and save data in its time series database. Prometheus has its data model – a plain text format – and as long as you can convert your data to this format, you can use Prometheus to collect and store data.

Grafana is another CNCF project that allows you to create dashboards to visualize data collected from hundreds of different places (data sources). Usually, Grafana is used with Prometheus since it can understand and display its data format. Still, many other data sources, such as Azure Monitor, Azure DevOps, DataDog, and GitHub, can be used. Grafana handles alerts, integrating with many services such as Teams or Slack.

Monitoring Solution

As we adopted Prometheus and Grafana, we needed to ensure that we did this in a cost-effective manner – the two key metrics were the size of our team and the amount of data processed/stored. So we did some proof of concept using an Azure Virtual Machine (VM)  with a Docker Compose file to start both Prometheus and Grafana containers and practice with them. The solution was cheap, but managing an entire VM to run two containers wastes time.

For this reason, we looked at the managed version offered by Azure:

  • Grafana Managed Instance costs 6€/month/active user – a bit more expensive than the 30€ for the monthly VM as the SRE team consists of six people.
  • Prometheus Managed pricing model  is not straightforward, especially if you are starting from scratch and don’t have any available metrics to rely on. Moreover, you have to pay for notifications (push, emails, webhooks, etc.).

After some experiments and research on all possible solutions, we saw that a combination of SaaS and serverless Azure solutions seemed the best option for our use case. Let’s see which ones, how, and why they’ve been integrated with each other.

Azure Managed Grafana

Azure Managed Grafana is a fully managed service for monitoring solutions. So, actually, we found what we were looking for: automatic software upgrades, SLA guarantees, availability zones, Single Sign-On with AAD  and integration with Azure Monitor (via Managed Identity ) ready out of the box.

Provisioning Grafana using Bicep

Bicep provides a simple way of provisioning our cloud resources through Infrastructure as Code principles. This allows repeatable deployments as well as a way to record the resource’s place along the CUJ. The Bicep definition of a Grafana Managed instance is simple.

resource grafana 'Microsoft.Dashboard/grafana@2022-08-01' = {
  name: name
  location: resourceGroup().location
  sku: {
    name: ‘Standard’
  }
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    apiKey: 'Disabled'
    autoGeneratedDomainNameLabelScope: 'TenantReuse'
    deterministicOutboundIP: 'Enabled'
    publicNetworkAccess: 'Enabled'
    zoneRedundancy: 'Enabled'
  }
}

In this configuration, it is worth highlighting in detail: the `deterministicOutbountIP` is set to `Enabled`. This allows us to have two static outbound IPs that we will use later to isolate the Prometheus instance from Grafana.

Finally, we needed to grant the Grafana Admin Role to our AAD group and the Monitoring Reader Role to the Grafana Managed Identity to get access to Azure Monitor logs.

@description('This is the built-in Grafana Admin role. See https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles')
resource grafanaAdminRole 'Microsoft.Authorization/roleDefinitions@2022-04-01' existing = {
  name: '22926164-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
}

@description('This is the built-in Monitoring Reader role. See https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles#monitoring-reader')
resource monitoringReaderRole 'Microsoft.Authorization/roleDefinitions@2022-04-01' existing = {
  scope: subscription()
  name: '43d0d8ad-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
}

@description('This is the SRE AD group.')
resource sreGroup 'Microsoft.Authorization/roleDefinitions@2022-04-01' existing = {
  name: 'aed71f3f-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
}

resource adminRoleSreGroupAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  name: guid(subscription().id, sreGroup.id, grafanaAdminRole.id)
  properties: {
    roleDefinitionId: grafanaAdminRole.id
    principalId: sreGroup.name
    principalType: 'Group'
  }

  dependsOn: [
    grafana
  ]
}

resource monitoringRoleSubscription 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  name: guid(subscription().id, grafana.name, monitoringReaderRole.id)
  properties: {
    roleDefinitionId: monitoringReaderRole.id
    principalId: grafana.outputs.principalId
    principalType: 'ServicePrincipal'
  }

  dependsOn: [
    grafana
  ]
}

Azure Container Apps (ACA) for Prometheus

Azure Container Apps is a serverless solution to run containers on Azure. Under the hood, it runs containers on top of Kubernetes; it is completely managed by Azure and the Kubernetes API (and any associated complexity) is never exposed to the users. Plus, it can scale from 0 to 30 replicas (0 is for free!), and you can attach volumes via a Files Share mounted on a Storage Account (a necessary option to run Prometheus as it is a time series database). We chose this service for its simplicity and flexibility, within the cost of around 20€/month when turned on 24/7.

Provisioning Prometheus on ACA using Bicep

We start to define a Storage Account that will be used to mount a volume on the container, initially allowing connections for outside.

resource storageAccount 'Microsoft.Storage/storageAccounts@2022-09-01' = {
  name: storageAccountName
  location: resourceGroup().location
  kind: 'StorageV2'
  sku: {
    name: 'Standard_ZRS'
  }
  properties: {
    supportsHttpsTrafficOnly: true
    minimumTlsVersion: 'TLS1_2'
    networkAcls: {
      defaultAction: 'Allow'
    }
    largeFileSharesState: 'Enabled'
  }
}

And then the related File Share using the SMB protocol.

resource fileServices 'Microsoft.Storage/storageAccounts/fileServices@2022-09-01' = {
  name: 'default'
  parent: storageAccount
}

resource fileShare 'Microsoft.Storage/storageAccounts/fileServices/shares@2022-09-01' = {
  name: name
  parent: fileServices
  properties: {
    accessTier: 'TransactionOptimized'
    shareQuota: 2
    enabledProtocols: 'SMB'
  }
}

Selecting the right `accessTier` here is important: we chose the `Hot` option, but it was an expensive choice with no performance gain. `TransactionOptimized` is much cheaper and more suited to Prometheus’s work. 

This File Share resource will be mounted on the container, so it shall arrange the local environment for Prometheus by provisioning two folders: `data` and `config`. In my case, the latter must contain the Prometheus configuration file named `prometheus.yml`. 

The former is used to store the time series database. In our Bicep file, we launch a shell script through a Bicep Deployment Script  to ensure these prerequisites exist at each pipeline run. And finally, the container app with the accessory resources – environment and log analytics workspace.

resource logAnalytics 'Microsoft.OperationalInsights/workspaces@2022-10-01' = {
  name: containerAppLogAnalyticsName
  location: resourceGroup().location
  properties: {
    sku: {
      name: 'PerGB2018'
    }
  }
}

var vnetConfiguration = {
  internal: false
  infrastructureSubnetId: subnetId
}

resource containerAppEnv 'Microsoft.App/managedEnvironments@2022-10-01' = {
  name: containerAppEnvName
  location: resourceGroup().location
  sku: {
    name: 'Consumption'
  }
  properties: {
    appLogsConfiguration: {
      destination: 'log-analytics'
      logAnalyticsConfiguration: {
        customerId: logAnalytics.properties.customerId
        sharedKey: logAnalytics.listKeys().primarySharedKey
      }
    }
    vnetConfiguration: vnetConfiguration
  }
}

resource permanentStorageMount 'Microsoft.App/managedEnvironments/storages@2022-10-01' = {
  name: storageAccountName
  parent: containerAppEnv
  properties: {
    azureFile: {
      accountName: storageAccountName
      accountKey: storageAccountKey
      shareName: fileShareName
      accessMode: 'ReadWrite'
    }
  }
}

resource containerApp 'Microsoft.App/containerApps@2022-10-01' = {
  name: containerAppName
  location: resourceGroup().location
  properties: {
    managedEnvironmentId: containerAppEnv.id
    configuration: {
      ingress: {
        external: true
        targetPort: 9090
        allowInsecure: false
        ipSecurityRestrictions: [for (ip, index) in ipAllowRules: {
          action: 'Allow'
          description: 'Allow access'
          name: 'GrantRule${index}'
          ipAddressRange: '${ip}/32'
        }]
        traffic: [
          {
            latestRevision: true
            weight: 100
          }
        ]
      }
    }
    template: {
      revisionSuffix: toLower(utcNow())
      containers: [
      {
        name: 'prometheus'
        probes: [
          {
            type: 'liveness'
            httpGet: {
              path: '/-/healthy'
              port: 9090
              scheme: 'HTTP'
            }
            periodSeconds: 120
            timeoutSeconds: 5
            initialDelaySeconds: 10
            successThreshold: 1
            failureThreshold: 3
          }
          {
            type: 'readiness'
            httpGet: {
              path: '/-/ready'
              port: 9090
              scheme: 'HTTP'
            }
            periodSeconds: 120
            timeoutSeconds: 5
            initialDelaySeconds: 10
            successThreshold: 1
            failureThreshold: 3
          }
        ]
        image: 'prom/prometheus:latest'
        resources: {
          cpu: json('0.75')
          memory: '1.5Gi'
        }
        command: [
          '/bin/prometheus'
          '--config.file=config/prometheus.yml'
          '--storage.tsdb.path=data'
          '--storage.tsdb.retention.time=60d'
          '--storage.tsdb.no-lockfile'
        ]
        volumeMounts: [
          {
            mountPath: '/prometheus'
            volumeName: 'azurefilemount'
          }
        ]
      }
    ]

      volumes: [
      {
        name: 'azurefilemount'
        storageType: 'AzureFile'
        storageName: storageAccountName
      }
    ]
      scale: {
        minReplicas: 1
        maxReplicas: 1
      }
    }
  }
  dependsOn: [
    permanentStorageMount
  ]
}

The above script is a bit long but hopefully still easy to understand. (If not, check out the documentation.) However, some details are worth highlighting and explaining.

Since we don’t need to browse the Prometheus dashboard, the ACA firewall should be enabled to block traffic from anything except Grafana, which is configured to use two outbound static IPs (passed as parameter `ipAllowRules`). 

To achieve this result, ingress must be enabled (`ingress`:`external` equals `true`). The same must be done for the underlying Storage Account. However, at the time of writing, isolation between ACA and Storage Accounts is not supported yet. This option is available just for a few services, such as VNets. 

For this reason, we had to isolate ACA in a VNet (unfortunately /23 size is a requirement ). Due to a Bicep bug, this option will not work if defined at the first Storage Account definition already written above. Still, it must be repeated at the end of the Bicep script.

resource storageAccount 'Microsoft.Storage/storageAccounts@2022-09-01' = {
  name: storageAccountName
  location: resourceGroup().location
  kind: 'StorageV2'
  sku: {
    name: storageAccountSku
  }
  properties: {
    supportsHttpsTrafficOnly: true
    minimumTlsVersion: 'TLS1_2'
    networkAcls: {
      defaultAction: 'Deny'
      bypass: 'AzureServices'
      virtualNetworkRules: [for i in range(0, length(storageAccountAllowedSubnets)): {
        id: virtualNetworkSubnets[i].id
      }]
    }
    largeFileSharesState: 'Enabled'
  }
}

Wrapping out the monitoring solution

What is described above so far could be represented diagrammatically in this way:

[Click on the image to view full-size]

  • The SRE AD group and related Managed Identity used for automation can access Grafana through AAD
  • Grafana can access Azure Monitor using Managed Identity: metrics, logs, and resource graphs can be queried using Kusto
  • Grafana IPs are allowed to connect with Prometheus ACA, hosted in a custom VNet
  • File Share mounted on a Storage Account is used as volume to run Prometheus container
  • Potentially, Prometheus can be scaled up and down – until now, we haven’t needed to do this

At this point a question arises: how can Prometheus access the Bank Platform – hosted in a closed VNet – to collect aggregate data?

This can be addressed with a different serverless solution: Azure Functions.

Monitoring Function App

An Azure Function app hosted in the Bank Platform VNet would be able to collect data from all the productive components around the platform: it has access to Azure SQL Database, Cosmos DB, Azure Batch, Service Bus, and even the Kubernetes cluster. 

It would be possible to query different data using a combination of tools such as Azure SDKs, and so why did we choose to use a Function app? Because it can expose REST APIs to the scheduled Prometheus jobs, and when these jobs are paused or stopped, Function apps are for free, being a serverless solution. Moreover, we could configure the Functions app to accept connections from a specific VNet only, the Prometheus VNet in this case.

Then, the complete diagram appears this way:

[Click on the image to view full-size]

In our case, the Monitoring Function app runs on .NET 7  using the new isolated worker process.

In `Program.cs`, create and run the host.

var host = Host.CreateDefaultBuilder()
    .ConfigureAppConfiguration((ctx, builder) =>
    {
        if (ctx.HostingEnvironment.IsDevelopment())
        {
            builder.AddUserSecrets(Assembly.GetExecutingAssembly(), optional: false);
            return;
        }

        // On Net7, it is fast enough to be used
        var configuration = builder.Build()!;
        // This logger is useful when diagnostic startup issues on Azure Portal
        var logger = LoggerFactory.Create(config =>
        {
            config.AddConsole();
            config.AddConfiguration(configuration.GetSection("Logging"));
        })
        .CreateLogger("Program");

        logger.LogInformation("Environment: {env}", ctx.HostingEnvironment.EnvironmentName);

        builder.AddAzureAppConfiguration(options =>
        {
            
            options.ConfigureRefresh(opt =>
            {
                // Auto app settings refresh
            });

            options.ConfigureKeyVault(opt =>
            {
                // KeyVault integration
            });
        }, false);
    })
    .ConfigureServices((ctx, services) =>
    {
        // Register services in IoC container
    })
    .ConfigureFunctionsWorkerDefaults((ctx, builder) =>
    {
        builder.UseDefaultWorkerMiddleware();

        if (ctx.HostingEnvironment.IsDevelopment())
            return;

        string? connectionString = ctx.Configuration.GetConnectionString("ApplicationInsights:SRE");
        if (string.IsNullOrWhiteSpace(connectionString))
            return;

        builder
            .AddApplicationInsights(opt =>
            {
                opt.ConnectionString = connectionString;
            })
            .AddApplicationInsightsLogger();

        builder.UseAzureAppConfiguration();
    })
    .Build();

await host.RunAsync();

Each function namespace represents a different monitoring context, so we have, for example, a namespace dedicated to the Azure Service Bus and another for Azure Batch, and so on. All namespaces provide an extension method to register into `IServiceCollection` all the requirements it needs. These extension methods are called from the `ConfigureServices`.

Monitoring Examples

Before concluding this, I want to provide some real usage examples.

Application Insights Availability Integration

Ping availability tests provided by Application Insights (AI) cost 0.0006€ per test. However, you can ping your services with a custom code and send the result to Application Insights using the official SDK for free.

Here is the code of the Availability section of the Monitoring Function app.

private async Task PingRegionAsync(
        string url,
        string testName)
    {
        const string LOCATION = "westeurope";

        string operationId = Guid.NewGuid().ToString("N");

        var availabilityTelemetry = new AvailabilityTelemetry
        {
            Id = operationId,
            Name = testName,
            RunLocation = LOCATION,
            Success = false,
            Timestamp = DateTime.UtcNow,
        };

        // not ideal, but we just need an estimation
        var stopwatch = Stopwatch.StartNew();

        try
        {
            await ExecuteTestAsync(url);

            stopwatch.Stop();

            availabilityTelemetry.Success = true;
        }
        catch (Exception ex)
        {
            stopwatch.Stop();

            if (ex is HttpRequestException reqEx && reqEx.StatusCode == System.Net.HttpStatusCode.NotFound)
            {
                _logger.LogError(reqEx, "Probably a route is missing");
            }

            HandleError(availabilityTelemetry, ex);
        }
        finally
        {
            availabilityTelemetry.Duration = stopwatch.Elapsed;

            _telemetryClient.TrackAvailability(availabilityTelemetry);
            _telemetryClient.Flush();
        }
    }

    private async Task ExecuteTestAsync(string url)
    {
        using var cancelAfterDelay = new CancellationTokenSource(TimeSpan.FromSeconds(20));

        string response;

        try
        {
            response = await _httpClient.GetStringAsync(url, cancelAfterDelay.Token);
        }
        catch (OperationCanceledException)
        {
            throw new TimeoutException();
        }

        switch (response.ToLowerInvariant())
        {
            case "healthy":
                break;
            default:
                _logger.LogCritical("Something is wrong");
                throw new Exception("Unknown error");
        }
    }

    private void HandleError(AvailabilityTelemetry availabilityTelemetry, Exception ex)
    {
        availabilityTelemetry.Message = ex.Message;

        var exceptionTelemetry = new ExceptionTelemetry(ex);
        exceptionTelemetry.Context.Operation.Id = availabilityTelemetry.Id;
        exceptionTelemetry.Properties.Add("TestName", availabilityTelemetry.Name);
        exceptionTelemetry.Properties.Add("TestLocation", availabilityTelemetry.RunLocation);
        _telemetryClient.TrackException(exceptionTelemetry);
    }

Initially, an `AvailabilityTelemetry` object is created and set up. Then, the ping operation is performed; depending on the result, different information is stored in AI using the SDK’s objects.

Note that the `stopwatch` object is not accurate, but it is enough for our use case.

Card Top up Critical User Journey

This is an example of a Critical User Journey (CUJ) where a user wants to top up their bank account through an external service. Behind the scenes, a third-party service notifies Flowe about the top up via REST APIs. Having access to Azure Monitor and from Grafana, it’s simply displaying the count of callbacks received through a Kusto query against the Application Insights resource.

requests
| where (url == "")
| where timestamp >= $__timeFrom and timestamp < $__timeTo
| summarize Total = count()

[Click on the image to view full-size]

It is also possible to display the same data as a time series chart but group callbacks by their status code.

requests
| where (url == "")
| where timestamp >= $__timeFrom and timestamp < $__timeTo
| summarize response = dcount(id) by resultCode, bin(timestamp, 1m)
| order by timestamp asc

[Click on the image to view full-size]

After the callback is received, a Flowe internal asynchronous flow is triggered to let microservices communicate with each other through integration events .

To complete this CUJ, the same Grafana dashboard shows the number of items that ended up in DLQ due to failures. Azure Monitor does not expose this kind of data directly, so custom code had to be written. The Monitor Functions app exposes an endpoint to return aggregate data of the items stuck in the Azure Service Bus DLQs.

[Function(nameof(DLQFunction))]
    public async Task RunAsync(
        [HttpTrigger(
            AuthorizationLevel.Function,
            methods: "GET",
            Route = "exporter"
        )] HttpRequestData req)
    {
        return await ProcessSubscriptionsAsync(req);
    }

    private async Task ProcessSubscriptionsAsync(HttpRequestData req)
    {
        var registry = Metrics.NewCustomRegistry();
        _gauge = PrometheusFactory.ProduceGauge(
            registry,
            PrometheusExporterConstants.SERVICEBUS_GAUGE_NAME,
            "Number of DLQs grouped by subscription and subject",
            labelNames: new[]
                {
                    PrometheusExporterConstants.SERVICEBUS_TYPE_LABEL,
                    PrometheusExporterConstants.SERVICEBUS_TOPIC_LABEL,
                    PrometheusExporterConstants.SERVICEBUS_SUBSCRIPTION_LABEL,
                    PrometheusExporterConstants.SERVICEBUS_SUBJECT_LABEL,
                });

        foreach (Topic topic in _sbOptions.serviceBus.Topics!)
        {
            foreach (var subscription in topic.Subscriptions!)
            {
                await ProcessSubscriptionDlqs(_sbOptions.serviceBus.Name!, topic.Name!, subscription, _gauge);
            }
        }

        return await GenerateResponseAsync(req, registry);
    }

    private async Task ProcessSubscriptionDlqs(string serviceBus, string topic, string subscription, Gauge gauge)
    {
        var stats = await _serviceBusService.GetDeadLetterMessagesRecordsAsync(serviceBus, topic, subscription);

        var groupedStats = stats
            .GroupBy(x => x.Subject, (key, group) => new { Subject = key, Count = group.Count() })
            .ToList();

        foreach (var stat in groupedStats)
        {
            gauge!
                .WithLabels("dlqs", topic, subscription, stat.Subject)
                .Set(stat.Count);
        }
    }

    private static async Task GenerateResponseAsync(HttpRequestData req, CollectorRegistry registry)
    {
        var result = await PrometheusHelper.GenerateResponseFromRegistryAsync(registry);

        var response = req.CreateResponse(HttpStatusCode.OK);
        response.Headers.Add("Content-Type", $"{MediaTypeNames.Text.Plain}; charset=utf-8");
        response.WriteString(result);

        return response;
    }

In this case, the Function App doesn’t query the Azure Service Bus instance directly; it is instead wrapped by another custom service through the `_serviceBusService`.

P.S. We are working to publish this service on GitHub!

Once data is returned to Prometheus, Grafana can show them using PromQL.

service_bus{subject="CriticalEventName"}

[Click on the image to view full-size]

OnBoarding Critical User Journey

The “OnBoarding” CUJ starts when customers open the app for the first time and finishes when the customer successfully opens a new bank account. It is a complicated journey because the user’s digital identity must be confirmed, and personal data is processed by Know Your Customer (KYC) and anti-money laundering services. To complete the process, a lot of third parties are involved.

Here, I want to share a piece of this CUJ dashboard where a sequence of APIs is monitored, and a funnel is built on top of them, among other data.

[Click on the image to view full-size]

 

The queries used to build this dashboard are similar to those described above.

Conclusions

“Critical User Journeys are an effective way to decide what metrics monitor and guide their collection. Bringing these metrics together in tools like Prometheus and Grafana simplifies the way in which SREs and Architects can exchange responsibilities to oversee operations. Custom code may be needed to collect certain metrics about different CUJs, but all teams benefit from the resulting simplicity in monitoring the overall workflow.

Usually, Prometheus and Grafana are used to monitor Kubernetes and application metrics such as API average response time and CPU under use.

Instead, this post shows how to calculate and display CUJs over aggregated data, necessarily collected using custom code.

Architectural choices point to a cost-effective solution in Azure, but keep in mind the deployment simplicity and requirements that an organization may have (such as SSO and security concerns). Plus, the need for maintenance is almost eliminated.”

Note: All the numbers and names of the screenshots shown in this article were taken from test environments (and sometimes mixed as well). There is no way to get real data and real flows from the charts above.