Archives September 2023

Domain-Driven Cloud: Aligning your Cloud Architecture to your Business Model

Key Takeaways

  • Domain-Driven Cloud (DDC) is an approach for creating your organization’s cloud architecture based on the bounded contexts of your business model. DDC extends the principles of Domain-Driven Design (DDD) beyond traditional software systems to create a unifying architecture approach across business domains, software systems and cloud infrastructure.
  • DDC creates a cloud architecture that evolves as your business changes, improves team autonomy and promotes low coupling between distributed workloads. DDC simplifies security, governance and cost management in a way that promotes transparency within your organization.
  • In practice, DDC aligns your bounded contexts with AWS Organizational Units (OU’s) and Azure Management Groups (MG’s). Bounded contexts are categorized as domain contexts based on your business model and supporting technical contexts. DDC gives you freedom to implement different AWS Account or Azure Subscription taxonomies while still aligning to your business model.
  • DDC uses inheritance to enforce policies and controls downward while reporting costs and compliance upwards. Using DDC makes it automatically transparent how your cloud costs align to your business model without implementing complex reports and error-prone tagging requirements.
  • DDC aligns with established AWS and Azure well-architected best practices. You can implement DDC in 5 basic steps whether a new migration (green field) or upgrading your existing cloud architecture (brown field).

Domain-Driven Cloud (DDC) is an approach for creating your organization’s cloud architecture based on your business model. DDC uses the bounded contexts of your business model as inputs and outputs a flexible cloud architecture to support all of the workloads in your organization and evolve as your business changes. DDC promotes team autonomy by giving teams the ability to innovate within guardrails. Operationally, DDC simplifies security, governance, integration and cost management in a way that promotes transparency for IT and business stakeholders alike.

Based on Domain-Driven Design (DDD) and the architecture principle of high cohesion and low coupling, this article introduces DDC including the technical and human benefits of aligning your cloud architecture to the bounded contexts in your business model. You will learn how DDC can be implemented in cloud platforms including Amazon Web Services (AWS) and Microsoft Azure while aligning with their well-architected frameworks. Using illustrative examples from one of our real customers, you will learn the 5 steps to implementing DDC in your organization.

What is Domain-Driven Cloud (DDC)?

DDC extends the principles of DDD beyond traditional software systems to create a unifying architecture spanning business domains, software systems and cloud infrastructure.  

Our customers perpetually strive to align “people, process and technology” together so they can work in harmony to deliver business outcomes. However, in practice, this often falls down as the Business (Biz), IT Development (Dev) and IT Operations (Ops) all go to their separate corners to design solutions for complex problems that actually span all three.

What emerges is business process redesigns, enterprise architectures and cloud platform architecture all designed and implemented by different groups using different approaches and localized languages.  

What’s missing is a unified architecture approach using a shared language that integrates BizDevOps. This is where DDC steps in, with a specific focus on aligning the cloud architecture and software systems that run on them to the bounded contexts of your business model, identified using DDD. Figure 1 illustrates how DDC extends the principles of DDD to include cloud infrastructure architecture and in doing so creates a unified architecture that aligns BizDevOps.

[Click on the image to view full-size]

In DDC, the most important cloud services are AWS Organizational Units (OU’s) that contain Accounts and Azure Management Groups (MG’s) that contain Subscriptions. Because 100% of the cloud resources you secure, use and pay for are connected to Accounts and Subscriptions, these are the natural cost and security containers. By enabling management and security at the higher OU/MG level and anchoring these on the bounded contexts of your business model, you can now create a unifying architecture spanning Biz, Dev and Ops. You can do this while giving your teams flexibility in how they use Accounts and Subscriptions to meet specific requirements.

Why align your Cloud Architecture with your Business Model?

The benefits for aligning your cloud architecture to your organization’s business model include:

  • Evolves with your Business – Businesses are not static and neither is your cloud architecture. As markets change and your business evolves, new contexts may emerge and others may consolidate or fade away. Some contexts that historically were strategic differentiators may drive less business value today. The direct alignment of your cloud management, security and costs to bounded contexts means your cloud architecture evolves with your business.
  • Improves Team Autonomy – While some cloud management tasks must be centralized, DDC recommends giving teams autonomy within their domain contexts for things like provisioning infrastructure and deploying applications. This enables innovation within guardrails so your agile teams can go faster and be more responsive to changes as your business grows. It also ensures dependencies between workloads in different contexts are explicit with the goal of promoting a loosely-coupled architecture aligned to empowered teams.
  • Promotes High Cohesion and Low Coupling – Aligning your networks to bounded contexts enables you to explicitly allow or deny network connectivity between all contexts. This is extraordinarily powerful, especially for enforcing low coupling across the your cloud platform and avoiding a modern architecture that looks like a bowl of spaghetti. Within a context, teams and workloads ideally have high cohesion with respect to security, network integration and alignment on supporting a specific part of your business. You also have freedom to make availability and resiliency decisions at both the bounded context and workload levels.
  • Increases Cost Transparency – By aligning your bounded contexts to OU’s and MG’s, all cloud resource usage, budgets and costs are precisely tracked at a granular level. Then they are automatically summarized at the bounded contexts without custom reports and nagging all your engineers to tag everything! With DDC you can look at your monthly cloud bill and know the exact cloud spend for each of your bounded contexts, enabling you to assess whether these costs are commensurate with each context’s business value. Cloud budgets and alarms can be delegated to context-aligned teams enabling them to monitor and optimize their spend while your organization has a clear top-down view of overall cloud costs.
  • Domain-Aligned Security – Security policies, controls, identity and access management all line up nicely with bounded contexts. Some policies and controls can be deployed across-the-board to all contexts to create a strong security baseline. From here, selected controls can be safely delegated to teams for self-management while still enforcing enterprise security standards.
  • Repeatable with Code Templates – Both AWS and Azure provide ways to provision new Accounts or Subscriptions consistently from a code-based blueprint. In DDC, we recommend defining one template for all domain contexts, then using this template (plus configurable input parameters) to provision and configure new OU’s and Accounts or MG’s and Subscriptions as needed. These management constructs are free (you only pay for the actual resources used within them), enabling you to build out your cloud architecture incrementally yet towards a defined future-state, without incurring additional cloud costs along the way.

DDC may not be the best approach in all situations. Alternatives such as organizing your cloud architecture by tenant/customer (SaaS) or legal entity are viable options, too.

Unfortunately, we often see customers default to organizing their cloud architecture by their current org structure, following Conway’s Law from the 1960’s. We think this is a mistake and that DDC is a better alternative for one simple reason: your business model is more stable than your org structure.

One of the core tenets of good architecture is that we don’t have more stable components depending on less stable components (aka the Stable Dependencies Principle). Organizations, especially large ones, like to reorganize often, making their org structure less stable than their business model. Basing your cloud architecture on your org structure means that every time you reorganize your cloud architecture is directly impacted, which may impact all the workloads running in your cloud environment. Why do this? Basing your cloud architecture on your organization’s business model enables it to evolve naturally as your business strategy evolves, as seen in Figure 2.

[Click on the image to view full-size]

We recognize that, as Ruth Malan states, “If the architecture of the system and the architecture of the organization are at odds, the architecture of the organization wins”. We also acknowledge there is work to do with how OU’s/MG’s and all the workloads within them best align to team boundaries and responsibilities. We think ideas like Team Topologies may help here.

We are seeing today’s organizations move away from siloed departmental projects within formal communications structures to cross-functional teams creating products and services that span organizational boundaries. These modern solutions run in the cloud, so we feel the time is right for evolving your enterprise architecture in a way that unifies Biz, Dev and Ops using a shared language and architecture approach.

What about Well-Architected frameworks?

Both AWS’s Well-Architected framework and Azure’s Well-Architected framework provide a curated set of design principles and best practices for designing and operating systems in your cloud environments. DDC fully embraces these frameworks and at SingleStone we use these with our customers. While these frameworks provide specific recommendations and benefits for organizing your workloads into multiple Accounts or Subscriptions, managed with OU’s and MG’s, they leave it to you to figure out the best taxonomy for your organization.

DDC is opinionated on basing your cloud architecture on your bounded contexts, while being 100% compatible with models like AWS’s Separated AEO/IEO and design principles like “Perform operations as code” and “Automatically recover from failure”. You can adopt DDC and apply these best practices, too. Tools such as AWS Landing Zone and Azure Landing Zones can accelerate the setup of your cloud architecture while also being domain-driven.

5 Steps for Implementing Domain-Driven Cloud

Do you think a unified architecture using a shared language across BizDevOps might benefit your organization? While a comprehensive list of all tasks is beyond the scope of this article, here are the five basic steps you can follow, with illustrations from one of our customers who recently migrated to Azure.

Step 1: Start with Bounded Contexts

The starting point for implementing DDC is a set of bounded contexts that describes your business model. The steps to identify your bounded contexts are not covered here, but the process described in Domain-Driven Discovery is one approach.

Once you identify your bounded contexts, organize them into two groups:

  • Domain contexts are directly aligned to your business model.
  • Technical contexts support all domain contexts with shared infrastructure and services

To illustrate, let’s look at our customer who is a medical supply company. Their domain and technical contexts are shown in Figure 3.

[Click on the image to view full-size]

Your organization’s domain contexts would be different, of course.

For technical contexts, the number will depend on factors including your organization’s industry, complexity, regulatory and security requirements. A Fortune 100 financial services firm will have more technical contexts than a new media start-up. With that said, as a starting point DDC recommends six technical contexts for supporting all your systems and data.

  • Cloud Management – Context for the configuration and management of your cloud platform including OU/MG’s, Accounts/Subscriptions, cloud budgets and cloud controls.
  • Security – Context for identity and access management, secrets management and other shared security services used by any workload.
  • Network – Context for all centralized networking services including subnets, firewalls, traffic management and on-premise network connectivity.
  • Compliance – Context for any compliance-related services and data storage that supports regulatory, audit and forensic activities.
  • Platform Services – Context for common development and operations services including CI/CD, package management, observability, logging, compute and storage.
  • Analytics – Context for enterprise data warehouses, governance, reporting and dashboards.

You don’t have to create these all up-front, start with Cloud Management initially and build out as-needed.

Step 2: Build a Solid Foundation

WIth your bounded contexts defined, it’s now time to build a secure cloud foundation for supporting your organization’s workloads today and in the future. In our experience, we have found it is helpful to organize your cloud capabilities into three layers based on how they support your workloads. For our medical supply customer, Figure 4 shows their contexts aligned to Application, Platform and Foundation layers of their cloud architecture.

[Click on the image to view full-size]

With DDC, you align AWS Organizational Units (OU’s) or Azure Management Groups (MG’s) to bounded contexts. By align, we mean you name them after your bounded contexts. These are the highest levels of management and through the use of inheritance they give you the ability to standardize controls and settings across your entire cloud architecture.

DDC gives you flexibility in how best to organize your Accounts and Subscription taxonomy, from coarse-grained to fine-grained, as seen in Figure 5.

DDC recommends starting with one OU/MG and at least two Accounts/Subscriptions per bounded context. If your organization has higher workload isolation requirements, DDC can support this too, as seen in Figure 5.

[Click on the image to view full-size]

For our customer who had a small cloud team new to Azure, separate Subscriptions for Prod and NonProd for each context made sense as a starting point, as shown in Figure 6.

[Click on the image to view full-size]

Figure 7 shows what this would look like in AWS.

[Click on the image to view full-size]

For our customer, further environments like Dev, Test and Stage could be created within their respective Prod and Non-Prod Subscriptions. This provides them isolation between environments with the ability to configure environment-specific settings at the Subscription or lower levels. They also decided to build just the Prod Subscriptions for the six technical contexts to keep it simple to start. Again, if your organization wanted to create separate Accounts or Subscriptions for every workload environment, this can be done too and still aligned with DDC.

From a governance perspective, in DDC we recommend domain contexts inherit security controls and configurations from technical contexts. Creating a strong security posture in your technical contexts enables all your workloads that run in domain contexts to inherit this security by default. Domain contexts can then override selected controls and settings on a case-by-case basis balancing team autonomy and flexibility with required security guardrails.

Using DDC, your organization can grant autonomy to teams to enable innovation within guardrails. Leveraging key concepts from team topologies, stream-aligned teams can be self-sufficient within domain contexts when creating cloud infrastructure, deploying releases and monitoring their workloads. Platform teams, primarily working in technical contexts, can focus on designing and running highly-available services used by the stream-aligned teams. These teams work together to create the right balance between centralization and decentralization of cloud controls to meet your organization’s security and risk requirements, as shown in Figure 8.

[Click on the image to view full-size]

As this figure shows, policies and controls defined at higher level OU’s/MG’s are enforced downwards while costs and compliance are reported upwards. For our medical supply customer, this means their monthly Azure bill is automatically itemized by their bounded contexts with summarized cloud costs for Orders, Distributors and Payers to name a few.

This makes it easy for their CTO to share cloud costs with their business counterparts and establish realistic budgets that can be monitored over time. Just like costs, policy compliance across all contexts can be reported upwards with evidence stored in the Compliance technical context for auditing or forensic purposes. Services such as Azure Policy and AWS Audit Manager are helpful for continually maintaining compliance across your cloud environments by organizing your policies and controls in one place for management.

Step 3: Align Workloads to Bounded Contexts

With a solid foundation and our bounded contexts identified, the next step is to align your workloads to the bounded contexts. Identifying all the workloads that will run in your cloud environment is often done during a cloud migration discovery, aided in part by a change management database (CMDB) that contains your organization’s portfolio of applications.

When aligning workloads to bounded contexts we prefer a workshop approach that promotes discussion and collaboration. In our experience this makes DDC understandable and relatable by the teams involved in migration. Because teams must develop and support these workloads, the workshop also highlights where organizational structures may align (or not) to bounded contexts. This workshop (or a follow-up one) can also identify which applications should be independently deployable and how the team’s ownership boundaries map to bounded contexts.

For our medical supply customer, this workshop revealed the permissions required for a shared CI/CD tool in the Shared Services context was needed to deploy a new version of their Order Management system in the Orders context. This drove a discussion on working out how secrets and permissions would be managed across contexts, identifying new capabilities needed for secrets management that were prioritized during cloud migration. By creating a reusable solution that worked for all future workloads in domain contexts, the cloud team created a new capability that improved the speed of future migrations.

Figure 9 summarizes how our customer aligned their workloads to bounded contexts, which are aligned to their Azure Management Groups.

[Click on the image to view full-size]

Within the Order context, our customer used Azure Resource Groups for independently deployable applications or services that contain Azure Resources, as shown in Figure 10.

[Click on the image to view full-size]

This design served as a starting point for their initial migration of applications running in a data center to Azure. Over the next few years their goal was to re-factor these applications into multiple independent micro-services. When this time came, they could incrementally do this an application at a time by creating additional Resource Groups for each service.

If our customer were using AWS, Figure 10 would look very similar but use Organizational Units, Accounts and AWS Stacks for organizing independently deployable applications or services that contained resources. One difference in cloud providers is that AWS allows nested stacks (stacks within stacks) whereas Azure Resource Groups cannot be nested.

For networking, in order for workloads running in domain contexts to access shared services in technical contexts, their networks must be connected or permissions explicitly enabled to allow access. While the Network technical context contains centralized networking services, by default each Account or Subscription aligned to a domain context will have its own private network containing subnets that are independently created, maintained and used by the workloads running inside them.

Depending on the total number of Accounts or Subscriptions, this may be desired or it may be too many separate networks to manage (each potentially has their own IP range). Alternatively, core networks can be defined in the Network Context and shared to specific domain or technical contexts thereby avoiding every context having its own private network. The details of cloud networking are beyond the scope of this article but DDC enables multiple networking options while still aligning your cloud architecture to your business model. Bottom line: you don’t have to sacrifice network security to adopt DDC.

Step 4: Migrate Workloads

Now that we have identified where each workload will run, it was time to begin moving them into the right Account or Subscription. While this was a new migration for our customer (greenfield), for your organization this may involve re-architecting your existing cloud platform (brownfield). Migrating a portfolio of workloads to AWS or Azure and the steps for architecting your cloud platform is beyond the scope of this article, but with respect to DDC this is a checklist of the key things to keep in mind:

  • Name your AWS Organizational Units (OU’s) or Azure Management Groups (MG’s) after your bounded contexts.
  • Organize your contexts into domain and technical groupings, with:
    • Technical contexts as the foundation and platform layers of your cloud architecture.
    • Domain contexts as the application layer of your cloud architecture.
  • Centralize common controls in technical contexts for a strong security posture.
  • Decentralize selected controls in domain contexts to promote team autonomy, speed and agility.
  • Use inheritance within OU’s or MG’s for enforcing policies and controls downward while reporting cost and compliance upwards.
  • Decide on your Account / Subscription taxonomy within the OU’s / MG’s, balancing workload isolation with management complexity.
  • Decide how your networks will map to domain and technical contexts, balancing centralization versus decentralization.
  • Create domain contexts templates for consistency and use these when provisioning new Accounts / Subscriptions.

For brownfield deployments of DDC that are starting with an existing cloud architecture, the basic recipe is:

  1. Create new OU’s / MG’s named after your bounded contexts. For a period of time these will live side-by-side with your existing OU’s / MG’s and should have no impact on current operations.
  2. Implement policies and controls within the new OU’s / MG’s for your technical contexts, using inheritance as appropriate.
  3. Create a common code template for all domain contexts that inherits policies and controls from your technical contexts. Use parameters for anything that’s different between contexts.
  4. Based on the output of your workloads mapping workshop, for each workload either:
    • a.  Create a new Account / Subscription using the common template, aligned with your desired account taxonomy, for holding the workload or
    • b.  Migrate an existing Account / Subscription, including all workloads and resources within the, to the new OU / MG. When migrating, pay careful attention to controls from the originating OU / MG to ensure they are also enabled in the target OU / MG.
  5. The order you move workloads will be driven by the dependencies between your workloads, so this should be understood before beginning. The same goes for shared services that workloads depend on.
  6. Depending on the number of workloads to migrate, this may take weeks or months (but hopefully not years). Work methodically as you migrate workloads, verifying that controls, costs and compliance are working correctly for each context.
  7. Once done, decommission the old OU / MG structure and any Accounts / Subscriptions no longer in use.

Step 5: Inspect and Adapt

Your cloud architecture is not a static artifact, the design will continue to evolve over time as your business changes and new technologies emerge. New bounded contexts will appear that require changes to your cloud platform. Ideally much of this work is codified and automated, but in all likelihood you will still have some manual steps involved as your bounded contexts evolve.

Your Account / Subscription taxonomy may change over time too, starting with fewer to simplify initial management and growing as your teams and processes mature. The responsibility boundaries of teams and how these align to bounded contexts will also mature over time. Methods like GitOps work nicely alongside DDC to keep your cloud infrastructure flexible and extensible over time and continually aligned with your business model.

Conclusion

DDC extends the principles of DDD beyond traditional software systems to create a unifying architecture spanning business domains, software systems and cloud infrastructure (BizDevOps). DDC is based on the software architecture principle of high cohesion and low coupling that is used when designing complex distributed systems, like your AWS and Azure environments. Employing the transparency and shared language benefits of DDD when creating your organization’s cloud architecture results in a secure-yet-flexible platform that naturally evolves as your business changes over time.

Special thanks to John Chapin, Casey Lee, Brandon Linton and Nick Tune for feedback on early drafts of this article and Abby Franks for the images.

Dealing with Java CVEs: Discovery, Detection, Analysis, and Resolution

Key Takeaways

  • Including a dependency vulnerability check (Software Composition Analysis or SCA) as part of a  continuous integration or continuous delivery pipeline is important to maintain an effective security posture.
  • The same vulnerability can be critical in one application and harmless in another. Humans should be “kept in the loop” here, and only the developers maintaining the application make an effective decision.
  • It is essential to prevent vulnerability alert fatigue. We should not get used to the fact that the dependency check is failing. If we do, critical vulnerability may pass unnoticed.
  • It is crucial to quickly upgrade vulnerable dependencies or suppress false positives even if we are maintaining dozens of services.
  • Developers should invest in tools that help with discovery, detection, analysis and resolution of vulnerabilities. Examples include OWASP dependency check, GitHub Dependabot, Checkmarx, Snyk and Dependency Shield.

Modern Java applications are built on top of countless open-source libraries. The libraries encapsulate common, repetitive code and allow application programmers to focus on delivering customer value. But the libraries come with a price – security vulnerabilities. A security issue in a popular library enables malicious actors to attack a wide range of targets cheaply.

Therefore, it’s crucial to have dependency vulnerability checks (a.k.a. Software Composition Analysis or SCA) as part of the CI pipeline. Unfortunately, the security world is not black and white; one vulnerability can be totally harmless in one application and a critical issue in another, so the scans always need human oversight to determine whether a report is a false positive.

This article will explore examples of vulnerabilities commonly found in standard Spring Boot projects over the last few years. This article is written from the perspective of software engineers. The focus will shift to the challenges faced when utilizing widely available tools such as the OWASP dependency check.

As software engineers are dedicated to delivering product value, they view security as one of their many responsibilities. Despite its importance, security can sometimes get in the way and be neglected because of the complexity of other tasks.

Vulnerability resolution lifecycle

A typical vulnerability lifecycle looks like this:

Discovery

A security researcher usually discovers the vulnerability. It gets reported to the impacted OSS project and, through a chain of various non-profit organizations, ends up in the NIST National Vulnerability Database (NVD). For instance, the Spring4Shell vulnerability was logged in this manner.

Detection

When a vulnerability is reported, it is necessary to detect that the application contains the vulnerable dependency. Fortunately, a plethora of tools are available that can assist with the detection.

One of the popular solutions is the OWASP dependency check – it can be used as a Gradle or Maven plugin. When executed, it compares all your application dependencies with the NIST NVD database and Sonatype OSS index. It allows you to suppress warnings and generate reports and is easy to integrate into the CI pipeline. The main downside is that it sometimes produces false positives as the NIST NVD database does not provide the data in an ideal format. Moreover, the first run takes ages as it downloads the whole vulnerability database.

Various free and commercial tools are available, such as GitHub Dependabot, Checkmarx, and Snyk. Generally, these tools function similarly, scanning all dependencies and comparing them against a database of known vulnerabilities. Commercial providers often invest in maintaining a more accurate database. As a result, commercial tools may provide fewer false positives or even negatives.

Analysis

After a vulnerability is detected, a developer must analyze the impact. As you will see in the examples below, this is often the most challenging part. The individual performing the analysis must understand the vulnerability report, the application code, and the deployment environment to see if the vulnerability can be exploited. Typically, this falls to the application programmers as they are the only ones who have all the necessary context.

Resolution

The vulnerability has to be resolved.

  1. Ideally, this is achieved by upgrading the vulnerable dependency to a fixed version.
  2.  If no fix is released yet, the application programmer may apply a workaround, such as changing a configuration, filtering an input, etc.
  3. More often than not, the vulnerability report is a false positive. Usually, the vulnerability can’t be exploited in a given environment. In such cases, the report has to be suppressed to prevent becoming accustomed to failing vulnerability reports.

Once the analysis is done, the resolution is usually straightforward but can be time-consuming, especially if there are dozens of services to patch. It’s important to simplify the resolution process as much as possible. Since this is often tedious manual work, automating it to the greatest extent possible is advisable. Tools like Dependabot or Renovate can help in this regard to some extent.

Vulnerability examples

Let’s examine some vulnerability examples and the issues that can be encountered when resolving them.

Spring4Shell (CVE-2022-22965, score 9.8)

Let’s start with a serious vulnerability – Spring Framework RCE via Data Binding on JDK 9+, a.k.a. Spring4Shell, which allows an attacker to remotely execute code just by calling HTTP endpoints.

Detection

It was easy to detect this vulnerability. Spring is quite a prominent framework; the vulnerability was present in most of its versions, and it was discussed all over the internet. Naturally, all the detection tools were able to detect it.

Analysis

In the early announcement of the vulnerability, it was stated that only applications using Spring WebMvc/Webflux deployed as WAR to a servlet container are affected. In theory, deployment with an embedded servlet container should be safe. Unfortunately, the announcement lacked the vulnerability details, making it difficult to confirm whether this was indeed the case. However, this vulnerability was highly serious, so it should have been mitigated promptly.

Resolution

The fix was released in a matter of hours, so the best way was to wait for the fix and upgrade. Tools like Dependabot or Renovate can help to do that in all your services.

If there was a desire to resolve the vulnerability sooner, a workaround was available. But it meant applying an obscure configuration without a clear understanding of what it did. The decision to manually apply it across all services or wait for the fix could have been a challenging one to make.

HttpInvoker RCE (CVE-2016-1000027, score 9.8)

Let’s continue to focus on Spring for a moment. This vulnerability has the same criticality as Spring4Shell 9.8. But one might notice the date is 2016 and wonder why it hasn’t been fixed yet or why it lacks a fancy name. The reason lies in its location within the HttpInvoker component, used for the RPC communication style. It was popular in the 2000s but is seldom used nowadays. To make it even more confusing, the vulnerability was published in 2020, four years after it was initially reported due to some administrative reasons.

Detection

This issue was reported by OWASP dependency check and other tools. As it did not affect many, it did not make the headlines.

Analysis

Reading the NIST CVE detail doesn’t reveal much:

Pivotal Spring Framework through 5.3.16 suffers from a potential remote code execution (RCE) issue if used for Java deserialization of untrusted data. Depending on how the library is implemented within a product, this issue may or [may] not occur, and authentication may be required.

This sounds pretty serious, prompting immediate attention and a search through the link to find more details. However, the concern turns out to be a false alarm, as it only applies if HttpInvokerServiceExporter is used.

Resolution

No fixed version of a library was released, as Pivotal did not consider it a bug. It was a feature of an obsolete code that was supposed to be used only for internal communication. The whole functionality was dropped altogether in Spring 6, a few years later.

The only action that to take is to suppress the warning. Using the free OWASP dependency check, this process can be quite time-consuming if it has to be done manually for each service.

There are several ways to simplify the flow. One is to expose and use a shared suppression file in all your projects by specifying its URL. Lastly, you can employ a simple service like Dependency Shield to streamline the whole suppression flow. The important point is that a process is needed to simplify the suppression, as most of the reports received are likely false positives.

SnakeYAML RCE (CVE-2022-1471, score 9.8)

Another critical vulnerability has emerged, this time in the SnakeYAML parsing library. Once again, it involves remote code execution, with a score of 9.8. However, it was only applicable if the SnakeYAML Constructor class had been used to parse a YAML provided by an attacker.

Detection

It was detected by vulnerability scanning tools. SnakeYAML is used by Spring to parse YAML configuration files, so it’s quite widespread.

Analysis

Is the application parsing YAML that could be provided by an attacker, for example, on a REST API? Is the unsafe Constructor class being used? If so, the system is vulnerable. The system is safe if it is simply used to parse Spring configuration files. An individual who understands the code and its usage must make the decision. The situation could either be critical, requiring immediate attention and correction, or it could be safe and therefore ignored.

Resolution

The issue was quickly fixed. What made it tricky was that SnakeYAML was not a direct dependency; it’s introduced transitively by Spring, which made it harder to upgrade. If you want to upgrade SnakeYAML, you may do it in several ways.

  1. If using the Spring Boot dependency management plugin with Spring Boot BOM,
    • a.    the snakeyaml.version variable can be overridden.
    • b.    the dependency management declaration can be overridden.
  2. If not using dependency management, SnakeYAML must be added as a direct dependency to the project, and the version must be overridden.

When combined with complex multi-project builds, it’s almost impossible for tools to upgrade the version automatically. Both Dependabot and Renovate are not able to do that. Even a commercial tool like Snyk is failing with “could not apply the upgrade, dependency is managed externally.”

And, of course, once the version is overridden, it is essential to remember to remove the override once the new version is updated in Spring. In our case, it’s better to temporarily suppress the warning until the new version is used in Spring.

Misidentified Avro vulnerability

Vulnerability CVE-2021-43045 is a bug in .NET versions of the Avro library, so it’s unlikely to affect a Java project. How, then, is it reported? Unfortunately, the NIST report contains cpe:2.3:a:apache:avro:*:*:*:*:*:*:*:* identifier. No wonder the tools mistakenly identify org.apache.avro/[email protected] as vulnerable, even though it’s from a completely different ecosystem.

Resolution: Suppress

Summary

Let’s look back at the different stages of the vulnerability resolution and how to streamline it as much as possible so the reports do not block the engineers for too long.

Detection

The most important part of detection is to avoid getting used to failing dependency checks. Ideally, the build should fail if there is a vulnerable dependency detected. To be able to enable that, the resolution needs to be as painless and as fast as possible. No one wants to encounter a broken pipeline due to a false positive.
 
Since the OWASP dependency check primarily uses the NIST NVD database, it sometimes struggles with false positives. However, as has been observed, false positives are inevitable, as the analysis is only occasionally straightforward.

Analysis

This is the hard part and actually, the one when tooling can’t help us much. Consider the SnakeYAML remote code execution vulnerability as an example. For it to be exploitable, the library would have to be used unsafely, such as parsing data provided by an attacker. Regrettably, no tool is likely to reliably detect whether an application and all its libraries contain vulnerable code. So, this part will always need some human intervention.

Resolution

Upgrading the library to a fixed version is relatively straightforward for direct dependencies. Tools like Dependabot and Renovate can help in the process. However, the tools fail if the vulnerable dependency is introduced transitively or through dependency management. Manually overriding the dependency may be an acceptable solution for a single project. In cases where multiple services are being maintained, we should introduce centrally managed dependency management to streamline the process.

Most reports are false positives, so it’s crucial to have an easy way to suppress the warning. When using OWASP dependency check, either try a shared suppression file or a tool like Dependency Shield that helps with this task.

It often makes sense to suppress the report temporarily. Either to unblock the pipeline until somebody has time to analyze the report properly or until the transitive dependency is updated in the project that introduced it.

Building Kafka Event-Driven Applications with KafkaFlow

Key Takeaways

  • KafkaFlow is an open-source project that streamlines Kafka-based event-driven applications, simplifying the development of Kafka consumers and producers.
  • The .NET framework offers an extensive range of features, including middleware, message handlers, type-based deserialization, concurrency control, batch processing, etc.
  • By utilizing middlewares, developers can encapsulate the logic for processing messages, which leads to better separation of concerns and maintainable code.
  • The project can be extended, creating the possibility of customization and the growth of an ecosystem of add-ons.
  • Developers benefit from KafkaFlow by being able to focus on what matters, spending more time on the business logic rather than investing in low-level concerns.

KafkaFlow is an open-source framework by FARFETCH. It helps .NET developers working with Apache Kafka to create event-driven applications. KafkaFlow lets developers easily set up “Consumers” and “Producers.” The simplicity makes it an attractive framework for businesses seeking efficiency and robustness in their applications.

In this article, we will explore what KafkaFlow has to offer. If you build Apache Kafka Consumers and Producers using .NET, this article will provide a glance at how KafkaFlow can simplify your life.

Why Should You Care About It?

KafkaFlow provides an abstraction layer over the Confluent .NET Kafka client. It does so while making it easier to use, maintain, and test Kafka consumers and producers.

Imagine you need to build a Client Catalog for marketing initiatives. You will need a service to consume messages that capture new Clients. Once you start laying out your required service, you notice that existing services are not consistent in how they consume messages.

It’s common to see teams struggling and often solving simple problems such as graceful shutdowns. You’ve figured out that you have four different implementations of a JSON serializer across the organization, just to name one of the challenges.

Adopting a framework like KafkaFlow simplifies the process and can speed up the development cycle. KafkaFlow has a set of features designed to enhance the developer experience:

  1. Middlewares: KafkaFlow allows developers to create middlewares to process messages, enabling more control and customization of the Kafka consumer/producer pipeline.
  2. Handlers: Introduces the concept of message handlers, allowing developers to forward message processing from a topic to a message-type dedicated handler.
  3. Deserialization Algorithms: Offers a set of Serialization and Deserialization algorithms out-of-the-box.
  4. Multi-threaded Consumers: Provides multi-threading with message order guaranteed, helping to ensure optimal use of system resources.
  5. Administration API and Dashboard: Provides API and Dashboards to manage Consumers and Consumer groups, with operations such as pausing, resuming, or rewinding offsets, all at runtime.
  6. Consumer Throttling: Provides an easy way to bring priorities to topic consumption.

Let’s explore them so you can see the potential to address a problem like this.

KafkaFlow Producers: Simplified Message Production

Let’s start with message producers.

Producing a Message to Kafka is not rocket science. Even then, KafkaFlow provides a higher-level abstraction over the producer interface from Confluent’s .NET Kafka client, simplifying the code and increasing maintainability.

Here’s an example of how to send a message with a KafkaFlow producer:

await _producers["my-topic-events"]
    .ProduceAsync("my-topic", message.Id.ToString(), message);

This way, you can produce messages to Kafka without dealing directly with serialization or other complexities of the underlying Kafka client.

Not only that, but defining and managing Producers is pleasantly done through a Fluent Interface on your service configuration.

services.AddKafka(kafka => kafka
    .AddCluster(cluster => cluster
        .WithBrokers(new[] { "host:9092" })
        .AddProducer(
            "product-events",
            producer =>
                producer
            ...
        )
    )
);

Producers tend to be simple, but there are some common concerns to address, like compression or serialization. Let’s explore that.

Custom Serialization/Deserialization in KafkaFlow

One of the attractive features of Apache Kafka is being agnostic of data formats. However, that transfers the responsibility to producers and consumers. Without a thoughtful approach, it may lead to many ways to achieve the same result across the system. That makes serialization an obvious use case to be handled by a client framework.

KafkaFlow has serializers available for JSON, Protobuf, and even Avro. Those can be used simply by adding them to the middleware configuration.

.AddProducer(producer => producer
       ...
       .AddMiddlewares(middlewares => middleware
           ...
           .AddSerializer()
       )
)

The list is not restricted to those three due to its ability to use custom serializers/deserializers for messages. While Confluent’s .NET Kafka client already supports custom serialization/deserialization, KafkaFlow simplifies the process by providing a more elegant way to handle it.

As an example, to use a custom serializer, you would do something like this:

public class MySerializer : ISerializer
{
       public Task SerializeAsync(object message, Stream output, ISerializerContext context)
       {
             // Serialization logic here
       }

       public async Task DeserializeAsync(Stream input, Type type, ISerializerContext context)
       {
             // Deserialization logic here
       }
}

// Register the custom serializer when setting up the Kafka consumer/producer

.AddProducer(producer => producer
       ...
       .AddMiddlewares(middlewares => middleware
       	  ...
       	  .AddSerializer()
       )
)

Message Handling in KafkaFlow

Consumers bring a ton of questions and possibilities. The first one is “How do you handle a message?”

Let’s start with the simplest way. With the advent of libraries like MediatR that popularized the CQRS and the Meditor Patterns, .NET developers got used to decoupling message handlers from the request/message receiver. KafkaFlow brings that same principle to Kafka Consumers.

KafkaFlow message handlers allow developers to define specific logic to process messages from a Kafka topic. KafkaFlow’s message handler structure is designed for better separation of concerns and cleaner, more maintainable code.

Here’s an example of a message handler:

public class MyMessageHandler : IMessageHandler
{
    public Task Handle(IMessageContext context, MyMessageType message)
    {
        // Message handling logic here.
    }
}

This handler can be registered in the consumer configuration:

.AddConsumer(consumer => consumer
...
       .AddMiddlewares(middlewares => middlewares
           ...
             .AddTypedHandlers(handlers => handlers
                     .AddHandler()
              )
       )
)

With this approach, it’s easy to separate Consumers from Handlers, simplifying maintainability and testability.

This may look like unneeded complexity if you have a microservice handling one topic with only one message type. In that case, you can take advantage of middlewares.

Middleware in KafkaFlow

KafkaFlow is middleware-oriented. Maybe you noticed on the Message Handlers snippets a reference to “Middlewares.” So, you may be asking yourself what a Middleware is.

Middlewares are what make Typed Handlers possible. Messages are delivered to a middleware pipeline that will be invoked in sequence. You might be familiar with this concept if you have used MediatR pipelines. Also, Middlewares can be used to apply a series of transformations. In other words, a given Middleware can transform the incoming message to the following Middleware.

A Middleware in KafkaFlow encapsulates the logic for processing messages. The pipeline is extensible, allowing developers to add behavior to the message-processing pipeline.

Here’s an example of a middleware:

public class MyMiddleware : IMessageMiddleware
{
    public async Task Invoke(IMessageContext context, MiddlewareDelegate next)
    {
         // Pre-processing logic here.          
        await next(context);          
         // Post-processing logic here.     
    }
}

To use this middleware, it can be registered in the consumer configuration:

.AddConsumer(consumer => consumer
       ...
       .AddMiddlewares(middlewares => middlewares
             ...
             .Add()
         )
)   

This way, developers can plug-in custom logic into the message processing pipeline, providing flexibility and control.

Typed Handlers are a form of Middleware. So, you can even handle a message without a Typed Handler, implementing your middleware, or you can take advantage of Middlewares to build a Message pipeline that performs validations, enrichment, etc., before handling that message.

Handling Concurrency in KafkaFlow

Once you start thinking about infrastructure efficiency, you will notice that many Kafka Consumers are underutilized. The most common implementation is single-threaded, which caps resource utilization. So, when you need to scale, you do it horizontally to keep the desired throughput.

KafkaFlow brings another option to achieve infrastructure efficiency. KafkaFlow gives developers control over how many messages can be processed concurrently by a single consumer. It uses the concept of Workers that can work together consuming a topic.
This functionality allows you to optimize your Kafka consumer to better match your system’s capabilities.

Here’s an example of how to set the number of concurrent workers for a consumer:

.AddConsumer(consumer => consumer
.Topic("topic-name")
       .WithGroupId("sample-group")
       .WithBufferSize(100)
       .WithWorkersCount(10) // Set the number of workers.
       .AddMiddlewares(middlewares => middlewares
       	...
      	)
)

KafkaFlow guarantees order even with concurrent workers.

Batch Processing

With scale, you will face the tradeoff between latency and throughput. To handle that tradeoff, KafkaFlow has an important feature called “Batch Consuming.” This feature addresses the need for efficiency and performance in consuming and processing messages from Kafka in a batch-wise manner. It plays an important role in use cases where a group of messages needs to be processed together rather than individually.

What Is Batch Consuming?

Batch consuming is an approach where instead of processing messages atomically as they come in, the system groups several messages together and processes them all at once. This method is more efficient for dealing with large amounts of data, particularly if messages are independent of each other. Performing operations as a batch will lead to an increase in overall performance.

KafkaFlow’s Approach to Batch Consuming

KafkaFlow takes advantage of the system of Middlewares to provide batch processing. The Batch Processing Middleware lets you group messages according to batch size or timespan. Once one of those conditions is reached, the Middleware will forward the group of messages to the next middleware.

services.AddKafka(kafka => kafka
    .AddCluster(cluster => cluster
        .WithBrokers(new[] { "host:9092" })
        .AddConsumer(
            consumerBuilder => consumerBuilder
            ...
            .AddMiddlewares(
                middlewares => middlewares
                    ...
                    .BatchConsume(100, TimeSpan.FromSeconds(10))
                    .Add()
            )
        )
    )
);

The Impact of Batch Consuming on Performance

With batch processing, developers can achieve higher throughput in their Kafka-based applications. It allows for faster processing as the overhead associated with initiating and finalizing each processing task is significantly reduced. This leads to an overall increase in system performance.

Also, this approach reduces network I/O operations as data is fetched in larger chunks, which can further improve processing speed, especially in systems where network latency is a concern.

Consumer Administration with KafkaFlow

KafkaFlow also simplifies administrative tasks related to managing Kafka consumers. You can start, stop, pause consumers, rewind offsets, and much more with KafkaFlow’s administrative API.

The Administration API can be used throughout a programming interface, REST API, or a Dashboard UI.

[Click on the image to view full-size]

KafkaFlow administration Dashboard

Consumer Throttling

Often, underlying technologies may not be able to deal with high-load periods in the same way as Kafka Consumers. That can bring stability problems. That is where throttling comes in.

Consumer Throttling is an approach to managing the consumption of messages, enabling applications to dynamically fine-tune the rate at which they consume messages based on metrics.

Prioritization

Imagine you’re running an application where you want to segregate atomic and bulk actions into different consumers and topics. You may prefer to prioritize the processing of atomic actions over bulk actions. Traditionally, managing this differentiation could be challenging, given the potential discrepancies in the rate of message production.

Consumer Throttling is valuable in such instances, allowing you to monitor the consumer lag of the consumer responsible for atomic actions. Based on this metric, you can apply throttling to the consumer handling the bulk actions, ensuring that atomic actions are processed as a priority.

The result? An efficient, flexible, and optimized consumption process.

Adding throttling to a consumer is straightforward with a KafkaFlow fluent interface. Here’s a simple example:

.AddConsumer(
    consumer => consumer
        .Topic("bulk-topic")
        .WithName("bulkConsumer")
        .AddMiddlewares(
            middlewares => middlewares
                .ThrottleConsumer(
                    t => t
                        .ByOtherConsumersLag("singleConsumer")
                        .WithInterval(TimeSpan.FromSeconds(5))
                        .AddAction(a => a.AboveThreshold(10).ApplyDelay(100))
                        .AddAction(a => a.AboveThreshold(100).ApplyDelay(1_000))
                        .AddAction(a => a.AboveThreshold(1_000).ApplyDelay(10_000)))
                .AddSerializer()
        )
)

KafkaFlow: Looking Toward the Future

As of now, KafkaFlow provides a robust, developer-friendly abstraction over Kafka that simplifies building real-time data processing applications with .NET. However, like any active open-source project, it’s continually evolving and improving.

Given the project’s current trajectory, we might anticipate several developments. For instance, KafkaFlow could further enhance its middleware system, providing even more control and flexibility over message processing. We might also see more extensive administrative APIs, providing developers with even greater control over their Kafka clusters.

Being extensible by design, we can expect the KafkaFlow community to grow, leading to more contributions, innovative features, extensions, and support. As more developers and organizations adopt KafkaFlow, we’re likely to see an increase in learning resources, tutorials, case studies, and other community-generated content that can help new users get started and existing users get more from the library.

Conclusion

KafkaFlow is a handy and developer-friendly tool that simplifies work with Kafka in .NET. It shines in the area of developer experience and usability. The framework design lends itself well to clean, readable code. With a clear separation of concerns through middlewares and message handlers, as well as abstractions over complex problems when building applications on top of Apache Kafka, KafkaFlow helps to keep your codebase manageable and understandable.

Furthermore, the community around KafkaFlow is growing. If you’re using Kafka and looking to improve productivity and reliability, KafkaFlow is certainly worth considering.

InfoQ AI, ML, and Data Engineering Trends Report – September 2023

Key Takeaways

  • Generative AI, powered by Large Language Models (LLMs) like GPT-3 and GPT-4, has gained significant prominence in the AI and ML industry, with widespread adoption driven by technologies like ChatGPT.
  • Major tech players such as Google and Meta have announced their own generative AI models, indicating the industry’s commitment to advancing these technologies.
  • Vector databases and embedding stores are gaining attention due to their role in enhancing observability in generative AI applications.
  • Responsible and ethical AI considerations are on the rise, with calls for stricter safety measures around large language models and an emphasis on improving the lives of all people through AI.
  • Modern data engineering is shifting towards decentralized and flexible approaches, with the emergence of concepts like Data Mesh, which advocates for federated data platforms partitioned across domains.

The InfoQ Trends Reports provide InfoQ readers with an opinionated high-level overview of the topics we believe architects and technical leaders should pay attention to. In addition, they also help the InfoQ editorial team focus on writing news and recruiting article authors to cover innovative technologies.

In this annual report, the InfoQ editors discuss the current state of AI, ML, and data engineering and what emerging trends you as a software engineer, architect, or data scientist should watch. We curate our discussions into a technology adoption curve with supporting commentary to help you understand how things are evolving.

In this year’s podcast, InfoQ editorial team was joined by external panelist Sherin Thomas, software engineer at Chime. The following sections in the article summarize some of these trends and where different technologies fall in the technology adoption curve.

Generative AI

Generative AI, including Large Language Models (LLMs) like GPT-3, GPT-4, and Chat GPT, has become a major force in the AI and ML industry. These technologies have garnered significant attention, especially given the progress they made over the last year. We have seen wide adoption of these technologies by users, in particular driven by ChatGPT. Multiple players such as Google and Meta have announced their own generative AI models.

The next step we expect is a larger focus on LLMOps to operate these large language models in an enterprise setting. We are divided in whether prompt engineering will be a large topic in the future or whether the adoption will be so widespread that everyone will be able to contribute to the prompts used.

Vector Databases and Embedding Stores

With the rise in LLM technology there’s a growing focus on vector databases and embedding stores. One intriguing application gaining traction is the use of sentence embeddings to enhance observability in generative AI applications.

The need for vector search databases arises from the limitations of large language models, which have a finite token history. Vector databases can store document summaries as feature vectors generated by these language models, potentially resulting in millions or more feature vectors. With traditional databases, finding relevant documents becomes challenging as the dataset grows. Vector search databases enable efficient similarity searches, allowing users to locate the nearest neighbors to a query vector, enhancing the search process.

A notable trend is the surge in funding for these technologies, signaling investor recognition of their significance. However, adoption among developers has been slower, but it’s expected to pick up in the coming years. Vector search databases like Pinecone, Milvus, and open-source solutions like Chroma are gaining attention. The choice of database depends on the specific application and the nature of the data being searched.

In various fields, including Earth observation, vector databases have demonstrated their potential. NASA, for instance, leveraged self-supervised learning and vector search technology to analyze satellite images of Earth, aiding scientists in tracking weather phenomena such as hurricanes over time.

Robotics and Drone Technologies

Cost of robots is going down. In the past legged balancing robots were hard to acquire, but there are already some models available for around 1,500 dollars. This allows more users to use robot technologies in their applications. The Robot Operating System (ROS) is still the leading software framework in this field, but companies like VIAM are also developing middleware solutions that make it easier to integrate and configure plugins for robotics development.

We expect that advances in unsupervised learning and foundational models will translate into improved capabilities. For example, by integrating a large language model into the path planning part of the robot to enable planning using natural language.

Responsible and Ethical AI

As AI starts to affect all of humanity there is a growing interest in responsible and ethical AI. People are simultaneously calling for stricter safety around large language models, as well as being frustrated by the output of such models reminding users of the safeguards in place.

It remains important for engineers to keep in mind to improve the lives of all people, not just a select few. We expect a similar impact from AI regulation as GDPR had a few years ao.

We have seen some AI fail because of bad data. Data discovery, operations, data lineage, labeling and good model development practices are going to take center stage. Data is crucial to explainability.

Data Engineering

The state of modern data engineering is marked by a dynamic shift towards more decentralized and flexible approaches to manage the ever-growing volumes of data. Data Mesh, a novel concept, has emerged to address the challenges posed by centralized data management teams becoming bottlenecks in data operations. It advocates for a federated data platform partitioned across domains, where data is treated as a product. This allows domain owners to have ownership and control over their data products, reducing the reliance on central teams. While promising, Data Mesh adoption may face hurdles related to expertise, necessitating advanced tooling and infrastructure for self-service capabilities.

Data observability has become paramount in data engineering, analogous to system observability in application architectures. Observability is essential at all layers, including data observability, especially in the context of machine learning. Trust in data is pivotal for AI success, and data observability solutions are crucial for monitoring data quality, model drift, and exploratory data analysis to ensure reliable machine learning outcomes. This paradigm shift in data management and the integration of observability across the data and ML pipelines reflect the evolving landscape of data engineering in the modern era.

Explaining the updates to the curve

With this trends report also comes an updated graph showing what we believe the state of certain technologies is. The categories are based on the book “Crossing the Chasm“, by Geoffrey Moore. At InfoQ we mostly focus on categories which have not yet crossed the chasm.

One notable upgrade from innovators to early adopters are the “AI Coding Assistants”. Although they were very new last year and hardly used, we see more and more companies offering this as a service to their employees to make them more efficient. It’s not a default part of every stack, and we are still discovering how to use them most effectively, but we believe that adoption will continue to grow.

Something which we believe is crossing the chasm right now is natural language processing. This will not come as a surprise to anyone as many companies are currently trying to figure out how to adopt generative AI capabilities in their product offering following the massive success of ChatGPT. We thus decided to make it cross the chasm already into the early majority category. There is still a lot of potential for growth here, and time will teach us more what the best practices and capabilities are for this technology.

There are some notable categories who did not move at all. These are technologies such as synthetic data generation, brain-computer interfaces and robotics. All of these seem to be consistently stuck in the innovators category. The most promising in this regard is the synthetic data generation topic, which is lately getting more attention with the GenAI hype. We do see more and more companies talking about generating more of their training data, but have not seen enough applications actually using it in their stack to warrant it moving to the early adopters category. Robotics has been getting a lot of attention for multiple years now, but its adoption rate is still too low for us to warrant a movement.

We also introduced several new categories to the graph. A notable one is vector search databases, something which comes as a byproduct of the GenAI hype. As we are gaining more understanding of how we can represent concepts as a vector there is also more need for efficient storing and retrieving said vectors. We also added explainable AI to the innovators category. We believe that computers explaining why they made a certain decision will be vital for widespread adoption to combat hallucinations and other dangers. However, we currently don’t see enough work in the industry to warrant a higher category.

Conclusion

The field of AI, ML, and Data Engineering keeps on growing year over year. There is still a lot of growth in both the technological capabilities as well as the possible applications. It’s exciting for us editors at InfoQ to be so close to the progress, and we are looking forward to making the same report next year. In the podcast we make several predictions for the coming year, which range from “there will be no AGI” to “Autonomous Agents will be a thing”. We hope you enjoyed listening to the podcast and reading this article, and would love to see your predictions and comments below this article.

AI, ML, and Data Engineering InfoQ Trends Report – September 2023

Key Takeaways

  • Generative AI, powered by Large Language Models (LLMs) like GPT-3 and GPT-4, has gained significant prominence in the AI and ML industry, with widespread adoption driven by technologies like ChatGPT.
  • Major tech players such as Google and Meta have announced their own generative AI models, indicating the industry’s commitment to advancing these technologies.
  • Vector databases and embedding stores are gaining attention due to their role in enhancing observability in generative AI applications.
  • Responsible and ethical AI considerations are on the rise, with calls for stricter safety measures around large language models and an emphasis on improving the lives of all people through AI.
  • Modern data engineering is shifting towards decentralized and flexible approaches, with the emergence of concepts like Data Mesh, which advocates for federated data platforms partitioned across domains.

The InfoQ Trends Reports provide InfoQ readers with an opinionated high-level overview of the topics we believe architects and technical leaders should pay attention to. In addition, they also help the InfoQ editorial team focus on writing news and recruiting article authors to cover innovative technologies.

In this annual report, the InfoQ editors discuss the current state of AI, ML, and data engineering and what emerging trends you as a software engineer, architect, or data scientist should watch. We curate our discussions into a technology adoption curve with supporting commentary to help you understand how things are evolving.

In this year’s podcast, InfoQ editorial team was joined by external panelist Sherin Thomas, software engineer at Chime. The following sections in the article summarize some of these trends and where different technologies fall in the technology adoption curve.

Generative AI

Generative AI, including Large Language Models (LLMs) like GPT-3, GPT-4, and Chat GPT, has become a major force in the AI and ML industry. These technologies have garnered significant attention, especially given the progress they made over the last year. We have seen wide adoption of these technologies by users, in particular driven by ChatGPT. Multiple players such as Google and Meta have announced their own generative AI models.

The next step we expect is a larger focus on LLMOps to operate these large language models in an enterprise setting. We are divided in whether prompt engineering will be a large topic in the future or whether the adoption will be so widespread that everyone will be able to contribute to the prompts used.

Vector Databases and Embedding Stores

With the rise in LLM technology there’s a growing focus on vector databases and embedding stores. One intriguing application gaining traction is the use of sentence embeddings to enhance observability in generative AI applications.

The need for vector search databases arises from the limitations of large language models, which have a finite token history. Vector databases can store document summaries as feature vectors generated by these language models, potentially resulting in millions or more feature vectors. With traditional databases, finding relevant documents becomes challenging as the dataset grows. Vector search databases enable efficient similarity searches, allowing users to locate the nearest neighbors to a query vector, enhancing the search process.

A notable trend is the surge in funding for these technologies, signaling investor recognition of their significance. However, adoption among developers has been slower, but it’s expected to pick up in the coming years. Vector search databases like Pinecone, Milvus, and open-source solutions like Chroma are gaining attention. The choice of database depends on the specific application and the nature of the data being searched.

In various fields, including Earth observation, vector databases have demonstrated their potential. NASA, for instance, leveraged self-supervised learning and vector search technology to analyze satellite images of Earth, aiding scientists in tracking weather phenomena such as hurricanes over time.

Robotics and Drone Technologies

Cost of robots is going down. In the past legged balancing robots were hard to acquire, but there are already some models available for around 1,500 dollars. This allows more users to use robot technologies in their applications. The Robot Operating System (ROS) is still the leading software framework in this field, but companies like VIAM are also developing middleware solutions that make it easier to integrate and configure plugins for robotics development.

We expect that advances in unsupervised learning and foundational models will translate into improved capabilities. For example, by integrating a large language model into the path planning part of the robot to enable planning using natural language.

Responsible and Ethical AI

As AI starts to affect all of humanity there is a growing interest in responsible and ethical AI. People are simultaneously calling for stricter safety around large language models, as well as being frustrated by the output of such models reminding users of the safeguards in place.

It remains important for engineers to keep in mind to improve the lives of all people, not just a select few. We expect a similar impact from AI regulation as GDPR had a few years ao.

We have seen some AI fail because of bad data. Data discovery, operations, data lineage, labeling and good model development practices are going to take center stage. Data is crucial to explainability.

Data Engineering

The state of modern data engineering is marked by a dynamic shift towards more decentralized and flexible approaches to manage the ever-growing volumes of data. Data Mesh, a novel concept, has emerged to address the challenges posed by centralized data management teams becoming bottlenecks in data operations. It advocates for a federated data platform partitioned across domains, where data is treated as a product. This allows domain owners to have ownership and control over their data products, reducing the reliance on central teams. While promising, Data Mesh adoption may face hurdles related to expertise, necessitating advanced tooling and infrastructure for self-service capabilities.

Data observability has become paramount in data engineering, analogous to system observability in application architectures. Observability is essential at all layers, including data observability, especially in the context of machine learning. Trust in data is pivotal for AI success, and data observability solutions are crucial for monitoring data quality, model drift, and exploratory data analysis to ensure reliable machine learning outcomes. This paradigm shift in data management and the integration of observability across the data and ML pipelines reflect the evolving landscape of data engineering in the modern era.

Explaining the updates to the curve

With this trends report also comes an updated graph showing what we believe the state of certain technologies is. The categories are based on the book “Crossing the Chasm“, by Geoffrey Moore. At InfoQ we mostly focus on categories which have not yet crossed the chasm.

One notable upgrade from innovators to early adopters are the “AI Coding Assistants”. Although they were very new last year and hardly used, we see more and more companies offering this as a service to their employees to make them more efficient. It’s not a default part of every stack, and we are still discovering how to use them most effectively, but we believe that adoption will continue to grow.

Something which we believe is crossing the chasm right now is natural language processing. This will not come as a surprise to anyone as many companies are currently trying to figure out how to adopt generative AI capabilities in their product offering following the massive success of ChatGPT. We thus decided to make it cross the chasm already into the early majority category. There is still a lot of potential for growth here, and time will teach us more what the best practices and capabilities are for this technology.

There are some notable categories who did not move at all. These are technologies such as synthetic data generation, brain-computer interfaces and robotics. All of these seem to be consistently stuck in the innovators category. The most promising in this regard is the synthetic data generation topic, which is lately getting more attention with the GenAI hype. We do see more and more companies talking about generating more of their training data, but have not seen enough applications actually using it in their stack to warrant it moving to the early adopters category. Robotics has been getting a lot of attention for multiple years now, but its adoption rate is still too low for us to warrant a movement.

We also introduced several new categories to the graph. A notable one is vector search databases, something which comes as a byproduct of the GenAI hype. As we are gaining more understanding of how we can represent concepts as a vector there is also more need for efficient storing and retrieving said vectors. We also added explainable AI to the innovators category. We believe that computers explaining why they made a certain decision will be vital for widespread adoption to combat hallucinations and other dangers. However, we currently don’t see enough work in the industry to warrant a higher category.

Conclusion

The field of AI, ML, and Data Engineering keeps on growing year over year. There is still a lot of growth in both the technological capabilities as well as the possible applications. It’s exciting for us editors at InfoQ to be so close to the progress, and we are looking forward to making the same report next year. In the podcast we make several predictions for the coming year, which range from “there will be no AGI” to “Autonomous Agents will be a thing”. We hope you enjoyed listening to the podcast and reading this article, and would love to see your predictions and comments below this article.

Managing the Carbon Emissions Associated with Generative AI

Key Takeaways

  • There’s an increasing concern around carbon emissions as generative AI becomes more integrated in our everyday lives
  • The comparisons of carbon emissions between generative AI and the commercial aviation industry are misleading
  • Organizations should incorporate best practices to mitigate emissions specific to generative AI. Transparency requirements could be crucial to both training and using AI models
  • Improving energy efficiency in AI models is valuable not only for sustainability but also for improving capabilities and reducing costs
  • Prompt engineering becomes key to reducing computational resources and thus carbon emitted when using gen AI. Commands that generate shorter outputs would use less computation which leads to a new process “green prompt engineering”

Introduction

Recent developments in generative AI are transforming our industry and our broader society. Language models like ChatGPT and CoPilot are drafting letters and writing code, image and video generation models can create compelling content from a simple prompt, while music and voice models allow easy synthesis of speech in anyone’s voice, and the creation of sophisticated music.

Conversations on the power and potential value of this technology are happening around the world. At the same time, people are talking about risks and threats.

From extremist worries about superintelligent AI wiping out humanity, to more grounded concerns about the further automation of discrimination and the amplification of hate and misinformation, people are grappling with how to assess and mitigate the potential negative consequences of this new technology.

People are also increasingly concerned about the energy use and corresponding carbon emissions of these models. Dramatic comparisons have resurfaced in recent months.

One article, for example, equates the carbon emissions of training GPT-3 to driving to the moon and back; another, meanwhile, explains that training an AI model emits massively more carbon than a long-distance flight.

The ultimate impact will depend on how this technology is used and to what degree it is integrated into our lives.

It is difficult to anticipate exactly how it will impact our day to day, but one current example, the search giants integrating generative AI into their products, is fairly clear.

As per a recent Wired article:

Martin Bouchard, cofounder of Canadian data center company QScale, believes that, based on his reading of Microsoft and Google’s plans for search, adding generative AI to the process will require “at least four or five times more computing per search” at a minimum.

It’s clear that generative AI is not to be ignored.

Are carbon emissions of generative AI overhyped?

However, the concerns about the carbon emissions of generative AI may be overhyped. It’s important to put things in perspective: the entire global tech sector accounts for 1.8% to 3.9% of global greenhouse-gas emissions but only a fraction of those emissions are caused by AI[1]. Dramatic comparisons between AI and aviation or other sources of carbon are creating confusion from differences in scale: while there are many cars and aircraft traveling millions of kilometers every day, training a modern AI model like the GPT models is something that only happens a relatively small number of times.

Admittedly, it’s unclear exactly how many large AI models have been trained. Ultimately, that depends on how we define “large AI model.” However, if we consider models at the scale of GPT-3 or larger, it is clear that there have been fewer than 1,000 such models trained. To do a little math:

 

A recent estimate suggests that training GPT-3 emitted 500 metric tons of CO2. Meta’s LLaMA model was estimated to emit 173 tons. Training 1,000 500-ton models would involve a total emission of about 500,000 metric tons of CO2. Newer models may increase the emissions somewhat, but the 1,000 models is almost certainly an overestimate and so accounts for this. The commercial aviation industry emitted about 920,000,000 metric tons of CO2 in 2019[2], almost 2,000 times as much as LLM training, and keep in mind that this compares one year of aviation to multiple years of LLM training. The training of LLMs is still not negligible, but the dramatic comparisons are misleading. More nuanced thinking is needed.

This, of course, is only considering the training of such models. The serving and use of the models also requires energy and has associated emissions. Based on one analysis, ChatGPT might emit about 15,000 metric tons of CO2 to operate for a year. Another analysis suggests much less at about 1,400 metric tons. Not negligible, but still nothing compared to aviation.

Emissions transparency is needed

But even if the concerns about the emissions of AI are somewhat overhyped, they still merit attention, especially as generative AI becomes integrated into more and more of our modern life. As AI systems continue to be developed and adopted, we need to pay attention to their environmental impact. There are many well-established practices that should be leveraged, and also some ways to mitigate emissions that are specific to generative AI.

Firstly, transparency is crucial. We recommend transparency requirements to allow for monitoring of the carbon emissions related to both training and use of AI models. This will allow those deploying these models and also end users to make informed decisions about their use of AI based on its emissions. And also to incorporate AI-related emissions into their greenhouse gas inventories and net zero targets. This is one component of holistic AI transparency.

As an example of how such requirements might work, France has recently passed a law mandating telecommunications companies to provide transparency reporting around their sustainability efforts. A similar law could require products incorporating AI systems to report carbon emissions to their customers and also for model providers to integrate carbon emissions data into their APIs.

Greater transparency can lead to stronger incentives to build energy-efficient generative AI systems, and there are many ways to increase efficiency. In another recent InfoQ article, Sara Bergman, Senior Software Engineer at Microsoft, encourages people to consider the entire lifecycle of an AI system and provides advice on applying the tools and practices from the Green Software Foundation to making AI systems more energy efficient, including careful selection of server hardware and architecture, as well as time and region shifting to find less carbon-intensive electricity. But generative AI presents some unique opportunities for efficiency improvements.

Efficiency: Energy use and model performance

As explored in Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning the carbon emissions associated with the training or use of a generative AI model depends on many factors, including:

  • Number of model parameters
  • Quantization (numeric precision)
  • Model architecture
  • Efficiency of GPUs or other hardware used
  • Carbon-intensity of electricity used

The latter factors are relevant for any software and well explored by others, such as the InfoQ article that we mentioned. Thus, we will focus on the first three factors here, all of which involve some tradeoff between energy use and model performance.

It’s worth noting that efficiency is valuable not only for sustainability concerns. More efficient models can improve capabilities in situations where less data is available, decrease costs, and unlock the possibility of running on edge devices.

Number of model parameters

As shown in this figure from OpenAI’s paper, “Language Models are Few-Shot Learners“, larger models tend to perform better.

This is also a point made in Emergent Abilities of Large Language Models:

Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models.

We see that not only do larger models do better at a given task, but there are actually entirely new capabilities that emerge only as models get large.  Examples of such emergent capabilities include adding and subtracting large numbers, toxicity classification, and chain of thought techniques for math word problems.

But training and using larger models requires more computation and thus more energy.  Thus, we see a tradeoff between the capabilities and performance of a model and its computational, and thus carbon, intensivity.

Quantization

There has been significant research into the quantization of models. This is where lower-precision numbers are used in model computations, thus reducing computational intensivity, albeit at the expense of some accuracy. It has typically been applied to allow models to run on more modest hardware, for example, enabling LLMs to run on a consumer-grade laptop. The tradeoff between decreased computation and decreased accuracy is often very favorable, making quantized models extremely energy-efficient for a given level of capability. There are related techniques, such as “distillation“, that use a larger model to train a small model that can perform extremely well for a given task.

Distillation technically requires training two models, so it could well increase the carbon emissions related to model training; however it should compensate for this by decreasing the model’s in-use emissions. Distillation of an existing already-trained model can also be a good solution. It’s even possible to leverage both distillation and quantization together to create a more efficient model for a given task.

Model Architecture

Model architecture can have an enormous impact on computational intensivity, so choosing a simpler model can be the most effective way to decrease carbon emission from an AI system. While GPT-style transformers are very powerful, simpler architectures can be effective for many applications. Models like ChatGPT are considered “general-purpose” meaning that these models can be used for many different applications. However, when a fixed application is required, using a complex model may be unnecessary. A custom model for the task may be able to achieve adequate performance with a much simpler and smaller architecture, decreasing carbon emissions.  Another useful approach is fine-tuning — the paper Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning discusses how fine-tuning “offers better accuracy as well as dramatically lower computational costs”.

Putting carbon and accuracy metrics on the same level

The term “accuracy” easily feeds into a “more is better” mentality. To address this, it is critical to understand the requirements for the given application – “enough is enough”. In some cases, the latest and greatest model may be needed, but for other applications, older, smaller, possibly quantized models might be perfectly adequate. In some cases, correct behavior may be required for all possible inputs, while other applications may be more fault tolerant. Once the application and level of service required is properly understood, an appropriate model can be selected by comparing performance and carbon metrics across the options. There may also be cases in which a suite of models can be leveraged. Requests can, by default, be passed to simpler, smaller models, but in cases in which the task can’t be handled by the simple model, it can be passed off to a more sophisticated model.

Here, integrating carbon metrics into DevOps (or MLOps) processes is important. Tools like codecarbon make it easy to track and account for the carbon emissions associated with training and serving a model. Integrating this or a similar tool into continuous integration test suites allows carbon, accuracy, and other metrics to be analyzed in concert. For example, while experimenting with model architecture, tests can immediately report both accuracy and carbon, making it easier to find the right architecture and choose the right hyperparameters to meet accuracy requirements while minimizing carbon emissions.

It’s also important to remember that experimentation itself will result in carbon emissions. In the experimentation phase of the MLOps cycle, experiments are performed with different model families and architectures to determine the best option, which can be considered in terms of accuracy, carbon and, potentially, other metrics. This can save carbon in the long run as the model continues to be trained with real-time data and/or is put into production, but excessive experimentation can waste time and energy. The appropriate balance will vary depending on many factors, but this can be easily analyzed when carbon metrics are available for running experiments as well as production training and serving of the model.

Green prompt engineering

When it comes to carbon emissions associated with the serving and use of a generative model, prompt engineering becomes very important as well. For most generative AI models — like  GPT — the computational resources used, and thus carbon emitted, depend on the number of tokens passed to and generated by the model.

While the exact details depend on the implementation, prompts are generally passed “all at once” into transformer models. This might make it seem like the amount of computation doesn’t depend on the length of a prompt. However, due to the quadratic nature of the self-attention mechanism, it’s reasonable to expect that optimizations would suppress this function for unused portions of the input, meaning that shorter prompts save computation and thus energy.
For the output, it is clear that the computational cost is proportional to the number of tokens produced, as the model needs to be “run again” for each token generated.

This is reflected in the pricing structure for OpenAI’s API access to GPT4. At the time of writing, the costs for the base GPT4 model are $0.03/1k prompt tokens and $0.06/1k sampled tokens. The prompt length and length of the output in tokens are both incorporated into the price, reflecting the fact that both influence the amount of computation that is required.

So, shorter prompts and prompts that will generate shorter outputs will use less computation. This suggests a new process of “green prompt engineering”. With proper support for experimentation in an MLOps platform, it becomes relatively easy to experiment with shortening prompts while continuously evaluating the impact of both carbon and system performance.

As well as considering only single prompts, there are interesting approaches being developed to improve efficiency for more complex use of LLMs, as in this paper.

Conclusion

Although possibly overhyped, the carbon emissions of AI are still of concern and should be managed with appropriate best practices. Transparency is needed to support effective decision-making and consumer awareness. Also, integrating carbon metrics into MLOps workflows can support smart choices about model architecture, size, quantization, as well as effective green prompt engineering. The content in this article is an overview only and just scratches the surface. For those that truly want to do green generative AI, I encourage you to follow the latest research.

Footnotes