Generative AI, powered by Large Language Models (LLMs) like GPT-3 and GPT-4, has gained significant prominence in the AI and ML industry, with widespread adoption driven by technologies like ChatGPT.
Major tech players such as Google and Meta have announced their own generative AI models, indicating the industry’s commitment to advancing these technologies.
Vector databases and embedding stores are gaining attention due to their role in enhancing observability in generative AI applications.
Responsible and ethical AI considerations are on the rise, with calls for stricter safety measures around large language models and an emphasis on improving the lives of all people through AI.
Modern data engineering is shifting towards decentralized and flexible approaches, with the emergence of concepts like Data Mesh, which advocates for federated data platforms partitioned across domains.
The InfoQ Trends Reports provide InfoQ readers with an opinionated high-level overview of the topics we believe architects and technical leaders should pay attention to. In addition, they also help the InfoQ editorial team focus on writing news and recruiting article authors to cover innovative technologies.
In this annual report, the InfoQ editors discuss the current state of AI, ML, and data engineering and what emerging trends you as a software engineer, architect, or data scientist should watch. We curate our discussions into a technology adoption curve with supporting commentary to help you understand how things are evolving.
In this year’s podcast, InfoQ editorial team was joined by external panelist Sherin Thomas, software engineer at Chime. The following sections in the article summarize some of these trends and where different technologies fall in the technology adoption curve.
Generative AI
Generative AI, including Large Language Models (LLMs) like GPT-3, GPT-4, and Chat GPT, has become a major force in the AI and ML industry. These technologies have garnered significant attention, especially given the progress they made over the last year. We have seen wide adoption of these technologies by users, in particular driven by ChatGPT. Multiple players such as Google and Meta have announced their own generative AI models.
The next step we expect is a larger focus on LLMOps to operate these large language models in an enterprise setting. We are divided in whether prompt engineering will be a large topic in the future or whether the adoption will be so widespread that everyone will be able to contribute to the prompts used.
Vector Databases and Embedding Stores
With the rise in LLM technology there’s a growing focus on vector databases and embedding stores. One intriguing application gaining traction is the use of sentence embeddings to enhance observability in generative AI applications.
The need for vector search databases arises from the limitations of large language models, which have a finite token history. Vector databases can store document summaries as feature vectors generated by these language models, potentially resulting in millions or more feature vectors. With traditional databases, finding relevant documents becomes challenging as the dataset grows. Vector search databases enable efficient similarity searches, allowing users to locate the nearest neighbors to a query vector, enhancing the search process.
A notable trend is the surge in funding for these technologies, signaling investor recognition of their significance. However, adoption among developers has been slower, but it’s expected to pick up in the coming years. Vector search databases like Pinecone, Milvus, and open-source solutions like Chroma are gaining attention. The choice of database depends on the specific application and the nature of the data being searched.
In various fields, including Earth observation, vector databases have demonstrated their potential. NASA, for instance, leveraged self-supervised learning and vector search technology to analyze satellite images of Earth, aiding scientists in tracking weather phenomena such as hurricanes over time.
Robotics and Drone Technologies
Cost of robots is going down. In the past legged balancing robots were hard to acquire, but there are already some models available for around 1,500 dollars. This allows more users to use robot technologies in their applications. The Robot Operating System (ROS) is still the leading software framework in this field, but companies like VIAM are also developing middleware solutions that make it easier to integrate and configure plugins for robotics development.
We expect that advances in unsupervised learning and foundational models will translate into improved capabilities. For example, by integrating a large language model into the path planning part of the robot to enable planning using natural language.
Responsible and Ethical AI
As AI starts to affect all of humanity there is a growing interest in responsible and ethical AI. People are simultaneously calling for stricter safety around large language models, as well as being frustrated by the output of such models reminding users of the safeguards in place.
It remains important for engineers to keep in mind to improve the lives of all people, not just a select few. We expect a similar impact from AI regulation as GDPR had a few years ao.
We have seen some AI fail because of bad data. Data discovery, operations, data lineage, labeling and good model development practices are going to take center stage. Data is crucial to explainability.
Data Engineering
The state of modern data engineering is marked by a dynamic shift towards more decentralized and flexible approaches to manage the ever-growing volumes of data. Data Mesh, a novel concept, has emerged to address the challenges posed by centralized data management teams becoming bottlenecks in data operations. It advocates for a federated data platform partitioned across domains, where data is treated as a product. This allows domain owners to have ownership and control over their data products, reducing the reliance on central teams. While promising, Data Mesh adoption may face hurdles related to expertise, necessitating advanced tooling and infrastructure for self-service capabilities.
Data observability has become paramount in data engineering, analogous to system observability in application architectures. Observability is essential at all layers, including data observability, especially in the context of machine learning. Trust in data is pivotal for AI success, and data observability solutions are crucial for monitoring data quality, model drift, and exploratory data analysis to ensure reliable machine learning outcomes. This paradigm shift in data management and the integration of observability across the data and ML pipelines reflect the evolving landscape of data engineering in the modern era.
Explaining the updates to the curve
With this trends report also comes an updated graph showing what we believe the state of certain technologies is. The categories are based on the book “Crossing the Chasm“, by Geoffrey Moore. At InfoQ we mostly focus on categories which have not yet crossed the chasm.
One notable upgrade from innovators to early adopters are the “AI Coding Assistants”. Although they were very new last year and hardly used, we see more and more companies offering this as a service to their employees to make them more efficient. It’s not a default part of every stack, and we are still discovering how to use them most effectively, but we believe that adoption will continue to grow.
Something which we believe is crossing the chasm right now is natural language processing. This will not come as a surprise to anyone as many companies are currently trying to figure out how to adopt generative AI capabilities in their product offering following the massive success of ChatGPT. We thus decided to make it cross the chasm already into the early majority category. There is still a lot of potential for growth here, and time will teach us more what the best practices and capabilities are for this technology.
There are some notable categories who did not move at all. These are technologies such as synthetic data generation, brain-computer interfaces and robotics. All of these seem to be consistently stuck in the innovators category. The most promising in this regard is the synthetic data generation topic, which is lately getting more attention with the GenAI hype. We do see more and more companies talking about generating more of their training data, but have not seen enough applications actually using it in their stack to warrant it moving to the early adopters category. Robotics has been getting a lot of attention for multiple years now, but its adoption rate is still too low for us to warrant a movement.
We also introduced several new categories to the graph. A notable one is vector search databases, something which comes as a byproduct of the GenAI hype. As we are gaining more understanding of how we can represent concepts as a vector there is also more need for efficient storing and retrieving said vectors. We also added explainable AI to the innovators category. We believe that computers explaining why they made a certain decision will be vital for widespread adoption to combat hallucinations and other dangers. However, we currently don’t see enough work in the industry to warrant a higher category.
Conclusion
The field of AI, ML, and Data Engineering keeps on growing year over year. There is still a lot of growth in both the technological capabilities as well as the possible applications. It’s exciting for us editors at InfoQ to be so close to the progress, and we are looking forward to making the same report next year. In the podcast we make several predictions for the coming year, which range from “there will be no AGI” to “Autonomous Agents will be a thing”. We hope you enjoyed listening to the podcast and reading this article, and would love to see your predictions and comments below this article.
There’s an increasing concern around carbon emissions as generative AI becomes more integrated in our everyday lives
The comparisons of carbon emissions between generative AI and the commercial aviation industry are misleading
Organizations should incorporate best practices to mitigate emissions specific to generative AI. Transparency requirements could be crucial to both training and using AI models
Improving energy efficiency in AI models is valuable not only for sustainability but also for improving capabilities and reducing costs
Prompt engineering becomes key to reducing computational resources and thus carbon emitted when using gen AI. Commands that generate shorter outputs would use less computation which leads to a new process “green prompt engineering”
Introduction
Recent developments in generative AI are transforming our industry and our broader society. Language models like ChatGPT and CoPilot are drafting letters and writing code, image and video generation models can create compelling content from a simple prompt, while music and voice models allow easy synthesis of speech in anyone’s voice, and the creation of sophisticated music.
Conversations on the power and potential value of this technology are happening around the world. At the same time, people are talking about risks and threats.
From extremist worries about superintelligent AI wiping out humanity, to more grounded concerns about the further automation of discrimination and the amplification of hate and misinformation, people are grappling with how to assess and mitigate the potential negative consequences of this new technology.
Related Sponsored Content
People are also increasingly concerned about the energy use and corresponding carbon emissions of these models. Dramatic comparisons have resurfaced in recent months.
The ultimate impact will depend on how this technology is used and to what degree it is integrated into our lives.
It is difficult to anticipate exactly how it will impact our day to day, but one current example, the search giants integrating generative AI into their products, is fairly clear.
Martin Bouchard, cofounder of Canadian data center company QScale, believes that, based on his reading of Microsoft and Google’s plans for search, adding generative AI to the process will require “at least four or five times more computing per search” at a minimum.
It’s clear that generative AI is not to be ignored.
Are carbon emissions of generative AI overhyped?
However, the concerns about the carbon emissions of generative AI may be overhyped. It’s important to put things in perspective: the entire global tech sector accounts for 1.8% to 3.9% of global greenhouse-gas emissions but only a fraction of those emissions are caused by AI[1]. Dramatic comparisons between AI and aviation or other sources of carbon are creating confusion from differences in scale: while there are many cars and aircraft traveling millions of kilometers every day, training a modern AI model like the GPT models is something that only happens a relatively small number of times.
Admittedly, it’s unclear exactly how many large AI models have been trained. Ultimately, that depends on how we define “large AI model.” However, if we consider models at the scale of GPT-3 or larger, it is clear that there have been fewer than 1,000 such models trained. To do a little math:
A recent estimate suggests that training GPT-3 emitted 500 metric tons of CO2. Meta’s LLaMA model was estimated to emit 173 tons. Training 1,000 500-ton models would involve a total emission of about 500,000 metric tons of CO2. Newer models may increase the emissions somewhat, but the 1,000 models is almost certainly an overestimate and so accounts for this. The commercial aviation industry emitted about 920,000,000 metric tons of CO2 in 2019[2], almost 2,000 times as much as LLM training, and keep in mind that this compares one year of aviation to multiple years of LLM training. The training of LLMs is still not negligible, but the dramatic comparisons are misleading. More nuanced thinking is needed.
This, of course, is only considering the training of such models. The serving and use of the models also requires energy and has associated emissions. Based on one analysis, ChatGPT might emit about 15,000 metric tons of CO2 to operate for a year. Another analysis suggests much less at about 1,400 metric tons. Not negligible, but still nothing compared to aviation.
Emissions transparency is needed
But even if the concerns about the emissions of AI are somewhat overhyped, they still merit attention, especially as generative AI becomes integrated into more and more of our modern life. As AI systems continue to be developed and adopted, we need to pay attention to their environmental impact. There are many well-established practices that should be leveraged, and also some ways to mitigate emissions that are specific to generative AI.
Firstly, transparency is crucial. We recommend transparency requirements to allow for monitoring of the carbon emissions related to both training and use of AI models. This will allow those deploying these models and also end users to make informed decisions about their use of AI based on its emissions. And also to incorporate AI-related emissions into their greenhouse gas inventories and net zero targets. This is one component of holistic AI transparency.
As an example of how such requirements might work, France has recently passed a law mandating telecommunications companies to provide transparency reporting around their sustainability efforts. A similar law could require products incorporating AI systems to report carbon emissions to their customers and also for model providers to integrate carbon emissions data into their APIs.
Greater transparency can lead to stronger incentives to build energy-efficient generative AI systems, and there are many ways to increase efficiency. In another recent InfoQ article, Sara Bergman, Senior Software Engineer at Microsoft, encourages people to consider the entire lifecycle of an AI system and provides advice on applying the tools and practices from the Green Software Foundation to making AI systems more energy efficient, including careful selection of server hardware and architecture, as well as time and region shifting to find less carbon-intensive electricity. But generative AI presents some unique opportunities for efficiency improvements.
The latter factors are relevant for any software and well explored by others, such as the InfoQ article that we mentioned. Thus, we will focus on the first three factors here, all of which involve some tradeoff between energy use and model performance.
It’s worth noting that efficiency is valuable not only for sustainability concerns. More efficient models can improve capabilities in situations where less data is available, decrease costs, and unlock the possibility of running on edge devices.
Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models.
We see that not only do larger models do better at a given task, but there are actually entirely new capabilities that emerge only as models get large. Examples of such emergent capabilities include adding and subtracting large numbers, toxicity classification, and chain of thought techniques for math word problems.
But training and using larger models requires more computation and thus more energy. Thus, we see a tradeoff between the capabilities and performance of a model and its computational, and thus carbon, intensivity.
Quantization
There has been significant research into the quantization of models. This is where lower-precision numbers are used in model computations, thus reducing computational intensivity, albeit at the expense of some accuracy. It has typically been applied to allow models to run on more modest hardware, for example, enabling LLMs to run on a consumer-grade laptop. The tradeoff between decreased computation and decreased accuracy is often very favorable, making quantized models extremely energy-efficient for a given level of capability. There are related techniques, such as “distillation“, that use a larger model to train a small model that can perform extremely well for a given task.
Distillation technically requires training two models, so it could well increase the carbon emissions related to model training; however it should compensate for this by decreasing the model’s in-use emissions. Distillation of an existing already-trained model can also be a good solution. It’s even possible to leverage both distillation and quantization together to create a more efficient model for a given task.
Model Architecture
Model architecture can have an enormous impact on computational intensivity, so choosing a simpler model can be the most effective way to decrease carbon emission from an AI system. While GPT-style transformers are very powerful, simpler architectures can be effective for many applications. Models like ChatGPT are considered “general-purpose” meaning that these models can be used for many different applications. However, when a fixed application is required, using a complex model may be unnecessary. A custom model for the task may be able to achieve adequate performance with a much simpler and smaller architecture, decreasing carbon emissions. Another useful approach is fine-tuning — the paper Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning discusses how fine-tuning “offers better accuracy as well as dramatically lower computational costs”.
Putting carbon and accuracy metrics on the same level
The term “accuracy” easily feeds into a “more is better” mentality. To address this, it is critical to understand the requirements for the given application – “enough is enough”. In some cases, the latest and greatest model may be needed, but for other applications, older, smaller, possibly quantized models might be perfectly adequate. In some cases, correct behavior may be required for all possible inputs, while other applications may be more fault tolerant. Once the application and level of service required is properly understood, an appropriate model can be selected by comparing performance and carbon metrics across the options. There may also be cases in which a suite of models can be leveraged. Requests can, by default, be passed to simpler, smaller models, but in cases in which the task can’t be handled by the simple model, it can be passed off to a more sophisticated model.
Here, integrating carbon metrics into DevOps (or MLOps) processes is important. Tools like codecarbon make it easy to track and account for the carbon emissions associated with training and serving a model. Integrating this or a similar tool into continuous integration test suites allows carbon, accuracy, and other metrics to be analyzed in concert. For example, while experimenting with model architecture, tests can immediately report both accuracy and carbon, making it easier to find the right architecture and choose the right hyperparameters to meet accuracy requirements while minimizing carbon emissions.
It’s also important to remember that experimentation itself will result in carbon emissions. In the experimentation phase of the MLOps cycle, experiments are performed with different model families and architectures to determine the best option, which can be considered in terms of accuracy, carbon and, potentially, other metrics. This can save carbon in the long run as the model continues to be trained with real-time data and/or is put into production, but excessive experimentation can waste time and energy. The appropriate balance will vary depending on many factors, but this can be easily analyzed when carbon metrics are available for running experiments as well as production training and serving of the model.
Green prompt engineering
When it comes to carbon emissions associated with the serving and use of a generative model, prompt engineering becomes very important as well. For most generative AI models — like GPT — the computational resources used, and thus carbon emitted, depend on the number of tokens passed to and generated by the model.
While the exact details depend on the implementation, prompts are generally passed “all at once” into transformer models. This might make it seem like the amount of computation doesn’t depend on the length of a prompt. However, due to the quadratic nature of the self-attention mechanism, it’s reasonable to expect that optimizations would suppress this function for unused portions of the input, meaning that shorter prompts save computation and thus energy.
For the output, it is clear that the computational cost is proportional to the number of tokens produced, as the model needs to be “run again” for each token generated.
This is reflected in the pricing structure for OpenAI’s API access to GPT4. At the time of writing, the costs for the base GPT4 model are $0.03/1k prompt tokens and $0.06/1k sampled tokens. The prompt length and length of the output in tokens are both incorporated into the price, reflecting the fact that both influence the amount of computation that is required.
So, shorter prompts and prompts that will generate shorter outputs will use less computation. This suggests a new process of “green prompt engineering”. With proper support for experimentation in an MLOps platform, it becomes relatively easy to experiment with shortening prompts while continuously evaluating the impact of both carbon and system performance.
As well as considering only single prompts, there are interesting approaches being developed to improve efficiency for more complex use of LLMs, as in this paper.
Conclusion
Although possibly overhyped, the carbon emissions of AI are still of concern and should be managed with appropriate best practices. Transparency is needed to support effective decision-making and consumer awareness. Also, integrating carbon metrics into MLOps workflows can support smart choices about model architecture, size, quantization, as well as effective green prompt engineering. The content in this article is an overview only and just scratches the surface. For those that truly want to do green generative AI, I encourage you to follow the latest research.
Recent advances in prose-to-code generation via Large Language Models (LLMs) will make it practical for non-programmers to “program in prose” for practically useful program complexities, a long-standing dream of computer scientists and subject-matter experts alike.
Assuming that correctness of the code and explainability of the results remain important, testing the code will still have to be done using more traditional approaches. Hence, the non-programmers must understand the notion of testing and coverage.
Program understanding, visualization, exploration, and simulation will become even more relevant in the future to illustrate what the generated program does to subject matter experts.
There is a strong synergy with very high-level programming languages and domain-specific languages (DSLs) because the to-be-generated programs are shorter (and less error prone) and more directly aligned with the execution semantics (and therefore easier to understand).
I think it is still an open question how far the approach scales and how integrated tools will look that exploit both LLMs’ “prose magic” and more traditional ways of computing. I illustrate this with an open-source demonstrator implemented in JetBrains MPS.
Introduction
As a consequence of AI, machine learning, neural networks, and in particular Large Language Models (LLMs) like ChatGPT, there’s a discussion about the future of programming. There are mainly two areas. One focuses on how AI can help developers code more efficiently. We have probably all asked ChatGPT to generate small-ish fragments of code from prose descriptions and pasted them into whatever larger program we were developing. Or used Github Copilot directly in our IDEs.
This works quite well because, as programmers, we can verify that the code makes sense just by looking at it or trying it out in a “safe” environment. Eventually (or even in advance), we write tests to validate that the generated code works in all relevant scenarios. And the AI-generated code doesn’t even have to be completely correct because it is useful to developers if it reaches 80% correctness. Just like when we look up things on Stackoverflow, it can serve as an inspiration/outline/guidance/hint to allow the programmer to finish the job manually. I think it is indisputable that this use of AI provides value to developers.
The second discussion area is whether this will enable non-programmers to instruct computers. The idea is that they just write a prompt, and the AI generates code that makes the machine do whatever they intended. The key difference to the previous scenario is that the inherent safeguards against generated nonsense aren’t there, at least not obviously.
Related Sponsored Content
A non-programmer user can’t necessarily look at the code and check it for plausibility, they can’t necessarily bring a generated 80% solution to 100%, and they don’t necessarily write tests. So will this approach work, and how must languages and tools change to make it work? This is the focus of this article.
Why not use AI directly?
You might ask: why generate programs in the first place? Why don’t we just use a general-purpose AI to “do the thing” instead of generating code that then “does the thing”? Let’s say we are working in the context of tax calculation. Our ultimate goal is a system that calculates the tax burden for any particular citizen based on various data about their incomes, expenses, and life circumstances.
We could use an approach where a citizen enters their data into some kind of a form and then submits the form data (say, as JSON) to an AI (either a generic LLM or tax-calculation-specific model), which then directly computes the taxes. There’s no program in between, AI-generated or otherwise (except the one that collects the data, formats the JSON, and submits it to the AI). This approach is unlikely to be good enough in most cases for the following reasons:
AI-based software isn’t good at mathematical calculations [1]; this isn’t a tax-specific issue since most real-world domains contain numeric calculations.
If an AI is only 99% correct, the 1% wrong is often a showstopper.
Whatever the result is, it can’t be explained or “justified” to the end user (I will get back to this topic below).
Running a computation for which a deterministic algorithm exists with a neural network is inefficient in terms of computing power and the resulting energy and water consumption.
If there’s a change to the algorithm, we have to retrain the network, which is even more computationally expensive.
To remedy these issues, we use an approach where a subject matter expert who is not a programmer, say our tax consultant, describes the logic of the tax calculation to the AI, and the AI generates a classical, deterministic algorithm which we then repeatedly run on citizens’ data. Assuming the generated program is correct, all the above drawbacks are gone:
It calculates the result with the required numeric precision.
By tracing the calculation algorithm, we can explain and justify the result (again, I will explain this in more detail below).
It will be correct in 100% of the cases (assuming the generated program is correct).
The computation is as energy efficient as any program today.
The generated code can be adapted incrementally as requirements evolve.
Note that we assume correctness for all (relevant) cases and explainability are important here. If you don’t agree with these premises, then you can probably stop reading; you are likely of the opinion that AI will replace more or less all traditionally programmed software. I decidedly don’t share this opinion, at least not for 5–7 years.
Correctness and Creativity
Based on our experience with LLMs writing essays or Midjourney & Co generating images, we ascribe creativity to these AIs. Without defining precisely what “creativity” means, I see it here as a degree of variability in the results generated for the same or slightly different prompts. This is a result of how word prediction works and the fact that these tools employ randomness in the result generation process (Stephen Wolfram explains this quite well in his essay). This feels almost like making a virtue from the fault that neural networks generally aren’t precisely deterministic.
Just do an experiment and ask an image-generating AI to render technical subjects such as airplanes or cranes, subjects for which a specific notion of “correct” exists; jet airliners just don’t have two wings of different lengths or two engines on one wing and one on the other. The results are generally disappointing. If, instead, you try to generate “phantasy dogs running in the forest while it rains,” the imprecision and variability are much more tolerable, to the point we interpret it as “creativity.” Generating programs is more like rendering images of airplanes than of running dogs. Creativity is not a feature for this use case of AI.
Explainability
Let me briefly linger on the notion of explainability. Consider again your tax calculation. Let’s say it asks you to pay 15.323 EUR for a particular period of time. Based on your own estimation, this seems too much, so you ask, “Why is it 15.323 EUR?” If an AI produces the result directly, it can’t answer this question. It might (figuratively) reply with the value for each of the internal neurons’ weights, thresholds, and activation levels of the internal neurons. Still, those have absolutely no meaning to you as a human. Their connection to the logic of tax calculation is, at best, very indirect. Maybe it can even (figuratively) show you that your case looks very similar to these 250 others, and therefore, somehow, your tax amount has to be 15.323 EUR. A trained neural network is essentially just an extremely tight curve fit,one with a huge number of parameters. It’s a form of “empirical programming”: it brute-force replicates existing data and extrapolates.
It’s just like in science: to explain what fitted data means, you have to connect it to a scientific theory, i.e., “fit the curve with physical quantities that we know about. The equivalent of a scientific theory (stretching the analogy a bit) is a “traditional” program that computes the result based on a “meaningful” algorithm. The user can inspect the intermediate values, see the branches the program took, the criteria for decisions, and so on. This serves as a reasonable first-order answer to the “why” question – especially if the program is expressed with abstractions, structures, and names that make sense in the context of the tax domain [2].
A well-structured program can also be easily traced back to the law or regulations that back up the particular program code. Program state, expressed with reasonably domain-aligned abstractions, plus a connection to the “requirements” (the law in case of tax calculation) is a really good answer to the “why.” Even though there is research into explainable AI, I don’t think the current approach of deep learning will be able to do this anytime soon. And the explanations that ChatGPT provides are often hollow or superficial. Try to ask “why” one or two more times, and you’ll quickly see that it can’t really explain a lot.
Domain-Specific Tools and Languages
A part of the answer of whether subject-matter expert prose programming works is domain-specific languages (DSL). A DSL is a (software) language that is tailor-made for a particular problem set – for example, for describing tax calculations and the data structures necessary for them, or for defining questionnaires used in healthcare to diagnose conditions like insomnia or drug abuse. DSLs are developed with the SMEs in the field and rely on abstractions and notations familiar to them. Consequently, if the AI generates DSL code, subject matter experts will be more able to read the code and validate “by looking” that it is correct.
There’s an important comment I must make here about the syntax. As we know, LLMs work with text, so we have to use a textual syntax for the DSL when we interact with the LLM. However, this does not mean that the SME has to look at this for validation and other purposes. The user-facing syntax can be a mix of whatever makes sense: graphical, tables, symbolic, Blockly-style, or textual. While representing classical programming languages graphically often doesn’t work well, it works much better if the language has been designed from the get-go with the two syntaxes in mind – the DSL community has lots of experience with this.
More generally, if the code is written by the AI and only reviewed or adapted slightly by humans, then the age-old trade-off between writability and readability is decided in favor of readability. I think the tradeoff has always tended in this direction because code is read much more often than it is written, plus IDEs have become more and more helpful with the writing part. Nonetheless, if the AI writes the code, then the debate is over.
A second advantage to generating code in very high-level languages such as DSLs is that it is easier for the AI to get it right. Remember that LLMs are Word Prediction Machines. We can reduce the risk of predicting wrong by limiting the vocabulary and simplifying the grammar. There will be less non-essential variability in the sentences, so there will be a higher likelihood of correctly generated code. We should ensure that the programming language is good at separating concerns. No “technical stuff” mixed with the business logic the SME cares about.
The first gateway for correctness is the compiler (or syntax/type checker in case of an interpreted language). Any generated program that does not type check or compile can be rejected immediately, and the AI can automatically generate another one. Here is another advantage of high-level languages: you can more easily build type systems that, together with the syntactic structure, constrain programs to be meaningful in the domain. In the same spirit, the fewer (unnecessary) degrees of freedom a language has, the easier it is to analyze the programs relative to interesting properties. For example, a state machine model is easier to model check than a C program. It is also easier to extract an “explanation” for the result, and, in the end, it is easier for an SME to learn to validate the program by reading it or running it with some kind of simulator or debugger. There’s just less clutter, which simplifies everybody’s (and every tool’s) life.
There are several examples that use this approach. Chat Notebooks in Mathematica allow users to write prose, and ChatGPT generates the corresponding Wolfram Language code that can then be executed in Mathematica. A similar approach has been demonstrated for Apache Spark and itemis CREATE, a state machine modeling tool (the linked article is in German, but the embedded video is in English). I will discuss my demonstrator a bit more in the next section.
The approach of generating DSL code also has a drawback: the internet isn’t full of example code expressed in your specific language for the LLM to learn from. However, it turns out that “teaching” ChatGPT the language works quite well. I figure there are two reasons: one is that even though the language is domain-specific, many parts of it, for example, expressions, are usually very similar to traditional programming languages. And second, because DSLs are higher-level relative to the domain, the syntax is usually a bit more “prose-like”; so expressing something “in the style of the DSL I explained earlier” is not a particular challenge for an AI.
The size of the language you can teach to an LLM is limited by the “working memory” of the LLM, but it is fair to assume that this will grow in the future, allowing more sophisticated DSLs. And I am sure that other models will be developed that are optimized for structured text, following a formal schema rather than the structure of (English) prose.
A demonstrator
I have implemented a system that demonstrates the approach of combining DSLs and LLMs. The demonstrator is based on JetBrains’ MPS and ChatGPT; the code is available on github. The example language focuses on forms with fields and calculated values; more sophisticated versions of such forms are used, for example, as assessments in healthcare. Here is an example form:
In addition to the forms, the language also supports expressing tests; these can be executed via an interpreter directly in MPS.
In this video, I show how ChatGPT 3.5 turbo generates meaningfully interesting forms for prose prompts. Admittedly, this is a simple language, and the DSLs we use for real-world systems are more complex. I have also done other experiments where the language was more complicated and it worked reasonably well. And as I have said, LLMs will become better and more optimized for this task. In addition, most DSLs have different aspects or viewpoints, and a user often just has to generate small parts of the model that, from the perspective of the LLM, can be seen as smaller languages.
A brief description of how this demonstrator is implemented technically can be found in the README on github.
Understanding and testing the generated code
Understanding what a piece of code does just by reading it only goes so far. A better way to understand code is to run it and observe the behavior. In the case of our tax calculation example, we might check that the amount of tax our citizen has to pay is correct relative to what the regulations specify. Or we validate that the calculated values in the healthcare forms above have the expected values. For realistically complex programs, there is a lot of variability in the behavior; there are many case distinctions (tax calculations are a striking example, and so are algorithms in healthcare), so we write tests to validate all relevant cases.
This doesn’t go away just because the code is AI-generated. It is even more critical because if we don’t express the prose requirements precisely, the generated code is likely to be incorrect or incomplete – even if we assume the AI doesn’t hallucinate nonsense. Suppose we use a dialog-based approach to get the code right incrementally. In that case, we need regression testing to ensure previously working behavior isn’t destroyed by an AI’s “improvement” of the code. So all of this leads to the conclusion that if we let the AI generate programs, the non-programmer subject matter expert must be in control of a regression test suite – one that has reasonably good coverage.
I don’t think that it is efficient – even for SMEs – to make every small change to the code through a prose instruction to the AI. Over time they will get a feel for the language and make code changes directly. The demo application I described above allows users to modify the generated form, and when they then instruct the LLM to modify further, the LLM continues from the result of the user’s modified state. Users and the LLM can truly collaborate. The tooling also supports “undo”: if the AI changes the code in a way that does more harm than good, you want to be able to roll back. The demonstrator I have built keeps the history of {prompt-reply}-pairs as a list of nodes in MPS; stepwise undo is supported just by deleting the tail of the list.
So how can SMEs get to the required tests? If the test language is simple enough (which is often the case for DSLs, based on my experience), they can manually write the tests. This is the case in my demonstrator system, where tests are just a list of field values and calculation assertions. It’s inefficient to have an LLM generate the tests based on a much more verbose prose description. This is especially true with good tool support where, as in the demonstrator system, the list of fields and calculations is already pre-populated in the test. An alternative to writing the tests is to record them while the user “plays” with the generated artifact. While I have not implemented this for the demonstrator, I have done it for a similar health assessment DSL in a real project: the user can step through a fully rendered form, enter values, and express “ok” or “not ok” on displayed calculated values.
Note that users still have to think about relevant test scenarios, and they still have to continue creating tests until a suitable coverage metric shows green. A third option is to use existing test case generation tools. Based on analysis of the program, they can come up with a range of tests that achieve good coverage. The user will usually still have to manually provide the expected output values (or, more generally, assert the behavior) for each automatically generated set of inputs. For some systems, such test case generators can generate the correct assertion as well, but then the SME user at least has to review them thoroughly – because they will be wrong if the generated program is wrong. Technically speaking, test case generation can only verify a program, not validate it.
Mutation testing (where a program is automatically modified to identify parts that don’t affect test outcomes) is a good way of identifying holes in the coverage; the nice thing about this approach is that it does not rely on fancy program analysis; it’s easy to implement, also for your own (domain-specific) languages. In fact, the MPS infrastructure on which we have built our demonstrator DSL supports both coverage analysis (based on the interpreter that runs the tests), and we also have a prototype program mutator.
We can also consider having the tests generated by an AI. Of course, this carries the risk of self-fulfilling prophecies; if the AI “misunderstands” the prose specification, it might generate a wrong program and tests that falsely corroborate that wrong program. To remedy this issue, we can have the program and tests generated by different AIs. At the very least, you should use two separate ChatGPT sessions. In my experiments, the ChatGPT couldn’t generate the correct expected values for the form calculations; it couldn’t “execute” the expressions it generated into the form earlier. Instead of generating tests, we can generate properties [3] for verification tools, such as model checkers. In contrast to generated tests, generated properties provide a higher degree of confidence. Here’s the important thing: even if tests or properties are generated (based on traditional test generators or via AI), then at least the tests have to be validated by a human. Succeeding tests or tool-based program verifications are only useful if they ensure the right thing.
There’s also a question about debugging. What happens if the generated code doesn’t work for particular cases? Just writing prompts à la “the code doesn’t work in this case, fix it!” is inefficient; experiments with my demonstrator confirm this suspicion. It will eventually become more efficient to adapt the generated code directly. So again: the code has to be understood and “debugged.” A nicely domain-aligned language (together with simulators, debuggers, and other means of relating the program source to its behavior) can go a long way, even for SMEs. The field of program tracing, execution visualization, live programming, and integrated programming environments where there’s less distinction between the program and its executions is very relevant here. I think much more research and development are needed for programs without obvious graphical representations; the proverbial bouncing ball from the original Live Programming demo comes to mind.
There’s also another problem I call “broken magic.” If SMEs are used to things “just working” based on their prose AI prompt, and they are essentially shielded from the source code and, more generally, how the generated program works, then it will be tough for them to dig into that code to fix something. The more “magic” you put into the source-to-behavior path, the harder it is for (any kind of) user to go from behavior back to the program during debugging. You need quite fancy debuggers, which can be expensive to build. This is another lesson learned from years and years of using DSLs without AI.
Summing up
Let’s revisit the skills the SMEs will need in order to reliably use AI to “program” in the context of a particular domain. In addition to being able to write prompts, they will have to learn how to review, write or record tests, and understand coverage to appreciate which tests are missing and when enough tests are available. They have to understand the “paradigm” and structure of the generated code so they can make sense of explanations and make incremental changes. For this to work in practice, we software engineers have to adapt the languages and tools we use as the target of AI code generation:
Smaller and more domain-aligned languages have a higher likelihood that the generated code will be correct and are easier for SMEs to understand; this includes the language for writing tests.
We need program visualizers, animators, simulators, debuggers, and other tools that reduce the gap between a program and its set of executions.
Finally, any means of test case generation, program analysis, and the like will be extremely useful.
So, the promise that AI will let humans communicate with computers using the humans’ language is realistic to a degree. While we can express the expected behavior as prose, humans have to be able to validate that the AI-generated programs are correct in all relevant cases. I don’t think that doing this just via a prose interface will work well; some degree of education on “how to talk to computers” will still be needed, and the diagnosis that this kind of education is severely lacking in most fields except computer science remains true even with the advent of AI.
Of course, things will change as AI improves – especially in the case of groundbreaking new ideas where classical, rule-based AI is meaningfully integrated with LLMs. Maybe more or less manual validation is no longer necessary because the AI is somehow good enough to always generate the correct programs. I don’t think this will happen in the next 5–7 years. Predicting beyond is difficult – so I don’t.
Footnotes
[1] In the future, LLMs will likely be integrated with arithmetic engines like Mathematica, so this particular problem might go away.
[2] Imagine the same calculation expressed as a C program with a set of global integer variables all names i1 through i500. Even though the program can absolutely produce the correct results and is fully deterministic, inspecting the program’s execution – or some kind of report auto-generated from it – won’t explain anything to a human. Abstractions and names matter a lot!
[3] Properties are generalized statements about the behavior of a system that verification tools try to prove or try to find counterexamples for.
Acknowledgments
Thanks to Sruthi Radhakrishnan, Dennis Albrecht, Torsten Görg, Meite Boersma, and Eugen Schindler for feedback on previous versions of this article.
I also want to thank Srini Penchikala and Maureen Spencer for reviewing and copyediting this article.
Karl Stefanovic, one of the most popular and well-known Australian television personalities, is not just a well-known face on TV. He is also a shrewd investor who has made a lot of money from his investments over the years. One of his most successful investments to date has been in the jewellery industry, where he has made a significant profit.
Stefanovic has always had an eye for investments that have the potential for high returns. He is known for his love of luxury items and has invested in everything from property to luxury cars. However, his foray into the jewellery industry was somewhat unexpected.
In 2016, Stefanovic made his first investment in the jewellery industry by purchasing a stake in Cerrone Jewellers, a high-end jewellery store in Sydney. Cerrone Jewellers is well-known for its exquisite range of jewellery, including diamond engagement rings and bespoke pieces.
Stefanovic’s investment in Cerrone Jewellers was a shrewd move. The jewellery industry has been growing steadily over the past few years, and demand for high-end jewellery has increased significantly. Cerrone Jewellers is a well-established brand with a loyal customer base, and Stefanovic saw the potential for significant growth in the company.
Stefanovic’s investment in Cerrone Jewellers paid off quickly. In just a few years, the company’s revenue grew significantly, and Stefanovic’s stake in the business became more valuable. In 2019, Stefanovic sold his stake in the company for a reported $1.3 million, making a significant profit on his initial investment.
Stefanovic’s success in the jewellery industry is a testament to his keen eye for investments and his willingness to take calculated risks. Investing in the jewellery industry is not without its risks, but Stefanovic saw the potential for significant returns and was willing to take the plunge.
Stefanovic’s success in the jewellery industry is not an isolated incident. Many investors have found success in the jewellery industry, particularly in high-end, luxury jewellery. The demand for luxury jewellery has been growing steadily over the past few years, driven by an increase in disposable income and a growing appetite for luxury items.
In conclusion, Karl Stefanovic’s investment in the jewellery industry is a testament to his savvy investment skills and his willingness to take risks. His success in the industry is an example of how investors can find success in unexpected places by keeping an eye on emerging trends and taking calculated risks. With the demand for luxury jewellery showing no signs of slowing down, there are plenty of opportunities for investors to follow in Stefanovic’s footsteps and find success in the industry.
Designing distributed file systems that maintain POSIX-compatibility is a challenging task, often requiring tradeoffs to be made.
GFS introduced a decoupled architecture comprising a master, chunkservers, and clients and became the foundation of many other big data systems.
Tectonic employs a layered metadata design, enabling the separation of storage and compute for metadata. This innovative approach enhances scalability and performance.
JuiceFS uses cost-effective, robust object storage services for data storage, while employing open-source databases as its metadata engine. This aligns with the demands of cloud computing.
Distributed file systems play a crucial role in enabling scalable, reliable, and performant data storage and processing, driving innovation in the field of big data and cloud-native solutions.
As technology advances and data continues to explode, traditional disk file systems have revealed their limitations. To address the growing storage demands, distributed file systems have emerged as dynamic and scalable solutions.
In this article, we explore the design principles, innovations, and challenges addressed by three representative distributed file systems: Google File System (GFS), Tectonic, and JuiceFS.
GFS pioneered commodity hardware use and influenced systems like Hadoop Distributed File System (HDFS) in big data.
Tectonic introduced layered metadata and storage/compute separation, improving scalability and performance.
JuiceFS, designed for the cloud-native era, uses object storage and a versatile metadata engine for scalable file storage in the cloud.
By exploring the architectures of these three systems, you will gain valuable insights into designing distributed file systems.
Related Sponsored Content
This understanding can guide enterprises in choosing suitable file systems.
We aim to inspire professionals and researchers in big data, distributed system design, and cloud-native technologies with knowledge to optimize data storage, stay informed about industry trends, and explore practical applications.
An overview of popular distributed file systems
The table below shows a variety of widely-used distributed file systems, both open-source and proprietary.
[Click on the image to view full-size]
As shown in the table, a large number of distributed systems emerged around the year 2000. Before this period, shared storage, parallel file systems, and distributed file systems existed, but they often relied on specialized and expensive hardware.
The “POSIX-compatible” column in the table represents the compatibility of the distributed file system with the Portable Operating System Interface (POSIX), a set of standards for operating system implementations, including file system-related standards. A POSIX-compatible file system must meet all the features defined in the standard, rather than just a few.
For example, GFS is not a POSIX-compatible file system. Google made several trade-offs when it designed GFS. It discarded many disk file system features and retained some distributed storage requirements needed for Google’s search engine at that time.
In the following sections, we’ll focus on the architecture design of GFS, Tectonic, and JuiceFS. Let’s explore the contributions of each system and how they have transformed the way we handle data.
GFS Architecture
In 2003, Google published the GFS paper. It demonstrated that we can use cost-effective commodity computers to build a powerful, scalable, and reliable distributed storage system, entirely based on software, without relying on proprietary or expensive hardware resources.
GFS significantly reduced the barrier to entry for distributed file systems. Its influence can be seen in varying degrees on many subsequent systems. HDFS, an open-source distributed file system developed by Yahoo, is heavily influenced by the design principles and ideas presented in the GFS paper. It has become one of the most popular storage systems in the big data domain. Although GFS was released in 2003, its design is still relevant and widely used today.
A master, which serves as the metadata node. To maintain metadata such as directories, permissions, and attributes for a file system, a central node, the master, is used. The master is structured in a tree-like design.
Multiple chunkservers, which store the data. The chunkserver relies on the local operating system’s file system to store the data.
Multiple clients
The communication between the master and chunkserver is through a network, resulting in a distributed file system. The chunkservers can be horizontally scaled as data grows.
All components are interconnected in GFS. When a client initiates a request, it first retrieves the file metadata information from the master, communicates with the chunkserver, and finally obtains the data.
GFS stores files in fixed-size chunks, usually 64 MB, with multiple replicas to ensure data reliability. Therefore, reading the same file may require communication with different chunkservers. The replica mechanism is a classic design of distributed file systems, and many open-source distributed system implementations today are influenced by GFS.
While GFS was groundbreaking in its own right, it had limitations in terms of scalability. To address these issues, Google developed Colossus as an improved version of GFS. Colossus provides storage for various Google products and serves as the underlying storage platform for Google Cloud services, making it publicly available. With enhanced scalability and availability, Colossus is designed to handle modern applications’ rapidly growing data demands.
Tectonic Architecture
Tectonic is the largest distributed file system used at Meta (formerly Facebook). This project, originally called Warm Storage, began in 2014, but its complete architecture was not publicly released until 2021.
Prior to developing Tectonic, Meta primarily used HDFS, Haystack, and f4 for data storage:
HDFS was used in the data warehousing scenario (limited by the storage capacity of a single cluster, with dozens of clusters deployed).
Haystack and f4 were used for unstructured data storage scenarios.
Tectonic was designed to support these three storage scenarios in a single cluster.
Layer design in Tectonic – Innovations in Tectonic architecture design
Innovation #1: Layered metadata
Tectonic abstracts the metadata of the distributed file system into a simple key-value (KV) model. This allows for excellent horizontal scaling and load balancing, and effectively prevents hotspots in data access.
Tectonic introduces a hierarchical approach to metadata, setting it apart from traditional distributed file systems. The Metadata Store is divided into three layers, which correspond to the data structures in the underlying KV storage:
The Name layer, which stores the metadata related to the file name or directory structure, sharded by directory IDs
The File layer, which stores the file attributes, sharded by file IDs
The Block layer, which stores the metadata regarding the location of data blocks in the Chunk Store, sharded by block IDs
The figure below summarizes the key-value mapping of the three layers:
This layered design addresses the scalability and performance demands of Tectonic, especially in Meta’s scenarios, where handling exabyte-scale data is required.
Innovation #2: Separation of storage and compute for metadata
The three metadata layers are stateless and can be horizontally scaled based on workloads. They communicate with the Key-Value Store, a stateful storage in the Metadata Store, through the network.
The Key-Value Store is not solely developed by the Tectonic team; instead, they use ZippyDB, a distributed KV storage system within Meta. ZippyDB is built on RocksDB and the Paxos consensus algorithm. Tectonic relies on ZippyDB’s KV storage and its transactions to ensure the consistency and atomicity of the file system’s metadata.
Transactional functionality plays a vital role in implementing a large-scale distributed file system. It’s essential to horizontally scale the Metadata Store to meet the demands of such a system. However, horizontal scaling introduces the challenge of data sharding. Maintaining strong consistency is a critical requirement in file system design, especially when performing operations like renaming directories with multiple subdirectories. Ensuring efficiency and consistency throughout the renaming process is a significant and widely recognized challenge in distributed file system design.
To address this challenge, Tectonic uses ZippyDB’s transactional features. When handling metadata operations within a single shard, Tectonic guarantees both transactional behavior and strong consistency.
However, ZippyDB does not support cross-shard transactions. This limits Tectonic’s ability to ensure atomicity when it processes metadata requests that span multiple directories, such as moving files between directories.
Innovation #3: Erasure coding in the Chunk Store
As previously mentioned, GFS ensures data reliability and security through multiple replicas, but this approach comes with high storage costs. For example, storing just 1 TB of data typically requires three replicas, resulting in at least 3 TB of storage space. This cost increases significantly for large-scale systems like Meta, operating at the exabyte level.
To solve this problem, Meta implements erasure coding (EC) in the Chunk Store which achieves data reliability and security with reduced redundancy, typically around 1.2 to 1.5 times the original data size. This approach offers substantial cost savings compared to the traditional three-replica method. Tectonic’s EC design provides flexibility, allowing configuration on a per-chunk basis.
While EC effectively ensures data reliability with minimal storage space, it does have some drawbacks. Specifically, reconstructing lost or corrupted data incurs high computational and I/O resource requirements.
According to the Tectonic research paper, the largest Tectonic cluster in Meta comprises approximately 4,000 storage nodes, with a total capacity of about 1,590 petabytes, which is equivalent to 10 billion files. This scale is substantial for a distributed file system and generally fulfills the requirements of the majority of use cases at present.
JuiceFS Architecture
JuiceFS was born in 2017, a time when significant changes had occurred in the external landscape compared to the emergence of GFS and Tectonic:
Cloud computing had become mainstream, with enterprises transitioning into the “cloud era” through public, private, or hybrid clouds.
This shift presented new challenges for infrastructure architecture. Migrating traditional infrastructure designed for IDC environments to the cloud often brought about various issues. Maximizing the benefits of cloud computing became a crucial requirement for seamless integration of infrastructure into cloud environments.
Moreover, GFS and Tectonic were in-house systems serving specific company operations, operating at a large scale but with a narrow focus. In contrast, JuiceFS is designed to cater to a wide range of public-facing users and to meet diverse use case requirements. As a result, the architecture of JuiceFS differs significantly from the other two file systems.
Taking these changes and distinctions into account, let’s look at the JuiceFS architecture as shown in the figure below:
While JuiceFS shares a similar overall framework with the aforementioned systems, it distinguishes itself through various design aspects.
Data Storage
Unlike GFS and Tectonic, which rely on proprietary data storage, JuiceFS follows the trend of the cloud-native era by using object storage. As previously mentioned, Meta’s Tectonic cluster uses over 4,000 servers to handle exabyte-scale data. This inevitably leads to significant operational costs for managing such a large-scale storage cluster.
For regular users, object storage has several advantages:
Out-of-the-box usability
Elastic capacity
Simplified operations and maintenance
Support for erasure coding, resulting in lower storage costs compared to replication
However, object storage has limitations, including:
Object immutability
Poor metadata performance
Absence of strong consistency
Limited random read performance
To tackle these challenges, JuiceFS adopts the following strategies in its architectural design:
An independent metadata engine
A three-layer data architecture comprising chunks, slices, and blocks
Multi-level caching
Metadata Engine
JuiceFS supports various open-source databases as its underlying storage for metadata. This is similar to Tectonic, but JuiceFS goes a step further by supporting not only distributed KV stores but also Redis, relational databases, and other storage engines. This design has these advantages:
It allows users to choose the most suitable solution for their specific use cases, aligning with JuiceFS’s goal of being a versatile file system.
Open-source databases often offer fully managed services in public clouds, resulting in almost zero operational costs for users.
Tectonic achieves strong metadata consistency by using ZippyDB, a transactional KV store. However, its transactionality is limited to metadata operations within a single shard. In contrast, JuiceFS has stricter requirements for transactionality and demands global strong consistency across shards. Therefore, all supported databases integrated as metadata engines must support transactions. With a horizontally scalable metadata engine like TiKV, JuiceFS can now store over 20 billion files in a single file system, meeting the storage needs of enterprises with massive data. This capability makes JuiceFS an ideal choice for enterprises dealing with massive data storage needs.
Client
The main differences between the JuiceFS client and the clients of the other two systems are as follows:
The GFS client speaks non-standard protocol and does not support the POSIX standard. It only allows append-only writes. This limits its usability to a specific scenario.
The Tectonic client also lacks support for POSIX and only permits append-only writes, but it employs a rich client design that incorporates many functionalities on the client side for maximum flexibility.
The JuiceFS client supports multiple standard access methods, including POSIX, HDFS, S3, WebDAV, and Kubernetes CSI.
The JuiceFS client also offers caching acceleration capabilities, which are highly valuable for storage separation scenarios in cloud-native architectures.
Conclusion
Distributed file systems have transformed data storage, and three notable systems stand out in this domain: GFS, Tectonic, and JuiceFS.
GFS demonstrated the potential of cost-effective commodity computers in building reliable distributed storage systems. It paved the way for subsequent systems and played a significant role in shaping the field.
Tectonic introduced innovative design principles such as layered metadata and separation of storage and compute. These advancements addressed scalability and performance challenges, providing efficiency, load balancing, and strong consistency in metadata operations.
JuiceFS, designed for the cloud-native era, uses object storage and a versatile metadata engine to deliver scalable file storage solutions. With support for various open-source databases and standard access methods, JuiceFS caters to a wide range of use cases and seamlessly integrates with cloud environments.
Distributed file systems overcome traditional disk limitations, providing flexibility, reliability, and efficiency for managing large data volumes. As technology advances and data grows exponentially, their ongoing evolution reflects industry’s commitment to efficient data management. With diverse architectures and innovative features, distributed file systems drive innovation across industries.
DuckDB is an open-source OLAP database designed for analytical data management. Similar to SQLite, it is an in-process database that can be embedded within your application.
In an in-process database, the engine resides within the application, enabling data transfer within the same memory address space. This eliminates the need to copy large amounts of data over sockets, resulting in improved performance.
DuckDB leverages vectorized query processing, which enables efficient operations within the CPU cache and minimizes function call overhead.
The use of Morsel-Driven parallelism in DuckDB allows efficient parallelization across multiple cores while maintaining awareness of multi-core processing.
Why did I embark on the journey of building a new database? It started with a statement by the well-known statistician and software developer Hadley Wickham:
If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more frustrating.
Related Sponsored Content
This sentiment was a blow and a challenge to database researchers like myself. What are the aspects that make databases slow and frustrating? The first culprit is the client-server model.
When conducting data analysis and moving large volumes of data into a database from an application, or extracting it from a database into an analysis environment like R or Python, the process can be painfully slow.
Comparing the database client protocols of various data management systems, I timed how long it took to transmit a fixed dataset between a client program and several database systems.
As a benchmark, I used the Netcat utility to send the same dataset over a network socket.
[Click on the image to view full-size]
Figure 1: Comparing different clients; the dashed line is the wall clock time for netcat to transfer a CSV of the data
Compared to Netcat, transferring the same volume of data with MySQL took ten times longer, and with Hive and MongoDB, it took over an hour. The client-server model appears to be fraught with issues.
SQLite
My thoughts then turned to SQLite. With billions and billions of copies existing in the wild, SQLite is the most extensively used SQL system in the world. It’s quite literally everywhere: you’re daily engaging with dozens, if not hundreds, of instances unbeknownst to you.
SQLite operates in-process, a different architectural approach integrating the database management system directly into a client application, avoiding the traditional client-server model. Data can be transferred within the same memory address space, eliminating the need to copy and serialize large amounts of data over sockets.
However, SQLite isn’t designed for large-scale data analysis and its primary purpose is to handle transactional workloads.
DuckDB
Several years ago, Mark Raasveldt and I began working on a new database, DuckDB. Written entirely in C++, DuckDB is a database management system that employs a vectorized execution engine. It is an in-process database engine and we often refer to it as the ‘SQLite for analytics’. Released under the highly permissive MIT license, the project operates under the stewardship of a foundation, rather than the typical venture capital model.
What does interacting with DuckDB look like?
import duckdb
duckdb.sql('LOAD httpfs')
duckdb.sql("SELECT * FROM 'https://github.com/duckdb/duckdb/blob/master/data/parquet-testing/userdata1.parquet'").df()
In these three lines, DuckDB is imported as a Python package, an extension is loaded to enable communication with HTTPS resources, and a Parquet file is read from a URL and converted back to a Panda DataFrame (DF).
DuckDB, as demonstrated in this example, inherently supports Parquet files, which we consider the new CSV. The LOAD httpfs call illustrates how DuckDB can be expanded with plugins.
There’s a lot of intricate work hidden in the conversion to DF, as it involves transferring a result set, potentially millions of lines. But as we are operating in the same address space, we can bypass serialization or socket transfer, making the process incredibly fast.
We’ve also developed a command-line client, complete with features like query autocompletion and SQL syntax highlighting. For example, I can initiate a DuckDB shell from my computer and read the same Parquet file:
If you consider the query:
SELECT * FROM userdata.parquet;
you realize that would not typically work in a traditional SQL system, as userdata.parquet is not a table, it is a file. The table doesn’t exist yet, but the Parquet file does. If a table with a specific name is not found, we search for other entities with that name, such as a Parquet file, directly executing queries on it.
In-Process Analytics
From an architectural standpoint, we have a new category of data management systems: in-process OLAP databases.
SQLite is an in-process system, but it is geared toward OLTP (Online Transaction Processing). When you think of a traditional client-server architecture for OLTP, PostgreSQL is instead the most common option.
Figure 2: OLTP versus OLAP
On the OLAP side, there have been several client-server systems, with ClickHouse being the most recognized open-source option. However, before the emergence of DuckDB, there was no in-process OLAP option.
Technical Perspective of DuckDB
Let’s discuss the technical aspects of DuckDB, walking through the stages of processing the following query:
[Click on the image to view full-size]
Figure 3: A simple select query on DuckDB
The example involves selecting a name and sum from the joining of two tables, customer, and sale that share a common column, cid. The goal is to compute the total revenue per customer, summing up all revenue and including tax for each transaction.
When we run this query, the system joins the two tables, aggregating customers based on the value in the cid column. Then, the system computes the revenue + tax projection, followed by a grouped aggregation by cid, where we compute the first name and the final sum.
DuckDB processes this query through standard phases: query planning, query optimization, and physical planning, and the query planning stage is further divided into so-called pipelines.
For example, this query has three pipelines, defined by their ability to be run in a streaming fashion. The streaming ends when we encounter a breaking operator, that is an operator that needs to retrieve the entire input before proceeding.
Figure 4: First pipeline
The first pipeline scans the customer table and constructs a hash table. The hash join is split into two phases, building the hash table on one side of the join, and probing, which happens on the other side. The building of the hash table requires seeing all data from the left-hand side of the join, meaning we must run through the entire customer table and feed all of it into the hash join build phase. Once this pipeline is completed, we move to the second pipeline.
[Click on the image to view full-size]
Figure 5: Second pipeline
The second pipeline is larger and contains more streaming operators: it can scan the sales table, and look into the hash table we’ve built before to find join partners from the customer table. It then projects the revenue + tax column and runs the aggregate, a breaking operator. Finally, we run the group by build phase and complete the second pipeline.
Figure 6: Third pipeline
We can schedule the third and final pipeline that reads the results of the GROUP BY and outputs the result. This process is fairly standard and many database systems take a similar approach to query planning.
Row-at-a-time
To understand how DuckDB processes a query, let’s consider first the traditional Volcano-style iterator model that operates through a sequence of iterators: every operator exposes an iterator and has a set of iterators as its input.
The execution begins by trying to read from the top operator, in this case, the GROUP BY BUILD phase. However, it can’t read anything yet as no data has been ingested. This triggers a read request to its child operator, the projection, which reads from its child operator, the HASH JOIN PROBE. This cascades down until it finally reaches the sale table.
[Click on the image to view full-size]
Figure 7: Volcano-style iterator model
The sale table generates a tuple, for example, 42, 1233, 422, representing the ID, revenue, and tax columns. This tuple then moves up to the HASH JOIN PROBE, which consults its built hash table. For instance, it knows that ID 42 corresponds to the company ASML and it generates a new row as the join result, which is ASML, 1233, 422.
This new row is then processed by the next operator, the projection, which sums up the last two columns, resulting in a new row: ASML, 1355. This row finally enters the GROUP BY BUILD phase.
This tuple-at-a-time, row-at-a-time approach is common to many database systems such as PostgreSQL, MySQL, Oracle, SQL Server, and SQLite. It’s particularly effective for transactional use cases, where single rows are the focus, but it has a major drawback in analytical processing: it generates significant overhead due to the constant switching between operators and iterators.
One possible improvement suggested by the literature is to just-in-time (JIT) compile the entire pipeline. This option, though viable, isn’t the only one.
Vector-at-a-time
Let’s consider the operation of a simple streaming operator like the projection.
[Click on the image to view full-size]
Figure 8: Implementation of a projection
We have an incoming row and some pseudocode: input.readRow reads a row of input, the first value remains unchanged, and the second entry in the output becomes the result of adding the second and third values of the input, with the output then written. While this approach is easy to implement, it incurs a significant performance cost due to function calls for every value read.
This model processes not just single values at a time, but short columns of values collectively referred to as vectors. Instead of examining a single value for each row, multiple values are examined for each column at once. This approach reduces the overhead as type switching is performed on a vector of values instead of a single row of values.
[Click on the image to view full-size]
Figure 9: The vector-at-a-time model
The vector-at-a-time model strikes a balance between columnar and row-wise executions. While columnar execution is more efficient, it can lead to memory issues. By limiting the size of columns to something manageable, the vector-at-a-time model avoids JIT compilation. It also promotes cache locality, which is critical for efficiency.
The graphic, provided by Google’s Peter Norvig and Jeff Dean, highlights the disparity between the L1 cache reference (0.5 nanoseconds) and the main memory reference (100 nanoseconds), a factor of 200. Given that L1 cache reference has become 200 times faster since 1990 compared to memory reference, which is only twice as fast, there’s a significant advantage in having operations fit within the CPU cache.
This is where the beauty of vectorized query processing lies.
[Click on the image to view full-size]
Figure 11: Implementation of a projection with vectorized query processing
Let’s consider the same projection of revenue + tax example we discussed before. Instead of retrieving a single row, we get as input three vectors of values and output two vectors of values. We read a chunk (a collection of small vectors of columns) instead of a single row. As the first vector remains unchanged, it’s reassigned to the output. A new result vector is created, and an addition operation is performed on every individual value in the range from 0 to 2048.
This approach allows the compiler to insert special instructions automatically and avoids function call overhead by interpreting and switching around data types and operators only at the vector level. This is the core of vectorized processing.
Exchange-Parallelism
Vectorized processing being efficient on a single CPU is not enough, it also needs to perform well on multiple CPUs. How can we support parallelism?
In this example, three partitions are read simultaneously. Filters are applied and values are pre-aggregated, then hashed. Based on the values of the hash, the data is split up, further aggregated, re-aggregated, and then the output is combined. By doing this, most parts of the query are effectively parallelized.
For instance, you can observe this approach in Spark’s execution of a simple query. After scanning the files, a hash aggregate performs a partial_sum. Then, a separate operation partitions the data, followed by a re-aggregation that computes the total sum. However, this has been proven to be problematic in many instances.
Morsel-Driven Parallelism
A more modern model for achieving parallelism in SQL engines is Morsel-Driven parallelism. As in the approach above, the input level scans are divided, resulting in partial scans. In our second pipeline, we have two partial scans of the sale table, with the first one scanning the first half of the table and the second one scanning the latter half.
[Click on the image to view full-size]
Figure 13: Morsel-Driven parallelism
The HASH JOIN PROBE remains the same as it still operates on the same hash table from the two pipelines. The projection operation is independent, and all these results sync into the GROUP BY operator, which is our blocking operator. Notably, you don’t see an exchange operator here.
Unlike the traditional exchange operator-based model, the GROUP BY is aware of the parallelization taking place and is equipped to efficiently manage the contention arising from different threads reading groups that could potentially collide.
[Click on the image to view full-size]
Figure 14: Partitioning hash tables for parallelized merging
In Morsel-Driven parallelism, the process begins (Phase 1) with each thread pre-aggregating its values. The separate subsets or morsels of input data, are built into separate hash tables.
The next phase (Phase 2) involves partition-wise aggregation: in the local hash tables, data is partitioned based on the radixes of the group keys, ensuring that each hash table cannot contain keys present in any other hash table. When all the data has been read and it’s time to finalize our hash table and aggregate, we can select the same partition from each participating thread and schedule more threads to read them all.
Though this process is more complex than a standard aggregate hash table, it allows the Morsel-Driven model to achieve great parallelism. This model efficiently constructs an aggregation over multiple inputs, circumventing the issues associated with the exchange operator.
Simple Benchmark
I conducted a simple benchmark, using our example query with some minor added complexity in the form of an ORDER BY and a LIMIT clause. The query selects the name and the sum of revenue + tax from the customer and sale tables, which are joined and grouped by the customer ID.
The experiment involved two tables: one with a million customers and another with a hundred million sales entries. This amounted to about 1.4 gigabytes of CSV data, which is not an unusually large dataset.
[Click on the image to view full-size]
Figure 15: The simple benchmark
DuckDB completed the query on my laptop in just half a second. On the other hand, PostgreSQL, after I had optimized the configuration, took 11 seconds to finish the same task. With default settings, it took 21 seconds.
While DuckDB could process the query around 40 times faster than PostgreSQL, it’s important to note that this comparison is not entirely fair, as PostgreSQL is primarily designed for OLTP workloads.
Conclusions
The goal of this article is to explain the design, functionality, and rationale behind DuckDB, a data engine encapsulated in a compact package. DuckDB functions as a library linked directly to the application process, boasting a small footprint and no dependencies and allowing developers to easily integrate a SQL engine for analytics.
I highlighted the power of in-process databases, which lies in their ability to efficiently transfer result sets to clients and write data to the database.
An essential component of DuckDB’s design is vectorized query processing: this technique allows efficient in-cache operations and eliminates the burden of the function call overhead.
Lastly, I touched upon DuckDB’s parallelism model: Morsel-Driven parallelism supports efficient parallelization across any number of cores while maintaining awareness of multi-core processing, contributing to DuckDB’s overall performance and efficiency.
L’Italia ha scelto il suo rappresentante per l’Eurovision Song Contest 2023 e si tratta di Marco Mengoni, uno dei cantanti più famosi del paese. Mengoni ha vinto il Sanremo Music Festival 2023 con la canzone “Siamo Qui”, ottenendo il diritto di rappresentare l’Italia all’Eurovision Song Contest.
Mengoni, che ha già rappresentato l’Italia all’Eurovision nel 2013, ha dichiarato di essere molto emozionato per questa nuova sfida e di essere fiero di rappresentare il suo paese in un evento così importante. “Essere scelto per rappresentare l’Italia all’Eurovision è un’onore enorme e un sogno che si avvera per me. Darò il massimo per rappresentare al meglio il mio paese e spero di portare a casa la vittoria”, ha dichiarato Mengoni.
La canzone “Siamo Qui” è stata scritta da Mengoni insieme a un team di compositori e parolieri italiani ed è stata accolta molto positivamente dal pubblico italiano. La canzone è un inno all’amore e alla speranza e ha già conquistato numerosi fan in tutta Italia.
L’Eurovision Song Contest 2023 si terrà a Dublino, in Irlanda, e vedrà la partecipazione di numerosi artisti provenienti da tutto il mondo. La competizione si svolgerà in due semifinali, seguite dalla finale, che si terrà il 13 maggio 2023.
Mengoni si prepara per il grande evento con grande impegno e dedizione, collaborando con un team di professionisti per creare una performance indimenticabile che possa catturare l’attenzione del pubblico di tutto il mondo. “Sto lavorando sodo per creare una performance che sia all’altezza di questo evento così importante. Voglio che il mio paese sia fiero di me e della mia musica”, ha dichiarato Mengoni.
L’Italia ha una lunga e gloriosa storia all’Eurovision Song Contest, avendo vinto il concorso due volte, nel 1964 e nel 1990. Con Mengoni come rappresentante, l’Italia punta a conquistare il terzo titolo e a continuare a lasciare il segno nel panorama musicale europeo.
In conclusione, l’Italia sarà rappresentata da Marco Mengoni all’Eurovision Song Contest 2023 con la canzone “Siamo Qui”, una canzone che ha già conquistato il pubblico italiano e che rappresenta un inno all’amore e alla speranza. Mengoni si prepara con grande impegno e dedizione per creare una performance indimenticabile che possa catturare l’attenzione del pubblico di tutto il mondo e portare la vittoria all’Italia. L’Italia ha una lunga e gloriosa storia all’Eurovision Song Contest e punta a continuare a lasciare il segno nel panorama musicale europeo con la sua musica.
L’anno è il 2023 e il mondo è sempre più dipendente dalla tecnologia. Il settore informatico sta crescendo a un ritmo vertiginoso, creando opportunità senza precedenti per gli investitori. Se stai cercando di guadagnare nel mercato finanziario, l’industria informatica è una delle aree più interessanti da considerare.
Intervistatore: Buongiorno a tutti, sono qui con Gerry Scotty, il noto conduttore televisivo e imprenditore. Oggi parleremo del suo investimento nell’industria informatica italiana. Gerry, ci può raccontare di cosa si tratta?
Gerry Scotty: Certo, sono molto orgoglioso del mio investimento nell’industria informatica italiana. Si tratta di un’azienda che si occupa della produzione di software per il settore sanitario. La loro tecnologia è innovativa e permette di gestire in modo più efficiente i dati relativi alla salute dei pazienti.
Intervistatore: E come ha deciso di investire in questa azienda?
Gerry Scotty: Ho sempre avuto una grande passione per la tecnologia e credo che sia uno dei settori più importanti per lo sviluppo del nostro Paese. Quando ho scoperto questa azienda e ho visto il potenziale della loro tecnologia, ho deciso di investire.
Intervistatore: E quanto ha guadagnato da questo investimento?
Gerry Scotty: Devo dire che l’investimento ha avuto un grande successo. Grazie alla mia collaborazione con l’azienda, siamo riusciti a far crescere il loro business in modo esponenziale. Sono stato molto fortunato e ho guadagnato una somma significativa.
Intervistatore: E l’azienda italiana ha beneficiato dell’investimento?
Gerry Scotty: Assolutamente. Grazie al nostro lavoro insieme, l’azienda ha potuto espandersi e assumere nuovi dipendenti. Abbiamo anche investito nella ricerca e sviluppo di nuove tecnologie per migliorare il loro prodotto. Sono molto felice di aver contribuito al successo di un’azienda italiana.
Intervistatore: Cosa ne pensa del futuro dell’industria informatica in Italia?
Gerry Scotty: Sono molto ottimista sul futuro dell’industria informatica italiana. Ci sono molti giovani talenti e startup innovative che stanno emergendo. Credo che con il giusto supporto e investimento, l’industria informatica italiana possa diventare un leader mondiale.
Intervistatore: Grazie Gerry per le sue parole e per averci parlato del suo investimento nell’industria informatica italiana.
Gerry Scotty: Grazie a voi per l’opportunità di condividere la mia esperienza. Spero che questo possa ispirare altri a investire nell’industria informatica italiana.
In sintesi, investire nell’industria informatica italiana può portare a grandi guadagni. Come dimostra l’esempio di Gerry Scotty, il successo può essere raggiunto quando si investe in aziende innovative e tecnologie all’avanguardia. Con un mercato in costante crescita, il settore informatico offre una vasta gamma di opportunità di investimento. Se stai cercando di aumentare i tuoi guadagni, non sottovalutare il potenziale dell’industria informatica e la sua capacità di generare profitti a lungo termine.
Postgres allows to emit messages into its write-ahead log (WAL), without updating any actual tables
Logical decoding messages can be read using change data capture tools like Debezium
Stream processing tools like Apache Flink can be used to process (e.g., enrich, transform, and route) logical decoding messages
There are several number of use cases for logical decoding messages, including providing audit metadata, application logging, and microservices data exchange
There is no fixed schema for logical decoding messages; it’s on the application developer to define, communicate, and evolve such schema
Did you know there’s a function in Postgres that lets you write data which you can’t query? A function that lets you persist data in all kinds and shapes but which will never show up in any table? Let me tell you about pg_logical_emit_message()! It’s a Postgres function that allows you to write messages to the write-ahead log (WAL) of the database.
You can then use logical decoding—Postgres’ change data capture capability—to retrieve those messages from the WAL, process them, and relay them to external consumers.
In this article, we’ll explore how to take advantage of this feature for implementing three different use cases:
Propagating data between microservices via the outbox pattern
Application logging
Enriching audit logs with metadata
For retrieving logical decoding messages from Postgres we are going to use Debezium, a popular open-source platform for log-based change data capture (CDC), which can stream data changes from a large variety of databases into data streaming platforms like Apache Kafka or AWS Kinesis.
We’ll also use Apache Flink and the Flink CDC project, which seamlessly integrates Debezium into the Flink ecosystem, for enriching and routing raw change event streams. You can learn more about the foundations of change data capture and Debezium in this talk from QCon San Francisco.
Logical Decoding Messages 101
Before diving into specific use cases, let’s take a look at how logical decoding messages can be emitted and consumed. To follow along, make sure to have Docker installed on your machine. Start by checking out this example project from GitHub:
git clone https://github.com/decodableco/examples.git
cd examples/postgres-logical-decoding
The project contains a Docker Compose file for running a Postgres database, which is enabled for logical replication already. Start it like so:
docker compose up
Then, in another terminal window, connect to that Postgres instance using the pgcli command line client:
Next, you need to create a replication slot. A replication slot represents one specific stream of changes coming from a Postgres database and keeps track of how far a consumer has processed this stream. For this purpose, it stores the latest log sequence number (LSN) that the slot’s consumer has processed and acknowledged.
Each slot has a name and an assigned decoding plug-in which defines the format of that stream. Create a slot using the “test_decoding” plug-in, which emits changes in a simple text-based protocol, like this:
For production scenarios it is recommended to use the pgoutput plug-in, which emits change events using an efficient Postgres-specific binary format and is available by default in Postgres since version 10. Other commonly used options include the Decoderbufs plug-in (based on the Google Protocol Buffers format) and wal2json (emitting change events as JSON).
Changes are typically retrieved from remote clients such as Debezium by establishing a replication stream with the database. Alternatively, you can use the function pg_logical_slot_get_changes(), which lets you fetch changes from a given replication slot via SQL, optionally reading only up to a specific LSN (the first NULL parameter) or only a specific number of changes (the second NULL parameter). This comes in handy for testing purposes:
postgresuser@postgres:demodb> SELECT * FROM pg_logical_slot_get_changes('demo_slot', NULL, NULL);
+-------+-------+--------+
| lsn | xid | data |
|-------+-------+--------|
+-------+-------+--------+
No changes should be returned at this point. Let’s insert a logical decoding message using the pg_logical_emit_message() function:
transactional: a boolean flag indicating whether the message should be transactional or not; when issued while a transaction is pending and that transaction gets rolled back eventually, a transactional message would not be emitted, whereas a non-transactional message would be written to the WAL nevertheless
prefix: a textual identifier for categorizing messages; for instance, this could indicate the type of a specific message
content: the actual payload of the message, either as text or binary data; you have full flexibility of what to emit here, e.g., in regard to format, schema, and semantics
When you retrieve changes from the slot again after having emitted a message, you now should see three change events: a BEGIN and a COMMIT event for the implicitly created transaction when emitting the event, and the “Hello World!” message itself. Note that this message doesn’t appear in any Postgres table or view as would be the case when adding data using the INSERT statement; this message is solely present in the database’s transaction log.
There are a few other useful functions dealing with logical decoding messages and replication slots, including the following:
pg_logical_slot_get_binary_changes(): retrieves binary messages from a slot
pg_logical_slot_peek_changes(): allows to take a look at changes from a slot without advancing it
pg_replication_slot_advance(): advances a replication slot
pg_drop_replication_slot(): deletes a replication slot
You also can query the pg_replication_slots view for examining the current status of your replication slots, latest confirmed LSN, and more.
Use Cases
Having discussed the foundations of logical decoding messages, let’s now explore a few use cases of this useful Postgres API.
The Outbox Pattern
For microservices, it’s a common requirement that, when processing a request, a service needs to update its own database and simultaneously send a message to other services. As an example, consider a “fulfillment” service in an e-commerce scenario: when the status of a shipment changes from READY_TO_SHIP to SHIPPED, the shipment’s record in the fulfillment service database needs to be updated accordingly, but also a message should be sent to the “customer” service so that it can update the customer’s account history and trigger an email notification for the customer.
Now, when using data streaming platforms like Apache Kafka for connecting your services, you can’t reliably implement this scenario by just letting the fulfillment service issue its local database transaction and then send a message via Kafka. The reason is that it is not supported to have shared transactions for a database and Kafka (in technical terms, Kafka can’t participate in distributed transaction protocols like XA). While everything looks fine on the surface, you can end up with an inconsistent state in case of failures. The database transaction could get committed, but sending out the notification via Kafka fails. Or, the other way around: the customer service gets notified, but the local database transaction gets rolled back.
While you can find this kind of implementation in many applications, always remember: “Friends don’t let friends do dual writes”! A solution to this problem is the outbox pattern: instead of trying to update two resources at once (a database and Kafka), you only update a single one—the service’s database. When updating the shipment state in the database, you also write the message to be sent to an outbox table; this happens as part of one shared transaction, i.e., applying the atomicity guarantees you get from ACID transactions. Either the shipment state update and the outbox message get persisted, or none of them do. You then use change data capture to retrieve any inserts from the outbox in the database and propagate them to consumers.
More information about the outbox pattern can be found in this blog post on the Debezium blog. Another resource is this article on InfoQ which discusses how the outbox pattern can be used as the foundation for implementing Sagas between multiple services. In the following, I’d like to dive into one particular implementation approach for the pattern. Instead of inserting outbox events in a dedicated outbox table, the idea is to emit them just as logical decoding messages to the WAL.
There are pros and cons to either approach. What makes the route via logical decoding messages compelling is that it avoids any housekeeping needs. Unlike with an outbox table, there’s no need to remove messages after they have been consumed from the transaction log. Also, this emphasizes the nature of an outbox being an append-only medium: messages must never be modified after being added to the outbox, which might happen by accident with a table-based approach.
Regarding the content of outbox messages, you have full flexibility there in general. Sticking to the e-commerce domain from above, it could, for instance, describe a shipment serialized as JSON, Apache Avro, Google Protocol Buffers, or any other format you choose. What’s important to keep in mind is that while the message content doesn’t adhere to any specific table schema from a database perspective, it’s subject to an (ideally explicit) contract between the sending application and any message consumers. In particular, the schema of any emitted events should only be modified if you keep in mind the impact on consumers and backward compatibility.
One commonly used approach is to look at the design of outbox events and their schemas from a domain-driven design perspective. Specifically, Debezium recommends that your messages have the following attributes:
id: a unique message id, e.g., a UUID, which consumers can use for deduplication purposes
aggregate type: describes the kind of aggregate an event is about, e.g., “customer,” “shipment,” or “purchase order”; when propagating outbox events via Kafka or other streaming platforms, this can be used for sending events of one aggregate type to a specific topic
aggregate id: the id of the aggregate an event is about, e.g., a customer or order id; this can be used as the record key in Kafka, thus ensuring all events pertaining to one aggregate will go to the same topic partition and making sure consumers receive these events in the correct order
payload: the actual message payload; unlike “raw” table-level CDC events, this can be a rich structure, representing an entire aggregate and all its parts, which in the database itself may spread across multiple tables
Figure 1: Routing outbox events from the transaction log to different Kafka topics
Enough of the theory—let’s see how a database transaction could look, which emits a logical decoding message with an outbox event. In the accompanying GitHub repository, you can find a Docker Compose file for spinning up all the required components and detailed instructions for running the complete example yourself. Emit an outbox message like this:
This creates a transactional message (i.e., it would not be emitted if the transaction aborts, e.g., because of a constraint violation of another record inserted in the same transaction). It uses the “outbox” prefix (allowing it to distinguish it from messages of other types) and contains a JSON message as the actual payload.
Regarding retrieving change events and propagating them to Kafka, the details depend on how exactly Debezium, as the underlying CDC tool, is deployed. When used with Kafka Connect, Debezium provides a single message transform (SMT) that supports outbox tables and, for instance, routes outbox events to different topics in Kafka based on a configurable column containing the aggregate type. However, this SMT doesn’t yet support using logical decoding messages as the outbox format.
When using Debezium via Flink CDC, you could implement a similar logic using a custom KafkaRecordSerializationSchema which routes outbox events to the right Kafka topic and propagates the aggregate id to the Kafka message key, thus ensuring correct ordering semantics. A basic implementation of this could look like this (you can find the complete source code, including the usage of this serializer in a Flink job here):
public class OutboxSerializer implements KafkaRecordSerializationSchema {
private static final long serialVersionUID = 1L;
private ObjectMapper mapper;
@Override
public ProducerRecord serialize(ChangeEvent element,
KafkaSinkContext context, Long timestamp) {
try {
JsonNode content = element.getMessage().getContent();
ProducerRecord record =
new ProducerRecord(
content.get("aggregate_type").asText(),
content.get("aggregate_id").asText().getBytes(Charsets.UTF_8),
mapper.writeValueAsBytes(content.get("payload"))
);
record.headers().add("message_id",
content.get("id").asText().getBytes(Charsets.UTF_8));
return record;
}
catch (JsonProcessingException e) {
throw new IllegalArgumentException(
"Couldn't serialize outbox message", e);
}
}
@Override
public void open(InitializationContext context,
KafkaSinkContext sinkContext) throws Exception {
mapper = new ObjectMapper();
SimpleModule module = new SimpleModule();
module.addDeserializer(Message.class, new MessageDeserializer());
mapper.registerModule(module);
}
}
With that Flink job in place, you’ll be able to examine the outbox message on the “shipment” Kafka topic like so:
The topic name corresponds to the specified aggregate type, i.e., if you were to issue outbox events for other aggregate types, they’d be routed to different topics accordingly. The message key is 42, matching the aggregate id. The unique event id is propagated as a Kafka message header, enabling consumers to implement efficient deduplication by keeping track of the ids they’ve already received and processed and ignoring any potential duplicates they may encounter. Lastly, the payload of the outbox event is propagated as the Kafka message value.
In particular, in larger organizations with a diverse set of event producers and consumers, it makes sense to align on a shared event envelope format, which standardizes common attributes like event timestamp, origin, partitioning key, schema URLs, and others. The CloudEvents specification comes in handy here, especially for defining event types and their schemas. It is an option worth considering to have your applications emit outbox events adhering to the CloudEvents standard.
Logging
While log management of modern applications typically happens through dedicated platforms like Datadog or Splunk, which ingest changes from dedicated APIs or logs in the file system, it sometimes can be convenient to persist log messages in the database of an application. Log libraries such as the widely used log4j 2 provide database-backed appenders for this purpose. These will typically require a second connection for the logger, though, because in case of a rollback of an application transaction itself, you still (and in particular then) want to write out any log messages, helping you with failure analysis.
Non-transactional logical decoding messages can be a nice means of using a single connection and still ensuring that log messages persist, also when a transaction is rolled back. For example, let’s consider the following situation with two transactions, one of which is committed and one rolled back:
Figure 2: Using non-transactional logical decoding messages for logging purposes
To follow along, run the following sequence of statements in the pgcli shell:
–- Assuming this table: CREATE TABLE data (id INTEGER, value TEXT);
BEGIN;
INSERT INTO data(id, value) VALUES('1', 'foo');
SELECT * FROM pg_logical_emit_message(false, 'log', 'OK');
INSERT INTO data(id, value) VALUES('2', 'bar');
COMMIT;
BEGIN;
INSERT INTO data(id, value) VALUES('3', 'baz');
SELECT * FROM pg_logical_emit_message(false, 'log', 'ERROR');
INSERT INTO data(id, value) VALUES('4', 'qux');
ROLLBACK;
The first transaction inserts two records in a new table, “data” and also emits a logical decoding message. The second transaction applies similar changes but then is rolled back. When retrieving the change events from the replication slot (using the “testing” decoding plug-in as shown above), the following events will be returned:
As expected, there are two INSERT events and the log message for the first transaction. However, there are no change events for the aborted transaction for the INSERT statements, as it was rolled back. But as the logical decoding message was non-transactional, it still was written to the WAL and can be retrieved. I.e., you actually can have that cake and eat it too!
Audit Logs
In enterprise applications, keeping an audit log of your data is a common requirement, i.e., a complete trail of all the changes done to a database record, such as a purchase order or a customer.
There are multiple possible approaches for building such an audit log; one of them is to copy earlier record versions into a separate history table whenever a data change is made. Arguably, this increases application complexity. Depending on the specific implementation strategy, you might have to deploy triggers for all the tables that should be audited or add libraries such as Hibernate Envers, an extension to the popular Hibernate object-relational mapping tool. In addition, there’s a performance impact, as the audit records are inserted as part of the application’s transactions, thus increasing write latency.
Change data capture is an interesting alternative for building audit logs: extracting data changes from the database transaction log requires no changes to writing applications. A change event stream, with events for all the inserts, updates, and deletes executed for a table—e.g., persisted as a topic in Apache Kafka, whose records are immutable by definition—could be considered a simple form of an audit log. As the CDC process runs asynchronously, there’s no latency impact on writing transactions.
One shortcoming of this approach—at least in its most basic form—is that it doesn’t capture contextual metadata, like the application user making a given change, client information like device configuration or IP address, use case identifiers, etc. Typically, this data is not stored in the business tables of an application and thus isn’t exposed in raw change data events.
The combination of logical decoding messages and stream processing, with Apache Flink, can provide a solution here. At the beginning of each transaction, the source application writes all the required metadata into a message; in comparison to writing a full history entry for each modified record, this just adds a small overhead on the write path. You can then use a simple Flink job for enriching all the subsequent change events from that same transaction with that metadata. As all change events emitted by Debezium contain the id of the transaction they originate from, including logical decoding messages, correlating the events of one transaction isn’t complicated. The following image shows the general idea:
Figure 3: Enriching data change events with transaction-scoped audit metadata
When it comes to implementing this logic with Apache Flink, you can do this using a rather simple mapping function, specifically by implementing the RichFlatMapFunction interface, which allows you to combine the enrichment functionality and the removal of the original logical decoding messages in a single operator call:
public void flatMap(String value, Collector out)
throws Exception {
ChangeEvent changeEvent = mapper.readValue(value, ChangeEvent.class);
String op = changeEvent.getOp();
String txId = changeEvent.getSource().get("txId").asText();
// logical decoding message
if (op.equals("m")) {
Message message = changeEvent.getMessage();
// an audit metadata message -> remember it
if (message.getPrefix().equals("audit")) {
localAuditState = new AuditState(txId, message.getContent());
return;
}
else {
out.collect(value);
}
}
// a data change event -> enrich it with the metadata
else {
if (txId != null && localAuditState != null) {
if (txId.equals(localAuditState.getTxId())) {
changeEvent.setAuditData(localAuditState.getState());
}
else {
localAuditState = null;
}
}
changeEvent.setTransaction(null);
out.collect(mapper.writeValueAsString(changeEvent));
}
The logic is as follows:
When the incoming event is of type “m” (i.e., a logical decoding message) and it is an audit metadata event, put the content of the event into a Flink value state
When the incoming event is of any other type, and we have stored audit state for the event’s transaction before, enrich the event with that state
When the transaction id of the incoming event doesn’t match what’s stored in the audit state (e.g., when a transaction was issued without a metadata event at the beginning), clear the state store and propagate the event as is
You can find a simple yet complete Flink job that runs that mapping function against the Flink CDC connector for Postgres in the aforementioned GitHub repository. See the instructions in the README for running that job, triggering some data changes, and observing the enriched change events. As an example, let’s consider the following transaction which first emits a logical decoding message with the transaction metadata (user name and client IP address) and then two INSERT statements:
BEGIN;
SELECT * FROM pg_logical_emit_message(true, 'audit', '{ "user" : "[email protected]", "client" : "10.0.0.1" }');
INSERT INTO inventory.customer(first_name, last_name, email) VALUES ('Bob', 'Green', '[email protected]');
INSERT INTO inventory.address
(customer_id, type, line_1, line_2, zip_code, city, country)
VALUES
(currval('inventory.customer_id_seq'), 'Home', '12 Main St.', 'sdf', '90210', 'Los Angeles', 'US');
COMMIT;
The enriched change events, as emitted by Apache Flink, would look like so:
Within the same Flink job, you now could add a sink connector and for instance write the enriched events into a Kafka topic. Alternatively, depending on your business requirements, it can be a good idea to propagate the change events into a queryable store, for instance, an OLAP store like Apache Pinot or Clickhouse. You could use the same approach for enriching change events with contextual metadata for other purposes too, generally speaking for capturing all kinds of “intent” which isn’t directly persisted in the business tables of your application.
Bonus: Advancing Replication Slots
Finally, let’s discuss a technical use case for logical decoding messages: advancing Postgres replication slots. This can come in handy in certain scenarios, where otherwise large segments of the WAL could be retained by the database, eventually causing the database machine to run out of disk space.
This is because replication slots are always created in the context of a specific database, whereas the WAL is shared between all the databases on the same Postgres host. This means a replication slot set up for a database without any data changes and which, therefore, can’t advance, will retain potentially large chunks of WAL if changes are made to another database on the same host.
To experience this situation, stop the currently running Docker Compose set-up and launch this alternative Compose file from the example project:
docker compose -f docker-compose-multi-db.yml up
This spins up a Postgres database container with two databases, DB1 and DB2. Then launch the AdvanceSlotMain class. You can do so via Maven (note this is just for demonstration and development purposes; usually, you’d package up your Flink job as a JAR and deploy it to a running Flink cluster):
mvn exec:exec@advanceslot
It runs a simple Flink pipeline that retrieves all changes from the DB2 database and prints them out on the console. Now, do some changes on the DB1 database:
docker run --tty --rm -i
--network logical-decoding-network
quay.io/debezium/tooling:1.2
bash -c 'pgcli postgresql://postgresuser:postgrespw@order-db:5432/db1'
postgresuser@order-db:db1> CREATE TABLE data (id INTEGER, value TEXT);
postgresuser@order-db:db1> INSERT INTO data SELECT generate_series(1,1000) AS id, md5(random()::text) AS value;
Query the status of the replication slot (“flink”, set up for database “DB2”), and as you keep running more inserts in DB1, you’ll see that the retained WAL of that slot continuously grows, as long as there are no changes done over in DB2:
The problem is that as long as there are no changes in the DB2 database, the CDC connector of the running Flink job will never be invoked and thus never have a chance to acknowledge the latest processed LSN of its replication slot. Now, let’s use pg_logical_emit_message() to fix this situation. Get another Postgres shell, this time for DB2, and emit a message like so:
In the console output of AdvanceSlotMain you should see the change event emitted by the Debezium connector for that message. With the next checkpoint issued by Flink (look for “Completed checkpoint XYZ for job …” messages in the log), the LSN of that event will also be flushed to the database, essentially allowing the database to discard any WAL segments before that. If you now examine the replication slot again, you should find that the “retained WAL” value is much lower than before (as this process is asynchronous, it may take a bit until the disk space is freed up).
Wrapping Up
Logical decoding messages are not widely known yet very powerful tools, which should be in the box for every software engineer working with Postgres. As you’ve seen, the ability to emit messages into the write-ahead log without them ever surfacing in any actual table allows for a number of interesting use cases, such as reliable data exchange between microservices (thus avoiding unsafe dual writes), application logging, or providing metadata for building audit logs. Employing stateful stream processing using Apache Flink, you can enrich and route your captured messages as well as apply other operations on your data change events, such as filtering, joining, windowed aggregations, and more.
Where there is great power, there are also great responsibilities. As logical decoding messages don’t have an explicit schema, unlike your database tables, the application developer must define sensible contracts and carefully evolve them, always keeping backward compatibility in mind. The CloudEvents format can be a useful foundation for your custom message schemas, providing all the producers and consumers in an organization with a consistent message structure and well-defined semantics.
If you’d like to get started with your explorations around logical decoding messages, look at the GitHub repo accompanying this article, which contains the source code of all the examples shown above and detailed instructions for running them.
Defining and understanding the metaverse concept begins with exploring envisioned characteristics such as its immersiveness, interconnectedness and ability to deliver an endless set of experiences to consumers.
Making the metaverse a reality comes along with a number of engineering risks and quality concerns ranging from data privacy and security, to personal safety and virtual harassment.
Applying test-driven design principles to the development of the metaverse will allow teams to identify risks early and ensure that the metaverse is testable.
Achieving acceptable levels of test coverage in the metaverse may only be possible with advanced test automation capabilities powered by AI and machine learning.
Testers bring invaluable skills such as user empathy, creativity, curiosity, collaboration and communication to metaverse development and will likely play a key role in enabling its success.
Although the idea of the metaverse began as fiction, it is likely that it will soon become a reality. The metaverse will bring together a variety of modern computing technologies to realize a grand vision of an Internet experience with deeper social connections and experiences. However, the specification, design, development, validation, and overall delivery of the metaverse presents grand engineering challenges. In this article I’ll describe the metaverse concept, discuss its key engineering challenges and quality concerns, and then walk through recent technological advances in AI and software testing that are helping to mitigate these challenges. To wrap up, I share some of my thoughts on the role of software testers as we move towards a future of testing in the metaverse.
The Metaverse
With all the hype and chatter around the metaverse, it’s becoming increasingly difficult to describe exactly what the metaverse is and what it looks like. To be clear, the metaverse doesn’t actually exist yet and so a good way to describe it is as a hypothetical iteration of the Internet as a single, universal, simulated world, facilitated by a variety of modern computing technologies. In three words, the metaverse will be immersive, interconnected, and endless. Let’s explore these three characteristics a bit more.
Immersion
The metaverse will draw people into a plethora of experiences using virtual and real environments, or some combination of the two. It is projected that new and different levels of immersion will be achieved through the use of virtual, augmented, and merged or mixed reality technologies, collectively referred to as extended reality (XR). User experience designer Tijane Tall describes the key differentiators among the immersiveness of these experiences as follows:
Virtual Reality (VR): the perception of being physically present in a non-physical world. VR uses a completely digital environment, which is fully immersive by enclosing the user in a synthetic experience with little to no sense of the real world.
Augmented Reality (AR): having digital information overlaid on top of the physical world. Unlike VR, AR keeps the real world central to the user experience and enhances it with virtual information.
Merged or Mixed Reality (MR): intertwining virtual and real environments. MR might sound similar to AR, but it goes beyond the simple overlay of information and instead enables interactions between physical and virtual objects.
Related Sponsored Content
Technologies showcased at CES last year are also promising to enable a new level of immersion. For example, a company called OVR Technology showed a VR headset with a container for eight aromas that can be mixed together to create various scents. The headset that will bring smell to virtual experiences is scheduled to be released later this year.
Interconnection
Virtual worlds that are coined as “metaverses” today are mostly, if not all, separate and disjointed. For example, there are little to no integrations between the popular gaming metaverses Roblox and Fortnite. Now, what if the opposite were true? Imagine for a moment that there were deep integrations between these two experiences- so deep that one could walk their avatar from Roblox into Fortnite and vice-versa, and upon doing so their experience would be seamlessly transitioned. In the metaverse, even if there are distinct virtual spaces, this type of seamless transition from one space, place, or world to another will exist. Things like avatar customizations and preferences will be retained if desired. This is not to say that everything in the world should look exactly the same. Instead, it would be more of a visual equivalence rather than equality. As a result, my sports T-shirt in the Fortnite space may look different than the one in Roblox, but the color and branding make it apparent that this is my avatar. Integrations among various technologies such as blockchain, security, cryptocurrency, non-fungible tokens (NFTs), and more will be necessary to establish a fully interconnected metaverse.
Endlessness
Possibilities for realizing a variety of different experiences in the metaverse with various users will be endless. We can already see this happening in several platforms and modern video games and VR/AR experiences are proving that almost anything you can imagine can become an immersive experience. Of course, this is a bit of an exaggeration as in reality, there are limits, both in the real world and as relates to every technology, but there is no doubt that the vast range of experiences that are likely to be available, along with the immersion and interconnectedness is what makes the idea of the metaverse so appealing.
Metaverse Engineering Risks and Quality Concerns
As interest and investment in the metaverse grows, many are raising concerns about the potential risks in an environment where the boundaries between the physical and virtual world are blurred. Some of the key engineering risks and quality surrounding the development of the metaverse are:
Identity and Reputation: Ensuring that an avatar in the metaverse is who they say they are, and protecting users from avatar impersonation and other activities that can harm their reputation.
Ownership and Property: Granting and verifying the creation, purchase, and ownership rights to digital assets such as virtual properties, artistic works, and more.
Theft and Fraud: Stealing, scamming, and other types of crimes for financial gain as payment systems, banking and other forms of commerce migrate to the metaverse.
Privacy and Data Abuse: Having malicious actors make their presence undetectable in the metaverse and invisibly joining meetings or eavesdropping on conversations. There is also a significant risk of data abuse and the need for protections against misinformation.
Harassment and Personal Safety: Protecting users from various forms of harassment while in the metaverse, especially when using XR technologies. The advent of these types of experiences now means that harassment and personal safety is not just a physical thing, but now a virtual experience that should be guarded against.
Legislation and Jurisdiction: Identifying any boundaries and rules of the virtual spaces that are accessible to anyone across the world, and making sure they are safe and secure for everyone. Governance of the metaverse brings together several of the aforementioned risks.
User Experience: If the metaverse is to become a space where people can connect, form meaningful relationships and be immersed in novel digital experiences, then the visual, audio, performance, accessibility and other user experience related concerns must be addressed.
Mitigating Metaverse Risks with Continuous Testing
Software testing is all about assessing, mitigating, and preventing risks from becoming real and causing project delays and damage. I always encourage engineering teams to take a holistic view of software testing and treat it as an integral part of the development process. This is the idea that testing is continuous and therefore begins from as early as product inception and persists even after the system has been deployed to production.
Test-Driven Metaverse Design
A research colleague of mine once described the idea of using testing as the headlights of a software project during its early stages. The analogy and illustration he gave was one of a car driving down a dangerous and windy road at night with the only visible lights on the road being those projected from the car’s headlights. The moving car is the software project while the edges of the road represent risks, and the headlights are testing-related activities. As the project moves forward, testing sheds light on the project risks and allows engineering teams to make informed decisions through risk identification, quantification, estimation and ultimately mitigation. Similarly, as we start to design and develop the metaverse, teams can leverage test-driven design techniques for risk mitigation. These may include:
Acceptance Test-Driven Design (ATDD): using customer, development, and testing perspectives to collaborate and write acceptance tests prior to building the associated functionality. Such tests act as a form of requirements to describe how the system will work.
Design for Testability (DFT): developing the system with a number of testing and debugging features to facilitate the execution of tests pre- and post-deployment. In other words, testing is treated as a design concern to make the resulting system more observable and controllable.
Metaverse Testing
Achieving acceptable levels of coverage when testing the metaverse will likely require a high degree of automation. In comparison with traditional desktop, web, or mobile applications, the state space of a 3D, open world, extended reality, and online experience is truly vast and exponentially large. In the metaverse, at any moment you will be able to navigate your avatar to a given experience, equip various items and customizations, and interact with other human or computer-controlled characters. The content itself will be constantly evolving, making it a continuously moving target from an engineering perspective. Without sufficient test automation capabilities, creating, executing, and maintaining tests for the metaverse would be extremely expensive, tedious, and repetitive activities.
AI for Testing the Metaverse
The good news is that advances in AI and machine learning (ML) have been helping us to create highly adaptive, resilient, scalable automated testing solutions. In my previous role as chief scientist at test.ai, I had the pleasure of leading multiple projects that applied AI and ML to automated testing. Here are some details on the most relevant projects and promising directions that leverage AI for automated testing of metaverse-like experiences.
AI for Testing Digital Avatars
Advances in computer vision unlocks a realm of possibilities for test automation. Bots can be trained to recognize and validate visual elements just like humans do. As a proof-of-concept, we applied visual validation to the digital personas developed by SoulMachines. This involved training object detection classifiers to recognize scenarios like when the digital person was speaking, waiting for a response, smiling, serious, or confused. Leveraging AI, we developed automated tests to validate conversation-based interactions with the digital avatars. This included two forms of input actions, one using the on-screen textual chat window, and the other by tapping into the video stream to “trick” the bots into thinking that pre-recorded videos were live interactions with humans. A test engineer could therefore pre-record video questions or responses for the digital person, and the automation could check that the avatar had an appropriate response or reaction. Large, transformer-based natural language processing (NLP) models like OpenAI’s GPT-3, and its newest variant ChatGPT, can also be leveraged for generating conversational test input data or to validate expected responses.
Even in the early stages of their development, the trained bots were able to learn how to recognize interactions that were relevant to the application context and ignore others. Let’s look at a concrete example. The figure below shows the results of running a test where the goal was to validate that the bot was able to respond appropriately to the gesture of smiling. We all know that smiles are contagious and it’s very hard to resist smiling back at someone who smiles at you, and so we wanted to test this visual aspect of the bot interactions. The automation therefore launched the digital person, tapped into the live video stream, and showed the digital avatar a video stream of one of our engineers who after a few moments started to smile. The automation then checked to see the avatar’s response to smiling, and here was the result.
As shown in the figure, if you compare the bot’s current observation of the avatar with the prior observation, you will notice there are two differences. Firstly, the avatar’s eyes are closed at the moment of the capture as indicated by the blue boxes, and it is also smiling broadly enough that its teeth are now visible (red boxes). However, the difference mask generated by our platform only reports one difference — the smile. Can you guess why? Perhaps a bug in our testing platform? No, quite the contrary. Here the bots have learned that blinking is part of the regular animation cycle of the digital avatar. It’s not just trained on a single image, but actually trained on videos of the avatars, which include regular movements. With those animations now recognized as part of the ground truth, the bot distinguishes that the big smile is a deviation from the norm, and so produces an image difference mask highlighting that change and that change only. Just like a human would, AI can notice that the avatar smiled back in response to someone smiling at it, and knows that the eyes blinking at the moment of screen capture is just coincidental.
AI for Testing Gameplay
When it comes to playing games, AI has come a long way. Decades ago, bots were using brute force computation to play trivial games like tic-tac-toe. However, today, they are combining self-play with reinforcement learning to reach expert levels in more complex, intuitive games like Go, Atari Games, Mario Brothers, and more. This begs the question: if the bots can do that with AI, why not extend those bots with the aforementioned visual testing capabilities? When placed in this context, the test automation problem for gameplay doesn’t really seem as hard. It’s really just a matter of bringing the previously mentioned AI for game testing technologies together into an environment that combines it with real-time, AI-based gameplay.
Let’s take a look at an example. Suppose you’re tasked with testing a first-person shooter, where players engage in weapon-based combat. The game has a cooperative mode in which you can have either friendly players or enemy players in your field of view at any given time. During gameplay, your player has an on-screen, heads-up-display (HUD) that visually indicates health points, the kill count, and whether there is an enemy currently being targeted in the crosshair of your weapon. Here’s how you can automatically test the gameplay mechanics of this title:
Implement Real-Time Object Detection and Visual Diffing. Using images and videos from the game, you then train machine learning models that enable the bots to recognize enemies, friendlies, weapons, and any other objects of interest. In addition, you train them to report on the visual differences observed in-game when compared to the previously recorded baselines.
Model the Basic Actions the Bots Can Perform. In order for the bots to learn to play the game, you must first define the different moves or steps they can take. In this example, the bots can perform actions such as moving forward and backwards, strafing left and right, jumping or crouching, aiming, and shooting their weapon.
Define Bot Rewards for Reinforcement Learning. Now that the bots can perform actions in the environment, you want to set them to take random actions and then give them a positive or negative reward based on the outcome. In this example, you could specify three rewards:
A positive reward for locking onto targets to encourage the bot to aim its weapon at enemy players.
A positive reward for increasing the kill count so that the bot doesn’t just aim at enemies, but fires its weapon at them.
A negative reward for a decrease in health points to discourage the bot from taking damage.
With object detection, visual diffing, action rigging, and goal-based reinforcement learning capabilities in place, it is time to let the bot loose to train in the game environment. Initially, the bot will not be very good at attaining any of the goals such as attacking enemies and not taking damage. However, over time, after thousands of episodes of trying and failing, the bot gets better at playing the game. During training, visuals can be used as a baseline for future comparisons or the bot can be trained to detect visual glitches. Here’s a video of one of the trained bots in action. Live in-game action is shown at the top-left, while a visualization of what the bot sees in near real-time is shown in the top-right.
AI for Testing Virtual Reality
By combining software-based controller emulation with commodity hardware such as a Raspberry Pi, it is possible to automate several types of hardware devices ranging from gaming consoles, controllers, and video streaming devices. Such integrated tools and drivers allow the bots to observe and manipulate the input-out functions of these devices. As part of our research and development efforts into gaming and metaverse testing, we built integrations with VR headsets. Once we could control inputs and observe outputs in VR, it was just a matter of tying that API into a subsystem we refer to as the Gaming Cortex, which is basically the ML brain that combines the real-time object detection and goal-based reinforcement learning mentioned previously.
The final result is that engineers or external programs can make calls to the VR API controller and leverage it to define and execute tests in that environment. Here’s a look at it in action as we execute a script that programmatically modifies the yaw function causing the headset to rotate within the VR space.
A Future of Testing in the Metaverse
I firmly believe that in addition to the technical and engineering challenges that come along with creating something as complex as the metaverse, its development will bring with it several opportunities for testers to play a vital role in the future of the Internet. As software experiences become more “human”, skills like user empathy, critical thinking, risk analysis, and creativity become even more necessary and will be emphasized. Being such a grand vision, the metaverse requires a significant level of “big picture” thinking which is another skill that many testers bring to the table. Here are a few of my favorite skills that I associate with not just good, but great testers.
In cases where AI/ML are a core part of the metaverse development stack, testing skills like data selection, partitioning, and test data generation will move testers to the front of the development process. With that we can also expect to see more focus on testing as a development practice, leveraging approaches like acceptance test-driven design and design for testability to ensure that the metaverse is not only correct, complete, user-friendly, safe and secure, but that it is testable and automatable at scale.
A huge thanks to Yashas Mavinakere, Jonathan Beltran, Dionny Santiago, Justin Phillips, and Jason Stredwick for their contributions to the work described in this article.