Practical Schema Engineering: Tools and Techniques for schema-aware Kafka applications

Matthias Merdes
Principal Software Architect
Heidelberg Institute for Geoinformation Technology (HeiGIT)
April/August 2022
@neckargold

1. Introduction

A lot has been written on software engineering in general. While there is certainly not one universal definition of software engineering, a pragmatic definition might include activities such as versioning, automating, building, testing, documenting, and deploying certain artifacts, e.g. source code or compiled libraries and performing those activities in a systematic, reproducible, tool-supported, and ideally automated manner. This article aims at collecting and describing similar practices and tools for working with data schemas in the context of Kafka applications.

One major goal for any form of software engineering is to increase predictability by systematic actions and to generally decrease the likelihood of surprises. As surprises are more dangerous and damaging the later they happen, an important aspect is to perform actions at development time rather than runtime. To a considerable extent, this can be achieved in the schema engineering space as well. Such schemas can be modelled, developed, tested, published, and referenced much in the same way as other artifacts. As treating schemas as first-class citizens is a relatively new approach - at least for Kafka-based systems - a number of interesting challenges arise. In the following, we will describe a 'practical schema engineering' approach sharing lessons learned in the context of introducing Kafka at company-scale in a fintech.

When trying to constrain data by using a schema, one of the first questions is which schema technology to choose. A detailed discussion about the pros and cons of existing schema formats is beyond the scope of this article. As with most data-related topics, Martin Kleppmann’s book [DDIA] is a very good reference for the various options. At the time our project started (2020), Apache Avro was clearly the dominant schema format in the Kafka community. While support for other options such as ProtoBuf or JSON schema has since been added to important tools such as schema registries, Avro is probably still the most mature one.

The decision for or against using an explicit data schema is a fundamental one. Using a schema in a Kafka project usually means that the payload of Kafka messages must adhere to the rules contained in the schema. This implies that Kafka clients, namely message producers and message consumers, can take advantage of the meta-information encoded in the schema. Of course, Kafka applications can be developed without using schemas at all. This might even be the simplest solution for very small message data models or for immutable models (say constrained by some external standard not expected to evolve during a project’s lifetime). For most projects, however, usage of an explicit schema is beneficial - especially if the Kafka system is part of a business-critical production system. Making the structure of the message data explicit by using a schema can move many surprises from runtime to developement time or at least deployment time. This is comparable to a compiler which can detect many types of errors early compared to an interpreted language which would fail only at runtime. Obviously, this is highly desireable for every important system - usually paying off the increased development-time cost many times over. Interestingly, even traditionally schema-free NoSQL databases such as MongoDB now offer optional schema constraint support to facilitate the development of long-lived systems with non-trivial data structure.

In the following we will start with describing techniques for developing, documenting, and describing schemas. We will then describe some more advanced approaches for generating, manipulating, and testing/compatibility-checking schemas at development or build time. And of course, to be useful, schemas have to be deployed just as every other software artifact.

This article is not intended to be an introduction to Kafka itself or to (AVRO) schemas, so a basic familiarity with both topics will be helpful. It is also an overview article rather than a step-by-step tutorial, so some details will have to be omitted. All code examples are given as JUnit5 tests in Kotlin.

2. Schema creation and editing

As hinted to above, a schema-based approach to Kafka messaging implies that the payload of all messages on a given topic must be compliant to the schema associated with that topic. So the first step is to somehow create a schema document. These schemas can be edited manually, derived from classes in a domain model either at development time or on the fly upon first usage, or even created programmatically. We will give an overview over these options in the following sections with an emphasis on editor and automation support.

2.1. Schema-first or code-first?

In order to create the actual schema document, i.e. an avsc-file in the case of AVRO schemas, there are two basic options:

create schema by hand, then (optionally) generate Java classes
start with a Java or Kotlin model, then generate the schema document

Both approaches have their respective advantages and drawbacks. If the starting point is a Java or Kotlin domain model, then the full expressiveness of the underlying programming language can be used in the model, i.e. inheritance, composition. In the case of a Kotlin domain model, even the nullability information of Kotlin’s type system can be used to create optional AVRO types. See the avro4k library for an example of such capabilities. A slight advantage for the schema-first approach is that the majority of tools as well as documentation/tutorials seems to accept this as the default approach, so it is maybe a bit easier to adopt.

2.2. Manual schema editing

A schema-first approach comes with its own special set of implications. It is relatively cumbersome to maintain a large schema as the work has to be done at the textual abstraction level. This is partly alleviated by the capabilities of IDEs. In addition to the standard support for highlighting basic JSON validation errors some IDEs allow users to register the AVRO grammar (a form of 'metaschema' with respect to the domain schema) and thus offer some basic code helping features (see image below). Still, the AVRO grammar has limited expressiveness, e.g. the concept of optionality has to be emulated by a union between the actual type and null. Moreover, even with such IDE support, some important aspects, such as edit-time type safety are not well-supported as shown in the next section.

Figure 1. Code helping support in an IDE aware of the avsc schema grammar

2.3. Smoke-testing AVRO schema documents

While code-helping support is very useful when composing a schema by hand, its ability to detect errors is also rather limited. Here, a simple smoke test can be a live-saver when a large and/or complex schema is edited. Even in small schema documents such as the avsc snippet below (a slight modification of an example from the official AVRO guide) some errors can be hard to spot. Everything seems to look good: it is valid JSON and the IDE-based avsc-schema-checker does not complain.

Figure 2. A seemingly innocuous schema example (with a hidden error)

So let’s run the following smoke test against this innocent-looking schema document after performing some manual changes.

An extremely simple smoke test to easily check schema validity

package io.payment.schemaeng.compatibility

import org.apache.avro.Schema
import org.junit.jupiter.api.Test
import java.io.File


class SchemaSmokeTests {

    @Test
    fun `manually edited schema document can be parsed as valid schema`() {

        Schema.Parser().parse(File("src/test/resources/schema/user.avsc"))

    }


}

Alas, the test fails and we are alerted to a typing error in the schema (integer instead of the correct int)!

Figure 3. A failed schema smoke test including a specific error message

Of course, carefully modelling a schema to represent the domain correctly is still a challenging human activity, but the combination of JSON validation, AVRO metaschema awareness of the IDE, and such a simple smoke test goes a long way to make this activity less error-prone and frustrating.

2.4. Schema-first: generation of Java classes from schema

If such a schema is to be used in producer, consumer, or streaming projects then it is often helpful to generate a domain model from the schema. For JVM-based projects the Apache AVRO library can be employed for this. The avro-tools jar included in the distribution may be used for code generation and a number of other features such as conversions between various AVRO representations. In an automated build, the usage of an additional Gradle plugin can be very helpful (e.g. https://github.com/davidmc24/gradle-avro-plugin).

It should be noted that there might be cases where the generation of a domain model from the schema can be omitted, especially for schemas which are small or of a simple structure or in the realm of dynamically typed programming languages. Even when working in a statically typed langage such as Java or Kotlin it is possible to work with the generic types org.apache.avro.generic.IndexedRecord and org.apache.avro.generic.GenericRecord from the Apache AVRO library which allow for index- or name-based access to fields instead of creating an explicit domain model. In most cases, however, the generation of an explicit model should be of advantage.

2.5. Code-first: generation of schema documents from Java or Kotlin classes

While many tutorials emphasize the quasi-standard tooling from Apache AVRO to start with a schema and generate Java classes from it, the opposite direction is also possible - and may be simpler and more elegant in many cases [CodeFirst].

In the following we will give a small Kotlin-based example to illustrate the code-first approach, taken again from a payment project. It utilizes the avro4k library which is an extension of Kotlin’s serialization framework. The kotlinx.serialization component also provides a gradle plugin which is run transparently during the compilation process.

A Kotlin dataclass with avro4k annotations to control the schema generation process

@Serializable
@AvroDoc("Top-level representation of a merchant configuration to be consumed by PMPs")
@AvroNamespace("com.unzer.payment.configservice.message")
@AvroName("MerchantConfig")
@AvroProp("version", "0.0.2")
data class MerchantConfig constructor (

    @AvroDoc("Unique Id for the given merchant with respect to the legacy core system")
    var uniqueId: String,

    @AvroDoc("Version of the merchant object in ConfigService database")
    val merchantVersion: Int,

    @AvroDoc("The merchant's name")
    var name: String,

    @AvroDoc("Unzer-wide unique Id")
    @AvroDefault(Avro.NULL)
    var unzerId: String?,

    @AvroDoc("List of channels belonging to this merchant")
    var channels: List<Channel>

    )

In the Kotlin dataclass above, the @Serialization annotation marks the class as a candidate for serialization. The @Avro* annotations are offered by the avro4k libary to control various aspects of the schema generation process. Avro4k also provides a simple bridge to the Apache Avro data structure org.apache.avro.generic.GenericRecord via this utility method: com.github.avrokotlin.avro4k.Avro#toRecord(). This GenericRecord can then be used together with Confluent’s (de)serializers. Accessing the generated schema is equally simple using the following method: com.github.avrokotlin.avro4k.Avro#schema(). The schema text is in avsc format and can then be stored as a file for subsequent registration (see below).

It might be tempting not to extract the schema file at all but to rely on auto-registration capabilities offered by some serializers at the time of first message creation. While this is technically possible and may be acceptible for very simple schemas and during early development or on non-production stages, many of the intended schema-related benefits (such as fast failure) would be lost.

This approach avoids an explicit generation step during the build like the one required for generating Java classes from a schema. Moreover, the expressiveness of programming languages and the IDE tooling for working with Java or Kotlin codebases are much more advanced than that for working with JSON-based avsc schemas. For large and/or complex schemas this can make a huge difference.

2.6. Programmatic creation

While the schema- or code-first approaches described above are widely used and should be sufficient for most applications, there is also a little-known alternative: Building up a schema model by coding against an API. The ubiquituous Apache Avro library allows just that. The following class diagram shows the org.apache.avro.Schema interface and a small yet representative subset of its 24 inner (sub)classes which implement the AVRO schema API.

Figure 4. A (simplied) class diagram for the Apache Avro Schema hierarchy

As an example, we will create a very small AVRO schema programmatically. This model consists of a single record with three differently typed fields only but should suffice to show the technique in principle. Imagine that we would like to model a simplified 'incident', i.e. the occurrence of an urgent problem in a running software system. This is a typical example for a messaging payload: a distinct event happening at a certain point of time. The concrete task is then to create the following avsc file using the apache AVRO schema APIs:

A simplified Incident schema to be created programmatically

{
  "type" : "record",
  "name" : "Incident",
  "namespace" : "example.namespace",
  "doc" : "a simple incident",
  "fields" : [ {
    "name" : "title",
    "type" : "string"
  }, {
    "name" : "production",
    "type" : "boolean"
  }, {
    "name" : "timestamp",
    "type" : "long"
  } ]
}

In the following test we build up the schema programmatically and then compare it to the avsc text file.

An example for programmatic schema creation

package io.payment.schemaeng.compatibility

import org.apache.avro.Schema
import org.junit.jupiter.api.Assertions.assertEquals
import org.junit.jupiter.api.Test
import java.io.File


class SchemaCreationTests {


    private val expectedSchema = File("src/test/resources/schema/incident.avsc").readText()


    @Test
    fun `a schema can be created programmatically`() {

        val field1 = Schema.Field("title", Schema.create(Schema.Type.STRING))
        val field2 = Schema.Field("production", Schema.create(Schema.Type.BOOLEAN))
        val field3 = Schema.Field("timestamp", Schema.create(Schema.Type.LONG))

        val fields = listOf(field1, field2, field3)

        val newSchema = Schema.createRecord("Incident", "a simple incident", "example.namespace", false, fields)
        assertEquals(expectedSchema, newSchema.toString(true))

    }



}

Related techniques also allow for programmatic manipulation of schemas created by traditional (code-first or schema-first) means. In our project we used such an approach to automatically read an existing schema, manipulate it programmatically, and save it back. The automatic manipulation step consisted of adding several hundred possibilites for an ENUM to a certain field. Those values originated in another system and would have been very cumbersome and error-prone to maintain with a manual text-only approach. In a similar vein, we also employed such a technique to remove some specially marked fields of a large schema in a compatible way in order to produce a special variant of the schema. In this way, a family of schemas with varying content but guaranteed compatibility relationships can be created and maintained. This can be seen as a form of variability management for schemas [SPLE].

A programmatic approach can also be used for AVRO schema creation if another kind of model (say XML schema or some form of database schema) is the source of model truth and no other means of conversion exist. While not recommended as a generic method to create new schemas from scratch, a programmatic approach can simplify conversion from other schema representations (or variability management for large and 'similar' schemas) in selected situations.

3. Documentation

Documentation for any long-lived artifact is paramount to its usefulness. While manually created documentation is rather unpopular with developers and generally hard to maintain, an automated generation is often simple to set up and always stays in sync with the artifact. In the following sections, we will discuss some tools that can read an avsc schema file and produce a more human-readable documentation which is important for non-technical stake-holders. The role of metadata and its relevance for interoperability is also discussed.

3.1. Avrodoc-based documentation

In the Java world, Javadoc is a ubiquitous tool for such an automated creation of browsable documentation. In a similar vein there is a tool named 'Avrodoc' for the Apache Avro space. Originally created by Martin Kleppmann himself in 2012, it is not actively maintained anymore but a number of forks exist which can be executed in a docker container with moderate effort. Avrodoc reads a number of avsc files and generates an SPA from it in one self-contained HTML document. This generation can be run for every change during a standard automatic build. Technically, Avrodoc reads in the "doc" keys containing a description for every type. Needless to say the quality of the resulting documentation depends heavily on the human-edited domain-related descriptions. Other information such as domain datatype, technical type (e.g. enum, union, map etc), optionality, or default values can be inferred from the schema as well resulting in a complete functional and technical specification of the schema. For an example generated from a detailed schema for financial transactions see the following image.

Figure 5. Browsable html documentation generated from an avsc schema via avrodoc

The output of this tool is especially important for large schemas with tens or hundreds of types. It can easily be deployed and used for communication with domain experts within a team as well as for technical teams consuming data streams based on the schema in question.

3.2. AsyncAPI specifications and documentation

While avrodoc works well it is not really supported anymore. A more modern alternative is the usage of asyncAPI specifications. Such specifications have a much wider scope than just schema documentation. They represent complete descriptions of asynchronous systems, roughly comparable to OpenAPI (Swagger) for REST systems [asyncAPI], and details are thus beyond the scope of this article. However, the asyncAPI tooling is up-to-date and maintained, and the specification allows for referencing AVRO schemas when describing the payload of messages. The asyncapi generator tool will then embed a browsable documentation of the AVRO schema.

An example from a simple use case in the payment domain with a medium-sized schema can be seen in the following image.

Figure 6. HTML documentation including a browsable schema section, generated from an asyncAPI spec

3.3. Metadata based on the CloudEvents specification

In one of our use cases with an especially large and nested schema with frequent changes it has also proved beneficial to model metadata explicitly and in a uniform way. An emerging standard for defining the format of event data is the CloudEvents specification (at version 1.0.2 at the time of writing).

This specification prescribes important metadata, such as event id, creation timestamp, and a reference to the (AVRO) schema. When such metadata is used for all topics on a company scale it provides one important building block for a data catalogue. In the future, standardized metadata may also play a crucial role connecting systems of different enterprises [Flow].

3.4. Self-description and discoverability

While there is a small overlap beetween the AsyncAPI and CloudEvents specifications they serve different purposes: the former includes information on the whole event-driven systems such as servers or security mechanisms. The latter focus on the structure of the events themselves. The former may include a reference to the whole schema, i.e. the domain aspect of an event, the latter is restricted to the technical aspects shared by all events of a given topic. Together both spefications can be viewed as simple building blocks for discoverability and self-description, thus providing some first steps towards a data mesh [DataMesh].

4. Test-driven schema compatibility engineering

In order to work with schemas at run-time, three things are neccessary:

a schema compatibility model
a schema registry, i.e. a server component to serve schema versions to Kafka clients.
schema-aware (de)serializers used in producers and consumers.

The latter are provided in the AVRO serialization libraries by Confluent and work well with their schema registry product. However, other compatible registries exist and may be used with the Confluent serializers. In the following sections we will take a closer look at the very important concept of schema compatibility, and how to check for such compatibility as early as possible.

4.1. Confluent compatibility model for schema evolution

The systematic approach to schema compatibility in the Kafka space was pioneered by Confluent, who also drive much of the development of the Kafka broker itself. In order to reap the benefits of schema engineering at run-time, a special component, called schema registry is needed. Such a registry enables Kafka clients to check whether incoming messages are compatible with their local expectations. Compatibility is a rather broad term and a fine-grained model of compatibility levels can help with an adequate reaction to changing schemas. The Confluent schema compatibility model [Evolution] includes the following nuances:

BACKWARD
BACKWARD_TRANSITIVE
FORWARD
FORWARD_TRANSITIVE
FULL
FULL_TRANSITIVE

Apparently, there are three basic classes of compatibility modes, each coming in two flavors. Generally, the TRANSITIVE postfix implies that compatibility is checked against all previous versions along the schema history, while the simple versions (without said postfix) require checks against the last version only.

While only optional fields may be added but all fields may be deleted for BACKWARD compatibility, FORWARD compatibility allows optional fields to be deleted, and all fields to be added. As the name implies, the FULL compatibility modes embody the strongest guarantees, hence the least changes are allowed, i.e. addition and deletion of optional fields only. The benefit of this strict compatibility checking is that producer and consumer updates can be done in any order.

The main reason for employing such a compatibility model, and indeed the whole schema-based apprach, is the ability to perform controlled migrations of producer and consumer systems. In this context, a forward migration requires producers to be updated first while a backward migration requires consumers to be updated first.

4.2. Compatibility testing with Confluent libraries

One major advantage of a schema-centric approach is the ability to do schema compatibility analysis and migration in a controlled way.

The following listing shows a simplied JUnit test for checking certain compatility levels between the current (version 3.8) and the last (version 3.7) versions of a schema.

An automated schema compatibility test for different versions

package io.payment.schemaeng.compatibility

import io.confluent.kafka.schemaregistry.avro.AvroSchema
import org.junit.jupiter.api.Assertions.assertTrue
import org.junit.jupiter.api.DisplayName
import org.junit.jupiter.api.Test
import java.io.File


class SchemaCompatibilityTests {

    val backwardChecker = io.confluent.kafka.schemaregistry.CompatibilityChecker.BACKWARD_CHECKER
    val forwardChecker = io.confluent.kafka.schemaregistry.CompatibilityChecker.FORWARD_CHECKER

    val schema37 = AvroSchema(File("src/test/resources/schema/p3-txn-schema-3.7-internal.avsc").readText())
    val schema38 = AvroSchema(File("src/test/resources/schema/p3-txn-schema-3.8-internal.avsc").readText())


    @Test
    @DisplayName("schema version 3.8 is forward-compatible with version 3.7")
    fun schemaIsForwardCompatibleWithOlderVersion() {
        assertTrue(schema38 isForwardCompatibleWith schema37)
    }

    @Test
    @DisplayName("schema version 3.8 is backward-compatible with version 3.7")
    fun schemaIsBackwardCompatibleWithOlderVersion() {
        assertTrue(schema38 isBackwardCompatibleWith schema37)
    }


    infix fun AvroSchema.isBackwardCompatibleWith(other: AvroSchema) = backwardChecker
        .isCompatible(this, listOf(other))
        .isEmpty()

    infix fun AvroSchema.isForwardCompatibleWith(other: AvroSchema) = forwardChecker
        .isCompatible(this, listOf(other))
        .isEmpty()


}

Such a test can be run in an IDE (as well as in a CI build) in the usual way:

Figure 7. Running an automated schema compatibility test

This approach has a huge benefit: Fast Failure. It allows for finding unintended schema compatibility breaches early at development time and not only at schema registration time. Such fast feedback can save considerable trouble when handling non-trivial schemas.

5. Publication / Deployment

5.1. Infrastructure as code

A major improvement to operating software has been the move in recent years towards infrastructure as code (IaC). This implies the definition of infrastructure components in a textual format which can be versioned and collaborated upon. Such definitions are then executed against a production requirement in an automated and reproducible manner - a huge improvement over ad-hoc manual changes to infrastructure. There are many such tools (Ansible, Puppet, Terraform) and various different approaches. Often a dedicated configuration language with deliberately limited expressiveness is used such as the Hashicorp Configuration Language (HCL) for Terraform but recently libraries for expression infrastructure concerns in the main programming language have emerged, e.g. AWS CDK.

5.2. Automatic deployment of schemas

Today, Kafka servers are increasingly hosted by specialized or general cloud companies, the most prominent examples being Confluent Cloud, Aiven for Kafka, and AWS MSK. These offerings vary greatly in terms of features, maturity, and level of abstraction and a comparison is naturally beyond the scope of this article. All of them offer a schema registry service to manage schemas and their evolving versions at run-time. This was pioneered by Confluent, but open source alternatives (e.g. Karapace by Aiven) or proprietary systems (such as the Glue Registry by AWS) exist.

While those hosted Kafka systems usually offer a GUI to define elements such as topics, users, ACLs, and schemas it is highly preferable to define these building blocks in an IaC approach. Here, we are particularly interested in managing schemas. In our projects we had a good experience with deploying the schemas to the schema registry using Aiven’s Terraform provider. If a Kafka cloud vendor does not support Terraform, there are usually other, sometimes proprietary IaC tools. As a last resort, schemas can often be registered via a REST service offered by the cloud vendor. One way or another, automated and reproducible deployment of schemas should be treated with the same care and rigor as automated deployments of any other software component.

6. Conclusion

As laid out above, a surprising number of tools and techniques exist to treat schemas as first-class citizens from a software engineering viewpoint. However, there seems to be a lack of resources on those practices. This article aims to fill this void by creating an overview and starting point for further exploration.

In our experience, an early investment in schema engineering pays off very well. It is, of course, possible to introduce schema-related practices iteratively, but starting with a schema at all as early as possible is paramount. Introducing a schema only when the application is already in production is difficult and dangerous. As many non-trivial use cases will require schema support in the long run anyway, it will often be reasonable to reap the benefits in terms of communication, predictability, and fast feedback as early as possible.

Happy Schema Engineering!

7. Acknowledgements

Special thanks go to Ulrike Stampa, Thilo Espenlaub, and Tess Akerlund at Unzer for developing these ideas together and supporting this article, respectively.

References

[DDIA] Martin Kleppmann: 'Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems'
[CodeFirst] Stefano Zanella: 'Publishing Avro records to Kafka with Kotlin, avro4k and Spring Boot'
[asyncAPI] Daniel Kocot: 'AsyncAPI – Documentation of event- and message-driven architectures'
[Evolution] Confluent: 'Schema evolution and compatiblity' - https://docs.confluent.io/platform/current/schema-registry/avro.html
[Flow] James Urquhart: 'Flow Architectures: The Future of Streaming and Event-Driven Integration'
[DataMesh] Zhamak Dehghani: 'Data Mesh: Delivering Data-Driven Value at Scale'
[SPLE] Pohl, Böckle, van der Linden: 'Software Product Line Engineering'