Introduction to schemas
Before deepening into the different supporting technologies, let’s create a baseline about schemas and message brokers or async server-server communication.
Schema = Struct.
The shape and format of a “message” are built and delivered between different applications/services/electronic entities.
Schemas can be found in SQL & No SQL databases, in different shapes of the data the database expects to receive (for example, first_name:string, first.name etc..).
An unfamiliar or noncompliant schema will result in a drop, and the database will not save the record.
Schemas can also be found when two logical entities are communicating, for example, two microservices.
Imagine A writes a message to B, which expects a specific format (like Protobuf), and its logic or code also expects specific keys and value types, as an example, typo in column name. Unexpected schema or different format will result in a consumer.
Schemas are a manual or have an automatic contract for stable communication that dictates how two entities should communicate.
The following compared technologies will help you maintain and enforce schemas between services as data flows from one service to another.
What is AWS Glue?
AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.
Capabilities
Data integration engine
Event-driven ETL
No-code ETL jobs
Data preparation
Main components of AWS Glue are the Data Catalog which stores metadata, and an ETL engine that can automatically generate Scala or Python code. Common data sources would be Amazon S3, RDS, and Aurora.
What is Confluent Schema Registry?
Confluent Schema Registry provides a serving layer for your metadata.
It provides a RESTful interface for storing and retrieving your Avro®, JSON Schema, and Protobuf schemas.
It stores a versioned history of all schemas based on a specified subject name strategy, provides multiple compatibility settings, and allows the evolution of schemas according to the configured compatibility settings and expanded support for these schema types.
It provides serializers that plug into Apache Kafka® clients that handle schema storage and retrieval for Kafka messages that are sent in any of the supported formats.
Schema Registry lives outside of and separately from your Kafka brokers.
Your producers and consumers still talk to Kafka to publish and read data (messages) to topics.
Concurrently, they can also talk to Schema Registry to send and retrieve schemas that describe the data models for the messages.
What is Memphis.dev Schemaverse?
Memphis Schemaverse provides a robust schema store and schema management layer on top of Memphis broker without a standalone compute unit or dedicated resources.
With a unique & modern UI and programmatic approach, technical and non-technical users can create and define different schemas, attach the schema to multiple stations and choose if the schema should be enforced or not.
Memphis’ low-code approach removes the serialization part as it is embedded within the producer library.
Schemaverse supports versioning, GitOps methodologies, and schema evolution.
Schemaverse’s main purpose is to act as an automatic gatekeeper and ensure the format and structure of ingested messages to a Memphis station and to reduce consumer crashes, as often happens if certain producers produce an event with an unfamiliar schema.
Current version common use cases
Schema enforcement between microservices
Data contracts
Convert events’ format
Create an organizational standard around the different consumers and producers
Validation and Enforcement
When data streaming applications are integrated with schema management, schemas used for data production are validated against schemas within a central registry, allowing you to centrally control data quality.
AWS Glue offers enforcement and validation using Glue schema registry for Java-based applications using Apache Kafka, AWS MSK, Amazon Kinesis Data Streams, Apache Flink, Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda.
Schema registry validates and enforces message schemas at both the client and server sides. Validation will take place on the client side, and by performing a serialization over the about-to-be-produced data by retrieving the schema from the schema registry.
Confluent provides read-to-use serialization functions that can be used.
Schema updates and evolution will require booting the client and fetching the updates, so as to change the schema at the registry level. It first required to be switched into a certain mode (forward/backward), perform the change, and then bring back to default.
Schemaverse validates and enforces the schema at the client level as well without the need for manual schema fetch, and supports runtime evolution, meaning clients don’t need a reboot to apply new schema changes, including different data formats.
Schemaverse also makes the serialization/deserialization transparent to the client and embeds it within the SDK based on the required data format.
Serialization/Deserialization
When sending data over the network it needs to be encoded into bytes before.
AWS Glue and Schema Registry works similarly. Each created schema has an ID.
When the application producing data has registered its schema, the Schema Registry serializer validates that the record being produced by the application is structured with the fields and data types matching a registered schema.
Deserialization will take place by a similar process by fetching the needed schema based on the given ID within the message.
In AWS Glue and Schema Registry, It is the client responsibility to implement and deal with the serialisation while in Schemaverse it is fully transparent and all is needed by the client is to produce a message that complies with the required structure.
Popular formats
Usually, services communicate with each other using one of the following
- JSON. JSON stands for JavaScript Object Notation and is a text format for storing and transporting data.
When using JSON to send a message, we’ll first have to serialize the object representing the message into a JSON-formatted string, then transmit.
{“sensorId”: 32,”sensorValue”: 24.5}
This string is 36 characters long, but the information content of the string is only 6 characters long. This means that about 16% of transmitted data is actual data, while the rest is metadata. The ratio of useful data in the whole message is increased by decreasing key length or increasing value size, for example, when using a string or array.
- Protobuf. Protocol Buffers are a language-neutral, platform-neutral extensible mechanism for serializing structured data.
Protocol Buffers use a binary format to transfer messages.
Using Protocol Buffers in your code is slightly more complicated than using JSON.
The user must first define a message using the .proto file. This file is then compiled using Google’s protoc compiler, which generates source files that contain the Protocol Buffer implementation for the defined messages.
This is how our message would look in the .proto definition file:
message TemperatureSensorReading {
optional uint32 sensor_id = 1;
optional float sensor_value = 2;
}
When serializing the message from the example above, it’s only 7 bytes long. This can be confusing initially because we would expect uint32 and float to be 8 bytes long when combined. However, Protocol Buffers won’t use all 4 bytes for uint32 if they can encode the data in fewer bytes. In this example, the sensor_id value can be stored in 1 byte. It means that in this serialized message, 1 byte is metadata for the first field, and the field data itself is only 1 byte long. The remaining 5 bytes are metadata and data for the second field; 1 byte for metadata and 4 bytes for data because float always uses 4 bytes in Protocol Buffers. This gives us 5 bytes or 71% of actual data in a 7-byte message.
The main difference between the two is that JSON is just text, while Protocol Buffers are binary. This difference has a crucial impact on the size and performance of moving data between different devices.
Benchmark
In this benchmark, we will take the same message structure and examine the size difference, as well as the network performance.
We used Memphis schemaverse to act as our destination and a simple Python app as the sender.
The gaming industry use cases are usually considered to have a large payload, and to demonstrate the massive savings between the different formats. We used one of “Blizzard” example schemas.
The full used .proto can be found here.
Each packet weights 959.55KiB on average.
As can be seen, the average savings are between 618.19% to 807.93%!!!
Another key aspect to take under consideration would be the additional step of Serialization/Deserialization.
One more key aspect to take under consideration is the serialization function and its potential impact on performance, as it is, in fact, a similar action to compression.
Quick side-note. Memphis Schemaverse eliminates the need to implement Serialization/Deserialization functions, but behind the scenes, a set of conversion functions will take place.
Going back to serialization. The tests were performed using a Macbook Pro M1 Pro and 16 GB RAM using Google.Protobuf.JsonFormatter and Google.Protobuf.JsonParser Python serialization/deserialization used google.protobuf.json_format
Comparison
Join 4500+ others and sign up for our data engineering newsletter.
Originally published at Memphis.dev By Yaniv Ben Hemo, Co-Founder & CEO at Memphis.dev
Follow Us to get the latest updates!
Github • Docs • Discord