OpenTelemetry Collector's New Filter Processor
Traditionally, trace data has been used in a reactive manner. When things break, the trace data enables you to reduce your MTTR. At Tracetest.io, we enable trace data to also be used proactively, not just reactively, by enabling E2E tests in a fraction of the time typically required. The secret sauce behind this is the trace data, as we rely on it to conduct trace-based tests.
As traces are the main component to enable trace-based testing, we have two ways of collecting those:
- Direct integration: Used with trace data stores that allow an API or direct connection to query and retrieve a trace. This includes Jaeger, Tempo, OpenSearch, AWS X-Ray, and the list is growing.
- Via OpenTelemetry Collector: The user defines a second pipeline in their collector and sends traces directly to Tracetest. This is the approach used with Datadog, New Relic, Honeycomb, Lightstep, and others.
However, when using the OpenTelemetry Collector and dealing with a large number of traces, storage can become a problem pretty quickly, thus we had to find a way of filtering out traces that are not relevant for Tracetest. Especially because we never wanted to build Tracetest as a competitor of existing trace stores such as Jaeger, Tempo, or Lightstep. The solution found was to filter spans based on an attribute in their Trace State
object.
What is Trace State?
Trace State
is one of the components of the TraceContext
. On the OpenTelemetry website, we can see what both of those things mean:
A Context is an object that contains the information for the sending and receiving service to correlate one span with another and associate it with the trace overall.
Trace State, a list of key-value pairs that can carry vendor-specific trace information.
This means that anything that is part of the TraceContext
is propagated automatically. This was very important for us because it means that if we set a value in the Trace State
, it gets propagated to all the services that make up the operation of that trace.
Filtering by Trace State
If you had to filter spans based on the Trace State
object a few months ago, you probably noticed you didn’t have a way of doing it if you were using the core
distribution of the OpenTelemetry Collector. You had to rely on the distrib
version and use the tail_sampling
processor.
However, a few months ago the OpenTelemetry team rewrote part of the filter
processor and made it possible to use OpenTelemetry Transformation Language (OTTL) to build your filters. This change made it possible to use the filter
processor to access attributes in the Trace State
object.
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
timeout: 100ms
filter/tracetest:
error_mode: ignore
traces:
span:
# remove all spans that the trace state doesn't have an object
# which key is "tracetest" and value is "true"
- 'trace_state["tracetest"] != "true"'
exporters:
logging:
loglevel: debug
service:
pipelines:
traces/1:
receivers: [otlp]
processors: [filter/tracetest, batch]
exporters: [logging]
For more complex filters, check the OTTL documentation.
Why is this important?
There are two main arguments for migrating from tail_sampling
to filter
:
- It’s available in the
core
distribution: Most vendor-specific collectors are based on this collector version. Thus, if you use an up-to-date vendor-specific collector, you probably have access to thefilter
processor, but not to thetail_sampling
. - It’s easier to write: If you ever had to write a tail sampling rule you know. It’s not hard, but the syntax is more cumbersome than OTTL. So having the
filter
processor makes it easier to understand what is being filtered out of your pipeline. - The other point is performance: Tail sampling requires the collector to keep your trace in memory until it decides if it’s going to sample it or not. So, depending on your configuration and the amount of spans your system generates, this processor can be very heavy on memory usage. This means a higher cost of maintaining your collector.
Conclusion
The new capabilities of the filter
processor are very useful and it is easy to write and maintain. If you have tried using it before but thought it was lacking something, it’s worth revisiting it and checking the new filtering capabilities.
After we saw the new capabilities, we dropped our recommendation of tail sampling and started suggesting users use the filter
processor instead.
I want to send a big Kudos to the OpenTelemetry Collector team for this change!
Want to enable your trace data to power trace-based tests, allowing you to build powerful E2E tests in minutes rather than days? Give tracetest.io a try!