Skip to content

Samplers

Samplers are available as libraries to be imported into your services or are created by standalone components that retrieve Data Samples from a system (such as a data broker or a database) by themselves such as the kafka-sampler.

They are designed to not interfere with the normal operation of your systems and to not impact your application performace. Go to the Data Collection page to get a deeper understanding of how a Sampler works.

Samplers need to be able to decode the data that they intercept so that it can be evaluated by their configured Streams, which decide whether or not that Data Sample needs to be processed. Therefore, you need to choose a Sampler that is compatible with the message encoding (e.g. Protocol Buffers, JSON...) that your service works with.

Note

All Samplers are able to process JSON messages. Since it is a self-describing language, it is enough with the message itself (no external schema required) to be able to decode its contents. And since, at least when using Samplers within your services, it is usually possible to convert any object to JSON, this option works as a fallback in case the encoding your service uses is not supported. Of course, there is a performance penalty to consider when converting messages to JSON.

Best practices

Because their performance impact is negligible when no Streams are configured, it is recommended to add them wherever data is transformed or exchanged. This will allow you to track how your data evolves throughout your system.

Note

Unlike logs, where it is usually recommended to not add logging in the critical path to avoid too much noise and increased costs, Samplers can be dynamically configured so you can add them anywhere without worrying about impacting your application or costs.

Samplers have sensible defaults to protect your services and never export large amounts of data without your permission. By default, they have a rate limit that puts a ceiling on how many Data Samples they export per second. All of these default options can be adjusted at runtime using a Neblic client, like neblictl, if you need it (e.g. temporarily increase the rate limit while troubleshooting an issue).

For example, it is common to add Samplers in all service or module boundaries:

  • Requests and responses sent between services (e.g. HTTP API requests, gRPC calls...)
  • Data passed to/from modules/libraries used within the same service
  • Requests/responses or messages to external systems (e.g. DBs, Apache Kafka...)

Other interesting places could be:

  • Before/after relevant data transformation
  • When a service starts, to register its configuration

To make it easier to get Data Samples from multiple places, Neblic provides helpers, wrappers, and components that can automatically add Samplers in multiple places in your system e.g. in all gRPC requests/responses or in multiple Kafka topics. Check out the next sections to see what Samplers Neblic provides.

Configuration

The pair Sampler name and resource id is what identifies a particular set of Samplers. For example, if you have multiple replicas of the same service, each replica will register a Sampler with the same name and resource id. All of these Samplers are treated as a group and you can configure them all together.

The Data Collection page shows all the operations performed by a Sampler and the Collector and provide the foundation to understand how to configure the Sampler behaviour. See this how-to page to learn how to configure Samplers using neblictl.

Streams

Streams are the initial configuration that Samplers need to be able to start generating Data Telemetry. To create a Stream you need to provide a target Sampler and a rule which will select which Data Samples are part of the Stream.

Data Samples can have a key associated with them. By setting up a Stream as keyed, you can gather Data Telemetry separately for each key value. This feature is handy, for example, for collecting independent Data Telemetry for different customers (by using the customer ID as the key) or for various Event types (using the event type ID as the key). Currently, only Events are compatible with keyed Streams. For details on which functions are compatible with keyed Streams, please check the reference table.

Digests and Metrics

Digests are generated at the Stream level. First, you need to create a Stream and then you will be able to generate the required Digests. Metrics are generated from Digests so you first need to create a Digest and then the Collector will automatically generate and export Metrics based on its contents.

Events

Events are generated in the Collector and also work at the Stream level. To generate Events you will need to create a Stream and configure it to export Raw Data to the Collector.

Then, you can create Events by specifying the target Stream and a rule that will trigger the generation of the Event.

Available Samplers

Go

Info

Check this guide for an example of how to use it and the Godoc page for reference.

The Go Sampler module allows you to get Data Samples from your Go services.

Supported encodings

Encoding Description
JSON A string containing a JSON object.
Protobuf It can efficiently process proto.Message objects if the Protobuf message definition is provided.
Go native object Only capable of accessing exported fields. This is a limitation of the Go language.

gRPC interceptors

The Go library also provides a package to easily add Samplers as gRPC interceptors. This will automatically add Samplers in each gRPC method that will intercept all requests and responses.

Instrumentation overhead (advanced)

The Go Sampler module is fully functional, but it has not yet been thoroughly optimized. Therefore, there are some use cases where it may exhibit more overhead than desired, which you need to be aware of if your application processes a large volume of data or it is resource-constrained.

Take into account that the instrumentation overhead in most cases won't affect your application, in any case, since the Sampler behavior can be dynamically configured at runtime, it is recommended to set an input limiter when it is first deployed and adjust it while monitoring it. You can also adjust it as needed during a troubleshooting session and have it mostly inactive when unused.

Check the benchmarks page to see the latest results. The sections below offer a performance analysis for each supported Data Sample encoding. The majority of the overhead is due to the serialization and/or deserialization of the Data Sample. Therefore, the overhead is mostly influenced by the number of fields in the Data Sample and, to a lesser degree, its overall size.

JSON samples

The JSON sample is deserialized with go-json to determine to which Streams it belongs and to generate Digests. This action dominates the Sampler computational overhead. The plan is to not deserialize it into a map[string]interface{} but instead to simply iterate over it and convert the values as needed on-the-fly reusing memory buffers when possible. See this project item for details.

Info

If your application is very resource constrained, you can create a Stream that doesn't require to deserialize the JSON sample (e.g. you can use the rule true) and forward the raw Data Sample to the Collector and generate the Digests there. This will create network overhead since it will need to forward the Data Samples to the Collector but will not impact your application CPU usage.

Protobuf Samples

Stream matching is very fast, since it doesn't require to deserialize the Data Sample. But generating Digests requires the Data Sample to be deserialized into a map[string]interface{} which will dominate the Sampler processing overhead. The plan is to not deserialize it and iterate over its fields using the proto reflect package. See this project item for details.

Forwarding raw Data Samples sampled encoded as protobuf is also not as performant as we would like (see this issue). The plan is to not serialize it to JSON when forwarding it to the Collector but use a more efficient serialization format. It is also planned to not serialize the whole sample unless strictly necessary. See this and this project items for details.

Go native samples

Stream matching and Digest generation requires the sample to be deserialized into a map[string]interface{} which has some overhead but is quite acceptable since it doesn't require to perform as many memory allocations as when deserializing a JSON or a Protobuf encoded sample. It is unclear if iterating over the sample and not creating a map using reflection will reduce the deserialization overhead.

Exporting raw Data Samples requires them to be serialized to JSON. As in the previous case, using a different serialization format and only exporting the necessary parts of the data sample will help reduce its impact. See this and this project items for details.

Kafka

Neblic provides a standalone service called kafka-sampler capable of automatically monitoring your Apache Kafka topics and creating Samplers that will allow you to inspect all data that flows through them.

Supported encodings

Encoding Description
JSON A string containing a JSON object.

Instrumentation overhead (advanced)

The kafka-sampler service is based on the Go Sampler. Check its overhead analysis for details.

Check this guide to learn how to use it.

Advanced

Using OpenTelemetry SDK

The Neblic collector is built on top of OpenTelemetry stack, and as a result, the neblic collector is capable of understanding and processing samples encoded as OpenTelemetry logs if they are correctly formatted. Any OpenTelemetry SDK implementation supporting logs can be used to generate samples that neblic will process.

Concept match beetween OpenTelemetry and Neblic:

OpenTelemetry Neblic
Resource Resource
InstrumentationScope Sampler
Attribute com.neblic.sample.stream.names Stream
Attribute com.neblic.sample.key Key

Note

OpenTelemetry recommends using appenders to propagate logs, for that use case it does not work, and the Logs API is used instead.

Steps to follow:

  • Create a LoggerProvider with the desired Resource name.
  • Create a Logger with the desired sampler name as the InstrumentationScope name.
  • Emit a log with:
    • Attribute com.neblic.sample.stream.names with value all
    • Attribute com.neblic.sample.key with the desired key value
    • Attribute com.neblic.sample.type with value raw
    • Attribute com.neblic.sample.encoding with value json
    • Body with the serialized version of the data

Once the collector receives the first sample, the sampler will appear to the controlplane as any other sampler (but with limited functionality)