Skip to content

Data collection

The data collection and analysis pipeline is divided between two components: the Sampler and the Collector. The diagram below provides a high level view of how the Data Samples are transformed and forwarded across the pipeline:

flowchart LR;
    Application--Data Samples-->Sampler
    Sampler--Digests/Raw Data-->Collector
    Collector--Data Telemetry/Raw Data-->Store

Sampler

The diagram below illustrates the opreations carried out within a Sampler:

flowchart LR;
    Application--Data Sample-->in
    subgraph Sampler
        in((Limiter In))-->sampler
        sampler((Sampler))-->streams
        streams((Stream\nassignation))-->digest
        digest((Digest))--Value/Struct Digests-->out
        streams-->forward
        forward((Forward))--Raw Data-->out
    end
    out((Limiter Out))--Digests/Raw Data-->Collector

All the operations are configurable and can be dynamically enabled or disabled as needed:

  • Limiter In: Limits how many Data Samples per second can be processed applying token bucket rate limiting.
    • This limiter will randomly discard Data Samples but makes it very easy to set a fixed upper limit.
  • Sampler: Performs deterministic sampling based on the Data Sample key.
    • The deterministic sampling will select a random subset of keys to process while discarding the others. It may be trickierto adjust since it is configured by specifying a percentage of the total Data Samples instead of using a fixed upper limit.
    • The advantage is that different Samplers will sample the same subset of Data Samples as long as they have the same deterministic sampling configuration.
  • Stream assignation: Determines to which stream the Data Sample belongs to.
  • Digest: Analyze the Data Sample and builds Value Digests and/or Struct Digests per each Stream.
  • Forward: The Sampler can also forward the Data Sample raw content to the Collector for further analysis, storage, or Event detection.
  • Limiter Out: Limits how many Data Samples and Digests per second can be forwarded downstream applying token bucket rate limiting.
    • This limiter will randomly discard Data Samples but makes it very easy to set a fixed upper limit.

Info

Digests can also be generated in the Collector which may be a better option in certain use cases:

  • The Collector receives data from all the Sampler replicas (if you have multiple replicas of an instrumented service, you will have multiple identical Samplers) so it can generate Digests that aggregate all their data together.
  • Generating the Digest has a computational overhead that you may want to completely avoid if your application has limited resources. Note that altough the application won't be impacted by the computational overhead, forwarding the Data Sample to the Collector will increase its network usage.

Collector

The Collector is the central point that receives data from all Samplers. Its main function is to evaluate rules on Raw Data to detect Events and generate Metrics from Digests. It can also forward Raw Data and Digests to an external store.

flowchart LR;
    sampler((Sampler))-.Raw data.->digest
    sampler((Sampler))--Raw data-->event
    sampler--Digests-->metric
    sampler--Raw data/Digests-->forward
    subgraph Collector
        digest((Digest))-.Digests.->exporter
        event((Event\nevaluation))--Events-->exporter
        metric((Metric\ngeneration))--Metrics-->exporter
        forward((Forward))--Raw Data/Digests-->exporter
    end
    exporter((Exporter))--Data Telemetry/Raw Data-->Store

    style digest stroke-dasharray: 5 5
  • Digest: Analyzes the Raw Data and builds Value Digests and/or Struct Digests. Digests are usually built in the Sampler itself, but can also be configured to be built in the Collector.
  • Event evaluation: Evaluates rules on the received Raw Data and generates an Event if it matches.
  • Metric generation: Converts the digests statisticcs into OpenTelemetry metrics.
  • Forward: Forwards Raw Data and Digests as-is to an external Store.

Future planned work

Filter raw data

To generate events, the current method requires sending the entire Data Sample in its raw form to the Collector. The Collector will then evaluate the configured event rules on the Data Sample.

However, it's often more efficient to send only the necessary data needed to assess the event rules. For more information and updates on this feature, please refer to this project item.