Monitor More Than Scrapers: OpenTelemetry Pipeline Health

Adding an OpenTelemetry (OTel) Collector can bring order to your app telemetry. But there’s a catch: if the Collector itself gets backed up or crashes, you can end up “thinking you have observability” while silently losing data. This guide focuses on monitoring not just scrapers (crawlers), but the OTel pipeline (receiver → processor → exporter) end-to-end—what to watch, how to implement it (config examples, metrics to collect, and alert design), and how to catch drops and backpressure early.

What You’ll Learn

The key signals for OpenTelemetry Collector pipeline health
How to use zPages, health checks, and self-metrics for visibility and detection
Alert patterns to catch congestion, drops, and retries before they become outages

Why this monitoring matters

A common failure mode in web scraping platforms is: “We monitor scraper uptime, but we don’t monitor the observability pipeline.” In practice, incidents often happen outside the scraper itself—for example:

Collector memory pressure leading to OOMKill or CPU throttling
The exporter destination (backend/SaaS) slows down, and queues keep growing
Batching or sampling increases latency enough to break an SLO
Data is received, but one signal type (for example, logs only) is being dropped

Key premise: the OTel Collector is a data pipeline: “receive,” “process,” and “export.” That means process-level monitoring (CPU/memory) is necessary but not sufficient—you also need data-flow monitoring.

Define “healthy” (so alerts are actionable)

“Pipeline health” is vague unless you break it down. In monitoring design, splitting it into at least these three layers reduces surprises.

1) Alive

The Collector process is running, and its receiving endpoints respond. This maps to Kubernetes readiness/liveness checks and an HTTP health endpoint.

2) Receiving as expected

The receiver is ingesting data at the expected rate and passing it into processors. Evidence includes “accepted counts,” “refusals/errors,” and “queues growing.”

3) Exporting as expected

The exporter is successfully delivering data to your backend. Evidence includes “successful sends,” “send failures,” “retries,” “drops,” and “queue backlog.”

Watch out: health checks can stay green while an exporter destination is degraded and your queues grow without bound. If you treat “alive” as “healthy,” you can miss data loss for a long time.

Start with these extensions

The Collector includes extensions that make operations dramatically easier. The official configuration docs commonly show adding health_check/pprof/zpages under extensions. (See the Collector configuration documentation.)

health_check

An endpoint designed for load balancers and Kubernetes probes. Use this primarily for “alive” monitoring.

zPages

A built-in UI for peeking into live internal state. Official troubleshooting guidance calls out zPages as a way to inspect live receiver/exporter behavior. (See Collector troubleshooting.)

pprof

Used to capture CPU/memory profiles. It’s especially useful when you need to decide whether a “stuck pipeline” is caused by load (capacity) or configuration (pipeline behavior).

Self-monitoring the Collector

The OTel Collector can emit its own internal telemetry (self-metrics). Collect these metrics with Prometheus (or another metrics backend), then build dashboards and alerts on top. This is the core of pipeline health monitoring.

What to monitor

Exact metric names can vary by distribution and version. But the monitoring dimensions are stable. At minimum, cover:

Ingress volume: received counts/bytes by receiver and pipeline
Egress volume: successful sent counts/bytes by exporter
Failures: exporter errors, retries, refusals/rejections
Drops: drops in processors/exporters (direct evidence of data loss)
Backlog: queue length, backpressure signals, latency
Resources: CPU, memory, GC, goroutines/threads (environment-dependent)

Operational tip: visualize the delta between the “front door” (accepted/received) and the “exit” (sent). If the gap keeps growing, you’re accumulating backlog, retrying, or dropping somewhere.

Prevent memory-driven incidents

One of the most common Collector failures is OOM (out-of-memory). Add the memory_limiter processor so the Collector can shed load before it crashes. The official Go package documentation explicitly warns that incoming data can still consume additional memory before the memory limiter can reject it—so you need to design with real headroom, not “set it and forget it.” (See memorylimiterprocessor docs.)

Example configuration

This is a minimal setup for “health,” “internal visibility,” and “self-metrics readiness.” Adjust exporters (Prometheus/OTLP/vendor) and authentication for your environment.

extensions:
  health_check: {}
  pprof: {}
  zpages: {}

receivers:
  otlp:
    protocols:
      grpc: {}
      http: {}

processors:
  memory_limiter:
    # Tune these values for your environment
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  batch: {}

exporters:
  debug: {}

service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [debug]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [debug]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [debug]

The key detail is wiring extensions into service.extensions. The official docs explain the overall structure (pipelines/receivers/processors/exporters/extensions). (See Collector configuration.)

Common failure patterns (and what they usually mean)

Receiving increases, but exporting doesn’t

Common causes include an outage at the exporter destination, expired credentials, DNS/network failure, or rate limiting. Confirm by correlating exporter retries and queue growth.

Drops start increasing

Likely culprits include memory limiting, queue saturation, or a permanent exporter error. Drops mean confirmed data loss, so alert on this with the highest urgency.

Latency keeps rising

Batch settings, sampling, and exporter backpressure all affect end-to-end latency. If you run an SLO, treat “pipeline latency” as a first-class signal.

The Collector crashes

Besides OOM, misconfiguration or missing certificates can make it look like “nothing is being received.” If you run distroless images (or otherwise lack debugging tools in-container), exposing health_check and zPages is practical insurance for faster triage (field-proven advice). (See this experience-based discussion on Reddit.)

Note: Reddit is anecdotal. It’s useful for incident intuition, but validate conclusions against your own metrics and the official docs.

Alert design

“Alert when it’s down” is often too late. Add conditions that detect data-flow degradation early.

Recommended conditions

Drops: alert immediately if drops are non-zero over a 5-minute window
Ingress–egress delta: alert if the gap grows continuously for 10–15 minutes
Exporter failure rate: alert if failures exceed a threshold (for example, sustained >1%)
Memory pressure: alert if memory stays near the limit or GC pressure stays elevated

Split notification targets

Separate “fix now” (on-call) from “improve next business day” (engineering follow-up). In practice, drops and Collector down belong in the former; rising latency and slowly growing queues often belong in the latter.

Comparison: monitoring approaches

Here’s a practical mapping of common tools for pipeline monitoring by intent.

Approach	Primary purpose	Strength	Weakness
health_check	Liveness	Lightweight; ideal for LB/probes	Doesn’t reveal data loss
zPages	Real-time debugging	Lets you inspect live state quickly	Not ideal for continuous monitoring
self-metrics	Continuous monitoring	Backbone for dashboards and alerting	Requires design and upkeep
pprof	Root-cause analysis	Tracks down CPU/memory causes	Requires operator skill

Operational pitfalls to avoid

Assuming “the Collector is stable”

The Collector is robust, but if ingest spikes or the export destination slows, it will destabilize unless you’ve designed for backpressure and memory safety. Scraping workloads often have sharp peaks, so plan around memory protection (memory_limiter) and batching from day one. (See memorylimiterprocessor docs.)

Fragmented monitoring

If the app is monitored in an APM tool, the Collector in a different dashboard, and the backend somewhere else, triage slows down. Put “ingress → Collector → egress” on the same dashboard so incidents become pattern-matching rather than guesswork.

References

Start with official sources for configuration, debugging, and memory protection.

Want Fewer Blind Spots in OTel Pipelines?

If your scrapers look healthy but dashboards still have gaps, the issue is often in the Collector/exporter path. We can review your current pipeline, define the right self-metrics, and tune alerting to catch drops and backpressure early.

Contact UsFeel free to reach out for scraping consultations and quotes

Get in Touch

Summary

Define OTel health as more than “alive”: verify it’s receiving and exporting as expected
Use health_check/zPages/pprof, but make self-metrics the foundation for continuous monitoring
Alert proactively on drops, ingress–egress deltas, failure rates, and queue/backlog signals

About the Author

Ibuki Yamamoto

Web scraping engineer with over 10 years of practical experience, having worked on numerous large-scale data collection projects. Specializes in Python and JavaScript, sharing practical scraping techniques in technical blogs.

Leave it to the
Data Collection Professionals

Our professional team with over 100 million data collection records annually solves all challenges including large-scale scraping and anti-bot measures.

100M+

Annual Data Collection

24/7

Uptime

High Quality

Data Quality