Adding an OpenTelemetry (OTel) Collector can bring order to your app telemetry. But thereâs a catch: if the Collector itself gets backed up or crashes, you can end up âthinking you have observabilityâ while silently losing data. This guide focuses on monitoring not just scrapers (crawlers), but the OTel pipeline (receiver â processor â exporter) end-to-endâwhat to watch, how to implement it (config examples, metrics to collect, and alert design), and how to catch drops and backpressure early. A common failure mode in web scraping platforms is: âWe monitor scraper uptime, but we donât monitor the observability pipeline.â In practice, incidents often happen outside the scraper itselfâfor example: Key premise: the OTel Collector is a data pipeline: âreceive,â âprocess,â and âexport.â That means process-level monitoring (CPU/memory) is necessary but not sufficientâyou also need data-flow monitoring. âPipeline healthâ is vague unless you break it down. In monitoring design, splitting it into at least these three layers reduces surprises. The Collector process is running, and its receiving endpoints respond. This maps to Kubernetes readiness/liveness checks and an HTTP health endpoint. The receiver is ingesting data at the expected rate and passing it into processors. Evidence includes âaccepted counts,â ârefusals/errors,â and âqueues growing.â The exporter is successfully delivering data to your backend. Evidence includes âsuccessful sends,â âsend failures,â âretries,â âdrops,â and âqueue backlog.â Watch out: health checks can stay green while an exporter destination is degraded and your queues grow without bound. If you treat âaliveâ as âhealthy,â you can miss data loss for a long time. The Collector includes extensions that make operations dramatically easier. The official configuration docs commonly show adding health_check/pprof/zpages under extensions. (See the Collector configuration documentation.) An endpoint designed for load balancers and Kubernetes probes. Use this primarily for âaliveâ monitoring. A built-in UI for peeking into live internal state. Official troubleshooting guidance calls out zPages as a way to inspect live receiver/exporter behavior. (See Collector troubleshooting.) Used to capture CPU/memory profiles. Itâs especially useful when you need to decide whether a âstuck pipelineâ is caused by load (capacity) or configuration (pipeline behavior). The OTel Collector can emit its own internal telemetry (self-metrics). Collect these metrics with Prometheus (or another metrics backend), then build dashboards and alerts on top. This is the core of pipeline health monitoring. Exact metric names can vary by distribution and version. But the monitoring dimensions are stable. At minimum, cover: Operational tip: visualize the delta between the âfront doorâ (accepted/received) and the âexitâ (sent). If the gap keeps growing, youâre accumulating backlog, retrying, or dropping somewhere. One of the most common Collector failures is OOM (out-of-memory). Add the This is a minimal setup for âhealth,â âinternal visibility,â and âself-metrics readiness.â Adjust exporters (Prometheus/OTLP/vendor) and authentication for your environment.Monitor More Than Scrapers: OpenTelemetry Pipeline Health
Why this monitoring matters
Define âhealthyâ (so alerts are actionable)
1) Alive
2) Receiving as expected
3) Exporting as expected
Start with these extensions
health_check
zPages
pprof
Self-monitoring the Collector
What to monitor
Prevent memory-driven incidents
memory_limiter processor so the Collector can shed load before it crashes. The official Go package documentation explicitly warns that incoming data can still consume additional memory before the memory limiter can reject itâso you need to design with real headroom, not âset it and forget it.â (See memorylimiterprocessor docs.)Example configuration
extensions:
health_check: {}
pprof: {}
zpages: {}
receivers:
otlp:
protocols:
grpc: {}
http: {}
processors:
memory_limiter:
# Tune these values for your environment
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
batch: {}
exporters:
debug: {}
service:
extensions: [health_check, pprof, zpages]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [debug]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [debug]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [debug]
The key detail is wiring extensions into service.extensions. The official docs explain the overall structure (pipelines/receivers/processors/exporters/extensions). (See Collector configuration.)
Common failure patterns (and what they usually mean)
Receiving increases, but exporting doesnât
Common causes include an outage at the exporter destination, expired credentials, DNS/network failure, or rate limiting. Confirm by correlating exporter retries and queue growth.
Drops start increasing
Likely culprits include memory limiting, queue saturation, or a permanent exporter error. Drops mean confirmed data loss, so alert on this with the highest urgency.
Latency keeps rising
Batch settings, sampling, and exporter backpressure all affect end-to-end latency. If you run an SLO, treat âpipeline latencyâ as a first-class signal.
The Collector crashes
Besides OOM, misconfiguration or missing certificates can make it look like ânothing is being received.â If you run distroless images (or otherwise lack debugging tools in-container), exposing health_check and zPages is practical insurance for faster triage (field-proven advice). (See this experience-based discussion on Reddit.)
Note: Reddit is anecdotal. Itâs useful for incident intuition, but validate conclusions against your own metrics and the official docs.
Alert design
âAlert when itâs downâ is often too late. Add conditions that detect data-flow degradation early.
Recommended conditions
- Drops: alert immediately if drops are non-zero over a 5-minute window
- Ingressâegress delta: alert if the gap grows continuously for 10â15 minutes
- Exporter failure rate: alert if failures exceed a threshold (for example, sustained >1%)
- Memory pressure: alert if memory stays near the limit or GC pressure stays elevated
Split notification targets
Separate âfix nowâ (on-call) from âimprove next business dayâ (engineering follow-up). In practice, drops and Collector down belong in the former; rising latency and slowly growing queues often belong in the latter.
Comparison: monitoring approaches
Hereâs a practical mapping of common tools for pipeline monitoring by intent.
| Approach | Primary purpose | Strength | Weakness |
|---|---|---|---|
| health_check | Liveness | Lightweight; ideal for LB/probes | Doesnât reveal data loss |
| zPages | Real-time debugging | Lets you inspect live state quickly | Not ideal for continuous monitoring |
| self-metrics | Continuous monitoring | Backbone for dashboards and alerting | Requires design and upkeep |
| pprof | Root-cause analysis | Tracks down CPU/memory causes | Requires operator skill |
Operational pitfalls to avoid
Assuming âthe Collector is stableâ
The Collector is robust, but if ingest spikes or the export destination slows, it will destabilize unless youâve designed for backpressure and memory safety. Scraping workloads often have sharp peaks, so plan around memory protection (memory_limiter) and batching from day one. (See memorylimiterprocessor docs.)
Fragmented monitoring
If the app is monitored in an APM tool, the Collector in a different dashboard, and the backend somewhere else, triage slows down. Put âingress â Collector â egressâ on the same dashboard so incidents become pattern-matching rather than guesswork.
References
Start with official sources for configuration, debugging, and memory protection.
Want Fewer Blind Spots in OTel Pipelines?
If your scrapers look healthy but dashboards still have gaps, the issue is often in the Collector/exporter path. We can review your current pipeline, define the right self-metrics, and tune alerting to catch drops and backpressure early.
Summary
- Define OTel health as more than âaliveâ: verify itâs receiving and exporting as expected
- Use health_check/zPages/pprof, but make self-metrics the foundation for continuous monitoring
- Alert proactively on drops, ingressâegress deltas, failure rates, and queue/backlog signals