OpenTelemetry Has Won Observability — Here's What To Do With It | AI Plus

The Lock-In Problem Is Solved

For most of the 2010s, choosing an observability vendor meant marrying their instrumentation library. Datadog agents, New Relic SDKs, Honeycomb's proprietary beelines — each one locked your application code to a specific backend. Switching vendors meant re-instrumenting every service. That era is over.

OpenTelemetry is now the de facto standard for distributed tracing, metrics, and logs. AWS, GCP, Azure, Datadog, Honeycomb, Grafana, New Relic, Dynatrace, and every major observability platform support it natively. You instrument once. You route anywhere.

What OpenTelemetry Actually Is

OpenTelemetry is a CNCF project — graduated in 2021 — born from the merger of two competing standards: OpenCensus (Google) and OpenTracing (CNCF). That merger matters because it ended fragmentation. Before OpenTelemetry, you had to pick a side; now there is only one standard.

The project defines three observability signals:

Traces: Distributed request flows across service boundaries, represented as a tree of spans with timing and attributes.
Metrics: Aggregated numerical measurements — counters, gauges, histograms — for dashboards and alerting.
Logs: Structured event records that can be correlated with traces and metrics via shared context (trace IDs, span IDs).

The architecture has two main components: the SDK (embedded in your application) and the Collector (a standalone binary that receives, processes, and exports telemetry). Understanding the separation between these two is the key insight most teams miss.

The Specification vs. the SDK: Why the Separation Matters

OpenTelemetry defines an API specification separate from the SDK implementation. This is deliberate. Library authors can instrument their code against the API without taking a hard dependency on any specific SDK. When your application runs without an SDK configured, those API calls become no-ops. Zero overhead.

When you do configure an SDK, it wires up to the API and begins exporting. The SDK handles batching, retry logic, sampling, and context propagation. The specification defines the contract; the SDK provides the implementation. This architecture is why OpenTelemetry can claim zero-overhead instrumentation for uninstrumented deployments — a claim that matters for library authors shipping to a broad user base.

The practical implication: instrument your libraries against the OTel API. Instrument your applications against the SDK. Never let a library take a hard SDK dependency.

Auto-Instrumentation vs. Manual Spans

OpenTelemetry provides language-specific auto-instrumentation that patches popular frameworks and libraries without code changes:

Java: A Java agent (-javaagent:opentelemetry-javaagent.jar) that instruments Spring, Tomcat, gRPC, JDBC, and dozens of other libraries via bytecode manipulation at startup.
Python: opentelemetry-instrument python app.py — instruments Django, Flask, FastAPI, SQLAlchemy, Redis clients, and more automatically.
Node.js: Require the @opentelemetry/auto-instrumentations-node package and it hooks into Express, HTTP, gRPC, MySQL, Postgres, and others via module patching.

Auto-instrumentation gives you infrastructure-level visibility immediately. You will see HTTP request durations, database query latency, external API calls — all the mechanical plumbing — without touching application code.

But auto-instrumentation does not know what your code means. It cannot tell you that the checkout service is slow because of a specific promo code validation rule. For business-logic visibility, you need manual spans:

Wrap any operation that can fail or is latency-sensitive in a custom span.
Add semantic attributes: user ID, order ID, experiment variant, feature flag values.
Record exceptions explicitly with span.recordException(err) and span.setStatus({code: ERROR}).

The right approach: use auto-instrumentation as the baseline, add manual spans at every decision point that matters to your business. Start with auto, add manual incrementally as you learn what questions you cannot answer.

The Collector Is the Architectural Lynchpin

Most teams start by sending telemetry directly from their application SDK to a backend. This works but gives up the most valuable feature of the OTel architecture: the Collector pipeline.

The OTel Collector is a vendor-agnostic proxy that sits between your applications and your backends. Configure it with receivers (OTLP, Jaeger, Prometheus, Zipkin), processors (sampling, attribute manipulation, PII scrubbing), and exporters (Datadog, Honeycomb, Tempo, CloudWatch — any combination).

Why this matters in practice:

Fan-out: Send traces to Honeycomb for exploration AND Grafana Tempo for long-term retention simultaneously. No application changes required.
PII scrubbing: Strip or hash sensitive attributes (email addresses, IP addresses, session tokens) before they leave your network perimeter — before any data reaches a vendor.
Sampling decisions: Apply tail-based sampling at the Collector level, keeping errors and slow traces, discarding healthy fast ones.
Backend migration: Switch from Datadog to Grafana by changing a Collector config file. Your application instrumentation is untouched.

Deploy the Collector as a sidecar in Kubernetes or as a DaemonSet. For high-traffic environments, deploy a tiered architecture: sidecar Collectors per pod forwarding to gateway Collectors that handle tail sampling across the full request population.

Sampling Strategies: Head-Based vs. Tail-Based

Tracing every request at full fidelity is expensive. At 10,000 requests per second, storing every trace costs real money. Sampling is not optional at scale — but sampling strategy determines what you can actually debug.

Head-based sampling makes the keep/drop decision at the start of a request, before any downstream spans are created. It is simple to implement and has minimal overhead. The problem: you cannot sample based on outcomes you do not yet know. You might drop the one request that failed. At 1% head sampling, you will routinely have no trace for your most interesting bugs.

Tail-based sampling buffers all spans for a trace and makes the decision only after the entire trace is complete. This lets you:

Keep 100% of traces containing any error span.
Keep 100% of traces exceeding a latency threshold (e.g., P99 > 2 seconds).
Keep a configurable percentage of healthy fast traces (e.g., 1% baseline).

The OTel Collector's tailsampling processor implements this. Configure policies in YAML: combine error-status policies with latency policies and a probabilistic baseline. The result is a trace corpus that is disproportionately made up of the interesting cases you actually want to debug.

The operational cost: tail-based sampling requires buffering in-flight spans in Collector memory. For meaningful decisions, you need all spans from a given trace to reach the same Collector instance — typically via trace-ID-based load balancing in front of a Collector pool. This is more infrastructure to operate, but the debuggability improvement is not marginal. It is the difference between seeing every error and seeing none of them.

The Current State of the Three Signals

OpenTelemetry's signals matured at different rates:

Traces: GA and stable since 2021. The most mature signal. Semantic conventions for HTTP, gRPC, database, messaging systems are well-established and stable. Use this first.
Metrics: GA and stable since 2022. The OTLP metrics protocol is production-ready. Semantic conventions cover HTTP server and client metrics, runtime metrics (JVM, Python runtime), and system metrics.
Logs: Stable since 2023. The log data model and OTLP logs protocol allow structured logs to carry trace context (trace ID, span ID), enabling correlation between logs and traces in backends that support it. Grafana Loki + Tempo correlation is the most mature implementation today.

The semantic conventions are what make cross-signal correlation actually work. When your HTTP span and your HTTP metric use the same attribute names (http.request.method, http.response.status_code, server.address), backends can join them. Adhere to the published semantic conventions — do not invent your own attribute names for standard operations.

Vendor Landscape in 2026

With instrumentation now vendor-neutral, the choice of backend is about query model, cost, and operational preference:

Honeycomb: The most opinionated and arguably best product for trace exploration. BubbleUp for automatic correlation discovery, high-cardinality column support, and a query model built around arbitrary-width events. Best for teams that want to do sophisticated trace analysis and are willing to pay for a managed product.
Grafana stack (LGTM): Loki (logs) + Grafana (dashboards) + Tempo (traces) + Mimir (metrics). Fully open source, self-hostable, or available managed via Grafana Cloud. The LGTM stack is the right choice for teams that want to own their infrastructure and avoid vendor lock-in entirely. Tempo's trace-to-logs and trace-to-metrics correlations work well when everything uses OTel semantic conventions.
Datadog: Excellent OTel ingestion support via the Datadog agent (which speaks OTLP). Best for teams already on Datadog for APM and infrastructure monitoring who want to standardize on OTel instrumentation while keeping the Datadog UI and alerting. Cost scales steeply with data volume.
AWS CloudWatch: The AWS Distro for OpenTelemetry (ADOT) provides AWS-managed OTel Collectors and deep integration with CloudWatch, X-Ray, and Amazon Managed Grafana. Practical choice for AWS-first teams who want to minimize operational surface area. X-Ray's trace visualization is functional but not as expressive as Honeycomb or Tempo.

What Is Still Hard

OpenTelemetry has not solved everything. Be honest about what remains rough:

Profiling signal: The profiling specification (continuous CPU and memory profiling correlated with traces) is in development but not yet stable as of mid-2026. Expect it to reach GA in 2026 or 2027. Until then, profiling remains vendor-specific.
Business metric correlation: Connecting a slow database query to revenue impact requires joining observability data with business event data. OpenTelemetry does not define how to do this. You need to add business attributes to your spans (order value, user tier, revenue-generating path) and build the analysis yourself in your backend.
Collector configuration complexity: A production OTel Collector config with tail sampling, multiple exporters, PII scrubbing, and attribute transformations can become hundreds of lines of YAML. The Collector has an extensive pipeline model, but the learning curve for complex configurations is real. Use the OTel Collector Builder and test configurations locally with the file exporter before deploying.

Getting Started: Instrument a Service in 5 Steps

For a Node.js service (same pattern applies to Python with equivalent packages):

Step 1 — Install packages: npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-otlp-http
Step 2 — Create an instrumentation file (tracing.js): Initialize the NodeSDK with your service name, the auto-instrumentations plugin, and an OTLP HTTP exporter pointing at your Collector endpoint or directly at a vendor's OTLP ingest URL.
Step 3 — Start the SDK before your app: In your entry point, call sdk.start() before requiring any other modules. For Node.js, use --require ./tracing.js in your startup command.
Step 4 — Add manual spans for business logic: Wrap checkout, payment processing, recommendation queries — anything with business significance — in custom spans. Add attributes for order ID, user segment, and experiment flags.
Step 5 — Deploy a Collector sidecar: Run the OTel Collector alongside your service, configured to receive OTLP on localhost:4318 and export to your chosen backend. This decouples backend configuration from application deployment.

Actionable Decision Framework

Here is how to make the key decisions without overthinking them:

Signal priority: Implement traces first, then metrics, then logs. Traces give you the most debugging leverage per unit of instrumentation effort. Logs are valuable but you likely already have them — focus on connecting them to traces via trace context injection.
Backend selection: If you are self-hosting, use the LGTM Grafana stack. If you want managed with excellent UX for trace analysis, use Honeycomb. If you are already on Datadog for infra monitoring, standardize on OTel instrumentation and keep Datadog as the backend. Do not optimize for backend choice early — the point of OTel is that you can change your mind.
Collector from day one: Even if you only have one backend today, deploy the Collector. The cost is minimal; the flexibility payoff when you add a second backend or need to change vendors is significant.
Sampling policy: Start with head-based sampling at 10–20% if you need to control costs immediately. Plan to migrate to tail-based sampling once you have a Collector pool — the improvement in error visibility is worth the operational complexity.
Semantic conventions: Enforce them. Add a lint step or CI check that validates your custom span attribute names against the OTel semantic conventions registry. Consistency now means cross-signal correlation later, and it means any new backend you adopt will understand your data without transformation.

The observability vendor wars are over. The instrumentation problem is solved. What remains is the operational discipline of deploying it correctly, configuring your pipeline for cost and fidelity, and building the organizational habits around trace-driven debugging. Those habits — going to traces first when something is slow, annotating deployments as trace attributes, writing runbooks that reference specific span attributes — are what separate teams that get value from observability from teams that just have dashboards.

OpenTelemetry Has Won the Observability War — Now What?