01

Introduction

There’s a persistent illusion in software engineering: That logs and metrics are enough to understand what’s happening in production. For simple monolithic applications, maybe that’s true.

But once you’re operating in a world of microservices, asynchronous calls and distributed data flows, that illusion starts to break down.

We learned that the hard way…

02

It started with a simple complaint

“The service is too slow.”

Not a crash. Not a high error rate.Just persistent latency on one specific endpoint.

At first, it didn’t seem alarming. But the issue wouldn’t go away. This particular route was consistently underperforming, and we couldn’t tell why. It wasn’t isolated, it relied on a database and several other services. The slowdown could be coming from anywhere in the chain.

We did what we always do:

  • Checked the logs
  • Monitored our service metrics through Grafana dashboards

They confirmed the delay. But they didn’t explain it. We had symptoms, not a diagnosis.

03

Why Traditional Observability Falls Short in Distributed Systems

Our architecture isn’t monolithic. It’s composed of many loosely coupled services that talk to each other. And yet, our tooling was still assuming a single box, a single stack trace, a single scope.

Here’s the core problem:Traditional tools show what happens inside a service. Not what happens across services.

That means:

Three gaps in traditional tooling: logs, metrics, and time visibility across services.

Trying to debug a latency issue in this environment with only logs and metrics is like trying to understand a relay race by watching just one runner.

We needed a way to see the full picture, to trace requests as they moved through our entire system. That’s when we discovered distributed tracing wasn’t just helpful, it was essential.

04

What we really needed

When working in a distributed system, like in our case, with a microservices architecture, operations often involve multiple services calling one another, creating a chain of requests. You send a request, and suddenly you’re calling more than one service, each with their own little mission.

We needed something that could stitch together the entire request lifecycle and highlight any performance bottlenecks. And debugging such issues without the right tools is like finding a needle in a haystack.

We didn’t just need more data, we needed better visibility.

That’s when we turned todistributed tracing.

05

Choosing the Right Tool for the Job

Once we knew we needed distributed tracing, the next question was: how do we actually implement it without overhauling everything?

Writing custom tracing logic was not a viable option, we wanted visibility without disrupting existing functionality or introducing extensive code changes.

06

Evaluating Observability Solutions

We set out to evaluate observability tools with a clear goal in mind: find something that made tracing easy, both to implement and to interpret.

Here’s what we looked for:

  • Ease of integration with minimal code changes
  • Support for open standards and multi-language environments
  • Trace visualization and diagnostics
  • Compatibility with our existing stack
  • Long-term maintenance

Here’s a comparison of the tools we considered:

Tool comparison
07

A Stack That Fit Without Disruption

After evaluating our options, we landed on a self-hosted, open, and flexible observability stack that checked all the boxes and most importantly, didn’t require us to refactor everything.

Our setup included:

  • Elastic APM Server for collecting and correlating application traces
  • Kibana for visualizing traces and creating dashboards
  • OpenTelemetry zero-code instrumentation to capture traces without touching application code
Four-step observability pipeline
08

Leveraging OpenTelemetry’s Zero-Code Instrumentation

What really stood out was OpenTelemetry’s zero-code instrumentation. It gave us full visibility across our microservices, all without changing a single line of application logic.

By simply configuring the OpenTelemetry agent, we could capture detailed request traces spanning services, databases, and third-party dependencies. This was a game-changer in our setup, where any component could be the source of a slowdown.

We specifically leveraged Python’s OpenTelemetry auto-instrumentation agent, which dynamically instruments supported libraries at runtime, no manual tracing logic needed.

Here’s how we set it up for our FastAPI services:

Step 1: Install Required Packages

We installed the OpenTelemetry distro package, which includes the API, SDK, auto-instrumentation tools, and exporters:

Bash
pip install opentelemetry-distro opentelemetry-exporter-otlp

Then, we ran the bootstrap command to automatically install instrumentation packages matching our installed dependencies :

Bash
opentelemetry-bootstrap -a install
Step 2: Configure Environment Variables

We configured the OpenTelemetry agent via environment variables to specify the service name, trace exporters, and OTLP endpoint of the collector:

Bash
export OTEL_SERVICE_NAME=my-service export OTEL_TRACES_EXPORTER=otlp export OTEL_EXPORTER_OTLP_ENDPOINT=http://apm-server:8200 export OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production, project.name=project-name export OTEL_PYTHON_TRACES_INSTRUMENTATION_ENABLED=True
Step 3: Run the FastAPI Application with Auto-Instrumentation

We launched the FastAPI app wrapped by the OpenTelemetry instrumentation agent. This automatically patches supported libraries, like FastAPI, HTTP, DB drivers at runtime without modifying our code:

Bash
opentelemetry-instrument python main.py
09

Integrating OpenTelemetry with APM Server for Full Observability

With OpenTelemetry capturing trace data across our microservices, the next step was clear: feed this data into our monitoring stack and make it usable. That meant collecting, processing, and visualizing traces efficiently and this is where Elastic APM Server came in.

Luckily, our Elastic Stack was already up and running, with APM Server deployed and connected to Elasticsearch. So our focus shifted entirely to making sure the instrumented services could send their traces to APM with minimal effort.

Thanks to OpenTelemetry’s support for the OTLP protocol, all it took was a single environment

Bash
export OTEL_EXPORTER_OTLP_ENDPOINT=http://apm-server:8200

And just like that, our OpenTelemetry-instrumented services were sending trace data directly to APM Server.

10

Visualizing Traces in Kibana

Once trace data is flowing into the APM Server, it becomes indexed in Elasticsearch and can be visualized via Kibana’s APM dashboard. Kibana provides a rich UI for exploring distributed traces, service maps, and latency breakdowns, enabling quick identification of bottlenecks and errors.

Key features of Kibana’s APM dashboard include:

- Service overview:Displays performance metrics, error rates, and throughput for each instrumented service.

- Transaction traces:Detailed timing and span data for individual requests, showing where time is spent across the call path.

With this setup, we finally had full observability, from the moment a request entered the system to the last line of code it touched.

11

Conclusion: From Fixing Latency to Facing the Future

What began as a single performance complaint turned into a deep dive into our system’s blind spots and a much-needed push toward better observability. Distributed tracing gave us the visibility we lacked, helping us pinpoint bottlenecks that logs alone could never reveal. With OpenTelemetry and APM Server working in tandem, we didn’t just fix the issue, we built the foundation for a more resilient system.

But observability doesn’t stop once the dashboards light up.

As our systems grow in complexity and our architecture continues to evolve, so must our approach to tracing and monitoring. And now, a new paradigm is emerging on the horizon: AI agents. No longer experimental, these agents are executing autonomous tasks, making decisions, and orchestrating workflows across entire platforms.

This shift introduces a whole new level of opacity. Unlike traditional services, AI agents don’t always follow predictable paths. They learn, adapt, and behave differently with each execution. And that makes the question of observability even more urgent:

How do we trace something that evolves as it runs?

To tackle this, the concept of observability is expanding. Beyond metrics, logs, and traces, we now need telemetry that can capture the intent, behavior, and impact of AI-driven systems.

That’s the next chapter. And just like before, it starts with a question and the curiosity to follow the traces wherever they lead.