Technology6 min read

Enforcing Per-Step SLOs in DAG Workflows with OpenTelemetry Spans

R
RileyAuthor
Enforcing Per-Step SLOs in DAG Workflows with OpenTelemetry Spans

Why per-step SLOs are hard in DAG-based internal workflows

When internal workflows become DAGs (data pipelines, approval flows, enrichment jobs, backfills, report builds), teams often track only the overall runtime and failure rate. That’s useful, but it hides the real problem: one “slow” step can quietly eat most of the budget and degrade the whole workflow even if the final status is technically “success.”

Per-step SLOs solve that by making each node accountable. The catch is that many teams assume they need a custom orchestrator or a large platform-engineering project to enforce step budgets. In practice, you can get most of the value by using OpenTelemetry spans as the enforcement primitive: every step emits spans with consistent naming and attributes, and your tooling evaluates those spans against budgets.

Model a workflow step as an enforceable span

The key idea is simple: treat each DAG node as a span boundary, not just a log boundary. That means:

  • One trace per workflow run (the entire DAG execution).
  • One span per step (each node in the DAG).
  • Span attributes that carry the enforcement context (step name, workflow name, environment, retry count, tenant, etc.).

In OpenTelemetry terms, a step span should have a stable, queryable identity. A practical convention is:

  • Trace name: workflow identifier (e.g., workflow.invoice_reconcile).
  • Span name: step identifier (e.g., step.fetch_ledger).
  • Attributes: workflow.name, step.name, run.id, attempt, dag.node_id, service.name, deployment.environment.

This gives you a durable contract: dashboards, alerts, and SLO evaluation can key off attributes instead of brittle string parsing.

Define per-step SLOs as budgets, not vibes

Per-step SLOs should be written like budgets you can enforce:

  • Latency objective: e.g., p95 < 2s for step.fetch_ledger in production.
  • Error objective: e.g., < 0.5% span error rate for step.enrich_customer.
  • Freshness/timeout objective (optional): e.g., hard timeout at 30s for a supplier API call step.

Two practical rules keep this from turning into an unmaintainable spreadsheet:

  • Start with the steps that dominate runtime or incident load, not every node.
  • Separate “steady-state” SLOs from “backfill” or “bulk” modes using attributes like workflow.mode.

Instrumenting steps with OpenTelemetry spans

You do not need a custom orchestrator to emit spans. You need consistent instrumentation in the code that runs each node. If your workflow engine runs scripts, containers, or functions, each unit of work can create a span at the top of the step and close it at completion.

At a minimum, each step span should capture:

  • Start/end time (automatic in spans).
  • Status: OK vs ERROR.
  • Failure reason as an attribute (sanitized), plus an event for the exception type.
  • Retry metadata: attempt number, whether it’s a retry, and upstream dependency info.

One subtle but important point: retries can distort percentiles if you don’t model them carefully. Consider two approaches:

  • Single span per logical step with events for retries (good for “user-visible” latency).
  • One span per attempt with attempt attribute (good for diagnosing flaky dependencies).

You can support both by nesting: a parent step span and child attempt spans.

Enforcement patterns without building a custom orchestrator

“Enforce” can mean several things operationally. OpenTelemetry spans let you implement enforcement as policy around execution, not as new scheduling software.

1) Fast feedback during execution with timeouts

The simplest enforcement is a hard timeout per step. If a step’s SLO is “must finish in 10s,” the code can enforce a 10s timeout and mark the span as ERROR on timeout. That prevents slow degradation from consuming downstream capacity.

2) Post-run gating and automatic triage

Some workflows shouldn’t fail just because a step missed its p95 budget once. Instead, evaluate spans after the run and choose an action:

  • Open an incident or page if the burn rate is high for that step.
  • Quarantine the workflow version if a new deploy caused systematic regression.
  • Create an auto-ticket with the top slow spans and their attributes.

This is where span attributes pay off: you can automatically bucket regressions by step, tenant, dependency, or environment.

3) Dependency-aware budgets in DAGs

DAGs introduce a unique problem: some steps are allowed to be slow only if upstream steps are fast (or vice versa). You can express that through trace structure:

  • Critical path analysis: use spans to compute which steps dominate the end-to-end runtime.
  • Queue vs execution time: split spans into “queued” and “running” so you don’t punish a step for worker saturation.

This avoids blaming the wrong node when the real issue is scheduling or resource contention.

How Windmill fits naturally into this approach

Windmill is designed around DAG workflows and production monitoring, so it’s a natural place to standardize step instrumentation and SLO evaluation without building yet another orchestration layer. With a code-first workflow model and deep observability, you can keep the enforcement logic close to the steps themselves while still centralizing how you view and alert on spans.

If your team already exports traces and metrics, integrating workflow execution with OpenTelemetry-friendly conventions makes it much easier to build consistent SLO reporting across many scripts and services. Windmill also supports exporting to OpenTelemetry and Prometheus, which helps you keep vendor choice open while still enforcing step-level budgets in one workflow system. The project home is windmill.dev.

Operational details that make per-step SLOs actually work

Naming and cardinality discipline

Span attributes are powerful, but high-cardinality fields can explode cost and reduce signal. Keep step.name and workflow.name stable, and be cautious with raw user IDs or unbounded payload identifiers. If you need per-tenant breakdowns, use a controlled tenant.id that’s expected and limited.

Separate “step correctness” from “step performance”

A step can be correct but slow, or fast but wrong. Use spans for performance and reliability signals, and pair them with application-level checks (counts, invariants, row deltas) when correctness matters.

Alerts tied to actions

A per-step SLO is only useful if it changes what happens next. Make the action explicit: page, auto-rollback, throttle, route to a different worker group, or open a ticket with trace links. If the action is unclear, the SLO will become noise.

Two workflow maintainability patterns worth borrowing

Per-step SLOs also influence how you design the DAG. Branching steps with different performance characteristics should be explicit and named, not hidden in one “do_everything” node. If you’re refining how you structure these branches, the article on branching logic patterns to keep no-code workflows maintainable is a useful companion.

And if you’re dealing with urgent operational work triggered by step regressions, having a lightweight triage system matters as much as the tracing. The post on avoiding the priority inversion backlog trap maps well to how SLO burn alerts can otherwise hijack your roadmap.

What you get by treating spans as the enforcement layer

By elevating spans from “nice to have” observability to “the contract” for step budgets, you gain a shared language across teams: performance regressions are traceable to specific DAG nodes, SLO ownership becomes concrete, and enforcement can be implemented with timeouts, gates, and targeted alerts—without building a custom orchestrator just to answer “which step broke the budget?”

FAQ
How do per-step SLOs in Windmill differ from tracking only workflow runtime?

What OpenTelemetry attributes should I standardize for Windmill workflow spans?

Can Windmill enforce step timeouts without a custom orchestrator?

How should I handle retries in per-step SLO reporting for Windmill?

How do I avoid high-cardinality tracing costs when instrumenting Windmill steps?