Skip to content

Operations Guide

Integration Observability: SLIs, Alerts, and Ownership

How to measure, monitor, and own integration reliability. Covers service level indicators, alerting strategies, runbooks, and operational ownership models for enterprise teams.

1. Why integration observability matters

Integrations are the connective tissue between business systems. When they fail silently, the effects cascade: sales can't see customer support tickets, finance doesn't get accurate revenue data, and operations makes decisions on stale information.

Observability transforms integrations from black boxes into understood, measurable components of your infrastructure. You can answer questions like:

  • Is data flowing as expected?
  • How fresh is the data in our destination systems?
  • What's the error rate, and which errors need immediate attention?
  • Are we meeting our SLAs?

See Wallace AI for how ThreadSync provides integration observability.

2. Key SLIs for integration platforms

Service Level Indicators (SLIs) are the metrics you measure. Service Level Objectives (SLOs) are the targets you set. Here are the essential SLIs for integration platforms:

Latency

Definition: Time from when data changes in the source to when it's available in the destination.

Example SLO: p95 latency < 5 minutes for CRM-to-warehouse sync

Data freshness

Definition: Age of the most recent data in the destination system.

Example SLO: Maximum data age < 1 hour for operational syncs

Error rate

Definition: Percentage of sync operations that fail.

Example SLO: Error rate < 0.1% over any 24-hour window

Throughput

Definition: Volume of records processed per time period.

Example SLO: Sustained throughput > 10,000 records/minute

Queue depth

Definition: Number of pending events waiting to be processed.

Example SLO: Queue depth < 1,000 for P1 integrations

3. Alerting strategies

Effective alerting requires balancing coverage (catching real issues) with noise (avoiding alert fatigue). Here's a tiered approach:

P1: Page immediately

  • Complete integration failure (no data flowing)
  • Error rate > 10% for > 5 minutes
  • Data freshness > 2x SLO
  • Security-critical integration down

P2: Notify during business hours

  • Error rate > 1% for > 15 minutes
  • Latency > 2x baseline
  • Queue depth growing unexpectedly
  • Approaching rate limits

P3: Log for review

  • Individual record failures (below threshold)
  • Minor latency variations
  • Schema drift warnings
  • Capacity planning signals

Alert hygiene practices

  • Review and tune thresholds quarterly
  • Track alert-to-incident ratio (aim for > 50%)
  • Require runbook links in alert definitions
  • Set up alert aggregation to prevent storms

4. Ownership models

Who owns integration reliability? Clear ownership prevents the "not my problem" dynamic that causes incidents to escalate slowly.

Centralized model

Who: Platform/integration team owns all integrations

Best for: Organizations with many similar integrations, strong platform team

Trade-off: Can become a bottleneck; domain knowledge gaps

Federated model

Who: Business teams own their integrations; platform team provides tooling

Best for: Organizations with diverse integration needs, strong domain teams

Trade-off: Inconsistent practices; governance challenges

Hybrid model

Who: Platform team owns infrastructure and critical integrations; business teams own domain-specific integrations

Best for: Most enterprises; balances expertise and scalability

Trade-off: Requires clear handoff processes

Ownership documentation

For each integration, document:

  • Owner: Team and individual responsible
  • Escalation path: Who to contact if owner unavailable
  • Business criticality: P1/P2/P3
  • SLO commitments: What uptime/freshness is expected
  • Runbook location: Where to find troubleshooting docs

5. Runbooks and incident response

Runbooks turn tribal knowledge into documented procedures. Every integration should have a runbook covering common failure scenarios.

Runbook structure

  1. Symptoms: What does this failure look like?
  2. Impact: What business processes are affected?
  3. Diagnosis steps: How to determine root cause
  4. Resolution steps: How to fix (with commands)
  5. Verification: How to confirm fix worked
  6. Escalation: When to escalate and to whom
  7. Post-incident: What to document after resolution

Common integration failure scenarios

  • Authentication failure: Token expired, credentials rotated
  • Rate limiting: Exceeded source/destination API limits
  • Schema change: Source schema changed unexpectedly
  • Network issue: Connectivity between systems interrupted
  • Data quality: Invalid data causing downstream failures

Incident timeline

  1. Detection: Alert fires or user reports issue
  2. Triage: Assess severity, engage owner
  3. Diagnosis: Follow runbook, identify root cause
  4. Mitigation: Restore service, even if temporary fix
  5. Resolution: Implement permanent fix
  6. Review: Post-incident review and documentation

See Solutions for Operations and Support for how ThreadSync helps with incident response.

Need better integration visibility?

See how ThreadSync provides observability for enterprise integrations.

Request Demo Explore Wallace AI