Operations Guide

Integration Observability: SLIs, Alerts, and Ownership

How to measure, monitor, and own integration reliability. Covers service level indicators, alerting strategies, runbooks, and operational ownership models for enterprise teams.

Published February 9, 2026

1. Why integration observability matters

Integrations are the connective tissue between business systems. When they fail silently, the effects cascade: sales can't see customer support tickets, finance doesn't get accurate revenue data, and operations makes decisions on stale information.

Observability transforms integrations from black boxes into understood, measurable components of your infrastructure. You can answer questions like:

Is data flowing as expected?
How fresh is the data in our destination systems?
What's the error rate, and which errors need immediate attention?
Are we meeting our SLAs?

See Wallace AI for how ThreadSync provides integration observability.

2. Key SLIs for integration platforms

Service Level Indicators (SLIs) are the metrics you measure. Service Level Objectives (SLOs) are the targets you set. Here are the essential SLIs for integration platforms:

Latency

Definition: Time from when data changes in the source to when it's available in the destination.

Example SLO: p95 latency < 5 minutes for CRM-to-warehouse sync

Data freshness

Definition: Age of the most recent data in the destination system.

Example SLO: Maximum data age < 1 hour for operational syncs

Error rate

Definition: Percentage of sync operations that fail.

Example SLO: Error rate < 0.1% over any 24-hour window

Throughput

Definition: Volume of records processed per time period.

Example SLO: Sustained throughput > 10,000 records/minute

Queue depth

Definition: Number of pending events waiting to be processed.

Example SLO: Queue depth < 1,000 for P1 integrations

3. Alerting strategies

Effective alerting requires balancing coverage (catching real issues) with noise (avoiding alert fatigue). Here's a tiered approach:

P1: Page immediately

Complete integration failure (no data flowing)
Error rate > 10% for > 5 minutes
Data freshness > 2x SLO
Security-critical integration down

P2: Notify during business hours

Error rate > 1% for > 15 minutes
Latency > 2x baseline
Queue depth growing unexpectedly
Approaching rate limits

P3: Log for review

Individual record failures (below threshold)
Minor latency variations
Schema drift warnings
Capacity planning signals

Alert hygiene practices

Review and tune thresholds quarterly
Track alert-to-incident ratio (aim for > 50%)
Require runbook links in alert definitions
Set up alert aggregation to prevent storms

4. Ownership models

Who owns integration reliability? Clear ownership prevents the "not my problem" dynamic that causes incidents to escalate slowly.

Centralized model

Who: Platform/integration team owns all integrations

Best for: Organizations with many similar integrations, strong platform team

Trade-off: Can become a bottleneck; domain knowledge gaps

Federated model

Who: Business teams own their integrations; platform team provides tooling

Best for: Organizations with diverse integration needs, strong domain teams

Trade-off: Inconsistent practices; governance challenges

Hybrid model

Who: Platform team owns infrastructure and critical integrations; business teams own domain-specific integrations

Best for: Most enterprises; balances expertise and scalability

Trade-off: Requires clear handoff processes

Ownership documentation

For each integration, document:

Owner: Team and individual responsible
Escalation path: Who to contact if owner unavailable
Business criticality: P1/P2/P3
SLO commitments: What uptime/freshness is expected
Runbook location: Where to find troubleshooting docs

5. Runbooks and incident response

Runbooks turn tribal knowledge into documented procedures. Every integration should have a runbook covering common failure scenarios.

Runbook structure

Symptoms: What does this failure look like?
Impact: What business processes are affected?
Diagnosis steps: How to determine root cause
Resolution steps: How to fix (with commands)
Verification: How to confirm fix worked
Escalation: When to escalate and to whom
Post-incident: What to document after resolution

Common integration failure scenarios

Authentication failure: Token expired, credentials rotated
Rate limiting: Exceeded source/destination API limits
Schema change: Source schema changed unexpectedly
Network issue: Connectivity between systems interrupted
Data quality: Invalid data causing downstream failures

Incident timeline

Detection: Alert fires or user reports issue
Triage: Assess severity, engage owner
Diagnosis: Follow runbook, identify root cause
Mitigation: Restore service, even if temporary fix
Resolution: Implement permanent fix
Review: Post-incident review and documentation

See Solutions for Operations and Support for how ThreadSync helps with incident response.

Need better integration visibility?

See how ThreadSync provides observability for enterprise integrations.

Request Demo Explore Wallace AI