Operations Guide
Integration Observability: SLIs, Alerts, and Ownership
How to measure, monitor, and own integration reliability. Covers service level indicators, alerting strategies, runbooks, and operational ownership models for enterprise teams.
1. Why integration observability matters
Integrations are the connective tissue between business systems. When they fail silently, the effects cascade: sales can't see customer support tickets, finance doesn't get accurate revenue data, and operations makes decisions on stale information.
Observability transforms integrations from black boxes into understood, measurable components of your infrastructure. You can answer questions like:
- Is data flowing as expected?
- How fresh is the data in our destination systems?
- What's the error rate, and which errors need immediate attention?
- Are we meeting our SLAs?
See Wallace AI for how ThreadSync provides integration observability.
2. Key SLIs for integration platforms
Service Level Indicators (SLIs) are the metrics you measure. Service Level Objectives (SLOs) are the targets you set. Here are the essential SLIs for integration platforms:
Latency
Definition: Time from when data changes in the source to when it's available in the destination.
Example SLO: p95 latency < 5 minutes for CRM-to-warehouse sync
Data freshness
Definition: Age of the most recent data in the destination system.
Example SLO: Maximum data age < 1 hour for operational syncs
Error rate
Definition: Percentage of sync operations that fail.
Example SLO: Error rate < 0.1% over any 24-hour window
Throughput
Definition: Volume of records processed per time period.
Example SLO: Sustained throughput > 10,000 records/minute
Queue depth
Definition: Number of pending events waiting to be processed.
Example SLO: Queue depth < 1,000 for P1 integrations
3. Alerting strategies
Effective alerting requires balancing coverage (catching real issues) with noise (avoiding alert fatigue). Here's a tiered approach:
P1: Page immediately
- Complete integration failure (no data flowing)
- Error rate > 10% for > 5 minutes
- Data freshness > 2x SLO
- Security-critical integration down
P2: Notify during business hours
- Error rate > 1% for > 15 minutes
- Latency > 2x baseline
- Queue depth growing unexpectedly
- Approaching rate limits
P3: Log for review
- Individual record failures (below threshold)
- Minor latency variations
- Schema drift warnings
- Capacity planning signals
Alert hygiene practices
- Review and tune thresholds quarterly
- Track alert-to-incident ratio (aim for > 50%)
- Require runbook links in alert definitions
- Set up alert aggregation to prevent storms
4. Ownership models
Who owns integration reliability? Clear ownership prevents the "not my problem" dynamic that causes incidents to escalate slowly.
Centralized model
Who: Platform/integration team owns all integrations
Best for: Organizations with many similar integrations, strong platform team
Trade-off: Can become a bottleneck; domain knowledge gaps
Federated model
Who: Business teams own their integrations; platform team provides tooling
Best for: Organizations with diverse integration needs, strong domain teams
Trade-off: Inconsistent practices; governance challenges
Hybrid model
Who: Platform team owns infrastructure and critical integrations; business teams own domain-specific integrations
Best for: Most enterprises; balances expertise and scalability
Trade-off: Requires clear handoff processes
Ownership documentation
For each integration, document:
- Owner: Team and individual responsible
- Escalation path: Who to contact if owner unavailable
- Business criticality: P1/P2/P3
- SLO commitments: What uptime/freshness is expected
- Runbook location: Where to find troubleshooting docs
5. Runbooks and incident response
Runbooks turn tribal knowledge into documented procedures. Every integration should have a runbook covering common failure scenarios.
Runbook structure
- Symptoms: What does this failure look like?
- Impact: What business processes are affected?
- Diagnosis steps: How to determine root cause
- Resolution steps: How to fix (with commands)
- Verification: How to confirm fix worked
- Escalation: When to escalate and to whom
- Post-incident: What to document after resolution
Common integration failure scenarios
- Authentication failure: Token expired, credentials rotated
- Rate limiting: Exceeded source/destination API limits
- Schema change: Source schema changed unexpectedly
- Network issue: Connectivity between systems interrupted
- Data quality: Invalid data causing downstream failures
Incident timeline
- Detection: Alert fires or user reports issue
- Triage: Assess severity, engage owner
- Diagnosis: Follow runbook, identify root cause
- Mitigation: Restore service, even if temporary fix
- Resolution: Implement permanent fix
- Review: Post-incident review and documentation
See Solutions for Operations and Support for how ThreadSync helps with incident response.
Need better integration visibility?
See how ThreadSync provides observability for enterprise integrations.
