Response Best Practices & Tools
Introduction
Your website just went down. Your monitoring system sent out alerts 47 seconds ago. But your team has no idea who should respond, what they should check, or who needs to be notified.
For the next 23 minutes, your team is running around, making phone calls, checking things randomly, and hoping someone figures out the problem before customers get too angry.
This chaos is what happens without proper incident management processes. Even the best monitoring system is useless if your team doesn’t know how to respond when an alert fires.
Incident management transforms your team from reactive firefighters to coordinated responders. When an incident occurs, everyone knows their role, the process is clear, and issues get resolved faster.
What is Incident Management?
Incident management is a set of documented processes and tools that guide your team’s response when something goes wrong with your website or application.
It answers these questions:
- Who gets notified when an incident occurs?
- What’s the priority of different types of incidents?
- Who leads the response effort?
- What are the steps to diagnose and fix the problem?
- Who communicates to affected customers?
- How do we learn from this incident to prevent it next time?
An incident management process ensures that instead of chaos, there’s order. Instead of guessing, there’s clarity.
Types of Incidents
Severity Levels
Classifying incidents by severity determines response urgency:
Severity 1 – Critical (respond immediately):
- Website completely down (HTTP 500, unable to connect, etc.)
- Core functionality unavailable (checkout not working, login broken, etc.)
- Revenue-impacting issues
- Security incidents
- Multiple users affected simultaneously
- Response time target: 5 minutes to acknowledge, 15 minutes to diagnosis
- Resolution time target: < 30 minutes
- Notifications: Immediate (phone calls, SMS, alerts)
Severity 2 – High (respond within 30 minutes):
- Partial functionality broken (some features unavailable but core works)
- Performance degradation (>50% slower than baseline)
- API errors affecting small percentage of requests
- Single region/user affected
- Response time target: 30 minutes to acknowledge
- Resolution time target: < 2 hours
- Notifications: Email, Slack, standard alerting
Severity 3 – Medium (respond within 4 hours):
- Minor functionality issues
- Small performance degradation
- UI/UX issues not affecting core functionality
- Issues affecting < 0.1% of users
- Response time target: 4 hours
- Resolution time target: < 24 hours
- Notifications: Email, daily report
Severity 4 – Low (handle during business hours):
- Feature requests tracked as incidents
- Documentation issues
- Non-urgent improvements
- Can wait for next planned maintenance
- Response time target: Next business day
- Resolution time target: < 1 week
Incident Management Roles and Responsibilities
Incident Commander
The incident commander is the leader who coordinates the entire response:
Before an incident:
- Aware of common failure scenarios
- Trained in incident management procedures
- Has escalation authority
- Understands business impact of outages
During an incident:
- Declared when severity 1 or critical issue occurs
- Coordinates all response activities
- Makes triage decisions (stop investigating, go straight to fix?)
- Communicates with stakeholders
- Ensures no critical steps are missed
- Stays focused on resolution, not blame
- Maintains timeline of events
After an incident:
- Leads post-incident review
- Documents lessons learned
- Ensures action items are assigned
Who serves as incident commander:
- On-call engineer (rotates daily/weekly)
- DevOps lead
- Principal engineer
- Never: Someone on vacation, someone new to the team, someone without authority
Engineering Team / Responders
Engineers who diagnose and fix the problem:
Responsibilities:
- Investigate root cause
- Implement fix or workaround
- Test fix before deploying
- Validate fix resolved the issue
- Report findings to incident commander
Not responsible for:
- Deciding severity level (that’s incident commander’s job)
- External communication (incident commander does that)
- Post-incident reviews (wait for scheduled review)
DevOps / Infrastructure Team
Infrastructure team handles deployment, rollbacks, and infrastructure issues:
Responsibilities:
- Deploy fixes to production
- Rollback problematic deployments
- Scale infrastructure (if capacity issue)
- Restart services (if hang/crash)
- Check infrastructure monitoring
- Verify no infrastructure-level issues
Communications Lead
Person responsible for external communication:
Responsibilities:
- Update status page
- Notify customers (if incident is public)
- Prepare communication templates
- Provide updates to stakeholders
- Handle customer support escalations
Not responsible for:
- Technical decisions (focus on communication accuracy)
- Fixing the issue (focus on transparency)
On-Call Rotation
Establish who’s on-call for each hour:
Example rotation:
- Engineer A: Monday-Wednesday
- Engineer B: Wednesday-Friday
- Engineer C: Friday-Monday
- Repeats weekly
On-call responsibilities:
- Available to respond to critical alerts 24/7
- 5-minute response time
- Acknowledge incident within 5 minutes
- Investigate root cause or escalate
On-call compensation:
- Paid on-call stipend ($300-1000/week)
- Paid time off for incidents after hours
- Night incident = day off next day
- Rotation frequency: 1 week per engineer per month (for 3-engineer team)
Incident Response Process
Phase 1: Detection (Automated)
Monitoring system detects issue:
text10:23:47 - Monitoring system detects website returns HTTP 500
10:23:48 - Alert fires to incident commander
10:23:49 - SMS, Slack, email sent to on-call engineer
10:23:50 - Incident created in incident tracking system
Time to phase completion: ~3 seconds (automated)
Phase 2: Acknowledgment (5 minutes)
On-call engineer acknowledges incident:
text10:24:15 - Engineer receives SMS and wakes up
10:24:45 - Engineer acknowledges incident ("I'm on it")
10:25:00 - Incident commander notified
10:25:15 - Customer communication prepared ("We're investigating")
Time to phase completion: 5 minutes max
What happens if no acknowledgment?:
- Alert escalates to backup engineer (team lead, manager)
- 3-minute escalation if no response
- Eventually reaches manager on call
Phase 3: Triage (5-15 minutes)
Quick assessment of situation:
Questions answered:
- What’s broken? (service, feature, entire site?)
- How many users affected? (single user, region, everyone?)
- What’s the root cause? (code, infrastructure, third-party?)
- Can we fix it quickly (< 15 min) or do we need workaround?
- Do we need to involve other teams?
Triage decision:
- Go straight to fix: Issue is obvious, fix is known
- Investigate further: Need more information to diagnose
- Implement workaround: Can’t fix now, implement temporary solution
- Escalate: Need help from other team/specialist
Example triage:
text10:25 - Alert: HTTP 500 errors on checkout API
10:26 - Engineer checks logs: "Database connection pool exhausted"
10:27 - Engineer checks database: All connections in use
10:28 - Engineer finds recent deploy increased connection usage
10:29 - Decision: Rollback deploy (quick fix), investigate later
Time to phase completion: 5-15 minutes
Phase 4: Mitigation (15-30 minutes)
Stop the bleeding, even if it’s not the real fix:
Options:
- Rollback recent deployment
- Kill runaway process
- Increase capacity (scale up)
- Switch to degraded mode (disable non-critical features)
- Failover to backup system
- Implement traffic limits (prevent cascade)
Not trying to find root cause yet, just stop the damage.
Example mitigation:
text10:29 - Engineer initiates rollback of 10:15 deployment
10:31 - Database connection pool recovers
10:32 - HTTP 500 errors stop
10:33 - Site partially recovered (some features disabled)
10:34 - Customer communication: "Issue resolved, investigating full cause"
Time to phase completion: 5-15 minutes
Phase 5: Root Cause Investigation (30+ minutes)
Now investigate what actually caused the problem:
Questions answered:
- Why did this happen?
- When did it start?
- Is it code, infrastructure, configuration, or third-party?
- Could this happen again?
Investigation process:
- Review logs from time of incident
- Check recent deployments/changes
- Check infrastructure metrics (CPU, memory, connections)
- Check third-party service status pages
- Interview users about what they were doing when it failed
- Correlate events timeline
Example investigation:
textReview deploy at 10:15:
- Changed database connection pooling settings
- Increased concurrent connections from 100 to 150
- New code wasn't closing connections properly
- Quickly hit 150 connection limit
- New queries couldn't get connections
- Database returned errors
- Application crashed with HTTP 500
Root cause: Improper connection handling in new code, exacerbated by
increased pool size limit.
Time to phase completion: 30+ minutes (depends on complexity)
Phase 6: Permanent Fix (varies)
Implement actual solution to prevent recurrence:
Options:
- Revert problematic code
- Fix the code properly
- Update configuration
- Increase infrastructure capacity
- Implement circuit breakers or throttling
Example permanent fix:
textIn code review: Find the issue in connection handling code
Fix: Ensure connections are always closed (finally block)
Test: Verify fix with load test that previously failed
Deploy: Roll out fix to production
Verify: Monitor metrics post-deployment
Time to phase completion: 30 minutes to several hours (depends on severity and fix complexity)
Phase 7: Communication
Ongoing customer communication throughout incident:
Template updates:
text10:24:00 - "We're investigating elevated error rates. More updates in 5 minutes."
10:29:00 - "We've identified the issue and are implementing a fix."
10:32:00 - "The issue has been mitigated. We're investigating root cause."
10:40:00 - "We've identified root cause: improper connection handling in recent deployment."
10:50:00 - "We've deployed permanent fix and monitoring closely."
11:00:00 - "Issue fully resolved. See blog post for details."
Phase 8: Post-Incident Review (Within 24 hours)
Team reviews incident to learn and improve:
What to discuss:
- Timeline of events
- What went well
- What went poorly
- Root cause (not blame)
- Action items to prevent recurrence
- Process improvements
Action items assigned:
- Engineer A: Add monitoring for connection pool exhaustion
- Engineer B: Add circuit breaker for database failures
- Engineer C: Improve load test to catch connection issues
- DevOps: Increase database connection pool capacity
Incident Management Tools
Alerting and Notification
PagerDuty:
- Routes alerts to on-call engineer
- Escalates if not acknowledged
- Tracks incident timeline
- Integrates with monitoring tools
- Schedules on-call rotations
Incident.io:
- Simpler PagerDuty alternative
- Automated incident declaration
- Slack integration
- Postmortems
- Trending by root cause
Built-in monitoring tools:
- Most monitoring tools have basic alerting
- Sufficient for small teams
- Limited escalation/rotation features
Status Pages
Statuspage.io:
- Public-facing status page
- Automated incident updates
- Component status tracking
- Scheduled maintenance notifications
- Analytics on uptime/incidents
Atlassian Statuspage:
- Enterprise version of above
- More customization
- Better reporting
Self-hosted:
- Some teams build custom status pages
- More control but requires maintenance
Communication
Slack:
- #incidents channel for incident discussion
- Integrates with PagerDuty/monitoring
- Real-time team communication
- Searchable incident history
Email/SMS:
- Critical for out-of-band communication
- Reaches people without Slack
- Backup communication method
Incident Management Best Practices
1. Declare Incidents Early
Bad: Wait until you’re certain it’s a real incident before declaring
Good: Declare immediately when severity 1 criteria met, even if you’re not 100% sure
Why: Getting team together quickly can resolve issues faster. False alarms are better than slow response to real incidents.
2. Timebox Investigation
Bad: Spend 2+ hours investigating root cause while customers are frustrated
Good:
- First 30 minutes: Mitigation focus
- Next 30 minutes: Initial diagnosis
- Then: Longer investigation if needed
Why: Users care about resolution, not perfect root cause. Fix the problem, investigate thoroughly after.
3. Assign Clear Owner
Bad: Team of engineers all working on incident with no clear owner, duplicating effort
Good: Incident commander assigns specific tasks to specific people
Why: Accountability, avoids duplication, ensures nothing falls through cracks.
4. Update Status Page Regularly
Bad: First update 30 minutes after incident, then nothing for 2 hours
Good: Update status page every 15-30 minutes with any progress, even “still investigating”
Why: Customers are less frustrated when they know you’re actively working on it.
5. Have Runbooks for Common Issues
Runbook: Step-by-step procedure for diagnosing/fixing common problems
Examples:
- “Database connection exhaustion”
- “High memory usage causing slowness”
- “API rate limit hit from third-party service”
- “SSL certificate expiration”
- “CDN cache invalidation issue”
Value: New/junior engineers can fix issues without deep expertise. Reduces MTTR significantly.
6. Blameless Post-Incident Reviews
Bad: “John deployed bad code, we’re removing his deployment access”
Good: “The code review process didn’t catch the issue. Let’s add load testing to code review.”
Why: Focus on systems and processes, not individuals. Engineers need to feel safe admitting mistakes so incidents are reported quickly.
7. Track Metrics Over Time
Metrics to track:
- MTTD (Mean Time To Detection): How quickly we detect issues
- MTTR (Mean Time To Resolution): How quickly we fix issues
- MTBF (Mean Time Between Failures): How often incidents occur
- Incident cost: Revenue lost per incident
Goal: Improve each metric quarter over quarter
Common Incident Management Mistakes
Mistake 1: No On-Call Rotation
Problem: Everyone is on-call, which means no one feels responsible
Solution: Explicit rotation where one person is on-call and has clear responsibilities
Mistake 2: No Escalation Path
Problem: Engineer doesn’t respond to alert, nobody else knows to take over
Solution: Alert escalates to backup engineer after 5 minutes, to manager after 10 minutes
Mistake 3: Vague Severity Levels
Problem: Everyone argues about whether incident is Sev 1 or Sev 2 while customers wait
Solution: Clear, objective criteria for severity levels
Mistake 4: No Runbooks
Problem: Each engineer diagnoses issue differently, takes different time to resolve
Solution: Standard procedures documented for common issues
Mistake 5: No Status Page
Problem: Customers have no idea what’s happening, assume worst
Solution: Public status page updated every 15 minutes during incidents
Conclusion
Incident management transforms your team from reactive, panicked firefighters into coordinated responders. When an incident occurs, everyone knows their role, the process is clear, and issues get resolved 2-5x faster.
The best incident management system includes:
- Clear roles and responsibilities
- Documented response procedures
- On-call rotation with escalation
- Monitoring/alerting to detect issues
- Status page for customer communication
- Post-incident reviews to prevent recurrence
Start with the basics: define severity levels, establish an on-call rotation, create a runbook for your most common incidents. As your maturity increases, add PagerDuty, status pages, and more sophisticated processes.
The ROI is clear: one prevented incident (faster resolution) pays for all incident management tools and processes.
Ready to implement incident management? CheckMe.dev integrates with PagerDuty and Incident.io, and includes detailed incident logs and reporting. Know immediately when an incident occurs, understand exactly what failed, and resolve issues faster. Start your free trial today.


