Incident Management for Website Monitoring

January 27, 2026

Response Best Practices & Tools

Introduction

Your website just went down. Your monitoring system sent out alerts 47 seconds ago. But your team has no idea who should respond, what they should check, or who needs to be notified.

For the next 23 minutes, your team is running around, making phone calls, checking things randomly, and hoping someone figures out the problem before customers get too angry.

This chaos is what happens without proper incident management processes. Even the best monitoring system is useless if your team doesn’t know how to respond when an alert fires.

Incident management transforms your team from reactive firefighters to coordinated responders. When an incident occurs, everyone knows their role, the process is clear, and issues get resolved faster.

What is Incident Management?

Incident management is a set of documented processes and tools that guide your team’s response when something goes wrong with your website or application.

It answers these questions:

Who gets notified when an incident occurs?
What’s the priority of different types of incidents?
Who leads the response effort?
What are the steps to diagnose and fix the problem?
Who communicates to affected customers?
How do we learn from this incident to prevent it next time?

An incident management process ensures that instead of chaos, there’s order. Instead of guessing, there’s clarity.

Types of Incidents

Severity Levels

Classifying incidents by severity determines response urgency:

Severity 1 – Critical (respond immediately):

Website completely down (HTTP 500, unable to connect, etc.)
Core functionality unavailable (checkout not working, login broken, etc.)
Revenue-impacting issues
Security incidents
Multiple users affected simultaneously
Response time target: 5 minutes to acknowledge, 15 minutes to diagnosis
Resolution time target: < 30 minutes
Notifications: Immediate (phone calls, SMS, alerts)

Severity 2 – High (respond within 30 minutes):

Partial functionality broken (some features unavailable but core works)
Performance degradation (>50% slower than baseline)
API errors affecting small percentage of requests
Single region/user affected
Response time target: 30 minutes to acknowledge
Resolution time target: < 2 hours
Notifications: Email, Slack, standard alerting

Severity 3 – Medium (respond within 4 hours):

Minor functionality issues
Small performance degradation
UI/UX issues not affecting core functionality
Issues affecting < 0.1% of users
Response time target: 4 hours
Resolution time target: < 24 hours
Notifications: Email, daily report

Severity 4 – Low (handle during business hours):

Feature requests tracked as incidents
Documentation issues
Non-urgent improvements
Can wait for next planned maintenance
Response time target: Next business day
Resolution time target: < 1 week

Incident Management Roles and Responsibilities

Incident Commander

The incident commander is the leader who coordinates the entire response:

Before an incident:

Aware of common failure scenarios
Trained in incident management procedures
Has escalation authority
Understands business impact of outages

During an incident:

Declared when severity 1 or critical issue occurs
Coordinates all response activities
Makes triage decisions (stop investigating, go straight to fix?)
Communicates with stakeholders
Ensures no critical steps are missed
Stays focused on resolution, not blame
Maintains timeline of events

After an incident:

Leads post-incident review
Documents lessons learned
Ensures action items are assigned

Who serves as incident commander:

On-call engineer (rotates daily/weekly)
DevOps lead
Principal engineer
Never: Someone on vacation, someone new to the team, someone without authority

Engineering Team / Responders

Engineers who diagnose and fix the problem:

Responsibilities:

Investigate root cause
Implement fix or workaround
Test fix before deploying
Validate fix resolved the issue
Report findings to incident commander

Not responsible for:

Deciding severity level (that’s incident commander’s job)
External communication (incident commander does that)
Post-incident reviews (wait for scheduled review)

DevOps / Infrastructure Team

Infrastructure team handles deployment, rollbacks, and infrastructure issues:

Responsibilities:

Deploy fixes to production
Rollback problematic deployments
Scale infrastructure (if capacity issue)
Restart services (if hang/crash)
Check infrastructure monitoring
Verify no infrastructure-level issues

Communications Lead

Person responsible for external communication:

Responsibilities:

Update status page
Notify customers (if incident is public)
Prepare communication templates
Provide updates to stakeholders
Handle customer support escalations

Not responsible for:

Technical decisions (focus on communication accuracy)
Fixing the issue (focus on transparency)

On-Call Rotation

Establish who’s on-call for each hour:

Example rotation:

Engineer A: Monday-Wednesday
Engineer B: Wednesday-Friday
Engineer C: Friday-Monday
Repeats weekly

On-call responsibilities:

Available to respond to critical alerts 24/7
5-minute response time
Acknowledge incident within 5 minutes
Investigate root cause or escalate

On-call compensation:

Paid on-call stipend ($300-1000/week)
Paid time off for incidents after hours
Night incident = day off next day
Rotation frequency: 1 week per engineer per month (for 3-engineer team)

Incident Response Process

Phase 1: Detection (Automated)

Monitoring system detects issue:

text10:23:47 - Monitoring system detects website returns HTTP 500
10:23:48 - Alert fires to incident commander
10:23:49 - SMS, Slack, email sent to on-call engineer
10:23:50 - Incident created in incident tracking system

Time to phase completion: ~3 seconds (automated)

Phase 2: Acknowledgment (5 minutes)

On-call engineer acknowledges incident:

text10:24:15 - Engineer receives SMS and wakes up
10:24:45 - Engineer acknowledges incident ("I'm on it")
10:25:00 - Incident commander notified
10:25:15 - Customer communication prepared ("We're investigating")

Time to phase completion: 5 minutes max

What happens if no acknowledgment?:

Alert escalates to backup engineer (team lead, manager)
3-minute escalation if no response
Eventually reaches manager on call

Phase 3: Triage (5-15 minutes)

Quick assessment of situation:

Questions answered:

What’s broken? (service, feature, entire site?)
How many users affected? (single user, region, everyone?)
What’s the root cause? (code, infrastructure, third-party?)
Can we fix it quickly (< 15 min) or do we need workaround?
Do we need to involve other teams?

Triage decision:

Go straight to fix: Issue is obvious, fix is known
Investigate further: Need more information to diagnose
Implement workaround: Can’t fix now, implement temporary solution
Escalate: Need help from other team/specialist

Example triage:

text10:25 - Alert: HTTP 500 errors on checkout API
10:26 - Engineer checks logs: "Database connection pool exhausted"
10:27 - Engineer checks database: All connections in use
10:28 - Engineer finds recent deploy increased connection usage
10:29 - Decision: Rollback deploy (quick fix), investigate later

Time to phase completion: 5-15 minutes

Phase 4: Mitigation (15-30 minutes)

Stop the bleeding, even if it’s not the real fix:

Options:

Rollback recent deployment
Kill runaway process
Increase capacity (scale up)
Switch to degraded mode (disable non-critical features)
Failover to backup system
Implement traffic limits (prevent cascade)

Not trying to find root cause yet, just stop the damage.

Example mitigation:

text10:29 - Engineer initiates rollback of 10:15 deployment
10:31 - Database connection pool recovers
10:32 - HTTP 500 errors stop
10:33 - Site partially recovered (some features disabled)
10:34 - Customer communication: "Issue resolved, investigating full cause"

Time to phase completion: 5-15 minutes

Phase 5: Root Cause Investigation (30+ minutes)

Now investigate what actually caused the problem:

Questions answered:

Why did this happen?
When did it start?
Is it code, infrastructure, configuration, or third-party?
Could this happen again?

Investigation process:

Review logs from time of incident
Check recent deployments/changes
Check infrastructure metrics (CPU, memory, connections)
Check third-party service status pages
Interview users about what they were doing when it failed
Correlate events timeline

Example investigation:

textReview deploy at 10:15:
- Changed database connection pooling settings
- Increased concurrent connections from 100 to 150
- New code wasn't closing connections properly
- Quickly hit 150 connection limit
- New queries couldn't get connections
- Database returned errors
- Application crashed with HTTP 500

Root cause: Improper connection handling in new code, exacerbated by 
increased pool size limit.

Time to phase completion: 30+ minutes (depends on complexity)

Phase 6: Permanent Fix (varies)

Implement actual solution to prevent recurrence:

Options:

Revert problematic code
Fix the code properly
Update configuration
Increase infrastructure capacity
Implement circuit breakers or throttling

Example permanent fix:

textIn code review: Find the issue in connection handling code
Fix: Ensure connections are always closed (finally block)
Test: Verify fix with load test that previously failed
Deploy: Roll out fix to production
Verify: Monitor metrics post-deployment

Time to phase completion: 30 minutes to several hours (depends on severity and fix complexity)

Phase 7: Communication

Ongoing customer communication throughout incident:

Template updates:

text10:24:00 - "We're investigating elevated error rates. More updates in 5 minutes."
10:29:00 - "We've identified the issue and are implementing a fix."
10:32:00 - "The issue has been mitigated. We're investigating root cause."
10:40:00 - "We've identified root cause: improper connection handling in recent deployment."
10:50:00 - "We've deployed permanent fix and monitoring closely."
11:00:00 - "Issue fully resolved. See blog post for details."

Phase 8: Post-Incident Review (Within 24 hours)

Team reviews incident to learn and improve:

What to discuss:

Timeline of events
What went well
What went poorly
Root cause (not blame)
Action items to prevent recurrence
Process improvements

Action items assigned:

Engineer A: Add monitoring for connection pool exhaustion
Engineer B: Add circuit breaker for database failures
Engineer C: Improve load test to catch connection issues
DevOps: Increase database connection pool capacity

Incident Management Tools

Alerting and Notification

PagerDuty:

Routes alerts to on-call engineer
Escalates if not acknowledged
Tracks incident timeline
Integrates with monitoring tools
Schedules on-call rotations

Incident.io:

Simpler PagerDuty alternative
Automated incident declaration
Slack integration
Postmortems
Trending by root cause

Built-in monitoring tools:

Most monitoring tools have basic alerting
Sufficient for small teams
Limited escalation/rotation features

Status Pages

Statuspage.io:

Public-facing status page
Automated incident updates
Component status tracking
Scheduled maintenance notifications
Analytics on uptime/incidents

Atlassian Statuspage:

Enterprise version of above
More customization
Better reporting

Self-hosted:

Some teams build custom status pages
More control but requires maintenance

Communication

Slack:

#incidents channel for incident discussion
Integrates with PagerDuty/monitoring
Real-time team communication
Searchable incident history

Email/SMS:

Critical for out-of-band communication
Reaches people without Slack
Backup communication method

Incident Management Best Practices

1. Declare Incidents Early

Bad: Wait until you’re certain it’s a real incident before declaring

Good: Declare immediately when severity 1 criteria met, even if you’re not 100% sure

Why: Getting team together quickly can resolve issues faster. False alarms are better than slow response to real incidents.

2. Timebox Investigation

Bad: Spend 2+ hours investigating root cause while customers are frustrated

Good:

First 30 minutes: Mitigation focus
Next 30 minutes: Initial diagnosis
Then: Longer investigation if needed

Why: Users care about resolution, not perfect root cause. Fix the problem, investigate thoroughly after.

3. Assign Clear Owner

Bad: Team of engineers all working on incident with no clear owner, duplicating effort

Good: Incident commander assigns specific tasks to specific people

Why: Accountability, avoids duplication, ensures nothing falls through cracks.

4. Update Status Page Regularly

Bad: First update 30 minutes after incident, then nothing for 2 hours

Good: Update status page every 15-30 minutes with any progress, even “still investigating”

Why: Customers are less frustrated when they know you’re actively working on it.

5. Have Runbooks for Common Issues

Runbook: Step-by-step procedure for diagnosing/fixing common problems

Examples:

“Database connection exhaustion”
“High memory usage causing slowness”
“API rate limit hit from third-party service”
“SSL certificate expiration”
“CDN cache invalidation issue”

Value: New/junior engineers can fix issues without deep expertise. Reduces MTTR significantly.

6. Blameless Post-Incident Reviews

Bad: “John deployed bad code, we’re removing his deployment access”

Good: “The code review process didn’t catch the issue. Let’s add load testing to code review.”

Why: Focus on systems and processes, not individuals. Engineers need to feel safe admitting mistakes so incidents are reported quickly.

7. Track Metrics Over Time

Metrics to track:

MTTD (Mean Time To Detection): How quickly we detect issues
MTTR (Mean Time To Resolution): How quickly we fix issues
MTBF (Mean Time Between Failures): How often incidents occur
Incident cost: Revenue lost per incident

Goal: Improve each metric quarter over quarter

Common Incident Management Mistakes

Mistake 1: No On-Call Rotation

Problem: Everyone is on-call, which means no one feels responsible

Solution: Explicit rotation where one person is on-call and has clear responsibilities

Mistake 2: No Escalation Path

Problem: Engineer doesn’t respond to alert, nobody else knows to take over

Solution: Alert escalates to backup engineer after 5 minutes, to manager after 10 minutes

Mistake 3: Vague Severity Levels

Problem: Everyone argues about whether incident is Sev 1 or Sev 2 while customers wait

Solution: Clear, objective criteria for severity levels

Mistake 4: No Runbooks

Problem: Each engineer diagnoses issue differently, takes different time to resolve

Solution: Standard procedures documented for common issues

Mistake 5: No Status Page

Problem: Customers have no idea what’s happening, assume worst

Solution: Public status page updated every 15 minutes during incidents

Conclusion

Incident management transforms your team from reactive, panicked firefighters into coordinated responders. When an incident occurs, everyone knows their role, the process is clear, and issues get resolved 2-5x faster.

The best incident management system includes:

Clear roles and responsibilities
Documented response procedures
On-call rotation with escalation
Monitoring/alerting to detect issues
Status page for customer communication
Post-incident reviews to prevent recurrence

Start with the basics: define severity levels, establish an on-call rotation, create a runbook for your most common incidents. As your maturity increases, add PagerDuty, status pages, and more sophisticated processes.

The ROI is clear: one prevented incident (faster resolution) pays for all incident management tools and processes.

Ready to implement incident management? CheckMe.dev integrates with PagerDuty and Incident.io, and includes detailed incident logs and reporting. Know immediately when an incident occurs, understand exactly what failed, and resolve issues faster. Start your free trial today.

Incident Management for Website Monitoring

Response Best Practices & Tools

Introduction

What is Incident Management?

Types of Incidents

Severity Levels

Incident Management Roles and Responsibilities

Incident Commander

Engineering Team / Responders

DevOps / Infrastructure Team

Communications Lead

On-Call Rotation

Incident Response Process

Phase 1: Detection (Automated)

Phase 2: Acknowledgment (5 minutes)

Phase 3: Triage (5-15 minutes)

Phase 4: Mitigation (15-30 minutes)

Phase 5: Root Cause Investigation (30+ minutes)

Phase 6: Permanent Fix (varies)

Phase 7: Communication

Phase 8: Post-Incident Review (Within 24 hours)

Incident Management Tools

Alerting and Notification

Status Pages

Communication

Incident Management Best Practices

1. Declare Incidents Early

2. Timebox Investigation

3. Assign Clear Owner

4. Update Status Page Regularly

5. Have Runbooks for Common Issues

6. Blameless Post-Incident Reviews

7. Track Metrics Over Time

Common Incident Management Mistakes

Mistake 1: No On-Call Rotation

Mistake 2: No Escalation Path

Mistake 3: Vague Severity Levels

Mistake 4: No Runbooks

Mistake 5: No Status Page

Conclusion

Still hungry? Here’s more

Why Your Website Works for You — But Not for Your International Users

Multi-Location Website Monitoring

Website Transaction Monitoring

Contact checkme.dev team