Incident Management for Website Monitoring

Response Best Practices & Tools

Introduction

Your website just went down. Your monitoring system sent out alerts 47 seconds ago. But your team has no idea who should respond, what they should check, or who needs to be notified.

For the next 23 minutes, your team is running around, making phone calls, checking things randomly, and hoping someone figures out the problem before customers get too angry.

This chaos is what happens without proper incident management processes. Even the best monitoring system is useless if your team doesn’t know how to respond when an alert fires.

Incident management transforms your team from reactive firefighters to coordinated responders. When an incident occurs, everyone knows their role, the process is clear, and issues get resolved faster.

What is Incident Management?

Incident management is a set of documented processes and tools that guide your team’s response when something goes wrong with your website or application.

It answers these questions:

  • Who gets notified when an incident occurs?
  • What’s the priority of different types of incidents?
  • Who leads the response effort?
  • What are the steps to diagnose and fix the problem?
  • Who communicates to affected customers?
  • How do we learn from this incident to prevent it next time?

An incident management process ensures that instead of chaos, there’s order. Instead of guessing, there’s clarity.

Types of Incidents

Severity Levels

Classifying incidents by severity determines response urgency:

Severity 1 – Critical (respond immediately):

  • Website completely down (HTTP 500, unable to connect, etc.)
  • Core functionality unavailable (checkout not working, login broken, etc.)
  • Revenue-impacting issues
  • Security incidents
  • Multiple users affected simultaneously
  • Response time target: 5 minutes to acknowledge, 15 minutes to diagnosis
  • Resolution time target: < 30 minutes
  • Notifications: Immediate (phone calls, SMS, alerts)

Severity 2 – High (respond within 30 minutes):

  • Partial functionality broken (some features unavailable but core works)
  • Performance degradation (>50% slower than baseline)
  • API errors affecting small percentage of requests
  • Single region/user affected
  • Response time target: 30 minutes to acknowledge
  • Resolution time target: < 2 hours
  • Notifications: Email, Slack, standard alerting

Severity 3 – Medium (respond within 4 hours):

  • Minor functionality issues
  • Small performance degradation
  • UI/UX issues not affecting core functionality
  • Issues affecting < 0.1% of users
  • Response time target: 4 hours
  • Resolution time target: < 24 hours
  • Notifications: Email, daily report

Severity 4 – Low (handle during business hours):

  • Feature requests tracked as incidents
  • Documentation issues
  • Non-urgent improvements
  • Can wait for next planned maintenance
  • Response time target: Next business day
  • Resolution time target: < 1 week

Incident Management Roles and Responsibilities

Incident Commander

The incident commander is the leader who coordinates the entire response:

Before an incident:

  • Aware of common failure scenarios
  • Trained in incident management procedures
  • Has escalation authority
  • Understands business impact of outages

During an incident:

  • Declared when severity 1 or critical issue occurs
  • Coordinates all response activities
  • Makes triage decisions (stop investigating, go straight to fix?)
  • Communicates with stakeholders
  • Ensures no critical steps are missed
  • Stays focused on resolution, not blame
  • Maintains timeline of events

After an incident:

  • Leads post-incident review
  • Documents lessons learned
  • Ensures action items are assigned

Who serves as incident commander:

  • On-call engineer (rotates daily/weekly)
  • DevOps lead
  • Principal engineer
  • Never: Someone on vacation, someone new to the team, someone without authority

Engineering Team / Responders

Engineers who diagnose and fix the problem:

Responsibilities:

  • Investigate root cause
  • Implement fix or workaround
  • Test fix before deploying
  • Validate fix resolved the issue
  • Report findings to incident commander

Not responsible for:

  • Deciding severity level (that’s incident commander’s job)
  • External communication (incident commander does that)
  • Post-incident reviews (wait for scheduled review)

DevOps / Infrastructure Team

Infrastructure team handles deployment, rollbacks, and infrastructure issues:

Responsibilities:

  • Deploy fixes to production
  • Rollback problematic deployments
  • Scale infrastructure (if capacity issue)
  • Restart services (if hang/crash)
  • Check infrastructure monitoring
  • Verify no infrastructure-level issues

Communications Lead

Person responsible for external communication:

Responsibilities:

  • Update status page
  • Notify customers (if incident is public)
  • Prepare communication templates
  • Provide updates to stakeholders
  • Handle customer support escalations

Not responsible for:

  • Technical decisions (focus on communication accuracy)
  • Fixing the issue (focus on transparency)

On-Call Rotation

Establish who’s on-call for each hour:

Example rotation:

  • Engineer A: Monday-Wednesday
  • Engineer B: Wednesday-Friday
  • Engineer C: Friday-Monday
  • Repeats weekly

On-call responsibilities:

  • Available to respond to critical alerts 24/7
  • 5-minute response time
  • Acknowledge incident within 5 minutes
  • Investigate root cause or escalate

On-call compensation:

  • Paid on-call stipend ($300-1000/week)
  • Paid time off for incidents after hours
  • Night incident = day off next day
  • Rotation frequency: 1 week per engineer per month (for 3-engineer team)

Incident Response Process

Phase 1: Detection (Automated)

Monitoring system detects issue:

text10:23:47 - Monitoring system detects website returns HTTP 500
10:23:48 - Alert fires to incident commander
10:23:49 - SMS, Slack, email sent to on-call engineer
10:23:50 - Incident created in incident tracking system

Time to phase completion: ~3 seconds (automated)

Phase 2: Acknowledgment (5 minutes)

On-call engineer acknowledges incident:

text10:24:15 - Engineer receives SMS and wakes up
10:24:45 - Engineer acknowledges incident ("I'm on it")
10:25:00 - Incident commander notified
10:25:15 - Customer communication prepared ("We're investigating")

Time to phase completion: 5 minutes max

What happens if no acknowledgment?:

  • Alert escalates to backup engineer (team lead, manager)
  • 3-minute escalation if no response
  • Eventually reaches manager on call

Phase 3: Triage (5-15 minutes)

Quick assessment of situation:

Questions answered:

  1. What’s broken? (service, feature, entire site?)
  2. How many users affected? (single user, region, everyone?)
  3. What’s the root cause? (code, infrastructure, third-party?)
  4. Can we fix it quickly (< 15 min) or do we need workaround?
  5. Do we need to involve other teams?

Triage decision:

  • Go straight to fix: Issue is obvious, fix is known
  • Investigate further: Need more information to diagnose
  • Implement workaround: Can’t fix now, implement temporary solution
  • Escalate: Need help from other team/specialist

Example triage:

text10:25 - Alert: HTTP 500 errors on checkout API
10:26 - Engineer checks logs: "Database connection pool exhausted"
10:27 - Engineer checks database: All connections in use
10:28 - Engineer finds recent deploy increased connection usage
10:29 - Decision: Rollback deploy (quick fix), investigate later

Time to phase completion: 5-15 minutes

Phase 4: Mitigation (15-30 minutes)

Stop the bleeding, even if it’s not the real fix:

Options:

  • Rollback recent deployment
  • Kill runaway process
  • Increase capacity (scale up)
  • Switch to degraded mode (disable non-critical features)
  • Failover to backup system
  • Implement traffic limits (prevent cascade)

Not trying to find root cause yet, just stop the damage.

Example mitigation:

text10:29 - Engineer initiates rollback of 10:15 deployment
10:31 - Database connection pool recovers
10:32 - HTTP 500 errors stop
10:33 - Site partially recovered (some features disabled)
10:34 - Customer communication: "Issue resolved, investigating full cause"

Time to phase completion: 5-15 minutes

Phase 5: Root Cause Investigation (30+ minutes)

Now investigate what actually caused the problem:

Questions answered:

  • Why did this happen?
  • When did it start?
  • Is it code, infrastructure, configuration, or third-party?
  • Could this happen again?

Investigation process:

  1. Review logs from time of incident
  2. Check recent deployments/changes
  3. Check infrastructure metrics (CPU, memory, connections)
  4. Check third-party service status pages
  5. Interview users about what they were doing when it failed
  6. Correlate events timeline

Example investigation:

textReview deploy at 10:15:
- Changed database connection pooling settings
- Increased concurrent connections from 100 to 150
- New code wasn't closing connections properly
- Quickly hit 150 connection limit
- New queries couldn't get connections
- Database returned errors
- Application crashed with HTTP 500

Root cause: Improper connection handling in new code, exacerbated by 
increased pool size limit.

Time to phase completion: 30+ minutes (depends on complexity)

Phase 6: Permanent Fix (varies)

Implement actual solution to prevent recurrence:

Options:

  • Revert problematic code
  • Fix the code properly
  • Update configuration
  • Increase infrastructure capacity
  • Implement circuit breakers or throttling

Example permanent fix:

textIn code review: Find the issue in connection handling code
Fix: Ensure connections are always closed (finally block)
Test: Verify fix with load test that previously failed
Deploy: Roll out fix to production
Verify: Monitor metrics post-deployment

Time to phase completion: 30 minutes to several hours (depends on severity and fix complexity)

Phase 7: Communication

Ongoing customer communication throughout incident:

Template updates:

text10:24:00 - "We're investigating elevated error rates. More updates in 5 minutes."
10:29:00 - "We've identified the issue and are implementing a fix."
10:32:00 - "The issue has been mitigated. We're investigating root cause."
10:40:00 - "We've identified root cause: improper connection handling in recent deployment."
10:50:00 - "We've deployed permanent fix and monitoring closely."
11:00:00 - "Issue fully resolved. See blog post for details."

Phase 8: Post-Incident Review (Within 24 hours)

Team reviews incident to learn and improve:

What to discuss:

  • Timeline of events
  • What went well
  • What went poorly
  • Root cause (not blame)
  • Action items to prevent recurrence
  • Process improvements

Action items assigned:

  • Engineer A: Add monitoring for connection pool exhaustion
  • Engineer B: Add circuit breaker for database failures
  • Engineer C: Improve load test to catch connection issues
  • DevOps: Increase database connection pool capacity

Incident Management Tools

Alerting and Notification

PagerDuty:

  • Routes alerts to on-call engineer
  • Escalates if not acknowledged
  • Tracks incident timeline
  • Integrates with monitoring tools
  • Schedules on-call rotations

Incident.io:

  • Simpler PagerDuty alternative
  • Automated incident declaration
  • Slack integration
  • Postmortems
  • Trending by root cause

Built-in monitoring tools:

  • Most monitoring tools have basic alerting
  • Sufficient for small teams
  • Limited escalation/rotation features

Status Pages

Statuspage.io:

  • Public-facing status page
  • Automated incident updates
  • Component status tracking
  • Scheduled maintenance notifications
  • Analytics on uptime/incidents

Atlassian Statuspage:

  • Enterprise version of above
  • More customization
  • Better reporting

Self-hosted:

  • Some teams build custom status pages
  • More control but requires maintenance

Communication

Slack:

  • #incidents channel for incident discussion
  • Integrates with PagerDuty/monitoring
  • Real-time team communication
  • Searchable incident history

Email/SMS:

  • Critical for out-of-band communication
  • Reaches people without Slack
  • Backup communication method

Incident Management Best Practices

1. Declare Incidents Early

Bad: Wait until you’re certain it’s a real incident before declaring

Good: Declare immediately when severity 1 criteria met, even if you’re not 100% sure

Why: Getting team together quickly can resolve issues faster. False alarms are better than slow response to real incidents.

2. Timebox Investigation

Bad: Spend 2+ hours investigating root cause while customers are frustrated

Good:

  • First 30 minutes: Mitigation focus
  • Next 30 minutes: Initial diagnosis
  • Then: Longer investigation if needed

Why: Users care about resolution, not perfect root cause. Fix the problem, investigate thoroughly after.

3. Assign Clear Owner

Bad: Team of engineers all working on incident with no clear owner, duplicating effort

Good: Incident commander assigns specific tasks to specific people

Why: Accountability, avoids duplication, ensures nothing falls through cracks.

4. Update Status Page Regularly

Bad: First update 30 minutes after incident, then nothing for 2 hours

Good: Update status page every 15-30 minutes with any progress, even “still investigating”

Why: Customers are less frustrated when they know you’re actively working on it.

5. Have Runbooks for Common Issues

Runbook: Step-by-step procedure for diagnosing/fixing common problems

Examples:

  • “Database connection exhaustion”
  • “High memory usage causing slowness”
  • “API rate limit hit from third-party service”
  • “SSL certificate expiration”
  • “CDN cache invalidation issue”

Value: New/junior engineers can fix issues without deep expertise. Reduces MTTR significantly.

6. Blameless Post-Incident Reviews

Bad: “John deployed bad code, we’re removing his deployment access”

Good: “The code review process didn’t catch the issue. Let’s add load testing to code review.”

Why: Focus on systems and processes, not individuals. Engineers need to feel safe admitting mistakes so incidents are reported quickly.

7. Track Metrics Over Time

Metrics to track:

  • MTTD (Mean Time To Detection): How quickly we detect issues
  • MTTR (Mean Time To Resolution): How quickly we fix issues
  • MTBF (Mean Time Between Failures): How often incidents occur
  • Incident cost: Revenue lost per incident

Goal: Improve each metric quarter over quarter

Common Incident Management Mistakes

Mistake 1: No On-Call Rotation

Problem: Everyone is on-call, which means no one feels responsible

Solution: Explicit rotation where one person is on-call and has clear responsibilities

Mistake 2: No Escalation Path

Problem: Engineer doesn’t respond to alert, nobody else knows to take over

Solution: Alert escalates to backup engineer after 5 minutes, to manager after 10 minutes

Mistake 3: Vague Severity Levels

Problem: Everyone argues about whether incident is Sev 1 or Sev 2 while customers wait

Solution: Clear, objective criteria for severity levels

Mistake 4: No Runbooks

Problem: Each engineer diagnoses issue differently, takes different time to resolve

Solution: Standard procedures documented for common issues

Mistake 5: No Status Page

Problem: Customers have no idea what’s happening, assume worst

Solution: Public status page updated every 15 minutes during incidents

Conclusion

Incident management transforms your team from reactive, panicked firefighters into coordinated responders. When an incident occurs, everyone knows their role, the process is clear, and issues get resolved 2-5x faster.

The best incident management system includes:

  1. Clear roles and responsibilities
  2. Documented response procedures
  3. On-call rotation with escalation
  4. Monitoring/alerting to detect issues
  5. Status page for customer communication
  6. Post-incident reviews to prevent recurrence

Start with the basics: define severity levels, establish an on-call rotation, create a runbook for your most common incidents. As your maturity increases, add PagerDuty, status pages, and more sophisticated processes.

The ROI is clear: one prevented incident (faster resolution) pays for all incident management tools and processes.


Ready to implement incident management? CheckMe.dev integrates with PagerDuty and Incident.io, and includes detailed incident logs and reporting. Know immediately when an incident occurs, understand exactly what failed, and resolve issues faster. Start your free trial today.

Still hungry? Here’s more

Scroll to Top

Contact checkme.dev team

Fill out the form, and we will be in touch shortly

Your Contact Information
How can we help?