Incident Response Runbook (ReasonKit MCP (Pro))
This runbook provides standardized procedures for handling security and operational incidents involving ReasonKit MCP (Pro) deployments.
1. Incident Classification
| Severity | Description | Response Time |
|---|---|---|
| P0 (Critical) | Core reasoning engine down; Data breach; PII leakage. | < 15 Minutes |
| P1 (High) | Significant latency degradation (>2s); Specific ThinkTools failing. | < 1 Hour |
| P2 (Medium) | Minor feature bugs; Intermittent API errors. | < 4 Hours |
| P3 (Low) | Documentation typos; Aesthetic UI issues. | Next Business Day |
2. Response Phases
Phase 1: Identification & Triage
- Alert Source: PagerDuty, Datadog (Logic Drift), or User Report.
- Action: Validate the incident. Identify the
TenantIDandRequestIDaffected.
Phase 2: Containment
- Action: If a specific model is hallucinating, switch the reasoning profile to a fallback model (e.g., switch from
deeptobalanced). - Action: If PII leakage is detected, rotate API keys and flush the associated vector cache immediately.
Phase 3: Eradication & Recovery
- Action: Patch the vulnerable protocol or logic step.
- Action: Redeploy affected ReasonKit nodes.
- Action: Verify system health via
rk benchmark.
Phase 4: Post-Mortem
- Action: Document the root cause.
- Action: Update the ThinkTool protocol to prevent recurrence.
3. Specific Scenarios
Scenario: Reasoning Logic Drift (Confidence < 0.4)
- Detection: Prometheus alert on
reasonkit_logic_confidence. - Immediate Action: Check the upstream LLM status page. If the model is degraded, use
rk-coreto route to a secondary provider.
Scenario: Unauthorized API Access
- Detection: Multiple
security.auth_failevents in audit logs. - Immediate Action: Revoke the compromised API key. Whitelist known IP ranges in the Gateway Router.
Scenario: PII Leakage
- Detection:
security.pii_redactedfailure in logs. - Immediate Action: Isolate the reasoning node. Execute
rk-mem cache-clear --tenant <id>. Notify the Privacy Officer.
4. Communication Plan
- Internal: Join the
#war-room-reasonkitSlack channel. - External: Update
status.reasonkit.sh. Send “Incident Identified” email to affected Enterprise admins.