Claude Opus 4.7 (recommended for nuanced blameless rewrites); Claude Sonnet 4.6 or GPT-4o work well for straightforward incidentsYour team resolved a P0 outage two hours ago. The on-call engineer has a Slack thread, a PagerDuty timeline, and some runbook notes — but nothing structured. You need a complete post-mortem document ready for the engineering all-hands tomorrow morning, and you have thirty minutes.Developer Tools

تولید خودکار گزارش پس‌مرگ بدون سرزنش از تایم‌لاین خام حادثه

May 28, 2026

اشتراک‌گذاری:

تولید خودکار گزارش پس‌مرگ بدون سرزنش از تایم‌لاین خام حادثه

Why this prompt matters

Post-mortems that skip the root cause / contributing factors distinction produce action items that fix symptoms, not systems. Teams that consistently write shallow post-mortems repeat the same incident categories — different service, same missing circuit breaker, same absent alert. The blameless framing is also not optional: post-mortem cultures where engineers feel implicitly blamed for outages produce teams that under-report near-misses, which means preventable P0s go undetected until they actually happen.

What we use it for

Your team resolved a P0 outage two hours ago. The on-call engineer has a Slack thread, a PagerDuty timeline, and some runbook notes — but nothing structured. You need a complete post-mortem document ready for the engineering all-hands tomorrow morning, and you have thirty minutes.

Prompt

Act as a senior site reliability engineer with extensive experience writing blameless post-mortems for high-traffic systems.

Context:
The following incident has been resolved. You have been given a raw timeline of events, the contributing factors identified during the retrospective, and the remediation steps the team took.

Incident details:
- Service affected: [SERVICE NAME, e.g., "Payment API", "User Authentication Service"]
- Severity: [P0/P1/P2]
- Duration: [START TIME] to [END TIME] ([TOTAL DURATION])
- Customer impact: [DESCRIBE IMPACT, e.g., "100% of checkout requests failed for 47 minutes"]
- Raw timeline / notes: [PASTE YOUR INCIDENT NOTES, SLACK THREAD, OR RUNBOOK ENTRIES HERE]

Task:
Write a complete, professional blameless post-mortem document following Google's SRE post-mortem culture principles. The document must identify system failures and process gaps — never individual blame.

Constraints:
- Use blameless language throughout. Say "the deployment pipeline did not have a gate for X" not "the engineer forgot to check X"
- Distinguish between root cause (the fundamental system or process failure) and contributing factors (conditions that allowed the root cause to have impact)
- Action items must be specific, ownable, and measurable — not vague ("improve monitoring")
- Do not pad the timeline. Only include events that affected the incident trajectory
- If the raw notes contain blame language, neutralize it in the post-mortem

Output Format:
**Incident Post-Mortem: [INCIDENT TITLE]**
**Date:** [DATE]  **Severity:** [P0/P1/P2]  **Duration:** [X hours Y minutes]  **Status:** Resolved

**Executive Summary**
[2-3 sentences: what failed, for how long, customer impact, and status]

**Timeline** (all times in [TIMEZONE])
[Chronological bullet list: time → what happened / who detected it / what action was taken]

**Root Cause**
[Single paragraph identifying the fundamental system or process failure. No blame language.]

**Contributing Factors**
[Bulleted list: each factor that allowed the root cause to cause impact]

**Impact**
[Quantified: % of users affected, requests failed, revenue exposure if known, SLA breach if applicable]

**What Went Well**
[Honest assessment: detection speed, communication, rollback success, etc.]

**What Went Poorly**
[Honest, blameless assessment: slow escalation paths, missing runbooks, unclear ownership, etc.]

**Action Items**
| Priority | Action | Owner Role | Due |
|---|---|---|---|
| P0 | [Specific action] | [Role, not name] | [DATE] |

**Lessons Learned**
[2-3 sentences: the systemic lesson this incident teaches, framed for future decision-making]

Result

**Incident Post-Mortem: Payment API Outage — Connection Pool Exhaustion** **Date:** 2026-05-14 **Severity:** P0 **Duration:** 1 hour 23 minutes **Status:** Resolved

**Executive Summary** A deployment of Payment API v4.17.2 introduced a database query that did not release connections under timeout conditions, exhausting the PostgreSQL connection pool within 34 minutes of deploy. Checkout requests failed for 100% of users between 14:02 and 15:25 UTC. No payment data was lost; all in-flight transactions were rolled back cleanly by PostgreSQL.

**Timeline** (all times UTC) - 13:31 — Payment API v4.17.2 deployed to production via automated pipeline. Health checks passed. - 13:58 — First spike in payment_api_p99_latency (180ms → 2.4s). Alert fires to on-call channel. - 14:02 — db_pool_available_connections hits zero. Checkout endpoint returns 503 for all requests. - 14:09 — On-call engineer identifies connection pool exhaustion in Datadog. Escalates to database team. - 14:17 — Database team identifies new query in v4.17.2 as source of leaked connections. Initiates rollback. - 14:31 — Rollback to v4.17.1 begins after second-engineer sign-off. - 14:37 — Rollback complete. Connection pool begins recovering. - 15:25 — All metrics return to baseline. Incident resolved.

**Root Cause** The new getPaymentMethodDetails() query introduced in v4.17.2 opened a database transaction but did not include a timeout handler for the network I/O path. Under elevated latency from the payment provider API (which was experiencing independent slowdowns), the transaction blocked indefinitely without releasing its connection back to the pool. With no connection limit per query type and no circuit breaker on the I/O path, the pool exhausted completely within 34 minutes of the deploy.

**Contributing Factors** - The CI pipeline had no static analysis check for connection lifecycle management in database queries - The staging environment uses a connection pool 10× smaller than production, masking this failure mode at staging load - The rollback approval process requires a second engineer sign-off with no defined fast-path escalation, adding 14 minutes to recovery time - No alert existed for connection pool availability below 20%; the first alert fired only after the pool was already fully exhausted

**Impact** - 100% of checkout requests failed for 83 minutes - ~14,200 failed checkout attempts (at pre-incident rate of 171/min) - No SLA breach (threshold: 120 minutes for P0 incidents) - No payment data loss; all in-flight transactions rolled back cleanly

**What Went Well** - Latency alerting fired within 27 minutes of deploy — well within the team's 30-minute detection target - Database team identified the specific query responsible within 8 minutes of escalation - The rollback itself executed in 6 minutes with no secondary failures or data inconsistencies

**What Went Poorly** - The rollback approval gate added 14 minutes with no fast-path for P0 incidents during peak hours - No connection pool headroom alert meant the service transitioned from 100% available to 0% with no intermediate warning - The PR review checklist does not include connection lifecycle verification for database query changes

**Action Items** | Priority | Action | Owner Role | Due | |---|---|---|---| | P0 | Add alert for db_pool_available_connections below 30% | On-Call Infrastructure | 2026-05-21 | | P0 | Define P0 fast-path rollback approval (max 5 min, auto-escalate) | Engineering Manager | 2026-05-21 | | P1 | Add CI linting rule for unguarded database transactions | Backend Platform | 2026-05-28 | | P1 | Scale staging connection pool to 1:3 ratio of production | Database Reliability | 2026-06-04 | | P2 | Add timeout handler to all payment provider I/O paths | Payments Team | 2026-06-11 |

**Lessons Learned** Connection lifecycle management requires both explicit code review verification and automated tooling — relying on developer attention alone is insufficient at current deployment frequency. The contributing factors identified here (missing alerts, undersized staging, slow rollback gates) are systemic patterns worth auditing across all critical services, not just the payment stack.

پست‌مرتم‌هایی که تحت فشار زمان نوشته می‌شوند — یا چند روز بعد از فروکش کردن آدرنالین — معمولاً سطحی از آب درمی‌آیند. این گزارش‌ها عوامل مؤثر را از قلم می‌اندازند، الگوهای سیستمی که حادثه را ممکن کرده‌اند نادیده می‌گیرند، یا بی‌صدا با انتخاب کلمات به کسی انگ می‌زنند. این پرامپت ساختاری را تحمیل می‌کند که پست‌مرتم‌ها را واقعاً مفید می‌کند: زبان بدون سرزنش، جداسازی root cause از عوامل مؤثر، و آیتم‌های عملی که آنقدر مشخص هستند که بتوان بستشان کرد.

پرامپت

این متن را کامل کپی کنید. قبل از اجرا، هر فیلد داخل براکت را با جزئیات واقعی حادثه‌تان جایگزین کنید.

Act as a senior site reliability engineer with extensive experience writing blameless post-mortems for high-traffic systems.

Context:
The following incident has been resolved. You have been given a raw timeline of events, the contributing factors identified during the retrospective, and the remediation steps the team took.

Incident details:
- Service affected: [SERVICE NAME, e.g., "Payment API", "User Authentication Service"]
- Severity: [P0/P1/P2]
- Duration: [START TIME] to [END TIME] ([TOTAL DURATION])
- Customer impact: [DESCRIBE IMPACT, e.g., "100% of checkout requests failed for 47 minutes"]
- Raw timeline / notes: [PASTE YOUR INCIDENT NOTES, SLACK THREAD, OR RUNBOOK ENTRIES HERE]

Task:
Write a complete, professional blameless post-mortem document following Google's SRE post-mortem culture principles. The document must identify system failures and process gaps — never individual blame.

Constraints:
- Use blameless language throughout. Say "the deployment pipeline did not have a gate for X" not "the engineer forgot to check X"
- Distinguish between root cause (the fundamental system or process failure) and contributing factors (conditions that allowed the root cause to have impact)
- Action items must be specific, ownable, and measurable — not vague ("improve monitoring")
- Do not pad the timeline. Only include events that affected the incident trajectory
- If the raw notes contain blame language, neutralize it in the post-mortem

Output Format:
**Incident Post-Mortem: [INCIDENT TITLE]**
**Date:** [DATE]  **Severity:** [P0/P1/P2]  **Duration:** [X hours Y minutes]  **Status:** Resolved

**Executive Summary**
[2-3 sentences: what failed, for how long, customer impact, and status]

**Timeline** (all times in [TIMEZONE])
[Chronological bullet list: time → what happened / who detected it / what action was taken]

**Root Cause**
[Single paragraph identifying the fundamental system or process failure. No blame language.]

**Contributing Factors**
[Bulleted list: each factor that allowed the root cause to cause impact]

**Impact**
[Quantified: % of users affected, requests failed, revenue exposure if known, SLA breach if applicable]

**What Went Well**
[Honest assessment: detection speed, communication, rollback success, etc.]

**What Went Poorly**
[Honest, blameless assessment: slow escalation paths, missing runbooks, unclear ownership, etc.]

**Action Items**
| Priority | Action | Owner Role | Due |
|---|---|---|---|
| P0 | [Specific action] | [Role, not name] | [DATE] |

**Lessons Learned**
[2-3 sentences: the systemic lesson this incident teaches, framed for future decision-making]

چرا هر بخش اینجاست

تمایز بین root cause و عوامل مؤثر مهم‌ترین تصمیم ساختاری در این پرامپت است. بدون این تمایز، تیم‌ها می‌نویسند «دیتابیس افتاد» به‌عنوان root cause و کار را تمام شده می‌دانند. پرامپت شما را مجبور می‌کند عمیق‌تر بروید: چه ویژگی از سیستم باعث شد که «افتادن دیتابیس» ممکن شود؟ آیا circuit breaker وجود نداشت؟ آلرت برای connection pool نبود؟ محیط staging کوچک‌تر از حدی بود که این مدل خرابی را در تست بار نشان دهد؟ اینها همان عوامل مؤثر هستند — و در واقع همان چیزهایی که باید درستشان کنید.

زبان بدون سرزنش یک محدودیت است، نه یک ترجیح لحنی. پرامپت صریحاً به مدل می‌گوید اگر در یادداشت‌های خامتان زبان سرزنش‌آمیز وجود داشت، آن را خنثی کند. این مهم است چون تردهای خام اسلک و کانال‌های حادثه پر از جمله‌هایی مثل «جان بدون مرور runbook تغییر را پوش کرد» هستند — که وقتی به صورت سیستمی بازنویسی شود، می‌شود «چک‌لیست استقرار شامل یک مرحله اجباری مرور runbook برای این نوع سرویس نبود». راه‌حل یکی است، اما یک نسخه فرهنگی می‌سازد که مهندسان نزدیک‌به‌حادثه‌ها را گزارش دهند و نسخه دیگر نه.

جدول آیتم‌های عملی عمداً با نقش (نه نام) و تاریخ سررسید فرمت شده است. پست‌مرتم‌هایی که آیتم‌های عملی را به افراد خاص نسبت می‌دهند وقتی آن فرد تیم را ترک کند از کار می‌افتند. نسبت دادن به نقش باعث می‌شود آیتم در برابر تغییرات سازمانی پایدار بماند. تاریخ سررسید اولویت‌بندی را تحمیل می‌کند؛ بدون آن، آیتم‌های P1 ماه‌ها در بک‌لاگ می‌مانند.

خروجی نمونه

در اینجا خروجی این پرامپت وقتی به یک حادثه exhaustion connection pool داده می‌شود:

Incident Post-Mortem: Payment API Outage — Connection Pool Exhaustion
Date: 2026-05-14 Severity: P0 Duration: 1 hour 23 minutes Status: Resolved

Executive Summary
یک استقرار از Payment API v4.17.2 یک query دیتابیس معرفی کرد که تحت شرایط timeout اتصالات را آزاد نمی‌کرد و در عرض ۳۴ دقیقه بعد از استقرار، connection pool PostgreSQL را کاملاً پر کرد. بین ساعت ۱۴:۰۲ تا ۱۵:۲۵ UTC درخواست‌های checkout برای ۱۰۰٪ کاربران شکست خورد. هیچ داده پرداختی از دست نرفت؛ تمام تراکنش‌های در جریان به‌طور تمیز rollback شدند.

Root Cause
query جدید getPaymentMethodDetails() یک تراکنش دیتابیس باز کرد اما برای مسیر I/O شبکه timeout handler نداشت. تحت تأخیر بالای API ارائه‌دهنده پرداخت، تراکنش بدون آزاد کردن اتصال indefinitely بلاک می‌شد. بدون محدودیت اتصال به ازای نوع query و بدون circuit breaker، pool در عرض ۳۴ دقیقه کاملاً خالی شد.

Contributing Factors

خط لوله CI هیچ بررسی استاتیکی برای مدیریت چرخه حیات اتصال در query‌های دیتابیس نداشت
محیط staging از connection pool استفاده می‌کند که ۱۰ برابر کوچک‌تر از production است و این مدل خرابی را در load testing در مقیاس staging پنهان کرد
فرآیند تأیید rollback نیاز به امضای مهندس دوم داشت که ۱۴ دقیقه به زمان بازیابی اضافه کرد بدون escalation تعریف‌شده در صورت عدم دسترسی تأییدکننده
هیچ آلرتی برای availability connection pool زیر ۲۰٪ وجود نداشت — اولین آلرت فقط بعد از اینکه pool کاملاً خالی شده بود فعال شد

Action Items (excerpt)

P0: افزودن آلرت headroom connection pool در ۳۰٪ باقی‌مانده — On-Call Infrastructure — 2026-05-21
P0: تعریف مسیر سریع تأیید rollback برای حوادث P0 (حداکثر ۵ دقیقه) — Engineering Manager — 2026-05-21
P1: افزودن قانون linting به CI برای تراکنش‌های دیتابیس بدون نگهبان — Backend Platform — 2026-05-28

چگونه بیشترین بهره را از پرامپت ببریم

کیفیت خروجی مستقیماً با کیفیت ورودی خام شما رابطه دارد. یک ترد اسلک با timestamp و جزئیات فنی تایم‌لاین دقیق‌تری نسبت به یک خلاصه پاراگرافی تولید می‌کند. اگر یادداشت‌های حادثه‌تان ناقص است، یک پاس دوم اضافه کنید: بعد از خروجی اولیه پست‌مرتم، با «حالا آیتم‌های عملی را بررسی کن — آیا هیچکدام مبهم یا غیرقابل انتساب هستند؟ اگر بله، آنها را طوری بازنویسی کن که مشخص باشند.» پیگیری کنید. مدل آنها را دقیق‌تر می‌کند.

برای انواع تکراری حادثه (خرابی‌های مربوط به استقرار، مسائل دیتابیس، خرابی APIهای شخص ثالث)، می‌توانید یک کتابخانه از پست‌مرتم‌های قبلی بسازید و از مدل بخواهید مورد جدید را مقایسه کند: «اینجا سه پست‌مرتم قبلی از سرویس پرداخت ما است. چه عوامل مؤثری در این حادثه جدید ظاهر شده که در موارد قبلی هم بود؟» این کار الگوهای سیستمی را آشکار می‌کند که پست‌مرتم‌های فردی از قلم می‌اندازند.

Claude Opus 4.7 بهترین بازنویسی‌های بدون سرزنش را تولید می‌کند — به‌طور قابل اعتمادی زبان سرزنش ظریفی را که GPT-4o گاهی باقی می‌گذارد می‌گیرد. برای حوادث ساده با یادداشت‌های خام تمیز، Claude Sonnet 4.6 یا GPT-4o سریع و کافی هستند.

productivityincident-responsepost-mortemsredevopsblameless-culture

اشتراک‌گذاری: