mPACT™ Suicide Benchmark
A clinician-led benchmark for evaluating AI in suicide-related conversations
Large language models (LLMs) are increasingly used for emotional support and mental health guidance. But evaluating how they respond—especially in moments of suicide risk—requires thorough, clinically-grounded benchmarking.
The V1 of mpathic Psychologist-led AI Clinical Tests (mPACT) Suicide Benchmark is a psychologist-led benchmark designed to evaluate how LLMs detect, interpret, and respond to suicide-related conversations. The benchmark fills a critical gap in AI evaluation: there are currently few clinically grounded frameworks for assessing model performance in high-risk mental health scenarios.
mPACT evaluates AI systems across three core dimensions:
These dimensions reflect how real clinicians assess and respond to suicide risk in practice.
To avoid the limitations of LLM-based judges and ensure alignment with clinical judgement, every response is evaluated by trained clinicians. They employ a multi-label rubric that captures varied levels of harm, including mixed-harm responses within a single talk turn.
mPACT Results
mPACT Suicide Benchmark V1 evaluated six leading AI models across 300 conversations.
The evaluation produces two summary metrics:
mPACT Score (0–10)
A severity-weighted composite reflecting both harm avoidance and helpful responding, as measured by human clinicians.
- 0: Consistently harmful or contraindicated responses (e.g., missed risk, reinforced unsafe beliefs or behaviors, or provided dangerous guidance).
- 10: Consistently appropriate responses, including recognition of meaningful risk and supportive, appropriate guidance, similar to what a well-trained human clinician would provide.
Higher scores indicate stronger clinical alignment. The mPACT Score goes beyond harm avoidance to assess whether the model responds helpfully and appropriately, similar to a human clinician, when risk is present.
Simple Harm Avoidance Score (0–10)
Measures only the proportion of responses that avoid harmful or contraindicated behavior.
- 0: Harmful or contraindicated responses throughout the benchmark.
- 10: No harmful or contraindicated responses across the benchmark.
Unlike the mPACT Score, this metric does not assess risk recognition, clinical quality, or supportive effectiveness. A model can score highly on Simple Harm Avoidance while still underperforming in clinically important moments.
In mPACT Suicide V1, GPT 5.2, Claude Sonnet 4.5 and Gemini 2.5 Flash received the highest ratings in both the mPACT Score and Simple Safety Score.
Findings reflect default API configurations and do not characterize consumer-facing deployments.
Because these evaluations are conducted by licensed clinicians using realistic scenarios, the findings are designed to reflect real-world expectations for mental health safety performance.
It’s important to note that aggregate scores alone can mask meaningful conversation-level failures. A model may achieve a high score while still producing harmful responses in a subset of interactions. For this reason, evaluating mental health responses require the pairing of overall performance with measures of within-conversation variability.
Claude Sonnet 4.5
Best overall clinical alignment
Combines detection, support, and harm reduction that is most aligned to a human clinician.
GPT 5.2
Best at avoiding harm
Avoids doing the wrong thing, but not always proactive enough.
Gemini 2.5
Best with moderate vulnerability in users
Performs highest when vulnerability in users is moderate.
What mPACT measures
mPACT uses realistic, clinician-developed conversations across the four C-SSRS levels of suicide risk (none, low, moderate, high). It leverages direct clinician judgment—without relying on automated raters—which enables more accurate capture of subtle, clinically meaningful signals. Each clinician evaluator was trained to a .80 IRR gold standard to ensure accurate and consistent evaluation across the full dataset.
The mPACT Suicide Benchmark involves:
Per-talk-turn Ordinal Score
Samples of Roleplays with Annotations
Warning: This content includes sensitive information about suicide methods. Select details are blurred for safety and can be revealed on hover.
What’s next for mPACT
mPACT-S-v1.0 establishes a clinically grounded foundation for evaluating AI models in suicide-related conversations, but we view it as an evolving benchmark. Future work will expand coverage to more diverse populations, cultural contexts, and real-world scenarios, while continuing to refine the clinician-guided evaluation framework as clinical understanding of suicidality evolves.
We will also evaluate new and emerging models over time to track progress and enable ongoing cross-system comparison.
Beyond suicide risk, we will also be benchmarking other high-risk human-AI interactions, supporting a broader suite of clinically grounded safety evaluations. Together, these efforts aim to ensure that mPACT remains a living benchmark that keeps pace with both advances in AI and the realities of human behavior.