mPACT™ Methodology

mpathic Psychologist-led AI Clinical Tests (mPACT) is a clinically grounded benchmarking framework developed by mpathic, a company focused on keeping humans safe in the AI era through technology-enabled, expert-led human data and evaluation services. mpathic partners with frontier AI builders to prevent harmful or unwanted model behaviors across the full AI lifecycle, spanning use cases from user wellbeing and mental health to financial risk and customer support.

Built on this foundation, mPACT evaluates how AI systems perform in high-stakes human interactions. Developed by licensed mental health professionals, the framework measures not only whether AI systems avoid harm, but also whether they respond in ways that are actively supportive, and aligned with real-world standards of care.

Clinical Foundation

mPACT is rooted in established psychological theory and clinical practice. Rather than treating AI behavior as a purely technical problem, it is approached as a clinical one – shaped by how trained clinicians assess risk, interpret ambiguity, and respond ethically in moments of vulnerability.

These constructs inform both the design of evaluation scenarios and the scoring of model behavior, so that benchmarks are grounded in the clinical judgment, ethical reasoning, and real-world decision-making clinicians use to understand risk.

A core clinical principle underlying mPACT is harm minimization: failing to detect risk is treated as more consequential than over-detection, reflecting real-world priorities in crisis care.

Methodology

mPACT introduces a distinct, clinician-led approach to AI evaluation that prioritizes realism, rigor, and reproducibility.

Clinician-Authored Scenarios. All test cases are created by licensed clinicians with expertise in the relevant domain. These scenarios:

Reflect real-world, multi-turn conversations
Include both explicit and subtle expressions of risk
Span a range of severity levels using validated frameworks

Expert Evaluation. Unlike many benchmarks, mPACT does not rely on LLM-based judges. Instead:

Trained clinicians evaluate model responses directly
Evaluators are calibrated to gold-standard (> .80) inter-rater reliability
Multi-label annotation captures both helpful and harmful behaviors within the same response

Multi-Dimensional Assessment. mPACT evaluates AI behavior across three core dimensions:

Detection: Does the model recognize risk signals?
Interpretation: Does it correctly understand severity and context?
Response: Does it respond in a supportive and clinically appropriate way?

Severity-Weighted Scoring. mPACT introduces a composite scoring system that:

Weighs performance more heavily in higher-risk scenarios
Distinguishes between harm avoidance and active safety promotion
Enables consistent comparison across models and over time

Key Differentiators

mPACT represents a shift from conventional AI benchmarks in several important ways:

Clinician-Led, End-to-End
From scenario creation to evaluation, licensed clinicians are embedded throughout the process.
Realistic, Multi-Turn Conversations
mPACT does not rely solely on composite or end-of-conversation safety judgments. It evaluates model behavior at the talk turn level, capturing how risk signals emerge, escalate, or are missed across the interaction.
Beyond Harm Avoidance
Most benchmarks focus on preventing harmful outputs. mPACT also measures whether systems validate user experience, assess risk appropriately and provide supportive, aligned responses.
Human Evaluation Over AI Judges
By relying on clinician raters instead of LLM evaluators, mPACT prioritizes clinical validity and reduces bias and drift.
Standardized, Comparable Index
mPACT provides a reproducible scoring system that supports cross-model comparisons, regulatory evaluation and longitudinal tracking of improvements.

Credits

mPACT is built through close collaboration between clinicians, researchers, and engineers.

Research Team: Developed methodology, scoring systems, and analysis
- Fuji Robledo Yamamoto, PhD, Alisa A. Breetz, PhD, Jasmin Acevedo, DSW, Shamal Lalvani, PhD, Caraline Bruzinski, MS, Megan Greenlaw, MA, Xieyining Huang, PhD, Valerie Douglass, PhD, Maggie Gorraiz, PhD, Danielle A. Schlosser, PhD, Sarah Peregrine Lord, PsyD & Alison Cerezo, PhD
Engineering & Product Teams: Built infrastructure and execution workflows
- Brian Williams, Amber Jolley-Paige, PhD, Luke Lovett, Matt Cox, Kruti Dickson, Natalie Wrona, Johnny O’Connell, Lucy Tate, Lina Jones, Matthew Borgman
Clinical Contributors: 75+ licensed mental health professionals who designed scenarios and conducted evaluations
External reviewers: Independently reviewed by clinical leaders and subject matter experts, including:
- Jessica Jackson, Ph.D., Founder & CEO of Therapy Is For Everyone Psychological & Consultation Services; PLLC, Caroline Figueroa, MD, PhD, Stanford University, Delft University of Technology; Adrian Aguilera, PhD, Chancellor’s Professor, UC Berkeley; Dr. Ursula Whiteside, CEO, NowMattersNow.org; Ellen E. Fitzsimmons-Craft, PhD, FAED, LP and Denise Wilfley, PhD, Center for Healthy Weight and Wellness, Washington University School of Medicine

Their combined expertise enables mPACT to bridge clinical rigor with scalable AI evaluation.