mPACT™ Methodology
mpathic Psychologist-led AI Clinical Tests (mPACT) is a clinically grounded benchmarking framework developed by mpathic, a company focused on keeping humans safe in the AI era through technology-enabled, expert-led human data and evaluation services. mpathic partners with frontier AI builders to prevent harmful or unwanted model behaviors across the full AI lifecycle, spanning use cases from user wellbeing and mental health to financial risk and customer support.
Built on this foundation, mPACT evaluates how AI systems perform in high-stakes human interactions. Developed by licensed mental health professionals, the framework measures not only whether AI systems avoid harm, but also whether they respond in ways that are actively supportive, and aligned with real-world standards of care.
Clinical Foundation
mPACT is rooted in established psychological theory and clinical practice. Rather than treating AI behavior as a purely technical problem, it is approached as a clinical one – shaped by how trained clinicians assess risk, interpret ambiguity, and respond ethically in moments of vulnerability.
These constructs inform both the design of evaluation scenarios and the scoring of model behavior, so that benchmarks are grounded in the clinical judgment, ethical reasoning, and real-world decision-making clinicians use to understand risk.
A core clinical principle underlying mPACT is harm minimization: failing to detect risk is treated as more consequential than over-detection, reflecting real-world priorities in crisis care.
Methodology
mPACT introduces a distinct, clinician-led approach to AI evaluation that prioritizes realism, rigor, and reproducibility.
Clinician-Authored Scenarios. All test cases are created by licensed clinicians with expertise in the relevant domain. These scenarios:
- Reflect real-world, multi-turn conversations
- Include both explicit and subtle expressions of risk
- Span a range of severity levels using validated frameworks
Expert Evaluation. Unlike many benchmarks, mPACT does not rely on LLM-based judges. Instead:
- Trained clinicians evaluate model responses directly
- Evaluators are calibrated to gold-standard (> .80) inter-rater reliability
- Multi-label annotation captures both helpful and harmful behaviors within the same response
Multi-Dimensional Assessment. mPACT evaluates AI behavior across three core dimensions:
- Detection: Does the model recognize risk signals?
- Interpretation: Does it correctly understand severity and context?
- Response: Does it respond in a supportive and clinically appropriate way?
Severity-Weighted Scoring. mPACT introduces a composite scoring system that:
- Weighs performance more heavily in higher-risk scenarios
- Distinguishes between harm avoidance and active safety promotion
- Enables consistent comparison across models and over time
Key Differentiators
mPACT represents a shift from conventional AI benchmarks in several important ways:
- Clinician-Led, End-to-End
From scenario creation to evaluation, licensed clinicians are embedded throughout the process. - Realistic, Multi-Turn Conversations
mPACT does not rely solely on composite or end-of-conversation safety judgments. It evaluates model behavior at the talk turn level, capturing how risk signals emerge, escalate, or are missed across the interaction. - Beyond Harm Avoidance
Most benchmarks focus on preventing harmful outputs. mPACT also measures whether systems validate user experience, assess risk appropriately and provide supportive, aligned responses. - Human Evaluation Over AI Judges
By relying on clinician raters instead of LLM evaluators, mPACT prioritizes clinical validity and reduces bias and drift. - Standardized, Comparable Index
mPACT provides a reproducible scoring system that supports cross-model comparisons, regulatory evaluation and longitudinal tracking of improvements.
Credits
mPACT is built through close collaboration between clinicians, researchers, and engineers.
- Research Team: Developed methodology, scoring systems, and analysis
- Fuji Robledo Yamamoto, PhD, Alisa A. Breetz, PhD, Jasmin Acevedo, DSW, Shamal Lalvani, PhD, Caraline Bruzinski, MS, Megan Greenlaw, MA, Xieyining Huang, PhD, Valerie Douglass, PhD, Maggie Gorraiz, PhD, Danielle A. Schlosser, PhD, Sarah Peregrine Lord, PsyD & Alison Cerezo, PhD
- Engineering & Product Teams: Built infrastructure and execution workflows
- Brian Williams, Amber Jolley-Paige, PhD, Luke Lovett, Matt Cox, Kruti Dickson, Natalie Wrona, Johnny O’Connell, Lucy Tate, Lina Jones, Matthew Borgman
- Clinical Contributors: 75+ licensed mental health professionals who designed scenarios and conducted evaluations
- External reviewers: Independently reviewed by clinical leaders and subject matter experts, including:
- Jessica Jackson, Ph.D., Founder & CEO of Therapy Is For Everyone Psychological & Consultation Services; PLLC, Caroline Figueroa, MD, PhD, Stanford University, Delft University of Technology; Adrian Aguilera, PhD, Chancellor’s Professor, UC Berkeley; Dr. Ursula Whiteside, CEO, NowMattersNow.org; Ellen E. Fitzsimmons-Craft, PhD, FAED, LP and Denise Wilfley, PhD, Center for Healthy Weight and Wellness, Washington University School of Medicine
- Jessica Jackson, Ph.D., Founder & CEO of Therapy Is For Everyone Psychological & Consultation Services; PLLC, Caroline Figueroa, MD, PhD, Stanford University, Delft University of Technology; Adrian Aguilera, PhD, Chancellor’s Professor, UC Berkeley; Dr. Ursula Whiteside, CEO, NowMattersNow.org; Ellen E. Fitzsimmons-Craft, PhD, FAED, LP and Denise Wilfley, PhD, Center for Healthy Weight and Wellness, Washington University School of Medicine
Their combined expertise enables mPACT to bridge clinical rigor with scalable AI evaluation.