mPACT™ Benchmarks

As AI systems increasingly engage in sensitive mental health conversations, rigorous, clinically grounded evaluation is essential. mpathic Psychologist-led AI Clinical Tests (mPACT) is a benchmarking framework designed to assess how AI models respond in high-risk scenarios.

About Methodology Get In Touch

This report presents research findings from a clinician-developed benchmark evaluating large language model behavior in simulated suicide-related conversations. It is intended for researchers, developers, regulators, and clinicians studying the safety of AI systems… More

Findings characterize the specific model versions identified herein, accessed through their respective public APIs in default configuration, without system prompts or additional safety scaffolding. Production deployments of these models may include additional safety layers, system prompts, or content policies that were not evaluated and that may materially affect model behavior. Results should not be generalized to other model versions, configurations, or deployment contexts.

This benchmark is a research instrument. It does not constitute clinical advice, and is not a certification, endorsement, or warranty of any AI system’s safety or fitness for any clinical, consumer, or other use. Findings should not be relied upon to make clinical decisions or as a substitute for the judgment of a qualified mental health professional. Product and company names are used for identification only and do not imply affiliation with or endorsement by the named entities.

mPACT
Suicide Benchmark

Evaluates how effectively models detect, assess, and respond to conversations involving suicidal ideation and self-harm.

More Results

mPACT
Eating Disorder Benchmark

Assesses model performance in identifying disordered eating signals and providing clinically appropriate support.

More Results

mPACT
Misinformation Benchmark

Assesses how frequently models generate or propagate inaccurate or misleading information, evaluating their reliability in providing truthful, evidence-based responses.

More Results

mpathic’s benchmarking work is crucial because we still lack comprehensive, evidence-based, scalable, and clinically grounded frameworks. We need benchmarks like mpathic’s that evaluate AI models against multi-dimensional risks and clinical evidence. mpathic’s exceptionally high safety standards can help companies build safer products and, importantly, evaluate real-world AI interactions.

Caroline Figueroa, MD, PhD, Stanford University; Delft University of Technology

mPACT relies on expert clinicians interacting with the LLM and simulating a wide range of patients, making it more of a real-world stress test than purely technical evaluations. As both a clinical psychologist and a digital health intervention developer and researcher, I continually struggle to balance the positive potential of technologies with their unintended negative consequences. mPACT represents exactly the kind of rigorous, safety-focused work needed to help the field strike that balance.

Adrian Aguilera, PhD, Chancellor’s Professor, UC Berkeley

A framework like the one developed by mpathic matters because it anchors evaluation in real-world clinical complexity. What stands out to me is that this approach doesn’t remove humans from the process, it centers them. As a psychologist, I trust a framework shaped by that level of clinical rigor far more than one relying solely on automated or LLM-based judgment.

Jessica Jackson, Ph.D., Founder & CEO of Therapy Is For Everyone Psychological & Consultation Services, PLLC

The field has needed a benchmark that treats safety as a clinical standard rather than a technical constraint. What gives this approach credibility is that clinicians are embedded throughout, from scenario design to evaluation, and that performance is assessed across detection, interpretation, and response. It is a more rigorous and clinically-aligned way to evaluate performance than relying on surface-level or automated judgments.

Dr. Ursula Whiteside, CEO, NowMattersNow.org

As AI enters high-stakes spaces like mental health, clinically grounded evaluation is essential. mpathic’s framework helps ensure these systems are assessed on how they actually respond in complex, real-world situations where safety is of the utmost importance.

Ellen E. Fitzsimmons-Craft, PhD, FAED, LP and Denise Wilfley, PhD, Center for Healthy Weight and Wellness, Washington University School of Medicine

mpathic’s creation of a clinically-grounded benchmarking framework to assess how AI systems perform when users engage with them for mental health support and resources is critical during this time.

Terika McCall, PhD, MPH, MBA, Yale School of Public Health

It cannot be overstated how critical a clinician-led, end-to-end benchmark for evaluating how LLMs respond to high-stakes human interactions is to the field of behavioral health. Currently, no other standard exists that I would trust to evaluate how LLMs detect, interpret, and respond in high-stakes human interactions. With this work, mpathic is establishing a remarkable benchmark that current and future models can be audited against.

Kara Emery, PhD, Director of Data Science, AI Hub – McSilver Institute for Poverty Policy and Research, NYU

View all mPACT Testimonials

mPACT™ Benchmarks

mPACTSuicide Benchmark

mPACTEating Disorder Benchmark

mPACTMisinformation Benchmark

mPACT
Suicide Benchmark

mPACT
Eating Disorder Benchmark

mPACT
Misinformation Benchmark