Endorsements for mPACT

Without clinically grounded, standardized benchmarks, it’s incredibly difficult to distinguish what’s actually effective from what simply sounds good, especially in high-stakes mental health contexts. A framework like the one developed by mpathic matters because it anchors evaluation in real-world clinical complexity, not idealized scenarios, and creates accountability in a space where claims can otherwise outpace evidence.

What stands out to me is that this approach doesn’t remove humans from the process, it centers them. Bringing together trained clinicians, lived experience, and mpathic’s track record of equity-informed methodology reflects how care should actually work. As a psychologist, I trust a framework shaped by that level of clinical rigor far more than one relying solely on automated or LLM-based judgment.

Jessica Jackson, Ph.D., Founder & CEO of Therapy Is For Everyone Psychological & Consultation Services, PLLC

Mpathic’s benchmarking work is crucial because we still lack comprehensive, evidence-based, scalable, and clinically grounded frameworks, even though millions of people are using AI for high-stakes interactions like mental health support. We need benchmarks like mpathic’s that evaluate AI models against multi-dimensional risks and clinical evidence across mental health and well-being domains, evaluated by clinicians with the right experience. Mpathic’s exceptionally high safety standards, developed with hundreds of experts, can help companies build safer products and, importantly, evaluate real-world AI interactions.

Mpathic’s methodology passes my standards for clinical best practices because they work with hundreds of clinicians, use AI scenarios close to real life, and draw on evidence-based clinical frameworks and treatment guidelines. They are also unique because they tackle complex, multi-dimensional AI safety issues and evaluate a range of risks, including those that are subtle or incomplete. For example, they don’t only look at avoiding harmful responses, but also at active safety promotion of models and sensitivity to varying risk scenarios, and are mindful of the risk of harm caused by biases. Theirs is also the first framework to use clinician-developed evaluation rubrics and multi-turn conversational contexts, and to rely on direct human clinical Evaluation.

They are developing an impressive, clinically grounded ecosystem for AI safety evaluation across suicide risk, eating disorders, misinformation, and other emerging areas of risk. Overall, Mpathic’s work can have a major impact on keeping people safe in the real world and helping them thrive alongside AI.

Caroline Figueroa, MD, PhD, Stanford University; Delft University of Technology

At a time of increasing public use of LLMs for mental health support, we need to move from criticism to action in understanding and improving outcomes. Suicidality poses one of the biggest challenges to using LLMs in this context, making it crucial to ensure that expert humans remain in the loop when making critical decisions around responding to suicidal ideation and behavior. mPACT relies on expert clinicians interacting with the LLM and simulating a wide range of patients, making it more of a real-world stress test than purely technical evaluations. This kind of rigorous, clinically-grounded benchmarking enables specific, standardized assessment that can lead to improved tools helping ensure benefits outweigh harms. Notably, mPACT rightly focuses on suicidality and assesses not only whether responses are safe but also whether they are genuinely helpful.

This rigorous study represents an important leap in stress-testing and understanding the safety of LLMs for mental health. The system is both rigorous and generalizable because it represents a broad range of providers, assesses multiple aspects of suicidality reporting, and evaluates both harmful and helpful responses. Importantly, the methodology is informed by the Columbia Suicide Severity Rating Scale, which is the standard in clinical practice, grounding the benchmark in established, evidence-based assessment. By assessing multiple dimensions of suicidality and evaluating responses along a spectrum from harmful to helpful, mPACT reflects the nuanced clinical judgment required in these high-stakes situations. As both a clinical psychologist and a digital health intervention developer and researcher, I continually struggle to balance the positive potential of technologies with their unintended negative consequences. mPACT represents exactly the kind of rigorous work needed to help the field strike that balance.

Adrian Aguilar, PhD, Chancellor’s Professor, UC Berkeley

When someone is struggling, small differences in how a Chatbot response lands can matter a lot. What I appreciate here is the focus on how AI actually shows up in those moments and whether it helps people avoid harm in real time.

People use AI to make sense of what they are feeling and to decide what to do next. That makes it critical to evaluate how responses can shape understanding, judgment, and action over time. This kind of clinically grounded benchmarking is an important step toward safer real-world use.

The field has needed a benchmark that treats safety as a clinical standard rather than a technical constraint. By weighting higher-risk scenarios more heavily and focusing on how models respond over time, this framework moves us toward a more meaningful definition of risk, one that reflects the stakes for real people.

Anyone who has been suicidal or sat with a suicidal person knows how rarely risk shows up in clean, obvious ways. What stands out here is the commitment to testing AI against the kind of messy, ambiguous conversations that happen in real life, and holding systems accountable for responses that are genuinely supportive.

What gives this approach credibility is that clinicians are embedded throughout, from scenario design to evaluation, and that performance is assessed across detection, interpretation, and response. That reflects how real clinical decisions are made. It is a more rigorous and clinically aligned way to evaluate AI behavior than relying on surface-level or automated judgments.

Dr. Ursula Whiteside, CEO, NowMattersNow.org

As AI enters high-stakes spaces like mental health, clinically grounded evaluation is essential. Mpathic’s framework helps ensure these systems are assessed on how they actually respond in complex, real-world situations where safety is of the utmost importance.

Ellen E. Fitzsimmons-Craft, PhD, FAED, LP and Denise Wilfley, PhD, Center for Healthy Weight and Wellness, Washington University School of Medicine

mpathic’s creation of a clinically-grounded benchmarking framework to assess how AI systems perform when users engage with them for mental health support and resources is critical during this time. More people are using AI for their mental health needs due to the accessibility and affordability of using a chatbot over a mental health provider. Therefore, we need to have an objective clinically-informed approach to evaluating the efficacy of large language model responses in suicide-related conversations — mpathic’s Psychologist-led AI Clinical Tests Suicide Index does just that.

Terika McCall, PhD, MPH, MBA, Yale School of Public Health

It cannot be overstated how critical a clinician-led, end-to-end benchmark for evaluating how LLMs respond to high-stakes human interactions is to the field of behavioral health. We are in an era where more people are turning to these technologies for emotional support in crisis than to a clinician or trusted individual in their lives — all the while, the standards for how these models respond in such scenarios remain inadequately controlled and assessed from a clinical perspective. The mPACT-S-v1.0 and mPACT-ED-v1.0 set a key standard that evaluates not only harm avoidance but the clinical appropriateness of a model’s response. With this work, mpathic is establishing a remarkable benchmark that current and future models can be audited against — one that can evolve with additional cultural and contemporary specifications and context. Currently, no other standard exists that I would trust to evaluate how LLMs detect, interpret, and respond in high-stakes human interactions.

I am a data scientist by training, so while I cannot speak to clinical best practices per se, I can speak to my perspective on the computational robustness and utility of the scoring systems that mpathic used to evaluate AI behavior. I am genuinely impressed by the creativity and importance of the methods mpathic is employing to address AI models. The combination of an assessment of harm avoidance and active support is critical to evaluating the nuance of LLM responses in a crisis context — nuance often missed by simple evaluations of aggregate performance metrics. Importantly, findings from this work illuminate the divergence between these metrics for some models and underscore the need to assess both dimensions independently. This computational evaluation, paired with clinician annotation of LLM responses, creates a groundbreaking framework that captures the clinical appropriateness of LLMs for suicide crisis response.

Kara Emery, PhD, Director of Data Science, AI Hub – McSilver Institute for Poverty Policy and Research, New York University (NYU)