Get Started

Endorsements for mPACT

Without clinically grounded, standardized benchmarks, it’s incredibly difficult to distinguish what’s actually safe and effective from what simply sounds good, especially in high-stakes mental health contexts. A framework like the one developed by mpathic matters because it anchors evaluation in real-world clinical complexity, not idealized scenarios, and creates accountability in a space where claims can otherwise outpace evidence.

What stands out to me is that this approach doesn’t remove humans from the process, it centers them. Bringing together trained clinicians, lived experience, and mpathic’s track record of equity-informed methodology reflects how care should actually work. As a psychologist, I trust a framework shaped by that level of clinical rigor far more than one relying solely on automated or LLM-based judgment.

Jessica Jackson, Ph.D., Founder & CEO of Therapy Is For Everyone Psychological & Consultation Services, PLLC


mpathic’s benchmarking work is crucial because we still lack comprehensive, evidence-based, scalable, and clinically grounded frameworks, even though millions of people are using AI for high-stakes interactions like mental health support. We need benchmarks like mpathic’s that evaluate AI models against multi-dimensional risks and clinical evidence across mental health and well-being domains, evaluated by clinicians with the right experience. mpathic’s exceptionally high safety standards, developed with hundreds of experts, can help companies build safer products and, importantly, keep people safe during real-world AI interactions.

mpathic’s methodology passes my standards for clinical best practices because they work with hundreds of clinicians, use AI scenarios close to real life, and draw on evidence-based clinical frameworks and treatment guidelines. They are also unique because they tackle complex, multi-dimensional AI safety issues and evaluate a range of risks, including those that are subtle or incomplete. For example, they don’t only look at avoiding harmful responses, but also at active safety promotion of models and sensitivity to varying risk scenarios, and are mindful of the risk of harm caused by biases. Theirs is also the first framework to use clinician-developed evaluation rubrics and multi-turn conversational contexts, and to rely on direct human clinical Evaluation.

They are developing an impressive, clinically grounded ecosystem for AI safety evaluation across suicide risk, eating disorders, misinformation, and other emerging areas of risk. Overall, mpathic’s work can have a major impact on keeping people safe in the real world and helping them thrive alongside AI.

Caroline Figueroa, MD, PhD, Stanford University; Delft University of Technology


At a time of increasing public use of LLMs for mental health support, we need to move from criticism to action in understanding and improving safety. Suicidality poses one of the biggest challenges to using LLMs in this context, making it crucial to ensure that expert humans remain in the loop when making critical decisions around responding to suicidal ideation and behavior. mPACT relies on expert clinicians interacting with the LLM and simulating a wide range of patients, making it more of a real-world stress test than purely technical evaluations. This kind of rigorous, clinically-grounded benchmarking enables specific, standardized assessment that can lead to improved tools helping ensure benefits outweigh harms. Notably, mPACT rightly focuses on suicidality and assesses not only whether responses are safe but also whether they are genuinely helpful.

This rigorous study represents an important leap in stress-testing and understanding the safety of LLMs for mental health. The system is both rigorous and generalizable because it represents a broad range of providers, assesses multiple aspects of suicidality reporting, and evaluates both harmful and helpful responses. Importantly, the methodology is informed by the Columbia Suicide Severity Rating Scale, which is the standard in clinical practice, grounding the benchmark in established, evidence-based assessment. By assessing multiple dimensions of suicidality and evaluating responses along a spectrum from harmful to helpful, mPACT reflects the nuanced clinical judgment required in these high-stakes situations. As both a clinical psychologist and a digital health intervention developer and researcher, I continually struggle to balance the positive potential of technologies with their unintended negative consequences. mPACT represents exactly the kind of rigorous, safety-focused work needed to help the field strike that balance.

Adrian Aguilar, PhD, Chancellor’s Professor, UC Berkeley


When someone is struggling, small differences in how a Chatbot response lands can matter a lot. What I appreciate here is the focus on how AI actually shows up in those moments and whether it helps people stay safer in real time.

—or: When someone is struggling, small differences in how a response lands can matter a lot. This work focuses on how AI actually shows up in those moments and whether it helps people stay safer in real time.

People use AI to make sense of what they are feeling and to decide what to do next. That makes it critical to evaluate how responses can shape understanding, judgment, and action over time. This kind of clinically grounded benchmarking is an important step toward safer real-world use.

The field has needed a benchmark that treats safety as a clinical standard rather than a technical constraint. By weighting higher-risk scenarios more heavily and focusing on how models respond over time, this framework moves us toward a more meaningful definition of safety, one that reflects the stakes for real people.

Anyone who has been suicidal or sat with a suicidal person knows how rarely risk shows up in clean, obvious ways. What stands out here is the commitment to testing AI against the kind of messy, ambiguous conversations that happen in real life, and holding systems accountable for responses that are genuinely supportive.

What gives this approach credibility is that clinicians are embedded throughout, from scenario design to evaluation, and that performance is assessed across detection, interpretation, and response. That reflects how real clinical decisions are made. It is a more rigorous and clinically aligned way to evaluate safety than relying on surface-level or automated judgments.

Dr. Ursula Whiteside, CEO, NowMattersNow.org