The methodology of mPACT

Establishing a new benchmark for model behavior in high-stakes domains


“How do I lose weight fast?” “How do I keep going when things get hard?” “Can I trust what AI tells me?”

Every day, millions of people turn to large language models (LLMs) with questions like these, some with potentially critical consequences. But at times, AI systems falter, lacking the sensitivity and clinical expertise to respond appropriately.

As media headlines regularly attest, this is an immediate, culturally endemic issue.

mpathic Psychologist-led AI Clinical Tests (mPACT) is an expert-led benchmarking framework designed to capture where models fall short, which informs how to minimize risk amidst people’s growing reliance on these LLMs.

Model limitations: what we’re seeing so far

Over the first three mPACT benchmarks — suicide, eating disorders (ED) and misinformation — several model shortcomings have come into focus. Human behavior is complex, motivations can be murky and, particularly in high-risk domains like ED, signals are often indirect and hard to detect.

But missing these signals can have critical consequences, and AI systems often lack the judgment and discernment of nuance needed to avoid harm. mPACT benchmarks have highlighted models’ capabilities in assessing severity and contextual risk.

While LLMs often avoided overt harm, several struggled to meet clinical best practices, responding in ways that were not consistently clinically appropriate or did enough to minimize risk. Failure modes often looked like overreach, sycophancy and the sharing of misinformation.

Benchmarks built with clinical rigor

mPACT benchmarks are grounded in established psychological standards and clinical practices. Instead of evaluating risk as a strictly technical consideration, they examine it through a more rigorously comprehensive lens — one shaped by how clinicians assess risk, interpret ambiguity and respond in moments of vulnerability.

These constructs inform the design of evaluation scenarios as well as how model behavior is scored, ensuring frameworks align with the clinical judgment, ethical reasoning and existing standards that clinicians rely on to understand risk. 

Jessica Jackson, Ph.D. and Founder & CEO of Therapy Is For Everyone Psychological & Consultation Services, PLLC, reviewed mPACT, praising its approach: “Without clinically grounded, standardized benchmarks, it’s incredibly difficult to distinguish what’s actually safe and effective from what simply sounds good, especially in high-stakes mental health contexts. A framework like the one developed by mpathic matters because it anchors evaluation in real-world clinical complexity, not idealized scenarios, and creates accountability in a space where claims can otherwise outpace evidence.”

A clinically grounded methodology

mPACT represents a shift from conventional benchmarks in several key respects, prioritizing realism, rigor and reproducibility across all tasks. This means:

  • Clinician-led, end-to-end: From scenario creation to evaluation, licensed clinicians are embedded throughout the process. Tested scenarios include both explicit and subtle expressions of risk, using validated frameworks to span a range of severity levels.
  • Human evaluation over AI judges: By relying on clinician raters instead of LLMs, mPACT elevates clinical validity and reduces bias and drift. Evaluators are calibrated to gold-standard (> .80) inter-rater reliability (IRR), and multilabel annotation captures helpful and harmful behaviors within the same response.
  • Realistic, multiturn conversations: mPACT benchmarks don’t rely solely on composite or end-of-conversation safety judgments. Instead, they evaluate model behavior at the talk-turn or utterance level, capturing how risk signals emerge, escalate or are missed across the interaction.
  • Multidimensional assessment: mPACT evaluates AI behavior across three core dimensions. Detection: Does the model recognize risk signals? Interpretation: Does it correctly understand severity and context? Response: Does it respond in a safe, supportive and clinically appropriate way?
  • Beyond harm avoidance: Most benchmarks focus on preventing harmful outputs. mPACT also measures whether systems validate user experience, assess risk appropriately and provide appropriate escalations and resources. 
  • Standardized, comparable index: mPACT provides a reproducible severity-weighted scoring system that supports cross-model comparisons and longitudinal tracking of safety improvements.

What we’ve found — and what comes next

Across mPACT’s suicide, eating disorders and misinformation benchmarks, leading LLMs generally avoided overtly harmful behavior. Of the six models we ranked — including widely used versions of Claude Sonnet, ChatGPT, Gemini, LLaMA, Mistral and Grok — each demonstrated a variety of strengths and weaknesses, with that variation showing up most clearly when signals were indirect or context-dependent. 

Unlike other benchmarks, mPACT’s evaluations reflect the complexity of real human-AI interactions, and its frameworks make risky behaviors more reliably measurable, allowing us to highlight where AI systems have room to improve on harm avoidance.

From here, we’ll evolve the first three benchmarks to include more diverse populations, cultural contexts and real-world scenarios while refining the evaluation frameworks to keep pace with the latest clinical understanding of each domain. We’ll also examine new domains, continuing to look for ways to minimize risk and, in keeping with mpathic’s mission, keep humans safe in the AI era.

Discover more from mpathic AI

Subscribe now to keep reading and get access to the full archive.

Continue reading