mPACT™ Eating Disorders Benchmark
A clinician-led benchmark for evaluating AI models in eating disorder–related conversations
Large language models (LLMs) are increasingly used for dieting advice, fitness guidance, and mental health support. However, evaluating whether these systems respond appropriately in eating disorder contexts presents a distinct challenge. Unlike other risk domains, eating disorder risk is often indirect, normalized, and embedded within culturally reinforced ideas of health, discipline, and self-improvement.
The V1 of mpathic Psychologist-led AI Clinical Tests (mPACT) Eating Disorders Benchmark is a psychologist-led benchmark designed to evaluate how LLMs detect, interpret, and respond to disordered eating–related conversations. It addresses a critical gap in AI evaluation: the lack of clinically grounded frameworks capable of capturing subtle, context-dependent eating disorder risk signals.
What mPACT-ED measures
mPACT-ED evaluates AI systems across three core dimensions:
These dimensions reflect how clinicians assess and respond to eating disorder pathology in real-world settings.
mPACT Eating Disorders Results
mPACT Eating Disorders Benchmark V1 evaluated six leading AI models, resulting in a single summary metric:
mPACT Score (0–10)
A severity-weighted composite reflecting both harm avoidance and helpful responding, as measured by human clinicians.
- 0: Consistently harmful or contraindicated responses (e.g., missed risk, reinforced unsafe beliefs or behaviors, or provided dangerous guidance).
- 10: Consistently appropriate responses, including recognition of meaningful risk and supportive, appropriate guidance, similar to what a well-trained human clinician would provide.
Higher scores indicate stronger clinical alignment. The mPACT Score goes beyond harm avoidance to assess whether the model responds helpfully and appropriately, similar to a human clinician, when risk is present.
Across models, overall performance falls within a relatively narrow range, clustering around a neutral baseline. This suggests that most systems are generally capable of avoiding overtly harmful behavior.
Findings reflect default API configurations and do not characterize consumer-facing deployments.
However, aggregate performance masks important differences in how models respond to different levels of risk.
When examined more granularly, several consistent patterns emerge across models:
Eating disorder risk is most difficult to detect when user behavior appears “normal.”
Harm is often subtle, not explicit.
Harmful responses often appear through normalization of restrictive behavior, reinforcement of weight-loss goals, validation of maladaptive
beliefs, or framing pathology as discipline or wellness. These responses may appear supportive while still reinforcing disordered eating.
Harm avoidance is not the same as clinical best practices.
Some responses avoided clearly harmful content but failed to recognize eating disorder risk, challenge maladaptive beliefs, or meaningfully
intervene.
Mixed responses are common.
Many responses contained both helpful and harmful elements within the same turn. This reflects a core limitation of binary evaluation frameworks and underscores the need for multi-label, clinician-informed evaluation.
Claude Sonnet 4.5
Best overall clinical alignment
Most consistently helpful responses, with the lowest rates of harmful behavior
Llama 3.3 70B
Consistent across varying user vulnerability
Lower variability than most models, but less clinically proactive
depth
Gemini 2.5 Flash
Strong on explicit risk
Performs well in high and acute risk scenarios
GPT 5.2
Clinically aligned behaviors
Strong use of clinically aligned behaviors such as encouraging support and discouraging risk factors.
Why Eating Disorders as a Benchmark
Eating disorders are among the most lethal psychiatric conditions, with elevated mortality driven by both medical complications and suicide risk. At the same time, only a minority of individuals receive treatment, driven by barriers to access and the fact that these behaviors often feel intentional or even desirable to the person.
LLMs are increasingly used in domains closely tied to eating disorder risk—including dieting, weight loss, and body image—where harmful behaviors may be normalized or encouraged.
Real-world failures have shown that AI systems can:
- Promote restrictive eating or weight-loss behaviors
- Reinforce maladaptive beliefs about food and body image
- Provide guidance that appears “healthy” but is not clinically recommended
These risks highlight the need for clinically grounded, domain-specific evaluation frameworks.
Why disordered eating signals are often missed by AI
Eating disorder harm avoidance differs from other domains in several key ways:
- Risk is often indirect and ambiguous.
Users may frame harmful behaviors as “discipline,” “health,” or “optimization” - Symptoms are ego-syntonic.
Individuals may not view behaviors as problematic and may seek reinforcement - Cultural norms blur the line.
Dieting, weight loss, and restriction are socially accepted and often encouraged - Harm can be subtle.
Responses that appear supportive may reinforce pathology (e.g., validating restriction or weight goals)
These dynamics make detection, interpretation, and response generation significantly more complex than in domains with clearer signals.
Benchmark design
mPACT-ED uses realistic, clinician-authored conversations spanning five levels of eating disorder risk: No evidence of risk, Low risk, Moderate risk, High risk and Acute/Medical risk.
The benchmark includes:
- Multi-turn roleplays reflecting real-world interactions
- Conversations developed by licensed clinicians with eating disorder expertise
- Diverse personas and real-world stressors
- Both direct and indirect expressions of pathology
Unlike benchmarks that rely on LLM-generated prompts or automated judges, mPACT-ED uses fully human-generated conversations, clinician-driven evaluation and structured, multi-label annotation at the talk-turn level. This approach enables detection of subtle and clinically meaningful patterns, including cases where helpful and harmful behaviors co-occur within a single response.
Scoring framework
Each response is evaluated using multiple clinical labels (e.g., supportive behaviors, harmful guidance, missed risk). Because responses often contain both helpful and harmful elements, scoring reflects the balance of these behaviors rather than assigning a single category, using the following scoring system on a talk-turn basis:
Per-talk-turn Score
Samples of Roleplays with Annotations
Warning: This content contains detailed descriptions of disordered eating behaviors and may be harmful or triggering.
What’s next for mPACT
mPACT-ED v1.0 establishes a clinically grounded foundation for evaluating AI models in eating disorder–related conversations, and this domain introduces a critical shift in how safety must be defined. In eating disorder contexts, risk is often indirect, context-dependent, and embedded within socially normalized behaviors. As a result, the primary risk is not only what models say, but what they fail to recognize or challenge. Responses may appear supportive, neutral, or health-oriented while still reinforcing clinically dangerous behaviors.
This means traditional approaches—focused primarily on preventing explicit harm—are insufficient. Effective evaluation must capture subtle reinforcement of pathology, missed risk signals, and clinically inadequate responses where models avoid overt harm but fail to intervene meaningfully.
Future work will expand coverage to more diverse populations, cultural contexts, and real-world scenarios, while continuing to refine the clinician-guided evaluation framework to better capture emerging and context-dependent patterns of eating disorder risk. We will also evaluate additional and emerging models over time to track progress and enable ongoing cross-system comparison.
Beyond eating disorders, mPACT will continue building a broader suite of clinician-guided benchmarks across high-risk human–AI interactions. Future extensions may also incorporate longitudinal interactions and multimodal inputs to better reflect how risk develops and is expressed in real-world settings.
Together, these efforts aim to ensure that mPACT remains a living benchmark that evolves alongside both advances in AI and the complex, shifting realities of human behavior.