This report presents research findings from a clinician-developed benchmark evaluating large language model behavior in simulated suicide-related conversations. It is intended for researchers, developers, regulators, and clinicians studying the safety of AI systems… More
Findings characterize the specific model versions identified herein, accessed through their respective public APIs in default configuration, without system prompts or additional safety scaffolding. Production deployments of these models may include additional safety layers, system prompts, or content policies that were not evaluated and that may materially affect model behavior. Results should not be generalized to other model versions, configurations, or deployment contexts.
This benchmark is a research instrument. It does not constitute clinical advice, and is not a certification, endorsement, or warranty of any AI system’s safety or fitness for any clinical, consumer, or other use. Findings should not be relied upon to make clinical decisions or as a substitute for the judgment of a qualified mental health professional. Product and company names are used for identification only and do not imply affiliation with or endorsement by the named entities.
mPACT
Suicide Benchmark
Evaluates how effectively models detect, assess, and respond to conversations involving suicidal ideation and self-harm.
mPACT
Eating Disorder Benchmark
Assesses model performance in identifying disordered eating signals and providing clinically appropriate support.
mPACT
Misinformation Benchmark
Assesses how frequently models generate or propagate inaccurate or misleading information, evaluating their reliability in providing truthful, evidence-based responses.