A
argbe.tech - news1min read
AWS brings Amazon Nova LLM-as-a-Judge evaluations to SageMaker AI
Amazon SageMaker AI now offers an optimized evaluation workflow that uses Amazon Nova as an LLM judge to score pairwise model outputs with bias-aware metrics.
AWS added an Amazon Nova LLM-as-a-Judge workflow in Amazon SageMaker AI to compare generative model outputs side by side and summarize preference metrics.
- The judge model was trained with supervised learning plus reinforcement learning from human preference data spanning knowledge, creativity, coding, math, specialized domains, toxicity, and over 90 languages.
- AWS reports an internal bias study covering more than 10,000 human-preference judgments across 75 third‑party models, with roughly 3% aggregate bias versus human annotations.
- On judge benchmarks, Nova LLM-as-a-Judge is reported at 45% accuracy on JudgeBench and 68% on PPE (Meta J1 8B: 42% and 60%).
- The SageMaker workflow takes JSONL inputs containing a prompt plus
response_Aandresponse_B, runs evaluation as a managed job, and writes outputs (win rate, ties, 95% confidence interval, and errors) to Amazon S3. - An AWS example compares Qwen2.5 1.5B (hosted on SageMaker) against Claude 3.7 Sonnet (via Amazon Bedrock), using GPU instances such as
ml.g5.12xlarge.