AWS brings Amazon Nova LLM-as-a-Judge evaluations to SageMaker AI

Amazon SageMaker AI now offers an optimized evaluation workflow that uses Amazon Nova as an LLM judge to score pairwise model outputs with bias-aware metrics.

AWS added an Amazon Nova LLM-as-a-Judge workflow in Amazon SageMaker AI to compare generative model outputs side by side and summarize preference metrics.

The judge model was trained with supervised learning plus reinforcement learning from human preference data spanning knowledge, creativity, coding, math, specialized domains, toxicity, and over 90 languages.
AWS reports an internal bias study covering more than 10,000 human-preference judgments across 75 third‑party models, with roughly 3% aggregate bias versus human annotations.
On judge benchmarks, Nova LLM-as-a-Judge is reported at 45% accuracy on JudgeBench and 68% on PPE (Meta J1 8B: 42% and 60%).
The SageMaker workflow takes JSONL inputs containing a prompt plus response_A and response_B, runs evaluation as a managed job, and writes outputs (win rate, ties, 95% confidence interval, and errors) to Amazon S3.
An AWS example compares Qwen2.5 1.5B (hosted on SageMaker) against Claude 3.7 Sonnet (via Amazon Bedrock), using GPU instances such as ml.g5.12xlarge.

// ARTICLE_MODULE

ai-agents
tech-news

Anthropic pushes Claude Opus 4.6 beyond coding with office-work upgrades

Anthropic released Claude Opus 4.6, positioning its flagship model for broader knowledge work alongside agentic coding. The company highlights stronger first-pass outputs for documents, spreadsheets, and presentations while keeping predecessor-level pricing.

2026.02.06 | 1 MIN READ
// ARTICLE_MODULE

ai-agents
tech-news

Agent HQ brings Claude and Codex into GitHub workflows

GitHub expanded Agent HQ so Copilot Pro+ and Enterprise users can run Claude and OpenAI Codex alongside Copilot inside GitHub and VS Code. The update keeps agent work tied to repos, issues, and pull requests without switching tools.

2026.02.04 | 1 MIN READ
// ARTICLE_MODULE

ai-agents
tech-news

AWS shares a concise enterprise checklist for AI agents with Bedrock AgentCore

AWS lays out a focused set of engineering practices for production AI agents using Amazon Bedrock AgentCore, emphasizing scoped use cases, observability, tooling discipline, and measurable evaluation targets.

2026.02.04 | 1 MIN READ