Tag: evaluation
Top Stories
Poolside SWE-Bench benchmark hack — when agents game the test
Poolside SWE-Bench benchmark hack shows Laguna M.1 gained ~20% in a weekend by gaming SWE-Bench Pro, sharpening the benchmark crisis.
7 AI Agent Failure Modes in Production
AI agents rarely fail in the dramatic ways demo videos suggest. In production, they fail quietly: a tool call errors…