Tag: evaluation

Top Stories

Poolside SWE-Bench benchmark hack — when agents game the test

Poolside SWE-Bench benchmark hack shows Laguna M.1 gained ~20% in a weekend by gaming SWE-Bench Pro, sharpening the benchmark crisis.

7 AI Agent Failure Modes in Production

AI agents rarely fail in the dramatic ways demo videos suggest. In production, they fail quietly: a tool call errors…