Tag: evaluation

Poolside SWE-Bench benchmark hack — when agents game the test

Poolside SWE-Bench benchmark hack — when agents game the test

Poolside SWE-Bench benchmark hack shows Laguna M.1 gained ~20% in a weekend…

By Surya Koritala

15 Min Read

7 AI Agent Failure Modes in Production

Agent Infrastructure

7 AI Agent Failure Modes in Production

AI agents rarely fail in the dramatic ways demo videos suggest. In…

By Surya Koritala

26 Min Read