A Weekend With CrewAI: What I Built and What Broke -

I’d been hearing about CrewAI for months — every multi-agent tutorial cites it, every engineering Substack mentions it. So one Saturday I cleared my afternoon, brewed coffee, and decided to actually use it. Three days later I had something working, several things broken, and a clearer view of where the framework fits. Here’s what I built, what worked, what didn’t, and whether I’d reach for it again.

Contents

The setup

Andrew Ng — Sequoia AI Ascent talk on AI Agentic Workflows. The frame for why frameworks like CrewAI matter.

CrewAI is a Python framework for building multi-agent systems. The pitch is “let your agents work like a team.” You define agents — each with a role, a goal, and a backstory — give them tasks, and the framework handles the handoffs between them.

It’s open source, MIT-licensed, and has been on the receiving end of a lot of hype since mid-2024. The DeepLearning.AI course co-taught by founder João Moura and Andrew Ng probably did more for the framework’s profile than anything else.

Installation is one command — pip install crewai crewai-tools — and if you’ve used LangChain before you’ll feel at home in about ten minutes. If you haven’t, give yourself a couple of hours with the docs first.

DeepLearning.AI promo banner for the 'Multi AI Agent Systems with crewAI' short course — Image: DeepLearning.AI — promotional banner for the CrewAI short course.

What I built

I wanted a content research crew. Three agents:

A researcher that pulls notes on a topic
A writer that drafts an article
An editor that critiques and rewrites for tone

This is the textbook CrewAI use case. It’s also kind of the only thing every tutorial builds, which made me suspicious — but I figured I’d start with the path of least resistance.

from crewai import Agent, Task, Crew, Process
from crewai_tools import SerperDevTool

researcher = Agent(
 role='Senior Research Analyst',
 goal='Gather accurate, up-to-date information on a given topic',
 backstory=(
 'You are a seasoned research analyst at a leading think tank. '
 "You read primary sources carefully and never make claims you can't cite."
 ),
 tools=[SerperDevTool()],
 verbose=True,
)

writer = Agent(
 role='Tech Journalist',
 goal='Write a 1500-word article that is accurate, readable, and engaging',
 backstory=(
 'You write for a respected technology publication. '
 'Your stylistic models are Stratechery and MIT Technology Review.'
 ),
 verbose=True,
)

editor = Agent(
 role='Senior Editor',
 goal='Tighten prose, fact-check claims, and ensure the piece reads cleanly',
 backstory='You edit for a magazine where every sentence has to earn its place.',
)

research_task = Task(
 description='Research {topic}. Find at least 5 primary sources.',
 expected_output='A research brief with citations.',
 agent=researcher,
)
writing_task = Task(
 description='Using the research brief, draft a 1500-word article.',
 expected_output='The article in markdown.',
 agent=writer,
)
editing_task = Task(
 description='Edit the draft for tone, accuracy, and flow.',
 expected_output='The polished article.',
 agent=editor,
)

crew = Crew(
 agents=[researcher, writer, editor],
 tasks=[research_task, writing_task, editing_task],
 process=Process.sequential,
)
result = crew.kickoff(inputs={'topic': 'Agent2Agent protocol'})

It ran. The output was… fine.

About forty lines of code, and the crew was running on the first try. That’s the thing about CrewAI — it makes the first 80% of a multi-agent system trivial. You write personas, define tasks, kick off the crew, watch it print to the console.

The output was OK. Not great, not bad — about what you’d expect from a sequential GPT-4 pipeline with role-prompts. The research was thorough but generic. The writing read like every other AI-written tech article (slightly too neutral, slightly too hedged). The editor pass cleaned up obvious phrasing but didn’t transform the piece.

I’d just spent an afternoon. The framework worked. And yet the output was unmistakably mid. So I kept going.

Day 2: The role-prompt rabbit hole

Here’s where CrewAI gets interesting. The agent’s backstory and role aren’t just labels — they’re injected into every LLM call as system context. The writer’s prose voice is essentially determined by how you wrote the backstory.

I spent most of Day 2 iterating on backstories. “Tech journalist” gave me generic writing. “Tech journalist at a respected technology publication” was slightly better. “Tech journalist who has just read three Stratechery posts and is now writing in that voice for an audience of engineering managers” was noticeably better.

This is the part nobody warns you about. Most of your work isn’t writing code, it’s writing personas. Bad personas mean bad outputs, even with the best models. Treat it like creative writing — you’re casting actors.

“Most of your work in CrewAI isn’t writing code, it’s writing personas. You’re casting actors, not configuring software.”

Day 3: Where it broke

Two things broke for me.

First: the researcher would occasionally just confabulate. CrewAI ships with Serper as the default search tool, but the agent didn’t always invoke it. When it didn’t, I got plausible-sounding nonsense back. The fix was making search a required step in the task description (“Use the search tool at least twice before drafting any claims”). After that, hallucinations dropped but didn’t disappear.

Second: token costs ramped fast. Three agents, three turns each, plus the supervisor coordinating — easily 20-30 LLM calls per crew run. With GPT-4o I was burning around $0.15 per article. With Claude Sonnet 4.6 I was closer to $0.05. That’s manageable for content production. It adds up if you run this in a loop.

⚠️ Don’t skip tracing. Wire up LangSmith or Langfuse on day one. Without tracing, when an agent produces a bad output, you’re guessing whether it was the prompt, the tool call, or the context. With tracing it’s obvious in 30 seconds. The setup is twenty minutes.

When CrewAI shines (and when it doesn’t)

After three days I had a clearer sense of where the framework fits.

Workflow shape	CrewAI verdict
Linear research → write → edit pipeline	Great fit. You’ll have working code in an afternoon.
Workflow with branching, retries, escalation	Reach for LangGraph instead.
Single agent with many tools	Use LangChain directly. CrewAI’s role abstraction adds friction here.
Production system needing durable state	Skip it — no checkpointing. LangGraph or a workflow engine.
Prototype to test a multi-agent idea quickly	Best in class. The ‘crew of personas’ framing makes prototypes self-documenting.

My personal mapping after a weekend. Yours may differ.

What I’d skip next time

If I were starting over from scratch:

Skip the elaborate backstories on Day 1. Start with terse roles, see what breaks, then iterate. Spending two hours crafting a backstory before the agent has even run is premature optimization.
Wire up tracing immediately. Mentioned above. Worth repeating.
Use Claude Sonnet for the writer agent, not GPT-4o. Better prose, lower cost per article. GPT-4o was fine for research; Sonnet’s prose was noticeably better for the writing step.
Set max_iter on every agent. CrewAI agents will loop forever if you let them. Cap iterations early.

The verdict

For a content pipeline? I’d use CrewAI again. For anything more complex? Probably not — LangGraph or a hand-rolled agent loop would give me more control.

CrewAI is at its best when you don’t fight it. If your workflow maps to a small team passing notes, the framework gets you to working code in an afternoon. If your workflow is something else, force-fitting it into the role-based abstraction will cost you more than building it directly.

That’s it. Three days. One working content crew. Several broken experiments. A clearer picture of where multi-agent frameworks fit and where they don’t.

Frequently asked questions

What is CrewAI?

CrewAI is an open-source Python framework for building multi-agent systems. You define agents with roles, goals, and backstories; give them tasks; and the framework handles the handoffs. It’s MIT-licensed and available at github.com/crewAIInc/crewAI. The framework was created by João Moura.

No Title

They optimize for different shapes. CrewAI is opinionated about role-based teams with sequential or hierarchical handoffs. LangGraph is the lower-level primitive — you model agents as graphs with conditional edges. If your workflow is a linear pipeline of personas, CrewAI is faster to ship. If it branches, loops, or needs durable state, LangGraph is the better fit.

Is CrewAI production-ready?

For content pipelines and other linear workflows, yes — many teams run it in production. For complex workflows requiring durable state, retries, and human-in-the-loop pauses, it’s less mature than LangGraph. The framework also lacks first-class checkpointing, so you can’t pause a crew mid-run and resume it hours later the way LangGraph supports out of the box.

Primary sources

Last updated: May 20, 2026. Related: Agent Infrastructure.

A Weekend With CrewAI: What I Built and What Broke

The setup

What I built

It ran. The output was… fine.

Day 2: The role-prompt rabbit hole

Day 3: Where it broke

When CrewAI shines (and when it doesn’t)

What I’d skip next time

The verdict

Frequently asked questions

What is CrewAI?

No Title

Is CrewAI production-ready?

Primary sources

Leave a Reply Cancel reply

More Popular from Alatirok

Tokens Per Agentic Coding Task: The 2026 Variance Data

What Is Cognition Devin? The Enterprise Guide for 2026

What Is Circle Agent Stack? USDC Wallets for AI Agents

AI Agent Identity: Entra Agent ID vs Okta vs SailPoint

Why Does My AI Agent Context Window Fill Up So Fast?

Migrate OpenAI Agent Builder to Agents SDK Before Nov 30

Best Voice AI Agent Framework 2026: Vapi vs LiveKit vs Pipecat

Purpose-Built Legal AI vs General LLM: 2026 Verdict

Categories

Quick Links