I’d been hearing about CrewAI for months — every multi-agent tutorial cites it, every engineering Substack mentions it. So one Saturday I cleared my afternoon, brewed coffee, and decided to actually use it. Three days later I had something working, several things broken, and a clearer view of where the framework fits. Here’s what I built, what worked, what didn’t, and whether I’d reach for it again.
The setup
CrewAI is a Python framework for building multi-agent systems. The pitch is “let your agents work like a team.” You define agents — each with a role, a goal, and a backstory — give them tasks, and the framework handles the handoffs between them.
It’s open source, MIT-licensed, and has been on the receiving end of a lot of hype since mid-2024. The DeepLearning.AI course co-taught by founder João Moura and Andrew Ng probably did more for the framework’s profile than anything else.
Installation is one command — pip install crewai crewai-tools — and if you’ve used LangChain before you’ll feel at home in about ten minutes. If you haven’t, give yourself a couple of hours with the docs first.

What I built
I wanted a content research crew. Three agents:
- A researcher that pulls notes on a topic
- A writer that drafts an article
- An editor that critiques and rewrites for tone
This is the textbook CrewAI use case. It’s also kind of the only thing every tutorial builds, which made me suspicious — but I figured I’d start with the path of least resistance.
from crewai import Agent, Task, Crew, Process
from crewai_tools import SerperDevTool
researcher = Agent(
role='Senior Research Analyst',
goal='Gather accurate, up-to-date information on a given topic',
backstory=(
'You are a seasoned research analyst at a leading think tank. '
"You read primary sources carefully and never make claims you can't cite."
),
tools=[SerperDevTool()],
verbose=True,
)
writer = Agent(
role='Tech Journalist',
goal='Write a 1500-word article that is accurate, readable, and engaging',
backstory=(
'You write for a respected technology publication. '
'Your stylistic models are Stratechery and MIT Technology Review.'
),
verbose=True,
)
editor = Agent(
role='Senior Editor',
goal='Tighten prose, fact-check claims, and ensure the piece reads cleanly',
backstory='You edit for a magazine where every sentence has to earn its place.',
)
research_task = Task(
description='Research {topic}. Find at least 5 primary sources.',
expected_output='A research brief with citations.',
agent=researcher,
)
writing_task = Task(
description='Using the research brief, draft a 1500-word article.',
expected_output='The article in markdown.',
agent=writer,
)
editing_task = Task(
description='Edit the draft for tone, accuracy, and flow.',
expected_output='The polished article.',
agent=editor,
)
crew = Crew(
agents=[researcher, writer, editor],
tasks=[research_task, writing_task, editing_task],
process=Process.sequential,
)
result = crew.kickoff(inputs={'topic': 'Agent2Agent protocol'})
It ran. The output was… fine.
About forty lines of code, and the crew was running on the first try. That’s the thing about CrewAI — it makes the first 80% of a multi-agent system trivial. You write personas, define tasks, kick off the crew, watch it print to the console.
The output was OK. Not great, not bad — about what you’d expect from a sequential GPT-4 pipeline with role-prompts. The research was thorough but generic. The writing read like every other AI-written tech article (slightly too neutral, slightly too hedged). The editor pass cleaned up obvious phrasing but didn’t transform the piece.
I’d just spent an afternoon. The framework worked. And yet the output was unmistakably mid. So I kept going.
Day 2: The role-prompt rabbit hole
Here’s where CrewAI gets interesting. The agent’s backstory and role aren’t just labels — they’re injected into every LLM call as system context. The writer’s prose voice is essentially determined by how you wrote the backstory.
I spent most of Day 2 iterating on backstories. “Tech journalist” gave me generic writing. “Tech journalist at a respected technology publication” was slightly better. “Tech journalist who has just read three Stratechery posts and is now writing in that voice for an audience of engineering managers” was noticeably better.
This is the part nobody warns you about. Most of your work isn’t writing code, it’s writing personas. Bad personas mean bad outputs, even with the best models. Treat it like creative writing — you’re casting actors.
“Most of your work in CrewAI isn’t writing code, it’s writing personas. You’re casting actors, not configuring software.”
Day 3: Where it broke
Two things broke for me.
First: the researcher would occasionally just confabulate. CrewAI ships with Serper as the default search tool, but the agent didn’t always invoke it. When it didn’t, I got plausible-sounding nonsense back. The fix was making search a required step in the task description (“Use the search tool at least twice before drafting any claims”). After that, hallucinations dropped but didn’t disappear.
Second: token costs ramped fast. Three agents, three turns each, plus the supervisor coordinating — easily 20-30 LLM calls per crew run. With GPT-4o I was burning around $0.15 per article. With Claude Sonnet 4.6 I was closer to $0.05. That’s manageable for content production. It adds up if you run this in a loop.
When CrewAI shines (and when it doesn’t)
After three days I had a clearer sense of where the framework fits.
| Workflow shape | CrewAI verdict |
|---|---|
| Linear research → write → edit pipeline | Great fit. You’ll have working code in an afternoon. |
| Workflow with branching, retries, escalation | Reach for LangGraph instead. |
| Single agent with many tools | Use LangChain directly. CrewAI’s role abstraction adds friction here. |
| Production system needing durable state | Skip it — no checkpointing. LangGraph or a workflow engine. |
| Prototype to test a multi-agent idea quickly | Best in class. The ‘crew of personas’ framing makes prototypes self-documenting. |
What I’d skip next time
If I were starting over from scratch:
- Skip the elaborate backstories on Day 1. Start with terse roles, see what breaks, then iterate. Spending two hours crafting a backstory before the agent has even run is premature optimization.
- Wire up tracing immediately. Mentioned above. Worth repeating.
- Use Claude Sonnet for the writer agent, not GPT-4o. Better prose, lower cost per article. GPT-4o was fine for research; Sonnet’s prose was noticeably better for the writing step.
- Set max_iter on every agent. CrewAI agents will loop forever if you let them. Cap iterations early.
The verdict
For a content pipeline? I’d use CrewAI again. For anything more complex? Probably not — LangGraph or a hand-rolled agent loop would give me more control.
CrewAI is at its best when you don’t fight it. If your workflow maps to a small team passing notes, the framework gets you to working code in an afternoon. If your workflow is something else, force-fitting it into the role-based abstraction will cost you more than building it directly.
That’s it. Three days. One working content crew. Several broken experiments. A clearer picture of where multi-agent frameworks fit and where they don’t.
Frequently asked questions
What is CrewAI?
CrewAI is an open-source Python framework for building multi-agent systems. You define agents with roles, goals, and backstories; give them tasks; and the framework handles the handoffs. It’s MIT-licensed and available at github.com/crewAIInc/crewAI. The framework was created by João Moura.
No Title
They optimize for different shapes. CrewAI is opinionated about role-based teams with sequential or hierarchical handoffs. LangGraph is the lower-level primitive — you model agents as graphs with conditional edges. If your workflow is a linear pipeline of personas, CrewAI is faster to ship. If it branches, loops, or needs durable state, LangGraph is the better fit.
Is CrewAI production-ready?
For content pipelines and other linear workflows, yes — many teams run it in production. For complex workflows requiring durable state, retries, and human-in-the-loop pauses, it’s less mature than LangGraph. The framework also lacks first-class checkpointing, so you can’t pause a crew mid-run and resume it hours later the way LangGraph supports out of the box.
Primary sources
- CrewAI — official docs
- CrewAI — GitHub repo
- DeepLearning.AI — Multi AI Agent Systems with crewAI course
- CrewAI main site
- João Moura — Twitter
Last updated: May 20, 2026. Related: Agent Infrastructure.