Firecrawl turns the messy web into clean, LLM-ready data. Give it a URL and it returns markdown or structured JSON; give it a domain and it crawls or maps the whole thing; give it a query and it searches and returns full page content. The reason it sits near the top of the AI-tooling charts is simple: feeding models good text is a bottleneck, and Firecrawl is the most batteries-included answer to that bottleneck. This page is about where it shines and where the open-source build quietly diverges from the hosted one.
What it actually does
The API is a small set of verbs, each doing one job well:
- Scrape a single URL to markdown, JSON, HTML, or a screenshot, with JS rendering via a headless browser.
- Crawl a site recursively into a set of pages.
- Map a site to discover all its URLs fast.
- Search the web and return full page content, with domain include/exclude filters.
- Batch scrape hundreds or thousands of URLs asynchronously.
- Extract structured data with a schema, increasingly framed as an agentic step.
- Parse (added in the v2.10 line) uploads a local file such as a PDF or DOCX and returns markdown while preserving tables and reading order.
The v2 rewrite (mid-2025) is the version to target; the project now ships official SDKs across Python, Node, Go, Ruby, PHP, .NET, and more.
Install and access
There are two paths, and the gap between them matters.
Cloud (firecrawl.dev), the path the docs assume:
from firecrawl import Firecrawl
app = Firecrawl(api_key="fc-YOUR_API_KEY")
print(app.scrape("firecrawl.dev").markdown)
pip install firecrawl-py # Python
npm install firecrawl # Node
Self-hosted via docker compose:
docker compose build
docker compose up
# API on http://localhost:3002
The self-hosted stack is real: a TypeScript API, a Playwright rendering service, Redis for queues and rate limiting, and a Postgres store. It works for scrape and crawl. But read the next section before you assume it is the cloud minus the bill.
The self-hosting caveat nobody tells you up front
This is the judgment the README soft-pedals. The open-source build (AGPL-3.0) is not feature-complete against the cloud. Concretely, from the issue tracker and self-host docs as of 2026-06:
- The anti-bot and IP-rotation layer (“fire-engine”) is not in the open-source build, so heavily defended sites behave differently.
- The agentic Extract path depends on internal services that are not open-sourced; self-hosters report needing a Firecrawl API key even when bringing their own LLM key (#3553).
- Some interaction endpoints expect Supabase, which self-host does not fully support (#3570).
If your use case is “scrape a known set of cooperative pages,” self-hosting is fine. If it is “extract structured data from arbitrary defended sites,” the cloud is doing more than the container can.
When it fits, and when it does not
It fits RAG pipelines, agents that need fresh web context, and anyone who would otherwise hand-roll Playwright plus a markdown cleaner. It fits less well if you need fine-grained, stateful scraping logic in your own language (a scraping library gives more control), or if AGPL copyleft is a problem for your distribution model.
How it compares
| Project | Shape | Note |
|---|---|---|
| firecrawl/firecrawl | Hosted API + self-host, full verb set | AGPL-3.0, cloud is more capable |
| D4Vinci/Scrapling | Python scraping library | Code-first control, you own the logic |
| Jina Reader (jina-ai/reader) | URL-prefix to LLM text | Minimal, single-purpose |
Scrapling is the better fit when you want to write scraping code and control every step; Firecrawl is the better fit when you want an API to hand a URL and get clean data back. Jina Reader is the lightweight middle for “just give me the text of this page.”
Gotchas from the issue tracker
- Scrape had no hard timeout, so slow or unreachable pages could return empty responses that are hard to debug (#3751).
- The rate limiter could fail closed when Redis hiccuped, rejecting everything (#3728); a nuq worker Redis connection leak that hammered self-hosters was fixed on 2026-06-10 (#3662).
- The agentic Extract path has burned through credits on JavaScript-heavy SPA pages with no clear cap (#3552), so watch spend on dynamic targets.
The throughline: the hosted API is polished, the self-hosted build needs operational care around Redis and rendering.
Background worth knowing
Firecrawl began as a product from the Mendable team and now operates under its own org with a credit-based cloud business. The main repo is AGPL-3.0 while SDKs and UI bits are more permissive. That dual reality (open core, capable cloud) is the honest frame for deciding which path you are really adopting.
Related reading
If you are wiring the web into models, compare with D4Vinci/Scrapling for code-first scraping and microsoft/markitdown for turning local documents into markdown once you have them.