LangGraph as a crawler orchestration layer: structure first, intelligence later
After Nutch, Scrapy, and years of custom crawlers, LangGraph wasn't built for crawling but fits the shape. Deterministic first; add AI to the nodes that earn it.
I’ve built crawlers and scrapers for very different jobs. Nutch for massive crawls feeding a search index. Scrapy for focused scrapes from a known domain. Custom code when the off-the-shelf options didn’t fit. Each one taught me a version of the same lesson: the scraping isn’t the hard part. The retry, rate-limit, and partial-failure orchestration around the scraping is the hard part. That’s where most of the bugs lived.
Lately I’ve been using LangGraph for agentic projects, orchestrating LLM decisions across multi-step tasks. And it hit me: LangGraph is the orchestrator that does the job for those earlier crawlers. Not because crawlers need LLMs, but because they grow into them. The orchestration shape (typed state, named nodes, conditional edges) is exactly what every crawler I’ve built has grown into on its own. Build it deterministic first; put the LLM in the node that earns it when you need to. The rest of the graph doesn’t notice.
The deterministic case is where most crawlers live. Selectors and a parser do the job for structured HTML. LLM-based extraction is a newer primitive that earns a node when page layouts go messy or structure drifts per page. With LangGraph, that’s just a choice you make inside one node. The rest of the graph (fetch, retry, rate-limit, save) doesn’t change either way.
Why this shape wants a graph
The naive version of crawl-then-extract is:
for url in urls:
html = fetch(url)
data = extract(html)
save(data)
Every crawler starts this simple and then grows past it. extract fails on one URL and you want to retry with a different strategy. fetch hits a rate limit and you have to back off before the next one. The for-loop can’t express either cleanly, so you end up wrapping it in state.
A graph makes the state the point. Control flow becomes data: a declared shape of nodes and transitions. The retry is an edge back to fetch. The backoff is a node that sleeps and returns. Adding a new failure mode is adding an edge, not refactoring an algorithm.
LangGraph was built for agents, not crawlers
LangGraph came out of LangChain to orchestrate LLM-driven agents: ReAct loops, tool-calling, multi-agent coordination. None of that is crawling.
But the problems are structurally identical. An agent needs to decide what to do next based on accumulated context, handle tool failures, retry gracefully, checkpoint state so long runs can resume. So does a crawler. The orchestrator doesn’t care whether “the thing that just happened” was an LLM tool call or an HTTP request. State is state. A retry is a retry.
If you’ve built crawlers and you’re looking for a framework, the instinct is “find a crawler framework.” After doing it enough times, my take is that crawlers and agents face the same orchestration problem, and LangGraph already does the work I’d otherwise be hand-rolling.
LangGraph in three primitives
LangGraph gives you three things:
State: a typed object (usually a TypedDict) that every node reads from and writes to. It’s the single thing that flows through the pipeline.
Nodes: Python functions that take the state and return a partial update. Nothing else. No magic.
Edges: how nodes connect. A linear edge says “after node A, go to node B.” A conditional edge says “after A, look at the state and pick a destination.”
That’s the whole framework. Everything else (checkpointing, human-in-the-loop, streaming) builds on those three primitives.
{ "urls": ["A", "B"], "items": [], "errors": [], "pending": null }
A minimal pipeline
Start with state. Everything that moves through the pipeline lives in a typed shape:
from typing import TypedDict
from langgraph.graph import StateGraph, START, END
class State(TypedDict):
urls: list[str]
items: list[dict]
errors: list[tuple[str, str]]
pending_html: str | None
pending_url: str | None
The first node pulls a URL off the queue and fetches it:
def fetch(state: State) -> dict:
url = state["urls"][0]
html = http_get(url) # stub: your fetcher of choice
return {"pending_html": html, "pending_url": url}
Extract runs on what fetch returned. If parsing fails, the URL moves into errors instead of items. The parse_structured call is where you’d plug BeautifulSoup, Parsel, a regex, or an LLM returning JSON. LangGraph doesn’t care which:
def extract(state: State) -> dict:
try:
item = parse_structured(state["pending_html"]) # stub: your parser
return {"items": state["items"] + [item]}
except Exception as e: # narrow this to your parser's failure type
return {
"errors": state["errors"] + [(state["pending_url"], str(e))]
}
Save flushes items and drops the processed URL off the queue. A small router decides whether to loop back or end:
def save(state: State) -> dict:
if state["items"]:
db.insert_many(state["items"]) # stub: your DB client
return {"urls": state["urls"][1:], "items": []}
def route(state: State) -> str:
return "fetch" if state["urls"] else END
The graph wires the nodes and compiles:
g = StateGraph(State)
g.add_node("fetch", fetch)
g.add_node("extract", extract)
g.add_node("save", save)
g.add_edge(START, "fetch")
g.add_edge("fetch", "extract")
g.add_edge("extract", "save")
g.add_conditional_edges("save", route, {"fetch": "fetch", END: END})
graph = g.compile()
And run it against an initial state:
final = graph.invoke({
"urls": [
"https://en.wikipedia.org/wiki/Web_crawler",
"https://en.wikipedia.org/wiki/Web_scraping",
"https://en.wikipedia.org/wiki/Apache_Nutch",
"https://www.python.org/",
],
"items": [],
"errors": [],
"pending_html": None,
"pending_url": None,
})
Three nodes, one conditional, pluggable extraction. The orchestration stays the same whether parse_structured is a selector library or an LLM returning JSON. You pick the extraction strategy per project based on how messy the HTML is.
Where AI earns a node
The deterministic extract above works until the HTML stops cooperating: layouts drift per-domain, or what you actually want is something a parser can’t give you. When that happens, you don’t rewrite the node. You add structure.
Three topologies emerge in practice. Each declares the AI choice in the graph itself rather than hiding it inside node code.
Topology 1, deterministic only. The extract node from the previous section. No LLM anywhere. The shape we’ve already built.
Topology 2, LLM-only extraction. When layouts are messy enough that selectors are a dead end, extract itself becomes an LLM call:
def extract(state: State) -> dict:
prompt = f"Extract item fields as JSON:\n\n{state['pending_html']}"
response = llm_client.complete(prompt) # stub: your LLM client
try:
item = json.loads(response)
return {"items": state["items"] + [item]}
except json.JSONDecodeError as e:
return {
"errors": state["errors"] + [(state["pending_url"], str(e))]
}
One node, LLM inside. The graph shape is unchanged from Topology 1.
Topology 3, deterministic with LLM fallback. This is where the graph actually grows. Split extract into two single-purpose nodes, and let a conditional edge decide which one runs:
def extract_det(state: State) -> dict:
try:
item = parse_structured(state["pending_html"]) # stub: your parser
return {"items": state["items"] + [item], "last_parse": "ok"}
except Exception:
return {"last_parse": "failed"}
def extract_llm(state: State) -> dict:
prompt = f"Extract fields as JSON:\n\n{state['pending_html']}"
response = llm_client.complete(prompt) # stub: your LLM client
try:
item = json.loads(response)
return {"items": state["items"] + [item]}
except json.JSONDecodeError as e:
return {
"errors": state["errors"] + [(state["pending_url"], str(e))]
}
def route_after_det(state: State) -> str:
return "save" if state.get("last_parse") == "ok" else "extract_llm"
Wire them with one conditional edge:
g.add_edge("fetch", "extract_det")
g.add_conditional_edges(
"extract_det",
route_after_det,
{"save": "save", "extract_llm": "extract_llm"},
)
g.add_edge("extract_llm", "save")
State gains one field (last_parse) to carry the deterministic outcome to the router. The LLM choice is now part of the declared graph, not buried inside a node’s if-branch. And because conditional edges can’t mutate state, the routing decision is pure inspection. The nodes do the work.
{ "items": [{title: "Widget", price: 42, description: "useful widget"}], "errors": [] }
When a project drifts from deterministic-is-enough into ambiguous-layouts territory, the migration isn’t a rewrite. It’s a graph extension.
What about throughput?
The example above processes URLs one at a time for readability. Real crawls fan out. LangGraph’s Send primitive is how. Return a list like [Send("fetch", {"url": u}) for u in urls] from a dispatcher node, and the runtime spawns those fetches in parallel. The rest of the graph (retry, rate-limit, save) doesn’t change. You get concurrency without threading code or a separate worker pool.
For crawls big enough that one process isn’t enough, LangGraph can run as a persistent service backed by a shared Postgres checkpointer. Workers across machines run the same graph against that shared state, so execution survives restarts and coordinates cleanly. Most small-to-medium jobs don’t need it. When you do, the graph you wrote doesn’t change, only where it runs.
The one sharp edge
Conditional edges in LangGraph are read-only. The router function gets state, looks at it, and returns a destination. It can’t mutate state along the way. If you want to tag a URL as “failed, retry” and route it to the retry branch, the tagging has to happen in the node that precedes the router, or you reach for Command, an API that packages a state update and a routing decision into one node return value.
The first non-trivial retry graph you build will run into this. The intuitive shape (update state as you decide where to go) doesn’t work. Put the mutation in a node, let the routing logic stay pure.
The thing that made it click for me
Crawlers I’ve built had a custom retry/rate-limit/state-transition layer. Libraries handled the mechanics; the coordination between them was always slightly different from project to project, and it was where most of the bugs lived. Not in the parsing. Not in the fetching. In the orchestration.
LangGraph’s contribution is that the orchestration layer becomes a declared shape: nodes, edges, conditions. Not control flow code. The bugs don’t go away (there’s no free lunch), but they move. From somewhere in a sprawling custom orchestrator to somewhere in a graph definition that fits on one page. That’s a much smaller surface to test and reason about. And because each node has a clean state-in, state-out contract, you can test them individually with no framework mocking at all.
When I’d reach for this vs. Scrapy
Scrapy is still right for some jobs. Multi-domain scrape projects where spider definitions, auto-throttling, and the built-in scheduler do the heavy lifting. Custom when you need total control and nothing off-the-shelf fits.
LangGraph is worth considering when:
- The pipeline has complex state transitions, not just “scrape each page independently”
- You want typed recovery paths and declared state instead of flags and nested try/except
- You’d rather debug a declared graph than a sprawling custom orchestrator
- You’d like the freedom to drop an LLM into one node (or not) without rewriting the pipeline
- Scale is small-to-medium (hundreds to tens of thousands of URLs, not millions)
When not to reach for it
For a one-shot script that crawls a handful of pages and dumps to a CSV, you don’t need a graph framework. You need a for loop and a short afternoon.
For millions of pages where distributed infrastructure is the real constraint, LangGraph isn’t the answer either. Its strength is reliability at small-to-medium scale, not raw throughput. A purpose-built distributed pipeline does that job.
Where this could go next
Each sequel is its own extension to the same shape: checkpointing so a failed run resumes from the last durable node, retry and backoff promoted to graph nodes when failure patterns get specific, a validation node between extract and save when silent field gaps are the real risk, or the functional API (@task, @entrypoint) as a different style of declaring the same graph. Multi-agent patterns open up further out, when one graph outgrows a single orchestration boundary. For now, this is the shape I reach for when “crawl some things, extract from them, save” shows up again. It’s the first time in a while that the orchestration layer has felt like something I can declare instead of something I have to defend.