Running AI Agents on the Edge: Why Latency Matters

I want to talk about a problem that sounds boring until it ruins your demo: latency in multi-agent systems.

Here's the setup. You have an AI assistant. It needs to create a persona, run a web search, and check your campaign status. Three tool calls to SocioLogic's API. If each call takes 200ms round-trip, that's 600ms just on network time, before any actual computation happens. Add the LLM inference time on each side and you're easily looking at 2-3 seconds for what should feel instant.

Now imagine those agents are calling other agents. Agent A asks Agent B for data. Agent B needs to verify credentials with Agent C. Each hop adds latency. In a three-hop chain with 200ms per hop, you're at 600ms of pure network overhead. In practice it's worse, because you also have TLS handshakes, DNS resolution, and serialization at each step.

This is why we moved SocioLogic's core infrastructure to Cloudflare Workers. And I want to share what we learned, because the tradeoffs aren't obvious.

Why Edge, Specifically

The traditional answer to "my API is slow" is "get a bigger server" or "add a cache." Both help, but neither solves the fundamental physics problem: data traveling between San Francisco and Tokyo takes about 50ms at the speed of light through fiber. No amount of server optimization changes that.

Edge deployment means your code runs in 300+ locations worldwide. When an agent in Tokyo calls our API, the request hits a Cloudflare node in Tokyo. When an agent in Frankfurt calls the same API, it hits Frankfurt. The code is identical; only the location changes.

For our use case, this dropped median response times from 180ms to 34ms globally. P99 went from 450ms to 87ms. That's not a tuning improvement; that's a category change.

What We Moved (And What We Didn't)

Not everything belongs on the edge. Here's how we split things:

On the edge (Cloudflare Workers):

Agent card resolution and capability lookups
Authentication and token validation
Signal Relay WebSocket termination
Rate limiting and abuse detection
Request routing and load balancing

Still centralized:

Persona generation (requires GPU inference)
Campaign execution (long-running, stateful)
Billing reconciliation
Registry verification pipeline

The pattern: anything that's stateless, read-heavy, and latency-sensitive goes to the edge. Anything that's stateful, write-heavy, or compute-intensive stays centralized. The edge layer handles the fast path; the origin handles the heavy lifting.

The Hard Parts

Edge deployment isn't free, and I don't mean the bill (though that's a conversation too). The real costs are architectural.

State management is genuinely hard. Workers are stateless by default. We use Cloudflare's Durable Objects for session state and KV for cached lookups, but the mental model is different from "I have a database and I query it." You have to think carefully about consistency. When a user updates their agent card, how quickly does that propagate to all 300+ edge locations? Our answer: KV with a 60-second TTL for most data, Durable Objects for anything that needs strong consistency. It's not elegant, but it's predictable.

Debugging distributed systems is painful. When something goes wrong on a Worker in Sao Paulo at 3am, you're working with logs and traces, not a debugger. We put a lot of time into structured logging and distributed tracing early, and it's paid off, but the feedback loop is still slower than debugging a monolith.

Cold starts exist, even on Workers. Cloudflare Workers have a much faster cold start than Lambda (typically under 5ms vs. hundreds of milliseconds), but it's not zero. For agent-to-agent calls where every millisecond counts, we use Workers that stay warm through strategic health checks. It adds complexity but keeps P99 latencies honest.

The Latency Multiplication Problem

Here's why this matters beyond just "faster is better." In traditional web apps, latency is additive. User makes a request, server processes it, server responds. One round trip.

In multi-agent systems, latency is multiplicative. An orchestrating agent might make 3-5 tool calls sequentially. If those tool calls themselves involve agent-to-agent communication, you get nested latency. A three-level deep call chain with 200ms per level is 600ms. The same chain with 35ms per level is 105ms. The user experience difference between those two numbers is the difference between "this feels responsive" and "is it broken?"

We've seen real-world call chains of 4-5 hops in production. At our old latency numbers, those chains would take over a second just in network time. At current numbers, they're under 200ms. That's the difference between an agent that feels like a tool and an agent that feels like a colleague.

Practical Advice

If you're building agent infrastructure and thinking about edge deployment, here's what I'd suggest:

Measure first. Instrument your existing system and find out where latency actually lives. You might be surprised. We thought our biggest bottleneck was compute; it was actually DNS resolution and TLS handshakes.
Start with the read path. Move reads to the edge before writes. Reads are stateless, cacheable, and low-risk. Agent discovery and capability lookup are perfect candidates.
Design for eventual consistency. If you need strong consistency everywhere, the edge will fight you. Decide what actually needs to be strongly consistent (billing, auth) and what can tolerate a 60-second delay (capability listings, public metadata).
Invest in observability early. You cannot debug a distributed system by reading code. You need traces, metrics, and structured logs from day one.

What's Next for Us

We're working on pushing persona memory lookups to the edge using Cloudflare's Vectorize for vector search. This would let agents retrieve persona context without a round trip to our origin servers. Early tests show a 4x improvement in memory retrieval latency. More on that when we've hardened it for production.

The agent infrastructure layer needs to be as fast as the agents it serves. For us, that means meeting agents where they are, literally, at the nearest edge node. Thirty-four milliseconds at a time.

Why Edge, Specifically

What We Moved (And What We Didn't)

The Hard Parts

The Latency Multiplication Problem

Practical Advice

What's Next for Us

About Dr. Sarah Chen

More from Dr. Sarah Chen

MCP Is the TCP/IP of Agents. Here's What's Missing.

Designing an Agent Registry That Doesn't Suck

Personas at Scale: What Happens When You Create 10,000 Synthetic Buyers

Try Synthetic User Research Today