It's the Wizard Not The Wand

Margot started the way most side projects do: with a question nobody asked and a weekend with nothing better to do. The question was simple enough — what if I wrapped a bunch of Google Cloud APIs in a single chat interface and let an LLM figure out which one to call? That was November 2024. Fourteen months later, the system has 140+ tools, four layers of memory, browser automation, an iOS companion, and a coding subagent that costs a fraction of a cent per task.

This post is a honest look at how Margot evolved, what worked, what didn't, and the decisions that shaped the system as it exists today.

The First Weekend

The initial prototype was embarrassingly simple: a FastAPI server, a single OpenAI function-calling prompt, and about five Google Analytics tools. The idea was to skip the GA4 UI entirely — just ask "what were my top pages last week?" and get an answer. It worked well enough that I kept going.

Within two weeks, the tool count had grown to include Search Console, Gmail, and Drive. Each integration followed the same pattern: write the API wrapper, define the function schema, add it to the system prompt. The coordinator model handled routing surprisingly well, even without explicit instructions about when to use which tool.

The Coordinator Problem

The first real architectural challenge came when the tool count crossed 30. The system prompt was getting unwieldy, and the model occasionally picked the wrong tool — calling a GA4 function when the user clearly wanted Search Console data, or invoking Gmail search when they asked about a file in Drive.

I tried keyword routing first. It was fast and cheap but hopelessly brittle. "Check my search traffic" — is that Search Console or GA4? "Find the report" — Gmail attachment or Drive file? Every edge case needed another rule, and the rules started conflicting.

The solution was to trust the LLM more, not less. The LOBSTER-01 framework organizes tools into four tiers and lets the model reason about which tier applies. Purpose-built tools first, shell commands second, browser automation third, direct answer as a fallback. The model's accuracy jumped above 95% once it had a clear decision framework instead of a flat list of 60+ tools.

Memory Changes Everything

The single biggest upgrade wasn't adding more tools — it was adding memory. Before the four-layer system, every conversation started from zero. The user had to re-explain their GA4 property IDs, their preferred report format, which email account to use. It was functional but exhausting.

The memory architecture draws from Mem0's approach: separate layers for different temporal needs. Short-term memory holds the current conversation. Semantic memory stores persistent facts ("my main GA4 property is 123456789"). Episodic memory summarizes past conversations so the system can reference previous work. Procedural memory captures workflow patterns so repeated tasks get faster over time.

The extraction pipeline runs after each conversation, pulling new memories asynchronously. Deduplication prevents the knowledge base from growing unbounded — if a new memory is >95% similar to an existing one, it updates rather than duplicates. In practice, the system assembles relevant context from all four layers in under 150ms.

Browser Automation and the Ralph Loop

Browser automation was the feature I resisted longest and now use the most. The Ralph Loop pattern — screenshot, reason, act, verify — handles surprisingly complex web interactions. LinkedIn job applications, form filling, data extraction from sites without APIs. The key insight was skills: site-specific knowledge about CSS selectors, workflow sequences, and common failure modes. 65 discovered skills later, the browser agent handles most tasks without intervention.

The hardening sprint (LOBSTER2) added the reliability guarantees that made browser automation production-ready. A canonical failure taxonomy, deterministic recovery paths, eval gates before every release. Without those guardrails, browser automation is a party trick. With them, it's a tool people actually depend on.

What I'd Do Differently

If I started over, I'd build the memory system first, not last. Too many early architectural decisions assumed statelessness, and retrofitting memory required touching nearly every component.

I'd also invest in structured evaluation earlier. For the first six months, "does it work?" meant me trying things manually. The eval harness and scorecard system that now gates browser automation releases should have been project-wide from the start.

And I'd pick the coordinator model more carefully. I went through four different models before landing on Grok 4.1 Fast, and each migration was painful. The 2M token context window and strong tool-calling accuracy made it worth the switch, but I wish I'd been more systematic about model evaluation from day one.

Where It Goes Next

Margot is open source and free to use — bring your own API keys. The current focus is on reliability: better error recovery, more comprehensive testing (1,248 tests and counting), and expanding the eval harness beyond browser automation to cover the full tool ecosystem.

The system is far from finished, but it's reached the point where I use it every day for real work. That's the bar I set for the first version, and clearing it felt like the right time to write about the journey.

← Back to Writing