Technical Paper

Executive Summary

Margot is a multi-agent AI assistant for macOS and iOS that orchestrates tools across Google Cloud services, GitHub, desktop operations, browser automation, and specialized knowledge bases. Built with a Tauri 2.0 + React frontend, a native SwiftUI iOS client, and a FastAPI/Python backend, the system is designed around a model-selectable coordinator agent that routes tasks to purpose-built tools and specialized subagents, backed by a four-layer cognitive memory architecture and a context laundering system that keeps conversations sharp and affordable.

This paper documents the system as of March 2026, covering the model-selectable coordinator, context laundering, memory system, tool ecosystem, specialized subagents, browser automation, research capabilities, self-extending skills, scheduled task automation, performance optimizations, and the iOS app.

Key specifications: Model-selectable coordinator routed through any model on OpenRouter, with 10 preconfigured models including Grok 4.1 Fast, Claude Sonnet 4.6, and Gemini 3.1 Pro. 84 Google Cloud tools spanning GA4, Search Console, Merchant Center, Gmail, Drive, Sheets, Calendar, and PageSpeed Insights. 21 Desktop Commander tools for filesystem and terminal operations. 48 browser skills with deterministic recovery and release gating. Three tiers of research from standard chat to four-agent deep research with cross-pollination. Context laundering that processes tool payloads before LLM reinjection. Image generation (Gemini 3 Pro), video generation (Veo 3.1, Sora 2), and voice synthesis (ElevenLabs). Scheduled task automation with verification contracts and mutation guardrails. A test suite with 1,248 passed tests and zero failures.

Technology Stack

Layer	Technology	Purpose
Frontend	Tauri 2.0 + React/TypeScript	Cross-platform desktop shell with native performance
Backend	FastAPI (Python 3.12)	Async API server with SSE streaming
Database	PostgreSQL 18 + pgvector 0.8.1	Relational storage with vector similarity search
Coordinator LLM	Model-selectable via OpenRouter (default: Grok 4.1 Fast)	Primary orchestrator with 10 preconfigured models
Browser Agent	Gemini 3 Flash (loop) / Grok 4.1 Fast (orchestration)	Autonomous browser actions
Context Laundering	GPT-4o-mini	Tool payload compression before LLM reinjection
Embeddings	OpenAI text-embedding-3-small	1536-dimension vectors for memory search
Image Generation	Gemini 3 Pro Image (Nano Banana Pro)	Text-to-image, editing, and blending
Video Generation	Veo 3.1 (Google AI Studio)	Text-to-video, image-to-video, frame interpolation
Video Generation	Sora 2 (OpenAI)	Text-to-video and image-to-video
Voice Synthesis	ElevenLabs (eleven_turbo_v2_5)	Text-to-speech with streaming
Browser	Playwright + CDP (Chrome port 9222)	Automated web interaction
iOS App	SwiftUI + SSE + APNs	Native client with push notifications
Networking	Tailscale (WireGuard mesh)	Secure remote access for iOS

The Coordinator Agent

The coordinator is the central intelligence of the system. Every user query passes through it, and it decides whether to answer directly, invoke tools, delegate to subagents, or execute multi-step workflows through its ReAct (Reason + Act) loop.

Model Selection and Configuration

The coordinator is model-selectable, routing through any model available on OpenRouter. Ten models are preconfigured across certified and fallback tiers, and users can switch coordinators per conversation from both the desktop and iOS clients.

Model	Provider	Tier
Grok 4.1 Fast	xAI	Certified (default)
Claude Sonnet 4.6	Anthropic	Certified
Gemini 3.1 Pro	Google	Certified
GPT-5.4	OpenAI	Fallback
GLM-5	Z-AI	Fallback
MiniMax M2.5	MiniMax	Fallback
MiniMax M2.7	MiniMax	Fallback
Kimi K2.5	Moonshot AI	Fallback
MiMo V2 Pro	Xiaomi	Fallback
MiMo Omni	Xiaomi	Fallback

Parameter	Value
Max ReAct Iterations	30
Prompt Caching	Enabled (5-minute TTL, 85% hit rate)
Tool Calling Accuracy	>95%

Coordinator Resilience

Interactive chat includes automatic resilience for model failures. Pre-first-real-chunk failure classification detects whether a failure is transient or structural. Transient failures trigger one retry with the selected model. If the retry fails, or if the failure is classified as a model/provider issue, the coordinator automatically falls back to the default certified model (Grok 4.1 Fast). Scheduled tasks do not participate in user-selected coordinator routing and remain pinned to the default certified coordinator path.

Internet Provider Selection

Each chat session supports a scoped internet provider picker, allowing users to switch between Brave Search (lower-cost web_search) and Firecrawl (search/fetch parity with web_fetch and research runtime support). The selection persists per conversation and is available on both desktop and iOS clients.

4-Tier Tool Selection Strategy (LOBSTER-01)

Rather than relying on brittle keyword routing, the coordinator uses an LLM-native decision framework organized into four prioritized tiers:

Tier	Approach	Examples
1 — Purpose-Built	Direct tool invocation	Google Cloud APIs, GitHub MCP, Desktop Commander file ops, research, image/video generation
2 — Shell Fallback	Terminal commands with safety analysis	System info, package management, git operations (3-level safety: SAFE / NEEDS_CONFIRMATION / BLOCKED)
3 — Browser Navigation	Lightweight browser tools for known URLs	Page snapshots, form filling, button clicks (6 always-available tools; full Ralph Loop requires toggle)
4 — Direct Answer	No tool needed	General knowledge, conversation, reasoning

This hierarchy ensures that the most efficient and reliable path is always attempted first. Shell commands undergo a three-level safety analysis: SAFE commands execute immediately, NEEDS_CONFIRMATION commands require user approval, and BLOCKED commands (destructive operations like rm -rf, sudo, mkfs) are refused entirely. The safety system decomposes pipe chains and scans embedded Python and AppleScript for dangerous patterns.

ReAct Loop

The coordinator operates through a ReAct (Reason + Act) loop that supports up to 30 iterations per query. Each iteration follows a consistent cycle: the model reasons about the current state, selects and invokes one or more tools, observes the results, and decides whether to continue or produce a final response.

Parallel tool calling is enabled when the coordinator identifies independent operations. Multiple tools execute concurrently via asyncio.gather(), with dependency detection preventing conflicts such as simultaneous read/write operations on the same file. This achieves a 30–70% reduction in total API call time for multi-tool queries, depending on tool mix.

Cognitive Memory Architecture

Margot’s memory system is inspired by the Mem0 architecture and organized into four distinct layers, each serving a different temporal and functional purpose. In benchmarks, this four-layer system achieves a 26% accuracy improvement over baseline single-layer memory approaches.

Memory Layers

Layer	Purpose	Storage	Retention
Short-Term	Current conversation context	PostgreSQL (conversations + messages)	Session duration
Semantic	Long-term facts and preferences	PostgreSQL + pgvector (HNSW index)	Persistent with confidence decay
Episodic	Past conversation summaries	PostgreSQL + pgvector	Persistent with importance scoring
Procedural	Reusable workflow patterns	PostgreSQL + pgvector	Persistent with success tracking

Before each coordinator invocation, the Memory Manager assembles context from all four layers: the last 20 messages from short-term memory, the top 5 semantically relevant facts, the top 3 related episodic summaries, and the top 2 applicable procedural workflows. This assembly targets sub-150ms latency using HNSW vector indexes with configurable similarity thresholds (0.7 for semantic, 0.6 for episodic, 0.5 for procedural).

Memory Extraction Pipeline

After each conversation, an asynchronous background task extracts new memories using the MemoryExtractor, which is powered by the coordinator model. The extractor performs three operations: semantic memory extraction (identifying facts, preferences, context, and entities), episodic summary generation (condensing conversations with importance scoring), and procedural workflow detection (identifying repeatable patterns from tool usage sequences). Deduplication is handled via similarity thresholds: memories with >0.95 cosine similarity to existing entries update the existing memory rather than creating duplicates.

Integrated Tool Ecosystem

Margot integrates tools organized into several service categories. All tool interactions are managed through the Model Context Protocol (MCP) pattern, providing consistent invocation, error handling, and result formatting.

Google Cloud Services (84 Tools)

The Google Cloud integration provides the largest single tool surface, spanning analytics, search performance, merchant data, email, file storage, spreadsheets, calendar management, and website performance auditing.

Service	Tools	Connection Type
Google Analytics 4	Reports, real-time data, properties, conversions, custom dimensions	Marketing
Google Search Console	Search analytics, sitemaps, site management	Marketing
Google Merchant Center	Product listings, status, issues	Marketing
Gmail	Search, read, send, reply, labels, batch operations (14 tools)	Productivity
Google Drive	Search, read, upload, download, folder management (10 tools)	Productivity
Google Sheets	Read, write, format spreadsheet data	Productivity
Google Calendar	Events, recurring meetings, available slot finder	Productivity
PageSpeed Insights	Performance audits, Core Web Vitals, mobile vs desktop	Productivity

Gmail (14 tools): Search with Gmail query syntax, full message and thread reading, send with CC/BCC and HTML formatting, reply with proper threading, label management, and batch delete/archive operations.

Drive (10 tools): File search with query syntax, automatic export of Google Docs to text/CSV/PDF, resumable uploads with MIME auto-detection, and folder management.

Sheets (8 tools): A1 notation reads, multi-range batch operations, append to next empty row, RAW and USER_ENTERED input modes, and copy-paste with format preservation.

Calendar (13 tools): Event CRUD with Google Meet integration, natural language quick-add, RRULE recurring events, free/busy checking, and an intelligent available-slot finder that respects working hours.

Multi-Connection OAuth Architecture

A key architectural decision is the multi-connection OAuth system that supports multiple Google accounts per user, with automatic routing based on tool type. Marketing tools (GA4, GSC, GMC) route to one OAuth connection, while productivity tools (Gmail, Drive, Sheets, Calendar) route to another. This prevents scope conflicts and allows clean separation between business analytics and personal productivity.

Credentials are stored with Fernet AES-128 encryption in PostgreSQL, with automatic token refresh and connection-type routing. The system supports marketing, productivity, and custom connection types, with a maximum of 5 connections per user.

Desktop Commander (21 Tools)

Local filesystem operations, terminal command execution, process management, and file search capabilities. These operate through the MCP protocol and are classified as Tier 1 purpose-built tools for file operations, or Tier 2 shell fallback for terminal commands with the three-level safety analysis.

Knowledge Sources

Specialized APIs for research agents, selected autonomously by the LLM based on query context rather than hardcoded routing rules:

Source	API	Best For	Credibility Tier
Wikipedia	Wikipedia API	Background facts, definitions	Tier 2
arXiv	arXiv API	CS/physics/math preprints	Tier 1
PubMed	PubMed/NCBI	Biomedical peer-reviewed research	Tier 1
Stack Exchange	Stack Exchange API	Programming Q&A	Tier 3
News	NewsAPI	Current events (30 days)	Tier 2

Specialized Subagents

Margot delegates specialized tasks to purpose-built subagents. Each subagent uses a model optimized for its domain and operates with its own tool set and system prompt.

Nano Banana Pro — Image Generation

Model: Gemini 3 Pro Image (via Google AI Studio)

Three operations: text-to-image generation (up to 4K resolution, 1–4 images per request, up to 14 reference images for style guidance), image editing (natural language modifications to existing images), and image blending (merging 2–14 images). Features include text rendering in images, Google Search grounding for real-time data, and professional camera controls. Nano Banana is user-toggled and bypasses the coordinator entirely for direct model routing.

Sora 2 — Video Generation

Model: OpenAI Sora 2 / Sora 2 Pro

Text-to-video and image-to-video generation supporting 720p and 1080p resolutions with configurable duration. Output is delivered as base64-encoded MP4. The Pro variant offers higher quality at increased cost and processing time.

Veo 3.1 — Video Generation

Model: Google Veo 3.1 (veo-3.1-generate-preview via Google AI Studio)

Veo 3.1 provides three video creation modes: text-to-video generation with native AI-generated audio at up to 1080p resolution, image-to-video animation (single image with prompt-guided motion), and frame interpolation (smooth transitions between two keyframe images). Videos are generated at 24fps in 4, 6, or 8-second clips, with an extension feature allowing iterative lengthening up to 148 seconds total.

Operation	Input	Resolution	Cost
Text-to-Video	Text prompt	720p / 1080p	$0.35 / $0.50 per 8s
Image-to-Video	Image + text prompt	720p / 1080p	$0.35 / $0.50 per 8s
Frame Interpolation	2 images + prompt	720p / 1080p	$0.35 / $0.50 per 8s
Video Extension	Existing Veo video	720p only	Proportional

Like Nano Banana, Veo 3.1 is user-toggled via the chat interface and bypasses the coordinator entirely for direct model routing. SSE streaming provides real-time generation status updates to the frontend.

ElevenLabs — Voice Synthesis

Model: eleven_turbo_v2_5 (ElevenLabs)

Text-to-speech conversion with multiple voice options and streaming support. The coordinator can invoke voice synthesis as a tool, converting any text response or content into audio output. The system supports quality control parameters and voice selection, enabling audio delivery alongside or in place of text responses.

Browser Agent — Web Automation

The browser agent uses the Ralph Loop pattern for iterative web interaction through a Chrome instance with CDP enabled on port 9222. The architecture consists of three layers:

RalphLoop: The core iteration loop that takes a screenshot/snapshot, sends it to the LLM for action selection, executes the action, and verifies the result. Each loop has configurable max iterations (default 15) and a 5-minute timeout. The loop includes modal-only snapshot filtering (reducing 60K+ characters to 5–10K of relevant content), deterministic click paths for known UI patterns, element blacklisting after repeated failures, and automatic form autofill from user configuration.

BrowserOrchestrator: Manages sequences of Ralph Loops for complex multi-step workflows. For example, a LinkedIn job application is decomposed into three sequential loops: find_job, apply, and return_to_list. The orchestrator tracks completion rates and supports configurable failure modes (FAIL_FAST, SKIP_CYCLE, CONTINUE).

Skills Framework V2: 48 skills provide site-specific knowledge including CSS selectors, workflow sequences, and known gotchas. Skills are selected via weighted scoring across five dimensions: exact trigger match (100 points), domain match (80 points), URL pattern match (60 points), partial trigger match (40 points), and tag match (20 points). A token budget system prevents context bloat from over-injection.

Browser Autonomy Hardening

A 12-ticket hardening sprint (LOBSTER2) added contract-driven reliability and release gating to the browser automation system:

Deterministic Recovery: A canonical failure taxonomy classifies every browser error by failure_code, failure_origin, and retry_class. Trace artifacts capture full execution context for each failure. A bounded self-healing loop generates candidate fixes from traces, validates them through a chain of checks, runs shadow replays, and makes accept/reject decisions with persisted recovery state.

Eval Gates and Scorecard: A three-stage quality gate system (stage1/stage2/production) runs evaluation harness packs before any release. A scorecard CLI aggregates pass/fail metrics across all evaluators. Rollout telemetry dimensions track reliability metrics in production, with an operator runbook documenting triage, rollback, and escalation procedures.

Unified Skill Inventory: Deterministic shared selectors and capability policy enforcement ensure consistent element targeting across skills. A validation chain verifies candidate skills before shadow replay, preventing regression from generated recovery paths.

Teach Recording

Users can record browser workflows as structured step sequences—clicks, inputs, navigations, and file uploads—and replay them deterministically. The recorder captures events including file_upload actions with stable path placeholders (resume_path, cover_letter_path, portfolio_path), enabling reliable replay of workflows that involve document attachments.

Deterministic replay executes recorded steps via the Ralph executor with native browser_file_upload support. Legacy recorded steps using fill/type actions with upload-like selectors are auto-normalized to explicit upload_file steps for backward compatibility. The recorder verification system distinguishes active listeners from stale localStorage flags, and modal state detection surfaces actionable guidance when the browser is blocked by a native file chooser dialog rather than reporting a generic failure.

Research Capabilities

Margot offers three tiers of research, each balancing depth, cost, and speed:

Feature	Standard Chat	Light Research V2	Deep Research V2
Architecture	Single model (Grok)	Subtopic pipeline	4 independent researchers
Depth	Single perspective	Subtopic decomposition	Angle-based with cross-pollination
Qualifying Questions	No	No	Yes (when needed)
Source Credibility	Basic	Basic	Required (4-tier system)
Criticism Search	Optional	Optional	Required
Synthesis	Direct response	Automatic report	Consensus/disagreement analysis
Estimated Time	5–15 seconds	30–90 seconds	90–180 seconds
Estimated Cost	~$0.001	~$0.01	~$0.016

Light Research V2

The mid-tier research option uses a three-phase subtopic pipeline: a planning phase decomposes the query into 2–4 focused subtopics with tailored search queries, a parallel research phase dispatches independent subtopic researchers that each search, scrape, and extract facts concurrently, and a synthesis phase combines all findings into a formatted report with citations via a built-in to_markdown() method.

Phase	Component	Description
Planning	ResearchCoordinator	Decomposes query into 2–4 focused subtopics with search queries
Research	SubtopicResearcher	For each subtopic: search → scrape → compress (fact extraction)
Synthesis	ReportSynthesizer	Combines findings into FinalReport with executive summary, sections, and citations

All three phases use Grok 4.1 Fast as a unified model. The system supports up to 4 concurrent researchers, 5 search results per subtopic, and 3 URLs scraped per subtopic. Granular SSE progress events stream real-time updates including planning status, per-subtopic search/scrape/compress progress, and synthesis state. The frontend renders these as an animated progress display positioned between the user query and assistant response.

Deep Research V2

The premium research tier uses four independent researcher agents (Alpha, Beta, Gamma, Delta), all running Grok 4.1 Fast, with a six-phase architecture:

Phase 1 — Query Analysis: Evaluates scope, complexity, and domain. Ambiguous queries trigger clarifying questions before research begins.

Phase 2 — Angle Generation: Four distinct research angles are generated to ensure diverse coverage (historical context, technical mechanisms, stakeholder perspectives, contrarian viewpoints).

Phase 3 — Round 1 Independent Research: All four agents research their assigned angles in parallel, isolated from each other to prevent groupthink. Each agent uses web search and knowledge APIs with credibility assessment.

Phase 4 — Summary Extraction: Each agent’s findings are compressed into structured summaries with key insights, confidence levels, and source citations.

Phase 5 — Round 2 Cross-Pollination: Agents receive summaries from the other three agents and conduct additional research to fill gaps, challenge assumptions, and explore connections they missed initially.

Phase 6 — Final Synthesis: A synthesis agent produces a comprehensive report with consensus points, key insights, areas of disagreement, and a structured final report with full source attribution.

Skills Framework

Margot’s skills system has two layers: autonomous tool generation (self-extending skills) and a curated skill library with intelligent discovery and routing (Skills V2).

Self-Extending Skills

Through three coordinator tools (create_skill, list_generated_skills, delete_skill), the system can convert natural language descriptions into validated Python tools that are hot-registered into the coordinator’s tool set.

Pipeline: The SkillGenerator calls the LLM to produce three files: skill.py (the tool implementation), skill_meta.json (function schema in OpenAI format), and test_skill.py (automated tests). The SkillValidator then performs AST-level analysis, checking against an import whitelist, scanning for banned patterns, and verifying the function signature matches the declared schema.

Validation: Generated skills must pass both static analysis (AST validation) and dynamic testing (pytest with a 30-second timeout). Only skills that pass all checks are registered.

Hot-Registration: The SkillExecutor dynamically imports approved skills and the SkillRegistry generates OpenAI function-format tool definitions with a skill_ prefix. New tools are immediately available to the coordinator without restart. A 20-skill cap prevents unbounded growth.

Skills V2 — Discovery and Routing

The curated skill library contains 31 skills discovered from disk, each defined with YAML frontmatter metadata (triggers, domains, URL patterns, requirements, allowed tools). Skills are feature-flagged via enable_skills_v2 and support four agent types: Browser, Coordinator, Desktop, and Research. When a task arrives, all skills are scored against the current context and the highest-scoring matches are injected into the agent’s prompt.

Scoring Signal	Weight	Description
Exact trigger match	100	Task text matches a declared trigger phrase exactly
Domain match	80	Current URL domain matches a skill’s declared domains
URL pattern match	60	Current URL matches a skill’s regex URL patterns
Partial trigger match	40	Task text partially overlaps with trigger phrases
Tag match	20	Task metadata matches skill tags

A token budget system (2,000–3,000 tokens per injection, max 1,500 per skill) prevents context bloat by truncating at paragraph and sentence boundaries. Re-evaluation triggers fire on domain changes, after 3 consecutive failures, or when a required tool is missing from the allowed tool set.

Context Laundering

Large tool payloads are processed by the context laundering service before being appended back into the coordinator’s LLM context as tool results. This prevents context bloat while preserving the information the coordinator needs to reason about tool outputs.

Laundering Modes

Mode	Behavior	When Used
Raw	Payload passed through unmodified	Small tool outputs below threshold
Summary	Compressed by a fast helper model (GPT-4o-mini)	Large tool outputs exceeding character thresholds
Truncated	Payload cut to size with truncation marker	Extremely large outputs or fallback when summary fails
Already Processed	Skipped (marked by upstream)	Deep Research V2 worker outputs with source-aware compression

Context laundering is integrated across all runtimes: streaming chat, non-streaming chat, the scheduled task executor, legacy deep research, and Deep Research V2. Deep Research V2 uses its own worker-side structured compression upstream and marks those results as already processed so the shared launderer does not summarize them again.

Each tool type has a specific character threshold that triggers laundering. Typical reductions are 80–95% for GA4 reports and Google Cloud responses, 80–90% for web search results, and 70–85% for file contents. An admin endpoint (GET /api/admin/launderer-stats) provides visibility into laundering metrics.

Scheduled Task Execution

Beyond monitoring, Margot can execute full coordinator workflows on a schedule. The TaskExecutor runs arbitrary multi-tool workflows—such as monthly PageSpeed audits written to Google Sheets—using the same ReAct loop and tool set available in interactive chat, with additional safety guarantees designed for unattended operation.

Scheduled tasks support the same three schedule modes as watches: slot-based (fixed time slots), interval-based (every N minutes), and cron expression scheduling. Runtime guidance is loaded from editable markdown prompts rather than hardcoded instructions, allowing workflow behavior to be tuned without code changes.

Verification Contract

Every scheduled execution is subject to a global verification contract that ensures work was actually performed and results are correct. The executor fails the run if any of the following conditions are detected:

Check	Fails When
Tool execution	No successful tool call occurred during the run
Write verification	Prompt-required file writes are missing
Read-back verification	Written data was not read back and confirmed
Model self-report	The model reports `VERIFIED: false` in its output

Mutation Guardrails

Scheduled tasks default to non-destructive behavior. For workflows that preserve, append, or upsert data, the executor monitors for suspicious file truncation—such as a Sheets write that replaces existing rows with fewer rows than expected. If detected, the run fails unless the prompt explicitly authorizes replacement. This prevents accidental data loss during unattended execution.

Coverage checks compare the number of processed items (e.g., analyzed URLs) against rows written and expected totals, catching partial completions that would otherwise appear successful.

Skill Context and Fallback

When Skills V2 is enabled, the executor prepares and injects skill context for scheduled runs, including read_skill execution paths during task loops. This gives scheduled workflows the same domain-specific knowledge available in interactive chat.

When a scheduled workflow fails verification or times out, the executor can invoke a deterministic backend-only fallback in the same execution pass. For example, a PageSpeed-to-Sheets workflow that fails the LLM-driven path will fall back to a hardcoded sequence: fetch PageSpeed data, save JSON artifact, resolve the target Drive file and sheet, write rows, and read back for verification. On failure, the fallback returns the exact step that failed and the last successful step for diagnostics.

Performance and Optimization

Parallel Tool Execution

Independent tools execute concurrently via asyncio.gather() with dependency detection to prevent conflicts. This achieves 30–70% faster execution for multi-tool queries, depending on whether all tools can run in parallel or some require sequential execution.

Prompt Caching

OpenRouter automatically caches prompts with a 5-minute TTL for supported models. The message structure is optimized for maximum cache hits: a static system prompt, a semi-static memory context in a separate message, and dynamic conversation content. Production measurements show an 85% cache hit rate, reducing prompt processing time by 500–800ms per request.

Context Management

Context management uses two components working together: a ContextWindowManager that enforces a sliding window, and the context laundering service that compresses large tool outputs before they enter the context (see Context Laundering section).

The sliding window maintains a 50,000-token maximum with a 30-message limit. The system message is always preserved, and when pruning is needed, the oldest non-system messages are removed first to keep the most recent context intact. Token counting uses tiktoken with the cl100k_base encoding. Per-conversation metrics—token_count, tokens_saved, and context_pruned_count—are tracked in PostgreSQL for monitoring.

Performance Targets

Operation	Target Latency
Memory context assembly	<150ms (all four layers)
Semantic vector search (HNSW)	<50ms
Simple query (no tools)	<1 second
Single tool call	<2 seconds
Multi-tool workflow (3–5 tools)	<5 seconds
Complex workflow (6–10 tools)	<10 seconds
Embedding generation	~200ms per embedding

Test Coverage

The backend unit test suite contains 1,248 passed tests with 1 skipped and zero failures. Coverage spans the memory system, Google Cloud tools, Gmail, Drive, Sheets, Calendar, PageSpeed, browser automation, skill generation, context laundering, scheduled task execution, shell safety, and API endpoints.

Real-Time Status Indicators

During the ReAct loop, the system emits lightweight SSE status events that replace generic loading messages with specific progress updates. Five status types map to different stages of query processing:

Status	When Emitted	Example
Analyzing	Start of request	Understanding your request...
Working	Before tool execution	Using Web Search...
Processing	After tool execution	Analyzing results...
Thinking	Subsequent ReAct iterations	Considering next steps...
Writing	First content chunk	Writing response...

A mapping of internal tool names to human-friendly display names (e.g., gcloud_run_ga4_report becomes Analytics Report) provides clear visibility into which tools are active. Each status event is under 100 bytes with less than 5ms overhead, transmitted over the existing SSE connection with no additional infrastructure.

iOS App

The iOS app is a full-featured native SwiftUI client with the same capabilities as the desktop, connected through Tailscale for secure WireGuard-based mesh networking.

Core Features

SSE Streaming: Real-time token-by-token response streaming, matching the desktop experience.

Push Notifications: APNs integration for reminders and watch alerts. Notification taps create contextual conversations with seed messages.

Markdown Rendering: Full markdown support with code syntax highlighting and table rendering.

Vega-Lite Charts: Inline data visualization that detects Vega-Lite JSON specs (by $schema or data+mark+encoding structure) and renders them natively via WKWebView with Vega/Vega-Embed from CDN. Charts use a dark theme with white labels and green marks, responsive width, and a 300pt default / 400pt max height.

Connection Monitoring: Real-time backend health checks with connecting/connected/disconnected states and throttled retry logic.

Media Prefetching: Background actor for prefetching thumbnails and media assets to reduce perceived latency.

Conversation Management

The conversation service provides full chat history management with auto-generated titles, full-text search, pinning, pagination, and export capabilities. Conversations are stored with auto-updating metadata triggers in PostgreSQL, and message retrieval is ordered chronologically with user-scoped authorization.

Operation	Description	Performance
Auto-Title Generation	LLM-generated titles from first 1–2 messages (50 char max)	<2 seconds
List Conversations	Paginated listing with sort by recent/created/pinned	<100ms
Full-Text Search	PostgreSQL GIN-indexed search across titles and message content	<200ms
Export	JSON (structured with metadata) or Markdown (human-readable)	<500ms
Update/Delete	Title, pinning status, or full conversation deletion	<50ms

The Conversation History Foundation (Slices A–M) was validated across backend, desktop, and iOS. The desktop UI supports conversation pinning, search, and sidebar navigation, while the iOS app mirrors this functionality with native SwiftUI components and background data synchronization.

Advanced Settings

The iOS app exposes coordinator model selection (persisted via selectedCoordinatorModel), internet provider picker (Brave or Firecrawl, defaulting to Brave), agent toggles for Nano Banana (images), Veo (video), and Browser Control, along with reasoning mode and media mode toggles. A Skills Library view allows browsing and managing all 48 skills. The Scheduled Tasks view provides creation, monitoring, and review of automation runs. A tool trace feed surfaces inline reasoning and full execution traces for every tool call. All settings are persisted via @AppStorage and passed to the API as snake_case fields.

Conversation Service

The backend Conversation Service manages the full lifecycle of chat conversations across both desktop and iOS clients. It handles auto-titling, search, pagination, and export, with all operations scoped to the authenticated user.

Operation	Description	Latency
Auto-Title Generation	LLM-generated titles from first 1–2 messages (50 char max, fallback to truncated message)	<2 seconds
List Conversations	Paginated listing with sort by recent/created/pinned and optional pinned-only filter	<100ms
Full-Text Search	PostgreSQL GIN-indexed search across titles and message content	<200ms
Export	JSON (structured with metadata) or Markdown (human-readable with headers)	<500ms
Update / Delete	Title editing, pinning status, or full conversation deletion with cascade	<50ms

Security Architecture

OAuth Credential Security

Google Cloud OAuth tokens are stored with Fernet AES-128 encryption in PostgreSQL, with automatic token refresh and connection-type routing. The multi-connection architecture supports up to 5 OAuth connections per user, scoped by connection type (marketing, productivity, custom). Credentials are isolated per user with user_id-scoped queries preventing cross-user access.

Shell Safety Guardrails

All shell commands undergo pre-execution safety analysis with a three-level classification system. SAFE commands (read-only operations like ls, cat, df) execute immediately. NEEDS_CONFIRMATION commands (potentially impactful operations) require explicit user approval through the coordinator. BLOCKED commands (destructive operations including rm -rf, sudo, mkfs, and disk formatting) are refused entirely. The safety system decomposes pipe chains to analyze each component, scans embedded Python code via python -c for dangerous patterns, and extracts commands from AppleScript do shell script blocks.

Email Formatting

When Margot sends emails via Gmail, a dedicated Email Formatter service converts markdown content to professionally styled HTML with proper rendering across email clients. Features include Apple-style font stacks, responsive tables with alternating row colors and hover states, dark-themed code blocks, blue-accented blockquotes, and smart metric highlighting where positive percentages appear in green and negative percentages in red. A plain text fallback is generated automatically for clients that do not support HTML.

Database Architecture

PostgreSQL 18 with pgvector 0.8.1 provides the persistence layer. Core tables include conversations and messages (with auto-updating triggers for metadata), semantic_memories, episodic_memories, and procedural_memories (all with HNSW vector indexes), google_cloud_connections (encrypted OAuth tokens), scheduled_tasks, and memory_operations (audit log). Performance is optimized through HNSW indexes (m=16, ef_construction=64), composite indexes on user_id + timestamp, GIN indexes for full-text search, and JSONB for flexible metadata.

Architecture Principles

LLM-driven routing over hardcoded rules. Keyword-based tool routing creates brittle systems. Letting models make intelligent tool selection decisions based on full context produces more reliable and adaptable behavior. The model-selectable coordinator takes this further by allowing users to choose the best model for their workflow.

Server-side calculations for data accuracy. LLMs are unreliable at arithmetic. All financial calculations, metric aggregations, and data transformations happen in Python before reaching the model, eliminating entire categories of errors.

Structural enforcement over prompt engineering. Validation gates, typed schemas, and AST analysis provide guarantees that no amount of prompt refinement can match. The self-extending skills pipeline exemplifies this: generated code must pass static analysis and automated tests before registration.

Memory as a first-class system component. The four-layer memory architecture transforms Margot from a stateless assistant into one that accumulates knowledge, recognizes patterns, and improves with use. The 26% accuracy improvement over baseline validates the investment in purpose-built memory infrastructure.

Margot represents a practical demonstration that multi-agent AI systems can be built for production use today, with careful attention to reliability, performance, and user experience.