MARGOT

Multi-Agent AI Desktop Assistant

Technical White Paper v2.1 March 2026
Patrick Kidwell Mountain Meadow Systems

Executive Summary

Margot is a multi-agent AI desktop assistant that orchestrates over 140 tools across Google Cloud services, GitHub, desktop operations, browser automation, computer use, and specialized knowledge bases. Built with a Tauri 2.0 + React frontend and a FastAPI/Python backend, the system is designed around a central coordinator agent that routes tasks to purpose-built tools and specialized subagents, backed by a four-layer cognitive memory architecture.

This paper documents the system as of February 2026, covering the coordinator agent, memory system, tool ecosystem, specialized subagents, browser automation, research capabilities, self-extending skills, proactive monitoring, performance optimizations, and the iOS companion app.

Key specifications: Coordinator powered by Grok 4.1 Fast with a 2 million token context window. 68 Google Cloud tools spanning GA4, Search Console, Merchant Center, Gmail, Drive, Sheets, and Calendar. 21 Desktop Commander tools for filesystem and terminal operations. 65 discovered browser skills with deterministic recovery and release gating. Three tiers of research from standard chat to four-agent deep research with cross-pollination. Image generation (Gemini 3 Pro), video generation (Veo 3.1, Sora 2), voice synthesis (ElevenLabs), and RAG-enhanced coding (Freya). A test suite with 1,248 passed tests and zero failures.

Technology Stack

LayerTechnologyPurpose
FrontendTauri 2.0 + React/TypeScriptCross-platform desktop shell with native performance
BackendFastAPI (Python 3.12)Async API server with SSE streaming
DatabasePostgreSQL 18 + pgvector 0.8.1Relational storage with vector similarity search
Coordinator LLMGrok 4.1 Fast (xAI via OpenRouter)Primary orchestrator with 2M token context
Browser AgentGemini 3 Flash (loop) / Grok 4.1 Fast (orchestration)Autonomous browser actions
Fast LLMGPT-4o-miniTool result summarization
Coding LLMMiniMax-M2 (230B MoE)RAG-enhanced code generation (Freya)
EmbeddingsOpenAI text-embedding-3-small1536-dimension vectors for memory search
Image GenerationGemini 3 Pro Image (Nano Banana Pro)Text-to-image, editing, and blending
Video GenerationVeo 3.1 (Google AI Studio)Text-to-video, image-to-video, frame interpolation
Video GenerationSora 2 (OpenAI)Text-to-video and image-to-video
Voice SynthesisElevenLabs (eleven_turbo_v2_5)Text-to-speech with streaming
BrowserPlaywright + CDP (Chrome port 9222)Automated web interaction
iOS AppSwiftUI + SSE + APNsNative companion with push notifications
NetworkingTailscale (WireGuard mesh)Secure remote access for iOS

The Coordinator Agent

The coordinator is the central intelligence of the system. Every user query passes through it, and it decides whether to answer directly, invoke tools, delegate to subagents, or execute multi-step workflows through its ReAct (Reason + Act) loop.

Model and Configuration

ParameterValue
Modelx-ai/grok-4.1-fast via OpenRouter
Context Window2,000,000 tokens
Max Output Tokens4,096
Temperature0.3 (optimized for tool calling)
Max ReAct Iterations30
Prompt CachingEnabled (5-minute TTL, 85% hit rate)
Tool Calling Accuracy>95%

The 2 million token context window is 12x larger than the previous coordinator model, effectively eliminating conversation length as a constraint. Combined with prompt caching that achieves 85% hit rates in production, typical follow-up queries process in under 2 seconds.

4-Tier Tool Selection Strategy (LOBSTER-01)

Rather than relying on brittle keyword routing, the coordinator uses an LLM-native decision framework organized into four prioritized tiers:

TierApproachExamples
1 — Purpose-BuiltDirect tool invocationGoogle Cloud APIs, GitHub MCP, Desktop Commander file ops, research, image/video generation
2 — Shell FallbackTerminal commands with safety analysisSystem info, package management, git operations (3-level safety: SAFE / NEEDS_CONFIRMATION / BLOCKED)
3 — Browser NavigationLightweight browser tools for known URLsPage snapshots, form filling, button clicks (6 always-available tools; full Ralph Loop requires toggle)
4 — Direct AnswerNo tool neededGeneral knowledge, conversation, reasoning

This hierarchy ensures that the most efficient and reliable path is always attempted first. Shell commands undergo a three-level safety analysis: SAFE commands execute immediately, NEEDS_CONFIRMATION commands require user approval, and BLOCKED commands (destructive operations like rm -rf, sudo, mkfs) are refused entirely. The safety system decomposes pipe chains and scans embedded Python and AppleScript for dangerous patterns.

ReAct Loop

The coordinator operates through a ReAct (Reason + Act) loop that supports up to 30 iterations per query. Each iteration follows a consistent cycle: the model reasons about the current state, selects and invokes one or more tools, observes the results, and decides whether to continue or produce a final response.

Parallel tool calling is enabled when the coordinator identifies independent operations. Multiple tools execute concurrently via asyncio.gather(), with dependency detection preventing conflicts such as simultaneous read/write operations on the same file. This achieves a 30–70% reduction in total API call time for multi-tool queries, depending on tool mix.

Cognitive Memory Architecture

Margot’s memory system is inspired by the Mem0 architecture and organized into four distinct layers, each serving a different temporal and functional purpose. In benchmarks, this four-layer system achieves a 26% accuracy improvement over baseline single-layer memory approaches.

Memory Layers

LayerPurposeStorageRetention
Short-TermCurrent conversation contextPostgreSQL (conversations + messages)Session duration
SemanticLong-term facts and preferencesPostgreSQL + pgvector (HNSW index)Persistent with confidence decay
EpisodicPast conversation summariesPostgreSQL + pgvectorPersistent with importance scoring
ProceduralReusable workflow patternsPostgreSQL + pgvectorPersistent with success tracking

Before each coordinator invocation, the Memory Manager assembles context from all four layers: the last 20 messages from short-term memory, the top 5 semantically relevant facts, the top 3 related episodic summaries, and the top 2 applicable procedural workflows. This assembly targets sub-150ms latency using HNSW vector indexes with configurable similarity thresholds (0.7 for semantic, 0.6 for episodic, 0.5 for procedural).

Memory Extraction Pipeline

After each conversation, an asynchronous background task extracts new memories using the MemoryExtractor, which is powered by the same Grok 4.1 Fast model as the coordinator. The extractor performs three operations: semantic memory extraction (identifying facts, preferences, context, and entities), episodic summary generation (condensing conversations with importance scoring), and procedural workflow detection (identifying repeatable patterns from tool usage sequences). Deduplication is handled via similarity thresholds: memories with >0.95 cosine similarity to existing entries update the existing memory rather than creating duplicates.

Integrated Tool Ecosystem

Margot integrates over 140 tools organized into several service categories. All tool interactions are managed through the Model Context Protocol (MCP) pattern, providing consistent invocation, error handling, and result formatting.

Google Cloud Services (68 Tools)

The Google Cloud integration provides the largest single tool surface, spanning analytics, search performance, merchant data, email, file storage, spreadsheets, calendar management, and website performance auditing.

ServiceToolsConnection Type
Google Analytics 4Reports, real-time data, properties, conversions, custom dimensionsMarketing
Google Search ConsoleSearch analytics, sitemaps, site managementMarketing
Google Merchant CenterProduct listings, status, issuesMarketing
GmailSearch, read, send, reply, labels, batch operations (14 tools)Productivity
Google DriveSearch, read, upload, download, folder management (10 tools)Productivity
Google SheetsRead, write, format spreadsheet dataProductivity
Google CalendarEvents, recurring meetings, available slot finderProductivity
PageSpeed InsightsPerformance audits, Core Web Vitals, mobile vs desktopProductivity

Gmail (14 tools): Search with Gmail query syntax, full message and thread reading, send with CC/BCC and HTML formatting, reply with proper threading, label management, and batch delete/archive operations.

Drive (10 tools): File search with query syntax, automatic export of Google Docs to text/CSV/PDF, resumable uploads with MIME auto-detection, and folder management.

Sheets (8 tools): A1 notation reads, multi-range batch operations, append to next empty row, RAW and USER_ENTERED input modes, and copy-paste with format preservation.

Calendar (13 tools): Event CRUD with Google Meet integration, natural language quick-add, RRULE recurring events, free/busy checking, and an intelligent available-slot finder that respects working hours.

Multi-Connection OAuth Architecture

A key architectural decision is the multi-connection OAuth system that supports multiple Google accounts per user, with automatic routing based on tool type. Marketing tools (GA4, GSC, GMC) route to one OAuth connection, while productivity tools (Gmail, Drive, Sheets, Calendar) route to another. This prevents scope conflicts and allows clean separation between business analytics and personal productivity.

Credentials are stored with Fernet AES-128 encryption in PostgreSQL, with automatic token refresh and connection-type routing. The system supports marketing, productivity, and custom connection types, with a maximum of 5 connections per user.

Desktop Commander (21 Tools)

Local filesystem operations, terminal command execution, process management, and file search capabilities. These operate through the MCP protocol and are classified as Tier 1 purpose-built tools for file operations, or Tier 2 shell fallback for terminal commands with the three-level safety analysis.

Knowledge Sources

Specialized APIs for research agents, selected autonomously by the LLM based on query context rather than hardcoded routing rules:

SourceAPIBest ForCredibility Tier
WikipediaWikipedia APIBackground facts, definitionsTier 2
arXivarXiv APICS/physics/math preprintsTier 1
PubMedPubMed/NCBIBiomedical peer-reviewed researchTier 1
Stack ExchangeStack Exchange APIProgramming Q&ATier 3
NewsNewsAPICurrent events (30 days)Tier 2

Specialized Subagents

Margot delegates specialized tasks to purpose-built subagents. Each subagent uses a model optimized for its domain and operates with its own tool set and system prompt.

Freya — Coding Subagent

Model: MiniMax-M2 (230B parameters, 10B active via MoE, 128K context)

Freya is a RAG-enhanced coding agent for code generation, refactoring, debugging, and test writing. It operates with its own ReAct loop and has access to Desktop Commander tools for reading, writing, and executing code. A best practices database (PostgreSQL with pgvector) provides relevant patterns and examples via hybrid search, and the agent includes auto-linting on file writes and auto-commit capabilities.

Freya is user-selected (not auto-delegated by the coordinator), providing a clean UX where users explicitly choose the coding mode. At approximately $0.003 per task versus $0.05 for comparable models, it delivers strong coding performance at 17x lower cost.

Nano Banana Pro — Image Generation

Model: Gemini 3 Pro Image (via Google AI Studio)

Three operations: text-to-image generation (up to 4K resolution, 1–4 images per request, up to 14 reference images for style guidance), image editing (natural language modifications to existing images), and image blending (merging 2–14 images). Features include text rendering in images, Google Search grounding for real-time data, and professional camera controls. Like Freya, Nano Banana is user-toggled and bypasses the coordinator entirely for direct model routing.

Sora 2 — Video Generation

Model: OpenAI Sora 2 / Sora 2 Pro

Text-to-video and image-to-video generation supporting 720p and 1080p resolutions with configurable duration. Output is delivered as base64-encoded MP4. The Pro variant offers higher quality at increased cost and processing time.

Veo 3.1 — Video Generation

Model: Google Veo 3.1 (veo-3.1-generate-preview via Google AI Studio)

Veo 3.1 provides three video creation modes: text-to-video generation with native AI-generated audio at up to 1080p resolution, image-to-video animation (single image with prompt-guided motion), and frame interpolation (smooth transitions between two keyframe images). Videos are generated at 24fps in 4, 6, or 8-second clips, with an extension feature allowing iterative lengthening up to 148 seconds total.

OperationInputResolutionCost
Text-to-VideoText prompt720p / 1080p$0.35 / $0.50 per 8s
Image-to-VideoImage + text prompt720p / 1080p$0.35 / $0.50 per 8s
Frame Interpolation2 images + prompt720p / 1080p$0.35 / $0.50 per 8s
Video ExtensionExisting Veo video720p onlyProportional

Like Nano Banana, Veo 3.1 is user-toggled via the chat interface and bypasses the coordinator entirely for direct model routing. SSE streaming provides real-time generation status updates to the frontend.

ElevenLabs — Voice Synthesis

Model: eleven_turbo_v2_5 (ElevenLabs)

Text-to-speech conversion with multiple voice options and streaming support. The coordinator can invoke voice synthesis as a tool, converting any text response or content into audio output. The system supports quality control parameters and voice selection, enabling audio delivery alongside or in place of text responses.

Browser Agent — Web Automation

The browser agent uses the Ralph Loop pattern for iterative web interaction through a Chrome instance with CDP enabled on port 9222. The architecture consists of three layers:

RalphLoop: The core iteration loop that takes a screenshot/snapshot, sends it to the LLM for action selection, executes the action, and verifies the result. Each loop has configurable max iterations (default 15) and a 5-minute timeout. The loop includes modal-only snapshot filtering (reducing 60K+ characters to 5–10K of relevant content), deterministic click paths for known UI patterns, element blacklisting after repeated failures, and automatic form autofill from user configuration.

BrowserOrchestrator: Manages sequences of Ralph Loops for complex multi-step workflows. For example, a LinkedIn job application is decomposed into three sequential loops: find_job, apply, and return_to_list. The orchestrator tracks completion rates and supports configurable failure modes (FAIL_FAST, SKIP_CYCLE, CONTINUE).

Skills Framework V2: 65 discovered skills (33 browser-specific) provide site-specific knowledge including CSS selectors, workflow sequences, and known gotchas. Skills are selected via weighted scoring across five dimensions: exact trigger match (100 points), domain match (80 points), URL pattern match (60 points), partial trigger match (40 points), and tag match (20 points). A token budget system prevents context bloat from over-injection.

Browser Autonomy Hardening

A 12-ticket hardening sprint (LOBSTER2) added contract-driven reliability and release gating to the browser automation system:

Deterministic Recovery: A canonical failure taxonomy classifies every browser error by failure_code, failure_origin, and retry_class. Trace artifacts capture full execution context for each failure. A bounded self-healing loop generates candidate fixes from traces, validates them through a chain of checks, runs shadow replays, and makes accept/reject decisions with persisted recovery state.

Eval Gates and Scorecard: A three-stage quality gate system (stage1/stage2/production) runs evaluation harness packs before any release. A scorecard CLI aggregates pass/fail metrics across all evaluators. Rollout telemetry dimensions track reliability metrics in production, with an operator runbook documenting triage, rollback, and escalation procedures.

Unified Skill Inventory: Deterministic shared selectors and capability policy enforcement ensure consistent element targeting across skills. A validation chain verifies candidate skills before shadow replay, preventing regression from generated recovery paths.

Teach Recording

Users can record browser workflows as structured step sequences—clicks, inputs, navigations, and file uploads—and replay them deterministically. The recorder captures events including file_upload actions with stable path placeholders (resume_path, cover_letter_path, portfolio_path), enabling reliable replay of workflows that involve document attachments.

Deterministic replay executes recorded steps via the Ralph executor with native browser_file_upload support. Legacy recorded steps using fill/type actions with upload-like selectors are auto-normalized to explicit upload_file steps for backward compatibility. The recorder verification system distinguishes active listeners from stale localStorage flags, and modal state detection surfaces actionable guidance when the browser is blocked by a native file chooser dialog rather than reporting a generic failure.

Research Capabilities

Margot offers three tiers of research, each balancing depth, cost, and speed:

FeatureStandard ChatLight Research V2Deep Research V2
ArchitectureSingle model (Grok)Subtopic pipeline4 independent researchers
DepthSingle perspectiveSubtopic decompositionAngle-based with cross-pollination
Qualifying QuestionsNoNoYes (when needed)
Source CredibilityBasicBasicRequired (4-tier system)
Criticism SearchOptionalOptionalRequired
SynthesisDirect responseAutomatic reportConsensus/disagreement analysis
Estimated Time5–15 seconds30–90 seconds90–180 seconds
Estimated Cost~$0.001~$0.01~$0.016

Light Research V2

The mid-tier research option uses a three-phase subtopic pipeline: a planning phase decomposes the query into 2–4 focused subtopics with tailored search queries, a parallel research phase dispatches independent subtopic researchers that each search, scrape, and extract facts concurrently, and a synthesis phase combines all findings into a formatted report with citations via a built-in to_markdown() method.

PhaseComponentDescription
PlanningResearchCoordinatorDecomposes query into 2–4 focused subtopics with search queries
ResearchSubtopicResearcherFor each subtopic: search → scrape → compress (fact extraction)
SynthesisReportSynthesizerCombines findings into FinalReport with executive summary, sections, and citations

All three phases use Grok 4.1 Fast as a unified model. The system supports up to 4 concurrent researchers, 5 search results per subtopic, and 3 URLs scraped per subtopic. Granular SSE progress events stream real-time updates including planning status, per-subtopic search/scrape/compress progress, and synthesis state. The frontend renders these as an animated progress display positioned between the user query and assistant response.

Deep Research V2

The premium research tier uses four independent researcher agents (Alpha, Beta, Gamma, Delta), all running Grok 4.1 Fast, with a six-phase architecture:

Phase 1 — Query Analysis: Evaluates scope, complexity, and domain. Ambiguous queries trigger clarifying questions before research begins.

Phase 2 — Angle Generation: Four distinct research angles are generated to ensure diverse coverage (historical context, technical mechanisms, stakeholder perspectives, contrarian viewpoints).

Phase 3 — Round 1 Independent Research: All four agents research their assigned angles in parallel, isolated from each other to prevent groupthink. Each agent uses web search and knowledge APIs with credibility assessment.

Phase 4 — Summary Extraction: Each agent’s findings are compressed into structured summaries with key insights, confidence levels, and source citations.

Phase 5 — Round 2 Cross-Pollination: Agents receive summaries from the other three agents and conduct additional research to fill gaps, challenge assumptions, and explore connections they missed initially.

Phase 6 — Final Synthesis: A synthesis agent produces a comprehensive report with consensus points, key insights, areas of disagreement, and a structured final report with full source attribution.

Skills Framework

Margot’s skills system has two layers: autonomous tool generation (self-extending skills) and a curated skill library with intelligent discovery and routing (Skills V2).

Self-Extending Skills

Through three coordinator tools (create_skill, list_generated_skills, delete_skill), the system can convert natural language descriptions into validated Python tools that are hot-registered into the coordinator’s tool set.

Pipeline: The SkillGenerator calls the LLM to produce three files: skill.py (the tool implementation), skill_meta.json (function schema in OpenAI format), and test_skill.py (automated tests). The SkillValidator then performs AST-level analysis, checking against an import whitelist, scanning for banned patterns, and verifying the function signature matches the declared schema.

Validation: Generated skills must pass both static analysis (AST validation) and dynamic testing (pytest with a 30-second timeout). Only skills that pass all checks are registered.

Hot-Registration: The SkillExecutor dynamically imports approved skills and the SkillRegistry generates OpenAI function-format tool definitions with a skill_ prefix. New tools are immediately available to the coordinator without restart. A 20-skill cap prevents unbounded growth.

Skills V2 — Discovery and Routing

The curated skill library contains 31 skills discovered from disk, each defined with YAML frontmatter metadata (triggers, domains, URL patterns, requirements, allowed tools). Skills are feature-flagged via enable_skills_v2 and support four agent types: Browser, Coordinator, Desktop, and Research. When a task arrives, all skills are scored against the current context and the highest-scoring matches are injected into the agent’s prompt.

Scoring SignalWeightDescription
Exact trigger match100Task text matches a declared trigger phrase exactly
Domain match80Current URL domain matches a skill’s declared domains
URL pattern match60Current URL matches a skill’s regex URL patterns
Partial trigger match40Task text partially overlaps with trigger phrases
Tag match20Task metadata matches skill tags

A token budget system (2,000–3,000 tokens per injection, max 1,500 per skill) prevents context bloat by truncating at paragraph and sentence boundaries. Re-evaluation triggers fire on domain changes, after 3 consecutive failures, or when a required tool is missing from the allowed tool set.

Proactive Monitoring

Margot includes a heartbeat-based watch system that runs scheduled evaluations and delivers alerts via iOS push notifications. The system supports three schedule modes: slot-based (fixed time slots), interval-based (every N minutes), and cron expression scheduling.

Evaluator Types

EvaluatorPurposeDetection Method
Uptime CheckHTTP health monitoringConfigurable timeout, status codes, redirect following
Content ChangeWebsite content monitoringSHA-256 hash comparison with optional CSS selector targeting
GitHub PR StaleStale pull request detectionAge threshold via GitHub MCP
GA4 MetricsAnalytics anomaly detectionThreshold-based alerts on key metrics
Custom LLMFlexible evaluationCoordinator-driven analysis of any data source

Alert lifecycle includes acknowledgment and resolution tracking. When a watch triggers an alert, the iOS app receives an APNs push notification containing a seed message. Tapping the notification creates a new conversation where the coordinator proactively investigates the alert, providing immediate context and recommended actions.

Scheduled Task Execution

Beyond monitoring, Margot can execute full coordinator workflows on a schedule. The TaskExecutor runs arbitrary multi-tool workflows—such as monthly PageSpeed audits written to Google Sheets—using the same ReAct loop and tool set available in interactive chat, with additional safety guarantees designed for unattended operation.

Scheduled tasks support the same three schedule modes as watches: slot-based (fixed time slots), interval-based (every N minutes), and cron expression scheduling. Runtime guidance is loaded from editable markdown prompts rather than hardcoded instructions, allowing workflow behavior to be tuned without code changes.

Verification Contract

Every scheduled execution is subject to a global verification contract that ensures work was actually performed and results are correct. The executor fails the run if any of the following conditions are detected:

CheckFails When
Tool executionNo successful tool call occurred during the run
Write verificationPrompt-required file writes are missing
Read-back verificationWritten data was not read back and confirmed
Model self-reportThe model reports VERIFIED: false in its output

Mutation Guardrails

Scheduled tasks default to non-destructive behavior. For workflows that preserve, append, or upsert data, the executor monitors for suspicious file truncation—such as a Sheets write that replaces existing rows with fewer rows than expected. If detected, the run fails unless the prompt explicitly authorizes replacement. This prevents accidental data loss during unattended execution.

Coverage checks compare the number of processed items (e.g., analyzed URLs) against rows written and expected totals, catching partial completions that would otherwise appear successful.

Skill Context and Fallback

When Skills V2 is enabled, the executor prepares and injects skill context for scheduled runs, including read_skill execution paths during task loops. This gives scheduled workflows the same domain-specific knowledge available in interactive chat.

When a scheduled workflow fails verification or times out, the executor can invoke a deterministic backend-only fallback in the same execution pass. For example, a PageSpeed-to-Sheets workflow that fails the LLM-driven path will fall back to a hardcoded sequence: fetch PageSpeed data, save JSON artifact, resolve the target Drive file and sheet, write rows, and read back for verification. On failure, the fallback returns the exact step that failed and the last successful step for diagnostics.

Performance and Optimization

Parallel Tool Execution

Independent tools execute concurrently via asyncio.gather() with dependency detection to prevent conflicts. This achieves 30–70% faster execution for multi-tool queries, depending on whether all tools can run in parallel or some require sequential execution.

Prompt Caching

OpenRouter automatically caches prompts for Grok 4.1 Fast with a 5-minute TTL. The message structure is optimized for maximum cache hits: a static system prompt, a semi-static memory context in a separate message, and dynamic conversation content. Production measurements show an 85% cache hit rate, reducing prompt processing time by 500–800ms per request.

Context Management

Context management uses two components working together: a ContextWindowManager that enforces a sliding window, and a ToolResultSummarizer that compresses large tool outputs before they enter the context.

The sliding window maintains a 50,000-token maximum with a 30-message limit. The system message is always preserved, and when pruning is needed, the oldest non-system messages are removed first to keep the most recent context intact. Token counting uses tiktoken with the cl100k_base encoding. Per-conversation metrics—token_count, tokens_saved, and context_pruned_count—are tracked in PostgreSQL for monitoring.

The ToolResultSummarizer compresses large tool outputs using GPT-4o-mini at temperature 0.3 before they are fed back into the coordinator context. Each tool type has a specific character threshold that triggers summarization:

Tool CategoryThresholdTypical Reduction
GA4 reports5,000 chars90–95%
Web search results3,000 chars80–90%
File contents10,000 chars70–85%
Google Cloud responses5,000 chars85–95%

Performance Targets

OperationTarget Latency
Memory context assembly<150ms (all four layers)
Semantic vector search (HNSW)<50ms
Simple query (no tools)<1 second
Single tool call<2 seconds
Multi-tool workflow (3–5 tools)<5 seconds
Complex workflow (6–10 tools)<10 seconds
Embedding generation~200ms per embedding

Test Coverage

The backend unit test suite contains 1,248 passed tests with 1 skipped and zero failures. Coverage spans the memory system, Google Cloud tools, Gmail, Drive, Sheets, Calendar, PageSpeed, Freya coding agent, browser automation, skill generation, watch evaluators, shell safety, and API endpoints.

Real-Time Status Indicators

During the ReAct loop, the system emits lightweight SSE status events that replace generic loading messages with specific progress updates. Five status types map to different stages of query processing:

StatusWhen EmittedExample
AnalyzingStart of requestUnderstanding your request...
WorkingBefore tool executionUsing Web Search...
ProcessingAfter tool executionAnalyzing results...
ThinkingSubsequent ReAct iterationsConsidering next steps...
WritingFirst content chunkWriting response...

A mapping of 65+ internal tool names to human-friendly display names (e.g., gcloud_run_ga4_report becomes Analytics Report) provides clear visibility into which tools are active. Each status event is under 100 bytes with less than 5ms overhead, transmitted over the existing SSE connection with no additional infrastructure.

iOS Companion App

The iOS app provides a native SwiftUI interface for interacting with Margot remotely, connected through Tailscale for secure WireGuard-based mesh networking.

Core Features

SSE Streaming: Real-time token-by-token response streaming, matching the desktop experience.

Push Notifications: APNs integration for reminders and watch alerts. Notification taps create contextual conversations with seed messages.

Markdown Rendering: Full markdown support with code syntax highlighting and table rendering.

Vega-Lite Charts: Inline data visualization that detects Vega-Lite JSON specs (by $schema or data+mark+encoding structure) and renders them natively via WKWebView with Vega/Vega-Embed from CDN. Charts use a dark theme with white labels and green marks, responsive width, and a 300pt default / 400pt max height.

Connection Monitoring: Real-time backend health checks with connecting/connected/disconnected states and throttled retry logic.

Media Prefetching: Background actor for prefetching thumbnails and media assets to reduce perceived latency.

Conversation Management

The conversation service provides full chat history management with auto-generated titles, full-text search, pinning, pagination, and export capabilities. Conversations are stored with auto-updating metadata triggers in PostgreSQL, and message retrieval is ordered chronologically with user-scoped authorization.

OperationDescriptionPerformance
Auto-Title GenerationLLM-generated titles from first 1–2 messages (50 char max)<2 seconds
List ConversationsPaginated listing with sort by recent/created/pinned<100ms
Full-Text SearchPostgreSQL GIN-indexed search across titles and message content<200ms
ExportJSON (structured with metadata) or Markdown (human-readable)<500ms
Update/DeleteTitle, pinning status, or full conversation deletion<50ms

The Conversation History Foundation (Slices A–M) was validated across backend, desktop, and iOS. The desktop UI supports conversation pinning, search, and sidebar navigation, while the iOS app mirrors this functionality with native SwiftUI components and background data synchronization.

Advanced Settings

The iOS app exposes agent toggles for Freya (coding), Nano Banana (images), Veo (video), and Browser Control, along with model parameter sliders for temperature (0–1, step 0.1) and max tokens (1,024–16,384, step 1,024). A debug mode toggle reveals diagnostic info and enables log export. All settings are persisted via @AppStorage and passed to the API as snake_case fields.

Conversation Service

The backend Conversation Service manages the full lifecycle of chat conversations across both desktop and iOS clients. It handles auto-titling, search, pagination, and export, with all operations scoped to the authenticated user.

OperationDescriptionLatency
Auto-Title GenerationLLM-generated titles from first 1–2 messages (50 char max, fallback to truncated message)<2 seconds
List ConversationsPaginated listing with sort by recent/created/pinned and optional pinned-only filter<100ms
Full-Text SearchPostgreSQL GIN-indexed search across titles and message content<200ms
ExportJSON (structured with metadata) or Markdown (human-readable with headers)<500ms
Update / DeleteTitle editing, pinning status, or full conversation deletion with cascade<50ms

Security Architecture

OAuth Credential Security

Google Cloud OAuth tokens are stored with Fernet AES-128 encryption in PostgreSQL, with automatic token refresh and connection-type routing. The multi-connection architecture supports up to 5 OAuth connections per user, scoped by connection type (marketing, productivity, custom). Credentials are isolated per user with user_id-scoped queries preventing cross-user access.

Shell Safety Guardrails

All shell commands undergo pre-execution safety analysis with a three-level classification system. SAFE commands (read-only operations like ls, cat, df) execute immediately. NEEDS_CONFIRMATION commands (potentially impactful operations) require explicit user approval through the coordinator. BLOCKED commands (destructive operations including rm -rf, sudo, mkfs, and disk formatting) are refused entirely. The safety system decomposes pipe chains to analyze each component, scans embedded Python code via python -c for dangerous patterns, and extracts commands from AppleScript do shell script blocks.

Email Formatting

When Margot sends emails via Gmail, a dedicated Email Formatter service converts markdown content to professionally styled HTML with proper rendering across email clients. Features include Apple-style font stacks, responsive tables with alternating row colors and hover states, dark-themed code blocks, blue-accented blockquotes, and smart metric highlighting where positive percentages appear in green and negative percentages in red. A plain text fallback is generated automatically for clients that do not support HTML.

Database Architecture

PostgreSQL 18 with pgvector 0.8.1 provides the persistence layer. Core tables include conversations and messages (with auto-updating triggers for metadata), semantic_memories, episodic_memories, and procedural_memories (all with HNSW vector indexes), google_cloud_connections (encrypted OAuth tokens), coding_practices (RAG knowledge base for Freya), and memory_operations (audit log). Performance is optimized through HNSW indexes (m=16, ef_construction=64), composite indexes on user_id + timestamp, GIN indexes for full-text search, and JSONB for flexible metadata.

Architecture Principles

LLM-driven routing over hardcoded rules. Keyword-based tool routing creates brittle systems. Letting models like Grok 4.1 Fast make intelligent tool selection decisions based on full context produces more reliable and adaptable behavior.

Server-side calculations for data accuracy. LLMs are unreliable at arithmetic. All financial calculations, metric aggregations, and data transformations happen in Python before reaching the model, eliminating entire categories of errors.

Structural enforcement over prompt engineering. Validation gates, typed schemas, and AST analysis provide guarantees that no amount of prompt refinement can match. The self-extending skills pipeline exemplifies this: generated code must pass static analysis and automated tests before registration.

Memory as a first-class system component. The four-layer memory architecture transforms Margot from a stateless assistant into one that accumulates knowledge, recognizes patterns, and improves with use. The 26% accuracy improvement over baseline validates the investment in purpose-built memory infrastructure.

Margot represents a practical demonstration that multi-agent AI systems can be built for production use today, with careful attention to reliability, performance, and user experience.