How to Build an AI Agent (the Hard Way)

Recently, I was asked to stress test an agent architecture built to handle thousands of invocations per hour. I may cover how this architecture actually works in a later article. The agent serves one purpose: monitoring employee business travel and relaying advisories and warnings to employees ahead of and during their travel period.

Due to both business and technical constraints, this agent needed to be built the hard way, using code. I'm not recommending that every organization build agents like this. You'll be the best judge of what works for you. That said, this method scales well to thousands of invocations per hour, and there aren't many agents connected to Workday, running continuously, that I've seen do that yet.

The stack: Workday (travel booking data and employee profiles), X Filtered Stream API (real-time local intelligence), a small selection of government advisory websites, OpenWeather, Slack, Apple Messages for Business via Amazon Connect, Gemini 3 Flash via Google Vertex for inference, and Vercel for hosting.

The Data

The agent pulls from two Workday APIs:

Get_Travel_Booking_Records returns active and upcoming employee travel metadata, ex. destinations, dates, booking references.
WQL returns emergency contacts and communication channel preferences.

Government travel advisories are the primary, generally static, safety data source, and most national governments publish open datasets for travel advisories. These are typically low-volume, high-token-count text documents (the UK's FCDO advisories, for example, run 5,000+ words per country). Vector search is unnecessary, and arguably inappropriate, for this. The documents are small enough and structured enough that file-based retrieval with grep works, and it's faster, cheaper, and more predictable than embedding them.

For real-time local intelligence, the agent creates X Filtered Stream connections scoped to each active travel event. The stream rules use the from: operator to track specific, verified local accounts (news outlets, emergency services, transport authorities) in the destination region, and only persists for the duration of the trip. OpenWeather provides weather data for severe weather alerts.

Model Selection

Gemini 3 Flash handles the agent's reasoning and tool orchestration. On the τ²-Bench telecom benchmarks, the model comes in ahead of most competitors at an intelligence-to-pricing ratio that matters for high-volume invocations: $0.5/M tokens input, $3/M tokens output. GLM-5 is another fine choice.

Inference runs through Google Vertex with about 1s latency and 130 tokens per second, which keeps conversations feeling responsive. Vertex also supports HIPAA compliance and data residency controls, which rounds out the privacy posture for a system that handles employee PII.

Memory Architecture

The agent's memory is structured around a sandboxed filesystem. Every agent execution is pre-loaded with its own directory structure:

The agent's sandboxed memory directory structure

The /policies and /advisories directories are templated: every agent sandbox gets the same base set of documents, refreshed on a schedule. The /memories directory is scoped to the current employee. No cross-user memory leakage by design.

Context files hold the most important, immediately relevant information: the employee's active booking details (destination, dates, hotel, airline), current advisory level, active alerts, and confirmed preferences. These are plain text files the agent reads at the start of each invocation and rewrites when new information arrives.

The /advisories and /policies directories are queried with grep. When an advisory updates, the file store refreshes. The agent searches policies with targeted queries ("employee travel insurance coverage for civil unrest" or "corporate policy on travel to Level 4 advisory countries") rather than embedding the full documents into the context window.

Archival memory stores information the agent has learned that isn't immediately needed: historical preferences from past trips, resolved alerts, prior interactions. Older memory files are rotated out of the active /memories directory into cold storage and retrieved on demand when the agent encounters a returning traveler.

External search (ClickHouse) stores and indexes the X stream data. Streams generate volume. A single active trip to a major city might pull hundreds of posts per day. ClickHouse handles the ingest, and the agent queries it for relevant signals when composing alerts. Stale data (streams from completed trips, resolved incidents) gets purged on a rolling schedule.

Durable Execution

Every agent invocation runs inside an execution framework that provides automatic retries, step-level idempotency, and state that survives failures. If the agent crashes midway through a multi-step operation (fetch advisory, compose alert, deliver message), it resumes from the last completed step rather than restarting from scratch.

Each execution step runs with org-scoped encryption. Payloads are encrypted at each step boundary using organization-specific keys, so a compromised step in one tenant's execution can't expose another tenant's data.

Message delivery uses cascading retries with dynamic timeouts. The agent's primary channel is the employee's stated preference (Apple Messages for Business, for end-to-end encryption with a recognizable sender identity). If delivery fails after the configured retry window, it falls back to Slack, then SMS. The retry intervals and timeout thresholds aren't static. The agent adjusts them based on travel urgency and the destination's current advisory level. A Level 4 advisory destination gets shorter retry windows than a Level 1.

Human-in-the-Loop

Before a trip begins, the agent pings the employee on Slack to confirm their travel preferences and communication channel. This contact is low-pressure and ensures organic adoption: "I see you have a trip to São Paulo on March 15. Want me to send travel alerts to iMessage, Slack, or email?"

The agent stores the confirmed preferences as a context file. If the employee doesn't respond, the agent waits a configurable number of days (agent-adjusted by urgency and travel date: 3 days for a Level 1 destination, 1 day for a flight tomorrow), then follows up via email. The approval step gates user personalization. When no confirmation is received, the agent follows a default messaging pipeline.

Guardrails

The agent's tools are deterministic. The agent itself is probabilistic. This distinction matters for reliability.

Every tool (fetch booking, query advisory, deliver message) is a pure function with defined inputs, outputs, and error handling. The agent decides which tools to call and when, but the tools themselves don't hallucinate. A tool either succeeds, fails, or retries. There's no ambiguity in the execution path.

For PII protection, the agent runs a guardrail layer on all inputs and outputs that scans for and redacts personally identifiable information before it reaches the model or gets written to logs.

The guardrail intercepts messages at both the input boundary (before the agent reasons over employee data) and the output boundary (before composing a message). Employee names, phone numbers, and booking references are masked in all log output. Critical information, like emergency contact details, are stored in-memory as reference IDs. Updates to Workday refresh the memory block; the agent is aware of what PII data is available, but not what the PII data is.

Observability

Tracing runs through Sentry. Every execution step, tool call, and model inference is instrumented as a span in a distributed trace. Payloads in traces are sanitized: no employee names, PII, or booking references. The trace tells you what the agent did and why (which tool, which branch, what the model's reasoning was), without exposing who it was about. This matters for debugging at scale.

What This Proved

The exercise validated a few things worth sharing.

First, file-based retrieval for government advisories worked better than expected. The documents are well-structured, update infrequently, and respond well to targeted pattern matching. Embedding them added latency and cost without improving retrieval quality.

Second, the filesystem-based memory architecture kept context lean. At peak load, the agent read only the active booking file, current advisory, and confirmed preferences into context, staying under 4,000 tokens per invocation. Everything else was queryable on demand from the sandboxed directory or ClickHouse.

Third, Gemini 3 Flash through Vertex handled the volume. 1s latency at 130 tokens/second held steady under sustained load, and the per-token pricing made high-volume invocations economically viable.

Fourth, durable execution with encrypted step boundaries is table stakes for any agent that handles employee data. If you're building agents that touch HR systems, payroll, or travel records, and you're not encrypting payloads at each step boundary with org-scoped keys, you have an exposure surface you probably haven't mapped.

Intense, but reliable.