Building long-running agents for monitoring financial events

This is a technical write-up of the architecture behind Zeed's monitoring system for financial events. V1 of the product shipped to customers in Jun-25.

The Problem

Investors, generally, have to stay on top of events. Bankruptcies, clinical trial results, insider trading, merger arbitrage opportunities, credit rating changes, to name a few.

Special situations and event-driven desks build entire strategies around these events. When an event occurs and meets specific criteria, they need to quickly evaluate whether to act. I've worked on strategies like this—during my time as a trader, we monitored hurricanes in the Gulf Coast for their impact on oil supply and refinery capacity.

There are so many events that you cannot possibly analyse them all, but you can't afford to miss anything either. There may be 300 opportunities on a given day, but you have finite resources to explore, evaluate, and decide which ones to act on, if any.

Scaling humans to monitor these opportunities would be too expensive.
Single-agent chat-style AI is reactive, and lacks the firepower to help
We need something that is always on, doesn't need babysitting and learns over time

Approaching the Problem

We started with rigid, workflow-based systems, with fixed data APIs, if/else logic gates, and LLMs for the critical reasoning/research. It worked, but we hit a wall in terms of intelligence. The more rigid the structure, the less helpful and expressive the agent could be in guiding the research. We found that our scaffolding was holding back the agent.

All the LLMs we were using were working "below" our initial scaffolding. A rigid, deterministic workflow was controlling our creative, non-deterministic LLMs.

Our intuition was, this needed to flip.

The flip: from rigid planner on top to LLMs leading — Before: rigid planner constraining creative LLMs. After: LLMs lead, scaffolding supports.

Why did we need multi-agent

There are three reasons to bother with multi-agent systems. It would be a whole lot easier to just have a single agent in a for loop.

Context management. When the amount of tokens you need to process exceeds the limit at which LLMs are performant. Multi-agent isn't the only solution, for example, compact/summarise as you go along is trivial, yet lacklustre answer for this.
Speed. Even if an agent can manage its own context, it is still single threaded and sequential. Reading 5 documents is slower than reading them at the same time. Tok/s is the limiting factor for time sensitive tasks.
Cost management. Whilst it would be great to use Opus 4.5 for everything, it is expensive. Where smaller models are capable of handling smaller tasks, this presents a more cost effective solution.

In approaching this problem, it was clear that we would need multiple agents. If we are processing 100 events in a single agent-run, with each event carrying vastly different context, then a single agent's context window would not suffice. Plus, our users expect most of these agents to finish by the time the stock market opens/closes, so timing was a critical factor. The challenge was that they also can't be invoked too early, because the events haven't taken place yet (e.g. pre-market.) Therefore, we also needed the speed from a multi-agent setup.

Core Architectural Decision: A library, not a meeting room

We modelled the system with inspiration from how actual trading desks are organised. There is a single fulcrum point where all the information coalesces, only one person has the context of everything. This is the fund manager / head of desk. Everyone else is feeding relevant information to them, but the decisions are really only made by a single person.

Contrary to this, I have also personally been on desks that operate with a co-head structure. This sounds great, double the brain power, best of both...but experience tells me that in reality, this is incredibly difficult to pull off.

We also decided that the "council" approach would not work here, for example, multiple analysts working together, or pair programming. We wanted to discourage groupthink, whilst enabling collective intelligence. We did not want democracy averaging, as mentioned in the precursor to this article.

We decided on a Library model, where agents can "publish" their completed work when they are done.

Library vs Meeting Room coordination models — Meeting room: agents talk to each other, creating groupthink. Library: agents publish to a shared resource, enabling collective intelligence without context collision.

How it works

At the heart of the system is the primary orchestrator agent. This is (metaphorically) the fund manager. The agent has its own workspace (sandbox), that persists across the lifetime of the agent, including across separate agent runs. This orchestrator has a plan, defined in its workspace (but not loaded into its context window until needed.)

1. Building the Universe

Each time there is an agent run, the agent reads its plan, and gets to work. Its primary job is to find, filter and dispatch workers. This may mean hitting an API, or running some code, or browsing a website. It is not doing any research work, it's building the universe of events to consider.

Then, the orchestrator can interact with the agent swarm. It instructs the swarm, based on its plan, exactly what it wants to know about the universe of events it has found. At this stage, there are between 50 and 400 events.

2. Preliminary Research

When each of these events with instructions (jobs) is posted to the swarm, it gets put into a queue and picked up by swarm workers.

Each worker has its own fresh sandbox (to execute code, search the web etc.) The worker's job, at this stage, is to do preliminary research, i.e. doing a small amount of cheap work to figure out if it's worth doing a large amount of expensive work. Since the work is easy, and the scale is hgih, we decide to use a smaller, less intelligent model.

For most event types, this is between 1-5 minutes of work and we constrain it to 10 minutes. More on this later. All the work done up until that point is persisted in the sandbox, and it completes its job by generating a single message, communicating what it has found to the orchestrator.

This is a low trust activity, the orchestrator is encouraged to be very specific with what it is looking for in order to warrant doing "real" work for an event, it doesn't leave much judgement up to that worker itself. In some cases, this level of research is sufficient for what the orchestrator is looking for before flagging it for the user.

Each worker can access the Library. The Library is a centralised repository of "published" information, each orchestrator has one, and it contains entries from all swarm workers for that specific agent. It is only here that a worker can determine if it has actually seen this event before! Whilst it can read the work that other agents have published, it cannot see all the messy process behind it (to prevent context bloating.)

2.5 Independent Evaluation

After each job is complete. We have a list of heuristics (as decided by the orchestrator) that the outputs of the workers are scored on. This is a 3rd party independent eval on each job result. I call this "2.5" because we added it early on, and never removed it, but it is unclear how useful it actually is to the orchestrator. It's more for our eval systems.

Example heuristics include conciseness, faithfulness, novelty, logical correctness. Different event types warrant different heuristics as it relates to selection and filtering. E.g. an old event everybody knows about and is priced in already is not important to the manager. No novelty!

3. The First Filter

The orchestrator polls the agent swarm to know the status of the jobs. Unless it is extremely time-sensitive (sometimes it is!), it waits for all the jobs to complete. It now has the complete context of all the jobs, provided by the final messages from the cheap workers.

The orchestrator generally does one of the following things at this stage:

It decides the work done on any specific event was insufficient, and so asks for more information
It decides that the event was insignificant
It decides an event is interesting and goes deeper

Now, the number of events we are left with is between 10 and 50. Any more than that, the agent hasn't done enough work to actually curate the list for the end user! The orchestrator engages the agent swarm again, but this time to do "real" work.

Event filtering funnel — The filtering process: 400 events become 30 after preliminary research, then 10 actionable insights after deep research.

4. Deep Research

Each worker picks up where they left off, and now has fresh feedback from the orchestrator on what direction to take. The difference is, we now give this worker a smarter brain, a more expensive model, and more time. The time budget is set in the orchestrator's plan, typically between 10 and 60 minutes.

The worker is now going to find all the relevant sources, run all the analysis behind this event with its expansive skill set, including searching the Library and referencing financial models.

We use a method very similar to Deep Researcher with Test-Time Diffusion (actually a few months before this paper was published!) This is a fancier version of Ralph in my opinion, which has been doing the rounds this past few weeks. To summarise, it runs an agent until it returns. Then, the outer agent loop tells it to go again with the same prompt, continuing from where it left off. The agent looks at the changes made in the previous iteration, and proceeds accordingly. We run a max_depth of 20 iterations in production, and since it's all running in a sandbox and modifying/creating files, we don't find ourselves suffering from context rot.

Test-Time Diffusion loop — The deep research loop: generate, evaluate, refine, continue — 20 iterations until the agent is satisfied.

This final piece of work, when complete, is then published to the Library, and we again run the 3rd party evaluators. This is completely self-directed planning and execution.

5. Synthesis

The orchestrator waits for all the deep research jobs to finish, again with the option to instruct the worker to do more. When assessing each report, it reads each report with "fresh" context window (essentially a sub-agent of itself), and decides what it wants to put back into its main context. This is very closely modelled off an analyst-manager relationship.

Finally, it decides which research is actionable to be communicated back to the user. It dispatches one final synthesis worker to produce the content that summarises all the work which is done, and sends this back to the user.

6. Learn and Adapt

Everything the orchestrator deems relevant, it persists the knowledge in its sandbox (through organised folders, markdown files, csv files etc.) At times, it has learnt something about the process or ways to improve it, and so it can make small changes to update its own plan.

This can be things to watch out for in future runs, for example, an orchestrator may have picked up that a next-stage clinical trial result has been set 2 weeks from now whilst reading one of the worker's reports. The orchestrator saves this in the sandbox, and reminds itself to remind the user a day before, and monitor it on the day itself!

Additionally, the user can chat to the orchestrator, and based on the questions that are being asked, it can update the plan or focus on how it's interacting with the swarm.

How is the Orchestrator so OP?

I missed out one step from the above — stage 0.

0. The Agent Builder

The key to the OP (overpowered) orchestrator is an unbelievable planning agent that constructs it. This isn't new, but at the time we didn't realise it was a planning agent. We always referred to it as an agent builder, e.g. defining the toolset, the system prompt etc. The general idea is, a user types what they want to monitor, the agent builder asks clarifying questions for specifics on what it should be on the lookout for. The agent builder transforms it into something tangible.

Data

One of these critical things is where the data should be sourced from, in some instances it is public websites, and it builds its own custom scraper/reverse engineered API. Other times, it is data available within APIs our system already has access to, or sometimes it is a private data source that the user has access to.

Two things we did that unlocked a lot of performance for us.

Shared Sandbox

The biggest by far was the notion of a shared sandbox across agent runs for persisted storage. This was heavily inspired from using Claude Code for structured research. In fact, the entirety of the MVP was scaffolded on top of a local CC instance, I used this MVP to explain to the team how I thought this should be built.

Agent Initialisation

The second was the having the agent builder do the orchestrator initialisation. Critically, this includes putting the agent definition and plan within the sandbox. This doesn't load the whole plan into context, but allows the orchestrator to grep and understand the plan as and when needed. Including debugging etc. The plan_update setup at the end of agent runs was also critical.

The strength of the builder is in understanding the system capabilities deeply. This is a system VERY good at monitoring. If you want a simple one-off chat, we explicitly suggest the user to use a separate part of the product (which is essentially a general purpose orchestrator agent.)

The agent builder is really the CEO of the system, he hardly does anything, but makes sure everyone else is organised in the optimal way, and deals with any big problems that require seismic shifts. Everything else is handled by the orchestrator and his swarm.

We also version control the sandbox, to make sure there are checkpoints for the agent builder to restore the system back to in case anything goes wrong.

Sandbox is all you need

Each agent having its own sandbox with a file system and code execution environment, and scaling that, is what I was betting on.

Have some faith in the agents—hundreds of billions of parameters know more than you think. They just need clear direction and objectives. The filesystem enables tracking progress, refinement across runs and the Library of previous work. The resetting of the sandbox is what we found to produce inconsistency between agent runs, we found that persisting the sandbox overall led to a reduction in decay of agent performance, even when changing models!

Economics

There is a fatal drawback to this system, and likely the reason it isn't widely used. It is computationally expensive. Here is an illustrative example of the models we used.

Agent builder: Gemini 2.5 Pro
Orchestrator: Gemini 2.5 Pro
Preliminary: Gemini 2.5 Flash
Deep Research: Gemini 2.5 Pro
Synthesis: Gemini 2.5 Pro
Evaluation: GPT 4o-mini

These were SOTA when we initially built the system, and just about worked. The performance was on the edge of what we needed, and we were banking on them improving. Now that we have changed to the current best in class, eg Gemini 3 Pro, our reliability and quality has improved, reducing the 're-run' rate to near zero.

For 10 agents, with each agent processing roughly 200 events a day, deep researching 30 events, cost us around $5,000/month in just LLM inference cost for a single user. This is after optimising a lot of things. Originally, I wanted Opus 4 and Claude Code in each sandbox. With how much more expensive the cost per token was, and how much more token hungry CC is, we estimated the cost would be nearly 4 times as much, a tradeoff we weren't comfortable with.

5-10 agents is what we expect the average trading desk to use. Most 'work' would struggle to justify spending that much on compute, unless it was generating value exceeding that. At $60K a year (if we priced it at cost to the customer), it's roughly half to a third of an actual person you'd hire to do the role. We expect the value of the research produced by the agent to far exceed that, since we believe a human can't process this much information that quickly. Our bet is that we are doubling down on LLM's strengths, their capacity to consume!

The same system, if used to track Vinted listings, or concert tickets or Jellycats, would (and has!) worked exceptionally well. But it's like bringing an RPG to a fist fight. Completely the wrong tool for those hobbyist tasks. Hence why, finance is the ideal domain, especially when $XXM are being invested into the opportunities surfaced. Just one deal over the course of a year can justify the value of the system.

Always on

We wanted the agents to be 'always on', constantly working and giving outputs. The challenge is, a lot of this data we couldn't subscribe to/listen for updates through web-hooks or similar. And whilst we technically could have kept sandboxes running all the time, it was already an expensive ordeal. So, we opted to get as close as we could, through periodic runs to create the illusion of being 'always on.' It would save us an order of magnitude on a cost basis, so this was a reasonable trade off.

Limitations

We're token STARVING, literally consuming whatever we can, as long as it helps us achieve our objective. This is a mindset and designed as a feature of the system, for when inference cost looks like a rounding error. But when thinking about scale, and current cost basis, it is a limitation.

We are overly reliant on sandboxes. We don't manage our own sandboxes, and this introduces a critical dependency. If for some reason, an image doesn't wake up in a few seconds, or fails to altogether, this is a massive problem. The agent can't run!

Our report generation / deep research process could also be better. We believe in the TTD approach, for structured content. Balancing structure with novelty is hard. For this use case, we want the 'outlier' insights, and we are unsure if this is a model intelligence thing, or a process thing. You want the agent to communicate with the user in a consistent manner each day. Without this, you lose trust with the user.

What we learned

We set out to solve a very specific problem, but ended up having to study anthropology in the process. Our system is highly specialised for the type of work we claim to solve, but we have many ideas about how this applies to other fields/industries.

What we built was expensive and opinionated—but it worked.

The biggest lessons:

Agents perform better with persistent memory than fresh starts
Separating planning from execution produces better outcomes
The "council" approach averages out intelligence—let your best actor lead

Coordination is the hard part. The models/humans are good enough—the question is whether your system lets them work together without getting in each other's way.