What Breaks When Your Agent Has 100,000 Tools

Most AI agents demo well and fall apart in production. We've spent the past year building an AI coworker that lives in Slack, connects to your company's tools, and automates real work. Here's what we learned about agent architecture along the way.

Intelligence is not the bottleneck -- tool use is

Every few months a new frontier model drops with a 20% benchmark improvement, and our agent gets smarter overnight without us writing a line of code. That's great, but intelligence was never the real bottleneck.

The bottleneck is tool use. An AI that can reason brilliantly about your marketing spend is useless if it can't call the Meta Ads API. An AI that writes perfect status updates is useless if it can't post to Slack. The unlock isn't making the model smarter -- it's giving it hands.

We support ~3,000 integrations, each bringing anywhere from 10 to 100+ individual tools. A single user who connects Notion, Linear, HubSpot, and Gmail might give the agent access to 200+ tools. This is already more than 99% of ChatGPT users ever do, even though ChatGPT also has integrations. The difference between "theoretically supports tools" and "actually connected to your tools" is the difference between a toy and a product.

But that raises an obvious question: how do you expose an agent to tens of thousands of potential tools without blowing up its context window?

The context window is prime real estate

The naive approach is to describe every available tool in the system prompt so the model knows what it can do. This is catastrophically wasteful.

We went through three iterations:

  1. Everything in context. Hundreds of tool schemas dumped into the system prompt. Slow, expensive, and the model got confused about which tool to use.

  2. Search-based discovery. Tools live in files, agent searches when needed. Problem: the agent doesn't know what it doesn't know. If you ask about the weather, it won't think to grep for a "web search" function.

  3. One-line summaries with lazy loading. Each capability gets a single-line description in the system prompt -- we call these "skills" (a pattern that's become common in agent frameworks, though we use it in some novel ways). We have ~18 core skills, plus one for every integration the user connects. When the agent decides it needs one, it reads the full skill file in one step: detailed instructions, code examples, known gotchas, and the right function signatures to call. A user with 50 integrations has ~68 skills, but that's still just 68 lines of context. Maximum discoverability, minimum cost.

    The important nuance: when you connect a new integration, the agent explores it first. It tests the available API endpoints, discovers your team's IDs and project names, figures out what works and what doesn't, and writes all of this into a new skill file. The next time any agent invocation needs that integration, it doesn't search the codebase or guess at function signatures -- it reads the skill and immediately knows how to write the right code. This is strictly better than search-based discovery because the agent doesn't need to formulate a query for something it doesn't know exists yet.

The general principle: treat your context window like RAM in a memory-constrained system. Page things in only when needed. Keep the hot path small.

Code is the best tool-calling interface

Standard tool calling (JSON schemas, function calling APIs) works fine for 10-20 tools. It completely breaks down at scale. You can't put 500 tool schemas in context, and even if you could, the model would struggle to pick the right one.

Our solution: the agent writes code. Instead of calling a send_email tool through a structured API, it writes a Python script that imports a send_email function and calls it. This sounds like a hack, but it's actually strictly superior:

  • Composition. The agent can call three tools in a for loop, filter results with conditionals, and handle errors -- all in one turn. With structured tool calling, each of these would be a separate round trip.
  • Discoverability. The agent can browse a directory of available functions the same way a human developer would. It reads the module, sees the function signatures, and figures out how to use them.
  • Scalability. Adding a new tool means adding a Python function with a docstring. No schema changes, no prompt engineering.

LLMs are trained on enormous amounts of code. They're already good at this. Leaning into that strength -- treating the agent as a developer rather than a tool-caller -- was one of our best decisions.

Memory through plain text files

LLMs are stateless. There are many approaches to giving agents memory -- vector databases, RAG pipelines, summary-based context injection, persistent scratchpads. We tried most of them and landed on something surprisingly simple: markdown files on a shared filesystem.

When our agent explores a new integration -- say, your Linear account -- it writes down what it learned into that integration's skill file. The file structure looks roughly like this:

/skills/
├── linear.md          # Team IDs, project names, tips, broken endpoints
├── notion.md          # Workspace structure, key page IDs, usage patterns
├── hubspot.md         # Contact properties, pipeline stages, gotchas
├── browser.md         # How to use the browser API, form filling patterns
├── scheduled_crons.md # How to create and manage automations
└── ...

Each file accumulates institutional knowledge over time. A simplified example:

# Linear

## Teams
- Engineering (ID: eng-abc) -- used for most issues
- Design (ID: des-xyz) -- only for design-specific work

## Tips
- Always use "To Do" status, not "Triage"
- The list_labels endpoint is currently broken; use search_issues instead
- Peter prefers issues assigned to him to include a deadline

This doubles as self-healing. If an API call fails, the agent updates the file so future invocations don't repeat the mistake. If a user says "always put issues in To Do, not Triage," that preference gets appended. It's version-controlled institutional memory in a format the model already understands natively.

The key insight is that the filesystem is shared across the whole team. Every agent invocation -- regardless of which user triggered it -- reads and writes to the same skill files. One person's correction benefits everyone.

We tried more sophisticated approaches. They all performed worse than plain text that the model can read and write directly.

Proactive agents are a UX minefield

Most AI products are reactive: you ask a question, you get an answer. We wanted our agent to act on its own -- reading Slack messages, suggesting automations, following up on unanswered questions.

This is conceptually exciting and practically treacherous. The failure modes are social, not technical:

  • Too aggressive: The agent appears in every Slack thread with unsolicited opinions. People hate it.
  • Too generic: "Have you considered automating your workflow?" Thanks, very helpful.
  • Wrong audience: Posting a bot message in #general where the CEO sees it before anyone has context on what this thing is.

We learned to start conservatively. During the first few days after install, the agent introduces itself in small channels (not #general) with concrete examples relevant to that channel's topic. It reads messages four times a day but mostly just reacts with emoji and answers questions that have gone unanswered for 2+ hours. Low-stakes, high-signal actions that build trust before attempting anything ambitious.

The harder problem is suggesting automations that are actually useful. We run a workflow where the agent reads Slack history, cross-references available integrations, and proposes personalized automations to team members. Honestly, it still often suggests generic things. Making this specific and genuinely helpful is an open problem we're actively working on.

The economics of agent crons

Users can create scheduled automations with natural language: "Every morning at 9am, check the weather in Munich and post it in my DMs." This creates a cron that spins up an LLM-powered agent run on schedule.

We learned the hard way that this needs cost guardrails. One early user set up a cron running every 5 minutes that cost ~$5,000/month and did nothing useful.

The solution is a cost hierarchy:

  1. Script crons: Pure Python, no LLM calls. The agent writes the automation code once; it runs forever for nearly free. Example: an outage detector that checks 10 provider status pages every minute. The agent did the creative work (finding the right endpoints, writing the check logic); now it runs as a script.

  2. Conditional agent crons: A cheap Python condition check (is there a new message? did this file change?) runs first. The expensive LLM agent only spins up if the condition is met.

  3. Full agent crons: LLM runs every time. Expensive but sometimes necessary for tasks that genuinely require reasoning.

We even had the agent analyze its own spending and suggest where it could downgrade from option 3 to option 1. It worked surprisingly well -- turns out LLMs are decent at optimizing their own resource usage if you ask them to.

The general pattern: use intelligence once to create automation that runs forever without intelligence. The best agent invocation is the one that makes future agent invocations unnecessary.

Thread routing: making stateless feel stateful

Our agent lives in Slack, where conversations happen across DMs, threads, channels, and reactions. An LLM has a single linear context window. Making these two models coexist gracefully is harder than it sounds.

The interesting problem: a user DMs the agent a question, gets an answer in a thread, then sends a new top-level DM with a follow-up. These are two separate conversations from the system's perspective, but one continuous conversation from the user's perspective.

We solve this with forwarding logic. The new agent invocation checks recent DM history, determines if the message is a follow-up, and routes it to the original conversation where all the context already exists. The user never sees any of this complexity -- they just DM naturally and get coherent responses.

We also handle the messier Slack interactions that most agent frameworks ignore. If a user deletes a message, the agent is informed the user lost interest and should probably stop working on that task. If a user clicks an approval button and then un-approves, the agent is told the user changed their mind. Edited messages are treated as corrections. These feel like edge cases until you realize they happen constantly in real Slack usage, and an agent that ignores them feels broken in subtle, trust-eroding ways.

The lesson: the hard problems in agent engineering are often about routing, state management, and UX -- not model intelligence. Getting the plumbing right matters more than shaving milliseconds off inference.

What's actually hard

Agent engineering forces you to think at an unusual level of abstraction. Every decision has to work across a million different situations: different tools, different team structures, different communication styles, different preferences. You can't hardcode workflows because every team is different.

The meta-skill is finding the right level of specificity. Skill files need to be specific enough to be useful but general enough to transfer across situations. Proactive behaviors need to be assertive enough to deliver value but restrained enough not to annoy. Context injection needs to inform without overwhelming.

We're still early. But the compounding is real -- every model improvement, every prompt refinement, every new skill we add benefits every user simultaneously. The bet is simple: models will keep getting smarter, and if we build the right scaffolding around them, the gap between "AI assistant" and "AI coworker" closes fast.


Viktor is an AI coworker that lives in Slack, connects to 3,000+ integrations, and does real work for your team. Try Viktor →