An 8-Question Checklist Before You Buy an AI Agent for Your Team

Key Takeaways

  • Most AI agent demos lie by omission. They show the happy path with one perfectly worded prompt. The honest evaluation is what happens when the prompt is messy, the data is bad, or the action fails halfway.
  • The eight questions below are the ones that separate real products from chat wrappers. If a vendor cannot answer them clearly, walk away.
  • Security and audit matter more than features. A tool you cannot trust with your CRM is not actually a tool you can use, no matter how good its outputs look.
  • Pricing models tell you what the vendor optimizes for. Per-seat pricing rewards inactive seats. Per-action pricing rewards your usage.
  • The right answer for your team depends on what you do most. Writing-heavy teams need different things than ops teams or growth teams.

Why this checklist exists

Last year I sat through 11 AI agent demos in seven weeks. Every one of them went well. Every one of them looked like the answer. Three of them I bought. Two of them ended up shelfware within 90 days.

The pattern: I was asking the wrong questions in the demos. I was watching the output and thinking it was the product. The output is the easy part.

What separated the two products that worked from the three that did not was infrastructure I could not see during the demo: how they handled security, what happened when an action failed, how they integrated with our existing tools, what the audit log looked like, and what they did when they were wrong.

This checklist is what I run through now before I sign any AI agent contract.


Question 1: Does it execute, or does it just draft?

The first thing to figure out is what category you are evaluating.

Some "AI agents" are essentially chat surfaces with better prompts. They write the email, they outline the report, they draft the campaign. You still have to go execute it.

Others are full execution agents. They draft, you approve, they execute. The difference is whether the work actually leaves the chat window and shows up in your CRM, your ad platform, your repo.

Ask in the demo: "Show me an action that touches a real third-party tool and changes its state." If the demo cannot do this, it is a writing assistant, not an agent.

This is not a disqualifier. Writing assistants are useful. But know what you are buying.


Question 2: How does it handle authentication to my tools?

Every AI agent that touches your stack needs credentials. The question is how it stores them and what it can do with them.

Acceptable answers:

  • OAuth where the tool supports it (HubSpot, Slack, Google, GitHub, Notion all support this)
  • Encrypted API keys with scoped permissions (most other tools)
  • A clear scoped-down permission set per integration (read-only where possible, write only where needed)

Unacceptable answers:

  • "We use the same credentials as our admin user."
  • "You give us your password."
  • Vague answers about encryption without specifics on key management.

Ask: "What happens if I disconnect a single integration? Does the rest still work?" The answer should be yes.

We covered this in more depth in Is Your AI Agent Safe?.


Question 3: Does it review before it acts?

This is the single most important behavioral question.

A good agent shows you what it will do before it does it. You see the draft email, the proposed campaign change, the new GitHub branch, before any of those things go live. You approve, edit, or reject.

A bad agent fires actions and then asks for forgiveness. This sounds aggressive in the demo. It is a nightmare in production. Three weeks in, someone will discover that an agent paused the wrong campaign, or sent the wrong email to the wrong customer, or merged the wrong PR.

Ask: "Show me what a customer reply looks like before it goes out." The answer should be a draft you can approve. If the answer is "it just sends," walk away unless you only plan to use it for low-stakes notifications.

We wrote up the long argument for this in Don't Let Your AI Agent Act Without Asking.


Question 4: What does the audit log actually show?

Every action an agent takes should be logged. Not just "agent ran" but specifically:

  • What action was taken
  • Which integration was touched
  • What the input was
  • What the output was
  • Who approved it (if approval was required)
  • Timestamp

Without this, you cannot answer the question "did the agent do that or did a person do that?" When something breaks, you need to know. When compliance asks, you need to show.

Ask in the demo: "Show me the audit log for the last 10 actions." If the answer is fuzzy, the audit infrastructure does not exist yet, and you should not put it near anything important.


Question 5: How many integrations does it actually have, and how deep are they?

Vendors love big integration numbers. The numbers are often misleading.

The right way to evaluate: pick the three tools you actually need to connect, and ask the vendor to do a specific workflow in each one.

For HubSpot: "Update the deal stage on this specific deal." For Stripe: "Pull all invoices from last month over $1,000." For Linear: "Open a new issue in the Engineering team with priority Medium and link to this Slack thread."

Surface-level integrations only let you read basic data. Real integrations let you take granular actions. The difference is huge.

For context: when we say Viktor connects to "3,000+ integrations," we mean it has real read and write access via real API surfaces, not just data export.


Question 6: How does pricing work, and what is the worst case?

Pricing models tell you what the vendor optimizes for.

Per-seat pricing. You pay per user, regardless of usage. Predictable. Tends to lead to "wait, are we using this?" conversations after 6 months. Vendor wins when seats sit idle.

Per-action or credit-based pricing. You pay for what gets used. The vendor wins when you use it more, which is theoretically aligned with your interest if the agent is creating real value.

Hybrid. Some flat fee plus usage. Common for enterprise.

For each model, ask: "What is the worst-case bill if my team really leans into this?" If the vendor cannot give you a confident answer, that is a signal to be cautious.

A reasonable rule of thumb: an AI agent should pay back at least 5x its monthly cost in saved hours. If the math is closer than that, you are paying for novelty.


Question 7: How does it fail, and how does it tell me when it failed?

This is the question vendors hate. It is the most useful one.

Every agent fails sometimes. The model is wrong. The integration is down. The data is unexpected. The right question is what happens then.

Acceptable failure modes:

  • The agent surfaces the failure in the channel where the work was requested
  • It does not retry destructive actions silently
  • It logs the failure with enough detail that a human can debug
  • It pings the human who approved the action, not just into the void

Unacceptable failure modes:

  • Silent failure, where you only find out a week later when something downstream breaks
  • Auto-retry on destructive actions
  • Generic "something went wrong" messages with no detail

Ask in the demo: "What happens if the integration times out halfway through this action?" Watch how concrete the answer is.


Question 8: Where does it live in our team's day?

This is the cultural question. It is the one that determines whether your team will actually use it.

If the agent lives in a separate web app you have to remember to open, adoption will be poor. People will use it for the first week and then forget.

If the agent lives in Slack or Microsoft Teams (where conversations are already happening), it gets used. People @mention it the way they would @mention a colleague.

If your team is heavy on email, an agent that integrates with Gmail directly is a different shape. If your team is heavy on calendar, an agent in your calendar tool helps.

Match the surface to where work actually happens. The best AI agent in the wrong surface beats none of the time.


What good answers look like, summarized

Question Good answer Walk-away signal
Does it execute? Demos a real action against a real tool Only shows drafts in chat
Authentication? OAuth, scoped, per-integration Password sharing, vague encryption claims
Review-first? Draft → approve → execute Fires actions silently
Audit log? Detailed log of every action "We have logs" with no specifics
Integration depth? Demonstrates granular write actions Read-only or surface-level
Pricing? Clear worst-case math Cannot give a worst-case
Failure handling? Surfaces failures, no silent retries Generic error messages
Surface? Lives where work happens Yet another tab

What about ROI calculations?

Vendors will offer to do these for you. They are usually generous to themselves.

Run your own. Pick five workflows your team actually does every week. For each one, write down the average time it takes today and the average time it would take if the agent did the work and a human reviewed it. Multiply by frequency and your blended hourly cost.

If the answer is more than 5x the agent's price, it is probably worth piloting. If the answer is less, you are paying for the wrong workflows.


Frequently Asked Questions

Should I run a pilot before signing an annual contract? Yes. A 30-day pilot with one team and three workflows tells you more than three vendor demos. Most credible vendors will offer one.

How do I know if the agent is actually saving us time? Track time-to-completion on the workflows you piloted. Compare before and after. The honest answer is sometimes "it is the same, but the work is more consistent." That is still worth something, but it is a different sale.

Do I need a technical person on the evaluation? For the security, integration, and audit questions, yes. For the workflow and ROI questions, the operator who feels the pain should drive.

What if my team uses Microsoft Teams and not Slack? The same questions apply. Make sure the vendor supports your surface natively. We covered Microsoft Teams agents in Best AI Agents for Microsoft Teams.

Is Viktor a good fit for our team? Honestly, it depends on the workflows. If you are mostly doing writing work, ChatGPT Teams may be enough. If you have repetitive cross-tool ops work that nobody owns, Viktor is built for that exactly. Free credits to try, no card required.


Related reading


Viktor is an AI coworker that lives in Slack, connects to 3,000+ integrations, executes after human approval, and logs every action. Add Viktor to your workspace, free to start →