AgenticAssure
description = "The pytest for AI agents. Test your LLM-powered agents with YAML scenarios, structured scoring, and CLI-first reporting. Open source. Framework-agnostic. Ship with confidence."Features
scorers { built_in: "passfail, exact, regex, similarity" }reports { formats: "CLI (Rich), HTML, JSON" }adapters { protocol: "Plug in any agent — OpenAI, LangChain, custom" }config { options: "retries, timeouts, fail_fast, tag filtering" }Getting Started
pip install agenticassure// Initialize a new project:
agenticassure init my-testscd my-testsagenticassure run scenarios/ --adapter my_agent.AgentHow It Works
1. Define Scenarios
Write test cases in simple YAML files. Each scenario specifies an input prompt, the expected output or tool calls, and which scorers to use. No code required for the test definitions themselves.
2. Write an Adapter
Create a small wrapper class with a single run() method that connects AgenticAssure to your agent. Works with any framework — OpenAI, LangChain, or your own custom setup.
3. Run and Report
Execute your scenarios from the CLI or Python. AgenticAssure runs each test, scores the results, and generates structured reports in your terminal, as HTML, or as JSON.
Example Scenario
A single YAML file defines your entire test suite. Each scenario is self-contained — input, expectations, and scoring rules all in one place.
suite:
name: customer-support-agent
config:
default_timeout: 30
retries: 1
default_scorers: ["passfail"]
scenarios:
- name: Basic greeting
input: "Hello, who are you?"
expected_output: "hello"
tags: [basic]
- name: Order lookup
input: "Look up order ORD-001"
expected_tools: [get_order]
expected_tool_args:
get_order:
order_id: "ORD-001"
- name: Return policy (semantic match)
input: "What is your return policy?"
expected_output: "We offer a 30-day return policy."
scorers: [similarity]
metadata:
similarity_threshold: 0.8Scorers
Every scenario is evaluated by one or more scorers. A test passes only when all its scorers pass. Mix and match them per scenario.
PassFail
The default scorer. Checks that output exists, expected tools were called with the right arguments, and the expected output appears in the response.
Exact Match
Strict string comparison. The agent's output must match the expected output exactly — useful for deterministic responses.
Regex
Pattern matching against the agent's output. Define a regex in the scenario metadata to validate structure, formats, or specific content patterns.
Similarity
Semantic comparison using cosine similarity via sentence-transformers. Set a threshold to control how close the meaning needs to be. Great for natural language responses.