AgenticAssure

description = "The pytest for AI agents. Test your LLM-powered agents with YAML scenarios, structured scoring, and CLI-first reporting. Open source. Framework-agnostic. Ship with confidence."

Features

scorers { built_in: "passfail, exact, regex, similarity" }
reports { formats: "CLI (Rich), HTML, JSON" }
adapters { protocol: "Plug in any agent — OpenAI, LangChain, custom" }
config { options: "retries, timeouts, fail_fast, tag filtering" }

Getting Started

pip install agenticassure

// Initialize a new project:

agenticassure init my-tests
cd my-tests
agenticassure run scenarios/ --adapter my_agent.Agent

How It Works

1. Define Scenarios

Write test cases in simple YAML files. Each scenario specifies an input prompt, the expected output or tool calls, and which scorers to use. No code required for the test definitions themselves.

2. Write an Adapter

Create a small wrapper class with a single run() method that connects AgenticAssure to your agent. Works with any framework — OpenAI, LangChain, or your own custom setup.

3. Run and Report

Execute your scenarios from the CLI or Python. AgenticAssure runs each test, scores the results, and generates structured reports in your terminal, as HTML, or as JSON.

Example Scenario

A single YAML file defines your entire test suite. Each scenario is self-contained — input, expectations, and scoring rules all in one place.

suite:
  name: customer-support-agent
  config:
    default_timeout: 30
    retries: 1
    default_scorers: ["passfail"]

scenarios:
  - name: Basic greeting
    input: "Hello, who are you?"
    expected_output: "hello"
    tags: [basic]

  - name: Order lookup
    input: "Look up order ORD-001"
    expected_tools: [get_order]
    expected_tool_args:
      get_order:
        order_id: "ORD-001"

  - name: Return policy (semantic match)
    input: "What is your return policy?"
    expected_output: "We offer a 30-day return policy."
    scorers: [similarity]
    metadata:
      similarity_threshold: 0.8

Scorers

Every scenario is evaluated by one or more scorers. A test passes only when all its scorers pass. Mix and match them per scenario.

PassFail

The default scorer. Checks that output exists, expected tools were called with the right arguments, and the expected output appears in the response.

Exact Match

Strict string comparison. The agent's output must match the expected output exactly — useful for deterministic responses.

Regex

Pattern matching against the agent's output. Define a regex in the scenario metadata to validate structure, formats, or specific content patterns.

Similarity

Semantic comparison using cosine similarity via sentence-transformers. Set a threshold to control how close the meaning needs to be. Great for natural language responses.