Skip to main content

Prompt Testing & Regression

Tests are not just verification — they are the safety net for evolving AI specifications.

In Genum, testing is a core capability. Every time you define a prompt, you also fix its expected behavior in tests. And as your prompts change, it's critical to ensure that their behavior remains stable — even when models or parameters change.

This is where regression testing becomes essential.


Why Testing Matters

Regression-proof your specs
Ensure that changes to prompts or model configurations don't break expected logic or outputs.

Behavioral alignment
Test cases verify not only syntax but also semantics — does the model still behave as intended?

Build trust in automation
With reliable test coverage, you can safely iterate, tune, and deploy prompts into production environments.


How to Create Test Cases

From Playground Output

After running a prompt in the Playground and reviewing the output:

  1. Click Save as Expected if the result is valid.
  2. Then click Create Test Case to capture the prompt, input, and expected output.

These test cases are tied to the specific prompt specification.

Test Creation

From Logs

You can also create test cases from execution logs:

  • If an agent run or API call shows meaningful output,
  • And you'd like to freeze it for future checks,
  • You can convert the log entry into a test case.

This is especially useful for:

  • Auditing unexpected results
  • Testing corner cases from production
  • Backfilling test coverage

Running Regression Tests

Once test cases are in place, you can:

  • Run all test cases before committing changes
  • Re-run specific tests during tuning
  • Compare outputs to expected results using AI, strict, or manual assertions

Assertion Modes

  • AI – Semantic similarity via LLM
  • Strict – Exact match on output
  • Manual – Human-reviewed assertions

Memory Key Functionality in Tests

You can use memory keys in your test cases to verify:

  • That the prompt behaves correctly based on memory-driven context
  • That client-specific or scenario-specific responses are reliably produced

Memory in Genum is not static storage — it's a programmable extension to the prompt logic and should be tested like any other input.

See more on memory


Promote with Confidence

Before shipping a prompt to production:

  1. Run your full regression suite
  2. Validate that no critical behavior has regressed
  3. Commit the updated prompt version with confidence

In Genum, tests are not optional. They are the quality framework for AI logic.

Stay safe, stable, and reliable — test before you promote. ✅