Tests
Test Suite - A grouping of one or more tests –- useful for keeping related tests organized Test - An input combined with a grouping of checks against which an output will be evaluated Input - A question posed to the LLM (aka question) Output - The LLM’s response to the input (aka answer) Check - An operator/criteria pair used to evaluate whether the output meets certain expectations Operator - One ofincludes
, excludes
, etc. This is either unary
like is_safe
or binary
(meaning it takes a criteria filled in by the user). Each operator is a tool we’ve defined to check one specific item in the output
Criteria - Free text filled in by the user that the operator will be evaluated against
Right answer - An alternative way to evaluate an output. The user writes the “right answer” to a given prompt/question and we compare that right answer to the answer the LLM produces.
Platform
Custom operators - A custom operator you define via prompt engineering. Can be configured on the settings page. LLM (Large Language Model) - The underlying model used to generate responses. LLM as Judge - A process where one LLM evaluates the output LLM. SME (Subject matter expert) - An expert in the domain under question (e.g. lawyer, CPA, etc.)Runs
Run result - The result of evaluating a test suite. A run has two phases:- Output gathering - The model generates responses for each test input to gather outputs
- Evaluation - Each check is evaluated based on the outputs