Glossary

Tests

Test Suite - A grouping of one or more tests –- useful for keeping related tests organized Test - An input combined with a grouping of checks against which an output will be evaluated Input - A question posed to the LLM (aka question) Output - The LLM’s response to the input (aka answer) Check - An operator/criteria pair used to evaluate whether the output meets certain expectations Operator - One of includes, excludes, etc. This is either unary like is_safe or binary (meaning it takes a criteria filled in by the user). Each operator is a tool we’ve defined to check one specific item in the output Criteria - Free text filled in by the user that the operator will be evaluated against Right answer - An alternative way to evaluate an output. The user writes the “right answer” to a given prompt/question and we compare that right answer to the answer the LLM produces.

Platform

Custom operators - A custom operator you define via prompt engineering. Can be configured on the settings page. LLM (Large Language Model) - The underlying model used to generate responses. LLM as Judge - A process where one LLM evaluates the output LLM. SME (Subject matter expert) - An expert in the domain under question (e.g. lawyer, CPA, etc.)

Runs

Run result - The result of evaluating a test suite. A run has two phases:

Output gathering - The model generates responses for each test input to gather outputs
Evaluation - Each check is evaluated based on the outputs

Test result - The result of evaluating a single test Auto eval - The result of grading an output against its respective checks Model - The model used to produce outputs. Can either be a foundation model or your own custom model. Eval model - Some of our operators employ LLM as judge. This is the model that will be used as judge. Run parameters - These are the knobs available to you to tune either how the evaluation is conducted (evaluation parameters) or how the model responses are produced (model parameters). Heavyweight factor - Each auto eval will be run this many times, and the result is the mode. Duration/Latency - The time it takes for a model to produce a response. Confidence - When evaluating benchmark performance, we use a standard large and very capable model as the primary judge for scoring responses. For confidence assessment, we employ a smaller, more efficient model that analyzes both the question and surrounding test context to verify the coherence of question-answer pairs, providing a reliability metric for our evaluation results. A high confidence indicates a high likelihood that the check has been scored correctly.

Human Review

Auto eval re-review (aka Single run review) - Another pass by a human reviewer. Each of the checks posed to the model must be answered or flagged. Pairwise review - Two outputs are compared side by side. The reviewer selects the better one. Human review template - A user configured review question that will only be answered by a human reviewer. This allows you to evaluate outputs in ways that may be hard to codify solely using checks.

Get Started

Web App

CLI and SDK

Tests

Platform

Runs

Human Review

Get Started

Web App

CLI and SDK

​Tests

​Platform

​Runs

​Human Review

Tests

Platform

Runs

Human Review