Glossary
Tests
Check - An operator/criteria pair
Criteria - Free text filled in by the user that the operator will be evaluated against
Input - A question posed to the LLM (aka question)
Operator - One of includes
, excludes
, etc. This is either unary
like is_safe
or binary
(meaning it takes a criteria filled in by the user). Each operator is a tool we’ve defined to check one specific item in the output.
Output - The LLM’s response to the input (aka answer)
Test - An input combined with a grouping of one or more checks against which an output will be evaluated.
Test Suite - A grouping of one or more tests –- useful for keeping related tests organized
Right answer - An alternative way to evaluate an output. The user writes the “right answer” to a given prompt/question and we compare that right answer to the answer the LLM produces.
Platform
Custom operators - A custom operator you define via prompt engineering. Can be configured on the settings page.
LLM - Large language model
LLM as judge - A process by which one LLM is used to grade the response of another LLM.
SME - Subject matter expert - Specialist in the domain under question (e.g. lawyer, CPA, etc.)
Runs
Auto eval - The result of grading an output against its respective checks
Confidence - When evaluating benchmark performance, we use a standard large and very capable model as the primary judge for scoring responses. For confidence assessment, we employ a smaller, more efficient model that analyzes both the question and surrounding test context to verify the coherence of question-answer pairs, providing a reliability metric for our evaluation results. A high confidence indicates a high likelihood that the check has been scored correctly.
Duration/Latency - The time it takes for a model to produce a response.
Eval model - Some of our operators employ LLM as judge. This is the model that will be used as judge.
Heavyweight factor - Each auto eval will be run this many times, and the result is the mode.
Model - The model used to produce outputs. Can either be a foundation model or your own custom model.
Run parameters - These are the knobs available to you to tune either how the evaluation is conducted (evaluation parameters) or how the model responses are produced (model parameters).
Run result - An evaluated test suite. This evaluation takes place over two phases:
- Output gathering - Each test input is passed off to the model to gather outputs
- Evaluation - The checks are evaluated
Test result - The result of evaluating a single test
Human Review
Auto eval re-review (aka Single run review) - Another pass by a human reviewer. Each of the checks posed to the model must be answered or flagged.
Pairwise review - Two outputs are compared side by side and the user picks a winner.
Human review template - A user configured review question that will only be answered by a human reviewer. This allows you to evaluate outputs in ways that may be hard to codify solely using checks.