Skip to main contentTests
Test Suite - A grouping of one or more tests –- useful for keeping related tests organized
Test - An input combined with a grouping of checks against which an output will be evaluated
Input - A question posed to the LLM (aka question)
Output - The LLM’s response to the input (aka answer)
Check - An operator/criteria pair used to evaluate whether the output meets certain expectations
Operator - One of includes, excludes, etc. This is either unary like is_safe or binary (meaning it takes a criteria filled in by the user). Each operator is a tool we’ve defined to check one specific item in the output
Criteria - Free text filled in by the user that the operator will be evaluated against
Right answer - An alternative way to evaluate an output. The user writes the “right answer” to a given prompt/question and we compare that right answer to the answer the LLM produces.
Custom operators - A custom operator you define via prompt engineering. Can be configured on the settings page.
LLM (Large Language Model) - The underlying model used to generate responses.
LLM as Judge - A process where one LLM evaluates the output LLM.
SME (Subject matter expert) - An expert in the domain under question (e.g. lawyer, CPA, etc.)
Runs
Run result - The result of evaluating a test suite. A run has two phases:
- Output gathering - The model generates responses for each test input to gather outputs
- Evaluation - Each check is evaluated based on the outputs
Test result - The result of evaluating a single test
Auto eval - The result of grading an output against its respective checks
Model - The model used to produce outputs. Can either be a foundation model or your own custom model.
Eval model - Some of our operators employ LLM as judge. This is the model that will be used as judge.
Run parameters - These are the knobs available to you to tune either how the evaluation is conducted (evaluation parameters) or how the model responses are produced (model parameters).
Heavyweight factor - Each auto eval will be run this many times, and the result is the mode.
Duration/Latency - The time it takes for a model to produce a response.
Confidence - When evaluating benchmark performance, we use a standard large and very capable model as the primary judge for scoring responses. For confidence assessment, we employ a smaller, more efficient model that analyzes both the question and surrounding test context to verify the coherence of question-answer pairs, providing a reliability metric for our evaluation results. A high confidence indicates a high likelihood that the check has been scored correctly.
Human Review
Auto eval re-review (aka Single run review) - Another pass by a human reviewer. Each of the checks posed to the model must be answered or flagged.
Pairwise review - Two outputs are compared side by side. The reviewer selects the better one.
Human review template - A user configured review question that will only be answered by a human reviewer. This allows you to evaluate outputs in ways that may be hard to codify solely using checks.