Tests

Test Suite - A grouping of one or more tests –- useful for keeping related tests organized Test - An input combined with a grouping of checks against which an output will be evaluated Input - A question posed to the LLM (aka question) Output - The LLM’s response to the input (aka answer) Check - An operator/criteria pair used to evaluate whether the output meets certain expectations Operator - One of includes, excludes, etc. This is either unary like is_safe or binary (meaning it takes a criteria filled in by the user). Each operator is a tool we’ve defined to check one specific item in the output Criteria - Free text filled in by the user that the operator will be evaluated against Right answer - An alternative way to evaluate an output. The user writes the “right answer” to a given prompt/question and we compare that right answer to the answer the LLM produces.

Platform

Custom operators - A custom operator you define via prompt engineering. Can be configured on the settings page. LLM (Large Language Model) - The underlying model used to generate responses. LLM as Judge - A process where one LLM evaluates the output LLM. SME (Subject matter expert) - An expert in the domain under question (e.g. lawyer, CPA, etc.)

Runs

Run result - The result of evaluating a test suite. A run has two phases:
  1. Output gathering - The model generates responses for each test input to gather outputs
  2. Evaluation - Each check is evaluated based on the outputs
Test result - The result of evaluating a single test Auto eval - The result of grading an output against its respective checks Model - The model used to produce outputs. Can either be a foundation model or your own custom model. Eval model - Some of our operators employ LLM as judge. This is the model that will be used as judge. Run parameters - These are the knobs available to you to tune either how the evaluation is conducted (evaluation parameters) or how the model responses are produced (model parameters). Heavyweight factor - Each auto eval will be run this many times, and the result is the mode. Duration/Latency - The time it takes for a model to produce a response. Confidence - When evaluating benchmark performance, we use a standard large and very capable model as the primary judge for scoring responses. For confidence assessment, we employ a smaller, more efficient model that analyzes both the question and surrounding test context to verify the coherence of question-answer pairs, providing a reliability metric for our evaluation results. A high confidence indicates a high likelihood that the check has been scored correctly.

Human Review

Auto eval re-review (aka Single run review) - Another pass by a human reviewer. Each of the checks posed to the model must be answered or flagged. Pairwise review - Two outputs are compared side by side. The reviewer selects the better one. Human review template - A user configured review question that will only be answered by a human reviewer. This allows you to evaluate outputs in ways that may be hard to codify solely using checks.