Tests

Check - An operator/criteria pair

Criteria - Free text filled in by the user that the operator will be evaluated against

Input - A question posed to the LLM (aka question)

Operator - One of includes, excludes, etc. This is either unary like is_safe or binary (meaning it takes a criteria filled in by the user). Each operator is a tool we’ve defined to check one specific item in the output.

Output - The LLM’s response to the input (aka answer)

Test - An input combined with a grouping of one or more checks against which an output will be evaluated.

Test Suite - A grouping of one or more tests –- useful for keeping related tests organized

Right answer - An alternative way to evaluate an output. The user writes the “right answer” to a given prompt/question and we compare that right answer to the answer the LLM produces.

Platform

Custom operators - A custom operator you define via prompt engineering. Can be configured on the settings page.

LLM - Large language model

LLM as judge - A process by which one LLM is used to grade the response of another LLM.

SME - Subject matter expert - Specialist in the domain under question (e.g. lawyer, CPA, etc.)

Runs

Auto eval - The result of grading an output against its respective checks

Confidence - When evaluating benchmark performance, we use a standard large and very capable model as the primary judge for scoring responses. For confidence assessment, we employ a smaller, more efficient model that analyzes both the question and surrounding test context to verify the coherence of question-answer pairs, providing a reliability metric for our evaluation results. A high confidence indicates a high likelihood that the check has been scored correctly.

Duration/Latency - The time it takes for a model to produce a response.

Eval model - Some of our operators employ LLM as judge. This is the model that will be used as judge.

Heavyweight factor - Each auto eval will be run this many times, and the result is the mode.

Model - The model used to produce outputs. Can either be a foundation model or your own custom model.

Run parameters - These are the knobs available to you to tune either how the evaluation is conducted (evaluation parameters) or how the model responses are produced (model parameters).

Run result - An evaluated test suite. This evaluation takes place over two phases:

  1. Output gathering - Each test input is passed off to the model to gather outputs
  2. Evaluation - The checks are evaluated

Test result - The result of evaluating a single test

Human Review

Auto eval re-review (aka Single run review) - Another pass by a human reviewer. Each of the checks posed to the model must be answered or flagged.

Pairwise review - Two outputs are compared side by side and the user picks a winner.

Human review template - A user configured review question that will only be answered by a human reviewer. This allows you to evaluate outputs in ways that may be hard to codify solely using checks.