documents/
. Other paths are supported for maximum flexibility.
The File field is ignored when importing a csv that is not inside of a zip.
Type | Import | Export | CSV | JSON | ZIP |
---|---|---|---|---|---|
Test Suite | ✓ | ✓ | ✓ | ✓ | ✓ |
Test Questions | ✓ | ✓ | |||
Auto Eval | ✓ | ✓ | ✓ | ||
Human Review | ✓ | ✓ | |||
Question Answer Pairs | ✓ | ✓ |
Column Name (type) | Description |
---|---|
Suite Id (uuid) | Unique Identifier for the suite (e.g., 19025787-7245-45aa-8d27-c6047bc804c0 ) |
Title (str) | Title of the test suite (e.g., Math Evaluation Suite ) |
Description (str) | Description of the suite’s purpose or contents |
Suite Version (int) | Version identifier for the suite (e.g., 1 ), Will be incremented when the suite is updated and ran |
Number Of Tests (int) | Total number of tests included in the suite (e.g., 10 ) |
Number Of Checks (int) | Total number of checks in the suite (e.g., 5 ) |
Column Name (type) | Description |
---|---|
Test Id (uuid) | Unique identifier for the test (e.g., 19025787-7245-45aa-8d27-c6047bc804c0 ) |
Test Input (str) | The input or question the LLM will be asked (e.g., What is burden shifting under Title VII?) |
Right Answer (str) | The correct answer for the test input |
Tags (str) | Used to organize tests. This field is spread down the column of the csv (e.g., math or law ) |
Files (str) | File name that is inside of a test or the path to the file inside of a zip (e.g., documents/doc1.pdf ) |
Context Keys (str) | Key to a json value that is used as the input context for the test (e.g., date ) |
Context Values (str) | Value to a json key that is used as the input context for the test (e.g., 2024-01-01 ) |
Column Name (type) | Description |
---|---|
Operator (str) | The operator chosen for the check (e.g., includes ) |
Criteria (str) | The criteria for the operator that is checked against the LLM response (e.g., age , sex , religion ) |
Weight (int) | Numeric weight for the check (e.g., 1 , 2 ), important for scoring |
Category (str) | Category of the operator (e.g., Style , Correctness ) |
Extraction Prompt (str) | Instruction for extracting a value from the output (e.g., extract the table columns ) |
Conditional Operator (str) | Operator for the conditional check (e.g., satisfies_statement ) |
Conditional Criteria (str) | Criteria for a conditional check (e.g., mentions X ) |
Example Type (str) | Type of example, either positive (should pass) or negative (should fail) |
Example Value (str) | Example value for the check (e.g., John Doe ) |
Column Name (type) | Description |
---|---|
Test Input (str) | Test Input provided inside of a test |
Column Name (type) | Description |
---|---|
Run Id (uuid) | Unique identifier for the run (e.g., 19025787-7245-45aa-8d27-c6047bc804c0 ) |
Test Suite Id (uuid) | |
Test Suite Title (str) | |
Run Status (str) | Status of the run when exported (e.g., success , error ) |
Run Error Message (str) | Error message if the run failed |
Run Error Analysis (str) | LLM Analysis of the feedback from failed checks |
Completed At (datetime) | |
Run Parameters (dict) | Parameters used for the run |
Percent Of Checks Passed (float) | |
Amount Of Checks Passed (int) | |
Standard Deviation For Checks Passed (float) | |
Percent Of Tests Passed (float) | Percent of tests where all checks passed |
Amount Of Tests Passed (int) | Total number of tests where all checks passed |
Standard Deviation For Tests Passed (float) | |
Needs Review Percentage (float) | Percentage of results that are flagged for human review |
Column Name (type) | Description |
---|---|
Test Result Id (uuid) | Unique identifier for the test result (e.g., 19025787-7245-45aa-8d27-c6047bc804c0 ) |
Test Id (uuid) | Unique identifier for the test (e.g., 19025787-7245-45aa-8d27-c6047bc804c0 ) |
Test Status (str) | Status of the test result (e.g., success ) |
Test Error Message (str) | Error message if the test failed or encountered an error |
Test Input (str) | |
LLM Output (str) | The output generated by the LLM from the provided test input |
Files (str) | Name of the files that were passed to the LLM during evaluation |
In Tokens (int) | Number of input tokens that was used for the test |
Out Tokens (int) | Number of output tokens that the LLM generated when answering the test input |
Duration (float) | Time taken to run the test (in seconds) |
Input Context Keys (str) | Context key that was added inside of the test (e.g., date ) |
Input Context Values (str) | Context value that was added inside of the test (e.g., 2024-01-01 ) |
Output Context Keys (str) | Keys for matching output context values. (e.g., reasoning ) |
Output Context Values (str) | Values for matching output context keys. (e.g., The LLM provided a detailed explanation ) |
Column Name (type) | Description |
---|---|
Operator (str) | |
Criteria (str) | |
Auto Eval (str/float) | Human readable representation of the auto eval (e.g., pass , fail ) floats are left as is |
Edited Auto Eval (str) | Overrided auto eval score (e.g., pass , fail ). This is what would show in the UI instead of the original auto eval |
Edited Auto Eval Feedback (str) | Feedback left when the auto eval was edited |
Confidence Level (str) | Confidence in the auto eval score (e.g., high , low ) |
Feedback (str) | LLM feedback on LLM output regarding criteria and the auto eval score |
Weight (float) | |
Example Type (str) | |
Example Value (str) | |
Extractor (str) | The extracted value from the LLM output using the extraction prompt that was defined in the check |
Conditional Criteria (str) | |
Conditional Operator (str) | |
Category (str) |
Column Name (type) | Description |
---|---|
Run Id (uuid) | |
Run Name (str) | |
Run Review Status (str) | Status of the human review (e.g.,completed ) |
Run Review Created By (str) | User who created the run review |
Run Review Created At (datetime) | Timestamp when the review was created |
Run Review Completion Time (datetime) | Timestamp when the review was completed |
Number Of Reviews (int) | Number of reviews chosen for each test result added to the queue. (e.g., 1 ) |
Assigned Reviewers (list[str]) | List of users who were assigned to review the run |
Pass Rate (float) | Percentage of checks that humans marked as pass |
Flagged Rate (float) | Percentage of checks that humans marked as flagged |
Auto Eval ↔ Reviewer Agreement (float) | Agreement rate between auto eval and human reviewers (as a percentage) |
Reviewer ↔ Reviewer Agreement (float) | Agreement rate between different human reviewers (as a percentage) |
Column Name (type) | Description |
---|---|
Test Result Id (str) | |
Test Input (str) | |
LLM Output (str) | |
Files (str) | |
Completed At (datetime) | |
Completed By (str) | |
Test Review Feedback (str) | Feedback from the reviewer on a test result from a human review |
Column Name (type) | Description |
---|---|
Check Type (str) | For checks that come from auto eval, this is defaulted to Auto-eval review otherwise, it comes from the type selected by the human review template. |
Metric Name (str) | Name of the metric being evaluated drawn from the human review template. Auto eval checks are blank. |
Operator (str) | |
Criteria (str) | |
Auto Eval (str) | |
Reviewer Response (str) | Either human review template response or human readable auto eval score (e.g., pass , fail ) |
Column Name (type) | Description |
---|---|
Question (str) | The input or question that matches with an test input from within a test suite |
Answer (str) | LLM generated response to the question |
In Tokens (int) | |
Out Tokens (int) | |
Duration (float) |
0
if not provided.
Test Id | Test Input | Tags | Operator | Criteria |
---|---|---|---|---|
19025787-7245-45aa-8d27-c6047bc804c0 | Where is the Bay Area located? | Bay | includes | California |
Easy | includes_exactly | Northern California, United States | ||
excludes | Los Angeles | |||
excludes_exactly | Atlantic Ocean |