Overview

All information inside of our suites and run results can be viewed from within the Platform. However, we offer support for the most common file types. For readability we recommend using CSV and if you need to pull the data locally and parse, JSON is recommended. Exporting with just CSV or JSON will only export the content of the suite, and not the files associated with the tests. If you would like to access the files, export with zip. When you export a suite with zip, all files associated with the tests will be exported and can be found inside of documents/. Other paths are supported for maximum flexibility. The File field is ignored when importing a csv that is not inside of a zip.

File Operations

TypeImportExportCSVJSONZIP
Test Suite
Test Questions
Auto Eval
Human Review
Question Answer Pairs
When exporting a suite, top level information about the suite can be found, this is not required when importing. If importing global checks with tests, please ensure there is a blank row between the global checks and the tests. Expected format will be provided in examples below. An exhastive list of supported columns and types can be found below. Excluding suite information, all exported information can be re-imported. For example, if you were to export a suite and then re-import it, you should be left with the same suite.

Test Suite

Use these examples as a template for constructing your own test suites or for reference on the expected format. Example without files: Examples_of_Every_Operator_Suite.csv
Example with files: cuad_suite_short.zip

Suite

Column Name (type)Description
Suite Id (uuid)Unique Identifier for the suite (e.g., 19025787-7245-45aa-8d27-c6047bc804c0)
Title (str)Title of the test suite (e.g., Math Evaluation Suite)
Description (str)Description of the suite’s purpose or contents
Suite Version (int)Version identifier for the suite (e.g., 1), Will be incremented when the suite is updated and ran
Number Of Tests (int)Total number of tests included in the suite (e.g., 10)
Number Of Checks (int)Total number of checks in the suite (e.g., 5)

Test

Column Name (type)Description
Test Id (uuid)Unique identifier for the test (e.g., 19025787-7245-45aa-8d27-c6047bc804c0)
Test Input (str)The input or question the LLM will be asked (e.g., What is burden shifting under Title VII?)
Right Answer (str)The correct answer for the test input
Tags (str)Used to organize tests. This field is spread down the column of the csv (e.g., math or law)
Files (str)File name that is inside of a test or the path to the file inside of a zip (e.g., documents/doc1.pdf)
Context Keys (str)Key to a json value that is used as the input context for the test (e.g., date)
Context Values (str)Value to a json key that is used as the input context for the test (e.g., 2024-01-01)

Check

Column Name (type)Description
Operator (str)The operator chosen for the check (e.g., includes)
Criteria (str)The criteria for the operator that is checked against the LLM response (e.g., age, sex, religion)
Weight (int)Numeric weight for the check (e.g., 1, 2), important for scoring
Category (str)Category of the operator (e.g., Style, Correctness)
Extraction Prompt (str)Instruction for extracting a value from the output (e.g., extract the table columns)
Conditional Operator (str)Operator for the conditional check (e.g., satisfies_statement)
Conditional Criteria (str)Criteria for a conditional check (e.g., mentions X)
Example Type (str)Type of example, either positive (should pass) or negative (should fail)
Example Value (str)Example value for the check (e.g., John Doe)

Test Questions

Example: Examples_of_Every_Operator_questions.csv

Test Question

Column Name (type)Description
Test Input (str)Test Input provided inside of a test
Most compact and simple version of a test suite. Only contains the test inputs from the exported test suite. Can import back into a test suite.

Auto Eval

Example: Examples_of_Every_Operator_Results.csv

Run Result

Column Name (type)Description
Run Id (uuid)Unique identifier for the run (e.g., 19025787-7245-45aa-8d27-c6047bc804c0)
Test Suite Id (uuid)
Test Suite Title (str)
Run Status (str)Status of the run when exported (e.g., success, error)
Run Error Message (str)Error message if the run failed
Run Error Analysis (str)LLM Analysis of the feedback from failed checks
Completed At (datetime)
Run Parameters (dict)Parameters used for the run
Percent Of Checks Passed (float)
Amount Of Checks Passed (int)
Standard Deviation For Checks Passed (float)
Percent Of Tests Passed (float)Percent of tests where all checks passed
Amount Of Tests Passed (int)Total number of tests where all checks passed
Standard Deviation For Tests Passed (float)
Needs Review Percentage (float)Percentage of results that are flagged for human review

Test results

Column Name (type)Description
Test Result Id (uuid)Unique identifier for the test result (e.g., 19025787-7245-45aa-8d27-c6047bc804c0)
Test Id (uuid)Unique identifier for the test (e.g., 19025787-7245-45aa-8d27-c6047bc804c0)
Test Status (str)Status of the test result (e.g., success)
Test Error Message (str)Error message if the test failed or encountered an error
Test Input (str)
LLM Output (str)The output generated by the LLM from the provided test input
Files (str)Name of the files that were passed to the LLM during evaluation
In Tokens (int)Number of input tokens that was used for the test
Out Tokens (int)Number of output tokens that the LLM generated when answering the test input
Duration (float)Time taken to run the test (in seconds)
Input Context Keys (str)Context key that was added inside of the test (e.g., date)
Input Context Values (str)Context value that was added inside of the test (e.g., 2024-01-01)
Output Context Keys (str)Keys for matching output context values. (e.g., reasoning)
Output Context Values (str)Values for matching output context keys. (e.g., The LLM provided a detailed explanation)

Check

Column Name (type)Description
Operator (str)
Criteria (str)
Auto Eval (str/float)Human readable representation of the auto eval (e.g., pass, fail) floats are left as is
Edited Auto Eval (str)Overrided auto eval score (e.g., pass, fail). This is what would show in the UI instead of the original auto eval
Edited Auto Eval Feedback (str)Feedback left when the auto eval was edited
Confidence Level (str)Confidence in the auto eval score (e.g., high, low)
Feedback (str)LLM feedback on LLM output regarding criteria and the auto eval score
Weight (float)
Example Type (str)
Example Value (str)
Extractor (str)The extracted value from the LLM output using the extraction prompt that was defined in the check
Conditional Criteria (str)
Conditional Operator (str)
Category (str)

Human Review

Example: Examples_of_Every_Operator_Review.csv

Run Review

Column Name (type)Description
Run Id (uuid)
Run Name (str)
Run Review Status (str)Status of the human review (e.g.,completed)
Run Review Created By (str)User who created the run review
Run Review Created At (datetime)Timestamp when the review was created
Run Review Completion Time (datetime)Timestamp when the review was completed
Number Of Reviews (int)Number of reviews chosen for each test result added to the queue. (e.g., 1)
Assigned Reviewers (list[str])List of users who were assigned to review the run
Pass Rate (float)Percentage of checks that humans marked as pass
Flagged Rate (float)Percentage of checks that humans marked as flagged
Auto Eval ↔ Reviewer Agreement (float)Agreement rate between auto eval and human reviewers (as a percentage)
Reviewer ↔ Reviewer Agreement (float)Agreement rate between different human reviewers (as a percentage)
Only completed run reviews will be exported.

Test Review

Column Name (type)Description
Test Result Id (str)
Test Input (str)
LLM Output (str)
Files (str)
Completed At (datetime)
Completed By (str)
Test Review Feedback (str)Feedback from the reviewer on a test result from a human review

Human Review Check

Column Name (type)Description
Check Type (str)For checks that come from auto eval, this is defaulted to Auto-eval review otherwise, it comes from the type selected by the human review template.
Metric Name (str)Name of the metric being evaluated drawn from the human review template. Auto eval checks are blank.
Operator (str)
Criteria (str)
Auto Eval (str)
Reviewer Response (str)Either human review template response or human readable auto eval score (e.g., pass, fail)

Question Answer Pairs

Example: Examples_of_Every_Operator_QA_Pairs.csv

Question Answer Pair

Column Name (type)Description
Question (str)The input or question that matches with an test input from within a test suite
Answer (str)LLM generated response to the question
In Tokens (int)
Out Tokens (int)
Duration (float)
Additionally, we support output context keys and values. These are used to store additional information you may want to collect from the LLM. Each column that is not one of the ones above will be treated as a key and the value will be the cell inside of the column. Example can be found above. To support different workflows, columns other than Question and Answer are optional, metadata will be defaulted to 0 if not provided.

Troubleshooting

Frequent issues

  • Test input is missing from the csv file
  • Duplicated questions when uploading question answer pairs
  • Missing criteria values for non-unary operators
  • Using comma separated values inside of a single cell instead of spreading them down columns as intended

Tips on formatting

For lists, in order to make modifications easier we separate the values on a cell level instead of making them comma separated. For example,
Test IdTest InputTagsOperatorCriteria
19025787-7245-45aa-8d27-c6047bc804c0Where is the Bay Area located?BayincludesCalifornia
Easyincludes_exactlyNorthern California, United States
excludesLos Angeles
excludes_exactlyAtlantic Ocean
Each row before the next test input belongs to the current test. Columns such as Tags are grouped together, while operators are separated on the row level. Please reference the example for additional information. We support various different encoding types, but we recommend using UTF-8 as it is widely supported. We require that you fill out the criteria for each operator that is non-unary, meaning it requires a single value. If you have any issues that cannot be resolved, please reach out to us at contact@vals.ai.