Overview
All information inside of our suites and run results can be viewed from within the Platform. However, we offer support for the most common file types. For readability we recommend using CSV and if you need to pull the data locally and parse, JSON is recommended. Exporting with just CSV or JSON will only export the content of the suite, and not the files associated with the tests. If you would like to access the files, export with zip. When you export a suite with zip, all files associated with the tests will be exported and can be found inside ofdocuments/. Other paths are supported for maximum flexibility.
The File field is ignored when importing a csv that is not inside of a zip.
File Operations
| Type | Import | Export | CSV | JSON | ZIP |
|---|---|---|---|---|---|
| Test Suite | ✓ | ✓ | ✓ | ✓ | ✓ |
| Test Questions | ✓ | ✓ | |||
| Auto Eval | ✓ | ✓ | ✓ | ||
| Human Review | ✓ | ✓ | |||
| Question Answer Pairs | ✓ | ✓ |
Test Suite
Use these examples as a template for constructing your own test suites or for reference on the expected format. Example without files: Examples_of_Every_Operator_Suite.csvExample with files: cuad_suite_short.zip
Suite
| Column Name (type) | Description |
|---|---|
| Suite Id (uuid) | Unique Identifier for the suite (e.g., 19025787-7245-45aa-8d27-c6047bc804c0) |
| Title (str) | Title of the test suite (e.g., Math Evaluation Suite) |
| Description (str) | Description of the suite’s purpose or contents |
| Suite Version (int) | Version identifier for the suite (e.g., 1), Will be incremented when the suite is updated and ran |
| Number Of Tests (int) | Total number of tests included in the suite (e.g., 10) |
| Number Of Checks (int) | Total number of checks in the suite (e.g., 5) |
Test
| Column Name (type) | Description |
|---|---|
| Test Id (uuid) | Unique identifier for the test (e.g., 19025787-7245-45aa-8d27-c6047bc804c0) |
| Test Input (str) | The input or question the LLM will be asked (e.g., What is burden shifting under Title VII?) |
| Right Answer (str) | The correct answer for the test input |
| Tags (str) | Used to organize tests. This field is spread down the column of the csv (e.g., math or law) |
| Files (str) | File name that is inside of a test or the path to the file inside of a zip (e.g., documents/doc1.pdf) |
| Context Keys (str) | Key to a json value that is used as the input context for the test (e.g., date) |
| Context Values (str) | Value to a json key that is used as the input context for the test (e.g., 2024-01-01) |
Check
| Column Name (type) | Description |
|---|---|
| Operator (str) | The operator chosen for the check (e.g., includes) |
| Criteria (str) | The criteria for the operator that is checked against the LLM response (e.g., age, sex, religion) |
| Weight (int) | Numeric weight for the check (e.g., 1, 2), important for scoring |
| Category (str) | Category of the operator (e.g., Style, Correctness) |
| Extraction Prompt (str) | Instruction for extracting a value from the output (e.g., extract the table columns) |
| Conditional Operator (str) | Operator for the conditional check (e.g., satisfies_statement) |
| Conditional Criteria (str) | Criteria for a conditional check (e.g., mentions X) |
| Example Type (str) | Type of example, either positive (should pass) or negative (should fail) |
| Example Value (str) | Example value for the check (e.g., John Doe) |
Test Questions
Example: Examples_of_Every_Operator_questions.csvTest Question
| Column Name (type) | Description |
|---|---|
| Test Input (str) | Test Input provided inside of a test |
Auto Eval
Example: Examples_of_Every_Operator_Results.csvRun Result
| Column Name (type) | Description |
|---|---|
| Run Id (uuid) | Unique identifier for the run (e.g., 19025787-7245-45aa-8d27-c6047bc804c0) |
| Test Suite Id (uuid) | |
| Test Suite Title (str) | |
| Run Status (str) | Status of the run when exported (e.g., success, error) |
| Run Error Message (str) | Error message if the run failed |
| Run Error Analysis (str) | LLM Analysis of the feedback from failed checks |
| Completed At (datetime) | |
| Run Parameters (dict) | Parameters used for the run |
| Percent Of Checks Passed (float) | |
| Amount Of Checks Passed (int) | |
| Standard Deviation For Checks Passed (float) | |
| Percent Of Tests Passed (float) | Percent of tests where all checks passed |
| Amount Of Tests Passed (int) | Total number of tests where all checks passed |
| Standard Deviation For Tests Passed (float) | |
| Needs Review Percentage (float) | Percentage of results that are flagged for human review |
Test results
| Column Name (type) | Description |
|---|---|
| Test Result Id (uuid) | Unique identifier for the test result (e.g., 19025787-7245-45aa-8d27-c6047bc804c0) |
| Test Id (uuid) | Unique identifier for the test (e.g., 19025787-7245-45aa-8d27-c6047bc804c0) |
| Test Status (str) | Status of the test result (e.g., success) |
| Test Error Message (str) | Error message if the test failed or encountered an error |
| Test Input (str) | |
| LLM Output (str) | The output generated by the LLM from the provided test input |
| Files (str) | Name of the files that were passed to the LLM during evaluation |
| In Tokens (int) | Number of input tokens that was used for the test |
| Out Tokens (int) | Number of output tokens that the LLM generated when answering the test input |
| Duration (float) | Time taken to run the test (in seconds) |
| Input Context Keys (str) | Context key that was added inside of the test (e.g., date) |
| Input Context Values (str) | Context value that was added inside of the test (e.g., 2024-01-01) |
| Output Context Keys (str) | Keys for matching output context values. (e.g., reasoning) |
| Output Context Values (str) | Values for matching output context keys. (e.g., The LLM provided a detailed explanation) |
Check
| Column Name (type) | Description |
|---|---|
| Operator (str) | |
| Criteria (str) | |
| Auto Eval (str/float) | Human readable representation of the auto eval (e.g., pass, fail) floats are left as is |
| Edited Auto Eval (str) | Overrided auto eval score (e.g., pass, fail). This is what would show in the UI instead of the original auto eval |
| Edited Auto Eval Feedback (str) | Feedback left when the auto eval was edited |
| Confidence Level (str) | Confidence in the auto eval score (e.g., high, low) |
| Feedback (str) | LLM feedback on LLM output regarding criteria and the auto eval score |
| Weight (float) | |
| Example Type (str) | |
| Example Value (str) | |
| Extractor (str) | The extracted value from the LLM output using the extraction prompt that was defined in the check |
| Conditional Criteria (str) | |
| Conditional Operator (str) | |
| Category (str) |
Human Review
Example: Examples_of_Every_Operator_Review.csvRun Review
| Column Name (type) | Description |
|---|---|
| Run Id (uuid) | |
| Run Name (str) | |
| Run Review Status (str) | Status of the human review (e.g.,completed) |
| Run Review Created By (str) | User who created the run review |
| Run Review Created At (datetime) | Timestamp when the review was created |
| Run Review Completion Time (datetime) | Timestamp when the review was completed |
| Number Of Reviews (int) | Number of reviews chosen for each test result added to the queue. (e.g., 1) |
| Assigned Reviewers (list[str]) | List of users who were assigned to review the run |
| Pass Rate (float) | Percentage of checks that humans marked as pass |
| Flagged Rate (float) | Percentage of checks that humans marked as flagged |
| Auto Eval ↔ Reviewer Agreement (float) | Agreement rate between auto eval and human reviewers (as a percentage) |
| Reviewer ↔ Reviewer Agreement (float) | Agreement rate between different human reviewers (as a percentage) |
Test Review
| Column Name (type) | Description |
|---|---|
| Test Result Id (str) | |
| Test Input (str) | |
| LLM Output (str) | |
| Files (str) | |
| Completed At (datetime) | |
| Completed By (str) | |
| Test Review Feedback (str) | Feedback from the reviewer on a test result from a human review |
Human Review Check
| Column Name (type) | Description |
|---|---|
| Check Type (str) | For checks that come from auto eval, this is defaulted to Auto-eval review otherwise, it comes from the type selected by the human review template. |
| Metric Name (str) | Name of the metric being evaluated drawn from the human review template. Auto eval checks are blank. |
| Operator (str) | |
| Criteria (str) | |
| Auto Eval (str) | |
| Reviewer Response (str) | Either human review template response or human readable auto eval score (e.g., pass, fail) |
Question Answer Pairs
Example: Examples_of_Every_Operator_QA_Pairs.csvQuestion Answer Pair
| Column Name (type) | Description |
|---|---|
| Question (str) | The input or question that matches with an test input from within a test suite |
| Answer (str) | LLM generated response to the question |
| In Tokens (int) | |
| Out Tokens (int) | |
| Duration (float) |
0 if not provided.
Troubleshooting
Frequent issues
- Test input is missing from the csv file
- Duplicated questions when uploading question answer pairs
- Missing criteria values for non-unary operators
- Using comma separated values inside of a single cell instead of spreading them down columns as intended
Tips on formatting
For lists, in order to make modifications easier we separate the values on a cell level instead of making them comma separated. For example,| Test Id | Test Input | Tags | Operator | Criteria |
|---|---|---|---|---|
| 19025787-7245-45aa-8d27-c6047bc804c0 | Where is the Bay Area located? | Bay | includes | California |
| Easy | includes_exactly | Northern California, United States | ||
| excludes | Los Angeles | |||
| excludes_exactly | Atlantic Ocean |