All results are displayed in the results tab in the table.
Feature | Options/Actions |
---|---|
Filter by | Model Status Test Suite Run By Archived |
Search Includes | Run Name Test Suite Name |
Sort by | Pass Rate Run Date |
Show Columns | Toggle visibility of individual table columns |
There are a few actions available from this view:
Archived
filter is selectedClicking on a row will bring you to the single run page.
On the left, for each test result, it shows input, output, check results, and other metrics
The right contains top-level information about the run - including statistics, error analysis, and run parameters.
Feature | Options/Actions |
---|---|
Filter by Check Status | Checks Passed Checks Failed Some Checks Failed |
Filter by Attributes | Tag Has Output Error Has Low Confidence Checks |
Search Includes | Input Output Checks LLM Feedback Context |
By default, the Vals system will compute a confidence for every test - either “High” or “Low”. If the confidence is “High”, this means our system has flagged that it is very likely we’ve graded this output correctly. If the confidence is “Low”, then it means there is greater uncertainty or ambiguity in either the grading or in the criteria.
The confidence score is listed in each check next to the grade.
Several statistics are reported for every run.
Additionally, each check is by default given a certain category: e.g. “Correctness”, “Format”, “Style”, etc. The run result page will also show a pass percentage for each category (note: the categories can be overridden).
Finally, if tags are assigned to each test, it will show a performance breakdown by tag.
You can choose to compare two runs either through the table or when looking at a single run. This allows you to see the results of two runs side by side - this is commonly done to see the differences between two models.
We automatically compute several statistics, such as the likelihood of a statistically significant difference between the two runs.
You can also filter to only show tests that were marked differently by the auto-grader by pressing “Hide tests with no differences”.