All results are displayed in the results tab in the table.
It’s possible to filter the results by the model ran, the current status, who it was ran by, which test suite it was from, as well as to search by run name or test suite name. You can show and hide columns of the table by pressing “Show Columns”, and also sort by pass rate and when it was started.
There are a few actions available from this view:
Clicking on a row will bring you to the single run page.
On the left, it shows the results of every test - including the input, the output from the LLM, how it performed on each check, and other information. Ther right contains top-level information about the run - including statistics about overall performance, a free-text summary, and the parameters of the run.
It’s possible to filter the results to only view certain tests. First, you can filter based on failures - showing only the tests where some, all, or none of the checks failed.
It is also possible to search for a given string in the test. This searches over the input, output, checks, LLM feedback, and context by default.
Additionally, you can filter to only tests that have a given tag.
Finally, you can filter to only the tests there the model had an error when producing the output (e.g., if you were looking for cases where the model exceeded its token limit).
By default, the Vals system will compute a confidence for every test - either “High” or “Low”. If the confidence is “High”, this means our system has flagged that it is very likely we’ve graded this output correctly. If the confidence is “Low”, then it means there is greater uncertainty or ambiguity in either the grading or in the criteria.
The confidence score is listed in each check next to the grade.
Several statistics are reported for every run.
Additionally, each check is by default given a certain category: e.g. “Correctness”, “Format”, “Style”, etc. The run result page will also show a pass percentage for each category (note: the categories can be overridden).
Finally, if tags are assigned to each test, it will show a performance breakdown by tag.
You can choose to compare two runs either through the table or when looking at a single run. This allows you to see the results of two runs side by side - this is commonly done to see the differences between two models.
We automatically compute several statistics, such as the likelihood of a statistically significant difference between the two runs.
You can also filter to only show tests that were marked differently by the auto-grader by pressing “Hide tests with no differences”.
All results are displayed in the results tab in the table.
It’s possible to filter the results by the model ran, the current status, who it was ran by, which test suite it was from, as well as to search by run name or test suite name. You can show and hide columns of the table by pressing “Show Columns”, and also sort by pass rate and when it was started.
There are a few actions available from this view:
Clicking on a row will bring you to the single run page.
On the left, it shows the results of every test - including the input, the output from the LLM, how it performed on each check, and other information. Ther right contains top-level information about the run - including statistics about overall performance, a free-text summary, and the parameters of the run.
It’s possible to filter the results to only view certain tests. First, you can filter based on failures - showing only the tests where some, all, or none of the checks failed.
It is also possible to search for a given string in the test. This searches over the input, output, checks, LLM feedback, and context by default.
Additionally, you can filter to only tests that have a given tag.
Finally, you can filter to only the tests there the model had an error when producing the output (e.g., if you were looking for cases where the model exceeded its token limit).
By default, the Vals system will compute a confidence for every test - either “High” or “Low”. If the confidence is “High”, this means our system has flagged that it is very likely we’ve graded this output correctly. If the confidence is “Low”, then it means there is greater uncertainty or ambiguity in either the grading or in the criteria.
The confidence score is listed in each check next to the grade.
Several statistics are reported for every run.
Additionally, each check is by default given a certain category: e.g. “Correctness”, “Format”, “Style”, etc. The run result page will also show a pass percentage for each category (note: the categories can be overridden).
Finally, if tags are assigned to each test, it will show a performance breakdown by tag.
You can choose to compare two runs either through the table or when looking at a single run. This allows you to see the results of two runs side by side - this is commonly done to see the differences between two models.
We automatically compute several statistics, such as the likelihood of a statistically significant difference between the two runs.
You can also filter to only show tests that were marked differently by the auto-grader by pressing “Hide tests with no differences”.