Getting Started
go-eval v0.8 is an open-source evaluation toolkit for Go teams building LLM products. It runs inside go test, stays opt-in through GOEVAL=1, and covers judge metrics, deterministic checks, structured artifacts, tool trajectories, multi-step agent scenarios, result comparison, and summaries.
Use Cases
- RAG response quality checks in go test
- Deterministic output validation for JSON and artifacts
- Agent trajectory checks for tool-use workflows
- Ordered agent scenario contracts with per-step tool policies
- Tiered CI slices for critical, standard, and extended cases
- Prompt or model regression checks in CI pipelines
- Repeatability checks for flaky judge metrics
Core Concepts
Create Your First Test Run
Start with keyed eval.Case literals, a cheap deterministic check, and one judge metric. Keyed case literals are required by v0.4 and later because Case has a private blank field.
package yourpkg_test
import (
"testing"
eval "github.com/igcodinap/go-eval"
)
func TestSupportReply(t *testing.T) {
runner := eval.NewRunner(openAIJudge, eval.WithResultSink(eval.DefaultResultSink()))
c := eval.Case{
Input: "How do I cancel my plan?",
Output: "You can cancel from Billing > Subscription.",
Expected: "cancel",
Metadata: map[string]any{
"flow": "support.reply", "tier": "critical", "dataset": "support/v1",
},
}
result := runner.Run(t, eval.Precheck{
Pre: eval.Contains{},
Main: eval.Compound{
Dimensions: []eval.Dimension{
{Name: "helpfulness", Rubric: "Actionable next step", Threshold: 0.7},
{Name: "policy_alignment", Rubric: "No unsafe guidance", Threshold: 0.9},
},
},
}, c)
if !result.Passed {
t.Fatalf("eval failed: %s", result.Reason)
}
}Metrics Overview
| Metric | Type | Purpose | Threshold |
|---|---|---|---|
| Faithfulness | LLM-as-Judge | Verify RAG outputs do not contradict retrieved context | 0.8 |
| Hallucination | LLM-as-Judge | Catch outputs that invent facts outside the supplied context | 0.9 |
| AnswerRelevancy | LLM-as-Judge | Ensure the output directly addresses the user input | 0.7 |
| ContextPrecision | LLM-as-Judge | Check whether retrieved context documents are relevant to the input | 0.7 |
| GEval | LLM-as-Judge | Score custom criteria that built-in metrics do not cover | 0.7 |
| Compound | LLM-as-Judge | Evaluate several related rubric dimensions in one judge call | per-dimension |
| Contains | Deterministic | Check that output contains a required substring | binary |
| Regex | Deterministic | Validate output against a regular expression | binary |
| JSONPath | Deterministic | Assert a value inside JSON output | binary |
| FieldCount | Deterministic | Enforce a minimum number of non-null JSON fields | config |
| ArtifactExists | Deterministic | Check that a named structured artifact exists on the case | binary |
| ArtifactNotExists | Deterministic | Assert that an unwanted structured artifact was not emitted | binary |
| ArtifactJSONPath | Deterministic | Assert a JSON value inside a named artifact | binary |
| ArtifactFieldCount | Deterministic | Require enough non-null fields inside an artifact object | config |
| ArtifactNumberLTE | Deterministic | Check that a numeric artifact value stays under a maximum | binary |
| ArtifactArrayContains | Deterministic | Check that an artifact array contains an expected value | binary |
| ArtifactArrayNotContains | Deterministic | Check that an artifact array excludes an unwanted value | binary |
| ArtifactArrayMinLen | Deterministic | Require an artifact array to have at least a minimum length | binary |
| ArtifactSubset | Deterministic | Assert that an artifact contains a partial expected JSON structure | binary |
| OutputLengthBudget | Deterministic | Keep final output within rune or word limits | config |
| ToolCallAccuracy | Trajectory | Compare actual tool calls with expected calls under a match mode | 1.0 |
| ToolCallF1 | Trajectory | Report precision, recall, and F1 for tool-call matches | 0.8 |
| RequiredTools | Trajectory | Fail when required tool names or name patterns are absent | binary |
| ForbiddenTool | Trajectory | Fail when disallowed tool names or patterns appear in the trajectory | binary |
| StepBudget | Trajectory | Keep tool-call count within a configured budget | binary |
| Precheck | Wrapper | Skip expensive LLM metrics when a cheap guard fails | wrapped metric |
| Contract | Wrapper | Group several deterministic or judge checks into one named result | all checks |
| Repeat | Wrapper | Run a metric multiple times and aggregate pass rate plus score stats | configurable pass rate |
| WithTokenBudget | Wrapper | Fail a wrapped metric when token usage exceeds a maximum | token max |
| WithLatencyBudget | Wrapper | Fail a wrapped metric when latency exceeds a maximum | duration max |
Deterministic Metrics
Deterministic metrics do not call an LLM judge. They are fast, cheap, and reproducible, making them useful for prechecks, output length budgets, tool policy checks, and structured output validation.
Substring presence check.
Regular-expression output validation.
Exact value check at a JSON path.
Configurable minimum non-null JSON fields.
Configurable rune and word limits.
Partial JSON structure validation.
Agent Scenarios
Use Runner.RunScenario for ordered multi-step agent flows where correctness depends on accumulated history, artifacts, state, tool policy, and per-step contracts.
r := eval.NewRunner(
judge,
eval.WithResultSink(eval.DefaultResultSink()),
eval.DefaultTierFilter(),
)
result := r.RunScenario(t, eval.Scenario{
Name: "planning_to_route_ready",
Tier: "critical",
State: map[string]any{"locale": "es-CL"},
Tools: eval.NewToolRegistry("plan_route", "select_map_items"),
Repeat: eval.ScenarioRepeat{N: 3, PassRate: 2.0 / 3.0},
Driver: func(ctx context.Context, req eval.StepRequest) (eval.StepResult, error) {
return runAgentStep(ctx, req.Step.Input, req.History, req.Artifacts, req.State)
},
Steps: []eval.Step{
{
Name: "greeting",
Input: "Hola",
ForbiddenToolPatterns: []string{"plan_*", "select_*"},
Timeout: 500 * time.Millisecond,
},
{
Name: "ready_route_request",
Input: "Propón la ruta",
RequiredToolPatterns: []string{"plan_*"},
Timeout: 3 * time.Second,
Checks: []eval.Metric{
eval.NewContract("ready_route",
eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"},
eval.ArtifactArrayMinLen{Key: "route", Path: "stops", MinLen: 2},
),
},
},
},
})
if !result.Passed {
t.Fatalf("scenario failed")
}Scenario result sinks include normal metric rows plus a _scenario_summary row with step names, tool calls, emitted artifact keys, failed metrics, repeat counts, and redacted metadata. Set Step.ExpectFail for negative cases that should fail their checks.
Grouped Contracts
Use Contract to group several checks into one named requirement. It keeps the report readable while preserving per-check dimensions for debugging.
readyRoute := eval.Contract{
ContractName: "ready_route",
Checks: []eval.Metric{
eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"},
eval.ArtifactSubset{
Key: "route",
Expected: json.RawMessage(`{"success":true}`),
},
eval.OutputLengthBudget{MaxWords: 180},
},
StopOnFailure: true,
}
r.Run(t, readyRoute, c)Artifact Checks
Use Case.Artifacts for named JSON payloads that should be validated separately from final prose: route state, planner output, budget data, tool traces, or workflow state. v0.8 adds absence checks, array exclusion, JSON subsets, wildcard paths, output length budgets, and normalizers.
c := eval.Case{
Output: "Route is ready.",
Artifacts: map[string]json.RawMessage{
"route": json.RawMessage(`{
"status":"ready",
"total_minutes":98,
"stops":[{"name":"Pajaritos"},{"name":"Valparaíso"}]
}`),
},
}
fold := eval.ChainNormalizers(eval.CaseFoldNormalizer(), eval.SpanishASCIIFoldNormalizer())
r.Run(t, eval.ArtifactExists{Key: "route"}, c)
r.Run(t, eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"}, c)
r.Run(t, eval.ArtifactNumberLTE{Key: "route", Path: "total_minutes", Max: 120}, c)
r.Run(t, eval.ArtifactArrayContains{
Key: "route", Path: "stops[*].name", Expected: "pajaritos", Normalizer: fold,
}, c)
r.Run(t, eval.ArtifactArrayNotContains{Key: "route", Path: "stops[*].name", Expected: "Aeropuerto"}, c)
r.Run(t, eval.ArtifactSubset{Key: "route", Expected: json.RawMessage(`{"status":"ready"}`)}, c)Trajectory Checks
Use Case.Turns and Case.ExpectedToolCalls for conversation and tool-use workflows. JSON datasets can include optional turns and expected_tool_calls fields.
c := eval.Case{
Input: "Where is order 42?",
Output: "Order 42 arrives tomorrow.",
Turns: []eval.Turn{
{Role: eval.RoleUser, Content: "Where is order 42?"},
{Role: eval.RoleAssistant, ToolCalls: []eval.ToolCall{
{
Name: "orders.lookup",
Arguments: json.RawMessage(`{"order_id":"42"}`),
Result: "delivery_date=tomorrow",
},
}},
},
ExpectedToolCalls: []eval.ToolCall{
{Name: "orders.lookup", Arguments: json.RawMessage(`{"order_id":"42"}`)},
},
}
r.Run(t, eval.ToolCallAccuracy{Mode: eval.MatchStrict, MatchArgs: true}, c)
r.Run(t, eval.ToolCallF1{MatchArgs: true, Threshold: 0.8}, c)
r.Run(t, eval.RequiredTools{Patterns: []string{"orders.*"}}, c)
r.Run(t, eval.ForbiddenTool{Patterns: []string{"orders.refund*"}}, c)
r.Run(t, eval.StepBudget{MaxSteps: 1}, c)ToolCallAccuracy supports MatchStrict, MatchUnordered, MatchSubset, and MatchSuperset. Arguments compare as normalized JSON when MatchArgs is enabled. Required and forbidden tool checks support exact names and glob-style patterns.
Tier Filtering
Use DefaultTierFilter when you want GOEVAL_TIER to select only critical, standard, or extended cases. The filter is opt-in so ordinary runners ignore the environment variable.
r := eval.NewRunner(judge, eval.DefaultTierFilter())# fast CI slice
GOEVAL=1 GOEVAL_TIER=critical go test ./...
# broader pre-merge slice
GOEVAL=1 GOEVAL_TIER=critical,standard go test ./...Repeat And Budgets
Wrap noisy judge metrics with Repeat when pass rate matters, or add token and latency budgets when resource usage is part of correctness.
r.Run(t, eval.Repeat{
Metric: eval.Faithfulness{Threshold: 0.8},
N: 3,
PassRate: 2.0 / 3.0,
}, c)
r.Run(t, eval.WithTokenBudget(1200, eval.Faithfulness{Threshold: 0.8}), c)
r.Run(t, eval.WithLatencyBudget(2*time.Second, eval.AnswerRelevancy{Threshold: 0.7}), c)Save And Compare Results
Add a result sink to persist JSONL rows. v0.8 can compare two result files, summarize one result file, and write scenario summary rows for multi-step runs. Use WithRedactors when reasons or metadata may contain sensitive IDs.
r := eval.NewRunner(judge, eval.WithResultSink(eval.DefaultResultSink()))GOEVAL=1 GOEVAL_RESULTS_DIR=.eval-results go test ./...
goeval compare old/results.jsonl new/results.jsonl
goeval summarize .eval-results/results.jsonlUse compare.CaseIDFromMetadata when the conventional Case.Metadata["case_id"] key should identify rows across runs.
CLI
Install the optional CLI for common workflows:
go install github.com/igcodinap/go-eval/cmd/goeval@latest
goeval test ./...
goeval compare old/results.jsonl new/results.jsonl
goeval summarize current/results.jsonl
goeval versionCI/CD
Enable evaluations explicitly with GOEVAL=1. Without it, evals skip and normal test runs stay fast.
Install DefaultTierFilter on the runner when CI should select tiers with GOEVAL_TIER.
# Enable evals
GOEVAL=1 go test ./...
# Save result rows
GOEVAL=1 GOEVAL_RESULTS_DIR=.eval-results go test ./...
# Trace judge prompts and responses when debugging
GOEVAL=1 GOEVAL_TRACE=1 go test -v ./...
# Critical tier only
GOEVAL=1 GOEVAL_TIER=critical go test ./...Troubleshooting
Evals are skipped unexpectedly
Confirm GOEVAL=1 is set before running tests.
Trace output is missing
Use both GOEVAL=1 and GOEVAL_TRACE=1, and run tests with -v so t.Log output is visible.
Judge calls fail intermittently
Verify credentials, rate limits, and model availability. Use deterministic prechecks to reduce judge call volume.
Need detailed API reference
Use package docs: go doc github.com/igcodinap/go-eval