go-eval
LLM evaluation for Go, inside standard go test.
go-eval v0.8 combines LLM-as-judge metrics, deterministic JSON and artifact checks, typed tool trajectories, multi-step agent scenarios, grouped contracts, tiered CI slices, repeatability helpers, JSONL reporting, and optional judge adapters while keeping the core stdlib-only.
Go-native
Runs through testing.T, benchmarks, subtests, -parallel, and CI.
Agent-aware
Checks turns, tools, artifacts, scenario state, and step contracts.
Local-first
Opt-in eval gate, stdlib core, JSONL output, OpenAI or Ollama adapters.
Install
go get github.com/igcodinap/go-eval
go install github.com/igcodinap/go-eval/cmd/goeval@latestQuick Start
Write evaluation cases using standard Go tests. Case literals should use keyed fields, which is required by v0.4 and later.
package evaltest
import (
"testing"
eval "github.com/igcodinap/go-eval"
)
func TestRAGAnswer(t *testing.T) {
judge := newMyJudge(t)
r := eval.NewRunner(judge, eval.WithResultSink(eval.DefaultResultSink()))
c := eval.Case{
Input: "What's the capital of France?",
Output: myRAG.Answer("What's the capital of France?"),
Context: []string{"Paris is the capital of France."},
Metadata: map[string]any{
"flow": "rag.answer", "tier": "critical", "dataset": "capitals/v1",
},
}
r.Run(t, eval.Faithfulness{Threshold: 0.8}, c)
r.Run(t, eval.Hallucination{Threshold: 0.9}, c)
}Run with: GOEVAL=1 go test ./...
CI-safe by default
Without GOEVAL=1, eval runs skip. Use GOEVAL_TRACE=1 only when you need prompt and response logs.
Metrics
v0.8 includes LLM-as-Judge, Deterministic, Trajectory, and Wrapper metrics. Click any metric for a focused example.
| Metric | Type | Purpose | Threshold |
|---|---|---|---|
| Faithfulness | LLM-as-Judge | Verify RAG outputs do not contradict retrieved context | 0.8 |
| Hallucination | LLM-as-Judge | Catch outputs that invent facts outside the supplied context | 0.9 |
| AnswerRelevancy | LLM-as-Judge | Ensure the output directly addresses the user input | 0.7 |
| ContextPrecision | LLM-as-Judge | Check whether retrieved context documents are relevant to the input | 0.7 |
| GEval | LLM-as-Judge | Score custom criteria that built-in metrics do not cover | 0.7 |
| Compound | LLM-as-Judge | Evaluate several related rubric dimensions in one judge call | per-dimension |
| Contains | Deterministic | Check that output contains a required substring | binary |
| Regex | Deterministic | Validate output against a regular expression | binary |
| JSONPath | Deterministic | Assert a value inside JSON output | binary |
| FieldCount | Deterministic | Enforce a minimum number of non-null JSON fields | config |
| ArtifactExists | Deterministic | Check that a named structured artifact exists on the case | binary |
| ArtifactNotExists | Deterministic | Assert that an unwanted structured artifact was not emitted | binary |
| ArtifactJSONPath | Deterministic | Assert a JSON value inside a named artifact | binary |
| ArtifactFieldCount | Deterministic | Require enough non-null fields inside an artifact object | config |
| ArtifactNumberLTE | Deterministic | Check that a numeric artifact value stays under a maximum | binary |
| ArtifactArrayContains | Deterministic | Check that an artifact array contains an expected value | binary |
| ArtifactArrayNotContains | Deterministic | Check that an artifact array excludes an unwanted value | binary |
| ArtifactArrayMinLen | Deterministic | Require an artifact array to have at least a minimum length | binary |
| ArtifactSubset | Deterministic | Assert that an artifact contains a partial expected JSON structure | binary |
| OutputLengthBudget | Deterministic | Keep final output within rune or word limits | config |
| ToolCallAccuracy | Trajectory | Compare actual tool calls with expected calls under a match mode | 1.0 |
| ToolCallF1 | Trajectory | Report precision, recall, and F1 for tool-call matches | 0.8 |
| RequiredTools | Trajectory | Fail when required tool names or name patterns are absent | binary |
| ForbiddenTool | Trajectory | Fail when disallowed tool names or patterns appear in the trajectory | binary |
| StepBudget | Trajectory | Keep tool-call count within a configured budget | binary |
| Precheck | Wrapper | Skip expensive LLM metrics when a cheap guard fails | wrapped metric |
| Contract | Wrapper | Group several deterministic or judge checks into one named result | all checks |
| Repeat | Wrapper | Run a metric multiple times and aggregate pass rate plus score stats | configurable pass rate |
| WithTokenBudget | Wrapper | Fail a wrapped metric when token usage exceeds a maximum | token max |
| WithLatencyBudget | Wrapper | Fail a wrapped metric when latency exceeds a maximum | duration max |
Agent Scenarios
Use RunScenario for ordered multi-turn flows where each step can have its own input, tool policy, artifact contract, timeout, state, and repeat pass-rate requirement.
result := r.RunScenario(t, eval.Scenario{
Name: "planning_to_route_ready",
Tier: "critical",
State: map[string]any{"locale": "es-CL"},
Tools: eval.NewToolRegistry("plan_route", "select_map_items"),
Repeat: eval.ScenarioRepeat{N: 3, PassRate: 2.0 / 3.0},
Driver: func(ctx context.Context, req eval.StepRequest) (eval.StepResult, error) {
return runAgentStep(ctx, req.Step.Input, req.History, req.Artifacts, req.State)
},
Steps: []eval.Step{
{
Name: "greeting", Input: "Hola",
ForbiddenToolPatterns: []string{"plan_*", "select_*"},
Timeout: 500 * time.Millisecond,
},
{
Name: "ready_route_request", Input: "Propón la ruta",
RequiredToolPatterns: []string{"plan_*"},
Timeout: 3 * time.Second,
Checks: []eval.Metric{
eval.NewContract("ready_route",
eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"},
eval.ArtifactArrayMinLen{Key: "route", Path: "stops", MinLen: 2},
),
},
},
},
})
if !result.Passed {
t.Fatalf("scenario failed")
}Scenario runs write normal metric rows plus a _scenario_summary JSONL row when a result sink is configured.
Grouped Contracts
Contract turns several low-level checks into one named product requirement with per-check dimensions. It is especially useful inside scenario steps.
readyRoute := eval.Contract{
ContractName: "ready_route",
Checks: []eval.Metric{
eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"},
eval.ArtifactSubset{
Key: "route",
Expected: json.RawMessage(`{"success":true}`),
},
eval.OutputLengthBudget{MaxWords: 180},
},
}
r.Run(t, readyRoute, c)Artifact Checks
Case.Artifacts stores named structured JSON outputs alongside text output. v0.8 adds absence checks, array exclusion, JSON subset checks, wildcard paths, output length budgets, and normalizers.
c := eval.Case{
Output: "Route is ready.",
Artifacts: map[string]json.RawMessage{
"route": json.RawMessage(`{
"status":"ready",
"total_minutes":98,
"stops":[{"name":"Pajaritos"},{"name":"Valparaíso"}]
}`),
},
}
fold := eval.ChainNormalizers(
eval.CaseFoldNormalizer(),
eval.SpanishASCIIFoldNormalizer(),
)
r.Run(t, eval.ArtifactJSONPath{
Key: "route", Path: "status", Expected: "ready",
}, c)
r.Run(t, eval.ArtifactArrayContains{
Key: "route", Path: "stops[*].name", Expected: "pajaritos", Normalizer: fold,
}, c)
r.Run(t, eval.ArtifactArrayNotContains{
Key: "route", Path: "stops[*].name", Expected: "Aeropuerto",
}, c)
r.Run(t, eval.ArtifactSubset{
Key: "route", Expected: json.RawMessage(`{"status":"ready"}`),
}, c)Trajectory Checks
Use Turn, ToolCall, Case.Turns, and Case.ExpectedToolCalls to evaluate agent tool-use paths without leaving the normal metric pipeline.
c := eval.Case{
Turns: []eval.Turn{
{Role: eval.RoleUser, Content: "Where is order 42?"},
{Role: eval.RoleAssistant, ToolCalls: []eval.ToolCall{
{Name: "orders.lookup", Arguments: json.RawMessage(`{"order_id":"42"}`)},
}},
},
ExpectedToolCalls: []eval.ToolCall{
{Name: "orders.lookup", Arguments: json.RawMessage(`{"order_id":"42"}`)},
},
}
r.Run(t, eval.ToolCallAccuracy{Mode: eval.MatchStrict, MatchArgs: true}, c)
r.Run(t, eval.ToolCallF1{MatchArgs: true, Threshold: 0.8}, c)
r.Run(t, eval.RequiredTools{Patterns: []string{"orders.*"}}, c)
r.Run(t, eval.ForbiddenTool{Patterns: []string{"orders.refund*"}}, c)
r.Run(t, eval.StepBudget{MaxSteps: 1}, c)Match modes are MatchStrict, MatchUnordered, MatchSubset, and MatchSuperset. JSON datasets can include optional turns and expected_tool_calls fields.
Benchmarks
Track latency, token usage, and score quality across prompt or model changes using standard Go benchmarks and benchstat.
func BenchmarkRAGLatency(b *testing.B) {
r := eval.NewRunner(newMyJudge(b))
c := eval.Case{Input: "...", Output: "...", Context: docs}
eval.Bench(b, r, eval.Faithfulness{Threshold: 0.8}, c)
}ns/opLatency per judge calltokens/opMean tokens consumed per callscore_meanAverage score across iterationsscore_stddevScore consistency across runsCI/CD
Persist JSONL results, compare baselines, summarize one run, redact sensitive metadata, and filter case tiers while keeping normal CI fast by default.
Install DefaultTierFilter on the runner to use GOEVAL_TIER, and add WithRedactors before writing shared result logs.
GOEVAL=1 GOEVAL_TIER=critical GOEVAL_RESULTS_DIR=.eval-results go test ./...
goeval compare old/results.jsonl new/results.jsonl
goeval summarize .eval-results/results.jsonlEnvironment Variables
GOEVAL=1- Enable evaluationsGOEVAL_TRACE=1- Log judge prompts and responses viat.LogGOEVAL_TIER- Filter tiers whenDefaultTierFilteris installedGOEVAL_RESULTS_DIR- Writeresults.jsonlin this directory
CLI
The optional goeval CLI wraps common test and result workflows.
goeval test
Run go test with GOEVAL=1 set.
goeval test ./...goeval compare
Compare baseline and current JSONL results; exits nonzero on regressions or missing rows.
goeval compare old/results.jsonl new/results.jsonlgoeval summarize
Summarize one JSONL result file by pass/fail, score, latency, and token aggregates.
goeval summarize current/results.jsonlgoeval version
Print CLI version information.
goeval version