go-eval

LLM evaluation for Go, inside standard go test.

go-eval v0.8 combines LLM-as-judge metrics, deterministic JSON and artifact checks, typed tool trajectories, multi-step agent scenarios, grouped contracts, tiered CI slices, repeatability helpers, JSONL reporting, and optional judge adapters while keeping the core stdlib-only.

Go-native

Runs through testing.T, benchmarks, subtests, -parallel, and CI.

Agent-aware

Checks turns, tools, artifacts, scenario state, and step contracts.

Local-first

Opt-in eval gate, stdlib core, JSONL output, OpenAI or Ollama adapters.

Install

go get github.com/igcodinap/go-eval
go install github.com/igcodinap/go-eval/cmd/goeval@latest

Quick Start

Write evaluation cases using standard Go tests. Case literals should use keyed fields, which is required by v0.4 and later.

package evaltest

import (
	"testing"

	eval "github.com/igcodinap/go-eval"
)

func TestRAGAnswer(t *testing.T) {
	judge := newMyJudge(t)
	r := eval.NewRunner(judge, eval.WithResultSink(eval.DefaultResultSink()))

	c := eval.Case{
		Input:   "What's the capital of France?",
		Output:  myRAG.Answer("What's the capital of France?"),
		Context: []string{"Paris is the capital of France."},
		Metadata: map[string]any{
			"flow": "rag.answer", "tier": "critical", "dataset": "capitals/v1",
		},
	}

	r.Run(t, eval.Faithfulness{Threshold: 0.8}, c)
	r.Run(t, eval.Hallucination{Threshold: 0.9}, c)
}

Run with: GOEVAL=1 go test ./...

CI-safe by default

Without GOEVAL=1, eval runs skip. Use GOEVAL_TRACE=1 only when you need prompt and response logs.

Metrics

v0.8 includes LLM-as-Judge, Deterministic, Trajectory, and Wrapper metrics. Click any metric for a focused example.

MetricTypePurposeThreshold
FaithfulnessLLM-as-JudgeVerify RAG outputs do not contradict retrieved context0.8
HallucinationLLM-as-JudgeCatch outputs that invent facts outside the supplied context0.9
AnswerRelevancyLLM-as-JudgeEnsure the output directly addresses the user input0.7
ContextPrecisionLLM-as-JudgeCheck whether retrieved context documents are relevant to the input0.7
GEvalLLM-as-JudgeScore custom criteria that built-in metrics do not cover0.7
CompoundLLM-as-JudgeEvaluate several related rubric dimensions in one judge callper-dimension
ContainsDeterministicCheck that output contains a required substringbinary
RegexDeterministicValidate output against a regular expressionbinary
JSONPathDeterministicAssert a value inside JSON outputbinary
FieldCountDeterministicEnforce a minimum number of non-null JSON fieldsconfig
ArtifactExistsDeterministicCheck that a named structured artifact exists on the casebinary
ArtifactNotExistsDeterministicAssert that an unwanted structured artifact was not emittedbinary
ArtifactJSONPathDeterministicAssert a JSON value inside a named artifactbinary
ArtifactFieldCountDeterministicRequire enough non-null fields inside an artifact objectconfig
ArtifactNumberLTEDeterministicCheck that a numeric artifact value stays under a maximumbinary
ArtifactArrayContainsDeterministicCheck that an artifact array contains an expected valuebinary
ArtifactArrayNotContainsDeterministicCheck that an artifact array excludes an unwanted valuebinary
ArtifactArrayMinLenDeterministicRequire an artifact array to have at least a minimum lengthbinary
ArtifactSubsetDeterministicAssert that an artifact contains a partial expected JSON structurebinary
OutputLengthBudgetDeterministicKeep final output within rune or word limitsconfig
ToolCallAccuracyTrajectoryCompare actual tool calls with expected calls under a match mode1.0
ToolCallF1TrajectoryReport precision, recall, and F1 for tool-call matches0.8
RequiredToolsTrajectoryFail when required tool names or name patterns are absentbinary
ForbiddenToolTrajectoryFail when disallowed tool names or patterns appear in the trajectorybinary
StepBudgetTrajectoryKeep tool-call count within a configured budgetbinary
PrecheckWrapperSkip expensive LLM metrics when a cheap guard failswrapped metric
ContractWrapperGroup several deterministic or judge checks into one named resultall checks
RepeatWrapperRun a metric multiple times and aggregate pass rate plus score statsconfigurable pass rate
WithTokenBudgetWrapperFail a wrapped metric when token usage exceeds a maximumtoken max
WithLatencyBudgetWrapperFail a wrapped metric when latency exceeds a maximumduration max

Agent Scenarios

Use RunScenario for ordered multi-turn flows where each step can have its own input, tool policy, artifact contract, timeout, state, and repeat pass-rate requirement.

result := r.RunScenario(t, eval.Scenario{
	Name:  "planning_to_route_ready",
	Tier:  "critical",
	State: map[string]any{"locale": "es-CL"},
	Tools: eval.NewToolRegistry("plan_route", "select_map_items"),
	Repeat: eval.ScenarioRepeat{N: 3, PassRate: 2.0 / 3.0},
	Driver: func(ctx context.Context, req eval.StepRequest) (eval.StepResult, error) {
		return runAgentStep(ctx, req.Step.Input, req.History, req.Artifacts, req.State)
	},
	Steps: []eval.Step{
		{
			Name: "greeting", Input: "Hola",
			ForbiddenToolPatterns: []string{"plan_*", "select_*"},
			Timeout: 500 * time.Millisecond,
		},
		{
			Name: "ready_route_request", Input: "Propón la ruta",
			RequiredToolPatterns: []string{"plan_*"},
			Timeout: 3 * time.Second,
			Checks: []eval.Metric{
				eval.NewContract("ready_route",
					eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"},
					eval.ArtifactArrayMinLen{Key: "route", Path: "stops", MinLen: 2},
				),
			},
		},
	},
})

if !result.Passed {
	t.Fatalf("scenario failed")
}

Scenario runs write normal metric rows plus a _scenario_summary JSONL row when a result sink is configured.

Grouped Contracts

Contract turns several low-level checks into one named product requirement with per-check dimensions. It is especially useful inside scenario steps.

readyRoute := eval.Contract{
	ContractName: "ready_route",
	Checks: []eval.Metric{
		eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"},
		eval.ArtifactSubset{
			Key:      "route",
			Expected: json.RawMessage(`{"success":true}`),
		},
		eval.OutputLengthBudget{MaxWords: 180},
	},
}

r.Run(t, readyRoute, c)

Artifact Checks

Case.Artifacts stores named structured JSON outputs alongside text output. v0.8 adds absence checks, array exclusion, JSON subset checks, wildcard paths, output length budgets, and normalizers.

c := eval.Case{
	Output: "Route is ready.",
	Artifacts: map[string]json.RawMessage{
		"route": json.RawMessage(`{
			"status":"ready",
			"total_minutes":98,
			"stops":[{"name":"Pajaritos"},{"name":"Valparaíso"}]
		}`),
	},
}

fold := eval.ChainNormalizers(
	eval.CaseFoldNormalizer(),
	eval.SpanishASCIIFoldNormalizer(),
)

r.Run(t, eval.ArtifactJSONPath{
	Key: "route", Path: "status", Expected: "ready",
}, c)
r.Run(t, eval.ArtifactArrayContains{
	Key: "route", Path: "stops[*].name", Expected: "pajaritos", Normalizer: fold,
}, c)
r.Run(t, eval.ArtifactArrayNotContains{
	Key: "route", Path: "stops[*].name", Expected: "Aeropuerto",
}, c)
r.Run(t, eval.ArtifactSubset{
	Key: "route", Expected: json.RawMessage(`{"status":"ready"}`),
}, c)

Trajectory Checks

Use Turn, ToolCall, Case.Turns, and Case.ExpectedToolCalls to evaluate agent tool-use paths without leaving the normal metric pipeline.

c := eval.Case{
	Turns: []eval.Turn{
		{Role: eval.RoleUser, Content: "Where is order 42?"},
		{Role: eval.RoleAssistant, ToolCalls: []eval.ToolCall{
			{Name: "orders.lookup", Arguments: json.RawMessage(`{"order_id":"42"}`)},
		}},
	},
	ExpectedToolCalls: []eval.ToolCall{
		{Name: "orders.lookup", Arguments: json.RawMessage(`{"order_id":"42"}`)},
	},
}

r.Run(t, eval.ToolCallAccuracy{Mode: eval.MatchStrict, MatchArgs: true}, c)
r.Run(t, eval.ToolCallF1{MatchArgs: true, Threshold: 0.8}, c)
r.Run(t, eval.RequiredTools{Patterns: []string{"orders.*"}}, c)
r.Run(t, eval.ForbiddenTool{Patterns: []string{"orders.refund*"}}, c)
r.Run(t, eval.StepBudget{MaxSteps: 1}, c)

Match modes are MatchStrict, MatchUnordered, MatchSubset, and MatchSuperset. JSON datasets can include optional turns and expected_tool_calls fields.

Benchmarks

Track latency, token usage, and score quality across prompt or model changes using standard Go benchmarks and benchstat.

func BenchmarkRAGLatency(b *testing.B) {
	r := eval.NewRunner(newMyJudge(b))
	c := eval.Case{Input: "...", Output: "...", Context: docs}

	eval.Bench(b, r, eval.Faithfulness{Threshold: 0.8}, c)
}
ns/opLatency per judge call
tokens/opMean tokens consumed per call
score_meanAverage score across iterations
score_stddevScore consistency across runs

CI/CD

Persist JSONL results, compare baselines, summarize one run, redact sensitive metadata, and filter case tiers while keeping normal CI fast by default.

Install DefaultTierFilter on the runner to use GOEVAL_TIER, and add WithRedactors before writing shared result logs.

GOEVAL=1 GOEVAL_TIER=critical GOEVAL_RESULTS_DIR=.eval-results go test ./...
goeval compare old/results.jsonl new/results.jsonl
goeval summarize .eval-results/results.jsonl

Environment Variables

  • GOEVAL=1 - Enable evaluations
  • GOEVAL_TRACE=1 - Log judge prompts and responses via t.Log
  • GOEVAL_TIER - Filter tiers when DefaultTierFilter is installed
  • GOEVAL_RESULTS_DIR - Write results.jsonl in this directory

CLI

The optional goeval CLI wraps common test and result workflows.

goeval test

Run go test with GOEVAL=1 set.

goeval test ./...

goeval compare

Compare baseline and current JSONL results; exits nonzero on regressions or missing rows.

goeval compare old/results.jsonl new/results.jsonl

goeval summarize

Summarize one JSONL result file by pass/fail, score, latency, and token aggregates.

goeval summarize current/results.jsonl

goeval version

Print CLI version information.

goeval version

Core Concepts

Case
Input, output, expected value, context, artifacts, turns, expected tool calls, metadata, and timeout.
Scenario
Ordered multi-step agent flow with history, artifacts, state, tools, and repeats.
Contract
A named group of checks reported as one business-level pass/fail result.
Artifacts
Named structured JSON outputs for deterministic workflow checks.
Trajectory
Typed turns and tool calls for agent path evaluation.
Metric
A stateless scoring function with thresholded pass/fail behavior.
Precheck
Conditional wrapper that gates expensive metrics behind cheap checks.
Repeat
Wrapper for repeated runs, pass-rate aggregation, and score variance.
TierFilter
GOEVAL_TIER-driven case slicing when DefaultTierFilter is installed.
Normalizer
String comparison hook for deterministic checks where case or accents vary.
Judge
Concurrency-safe LLM-as-judge implementation returning scores and reasons.
Runner
Executes cases with metrics, handles GOEVAL gating, assertions, and result sinks.
CaseMetadata
Standard keys such as flow, tier, and dataset for filtering and reports.
MockJudge
Scripted judge for tests that should not call an LLM.