go-eval

LLM evaluation for Go, inside standard go test.

go-eval v1.0 combines LLM-as-judge metrics, deterministic JSON and artifact checks, typed tool trajectories, structured agent traces, multi-step agent scenarios, grouped contracts, tiered CI slices, repeatability helpers, policy-aware summaries, baseline comparison, static reports, judge calibration, and profile-driven eval operations while keeping the core stdlib-only.

Go-native

Runs through testing.T, benchmarks, subtests, -parallel, and CI.

Agent-aware

Checks turns, tools, artifacts, traces, scenario state, and step contracts.

Ops-ready

Profiles, prerequisites, compare policies, reports, calibration, and JSONL output.

Install

go get github.com/igcodinap/go-eval
go install github.com/igcodinap/go-eval/cmd/goeval@latest

Quick Start

Write evaluation cases using standard Go tests. Case literals should use keyed fields, which is required by v0.4 and later.

package evaltest

import (
	"testing"

	eval "github.com/igcodinap/go-eval"
)

func TestRAGAnswer(t *testing.T) {
	judge := newMyJudge(t)
	r := eval.NewRunner(judge, eval.WithResultSink(eval.DefaultResultSink()))

	c := eval.Case{
		Input:   "What's the capital of France?",
		Output:  myRAG.Answer("What's the capital of France?"),
		Context: []string{"Paris is the capital of France."},
		Metadata: map[string]any{
			"flow": "rag.answer", "tier": "critical", "dataset": "capitals/v1",
		},
	}

	r.Run(t, eval.Faithfulness{Threshold: 0.8}, c)
	r.Run(t, eval.Hallucination{Threshold: 0.9}, c)
}

Run with: GOEVAL=1 go test ./...

CI-safe by default

Without GOEVAL=1, eval runs skip. Use GOEVAL_TRACE=1 only when you need prompt and response logs.

Implementation Example

See how a Go API can wire go-eval into an agent workflow. The travel-planning example covers a goeval.json profile, shared runner options, redacted JSONL results, route artifact contracts, custom metrics, scenario state, tool policies, and compare gates.

Profile

Run critical integration evals with prerequisites and policy settings.

Contract

Validate route artifacts before judging final assistant prose.

Scenario

Exercise a multi-step trip-planning agent with required and forbidden tools.

Open the full implementation example

Metrics

go-eval includes LLM-as-Judge, Deterministic, Trajectory, and Wrapper metrics. Click any metric for a focused example.

Metric	Type	Purpose	Threshold
Faithfulness	LLM-as-Judge	Verify RAG outputs do not contradict retrieved context	0.8
Hallucination	LLM-as-Judge	Catch outputs that invent facts outside the supplied context	0.9
AnswerRelevancy	LLM-as-Judge	Ensure the output directly addresses the user input	0.7
ContextPrecision	LLM-as-Judge	Check whether retrieved context documents are relevant to the input	0.7
ContextRecall	LLM-as-Judge	Check whether retrieved context contains the expected answer or facts	0.7
AnswerCorrectness	LLM-as-Judge	Verify the output matches the expected answer semantically	0.7
NoiseSensitivity	LLM-as-Judge	Ensure the output ignores irrelevant or distracting retrieved context	0.7
TaskCompletion	LLM-as-Judge	Verify the agent completed the user task end-to-end	0.8
PlanAdherence	LLM-as-Judge	Check whether the agent followed the expected plan or workflow	0.7
GEval	LLM-as-Judge	Score custom criteria that built-in metrics do not cover	0.7
Compound	LLM-as-Judge	Evaluate several related rubric dimensions in one judge call	per-dimension
Contains	Deterministic	Check that output contains a required substring	binary
Regex	Deterministic	Validate output against a regular expression	binary
JSONPath	Deterministic	Assert a value inside JSON output	binary
FieldCount	Deterministic	Enforce a minimum number of non-null JSON fields	config
ArtifactExists	Deterministic	Check that a named structured artifact exists on the case	binary
ArtifactNotExists	Deterministic	Assert that an unwanted structured artifact was not emitted	binary
ArtifactJSONPath	Deterministic	Assert a JSON value inside a named artifact	binary
ArtifactFieldCount	Deterministic	Require enough non-null fields inside an artifact object	config
ArtifactNumberLTE	Deterministic	Check that a numeric artifact value stays under a maximum	binary
ArtifactArrayContains	Deterministic	Check that an artifact array contains an expected value	binary
ArtifactArrayNotContains	Deterministic	Check that an artifact array excludes an unwanted value	binary
ArtifactArrayMinLen	Deterministic	Require an artifact array to have at least a minimum length	binary
ArtifactSubset	Deterministic	Assert that an artifact contains a partial expected JSON structure	binary
ToolArgumentAccuracy	Deterministic	Verify tool names and JSON arguments match expectations	1.0
StepEfficiency	Deterministic	Verify the trace stays within step and tool-call budgets	1.0
OutputLengthBudget	Deterministic	Keep final output within rune or word limits	config
ToolCallAccuracy	Trajectory	Compare actual tool calls with expected calls under a match mode	1.0
ToolCallF1	Trajectory	Report precision, recall, and F1 for tool-call matches	0.8
RequiredTools	Trajectory	Fail when required tool names or name patterns are absent	binary
ForbiddenTool	Trajectory	Fail when disallowed tool names or patterns appear in the trajectory	binary
StepBudget	Trajectory	Keep tool-call count within a configured budget	binary
Precheck	Wrapper	Skip expensive LLM metrics when a cheap guard fails	wrapped metric
Contract	Wrapper	Group several deterministic or judge checks into one named result	all checks
Repeat	Wrapper	Run a metric multiple times and aggregate pass rate plus score stats	configurable pass rate
WithTokenBudget	Wrapper	Fail a wrapped metric when token usage exceeds a maximum	token max
WithLatencyBudget	Wrapper	Fail a wrapped metric when latency exceeds a maximum	duration max

Agent Scenarios

Use RunScenario for ordered multi-turn flows where each step can have its own input, tool policy, artifact contract, timeout, state, and repeat pass-rate requirement.

result := r.RunScenario(t, eval.Scenario{
	Name:  "planning_to_route_ready",
	Tier:  "critical",
	State: map[string]any{"locale": "es-CL"},
	Tools: eval.NewToolRegistry("plan_route", "select_map_items"),
	Repeat: eval.ScenarioRepeat{N: 3, PassRate: 2.0 / 3.0},
	Driver: func(ctx context.Context, req eval.StepRequest) (eval.StepResult, error) {
		return runAgentStep(ctx, req.Step.Input, req.History, req.Artifacts, req.State)
	},
	Steps: []eval.Step{
		{
			Name: "greeting", Input: "Hola",
			ForbiddenToolPatterns: []string{"plan_*", "select_*"},
			Timeout: 500 * time.Millisecond,
		},
		{
			Name: "ready_route_request", Input: "Propón la ruta",
			RequiredToolPatterns: []string{"plan_*"},
			Timeout: 3 * time.Second,
			Checks: []eval.Metric{
				eval.NewContract("ready_route",
					eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"},
					eval.ArtifactArrayMinLen{Key: "route", Path: "stops", MinLen: 2},
				),
			},
		},
	},
})

if !result.Passed {
	t.Fatalf("scenario failed")
}

Scenario runs write normal metric rows plus a _scenario_summary JSONL row when a result sink is configured. Use LoadScenarios and BindScenarioDrivers to define scenarios in JSON while keeping drivers in Go.

Grouped Contracts

Contract turns several low-level checks into one named product requirement with per-check dimensions. It is especially useful inside scenario steps.

readyRoute := eval.Contract{
	ContractName: "ready_route",
	Checks: []eval.Metric{
		eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"},
		eval.ArtifactSubset{
			Key:      "route",
			Expected: json.RawMessage(`{"success":true}`),
		},
		eval.OutputLengthBudget{MaxWords: 180},
	},
}

r.Run(t, readyRoute, c)

Artifact Checks

Case.Artifacts stores named structured JSON outputs alongside text output, with absence checks, array exclusion, JSON subset checks, wildcard paths, output length budgets, and normalizers.

c := eval.Case{
	Output: "Route is ready.",
	Artifacts: map[string]json.RawMessage{
		"route": json.RawMessage(`{
			"status":"ready",
			"total_minutes":98,
			"stops":[{"name":"Pajaritos"},{"name":"Valparaíso"}]
		}`),
	},
}

fold := eval.ChainNormalizers(
	eval.CaseFoldNormalizer(),
	eval.SpanishASCIIFoldNormalizer(),
)

r.Run(t, eval.ArtifactJSONPath{
	Key: "route", Path: "status", Expected: "ready",
}, c)
r.Run(t, eval.ArtifactArrayContains{
	Key: "route", Path: "stops[*].name", Expected: "pajaritos", Normalizer: fold,
}, c)
r.Run(t, eval.ArtifactArrayNotContains{
	Key: "route", Path: "stops[*].name", Expected: "Aeropuerto",
}, c)
r.Run(t, eval.ArtifactSubset{
	Key: "route", Expected: json.RawMessage(`{"status":"ready"}`),
}, c)

Trajectory Checks

Use Turn, ToolCall, Case.Turns, and Case.ExpectedToolCalls to evaluate agent tool-use paths without leaving the normal metric pipeline.

c := eval.Case{
	Turns: []eval.Turn{
		{Role: eval.RoleUser, Content: "Where is order 42?"},
		{Role: eval.RoleAssistant, ToolCalls: []eval.ToolCall{
			{Name: "orders.lookup", Arguments: json.RawMessage(`{"order_id":"42"}`)},
		}},
	},
	ExpectedToolCalls: []eval.ToolCall{
		{Name: "orders.lookup", Arguments: json.RawMessage(`{"order_id":"42"}`)},
	},
}

r.Run(t, eval.ToolCallAccuracy{Mode: eval.MatchStrict, MatchArgs: true}, c)
r.Run(t, eval.ToolCallF1{MatchArgs: true, Threshold: 0.8}, c)
r.Run(t, eval.RequiredTools{Patterns: []string{"orders.*"}}, c)
r.Run(t, eval.ForbiddenTool{Patterns: []string{"orders.refund*"}}, c)
r.Run(t, eval.StepBudget{MaxSteps: 1}, c)

Match modes are MatchStrict, MatchUnordered, MatchSubset, and MatchSuperset. JSON datasets can include optional turns and expected_tool_calls fields.

Structured Traces

Use Case.Trace when your agent can emit structured spans, tool calls, artifact records, or state deltas. Trace IDs link metric rows, scenario summaries, and trace records in downstream reports.

r := eval.NewRunner(
	judge,
	eval.WithResultSink(eval.DefaultResultSink()),
	eval.WithTraceSink(eval.DefaultTraceSink()),
)

c := eval.Case{
	Input:   "Find a route and charge the card",
	Output:  answer,
	TraceID: "route-42",
	Trace: &eval.Trace{
		ID:   "route-42",
		Name: "checkout_route",
		Spans: []eval.Span{{
			Name: "charge",
			Kind: "tool_call",
			ToolCall: &eval.ToolCall{
				Name:      "payments.charge",
				Arguments: json.RawMessage(`{"amount":42}`),
			},
		}},
	},
}

When GOEVAL_RESULTS_DIR is set, DefaultTraceSink writes traces.jsonl alongside results.jsonl. Trace writes use the same WithRedactors hooks as result JSONL.

Benchmarks

Track latency, token usage, and score quality across prompt or model changes using standard Go benchmarks and benchstat.

func BenchmarkRAGLatency(b *testing.B) {
	r := eval.NewRunner(newMyJudge(b))
	c := eval.Case{Input: "...", Output: "...", Context: docs}

	eval.Bench(b, r, eval.Faithfulness{Threshold: 0.8}, c)
}

ns/opLatency per judge call

tokens/opMean tokens consumed per call

score_meanAverage score across iterations

score_stddevScore consistency across runs

Eval Operations

v1.0 adds an operations layer for repeatable eval runs: define goeval.json profiles, preflight prerequisites, run profile-aware tests, and apply the same policy to compare and summarize commands.

Profiles

Name PR, nightly, provider, or release-gate run shapes once.

Prerequisites

Require env vars, files, TCP endpoints, or custom checks before a run.

Compare policies

Set score tolerances, case IDs, and regression behavior in config.

Reliability

Summarize pass rates, p95 latency/tokens, scenario totals, and flaky identities.

{
  "profiles": {
    "pr": {
      "packages": ["./..."],
      "tiers": ["critical"],
      "results_dir": ".goeval/pr"
    },
    "google": {
      "packages": ["./..."],
      "tiers": ["critical", "standard"],
      "results_dir": ".goeval/google",
      "prerequisites": [
        {"type": "env", "name": "GEMINI_API_KEY"},
        {"type": "env", "name": "GOOGLE_ROUTES_API_KEY"}
      ],
      "missing_prerequisite": "skip"
    }
  },
  "compare": {
    "case_id_key": "case_id",
    "default": {
      "score_tolerance": 0.02,
      "fail_on_missing": true,
      "fail_on_regression": true
    }
  }
}

goeval test --profile pr
goeval test --profile google --config goeval.json -run Route
goeval compare --policy goeval.json --format json old/results.jsonl new/results.jsonl
goeval compare --fail-on-regression=false old/results.jsonl new/results.jsonl
goeval summarize --policy goeval.json new/results.jsonl

Reports And Calibration

Render static HTML, Markdown, or JSON reports from JSONL result files. Use calibration to analyze judge disagreement and compare A/B variants.

Static Reports

Render HTML, Markdown, or JSON reports from one or two result files.

Judge Calibration

Analyze judge disagreement, aggregate duplicate rows, and compare A/B variants.

goeval report current/results.jsonl --out report.html
goeval report --baseline old/results.jsonl --current new/results.jsonl --format markdown
goeval calibrate --case-id-key case_id --judge-key judge current/results.jsonl
goeval calibrate --pairwise-key variant results.jsonl

CI/CD

Persist JSONL results, run named profiles, compare baselines with policy tolerances, summarize reliability, redact sensitive metadata, and filter case tiers while keeping normal CI fast by default.

Install DefaultTierFilter on the runner to use GOEVAL_TIER, declare run prerequisites in goeval.json or with eval.Require, and add WithRedactors before writing shared result logs.

goeval test --profile pr
goeval compare --policy goeval.json old/results.jsonl new/results.jsonl
goeval summarize --policy goeval.json .goeval/pr/results.jsonl

Environment Variables

GOEVAL=1 - Enable evaluations
GOEVAL_TRACE=1 - Log judge prompts and responses via t.Log
GOEVAL_TIER - Filter tiers when DefaultTierFilter is installed
GOEVAL_RESULTS_DIR - Write results.jsonl in this directory

Judge Adapters

Optional judge adapters live in separate modules so the core package stays stdlib-only. Use the Ollama adapter for local LLM-as-judge scoring, or the OpenAI adapter for cloud-based evaluation.

go get github.com/igcodinap/go-eval/adapters/ollama
go get github.com/igcodinap/go-eval/adapters/openai github.com/sashabaranov/go-openai

import ollamaeval "github.com/igcodinap/go-eval/adapters/ollama"

judge := ollamaeval.NewJudge("llama3.2")
r := eval.NewRunner(judge)

r.Run(t, eval.Faithfulness{Threshold: 0.8}, eval.Case{
	Input:   "What is the capital of France?",
	Output:  "Paris is the capital of France.",
	Context: []string{"Paris is the capital of France."},
})

You can also implement your own Judge by wrapping any LLM provider:

type MyJudge struct{}

func (j *MyJudge) Evaluate(ctx context.Context, prompt string) (eval.JudgeResponse, error) {
	// 1. Send prompt to an LLM.
	// 2. Parse its JSON {"score": float, "reason": string} response.
	// 3. Return eval.JudgeResponse{Score, Reason, Tokens}.
	// Must be safe for concurrent use.
	return eval.JudgeResponse{}, nil
}

CLI

The optional goeval CLI wraps common test, profile, compare, and summary workflows.

`goeval test`

Run a named goeval.json profile with GOEVAL=1, tier filters, result directories, and prerequisites applied.

goeval test --profile pr

`goeval compare`

Compare baseline and current JSONL results with policy tolerances, case IDs, and regression rules.

goeval compare --policy goeval.json old/results.jsonl new/results.jsonl

`goeval summarize`

Summarize pass rates, p95 latency/tokens, metadata groups, scenario totals, and flaky identities.

goeval summarize --policy goeval.json current/results.jsonl

`goeval report`

Render static HTML, Markdown, or JSON reports from JSONL result files.

goeval report current/results.jsonl --out report.html

`goeval calibrate`

Analyze judge disagreement, aggregate duplicate rows, and compare A/B variants.

goeval calibrate --judge-key judge current/results.jsonl

`goeval version`

Print CLI version information.

goeval version

Core Concepts

Case

Input, output, expected value, context, artifacts, turns, traces, expected tool calls, metadata, and timeout.

Scenario

Ordered multi-step agent flow with history, artifacts, state, tools, and repeats.

Contract

A named group of checks reported as one business-level pass/fail result.

Artifacts

Named structured JSON outputs for deterministic workflow checks.

Trajectory

Typed turns and tool calls for agent path evaluation.

Trace

Structured agent execution with spans, tool calls, artifact records, and state deltas.

Metric

A stateless scoring function with thresholded pass/fail behavior.

Precheck

Conditional wrapper that gates expensive metrics behind cheap checks.

Repeat

Wrapper for repeated runs, pass-rate aggregation, and score variance.

Eval Profiles

Named goeval.json run shapes for packages, tiers, results, and prerequisites.

Prerequisite Checks

Env, file, TCP, or custom checks that can skip or fail a profile before go test runs.

Compare Policies

Policies for score tolerance, stable identity, and regression behavior.

Reliability Summaries

Pass rates, p95 latency/tokens, scenario totals, metadata groups, and flaky identities.

Reports

Static HTML, Markdown, or JSON evaluation reports from JSONL result files.

Calibration

Judge disagreement analysis and A/B variant comparison for eval reliability.

Scenario Datasets

Portable JSON scenario definitions with named drivers bound in Go.

Stable Case IDs

Case metadata identities that survive test renames across result comparisons.

TierFilter

GOEVAL_TIER-driven case slicing when DefaultTierFilter is installed.

Normalizer

String comparison hook for deterministic checks where case or accents vary.

Judge

Concurrency-safe LLM-as-judge implementation returning scores and reasons.

Runner

Executes cases with metrics, handles GOEVAL gating, assertions, and result sinks.

CaseMetadata

Standard keys such as flow, tier, and dataset for filtering and reports.

MockJudge

Scripted judge for tests that should not call an LLM.