Home/Docs/Getting Started

Getting Started

go-eval v1.0 is an open-source evaluation toolkit for Go teams building LLM products. It runs inside go test, stays opt-in through GOEVAL=1, and covers judge metrics, deterministic checks, structured artifacts, tool trajectories, structured agent traces, multi-step agent scenarios, profile-driven eval operations, result comparison, static reports, judge calibration, and policy-aware reliability summaries.

Use Cases

RAG response quality checks in go test
Deterministic output validation for JSON and artifacts
Agent trajectory checks for tool-use workflows
Structured agent traces with spans, tool calls, and state deltas
Ordered agent scenario contracts with per-step tool policies
Tiered CI slices for critical, standard, and extended cases
Profile-driven eval runs for PR, nightly, provider, and release gates
Prompt or model regression checks in CI pipelines
Static HTML, Markdown, and JSON reports for review and CI artifacts
Judge calibration and A/B variant comparison
Policy-aware summaries and repeatability checks for flaky judge metrics

Core Concepts

Case

Input, output, expected value, context, artifacts, turns, traces, expected tool calls, metadata, and timeout.

Scenario

Ordered multi-step flow with accumulated history, artifacts, state, and repeats.

Contract

Named group of checks that reports one business-level result.

Metric

Stateless scorer with thresholded pass/fail behavior.

Judge

Concurrency-safe LLM-as-judge implementation returning scores and reasons.

Runner

Executes Cases with Metrics and handles GOEVAL gating, result sinks, and assertions.

Artifact

Named structured JSON output checked deterministically.

Trajectory

Typed turns, required tools, forbidden tools, and expected tool calls.

Trace

Structured agent execution with spans, tool calls, artifact records, and state deltas.

Eval Profiles

goeval.json run profiles for packages, tiers, result directories, and prerequisites.

Compare Policies

Baseline policies for score tolerances, stable case identity, and regression gates.

Reports

Static HTML, Markdown, or JSON evaluation reports from JSONL result files.

Create Your First Test Run

Start with keyed eval.Case literals, a cheap deterministic check, and one judge metric. Keyed case literals are required by v0.4 and later because Case has a private blank field.

package yourpkg_test

import (
	"testing"

	eval "github.com/igcodinap/go-eval"
)

func TestSupportReply(t *testing.T) {
	runner := eval.NewRunner(openAIJudge, eval.WithResultSink(eval.DefaultResultSink()))

	c := eval.Case{
		Input:    "How do I cancel my plan?",
		Output:   "You can cancel from Billing > Subscription.",
		Expected: "cancel",
		Metadata: map[string]any{
			"flow": "support.reply", "tier": "critical", "dataset": "support/v1",
		},
	}

	result := runner.Run(t, eval.Precheck{
		Pre: eval.Contains{},
		Main: eval.Compound{
			Dimensions: []eval.Dimension{
				{Name: "helpfulness", Rubric: "Actionable next step", Threshold: 0.7},
				{Name: "policy_alignment", Rubric: "No unsafe guidance", Threshold: 0.9},
			},
		},
	}, c)

	if !result.Passed {
		t.Fatalf("eval failed: %s", result.Reason)
	}
}

Metrics Overview

Metric	Type	Purpose	Threshold
Faithfulness	LLM-as-Judge	Verify RAG outputs do not contradict retrieved context	0.8
Hallucination	LLM-as-Judge	Catch outputs that invent facts outside the supplied context	0.9
AnswerRelevancy	LLM-as-Judge	Ensure the output directly addresses the user input	0.7
ContextPrecision	LLM-as-Judge	Check whether retrieved context documents are relevant to the input	0.7
ContextRecall	LLM-as-Judge	Check whether retrieved context contains the expected answer or facts	0.7
AnswerCorrectness	LLM-as-Judge	Verify the output matches the expected answer semantically	0.7
NoiseSensitivity	LLM-as-Judge	Ensure the output ignores irrelevant or distracting retrieved context	0.7
TaskCompletion	LLM-as-Judge	Verify the agent completed the user task end-to-end	0.8
PlanAdherence	LLM-as-Judge	Check whether the agent followed the expected plan or workflow	0.7
GEval	LLM-as-Judge	Score custom criteria that built-in metrics do not cover	0.7
Compound	LLM-as-Judge	Evaluate several related rubric dimensions in one judge call	per-dimension
Contains	Deterministic	Check that output contains a required substring	binary
Regex	Deterministic	Validate output against a regular expression	binary
JSONPath	Deterministic	Assert a value inside JSON output	binary
FieldCount	Deterministic	Enforce a minimum number of non-null JSON fields	config
ArtifactExists	Deterministic	Check that a named structured artifact exists on the case	binary
ArtifactNotExists	Deterministic	Assert that an unwanted structured artifact was not emitted	binary
ArtifactJSONPath	Deterministic	Assert a JSON value inside a named artifact	binary
ArtifactFieldCount	Deterministic	Require enough non-null fields inside an artifact object	config
ArtifactNumberLTE	Deterministic	Check that a numeric artifact value stays under a maximum	binary
ArtifactArrayContains	Deterministic	Check that an artifact array contains an expected value	binary
ArtifactArrayNotContains	Deterministic	Check that an artifact array excludes an unwanted value	binary
ArtifactArrayMinLen	Deterministic	Require an artifact array to have at least a minimum length	binary
ArtifactSubset	Deterministic	Assert that an artifact contains a partial expected JSON structure	binary
ToolArgumentAccuracy	Deterministic	Verify tool names and JSON arguments match expectations	1.0
StepEfficiency	Deterministic	Verify the trace stays within step and tool-call budgets	1.0
OutputLengthBudget	Deterministic	Keep final output within rune or word limits	config
ToolCallAccuracy	Trajectory	Compare actual tool calls with expected calls under a match mode	1.0
ToolCallF1	Trajectory	Report precision, recall, and F1 for tool-call matches	0.8
RequiredTools	Trajectory	Fail when required tool names or name patterns are absent	binary
ForbiddenTool	Trajectory	Fail when disallowed tool names or patterns appear in the trajectory	binary
StepBudget	Trajectory	Keep tool-call count within a configured budget	binary
Precheck	Wrapper	Skip expensive LLM metrics when a cheap guard fails	wrapped metric
Contract	Wrapper	Group several deterministic or judge checks into one named result	all checks
Repeat	Wrapper	Run a metric multiple times and aggregate pass rate plus score stats	configurable pass rate
WithTokenBudget	Wrapper	Fail a wrapped metric when token usage exceeds a maximum	token max
WithLatencyBudget	Wrapper	Fail a wrapped metric when latency exceeds a maximum	duration max

Deterministic Metrics

Deterministic metrics do not call an LLM judge. They are fast, cheap, and reproducible, making them useful for prechecks, output length budgets, tool policy checks, and structured output validation.

Contains

Substring presence check.

Regex

Regular-expression output validation.

JSONPath

Exact value check at a JSON path.

FieldCount

Configurable minimum non-null JSON fields.

OutputLengthBudget

Configurable rune and word limits.

ArtifactSubset

Partial JSON structure validation.

Agent Scenarios

Use Runner.RunScenario for ordered multi-step agent flows where correctness depends on accumulated history, artifacts, state, tool policy, and per-step contracts.

r := eval.NewRunner(
	judge,
	eval.WithResultSink(eval.DefaultResultSink()),
	eval.DefaultTierFilter(),
)

result := r.RunScenario(t, eval.Scenario{
	Name:  "planning_to_route_ready",
	Tier:  "critical",
	State: map[string]any{"locale": "es-CL"},
	Tools: eval.NewToolRegistry("plan_route", "select_map_items"),
	Repeat: eval.ScenarioRepeat{N: 3, PassRate: 2.0 / 3.0},
	Driver: func(ctx context.Context, req eval.StepRequest) (eval.StepResult, error) {
		return runAgentStep(ctx, req.Step.Input, req.History, req.Artifacts, req.State)
	},
	Steps: []eval.Step{
		{
			Name: "greeting",
			Input: "Hola",
			ForbiddenToolPatterns: []string{"plan_*", "select_*"},
			Timeout: 500 * time.Millisecond,
		},
		{
			Name: "ready_route_request",
			Input: "Propón la ruta",
			RequiredToolPatterns: []string{"plan_*"},
			Timeout: 3 * time.Second,
			Checks: []eval.Metric{
				eval.NewContract("ready_route",
					eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"},
					eval.ArtifactArrayMinLen{Key: "route", Path: "stops", MinLen: 2},
				),
			},
		},
	},
})

if !result.Passed {
	t.Fatalf("scenario failed")
}

Scenario result sinks include normal metric rows plus a _scenario_summary row with step names, tool calls, emitted artifact keys, failed metrics, repeat counts, and redacted metadata. Set Step.ExpectFail for negative cases that should fail their checks.

Grouped Contracts

Use Contract to group several checks into one named requirement. It keeps the report readable while preserving per-check dimensions for debugging.

readyRoute := eval.Contract{
	ContractName: "ready_route",
	Checks: []eval.Metric{
		eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"},
		eval.ArtifactSubset{
			Key:      "route",
			Expected: json.RawMessage(`{"success":true}`),
		},
		eval.OutputLengthBudget{MaxWords: 180},
	},
	StopOnFailure: true,
}

r.Run(t, readyRoute, c)

Artifact Checks

Use Case.Artifacts for named JSON payloads that should be validated separately from final prose: route state, planner output, budget data, tool traces, or workflow state. Artifact checks include absence checks, array exclusion, JSON subsets, wildcard paths, output length budgets, and normalizers.

c := eval.Case{
	Output: "Route is ready.",
	Artifacts: map[string]json.RawMessage{
		"route": json.RawMessage(`{
			"status":"ready",
			"total_minutes":98,
			"stops":[{"name":"Pajaritos"},{"name":"Valparaíso"}]
		}`),
	},
}

fold := eval.ChainNormalizers(eval.CaseFoldNormalizer(), eval.SpanishASCIIFoldNormalizer())

r.Run(t, eval.ArtifactExists{Key: "route"}, c)
r.Run(t, eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"}, c)
r.Run(t, eval.ArtifactNumberLTE{Key: "route", Path: "total_minutes", Max: 120}, c)
r.Run(t, eval.ArtifactArrayContains{
	Key: "route", Path: "stops[*].name", Expected: "pajaritos", Normalizer: fold,
}, c)
r.Run(t, eval.ArtifactArrayNotContains{Key: "route", Path: "stops[*].name", Expected: "Aeropuerto"}, c)
r.Run(t, eval.ArtifactSubset{Key: "route", Expected: json.RawMessage(`{"status":"ready"}`)}, c)

Trajectory Checks

Use Case.Turns and Case.ExpectedToolCalls for conversation and tool-use workflows. JSON datasets can include optional turns and expected_tool_calls fields.

c := eval.Case{
	Input:  "Where is order 42?",
	Output: "Order 42 arrives tomorrow.",
	Turns: []eval.Turn{
		{Role: eval.RoleUser, Content: "Where is order 42?"},
		{Role: eval.RoleAssistant, ToolCalls: []eval.ToolCall{
			{
				Name:      "orders.lookup",
				Arguments: json.RawMessage(`{"order_id":"42"}`),
				Result:    "delivery_date=tomorrow",
			},
		}},
	},
	ExpectedToolCalls: []eval.ToolCall{
		{Name: "orders.lookup", Arguments: json.RawMessage(`{"order_id":"42"}`)},
	},
}

r.Run(t, eval.ToolCallAccuracy{Mode: eval.MatchStrict, MatchArgs: true}, c)
r.Run(t, eval.ToolCallF1{MatchArgs: true, Threshold: 0.8}, c)
r.Run(t, eval.RequiredTools{Patterns: []string{"orders.*"}}, c)
r.Run(t, eval.ForbiddenTool{Patterns: []string{"orders.refund*"}}, c)
r.Run(t, eval.StepBudget{MaxSteps: 1}, c)

ToolCallAccuracy supports MatchStrict, MatchUnordered, MatchSubset, and MatchSuperset. Arguments compare as normalized JSON when MatchArgs is enabled. Required and forbidden tool checks support exact names and glob-style patterns.

Structured Traces

Use Case.Trace when your agent can emit structured spans, tool calls, artifact records, or state deltas. Case.TraceID and Result.TraceID link metric rows, scenario summaries, and trace records in downstream reports.

r := eval.NewRunner(
	judge,
	eval.WithResultSink(eval.DefaultResultSink()),
	eval.WithTraceSink(eval.DefaultTraceSink()),
)

c := eval.Case{
	Input:   "Find a route and charge the card",
	Output:  answer,
	TraceID: "route-42",
	Trace: &eval.Trace{
		ID:   "route-42",
		Name: "checkout_route",
		Spans: []eval.Span{{
			Name: "charge",
			Kind: "tool_call",
			ToolCall: &eval.ToolCall{
				Name:      "payments.charge",
				Arguments: json.RawMessage(`{"amount":42}`),
			},
		}},
	},
}

When GOEVAL_RESULTS_DIR is set, DefaultTraceSink writes traces.jsonl alongside results.jsonl. Trace writes use the same WithRedactors hooks as result JSONL. Tool-call metrics and scenario tool contracts read trace tool-call spans when present, falling back to Case.Turns for legacy evals.

Tier Filtering

Use DefaultTierFilter when you want GOEVAL_TIER to select only critical, standard, or extended cases. The filter is opt-in so ordinary runners ignore the environment variable.

r := eval.NewRunner(judge, eval.DefaultTierFilter())

# fast CI slice
GOEVAL=1 GOEVAL_TIER=critical go test ./...

# broader pre-merge slice
GOEVAL=1 GOEVAL_TIER=critical,standard go test ./...

Repeat And Budgets

Wrap noisy judge metrics with Repeat when pass rate matters, or add token and latency budgets when resource usage is part of correctness.

r.Run(t, eval.Repeat{
	Metric:   eval.Faithfulness{Threshold: 0.8},
	N:        3,
	PassRate: 2.0 / 3.0,
}, c)

r.Run(t, eval.WithTokenBudget(1200, eval.Faithfulness{Threshold: 0.8}), c)
r.Run(t, eval.WithLatencyBudget(2*time.Second, eval.AnswerRelevancy{Threshold: 0.7}), c)

Eval Operations

Use goeval.json when a repo has different eval run shapes for PRs, nightly runs, provider-specific checks, or release gates. Profiles set GOEVAL=1, optional tiers, result directories, and prerequisites before delegating to go test.

{
  "profiles": {
    "pr": {
      "packages": ["./..."],
      "tiers": ["critical"],
      "results_dir": ".goeval/pr"
    },
    "google": {
      "packages": ["./..."],
      "tiers": ["critical", "standard"],
      "results_dir": ".goeval/google",
      "prerequisites": [
        {"type": "env", "name": "GEMINI_API_KEY"},
        {"type": "env", "name": "GOOGLE_ROUTES_API_KEY"}
      ],
      "missing_prerequisite": "skip"
    }
  },
  "compare": {
    "case_id_key": "case_id",
    "default": {
      "score_tolerance": 0.02,
      "fail_on_missing": true,
      "fail_on_regression": true
    }
  }
}

goeval test --profile pr
goeval test --profile google --config goeval.json -run Route

Test code can also declare direct prerequisites with eval.Require, eval.Env, eval.File, eval.TCP, and eval.Func.

Save And Compare Results

Add a result sink to persist JSONL rows. v0.9 can compare two result files with policy tolerances, summarize reliability from one result file, match stable case IDs across test renames, and write scenario summary rows for multi-step runs. Use WithRedactors when reasons or metadata may contain sensitive IDs.

r := eval.NewRunner(judge, eval.WithResultSink(eval.DefaultResultSink()))

GOEVAL=1 GOEVAL_RESULTS_DIR=.eval-results go test ./...
goeval compare --policy goeval.json --format json old/results.jsonl new/results.jsonl
goeval compare --case-id-key case_id --score-tolerance 0.02 old.jsonl new.jsonl
goeval compare --fail-on-regression=false old.jsonl new.jsonl
goeval summarize --policy goeval.json .eval-results/results.jsonl

Use compare.StableCaseIDFromMetadata, or a compare policy case_id_key, when the conventional Case.Metadata["case_id"] key should identify rows across test renames.

Reports And Calibration

Render static HTML, Markdown, or JSON reports from JSONL result files. Use calibration to analyze judge disagreement and compare A/B variants across repeated runs.

goeval report current/results.jsonl --out report.html
goeval report --baseline old/results.jsonl --current new/results.jsonl --format markdown
goeval calibrate --case-id-key case_id --judge-key judge current/results.jsonl
goeval calibrate --pairwise-key variant results.jsonl

When --format is omitted, --out must use .html, .htm, .md, .markdown, or .json. Calibration aggregates duplicate judge or variant rows by mean score.

Judge Adapters

Optional judge adapters live in separate modules so the core package stays stdlib-only. Use the Ollama adapter for local LLM-as-judge scoring, or the OpenAI adapter for cloud-based evaluation.

go get github.com/igcodinap/go-eval/adapters/ollama
go get github.com/igcodinap/go-eval/adapters/openai github.com/sashabaranov/go-openai

import ollamaeval "github.com/igcodinap/go-eval/adapters/ollama"

judge := ollamaeval.NewJudge("llama3.2")
r := eval.NewRunner(judge)

r.Run(t, eval.Faithfulness{Threshold: 0.8}, eval.Case{
	Input:   "What is the capital of France?",
	Output:  "Paris is the capital of France.",
	Context: []string{"Paris is the capital of France."},
})

For non-default servers, configure the local endpoint with ollamaeval.WithBaseURL. The OpenAI adapter implements both Judge and RawJudge, enabling Compound metrics.

You can also implement your own Judge by wrapping any LLM provider. The interface requires a single method and must be safe for concurrent use:

type MyJudge struct{}

func (j *MyJudge) Evaluate(ctx context.Context, prompt string) (eval.JudgeResponse, error) {
	// 1. Send prompt to an LLM.
	// 2. Parse its JSON {"score": float, "reason": string} response.
	// 3. Return eval.JudgeResponse{Score, Reason, Tokens}.
	// Must be safe for concurrent use.
	return eval.JudgeResponse{}, nil
}

CLI

Install the optional CLI for common workflows:

go install github.com/igcodinap/go-eval/cmd/goeval@latest

goeval test ./...
goeval test --profile pr
goeval compare --policy goeval.json old/results.jsonl new/results.jsonl
goeval summarize --policy goeval.json current/results.jsonl
goeval report current/results.jsonl --out report.html
goeval calibrate --judge-key judge current/results.jsonl
goeval version

CI/CD

Enable evaluations explicitly with GOEVAL=1. Without it, evals skip and normal test runs stay fast.

Install DefaultTierFilter on the runner when CI should select tiers with GOEVAL_TIER, or let a goeval.json profile set the tier and result directory for each pipeline shape.

# Enable evals
GOEVAL=1 go test ./...

# Save result rows
GOEVAL=1 GOEVAL_RESULTS_DIR=.eval-results go test ./...

# Trace judge prompts and responses when debugging
GOEVAL=1 GOEVAL_TRACE=1 go test -v ./...

# Critical tier only
GOEVAL=1 GOEVAL_TIER=critical go test ./...

# Named PR profile
goeval test --profile pr

Troubleshooting

Evals are skipped unexpectedly

Confirm GOEVAL=1 is set before running tests.

Trace output is missing

Use both GOEVAL=1 and GOEVAL_TRACE=1, and run tests with -v so t.Log output is visible.

A profile skips instead of running

Check goeval.json prerequisites. Missing prerequisites skip by default unless the profile sets missing_prerequisite to fail.

Comparisons drift after test renames

Set a stable metadata case_id and compare with --case-id-key case_id or compare.StableCaseIDFromMetadata.

Judge calls fail intermittently

Verify credentials, rate limits, and model availability. Use deterministic prechecks to reduce judge call volume.

Need detailed API reference

Use package docs: go doc github.com/igcodinap/go-eval