Home/Docs/Getting Started

Getting Started

go-eval v0.8 is an open-source evaluation toolkit for Go teams building LLM products. It runs inside go test, stays opt-in through GOEVAL=1, and covers judge metrics, deterministic checks, structured artifacts, tool trajectories, multi-step agent scenarios, result comparison, and summaries.

Use Cases

  • RAG response quality checks in go test
  • Deterministic output validation for JSON and artifacts
  • Agent trajectory checks for tool-use workflows
  • Ordered agent scenario contracts with per-step tool policies
  • Tiered CI slices for critical, standard, and extended cases
  • Prompt or model regression checks in CI pipelines
  • Repeatability checks for flaky judge metrics

Core Concepts

Case
Input, output, expected value, context, artifacts, turns, expected tool calls, metadata, and timeout.
Scenario
Ordered multi-step flow with accumulated history, artifacts, state, and repeats.
Contract
Named group of checks that reports one business-level result.
Metric
Stateless scorer with thresholded pass/fail behavior.
Judge
Concurrency-safe LLM-as-judge implementation returning scores and reasons.
Runner
Executes Cases with Metrics and handles GOEVAL gating, result sinks, and assertions.
Artifact
Named structured JSON output checked deterministically.
Trajectory
Typed turns, required tools, forbidden tools, and expected tool calls.

Create Your First Test Run

Start with keyed eval.Case literals, a cheap deterministic check, and one judge metric. Keyed case literals are required by v0.4 and later because Case has a private blank field.

package yourpkg_test

import (
	"testing"

	eval "github.com/igcodinap/go-eval"
)

func TestSupportReply(t *testing.T) {
	runner := eval.NewRunner(openAIJudge, eval.WithResultSink(eval.DefaultResultSink()))

	c := eval.Case{
		Input:    "How do I cancel my plan?",
		Output:   "You can cancel from Billing > Subscription.",
		Expected: "cancel",
		Metadata: map[string]any{
			"flow": "support.reply", "tier": "critical", "dataset": "support/v1",
		},
	}

	result := runner.Run(t, eval.Precheck{
		Pre: eval.Contains{},
		Main: eval.Compound{
			Dimensions: []eval.Dimension{
				{Name: "helpfulness", Rubric: "Actionable next step", Threshold: 0.7},
				{Name: "policy_alignment", Rubric: "No unsafe guidance", Threshold: 0.9},
			},
		},
	}, c)

	if !result.Passed {
		t.Fatalf("eval failed: %s", result.Reason)
	}
}

Metrics Overview

MetricTypePurposeThreshold
FaithfulnessLLM-as-JudgeVerify RAG outputs do not contradict retrieved context0.8
HallucinationLLM-as-JudgeCatch outputs that invent facts outside the supplied context0.9
AnswerRelevancyLLM-as-JudgeEnsure the output directly addresses the user input0.7
ContextPrecisionLLM-as-JudgeCheck whether retrieved context documents are relevant to the input0.7
GEvalLLM-as-JudgeScore custom criteria that built-in metrics do not cover0.7
CompoundLLM-as-JudgeEvaluate several related rubric dimensions in one judge callper-dimension
ContainsDeterministicCheck that output contains a required substringbinary
RegexDeterministicValidate output against a regular expressionbinary
JSONPathDeterministicAssert a value inside JSON outputbinary
FieldCountDeterministicEnforce a minimum number of non-null JSON fieldsconfig
ArtifactExistsDeterministicCheck that a named structured artifact exists on the casebinary
ArtifactNotExistsDeterministicAssert that an unwanted structured artifact was not emittedbinary
ArtifactJSONPathDeterministicAssert a JSON value inside a named artifactbinary
ArtifactFieldCountDeterministicRequire enough non-null fields inside an artifact objectconfig
ArtifactNumberLTEDeterministicCheck that a numeric artifact value stays under a maximumbinary
ArtifactArrayContainsDeterministicCheck that an artifact array contains an expected valuebinary
ArtifactArrayNotContainsDeterministicCheck that an artifact array excludes an unwanted valuebinary
ArtifactArrayMinLenDeterministicRequire an artifact array to have at least a minimum lengthbinary
ArtifactSubsetDeterministicAssert that an artifact contains a partial expected JSON structurebinary
OutputLengthBudgetDeterministicKeep final output within rune or word limitsconfig
ToolCallAccuracyTrajectoryCompare actual tool calls with expected calls under a match mode1.0
ToolCallF1TrajectoryReport precision, recall, and F1 for tool-call matches0.8
RequiredToolsTrajectoryFail when required tool names or name patterns are absentbinary
ForbiddenToolTrajectoryFail when disallowed tool names or patterns appear in the trajectorybinary
StepBudgetTrajectoryKeep tool-call count within a configured budgetbinary
PrecheckWrapperSkip expensive LLM metrics when a cheap guard failswrapped metric
ContractWrapperGroup several deterministic or judge checks into one named resultall checks
RepeatWrapperRun a metric multiple times and aggregate pass rate plus score statsconfigurable pass rate
WithTokenBudgetWrapperFail a wrapped metric when token usage exceeds a maximumtoken max
WithLatencyBudgetWrapperFail a wrapped metric when latency exceeds a maximumduration max

Deterministic Metrics

Deterministic metrics do not call an LLM judge. They are fast, cheap, and reproducible, making them useful for prechecks, output length budgets, tool policy checks, and structured output validation.

Contains

Substring presence check.

Regex

Regular-expression output validation.

JSONPath

Exact value check at a JSON path.

FieldCount

Configurable minimum non-null JSON fields.

OutputLengthBudget

Configurable rune and word limits.

ArtifactSubset

Partial JSON structure validation.

Agent Scenarios

Use Runner.RunScenario for ordered multi-step agent flows where correctness depends on accumulated history, artifacts, state, tool policy, and per-step contracts.

r := eval.NewRunner(
	judge,
	eval.WithResultSink(eval.DefaultResultSink()),
	eval.DefaultTierFilter(),
)

result := r.RunScenario(t, eval.Scenario{
	Name:  "planning_to_route_ready",
	Tier:  "critical",
	State: map[string]any{"locale": "es-CL"},
	Tools: eval.NewToolRegistry("plan_route", "select_map_items"),
	Repeat: eval.ScenarioRepeat{N: 3, PassRate: 2.0 / 3.0},
	Driver: func(ctx context.Context, req eval.StepRequest) (eval.StepResult, error) {
		return runAgentStep(ctx, req.Step.Input, req.History, req.Artifacts, req.State)
	},
	Steps: []eval.Step{
		{
			Name: "greeting",
			Input: "Hola",
			ForbiddenToolPatterns: []string{"plan_*", "select_*"},
			Timeout: 500 * time.Millisecond,
		},
		{
			Name: "ready_route_request",
			Input: "Propón la ruta",
			RequiredToolPatterns: []string{"plan_*"},
			Timeout: 3 * time.Second,
			Checks: []eval.Metric{
				eval.NewContract("ready_route",
					eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"},
					eval.ArtifactArrayMinLen{Key: "route", Path: "stops", MinLen: 2},
				),
			},
		},
	},
})

if !result.Passed {
	t.Fatalf("scenario failed")
}

Scenario result sinks include normal metric rows plus a _scenario_summary row with step names, tool calls, emitted artifact keys, failed metrics, repeat counts, and redacted metadata. Set Step.ExpectFail for negative cases that should fail their checks.

Grouped Contracts

Use Contract to group several checks into one named requirement. It keeps the report readable while preserving per-check dimensions for debugging.

readyRoute := eval.Contract{
	ContractName: "ready_route",
	Checks: []eval.Metric{
		eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"},
		eval.ArtifactSubset{
			Key:      "route",
			Expected: json.RawMessage(`{"success":true}`),
		},
		eval.OutputLengthBudget{MaxWords: 180},
	},
	StopOnFailure: true,
}

r.Run(t, readyRoute, c)

Artifact Checks

Use Case.Artifacts for named JSON payloads that should be validated separately from final prose: route state, planner output, budget data, tool traces, or workflow state. v0.8 adds absence checks, array exclusion, JSON subsets, wildcard paths, output length budgets, and normalizers.

c := eval.Case{
	Output: "Route is ready.",
	Artifacts: map[string]json.RawMessage{
		"route": json.RawMessage(`{
			"status":"ready",
			"total_minutes":98,
			"stops":[{"name":"Pajaritos"},{"name":"Valparaíso"}]
		}`),
	},
}

fold := eval.ChainNormalizers(eval.CaseFoldNormalizer(), eval.SpanishASCIIFoldNormalizer())

r.Run(t, eval.ArtifactExists{Key: "route"}, c)
r.Run(t, eval.ArtifactJSONPath{Key: "route", Path: "status", Expected: "ready"}, c)
r.Run(t, eval.ArtifactNumberLTE{Key: "route", Path: "total_minutes", Max: 120}, c)
r.Run(t, eval.ArtifactArrayContains{
	Key: "route", Path: "stops[*].name", Expected: "pajaritos", Normalizer: fold,
}, c)
r.Run(t, eval.ArtifactArrayNotContains{Key: "route", Path: "stops[*].name", Expected: "Aeropuerto"}, c)
r.Run(t, eval.ArtifactSubset{Key: "route", Expected: json.RawMessage(`{"status":"ready"}`)}, c)

Trajectory Checks

Use Case.Turns and Case.ExpectedToolCalls for conversation and tool-use workflows. JSON datasets can include optional turns and expected_tool_calls fields.

c := eval.Case{
	Input:  "Where is order 42?",
	Output: "Order 42 arrives tomorrow.",
	Turns: []eval.Turn{
		{Role: eval.RoleUser, Content: "Where is order 42?"},
		{Role: eval.RoleAssistant, ToolCalls: []eval.ToolCall{
			{
				Name:      "orders.lookup",
				Arguments: json.RawMessage(`{"order_id":"42"}`),
				Result:    "delivery_date=tomorrow",
			},
		}},
	},
	ExpectedToolCalls: []eval.ToolCall{
		{Name: "orders.lookup", Arguments: json.RawMessage(`{"order_id":"42"}`)},
	},
}

r.Run(t, eval.ToolCallAccuracy{Mode: eval.MatchStrict, MatchArgs: true}, c)
r.Run(t, eval.ToolCallF1{MatchArgs: true, Threshold: 0.8}, c)
r.Run(t, eval.RequiredTools{Patterns: []string{"orders.*"}}, c)
r.Run(t, eval.ForbiddenTool{Patterns: []string{"orders.refund*"}}, c)
r.Run(t, eval.StepBudget{MaxSteps: 1}, c)

ToolCallAccuracy supports MatchStrict, MatchUnordered, MatchSubset, and MatchSuperset. Arguments compare as normalized JSON when MatchArgs is enabled. Required and forbidden tool checks support exact names and glob-style patterns.

Tier Filtering

Use DefaultTierFilter when you want GOEVAL_TIER to select only critical, standard, or extended cases. The filter is opt-in so ordinary runners ignore the environment variable.

r := eval.NewRunner(judge, eval.DefaultTierFilter())
# fast CI slice
GOEVAL=1 GOEVAL_TIER=critical go test ./...

# broader pre-merge slice
GOEVAL=1 GOEVAL_TIER=critical,standard go test ./...

Repeat And Budgets

Wrap noisy judge metrics with Repeat when pass rate matters, or add token and latency budgets when resource usage is part of correctness.

r.Run(t, eval.Repeat{
	Metric:   eval.Faithfulness{Threshold: 0.8},
	N:        3,
	PassRate: 2.0 / 3.0,
}, c)

r.Run(t, eval.WithTokenBudget(1200, eval.Faithfulness{Threshold: 0.8}), c)
r.Run(t, eval.WithLatencyBudget(2*time.Second, eval.AnswerRelevancy{Threshold: 0.7}), c)

Save And Compare Results

Add a result sink to persist JSONL rows. v0.8 can compare two result files, summarize one result file, and write scenario summary rows for multi-step runs. Use WithRedactors when reasons or metadata may contain sensitive IDs.

r := eval.NewRunner(judge, eval.WithResultSink(eval.DefaultResultSink()))
GOEVAL=1 GOEVAL_RESULTS_DIR=.eval-results go test ./...
goeval compare old/results.jsonl new/results.jsonl
goeval summarize .eval-results/results.jsonl

Use compare.CaseIDFromMetadata when the conventional Case.Metadata["case_id"] key should identify rows across runs.

CLI

Install the optional CLI for common workflows:

go install github.com/igcodinap/go-eval/cmd/goeval@latest

goeval test ./...
goeval compare old/results.jsonl new/results.jsonl
goeval summarize current/results.jsonl
goeval version

CI/CD

Enable evaluations explicitly with GOEVAL=1. Without it, evals skip and normal test runs stay fast.

Install DefaultTierFilter on the runner when CI should select tiers with GOEVAL_TIER.

# Enable evals
GOEVAL=1 go test ./...

# Save result rows
GOEVAL=1 GOEVAL_RESULTS_DIR=.eval-results go test ./...

# Trace judge prompts and responses when debugging
GOEVAL=1 GOEVAL_TRACE=1 go test -v ./...

# Critical tier only
GOEVAL=1 GOEVAL_TIER=critical go test ./...

Troubleshooting

Evals are skipped unexpectedly

Confirm GOEVAL=1 is set before running tests.

Trace output is missing

Use both GOEVAL=1 and GOEVAL_TRACE=1, and run tests with -v so t.Log output is visible.

Judge calls fail intermittently

Verify credentials, rate limits, and model availability. Use deterministic prechecks to reduce judge call volume.

Need detailed API reference

Use package docs: go doc github.com/igcodinap/go-eval