Versioning & Experimentation

Purpose

Enable project admins to safely experiment with agent configurations (which include prompts embedded within them) and strategies without affecting production. Provides a complete workflow from experimental testing to production deployment with validation gates and confidence metrics.

Overview

The agent versioning and experimentation system allows admins to:

Create experimental versions of agent configs (which include prompts embedded within them)
Test changes in isolated environments
Compare experimental versions against production
Validate changes meet quality and performance thresholds
Deploy with confidence after validation
Rollback instantly if issues arise

Version Tracking in Outputs

Every agent output includes version information for traceability and debugging. All agent outputs contain an agentVersion field that specifies the semantic version (e.g., "1.2.3") of the agent that generated the output.

How Version is Determined

Agents read their active version from the AgentVersionDeployment table during initialization
The version corresponds to the currently deployed version for the agent in the production environment
The version is included in all outputs, including error outputs and partial results
This enables traceability: you can identify exactly which agent version generated any output in the system

Benefits

Traceability: Track which version of an agent generated specific outputs
Debugging: Identify version-specific issues by correlating outputs with agent versions
Audit Trail: Maintain a complete record of which agent version was responsible for each result
Experimentation: Compare outputs from different agent versions during A/B testing
Rollback Analysis: Understand the impact of version changes by tracking outputs before and after deployments

The agentVersion field is separate from agentId (which identifies the agent type) and provides the specific version that generated the output. This is essential for the versioning and experimentation workflow, as it allows admins to track which version produced which results.

Database Schema

Core Tables

AgentVersion - Stores all agent versions with metadata:

{
  id: string
  agentId: string                    // Agent identifier (scheduler, query-strategy, etc.)
  version: string                     // Semantic version (e.g., "1.2.3")
  config: object                      // Agent configuration snapshot (JSONB) - from AgentConfig
                                     // Includes prompts embedded within the config (e.g., systemPrompt, extractionPrompt, etc.)
  codeHash: string                    // Hash of agent implementation code (for tracking code changes)
  createdAt: Date
  createdBy: string                   // 'learning-agent' | 'admin' | 'manual'
  metadata: {
    changelog?: string                // What changed in this version
    rationale?: string                // Why the change was made
    expectedImpact?: string           // Expected improvements
    performanceMetrics?: object        // Historical performance data
  }
  status: 'draft' | 'experimental' | 'testing' | 'production' | 'deprecated'
}

Note: AgentVersion stores snapshots of configurations (which include prompts embedded within them), but does not replace AgentConfig. When a version is deployed:

AgentConfig is updated to match the version's configuration from AgentVersion.config
Agents read from AgentConfig at runtime (this is the source of truth)
The AgentVersion serves as a historical record and rollback point

AgentVersionDeployment - Tracks which version is active in each environment:

{
  id: string
  agentId: string
  versionId: string                   // Reference to AgentVersion
  environment: 'production' | 'staging' | 'experimental' | 'development'
  deployedAt: Date
  deployedBy: string                  // Admin user ID
  rollbackVersionId?: string          // Previous version for quick rollback
  deploymentNotes?: string           // Why this version was deployed
}

Experimentation Tables

AgentExperiment - Track experimental runs and comparisons:

{
  id: string
  agentId: string
  versionId: string                   // Experimental version being tested
  baselineVersionId: string           // Production version to compare against
  status: 'running' | 'completed' | 'failed'
  testConfig: {
    testUsers?: string[]              // Specific users to test with
    testTickers?: string[]             // Specific tickers to test with
    testType: 'historical' | 'live' | 'synthetic'
    sampleSize?: number                // Number of test cases
    dateRange?: {                      // For historical tests
      start: Date
      end: Date
    }
  }
  results: {
    executionTime: number              // Average execution time (ms)
    successRate: number                // Success rate (0-1)
    qualityScore: number               // Quality score (0-1)
    cost: number                       // API cost in USD
    errorCount: number
    sampleOutputs: object[]            // Sample outputs for review
    metrics: {
      newsletterGenerated: number
      averageEngagement?: number
      userSatisfaction?: number
    }
  }
  comparison: {
    executionTimeDelta: number         // % change vs baseline
    successRateDelta: number
    qualityScoreDelta: number
    costDelta: number
    isBetter: boolean                  // Overall assessment
  }
  createdAt: Date
  completedAt?: Date
  createdBy: string                   // Admin user ID
  notes?: string
}

AgentValidation - Track validation checks before promotion:

{
  id: string
  versionId: string
  validationType: 'performance' | 'quality' | 'cost' | 'error-rate' | 'manual-review'
  status: 'pending' | 'passed' | 'failed' | 'warning'
  threshold: number                    // Required threshold
  actualValue: number                  // Actual measured value
  passed: boolean
  message?: string                     // Human-readable result
  notes?: string
  validatedBy?: string                 // Admin user ID
  validatedAt?: Date
  experimentId?: string               // Link to experiment that generated this validation
}

Version Lifecycle

1. Version Creation

Versions can be created from multiple sources:

Learning Agent: Automatically creates versions when optimizing configurations
Admin Manual: Admins create versions via admin dashboard
Experimental Fork: Create experimental version from existing production version

Status Flow:

draft → experimental → testing → production
                ↓
          deprecated

2. Experimental Phase

Versions with status: 'experimental' are for testing and validation
Run in isolated execution context
No impact on production users or data
Can run test executions on:
- Historical data (replay past scenarios)
- Test user accounts
- Sample tickers
- Synthetic test cases

3. Testing Phase

Versions promoted to status: 'testing' run alongside production
A/B testing on subset of traffic
Performance metrics collected for comparison
Can be promoted to production or reverted to experimental

4. Production Deployment

Only versions that pass validation can be deployed to production
Deployment Process:
1. AgentVersionDeployment table is updated to mark the new version as active
2. AgentConfig table is updated to match the version's configuration from AgentVersion.config (includes prompts embedded within config)
3. Agents read from AgentConfig at runtime (the source of truth)
4. Agent instances reload configuration from database (hot-reload without restart, but may require agent re-initialization)
Previous production version automatically tracked for rollback
Note: Code changes (if any) still require code deployment, but config/prompt changes can be hot-reloaded

5. Version Rollback

Instant rollback to previous production version
Updates AgentVersionDeployment table to point to previous version
Agent's AgentConfig is updated to match the previous version's configuration (includes prompts embedded within config)
Agents reload configuration from database (hot-reload)
No code deployment required (unless code was also changed)
Full audit trail maintained

Experimentation Workflow

Creating an Experimental Version

Fork from Production:
- Admin selects current production version
- Creates experimental copy with status: 'experimental'
- Can modify configs and prompts in sandbox
Edit Configuration:
- Admin edits agent config via admin dashboard (prompts are embedded within the config)
- Changes saved to experimental version only
- No impact on production
Run Test Execution:
- Admin configures test parameters:
  - Test users/tickers
  - Test type (historical/live/synthetic)
  - Sample size
- System runs experimental version on test data
- Results stored in AgentExperiment table
Review Comparison:
- System compares experimental vs production results
- Shows side-by-side metrics:
  - Execution time
  - Success rate
  - Quality scores
  - Cost impact
  - Sample outputs
- Admin reviews comparison dashboard

Validation System

Before promoting to production, versions must pass validation gates:

Automated Validations:

Performance Validation:
- Execution time must not exceed threshold (e.g., +20% vs baseline)
- Success rate must meet minimum (e.g., ≥95%)
Quality Validation:
- Quality score must meet minimum threshold (e.g., ≥0.8)
- Error rate must not exceed threshold (e.g., ≤5%)
Cost Validation:
- Cost increase must not exceed budget threshold (e.g., +10%)
Error Rate Validation:
- Error rate must not be higher than baseline

Manual Validations:

Admin review of sample outputs
Approval from required reviewers
Business logic validation

Validation Results:

All validations must pass for production promotion
Warnings can be overridden with admin approval
Failed validations block promotion
Validation history stored for audit

Promotion Workflow

Run Validation Suite:
- Admin triggers validation from dashboard
- System runs all automated checks
- Results displayed in validation dashboard
Review Results:
- Admin reviews validation results
- Can view detailed comparison metrics
- Can review sample outputs
Approve Promotion:
- If validations pass, admin can promote
- Can promote to 'testing' (A/B test) or directly to 'production'
- Promotion requires confirmation
- Audit log entry created
Deployment:
- System updates AgentVersionDeployment table
- Agent instances hot-reload configuration
- Previous version tracked for rollback
- Monitoring alerts configured

Admin Dashboard Features

Experimental Workspace (`/admin/agents/experiments`)

Version Browser: View all versions for each agent
Create Experimental: Fork production version to experimental
Config Editor: Edit agent configurations in sandbox (includes prompts embedded within config)
Test Runner: Configure and run test executions
Comparison Dashboard: Side-by-side comparison of versions
Validation Suite: Run and view validation results
Promotion Controls: Promote versions with validation gates

Version Validator (`/admin/agents/versions/[id]/validate`)

Validation Dashboard: View all validation checks
Run Validations: Trigger validation suite
Threshold Configuration: Configure validation thresholds per agent
Override Controls: Override warnings with approval workflow
History: View validation history for version

Version Comparison (`/admin/agents/versions/compare`)

Side-by-Side View: Compare any two versions
Metrics Comparison: Execution time, success rate, quality, cost
Output Comparison: Sample outputs from each version
Diff View: Configuration differences (prompts are embedded within config)
Performance Charts: Visual comparison of metrics over time

Workflow Examples

Example 1: Testing a New Prompt

Admin navigates to /admin/agents/experiments
Selects "Content Generation Agent"
Creates experimental version from current production
Edits prompt with new instructions via prompt editor
Tests prompt with sample input
Reviews output quality
Runs test execution on sample newsletters
Compares results with production version
Runs validation suite
If validations pass, promotes to testing (A/B test)
After sufficient data, promotes to production

Example 2: Optimizing Agent Configuration

Admin navigates to /admin/agents/experiments
Selects "Query Strategy Agent"
Creates experimental version from current production
Edits configuration (e.g., changes entity discovery settings)
Saves experimental version
Runs test execution on historical data (last 30 days)
Reviews comparison dashboard:
- Execution time: -15% (improved)
- Success rate: 98% (same)
- Quality score: 0.85 (improved from 0.82)
- Cost: +5% (acceptable)
Runs validation suite - all checks pass
Promotes to testing status for A/B test
Monitors A/B test results for 1 week
Confirms improvements, promotes to production

Example 3: Quick Rollback

New production version deployed
Monitoring alerts show increased error rate
Admin navigates to /admin/agents/versions
Views current production version
Clicks "Rollback" button
Confirms rollback to previous version
System updates AgentVersionDeployment table
Agents hot-reload previous configuration
Error rate returns to normal
Admin investigates issue in experimental environment

Best Practices

Always Test First: Never deploy directly to production without testing
Use Historical Tests: Test on historical data to validate behavior
Set Appropriate Thresholds: Configure validation thresholds based on business requirements
Monitor A/B Tests: Use testing phase to gather real-world metrics
Document Changes: Always include rationale and expected impact in version metadata
Review Sample Outputs: Manually review sample outputs before promotion
Gradual Rollout: Consider promoting to testing before production
Keep Rollback Ready: Always know which version to rollback to
Track Metrics: Monitor version performance after deployment
Audit Trail: All changes are logged for compliance and debugging

Integration with Learning Agent

The Learning Agent can create optimized versions automatically:

Learning Agent analyzes metrics and identifies optimizations
Creates new agent version with optimized configuration
Version starts as 'draft' status
Admin reviews optimization rationale in metadata
Admin can promote to experimental for testing
After validation, admin promotes to production
Learning Agent tracks performance of new version
Cycle continues with continuous improvement

This integration ensures that automated optimizations go through the same validation process as manual changes, maintaining quality and safety.

Versioning & Experimentation

On this page