Versioning & Experimentation
Purpose
Enable project admins to safely experiment with agent configurations (which include prompts embedded within them) and strategies without affecting production. Provides a complete workflow from experimental testing to production deployment with validation gates and confidence metrics.
Overview
The agent versioning and experimentation system allows admins to:
- Create experimental versions of agent configs (which include prompts embedded within them)
- Test changes in isolated environments
- Compare experimental versions against production
- Validate changes meet quality and performance thresholds
- Deploy with confidence after validation
- Rollback instantly if issues arise
Version Tracking in Outputs
Every agent output includes version information for traceability and debugging. All agent outputs contain an agentVersion field that specifies the semantic version (e.g., "1.2.3") of the agent that generated the output.
How Version is Determined
- Agents read their active version from the
AgentVersionDeploymenttable during initialization - The version corresponds to the currently deployed version for the agent in the production environment
- The version is included in all outputs, including error outputs and partial results
- This enables traceability: you can identify exactly which agent version generated any output in the system
Benefits
- Traceability: Track which version of an agent generated specific outputs
- Debugging: Identify version-specific issues by correlating outputs with agent versions
- Audit Trail: Maintain a complete record of which agent version was responsible for each result
- Experimentation: Compare outputs from different agent versions during A/B testing
- Rollback Analysis: Understand the impact of version changes by tracking outputs before and after deployments
The agentVersion field is separate from agentId (which identifies the agent type) and provides the specific version that generated the output. This is essential for the versioning and experimentation workflow, as it allows admins to track which version produced which results.
Database Schema
Core Tables
AgentVersion - Stores all agent versions with metadata:
{
id: string
agentId: string // Agent identifier (scheduler, query-strategy, etc.)
version: string // Semantic version (e.g., "1.2.3")
config: object // Agent configuration snapshot (JSONB) - from AgentConfig
// Includes prompts embedded within the config (e.g., systemPrompt, extractionPrompt, etc.)
codeHash: string // Hash of agent implementation code (for tracking code changes)
createdAt: Date
createdBy: string // 'learning-agent' | 'admin' | 'manual'
metadata: {
changelog?: string // What changed in this version
rationale?: string // Why the change was made
expectedImpact?: string // Expected improvements
performanceMetrics?: object // Historical performance data
}
status: 'draft' | 'experimental' | 'testing' | 'production' | 'deprecated'
}Note: AgentVersion stores snapshots of configurations (which include prompts embedded within them), but does not replace AgentConfig. When a version is deployed:
AgentConfigis updated to match the version's configuration fromAgentVersion.config- Agents read from
AgentConfigat runtime (this is the source of truth) - The
AgentVersionserves as a historical record and rollback point
AgentVersionDeployment - Tracks which version is active in each environment:
{
id: string
agentId: string
versionId: string // Reference to AgentVersion
environment: 'production' | 'staging' | 'experimental' | 'development'
deployedAt: Date
deployedBy: string // Admin user ID
rollbackVersionId?: string // Previous version for quick rollback
deploymentNotes?: string // Why this version was deployed
}Experimentation Tables
AgentExperiment - Track experimental runs and comparisons:
{
id: string
agentId: string
versionId: string // Experimental version being tested
baselineVersionId: string // Production version to compare against
status: 'running' | 'completed' | 'failed'
testConfig: {
testUsers?: string[] // Specific users to test with
testTickers?: string[] // Specific tickers to test with
testType: 'historical' | 'live' | 'synthetic'
sampleSize?: number // Number of test cases
dateRange?: { // For historical tests
start: Date
end: Date
}
}
results: {
executionTime: number // Average execution time (ms)
successRate: number // Success rate (0-1)
qualityScore: number // Quality score (0-1)
cost: number // API cost in USD
errorCount: number
sampleOutputs: object[] // Sample outputs for review
metrics: {
newsletterGenerated: number
averageEngagement?: number
userSatisfaction?: number
}
}
comparison: {
executionTimeDelta: number // % change vs baseline
successRateDelta: number
qualityScoreDelta: number
costDelta: number
isBetter: boolean // Overall assessment
}
createdAt: Date
completedAt?: Date
createdBy: string // Admin user ID
notes?: string
}AgentValidation - Track validation checks before promotion:
{
id: string
versionId: string
validationType: 'performance' | 'quality' | 'cost' | 'error-rate' | 'manual-review'
status: 'pending' | 'passed' | 'failed' | 'warning'
threshold: number // Required threshold
actualValue: number // Actual measured value
passed: boolean
message?: string // Human-readable result
notes?: string
validatedBy?: string // Admin user ID
validatedAt?: Date
experimentId?: string // Link to experiment that generated this validation
}Version Lifecycle
1. Version Creation
Versions can be created from multiple sources:
- Learning Agent: Automatically creates versions when optimizing configurations
- Admin Manual: Admins create versions via admin dashboard
- Experimental Fork: Create experimental version from existing production version
Status Flow:
draft → experimental → testing → production
↓
deprecated2. Experimental Phase
- Versions with
status: 'experimental'are for testing and validation - Run in isolated execution context
- No impact on production users or data
- Can run test executions on:
- Historical data (replay past scenarios)
- Test user accounts
- Sample tickers
- Synthetic test cases
3. Testing Phase
- Versions promoted to
status: 'testing'run alongside production - A/B testing on subset of traffic
- Performance metrics collected for comparison
- Can be promoted to production or reverted to experimental
4. Production Deployment
- Only versions that pass validation can be deployed to production
- Deployment Process:
AgentVersionDeploymenttable is updated to mark the new version as activeAgentConfigtable is updated to match the version's configuration fromAgentVersion.config(includes prompts embedded within config)- Agents read from
AgentConfigat runtime (the source of truth) - Agent instances reload configuration from database (hot-reload without restart, but may require agent re-initialization)
- Previous production version automatically tracked for rollback
- Note: Code changes (if any) still require code deployment, but config/prompt changes can be hot-reloaded
5. Version Rollback
- Instant rollback to previous production version
- Updates
AgentVersionDeploymenttable to point to previous version - Agent's
AgentConfigis updated to match the previous version's configuration (includes prompts embedded within config) - Agents reload configuration from database (hot-reload)
- No code deployment required (unless code was also changed)
- Full audit trail maintained
Experimentation Workflow
Creating an Experimental Version
-
Fork from Production:
- Admin selects current production version
- Creates experimental copy with
status: 'experimental' - Can modify configs and prompts in sandbox
-
Edit Configuration:
- Admin edits agent config via admin dashboard (prompts are embedded within the config)
- Changes saved to experimental version only
- No impact on production
-
Run Test Execution:
- Admin configures test parameters:
- Test users/tickers
- Test type (historical/live/synthetic)
- Sample size
- System runs experimental version on test data
- Results stored in
AgentExperimenttable
- Admin configures test parameters:
-
Review Comparison:
- System compares experimental vs production results
- Shows side-by-side metrics:
- Execution time
- Success rate
- Quality scores
- Cost impact
- Sample outputs
- Admin reviews comparison dashboard
Validation System
Before promoting to production, versions must pass validation gates:
Automated Validations:
-
Performance Validation:
- Execution time must not exceed threshold (e.g., +20% vs baseline)
- Success rate must meet minimum (e.g., ≥95%)
-
Quality Validation:
- Quality score must meet minimum threshold (e.g., ≥0.8)
- Error rate must not exceed threshold (e.g., ≤5%)
-
Cost Validation:
- Cost increase must not exceed budget threshold (e.g., +10%)
-
Error Rate Validation:
- Error rate must not be higher than baseline
Manual Validations:
- Admin review of sample outputs
- Approval from required reviewers
- Business logic validation
Validation Results:
- All validations must pass for production promotion
- Warnings can be overridden with admin approval
- Failed validations block promotion
- Validation history stored for audit
Promotion Workflow
-
Run Validation Suite:
- Admin triggers validation from dashboard
- System runs all automated checks
- Results displayed in validation dashboard
-
Review Results:
- Admin reviews validation results
- Can view detailed comparison metrics
- Can review sample outputs
-
Approve Promotion:
- If validations pass, admin can promote
- Can promote to 'testing' (A/B test) or directly to 'production'
- Promotion requires confirmation
- Audit log entry created
-
Deployment:
- System updates
AgentVersionDeploymenttable - Agent instances hot-reload configuration
- Previous version tracked for rollback
- Monitoring alerts configured
- System updates
Admin Dashboard Features
Experimental Workspace (/admin/agents/experiments)
- Version Browser: View all versions for each agent
- Create Experimental: Fork production version to experimental
- Config Editor: Edit agent configurations in sandbox (includes prompts embedded within config)
- Test Runner: Configure and run test executions
- Comparison Dashboard: Side-by-side comparison of versions
- Validation Suite: Run and view validation results
- Promotion Controls: Promote versions with validation gates
Version Validator (/admin/agents/versions/[id]/validate)
- Validation Dashboard: View all validation checks
- Run Validations: Trigger validation suite
- Threshold Configuration: Configure validation thresholds per agent
- Override Controls: Override warnings with approval workflow
- History: View validation history for version
Version Comparison (/admin/agents/versions/compare)
- Side-by-Side View: Compare any two versions
- Metrics Comparison: Execution time, success rate, quality, cost
- Output Comparison: Sample outputs from each version
- Diff View: Configuration differences (prompts are embedded within config)
- Performance Charts: Visual comparison of metrics over time
Workflow Examples
Example 1: Testing a New Prompt
- Admin navigates to
/admin/agents/experiments - Selects "Content Generation Agent"
- Creates experimental version from current production
- Edits prompt with new instructions via prompt editor
- Tests prompt with sample input
- Reviews output quality
- Runs test execution on sample newsletters
- Compares results with production version
- Runs validation suite
- If validations pass, promotes to testing (A/B test)
- After sufficient data, promotes to production
Example 2: Optimizing Agent Configuration
- Admin navigates to
/admin/agents/experiments - Selects "Query Strategy Agent"
- Creates experimental version from current production
- Edits configuration (e.g., changes entity discovery settings)
- Saves experimental version
- Runs test execution on historical data (last 30 days)
- Reviews comparison dashboard:
- Execution time: -15% (improved)
- Success rate: 98% (same)
- Quality score: 0.85 (improved from 0.82)
- Cost: +5% (acceptable)
- Runs validation suite - all checks pass
- Promotes to testing status for A/B test
- Monitors A/B test results for 1 week
- Confirms improvements, promotes to production
Example 3: Quick Rollback
- New production version deployed
- Monitoring alerts show increased error rate
- Admin navigates to
/admin/agents/versions - Views current production version
- Clicks "Rollback" button
- Confirms rollback to previous version
- System updates
AgentVersionDeploymenttable - Agents hot-reload previous configuration
- Error rate returns to normal
- Admin investigates issue in experimental environment
Best Practices
- Always Test First: Never deploy directly to production without testing
- Use Historical Tests: Test on historical data to validate behavior
- Set Appropriate Thresholds: Configure validation thresholds based on business requirements
- Monitor A/B Tests: Use testing phase to gather real-world metrics
- Document Changes: Always include rationale and expected impact in version metadata
- Review Sample Outputs: Manually review sample outputs before promotion
- Gradual Rollout: Consider promoting to testing before production
- Keep Rollback Ready: Always know which version to rollback to
- Track Metrics: Monitor version performance after deployment
- Audit Trail: All changes are logged for compliance and debugging
Integration with Learning Agent
The Learning Agent can create optimized versions automatically:
- Learning Agent analyzes metrics and identifies optimizations
- Creates new agent version with optimized configuration
- Version starts as 'draft' status
- Admin reviews optimization rationale in metadata
- Admin can promote to experimental for testing
- After validation, admin promotes to production
- Learning Agent tracks performance of new version
- Cycle continues with continuous improvement
This integration ensures that automated optimizations go through the same validation process as manual changes, maintaining quality and safety.