Milestone 6 - Enhanced Data Collection

Summary

Add multiple data sources, improve deduplication, and add relevance scoring.

Goal

Expand data collection to multiple sources and improve data quality through better deduplication and relevance scoring.

Deliverables

Data Collection Agent (Enhanced)

✅ Independent Scheduling (Maintained):
- Continues running independently every 1-4 hours (configurable per ticker)
- High-priority tickers: Every 1 hour
- Low-priority tickers: Every 6-12 hours
- Reads latest queries from database (from Query Strategy Agent)
✅ Additional News Sources:
- Google News API integration
- RSS feed parsing (if available)
- At least 2-3 news sources total
✅ Social Media Integration (Basic):
- Twitter/X API integration (v2) - basic search
- Simple post collection (10-20 posts per ticker)
✅ Improved Deduplication:
- AI-powered content similarity (for near-duplicates)
- Title similarity with fuzzy matching
- URL normalization
✅ Relevance Scoring:
- AI-powered relevance scoring
- Filter low-relevance content (threshold-based)
- Score-based prioritization
✅ Source Management:
- Source health monitoring
- Graceful degradation when sources fail
- Source-specific rate limiting

Query Strategy Agent (Minor Enhancement)

✅ Better query optimization:
- Track query performance
- Prioritize high-performing queries

Agent Versioning System (Enhanced)

✅ Version Creation and Tracking:
- Uses AgentVersion table (already created in Milestone 1)
- Version creation for agent configs and prompts (configs include prompts embedded within them)
- Version metadata (changelog, rationale, expected impact)
- Version status tracking (draft, experimental, testing, production, deprecated)
✅ Enhanced Deployment System:
- Uses AgentVersionDeployment table (already created in Milestone 1)
- Only one version per agentId can be active in production
- Version promotion flow:
  1. Update AgentVersionDeployment to mark new version as active
  2. Update AgentConfig table to match version's configuration from AgentVersion.config
  3. Agents read from AgentConfig at runtime (hot-reload without restart)
- Rollback capability to previous production version:
  - Update AgentVersionDeployment to previous version
  - Update AgentConfig to match previous version's config
  - Agents automatically reload configuration
✅ Basic Experimentation:
- Create experimental versions from production
- Run test executions on experimental versions
- Compare experimental vs production performance
- Basic validation checks before promotion (stored in AgentValidation table)
✅ Admin Dashboard:
- Basic version management UI (/admin/agents/versions)
- Version comparison tools
- Deployment controls (promote to production, rollback)
- Simple validation dashboard

Note: The Agent Versioning System is a foundational infrastructure feature that enables safe iteration on agent improvements. While it's introduced in this milestone, it will be fully utilized in later milestones (particularly Milestone 14) when the Learning Agent creates optimized versions automatically.

Hermes Orchestrator (Instance Scaling Enhancement)

✅ Automatic Instance Scaling:
- Monitors job queue size and instance load via AgentInstance table
- Spawns new instances when queue grows or instances are at capacity
- Scales down instances when load decreases (terminates idle instances)
- Configurable scaling policies (min/max instances per agent type)
- Instance lifecycle fully managed by orchestrator
✅ Instance Health Monitoring:
- Monitors instance heartbeats via Agent Registry API
- Automatically replaces failed or unhealthy instances (instances with stale heartbeats)
- Tracks instance metrics (success rate, average execution time)
- Queries AgentInstance table for health status

Database Updates

✅ Agent version tables already exist (AgentVersion, AgentVersionDeployment - created in Milestone 1)
✅ Basic experimentation tables (AgentExperiment, AgentValidation)
✅ Version tracking in agent configs (configs stored in AgentConfig table, snapshots in AgentVersion table)

Task Timeline

Limitations (Acceptable for This Milestone)

Limited social media data (Twitter only, basic search)
No earnings calls or SEC filings yet
Basic relevance scoring
No advanced data enrichment
Basic versioning (no advanced analytics or automated optimization)
Manual version promotion (no automated promotion workflows)

Success Criteria

✅ Collects data from 3+ sources
✅ Collects 50+ news articles per ticker per day
✅ Collects 20+ social media posts per ticker per day
✅ Deduplication removes 80%+ of duplicates
✅ Relevance scoring filters low-quality content effectively
✅ System handles multiple sources gracefully
✅ Agent versioning system works correctly (version creation, deployment, rollback)
✅ Only one version per agent can be active in production at a time
✅ Version promotion updates both AgentVersionDeployment and AgentConfig tables
✅ Agents hot-reload configuration when version is promoted (no restart required)
✅ Admins can create and test experimental versions
✅ Version comparison and rollback function properly
✅ Orchestrator can automatically scale instances up/down based on load
✅ Orchestrator replaces failed instances automatically

Next Steps

After this milestone, data collection is robust with multiple sources and agent versioning enables safe iteration on improvements. Milestone 7 will add comprehensive analysis (competitive, sentiment, and event/context analysis).

Milestone 6 - Enhanced Data Collection

On this page