Project Planning
Milestone 3 - Basic Data Collection
Summary
Replace mock data with real data collection from a single, simple source.
Timeline
Weeks 7-8
Goal
Implement real data collection from one reliable source (NewsAPI) to replace hardcoded data.
Deliverables
Data Collection Agent (Enhanced)
- ✅ Agent Versioning:
- Reads active version from
AgentVersionDeploymenttable during initialization - Includes
agentVersionfield in all outputs - Version information stored in
AgentVersiontable
- Reads active version from
- ✅ Agent Registration:
- Registers agent type metadata via Agent Registry API (
POST /api/registry/register/) - Registers instance via Agent Registry API (
POST /api/register/) when spawned by orchestrator - Reports heartbeat via Agent Registry API (
POST /api/heartbeat/) with current load and status - Updates capacity and load information in real-time
- Registers agent type metadata via Agent Registry API (
- ✅ Orchestrator-Triggered Execution:
- Runs on schedule created by admin (every 2-4 hours by default, configurable per ticker)
- Orchestrator invokes agent HTTP endpoint with job parameters
- Can also be triggered by orchestrator if data is stale during newsletter generation
- Uses hardcoded query mapping (ticker → company name) stored in code for this milestone
- ✅ NewsAPI integration:
- API client setup with authentication
- Query string construction from ticker (e.g., "AAPL" → "Apple Inc" → NewsAPI search query)
- Article fetching and parsing (title, URL, description, published date)
- Rate limiting handling (respects NewsAPI rate limits)
- ✅ Data storage via Agent Data API:
- Writes collected data via
POST /api/data-collection/endpoint - News articles with metadata (title, url, snippet/description, publishedAt, collectedAt)
- Timestamps for freshness tracking
- Links to
tickerIdfor association - Includes
agentVersionin all outputs
- Writes collected data via
- ✅ Basic deduplication:
- URL-based deduplication (exact match)
- Title similarity (simple string matching, case-insensitive)
- ✅ Error handling:
- API failure handling
- Retry logic with exponential backoff
- Graceful degradation (continues with partial data if API fails partially)
- Updates
AgentJobExecutiontable with status and errors
Query Strategy (Minimal - Hardcoded)
- ✅ Simple query mapping (hardcoded in code):
- Ticker symbol → company name mapping (e.g., "AAPL" → "Apple Inc")
- Company name → NewsAPI search query (e.g., "Apple Inc" → "Apple Inc OR Apple")
- Mapping stored in code/configuration (not in database yet)
- No entity discovery yet
- No query optimization yet
- No database storage of queries yet (will be added in later milestones)
Task Timeline
Limitations (Acceptable for This Milestone)
- Only one news source (NewsAPI)
- No social media data
- No earnings or SEC filings
- Simple query generation (ticker → company name mapping only, hardcoded in code)
- Queries not stored in database (hardcoded in code/configuration)
- Basic deduplication only (URL exact match and simple title similarity)
- No relevance scoring (all collected articles stored regardless of relevance)
- No entity discovery or relationship expansion (searches only for company name)
Success Criteria
- ✅ Agent versioning is functional (agent reads active version, includes in outputs)
- ✅ Agent registration is functional (agent registers type and instance, reports heartbeat)
- ✅ Collects 10+ real news articles per company per day (cumulative across all collection runs)
- ✅ Data is stored correctly in
DataSourcetable via Agent Data API with proper ticker associations - ✅ All outputs include
agentVersionfield - ✅ Deduplication removes obvious duplicates (same URL or very similar titles)
- ✅ Newsletter content uses real collected data (not mock data)
- ✅ System handles API failures gracefully (retries, continues with partial data)
- ✅ Orchestrator can trigger data collection on schedule (every 2-4 hours)
- ✅ Orchestrator can trigger immediate data collection if data is stale during newsletter generation
- ✅ Job execution is tracked in
AgentJobExecutiontable
Next Steps
After this milestone, newsletters use real data from one source. Milestone 4 will add basic analysis to provide insights beyond just news summaries.