MediaPulse
Project Planning

Milestone 3 - Basic Data Collection

Summary

Replace mock data with real data collection from a single, simple source.

Timeline

Weeks 7-8

Goal

Implement real data collection from one reliable source (NewsAPI) to replace hardcoded data.

Deliverables

Data Collection Agent (Enhanced)

  • Agent Versioning:
    • Reads active version from AgentVersionDeployment table during initialization
    • Includes agentVersion field in all outputs
    • Version information stored in AgentVersion table
  • Agent Registration:
    • Registers agent type metadata via Agent Registry API (POST /api/registry/register/)
    • Registers instance via Agent Registry API (POST /api/register/) when spawned by orchestrator
    • Reports heartbeat via Agent Registry API (POST /api/heartbeat/) with current load and status
    • Updates capacity and load information in real-time
  • Orchestrator-Triggered Execution:
    • Runs on schedule created by admin (every 2-4 hours by default, configurable per ticker)
    • Orchestrator invokes agent HTTP endpoint with job parameters
    • Can also be triggered by orchestrator if data is stale during newsletter generation
    • Uses hardcoded query mapping (ticker → company name) stored in code for this milestone
  • NewsAPI integration:
    • API client setup with authentication
    • Query string construction from ticker (e.g., "AAPL" → "Apple Inc" → NewsAPI search query)
    • Article fetching and parsing (title, URL, description, published date)
    • Rate limiting handling (respects NewsAPI rate limits)
  • ✅ Data storage via Agent Data API:
    • Writes collected data via POST /api/data-collection/ endpoint
    • News articles with metadata (title, url, snippet/description, publishedAt, collectedAt)
    • Timestamps for freshness tracking
    • Links to tickerId for association
    • Includes agentVersion in all outputs
  • ✅ Basic deduplication:
    • URL-based deduplication (exact match)
    • Title similarity (simple string matching, case-insensitive)
  • ✅ Error handling:
    • API failure handling
    • Retry logic with exponential backoff
    • Graceful degradation (continues with partial data if API fails partially)
    • Updates AgentJobExecution table with status and errors

Query Strategy (Minimal - Hardcoded)

  • ✅ Simple query mapping (hardcoded in code):
    • Ticker symbol → company name mapping (e.g., "AAPL" → "Apple Inc")
    • Company name → NewsAPI search query (e.g., "Apple Inc" → "Apple Inc OR Apple")
    • Mapping stored in code/configuration (not in database yet)
    • No entity discovery yet
    • No query optimization yet
    • No database storage of queries yet (will be added in later milestones)

Task Timeline

Limitations (Acceptable for This Milestone)

  • Only one news source (NewsAPI)
  • No social media data
  • No earnings or SEC filings
  • Simple query generation (ticker → company name mapping only, hardcoded in code)
  • Queries not stored in database (hardcoded in code/configuration)
  • Basic deduplication only (URL exact match and simple title similarity)
  • No relevance scoring (all collected articles stored regardless of relevance)
  • No entity discovery or relationship expansion (searches only for company name)

Success Criteria

  • ✅ Agent versioning is functional (agent reads active version, includes in outputs)
  • ✅ Agent registration is functional (agent registers type and instance, reports heartbeat)
  • ✅ Collects 10+ real news articles per company per day (cumulative across all collection runs)
  • ✅ Data is stored correctly in DataSource table via Agent Data API with proper ticker associations
  • ✅ All outputs include agentVersion field
  • ✅ Deduplication removes obvious duplicates (same URL or very similar titles)
  • ✅ Newsletter content uses real collected data (not mock data)
  • ✅ System handles API failures gracefully (retries, continues with partial data)
  • ✅ Orchestrator can trigger data collection on schedule (every 2-4 hours)
  • ✅ Orchestrator can trigger immediate data collection if data is stale during newsletter generation
  • ✅ Job execution is tracked in AgentJobExecution table

Next Steps

After this milestone, newsletters use real data from one source. Milestone 4 will add basic analysis to provide insights beyond just news summaries.