Health Check Agent

The Health Check Agent provides comprehensive health monitoring for all services, data quality, and performance metrics with alerting capabilities.

📋 Overview

Property	Value
Module	`src.agents.monitoring.health_check_agent`
Class	`HealthCheckAgent`
Author	Nguyen Dinh Anh Tuan
Version	2.0.0

🎯 Purpose

The Health Check Agent provides:

Service availability monitoring (HTTP, TCP, Cypher, SPARQL, Kafka)
Data quality validation (thresholds, counts, age)
Performance metrics collection (response times, latency)
Alerting integration (webhook, email, Slack)
Prometheus metrics export

📊 Monitoring Capabilities

Service Health Checks

Check Type	Protocol	Description
HTTP	REST	API endpoint availability
TCP	Socket	Database connection
Cypher	Neo4j	Graph database health
SPARQL	Fuseki	Triplestore availability
Kafka	Broker	Message queue status

Data Quality Metrics

Metric	Description
Data freshness	Age of latest data
Record count	Expected vs actual counts
Validation rate	Percentage passing validation
Error rate	Failed operations

🔧 Architecture

┌─────────────────────────────────────────────┐
│             Health Check Agent              │
├─────────────────────────────────────────────┤
│                                             │
│  ┌─────────┐  ┌─────────┐  ┌─────────────┐ │
│  │  HTTP   │  │  TCP    │  │   Cypher    │ │
│  │ Checker │  │ Checker │  │   Checker   │ │
│  └────┬────┘  └────┬────┘  └──────┬──────┘ │
│       │            │              │         │
│       └────────────┼──────────────┘         │
│                    ▼                        │
│           ┌───────────────┐                 │
│           │ Health Status │                 │
│           │   Aggregator  │                 │
│           └───────┬───────┘                 │
│                   │                         │
│       ┌───────────┼───────────┐            │
│       ▼           ▼           ▼            │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐       │
│  │  Alert  │ │Prometheus│ │Dashboard│       │
│  │ Manager │ │ Export  │ │   API   │       │
│  └─────────┘ └─────────┘ └─────────┘       │
└─────────────────────────────────────────────┘

🚀 Usage

Basic Health Check

from src.agents.monitoring.health_check_agent import HealthCheckAgent

agent = HealthCheckAgent()

# Start continuous monitoring
agent.start_monitoring()

# Get current health status
status = agent.get_health_status()
print(f"Overall Health: {status['overall_health']}")
print(f"Services: {status['services']}")

Check Specific Service

# Check single service
neo4j_health = agent.check_service("neo4j")
print(f"Neo4j Status: {neo4j_health['status']}")
print(f"Response Time: {neo4j_health['response_time_ms']}ms")

# Check database connection
mongo_health = agent.check_database("mongodb")

Custom Health Checks

# Register custom health check
@agent.register_check("my_service")
async def check_my_service():
    try:
        response = await http_client.get("http://my-service/health")
        return {"status": "healthy", "code": response.status_code}
    except Exception as e:
        return {"status": "unhealthy", "error": str(e)}

⚙️ Configuration

# config/health_check_config.yaml
health_check:
  enabled: true
  check_interval_seconds: 30
  
  services:
    backend:
      url: http://localhost:8001/health
      timeout: 5
      expected_status: 200
    
    neo4j:
      type: cypher
      url: bolt://localhost:7687
      query: "RETURN 1"
      timeout: 10
    
    mongodb:
      type: tcp
      host: localhost
      port: 27017
      timeout: 5
    
    redis:
      type: tcp
      host: localhost
      port: 6379
      timeout: 3
    
    stellio:
      url: http://localhost:8080/health
      timeout: 10
    
    fuseki:
      type: sparql
      url: http://localhost:3030/traffic/sparql
      query: "ASK { ?s ?p ?o }"
      timeout: 10

  # Alert configuration
  alerts:
    enabled: true
    channels:
      - type: webhook
        url: http://alert-service/webhook
      - type: email
        recipients:
          - admin@example.com
    
    rules:
      - name: service_down
        condition: status == "unhealthy"
        severity: critical
        cooldown_minutes: 5
      
      - name: high_latency
        condition: response_time_ms > 5000
        severity: warning
        cooldown_minutes: 15

  # Prometheus metrics
  prometheus:
    enabled: true
    port: 9090
    path: /metrics

📈 Prometheus Metrics

Metric	Type	Description
`service_health_status`	Gauge	Binary health (0=down, 1=up)
`service_response_time_seconds`	Histogram	Health check latency
`data_quality_score`	Gauge	Quality metric (0-100)
`health_check_total`	Counter	Total checks performed

Grafana Dashboard

{
  "panels": [
    {
      "title": "Service Health",
      "type": "stat",
      "targets": [
        {
          "expr": "service_health_status"
        }
      ]
    }
  ]
}

🛡️ Alerting

Alert Channels

# Configure alert channels
agent.configure_alerts({
    "webhook": {
        "url": "https://hooks.slack.com/services/xxx",
        "headers": {"Content-Type": "application/json"}
    },
    "email": {
        "smtp_server": "smtp.example.com",
        "recipients": ["ops@example.com"]
    }
})

Alert Rules

# Custom alert rule
agent.add_alert_rule(
    name="database_slow",
    condition=lambda status: status.get("response_time_ms", 0) > 1000,
    severity="warning",
    message="Database response time exceeded 1 second"
)

Performance Monitor - Performance metrics
Data Quality Validator - Data validation
DevOps Guide - Monitoring setup

See the complete agents reference for all available agents.

📋 Overview​

🎯 Purpose​

📊 Monitoring Capabilities​

Service Health Checks​

Data Quality Metrics​

🔧 Architecture​

🚀 Usage​

Basic Health Check​

Check Specific Service​

Custom Health Checks​

⚙️ Configuration​

📈 Prometheus Metrics​

Grafana Dashboard​

🛡️ Alerting​

Alert Channels​

Alert Rules​

📖 Related Documentation​