Chuyển tới nội dung chính

Health Check Agent

The Health Check Agent provides comprehensive health monitoring for all services, data quality, and performance metrics with alerting capabilities.

📋 Overview

PropertyValue
Modulesrc.agents.monitoring.health_check_agent
ClassHealthCheckAgent
AuthorNguyen Dinh Anh Tuan
Version2.0.0

🎯 Purpose

The Health Check Agent provides:

  • Service availability monitoring (HTTP, TCP, Cypher, SPARQL, Kafka)
  • Data quality validation (thresholds, counts, age)
  • Performance metrics collection (response times, latency)
  • Alerting integration (webhook, email, Slack)
  • Prometheus metrics export

📊 Monitoring Capabilities

Service Health Checks

Check TypeProtocolDescription
HTTPRESTAPI endpoint availability
TCPSocketDatabase connection
CypherNeo4jGraph database health
SPARQLFusekiTriplestore availability
KafkaBrokerMessage queue status

Data Quality Metrics

MetricDescription
Data freshnessAge of latest data
Record countExpected vs actual counts
Validation ratePercentage passing validation
Error rateFailed operations

🔧 Architecture

┌─────────────────────────────────────────────┐
│ Health Check Agent │
├─────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────────┐ │
│ │ HTTP │ │ TCP │ │ Cypher │ │
│ │ Checker │ │ Checker │ │ Checker │ │
│ └────┬────┘ └────┬────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────┼──────────────┘ │
│ ▼ │
│ ┌───────────────┐ │
│ │ Health Status │ │
│ │ Aggregator │ │
│ └───────┬───────┘ │
│ │ │
│ ┌───────────┼───────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Alert │ │Prometheus│ │Dashboard│ │
│ │ Manager │ │ Export │ │ API │ │
│ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────┘

🚀 Usage

Basic Health Check

from src.agents.monitoring.health_check_agent import HealthCheckAgent

agent = HealthCheckAgent()

# Start continuous monitoring
agent.start_monitoring()

# Get current health status
status = agent.get_health_status()
print(f"Overall Health: {status['overall_health']}")
print(f"Services: {status['services']}")

Check Specific Service

# Check single service
neo4j_health = agent.check_service("neo4j")
print(f"Neo4j Status: {neo4j_health['status']}")
print(f"Response Time: {neo4j_health['response_time_ms']}ms")

# Check database connection
mongo_health = agent.check_database("mongodb")

Custom Health Checks

# Register custom health check
@agent.register_check("my_service")
async def check_my_service():
try:
response = await http_client.get("http://my-service/health")
return {"status": "healthy", "code": response.status_code}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}

⚙️ Configuration

# config/health_check_config.yaml
health_check:
enabled: true
check_interval_seconds: 30

services:
backend:
url: http://localhost:8001/health
timeout: 5
expected_status: 200

neo4j:
type: cypher
url: bolt://localhost:7687
query: "RETURN 1"
timeout: 10

mongodb:
type: tcp
host: localhost
port: 27017
timeout: 5

redis:
type: tcp
host: localhost
port: 6379
timeout: 3

stellio:
url: http://localhost:8080/health
timeout: 10

fuseki:
type: sparql
url: http://localhost:3030/traffic/sparql
query: "ASK { ?s ?p ?o }"
timeout: 10

# Alert configuration
alerts:
enabled: true
channels:
- type: webhook
url: http://alert-service/webhook
- type: email
recipients:
- admin@example.com

rules:
- name: service_down
condition: status == "unhealthy"
severity: critical
cooldown_minutes: 5

- name: high_latency
condition: response_time_ms > 5000
severity: warning
cooldown_minutes: 15

# Prometheus metrics
prometheus:
enabled: true
port: 9090
path: /metrics

📈 Prometheus Metrics

MetricTypeDescription
service_health_statusGaugeBinary health (0=down, 1=up)
service_response_time_secondsHistogramHealth check latency
data_quality_scoreGaugeQuality metric (0-100)
health_check_totalCounterTotal checks performed

Grafana Dashboard

{
"panels": [
{
"title": "Service Health",
"type": "stat",
"targets": [
{
"expr": "service_health_status"
}
]
}
]
}

🛡️ Alerting

Alert Channels

# Configure alert channels
agent.configure_alerts({
"webhook": {
"url": "https://hooks.slack.com/services/xxx",
"headers": {"Content-Type": "application/json"}
},
"email": {
"smtp_server": "smtp.example.com",
"recipients": ["ops@example.com"]
}
})

Alert Rules

# Custom alert rule
agent.add_alert_rule(
name="database_slow",
condition=lambda status: status.get("response_time_ms", 0) > 1000,
severity="warning",
message="Database response time exceeded 1 second"
)

See the complete agents reference for all available agents.