Skip to main content

Health Check Agent

The Health Check Agent provides comprehensive health monitoring for all services, data quality, and performance metrics with alerting capabilities.

πŸ“‹ Overview​

PropertyValue
Modulesrc.agents.monitoring.health_check_agent
ClassHealthCheckAgent
AuthorNguyen Dinh Anh Tuan
Version2.0.0

🎯 Purpose​

The Health Check Agent provides:

  • Service availability monitoring (HTTP, TCP, Cypher, SPARQL, Kafka)
  • Data quality validation (thresholds, counts, age)
  • Performance metrics collection (response times, latency)
  • Alerting integration (webhook, email, Slack)
  • Prometheus metrics export

πŸ“Š Monitoring Capabilities​

Service Health Checks​

Check TypeProtocolDescription
HTTPRESTAPI endpoint availability
TCPSocketDatabase connection
CypherNeo4jGraph database health
SPARQLFusekiTriplestore availability
KafkaBrokerMessage queue status

Data Quality Metrics​

MetricDescription
Data freshnessAge of latest data
Record countExpected vs actual counts
Validation ratePercentage passing validation
Error rateFailed operations

πŸ”§ Architecture​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Health Check Agent β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ HTTP β”‚ β”‚ TCP β”‚ β”‚ Cypher β”‚ β”‚
β”‚ β”‚ Checker β”‚ β”‚ Checker β”‚ β”‚ Checker β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Health Status β”‚ β”‚
β”‚ β”‚ Aggregator β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β–Ό β–Ό β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Alert β”‚ β”‚Prometheusβ”‚ β”‚Dashboardβ”‚ β”‚
β”‚ β”‚ Manager β”‚ β”‚ Export β”‚ β”‚ API β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Usage​

Basic Health Check​

from src.agents.monitoring.health_check_agent import HealthCheckAgent

agent = HealthCheckAgent()

# Start continuous monitoring
agent.start_monitoring()

# Get current health status
status = agent.get_health_status()
print(f"Overall Health: {status['overall_health']}")
print(f"Services: {status['services']}")

Check Specific Service​

# Check single service
neo4j_health = agent.check_service("neo4j")
print(f"Neo4j Status: {neo4j_health['status']}")
print(f"Response Time: {neo4j_health['response_time_ms']}ms")

# Check database connection
mongo_health = agent.check_database("mongodb")

Custom Health Checks​

# Register custom health check
@agent.register_check("my_service")
async def check_my_service():
try:
response = await http_client.get("http://my-service/health")
return {"status": "healthy", "code": response.status_code}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}

βš™οΈ Configuration​

# config/health_check_config.yaml
health_check:
enabled: true
check_interval_seconds: 30

services:
backend:
url: http://localhost:8001/health
timeout: 5
expected_status: 200

neo4j:
type: cypher
url: bolt://localhost:7687
query: "RETURN 1"
timeout: 10

mongodb:
type: tcp
host: localhost
port: 27017
timeout: 5

redis:
type: tcp
host: localhost
port: 6379
timeout: 3

stellio:
url: http://localhost:8080/health
timeout: 10

fuseki:
type: sparql
url: http://localhost:3030/traffic/sparql
query: "ASK { ?s ?p ?o }"
timeout: 10

# Alert configuration
alerts:
enabled: true
channels:
- type: webhook
url: http://alert-service/webhook
- type: email
recipients:
- admin@example.com

rules:
- name: service_down
condition: status == "unhealthy"
severity: critical
cooldown_minutes: 5

- name: high_latency
condition: response_time_ms > 5000
severity: warning
cooldown_minutes: 15

# Prometheus metrics
prometheus:
enabled: true
port: 9090
path: /metrics

πŸ“ˆ Prometheus Metrics​

MetricTypeDescription
service_health_statusGaugeBinary health (0=down, 1=up)
service_response_time_secondsHistogramHealth check latency
data_quality_scoreGaugeQuality metric (0-100)
health_check_totalCounterTotal checks performed

Grafana Dashboard​

{
"panels": [
{
"title": "Service Health",
"type": "stat",
"targets": [
{
"expr": "service_health_status"
}
]
}
]
}

πŸ›‘οΈ Alerting​

Alert Channels​

# Configure alert channels
agent.configure_alerts({
"webhook": {
"url": "https://hooks.slack.com/services/xxx",
"headers": {"Content-Type": "application/json"}
},
"email": {
"smtp_server": "smtp.example.com",
"recipients": ["ops@example.com"]
}
})

Alert Rules​

# Custom alert rule
agent.add_alert_rule(
name="database_slow",
condition=lambda status: status.get("response_time_ms", 0) > 1000,
severity="warning",
message="Database response time exceeded 1 second"
)

See the complete agents reference for all available agents.