Health Check Agent
The Health Check Agent provides comprehensive health monitoring for all services, data quality, and performance metrics with alerting capabilities.
π Overviewβ
| Property | Value |
|---|---|
| Module | src.agents.monitoring.health_check_agent |
| Class | HealthCheckAgent |
| Author | Nguyen Dinh Anh Tuan |
| Version | 2.0.0 |
π― Purposeβ
The Health Check Agent provides:
- Service availability monitoring (HTTP, TCP, Cypher, SPARQL, Kafka)
- Data quality validation (thresholds, counts, age)
- Performance metrics collection (response times, latency)
- Alerting integration (webhook, email, Slack)
- Prometheus metrics export
π Monitoring Capabilitiesβ
Service Health Checksβ
| Check Type | Protocol | Description |
|---|---|---|
| HTTP | REST | API endpoint availability |
| TCP | Socket | Database connection |
| Cypher | Neo4j | Graph database health |
| SPARQL | Fuseki | Triplestore availability |
| Kafka | Broker | Message queue status |
Data Quality Metricsβ
| Metric | Description |
|---|---|
| Data freshness | Age of latest data |
| Record count | Expected vs actual counts |
| Validation rate | Percentage passing validation |
| Error rate | Failed operations |
π§ Architectureβ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Health Check Agent β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββ βββββββββββ βββββββββββββββ β
β β HTTP β β TCP β β Cypher β β
β β Checker β β Checker β β Checker β β
β ββββββ¬βββββ ββββββ¬βββββ ββββββββ¬βββββββ β
β β β β β
β ββββββββββββββΌβββββββββββββββ β
β βΌ β
β βββββββββββββββββ β
β β Health Status β β
β β Aggregator β β
β βββββββββ¬ββββββββ β
β β β
β βββββββββββββΌββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββ βββββββββββ βββββββββββ β
β β Alert β βPrometheusβ βDashboardβ β
β β Manager β β Export β β API β β
β βββββββββββ βββββββββββ βββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββ
π Usageβ
Basic Health Checkβ
from src.agents.monitoring.health_check_agent import HealthCheckAgent
agent = HealthCheckAgent()
# Start continuous monitoring
agent.start_monitoring()
# Get current health status
status = agent.get_health_status()
print(f"Overall Health: {status['overall_health']}")
print(f"Services: {status['services']}")
Check Specific Serviceβ
# Check single service
neo4j_health = agent.check_service("neo4j")
print(f"Neo4j Status: {neo4j_health['status']}")
print(f"Response Time: {neo4j_health['response_time_ms']}ms")
# Check database connection
mongo_health = agent.check_database("mongodb")
Custom Health Checksβ
# Register custom health check
@agent.register_check("my_service")
async def check_my_service():
try:
response = await http_client.get("http://my-service/health")
return {"status": "healthy", "code": response.status_code}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
βοΈ Configurationβ
# config/health_check_config.yaml
health_check:
enabled: true
check_interval_seconds: 30
services:
backend:
url: http://localhost:8001/health
timeout: 5
expected_status: 200
neo4j:
type: cypher
url: bolt://localhost:7687
query: "RETURN 1"
timeout: 10
mongodb:
type: tcp
host: localhost
port: 27017
timeout: 5
redis:
type: tcp
host: localhost
port: 6379
timeout: 3
stellio:
url: http://localhost:8080/health
timeout: 10
fuseki:
type: sparql
url: http://localhost:3030/traffic/sparql
query: "ASK { ?s ?p ?o }"
timeout: 10
# Alert configuration
alerts:
enabled: true
channels:
- type: webhook
url: http://alert-service/webhook
- type: email
recipients:
- admin@example.com
rules:
- name: service_down
condition: status == "unhealthy"
severity: critical
cooldown_minutes: 5
- name: high_latency
condition: response_time_ms > 5000
severity: warning
cooldown_minutes: 15
# Prometheus metrics
prometheus:
enabled: true
port: 9090
path: /metrics
π Prometheus Metricsβ
| Metric | Type | Description |
|---|---|---|
service_health_status | Gauge | Binary health (0=down, 1=up) |
service_response_time_seconds | Histogram | Health check latency |
data_quality_score | Gauge | Quality metric (0-100) |
health_check_total | Counter | Total checks performed |
Grafana Dashboardβ
{
"panels": [
{
"title": "Service Health",
"type": "stat",
"targets": [
{
"expr": "service_health_status"
}
]
}
]
}
π‘οΈ Alertingβ
Alert Channelsβ
# Configure alert channels
agent.configure_alerts({
"webhook": {
"url": "https://hooks.slack.com/services/xxx",
"headers": {"Content-Type": "application/json"}
},
"email": {
"smtp_server": "smtp.example.com",
"recipients": ["ops@example.com"]
}
})
Alert Rulesβ
# Custom alert rule
agent.add_alert_rule(
name="database_slow",
condition=lambda status: status.get("response_time_ms", 0) > 1000,
severity="warning",
message="Database response time exceeded 1 second"
)
π Related Documentationβ
- Performance Monitor - Performance metrics
- Data Quality Validator - Data validation
- DevOps Guide - Monitoring setup
See the complete agents reference for all available agents.