# Health Checking System This document describes the generalized health checking system for LabFusion Service Adapters. ## Overview The health checking system is designed to be flexible and extensible, supporting different types of health checks for different services. It uses a strategy pattern with pluggable health checkers. ## Architecture ### Core Components 1. **BaseHealthChecker**: Abstract base class for all health checkers 2. **HealthCheckResult**: Standardized result object 3. **HealthCheckerRegistry**: Registry for different checker types 4. **HealthCheckerFactory**: Factory for creating checker instances 5. **ServiceStatusChecker**: Main orchestrator ### Health Checker Types #### 1. API Health Checker (`APIHealthChecker`) - **Purpose**: Check services with HTTP health endpoints - **Use Case**: Most REST APIs, microservices - **Configuration**: ```python { "health_check_type": "api", "health_endpoint": "/api/health", "url": "https://service.example.com" } ``` #### 2. Sensor Health Checker (`SensorHealthChecker`) - **Purpose**: Check services via sensor data (e.g., Home Assistant entities) - **Use Case**: Home Assistant, IoT devices, sensor-based monitoring - **Configuration**: ```python { "health_check_type": "sensor", "sensor_entity": "sensor.system_uptime", "url": "https://homeassistant.example.com" } ``` #### 3. Custom Health Checker (`CustomHealthChecker`) - **Purpose**: Complex health checks with multiple validation steps - **Use Case**: Services requiring multiple checks, custom logic - **Configuration**: ```python { "health_check_type": "custom", "health_checks": [ { "type": "api", "name": "main_api", "url": "https://service.example.com/api/health" }, { "type": "sensor", "name": "uptime_sensor", "sensor_entity": "sensor.service_uptime" } ] } ``` ## Configuration ### Service Configuration Structure ```python SERVICES = { "service_name": { "url": "https://service.example.com", "enabled": True, "health_check_type": "api|sensor|custom", # API-specific "health_endpoint": "/api/health", "token": "auth_token", "api_key": "api_key", # Sensor-specific "sensor_entity": "sensor.entity_name", # Custom-specific "health_checks": [ { "type": "api", "name": "check_name", "url": "https://endpoint.com/health" } ] } } ``` ### Environment Variables ```bash # Service URLs HOME_ASSISTANT_URL=https://ha.example.com FRIGATE_URL=http://frigate.local:5000 IMMICH_URL=http://immich.local:2283 N8N_URL=http://n8n.local:5678 # Authentication HOME_ASSISTANT_TOKEN=your_token FRIGATE_TOKEN=your_token IMMICH_API_KEY=your_key N8N_API_KEY=your_key ``` ## Usage Examples ### Basic API Health Check ```python from services.health_checkers import factory # Create API checker checker = factory.create_checker("api", timeout=5.0) # Check service config = { "url": "https://api.example.com", "health_endpoint": "/health", "enabled": True } result = await checker.check_health("example_service", config) print(f"Status: {result.status}") print(f"Response time: {result.response_time}s") ``` ### Sensor-Based Health Check ```python # Create sensor checker checker = factory.create_checker("sensor", timeout=5.0) # Check Home Assistant sensor config = { "url": "https://ha.example.com", "sensor_entity": "sensor.system_uptime", "token": "your_token", "enabled": True } result = await checker.check_health("home_assistant", config) print(f"Uptime: {result.metadata.get('sensor_state')}") ``` ### Custom Health Check ```python # Create custom checker checker = factory.create_checker("custom", timeout=10.0) # Check with multiple validations config = { "url": "https://service.example.com", "enabled": True, "health_checks": [ { "type": "api", "name": "main_api", "url": "https://service.example.com/api/health" }, { "type": "api", "name": "database", "url": "https://service.example.com/api/db/health" } ] } result = await checker.check_health("complex_service", config) print(f"Overall status: {result.status}") print(f"Individual checks: {result.metadata.get('check_results')}") ``` ## Health Check Results ### HealthCheckResult Structure ```python { "status": "healthy|unhealthy|disabled|error|timeout|unauthorized|forbidden", "response_time": 0.123, # seconds "error": "Error message if applicable", "metadata": { "http_status": 200, "response_size": 1024, "sensor_state": "12345", "last_updated": "2024-01-15T10:30:00Z" } } ``` ### Status Values - **healthy**: Service is responding normally - **unhealthy**: Service responded but with error status - **disabled**: Service is disabled in configuration - **timeout**: Request timed out - **unauthorized**: Authentication required (HTTP 401) - **forbidden**: Access forbidden (HTTP 403) - **error**: Network or other error occurred ## Extending the System ### Adding a New Health Checker 1. **Create the checker class**: ```python from .base import BaseHealthChecker, HealthCheckResult class MyCustomChecker(BaseHealthChecker): async def check_health(self, service_name: str, config: Dict) -> HealthCheckResult: # Implementation pass ``` 2. **Register the checker**: ```python from services.health_checkers import registry registry.register("my_custom", MyCustomChecker) ``` 3. **Use in configuration**: ```python { "health_check_type": "my_custom", "custom_param": "value" } ``` ### Service-Specific Logic The factory automatically selects the appropriate checker based on: 1. `health_check_type` in configuration 2. Service name patterns 3. Configuration presence (e.g., `sensor_entity` → sensor checker) ## Performance Considerations - **Concurrent Checking**: All services are checked simultaneously - **Checker Caching**: Checkers are cached per service to avoid recreation - **Timeout Management**: Configurable timeouts per checker type - **Resource Cleanup**: Proper cleanup of HTTP clients ## Monitoring and Logging - **Debug Logs**: Detailed operation logs for troubleshooting - **Performance Metrics**: Response times and success rates - **Error Tracking**: Comprehensive error logging with context - **Health Summary**: Overall system health statistics ## Best Practices 1. **Choose Appropriate Checker**: Use the right checker type for your service 2. **Set Reasonable Timeouts**: Balance responsiveness with reliability 3. **Handle Errors Gracefully**: Always provide meaningful error messages 4. **Monitor Performance**: Track response times and success rates 5. **Test Thoroughly**: Verify health checks work in all scenarios 6. **Document Configuration**: Keep service configurations well-documented ## Troubleshooting ### Common Issues 1. **Timeout Errors**: Increase timeout or check network connectivity 2. **Authentication Failures**: Verify tokens and API keys 3. **Sensor Not Found**: Check entity names and permissions 4. **Configuration Errors**: Validate service configuration structure ### Debug Tools - **Debug Endpoint**: `/debug/logging` to test logging configuration - **Health Check Logs**: Detailed logs for each health check operation - **Metadata Inspection**: Check metadata for additional context