Files
labFusion/docs/CACHE_TROUBLESHOOTING.md
GSRN 79250ea3ab
Some checks failed
Docker Build and Push / build-and-push (push) Failing after 31s
API Docs (Node.js Express) / test (20) (push) Successful in 3m56s
API Docs (Node.js Express) / test (16) (push) Successful in 4m4s
API Docs (Node.js Express) / test (18) (push) Successful in 4m10s
LabFusion CI/CD Pipeline / api-gateway (push) Failing after 1m22s
LabFusion CI/CD Pipeline / api-docs (push) Successful in 1m2s
API Gateway (Java Spring Boot) / test (17) (push) Failing after 2m39s
API Gateway (Java Spring Boot) / test (21) (push) Failing after 2m45s
API Gateway (Java Spring Boot) / build (push) Has been skipped
API Gateway (Java Spring Boot) / security (push) Has been skipped
LabFusion CI/CD Pipeline / service-adapters (push) Failing after 3m21s
Frontend (React) / test (16) (push) Failing after 1m46s
LabFusion CI/CD Pipeline / frontend (push) Failing after 1m59s
LabFusion CI/CD Pipeline / integration-tests (push) Has been skipped
Frontend (React) / test (18) (push) Failing after 1m50s
Integration Tests / integration-tests (push) Failing after 49s
Integration Tests / performance-tests (push) Has been skipped
Service Adapters (Python FastAPI) / test (3.1) (push) Failing after 1m7s
Frontend (React) / test (20) (push) Failing after 2m30s
Frontend (React) / build (push) Has been skipped
Service Adapters (Python FastAPI) / test (3.11) (push) Failing after 1m43s
Frontend (React) / lighthouse (push) Has been skipped
Service Adapters (Python FastAPI) / test (3.9) (push) Failing after 1m2s
Service Adapters (Python FastAPI) / test (3.12) (push) Failing after 1m43s
Service Adapters (Python FastAPI) / build (push) Has been skipped
API Docs (Node.js Express) / build (push) Successful in 59s
refactor: Apply cache fixes directly to existing runner configs
- Update all runner configuration files with cache networking fixes:
  - config_docker.yaml
  - config_heavy.yaml
  - config_light.yaml
  - config_security.yaml
- Remove separate config_cache_fixed.yaml file
- Update troubleshooting scripts to use updated configs
- Update documentation to reference existing config files

All runner configs now have:
- Fixed cache host: host.docker.internal
- Fixed cache port: 44029
- Host networking for better container connectivity

This provides a cleaner approach by updating existing configs
instead of maintaining a separate fixed configuration file.
2025-09-15 16:44:16 +02:00

3.8 KiB

Cache Troubleshooting Guide

Problem Description

The LabFusion CI/CD pipelines were experiencing cache timeout errors:

::warning::Failed to restore: getCacheEntry failed: connect ETIMEDOUT 172.31.0.3:44029

This error occurs when the cache service is not accessible from the job containers due to Docker networking issues.

Root Cause

The issue is caused by:

  1. Docker Networking: Containers can't reach the cache server on the host
  2. Random Port Assignment: Using port 0 causes unpredictable port assignments
  3. Cache Service Location: The cache service binds to an IP that containers can't access

Solutions Implemented

1. Workflow-Level Fixes

Added fail-on-cache-miss: false to all cache actions in:

  • .gitea/workflows/api-gateway.yml
  • .gitea/workflows/frontend.yml
  • .gitea/workflows/service-adapters.yml
  • .gitea/workflows/api-docs.yml
  • .gitea/workflows/ci.yml

This ensures that cache failures don't cause the entire pipeline to fail.

2. Runner Configuration Fixes

Updated all existing runner configuration files with:

  • Fixed Host: host.docker.internal (allows containers to access host)
  • Fixed Port: 44029 (instead of random port 0)
  • Host Network: Uses host networking for better connectivity

Updated files:

  • runners/config_docker.yaml
  • runners/config_heavy.yaml
  • runners/config_light.yaml
  • runners/config_security.yaml

3. Troubleshooting Tools

Created diagnostic scripts:

  • runners/fix-cache-issues.sh (Linux/macOS)
  • runners/fix-cache-issues.ps1 (Windows)

These scripts help diagnose and fix cache issues.

How to Apply the Fixes

Option 1: Use the Updated Configuration

  1. Stop your current runner:

    pkill -f act_runner
    
  2. Start with an updated configuration:

    ./act_runner daemon --config config_docker.yaml
    # or
    ./act_runner daemon --config config_heavy.yaml
    # or
    ./act_runner daemon --config config_light.yaml
    # or
    ./act_runner daemon --config config_security.yaml
    

Option 2: Run the Troubleshooting Script

Linux/macOS:

cd runners
./fix-cache-issues.sh

Windows:

cd runners
.\fix-cache-issues.ps1

Option 3: Manual Configuration

Update your runner configuration with these key changes:

cache:
  enabled: true
  host: "host.docker.internal"  # Fixed host
  port: 44029                   # Fixed port

container:
  network: "host"               # Use host networking

Verification

After applying the fixes:

  1. Check Runner Logs: Look for cache service startup messages
  2. Test a Workflow: Run a simple workflow to verify cache works
  3. Monitor Cache Hits: Check if dependencies are being cached properly

Expected Results

  • No more ETIMEDOUT errors
  • Cache hits show " Cache hit!" messages
  • Faster build times due to dependency caching
  • Workflows continue even if cache fails

Troubleshooting

If issues persist:

  1. Check Docker Networking:

    docker network ls
    docker network inspect bridge
    
  2. Verify Cache Service:

    netstat -tlnp | grep 44029
    
  3. Test Connectivity:

    curl http://host.docker.internal:44029/
    
  4. Check Runner Logs:

    tail -f runner.log
    

Additional Resources

Support

If you continue to experience cache issues after applying these fixes, please:

  1. Run the troubleshooting script and share the output
  2. Check the runner logs for any error messages
  3. Verify your Docker and network configuration