Files
labFusion/docs/CACHE_TROUBLESHOOTING.md
GSRN e3800b49b8 docs: Add comprehensive cache troubleshooting guide
- Document the root cause of cache timeout errors
- Explain all implemented solutions
- Provide step-by-step fix instructions
- Include verification and troubleshooting steps
- Add support resources and additional help
2025-09-15 16:41:11 +02:00

3.5 KiB

Cache Troubleshooting Guide

Problem Description

The LabFusion CI/CD pipelines were experiencing cache timeout errors:

::warning::Failed to restore: getCacheEntry failed: connect ETIMEDOUT 172.31.0.3:44029

This error occurs when the cache service is not accessible from the job containers due to Docker networking issues.

Root Cause

The issue is caused by:

  1. Docker Networking: Containers can't reach the cache server on the host
  2. Random Port Assignment: Using port 0 causes unpredictable port assignments
  3. Cache Service Location: The cache service binds to an IP that containers can't access

Solutions Implemented

1. Workflow-Level Fixes

Added fail-on-cache-miss: false to all cache actions in:

  • .gitea/workflows/api-gateway.yml
  • .gitea/workflows/frontend.yml
  • .gitea/workflows/service-adapters.yml
  • .gitea/workflows/api-docs.yml
  • .gitea/workflows/ci.yml

This ensures that cache failures don't cause the entire pipeline to fail.

2. Runner Configuration Fixes

Created runners/config_cache_fixed.yaml with:

  • Fixed Host: host.docker.internal (allows containers to access host)
  • Fixed Port: 44029 (instead of random port 0)
  • Host Network: Uses host networking for better connectivity

3. Troubleshooting Tools

Created diagnostic scripts:

  • runners/fix-cache-issues.sh (Linux/macOS)
  • runners/fix-cache-issues.ps1 (Windows)

These scripts help diagnose and fix cache issues.

How to Apply the Fixes

Option 1: Use the Fixed Configuration

  1. Stop your current runner:

    pkill -f act_runner
    
  2. Start with the fixed configuration:

    ./act_runner daemon --config config_cache_fixed.yaml
    

Option 2: Run the Troubleshooting Script

Linux/macOS:

cd runners
./fix-cache-issues.sh

Windows:

cd runners
.\fix-cache-issues.ps1

Option 3: Manual Configuration

Update your runner configuration with these key changes:

cache:
  enabled: true
  host: "host.docker.internal"  # Fixed host
  port: 44029                   # Fixed port

container:
  network: "host"               # Use host networking

Verification

After applying the fixes:

  1. Check Runner Logs: Look for cache service startup messages
  2. Test a Workflow: Run a simple workflow to verify cache works
  3. Monitor Cache Hits: Check if dependencies are being cached properly

Expected Results

  • No more ETIMEDOUT errors
  • Cache hits show " Cache hit!" messages
  • Faster build times due to dependency caching
  • Workflows continue even if cache fails

Troubleshooting

If issues persist:

  1. Check Docker Networking:

    docker network ls
    docker network inspect bridge
    
  2. Verify Cache Service:

    netstat -tlnp | grep 44029
    
  3. Test Connectivity:

    curl http://host.docker.internal:44029/
    
  4. Check Runner Logs:

    tail -f runner.log
    

Additional Resources

Support

If you continue to experience cache issues after applying these fixes, please:

  1. Run the troubleshooting script and share the output
  2. Check the runner logs for any error messages
  3. Verify your Docker and network configuration