labFusion/docs/CACHE_TROUBLESHOOTING.md

# Cache Troubleshooting Guide

## Problem Description

The LabFusion CI/CD pipelines were experiencing cache timeout errors:
```
::warning::Failed to restore: getCacheEntry failed: connect ETIMEDOUT 172.31.0.3:44029
```

This error occurs when the cache service is not accessible from the job containers due to Docker networking issues.

## Root Cause

The issue is caused by:
1. **Docker Networking**: Containers can't reach the cache server on the host
2. **Random Port Assignment**: Using port 0 causes unpredictable port assignments
3. **Cache Service Location**: The cache service binds to an IP that containers can't access

## Solutions Implemented

### 1. Workflow-Level Fixes

Added `fail-on-cache-miss: false` to all cache actions in:
- `.gitea/workflows/api-gateway.yml`
- `.gitea/workflows/frontend.yml`
- `.gitea/workflows/service-adapters.yml`
- `.gitea/workflows/api-docs.yml`
- `.gitea/workflows/ci.yml`

This ensures that cache failures don't cause the entire pipeline to fail.

### 2. Runner Configuration Fixes

Created `runners/config_cache_fixed.yaml` with:
- **Fixed Host**: `host.docker.internal` (allows containers to access host)
- **Fixed Port**: `44029` (instead of random port 0)
- **Host Network**: Uses host networking for better connectivity

### 3. Troubleshooting Tools

Created diagnostic scripts:
- `runners/fix-cache-issues.sh` (Linux/macOS)
- `runners/fix-cache-issues.ps1` (Windows)

These scripts help diagnose and fix cache issues.

## How to Apply the Fixes

### Option 1: Use the Fixed Configuration

1. Stop your current runner:
   ```bash
   pkill -f act_runner
   ```

2. Start with the fixed configuration:
   ```bash
   ./act_runner daemon --config config_cache_fixed.yaml
   ```

### Option 2: Run the Troubleshooting Script

**Linux/macOS:**
```bash
cd runners
./fix-cache-issues.sh
```

**Windows:**
```powershell
cd runners
.\fix-cache-issues.ps1
```

### Option 3: Manual Configuration

Update your runner configuration with these key changes:

```yaml
cache:
  enabled: true
  host: "host.docker.internal"  # Fixed host
  port: 44029                   # Fixed port

container:
  network: "host"               # Use host networking
```

## Verification

After applying the fixes:

1. **Check Runner Logs**: Look for cache service startup messages
2. **Test a Workflow**: Run a simple workflow to verify cache works
3. **Monitor Cache Hits**: Check if dependencies are being cached properly

## Expected Results

- ✅ No more `ETIMEDOUT` errors
- ✅ Cache hits show "✅ Cache hit!" messages
- ✅ Faster build times due to dependency caching
- ✅ Workflows continue even if cache fails

## Troubleshooting

If issues persist:

1. **Check Docker Networking**:
   ```bash
   docker network ls
   docker network inspect bridge
   ```

2. **Verify Cache Service**:
   ```bash
   netstat -tlnp | grep 44029
   ```

3. **Test Connectivity**:
   ```bash
   curl http://host.docker.internal:44029/
   ```

4. **Check Runner Logs**:
   ```bash
   tail -f runner.log
   ```

## Additional Resources

- [Gitea Act Runner Documentation](https://gitea.com/gitea/act_runner/src/branch/main/docs/configuration.md)
- [GitHub Actions Cache Documentation](https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows)
- [Docker Networking Documentation](https://docs.docker.com/network/)

## Support

If you continue to experience cache issues after applying these fixes, please:
1. Run the troubleshooting script and share the output
2. Check the runner logs for any error messages
3. Verify your Docker and network configuration