Optimize GitHub CI cache utilization (#21066)

lnkuiper · web-flow · commit 0ae55359dd26 · 2026-02-25T15:12:30.000+01:00
I created to scripts to analyse the the github runner [cache](https://github.com/duckdb/duckdb/actions/caches) items: - `scripts/github_cache_usage.py`: summarize entries, total size and relative size per workflow. - `scripts/ccache_workflow_summary.py`: summarize ccache usage for a GitHub workflow run. ### duckdb/duckdb cache usage I've asked Mark to change the total cache size of the repo `duckdb/duckdb` from the default 10 GiB to 50 GiB. However, we're still reaching that limit, so GitHub evicts cache entries. Summarizing the cache entries gives: ```sql ┌──────────────────┬─────────────┬───────────┬──────────────┐ │ workflow │ entry_count │ total_gib │ pct_of_total │ ├──────────────────┼─────────────┼───────────┼──────────────┤ │ nightlytests │ 42 │ 19.324 │ 39.0 │ │ extensions │ 48 │ 13.286 │ 27.0 │ │ main │ 32 │ 13.135 │ 26.0 │ │ linux-release │ 6 │ 1.496 │ 3.0 │ │ osx │ 8 │ 1.432 │ 3.0 │ │ bundlestaticlibs │ 4 │ 0.658 │ 1.0 │ │ unmapped/other │ 8 │ 0.39 │ 1.0 │ └──────────────────┴─────────────┴───────────┴──────────────┘ ``` NightlyTests is mostly running on a nightly schedule, but still **consumes 39%(!) of the total repo cache**. ### Ccache hit rate for nightly tests Analysing the ccache hit rate for the branch [v1.5](https://github.com/duckdb/duckdb/actions/runs/22331644275) gives: | job_name | ccache_hit_rate | found_ccache_file | | --- | --- | --- | | check-draft | N/A | N/A | | Forcing async Sinks/Sources | 7% | Yes | | Linux Memory Leaks | 0% | No | | Release Assertions | 0% | Yes | | Storage Initialization Verification | 5% | Yes | | Extension updating test | 44% | Yes | | Sqllogic tests | 0% | No | | Smaller Binary | 48% | Yes | | Hash Zero | 10% | Yes | | Release Assertions with Clang | 55% | Yes | | Tests different configurations with a debug build | 41% | Yes | | Tests non-default vector and block size | 0% | No | | Regression Tests between safe and unsafe builds | 0% | Yes | | Release Assertions OSX Storage | 17% | Yes | - Total jobs is 14 where 10 use a ccache file. - The average ccache hit rate is 17%. For the branch [v1.4](https://github.com/duckdb/duckdb/actions/runs/22331643758): | job_name | ccache_hit_rate | found_ccache_file | | --- | --- | --- | | check-draft | N/A | N/A | | Linux Memory Leaks | 10% | No | | Release Assertions | 37% | Yes | | Tests non-default vector and block size | 31% | Yes | | Release Assertions with Clang | 100% | Yes | | Extension updating test | 37% | Yes | | Sqllogic tests | 11% | No | | Smaller Binary | 100% | Yes | | Release Assertions OSX Storage | 52% | Yes | | Forcing async Sinks/Sources | 29% | Yes | | Regression Tests between safe and unsafe builds | 0% | Yes | | Storage Initialization Verification | 0% | Yes | | Hash Zero | 31% | Yes | | Tests different configurations with a debug build | 34% | Yes | - Total jobs is 14 where 11 use a ccache file. - The average ccache hit rate is 36%. For the branch [main](https://github.com/duckdb/duckdb/actions/runs/22331643268): | job_name | ccache_hit_rate | found_ccache_file | | --- | --- | --- | | check-draft | N/A | N/A | | Forcing async Sinks/Sources | 5% | Yes | | Linux Memory Leaks | 0% | No | | Storage Initialization Verification | 1% | Yes | | Regression Tests between safe and unsafe builds | 0% | Yes | | Sqllogic tests | 0% | No | | Release Assertions | 3% | Yes | | Release Assertions OSX Storage | 10% | Yes | | Tests different configurations with a debug build | 50% | Yes | | Hash Zero | 6% | Yes | | Extension updating test | 39% | Yes | | Smaller Binary | 60% | Yes | | Release Assertions with Clang | 55% | Yes | | Tests non-default vector and block size | 0% | No | - Total jobs is 14 where 10 use a ccache file. - The average ccache hit rate is 18%. ### Remove ccache action for NightlyTests The GitHub repo cache is limited to 50 GiB. We can increase this to ~80 (max 100 GiB), however it's not something we can keep increasnig. Looking at what is stored in the cache and why, we see that 39% of the github repo cache is filled with entries for the ccache of the workflow [NightlyTests](https://github.com/duckdb/duckdb/actions/workflows/NightlyTests.yml). However, that workflow is only running once at night for the release branches and the hit rates are low (on average 20%-36%) so they do not give a meaningful speedup in those workflow runs. Disabling the ccache action for that workflow can lower the github repo cache eviction pressure and allow other ccache entries to become larger. ### Future work: Increase cache size for Main and Linux Release Another problem is the low hit rate when a ccache file is found in a CI job. This can be partially explained the default maximum size of 500 MiB for the ccache action and most CI jobs [generate](https://github.com/duckdb/duckdb/actions/runs/22341258586/job/64646738998#step:17:5) a ccache size of `~0.6-1.0` GiB: ``` Cacheable calls: 1208 / 1216 (99.34%) Hits: 0 / 1208 ( 0.00%) Direct: 0 Preprocessed: 0 Misses: 1208 / 1208 (100.0%) Errors: 8 / 1216 ( 0.66%) Local storage: Cache size (GB): 1.0 / 0.5 (193.7%) # <-- 2x allowed ccache size limit Cleanups: 1054 Hits: 0 / 1208 ( 0.00%) Misses: 1208 / 1208 (100.0%) ``` and thus start pruning files from the ccache storage to make it fit in the default 500 MiB cache size limit. In the future, I'll explore when/where we should increase the ccache cache size limit.
diff --git a/.github/actions/ccache-action/action.yml b/.github/actions/ccache-action/action.yml
@@ -3,7 +3,10 @@ runs:
   using: "composite"
   steps:
      - name: Setup Ccache
+       if: ${{ github.workflow != 'NightlyTests' }}
        uses: hendrikmuhs/ccache-action@main
        with:
          key: ${{ github.job }}
          save: ${{ github.repository != 'duckdb/duckdb' || contains('["refs/heads/main", "refs/heads/v1.4-andium", "refs/heads/v1.5-variegata"]', github.ref) }}
+         # Dump verbose ccache statistics report at end of CI job.
+         verbose: 2
diff --git a/.github/workflows/NightlyTests.yml b/.github/workflows/NightlyTests.yml
@@ -25,7 +25,6 @@ concurrency:
 env:
   GH_TOKEN: ${{ secrets.GH_TOKEN }}
   DUCKDB_WASM_VERSION: "cf2048bd6d669ffa05c56d7d453e09e99de8b87e"
-  BRANCHES_TO_BE_CACHED: ${{ github.repository == 'duckdb/duckdb' && '["refs/heads/main", "refs/heads/v1.4-andium", "refs/heads/v1.5-variegata"]' || ''}}
   REDUCE_SYMBOLS: 1
   PYTHONUNBUFFERED: 1
 
@@ -251,11 +250,7 @@ jobs:
        with:
          fetch-depth: 0
 
-     - name: Setup Ccache
-       uses: hendrikmuhs/ccache-action@v1.2.11 # Note: pinned due to GLIBC incompatibility in later releases
-       with:
-         key: ${{ github.job }}
-         save: ${{ env.BRANCHES_TO_BE_CACHED == '' || contains(env.BRANCHES_TO_BE_CACHED, github.ref) }}
+     - uses: ./.github/actions/ccache-action
 
      - uses: ./.github/actions/cleanup_runner
 
@@ -542,11 +537,7 @@ jobs:
         shell: bash
         run: sudo apt-get update -y -qq && sudo apt-get install -y -qq ninja-build
 
-      - name: Setup Ccache
-        uses: hendrikmuhs/ccache-action@main
-        with:
-          key: ${{ github.job }}
-          save: ${{ env.BRANCHES_TO_BE_CACHED == '' || contains(env.BRANCHES_TO_BE_CACHED, github.ref) }}
+      - uses: ./.github/actions/ccache-action
 
       - uses: ./.github/actions/cleanup_runner
 
diff --git a/scripts/ccache_workflow_summary.py b/scripts/ccache_workflow_summary.py
@@ -0,0 +1,244 @@
+#!/usr/bin/env python3
+"""
+Summarize ccache usage for a GitHub Actions workflow run.
+
+Examples:
+  python3 scripts/ccache_workflow_summary.py 22331644275
+  python3 scripts/ccache_workflow_summary.py 22331644275 --repo duckdb/duckdb --format csv --output summary.csv
+  python3 scripts/ccache_workflow_summary.py 22331644275 --keep-retries --verbose
+"""
+
+import argparse
+import json
+import re
+import subprocess
+import sys
+from typing import Dict, List, Optional
+
+
+HITS_RE = re.compile(r"Hits:\s*\d+\s*/\s*\d+\s*\(\s*([0-9]+(?:\.[0-9]+)?)\s*%\)")
+
+
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Analyze ccache summary for a GitHub Actions run via gh CLI.")
+    parser.add_argument("run_id", type=int, help="GitHub Actions run ID (for example: 22331644275).")
+    parser.add_argument("--repo", default="duckdb/duckdb", help="Repository in owner/name format.")
+    parser.add_argument(
+        "--format",
+        choices=["markdown", "table", "csv", "json"],
+        default="markdown",
+        help="Output format (default: markdown).",
+    )
+    parser.add_argument("--output", default="", help="Optional output file path; otherwise prints to stdout.")
+    parser.add_argument(
+        "--keep-retries",
+        action="store_true",
+        help="Include all job attempts. By default retries are collapsed to the latest attempt per job name.",
+    )
+    parser.add_argument("--verbose", action="store_true", help="Print progress messages to stderr.")
+    return parser.parse_args()
+
+
+def run_gh(args: List[str]) -> str:
+    proc = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
+    if proc.returncode != 0:
+        cmd = " ".join(args)
+        raise RuntimeError(f"Command failed ({proc.returncode}): {cmd}\n{proc.stderr.strip()}")
+    return proc.stdout
+
+
+def get_jobs(repo: str, run_id: int) -> List[Dict]:
+    output = run_gh(["gh", "run", "view", str(run_id), "--repo", repo, "--json", "jobs"])
+    payload = json.loads(output)
+    return payload.get("jobs", [])
+
+
+def dedupe_jobs_keep_latest(jobs: List[Dict]) -> List[Dict]:
+    by_name: Dict[str, Dict] = {}
+    for job in jobs:
+        name = job.get("name", "")
+        started_at = job.get("startedAt") or ""
+        completed_at = job.get("completedAt") or ""
+        dbid = int(job.get("databaseId") or 0)
+        candidate_key = (started_at, completed_at, dbid)
+        previous = by_name.get(name)
+        if previous is None:
+            by_name[name] = job
+            continue
+        prev_key = (
+            previous.get("startedAt") or "",
+            previous.get("completedAt") or "",
+            int(previous.get("databaseId") or 0),
+        )
+        if candidate_key > prev_key:
+            by_name[name] = job
+    ordered_names = []
+    seen = set()
+    for job in jobs:
+        name = job.get("name", "")
+        if name not in seen:
+            ordered_names.append(name)
+            seen.add(name)
+    return [by_name[name] for name in ordered_names if name in by_name]
+
+
+def parse_hit_rate_percent(log_text: str) -> Optional[int]:
+    in_stats = False
+    for line in log_text.splitlines():
+        if "ccache -s" in line:
+            in_stats = True
+            continue
+        if in_stats and "ccache --version" in line:
+            in_stats = False
+            continue
+        if not in_stats:
+            continue
+        match = HITS_RE.search(line)
+        if match:
+            value = float(match.group(1))
+            return int(value + 0.5)
+    return None
+
+
+def parse_found_ccache_file(log_text: str) -> str:
+    if "##[group]Restore cache" not in log_text:
+        return "N/A"
+    if "Cache hit for restore-key" in log_text or "Cache restored successfully" in log_text:
+        return "Yes"
+    if "No cache found." in log_text:
+        return "No"
+    return "No"
+
+
+def format_table(rows: List[Dict[str, str]], columns: List[str]) -> str:
+    widths = {col: len(col) for col in columns}
+    for row in rows:
+        for col in columns:
+            widths[col] = max(widths[col], len(str(row.get(col, ""))))
+    header = "  ".join(col.ljust(widths[col]) for col in columns)
+    sep = "  ".join("-" * widths[col] for col in columns)
+    body = ["  ".join(str(row.get(col, "")).ljust(widths[col]) for col in columns) for row in rows]
+    return "\n".join([header, sep] + body)
+
+
+def format_markdown_table(rows: List[Dict[str, str]], columns: List[str]) -> str:
+    header = "| " + " | ".join(columns) + " |"
+    separator = "| " + " | ".join(["---"] * len(columns)) + " |"
+    body = ["| " + " | ".join(str(row.get(col, "")) for col in columns) + " |" for row in rows]
+    return "\n".join([header, separator] + body)
+
+
+def to_csv(rows: List[Dict[str, str]], columns: List[str]) -> str:
+    def escape(cell: str) -> str:
+        if any(ch in cell for ch in [",", "\"", "\n"]):
+            return "\"" + cell.replace("\"", "\"\"") + "\""
+        return cell
+
+    lines = [",".join(columns)]
+    for row in rows:
+        lines.append(",".join(escape(str(row.get(col, ""))) for col in columns))
+    return "\n".join(lines)
+
+
+def emit(text: str, output_path: str) -> None:
+    if output_path:
+        with open(output_path, "w", encoding="utf-8") as f:
+            f.write(text)
+            if not text.endswith("\n"):
+                f.write("\n")
+        return
+    print(text)
+
+
+def summarize_rows(rows: List[Dict[str, str]]) -> str:
+    total_jobs = len(rows)
+    found_ccache_jobs = sum(1 for row in rows if row.get("found_ccache_file") == "Yes")
+    numeric_rates: List[int] = []
+    for row in rows:
+        rate = row.get("ccache_hit_rate", "")
+        if rate.endswith("%"):
+            try:
+                numeric_rates.append(int(rate[:-1]))
+            except ValueError:
+                pass
+    if numeric_rates:
+        avg_rate = int(sum(numeric_rates) / len(numeric_rates) + 0.5)
+        avg_text = f"{avg_rate}%"
+    else:
+        avg_text = "N/A"
+    return (
+        f"- Total jobs is {total_jobs} where {found_ccache_jobs} use a ccache file.\n"
+        f"- The average ccache hit rate is {avg_text}."
+    )
+
+
+def main() -> int:
+    args = parse_args()
+    try:
+        print(
+            f"Fetching workflow run log files for run {args.run_id} from {args.repo}...",
+            file=sys.stderr,
+            flush=True,
+        )
+        jobs = get_jobs(args.repo, args.run_id)
+        selected_jobs = jobs if args.keep_retries else dedupe_jobs_keep_latest(jobs)
+
+        if args.verbose:
+            print(
+                f"Found {len(jobs)} jobs ({len(selected_jobs)} after retry filtering).",
+                file=sys.stderr,
+            )
+
+        rows: List[Dict[str, str]] = []
+        for job in selected_jobs:
+            job_id = int(job.get("databaseId") or 0)
+            name = str(job.get("name") or "")
+            conclusion = str(job.get("conclusion") or "")
+            if args.verbose:
+                print(f"Fetching log for job {job_id}: {name}", file=sys.stderr, flush=True)
+
+            try:
+                log_text = run_gh(
+                    ["gh", "run", "view", str(args.run_id), "--repo", args.repo, "--job", str(job_id), "--log"]
+                )
+            except RuntimeError:
+                hit_rate = "N/A"
+                found_ccache = "N/A"
+            else:
+                parsed_rate = parse_hit_rate_percent(log_text)
+                hit_rate = f"{parsed_rate}%" if parsed_rate is not None else "N/A"
+                found_ccache = parse_found_ccache_file(log_text)
+
+            rows.append(
+                {
+                    "job_id": str(job_id),
+                    "job_name": name,
+                    "conclusion": conclusion,
+                    "ccache_hit_rate": hit_rate,
+                    "found_ccache_file": found_ccache,
+                }
+            )
+            print(f"Processed job {job_id}: {name}", file=sys.stderr, flush=True)
+
+        rows = [row for row in rows if row.get("conclusion") != "skipped"]
+        columns = ["job_id", "job_name", "conclusion", "ccache_hit_rate", "found_ccache_file"]
+        if args.format == "json":
+            text = json.dumps(rows, indent=2)
+        elif args.format == "csv":
+            text = to_csv(rows, columns)
+        elif args.format == "markdown":
+            markdown_columns = ["job_name", "ccache_hit_rate", "found_ccache_file"]
+            text = format_markdown_table(rows, markdown_columns) + "\n\n" + summarize_rows(rows)
+        else:
+            table_columns = ["job_name", "conclusion", "ccache_hit_rate", "found_ccache_file"]
+            text = format_table(rows, table_columns) + "\n\n" + summarize_rows(rows)
+
+        emit(text, args.output)
+        return 0
+    except RuntimeError as err:
+        print(str(err), file=sys.stderr)
+        return 1
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/scripts/github_cache_usage.py b/scripts/github_cache_usage.py