Skip to content

Commit 0ae5535

Browse files
authored
Optimize GitHub CI cache utilization (#21066)
I created to scripts to analyse the the github runner [cache](https://github.com/duckdb/duckdb/actions/caches) items: - `scripts/github_cache_usage.py`: summarize entries, total size and relative size per workflow. - `scripts/ccache_workflow_summary.py`: summarize ccache usage for a GitHub workflow run. ### duckdb/duckdb cache usage I've asked Mark to change the total cache size of the repo `duckdb/duckdb` from the default 10 GiB to 50 GiB. However, we're still reaching that limit, so GitHub evicts cache entries. Summarizing the cache entries gives: ```sql ┌──────────────────┬─────────────┬───────────┬──────────────┐ │ workflow │ entry_count │ total_gib │ pct_of_total │ ├──────────────────┼─────────────┼───────────┼──────────────┤ │ nightlytests │ 42 │ 19.324 │ 39.0 │ │ extensions │ 48 │ 13.286 │ 27.0 │ │ main │ 32 │ 13.135 │ 26.0 │ │ linux-release │ 6 │ 1.496 │ 3.0 │ │ osx │ 8 │ 1.432 │ 3.0 │ │ bundlestaticlibs │ 4 │ 0.658 │ 1.0 │ │ unmapped/other │ 8 │ 0.39 │ 1.0 │ └──────────────────┴─────────────┴───────────┴──────────────┘ ``` NightlyTests is mostly running on a nightly schedule, but still **consumes 39%(!) of the total repo cache**. ### Ccache hit rate for nightly tests Analysing the ccache hit rate for the branch [v1.5](https://github.com/duckdb/duckdb/actions/runs/22331644275) gives: | job_name | ccache_hit_rate | found_ccache_file | | --- | --- | --- | | check-draft | N/A | N/A | | Forcing async Sinks/Sources | 7% | Yes | | Linux Memory Leaks | 0% | No | | Release Assertions | 0% | Yes | | Storage Initialization Verification | 5% | Yes | | Extension updating test | 44% | Yes | | Sqllogic tests | 0% | No | | Smaller Binary | 48% | Yes | | Hash Zero | 10% | Yes | | Release Assertions with Clang | 55% | Yes | | Tests different configurations with a debug build | 41% | Yes | | Tests non-default vector and block size | 0% | No | | Regression Tests between safe and unsafe builds | 0% | Yes | | Release Assertions OSX Storage | 17% | Yes | - Total jobs is 14 where 10 use a ccache file. - The average ccache hit rate is 17%. For the branch [v1.4](https://github.com/duckdb/duckdb/actions/runs/22331643758): | job_name | ccache_hit_rate | found_ccache_file | | --- | --- | --- | | check-draft | N/A | N/A | | Linux Memory Leaks | 10% | No | | Release Assertions | 37% | Yes | | Tests non-default vector and block size | 31% | Yes | | Release Assertions with Clang | 100% | Yes | | Extension updating test | 37% | Yes | | Sqllogic tests | 11% | No | | Smaller Binary | 100% | Yes | | Release Assertions OSX Storage | 52% | Yes | | Forcing async Sinks/Sources | 29% | Yes | | Regression Tests between safe and unsafe builds | 0% | Yes | | Storage Initialization Verification | 0% | Yes | | Hash Zero | 31% | Yes | | Tests different configurations with a debug build | 34% | Yes | - Total jobs is 14 where 11 use a ccache file. - The average ccache hit rate is 36%. For the branch [main](https://github.com/duckdb/duckdb/actions/runs/22331643268): | job_name | ccache_hit_rate | found_ccache_file | | --- | --- | --- | | check-draft | N/A | N/A | | Forcing async Sinks/Sources | 5% | Yes | | Linux Memory Leaks | 0% | No | | Storage Initialization Verification | 1% | Yes | | Regression Tests between safe and unsafe builds | 0% | Yes | | Sqllogic tests | 0% | No | | Release Assertions | 3% | Yes | | Release Assertions OSX Storage | 10% | Yes | | Tests different configurations with a debug build | 50% | Yes | | Hash Zero | 6% | Yes | | Extension updating test | 39% | Yes | | Smaller Binary | 60% | Yes | | Release Assertions with Clang | 55% | Yes | | Tests non-default vector and block size | 0% | No | - Total jobs is 14 where 10 use a ccache file. - The average ccache hit rate is 18%. ### Remove ccache action for NightlyTests The GitHub repo cache is limited to 50 GiB. We can increase this to ~80 (max 100 GiB), however it's not something we can keep increasnig. Looking at what is stored in the cache and why, we see that 39% of the github repo cache is filled with entries for the ccache of the workflow [NightlyTests](https://github.com/duckdb/duckdb/actions/workflows/NightlyTests.yml). However, that workflow is only running once at night for the release branches and the hit rates are low (on average 20%-36%) so they do not give a meaningful speedup in those workflow runs. Disabling the ccache action for that workflow can lower the github repo cache eviction pressure and allow other ccache entries to become larger. ### Future work: Increase cache size for Main and Linux Release Another problem is the low hit rate when a ccache file is found in a CI job. This can be partially explained the default maximum size of 500 MiB for the ccache action and most CI jobs [generate](https://github.com/duckdb/duckdb/actions/runs/22341258586/job/64646738998#step:17:5) a ccache size of `~0.6-1.0` GiB: ``` Cacheable calls: 1208 / 1216 (99.34%) Hits: 0 / 1208 ( 0.00%) Direct: 0 Preprocessed: 0 Misses: 1208 / 1208 (100.0%) Errors: 8 / 1216 ( 0.66%) Local storage: Cache size (GB): 1.0 / 0.5 (193.7%) # <-- 2x allowed ccache size limit Cleanups: 1054 Hits: 0 / 1208 ( 0.00%) Misses: 1208 / 1208 (100.0%) ``` and thus start pruning files from the ccache storage to make it fit in the default 500 MiB cache size limit. In the future, I'll explore when/where we should increase the ccache cache size limit.
2 parents fbeb793 + aec99ab commit 0ae5535

File tree

4 files changed

+474
-11
lines changed

4 files changed

+474
-11
lines changed

.github/actions/ccache-action/action.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,10 @@ runs:
33
using: "composite"
44
steps:
55
- name: Setup Ccache
6+
if: ${{ github.workflow != 'NightlyTests' }}
67
uses: hendrikmuhs/ccache-action@main
78
with:
89
key: ${{ github.job }}
910
save: ${{ github.repository != 'duckdb/duckdb' || contains('["refs/heads/main", "refs/heads/v1.4-andium", "refs/heads/v1.5-variegata"]', github.ref) }}
11+
# Dump verbose ccache statistics report at end of CI job.
12+
verbose: 2

.github/workflows/NightlyTests.yml

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@ concurrency:
2525
env:
2626
GH_TOKEN: ${{ secrets.GH_TOKEN }}
2727
DUCKDB_WASM_VERSION: "cf2048bd6d669ffa05c56d7d453e09e99de8b87e"
28-
BRANCHES_TO_BE_CACHED: ${{ github.repository == 'duckdb/duckdb' && '["refs/heads/main", "refs/heads/v1.4-andium", "refs/heads/v1.5-variegata"]' || ''}}
2928
REDUCE_SYMBOLS: 1
3029
PYTHONUNBUFFERED: 1
3130

@@ -251,11 +250,7 @@ jobs:
251250
with:
252251
fetch-depth: 0
253252

254-
- name: Setup Ccache
255-
uses: hendrikmuhs/ccache-action@v1.2.11 # Note: pinned due to GLIBC incompatibility in later releases
256-
with:
257-
key: ${{ github.job }}
258-
save: ${{ env.BRANCHES_TO_BE_CACHED == '' || contains(env.BRANCHES_TO_BE_CACHED, github.ref) }}
253+
- uses: ./.github/actions/ccache-action
259254

260255
- uses: ./.github/actions/cleanup_runner
261256

@@ -542,11 +537,7 @@ jobs:
542537
shell: bash
543538
run: sudo apt-get update -y -qq && sudo apt-get install -y -qq ninja-build
544539

545-
- name: Setup Ccache
546-
uses: hendrikmuhs/ccache-action@main
547-
with:
548-
key: ${{ github.job }}
549-
save: ${{ env.BRANCHES_TO_BE_CACHED == '' || contains(env.BRANCHES_TO_BE_CACHED, github.ref) }}
540+
- uses: ./.github/actions/ccache-action
550541

551542
- uses: ./.github/actions/cleanup_runner
552543

scripts/ccache_workflow_summary.py

Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Summarize ccache usage for a GitHub Actions workflow run.
4+
5+
Examples:
6+
python3 scripts/ccache_workflow_summary.py 22331644275
7+
python3 scripts/ccache_workflow_summary.py 22331644275 --repo duckdb/duckdb --format csv --output summary.csv
8+
python3 scripts/ccache_workflow_summary.py 22331644275 --keep-retries --verbose
9+
"""
10+
11+
import argparse
12+
import json
13+
import re
14+
import subprocess
15+
import sys
16+
from typing import Dict, List, Optional
17+
18+
19+
HITS_RE = re.compile(r"Hits:\s*\d+\s*/\s*\d+\s*\(\s*([0-9]+(?:\.[0-9]+)?)\s*%\)")
20+
21+
22+
def parse_args() -> argparse.Namespace:
23+
parser = argparse.ArgumentParser(description="Analyze ccache summary for a GitHub Actions run via gh CLI.")
24+
parser.add_argument("run_id", type=int, help="GitHub Actions run ID (for example: 22331644275).")
25+
parser.add_argument("--repo", default="duckdb/duckdb", help="Repository in owner/name format.")
26+
parser.add_argument(
27+
"--format",
28+
choices=["markdown", "table", "csv", "json"],
29+
default="markdown",
30+
help="Output format (default: markdown).",
31+
)
32+
parser.add_argument("--output", default="", help="Optional output file path; otherwise prints to stdout.")
33+
parser.add_argument(
34+
"--keep-retries",
35+
action="store_true",
36+
help="Include all job attempts. By default retries are collapsed to the latest attempt per job name.",
37+
)
38+
parser.add_argument("--verbose", action="store_true", help="Print progress messages to stderr.")
39+
return parser.parse_args()
40+
41+
42+
def run_gh(args: List[str]) -> str:
43+
proc = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
44+
if proc.returncode != 0:
45+
cmd = " ".join(args)
46+
raise RuntimeError(f"Command failed ({proc.returncode}): {cmd}\n{proc.stderr.strip()}")
47+
return proc.stdout
48+
49+
50+
def get_jobs(repo: str, run_id: int) -> List[Dict]:
51+
output = run_gh(["gh", "run", "view", str(run_id), "--repo", repo, "--json", "jobs"])
52+
payload = json.loads(output)
53+
return payload.get("jobs", [])
54+
55+
56+
def dedupe_jobs_keep_latest(jobs: List[Dict]) -> List[Dict]:
57+
by_name: Dict[str, Dict] = {}
58+
for job in jobs:
59+
name = job.get("name", "")
60+
started_at = job.get("startedAt") or ""
61+
completed_at = job.get("completedAt") or ""
62+
dbid = int(job.get("databaseId") or 0)
63+
candidate_key = (started_at, completed_at, dbid)
64+
previous = by_name.get(name)
65+
if previous is None:
66+
by_name[name] = job
67+
continue
68+
prev_key = (
69+
previous.get("startedAt") or "",
70+
previous.get("completedAt") or "",
71+
int(previous.get("databaseId") or 0),
72+
)
73+
if candidate_key > prev_key:
74+
by_name[name] = job
75+
ordered_names = []
76+
seen = set()
77+
for job in jobs:
78+
name = job.get("name", "")
79+
if name not in seen:
80+
ordered_names.append(name)
81+
seen.add(name)
82+
return [by_name[name] for name in ordered_names if name in by_name]
83+
84+
85+
def parse_hit_rate_percent(log_text: str) -> Optional[int]:
86+
in_stats = False
87+
for line in log_text.splitlines():
88+
if "ccache -s" in line:
89+
in_stats = True
90+
continue
91+
if in_stats and "ccache --version" in line:
92+
in_stats = False
93+
continue
94+
if not in_stats:
95+
continue
96+
match = HITS_RE.search(line)
97+
if match:
98+
value = float(match.group(1))
99+
return int(value + 0.5)
100+
return None
101+
102+
103+
def parse_found_ccache_file(log_text: str) -> str:
104+
if "##[group]Restore cache" not in log_text:
105+
return "N/A"
106+
if "Cache hit for restore-key" in log_text or "Cache restored successfully" in log_text:
107+
return "Yes"
108+
if "No cache found." in log_text:
109+
return "No"
110+
return "No"
111+
112+
113+
def format_table(rows: List[Dict[str, str]], columns: List[str]) -> str:
114+
widths = {col: len(col) for col in columns}
115+
for row in rows:
116+
for col in columns:
117+
widths[col] = max(widths[col], len(str(row.get(col, ""))))
118+
header = " ".join(col.ljust(widths[col]) for col in columns)
119+
sep = " ".join("-" * widths[col] for col in columns)
120+
body = [" ".join(str(row.get(col, "")).ljust(widths[col]) for col in columns) for row in rows]
121+
return "\n".join([header, sep] + body)
122+
123+
124+
def format_markdown_table(rows: List[Dict[str, str]], columns: List[str]) -> str:
125+
header = "| " + " | ".join(columns) + " |"
126+
separator = "| " + " | ".join(["---"] * len(columns)) + " |"
127+
body = ["| " + " | ".join(str(row.get(col, "")) for col in columns) + " |" for row in rows]
128+
return "\n".join([header, separator] + body)
129+
130+
131+
def to_csv(rows: List[Dict[str, str]], columns: List[str]) -> str:
132+
def escape(cell: str) -> str:
133+
if any(ch in cell for ch in [",", "\"", "\n"]):
134+
return "\"" + cell.replace("\"", "\"\"") + "\""
135+
return cell
136+
137+
lines = [",".join(columns)]
138+
for row in rows:
139+
lines.append(",".join(escape(str(row.get(col, ""))) for col in columns))
140+
return "\n".join(lines)
141+
142+
143+
def emit(text: str, output_path: str) -> None:
144+
if output_path:
145+
with open(output_path, "w", encoding="utf-8") as f:
146+
f.write(text)
147+
if not text.endswith("\n"):
148+
f.write("\n")
149+
return
150+
print(text)
151+
152+
153+
def summarize_rows(rows: List[Dict[str, str]]) -> str:
154+
total_jobs = len(rows)
155+
found_ccache_jobs = sum(1 for row in rows if row.get("found_ccache_file") == "Yes")
156+
numeric_rates: List[int] = []
157+
for row in rows:
158+
rate = row.get("ccache_hit_rate", "")
159+
if rate.endswith("%"):
160+
try:
161+
numeric_rates.append(int(rate[:-1]))
162+
except ValueError:
163+
pass
164+
if numeric_rates:
165+
avg_rate = int(sum(numeric_rates) / len(numeric_rates) + 0.5)
166+
avg_text = f"{avg_rate}%"
167+
else:
168+
avg_text = "N/A"
169+
return (
170+
f"- Total jobs is {total_jobs} where {found_ccache_jobs} use a ccache file.\n"
171+
f"- The average ccache hit rate is {avg_text}."
172+
)
173+
174+
175+
def main() -> int:
176+
args = parse_args()
177+
try:
178+
print(
179+
f"Fetching workflow run log files for run {args.run_id} from {args.repo}...",
180+
file=sys.stderr,
181+
flush=True,
182+
)
183+
jobs = get_jobs(args.repo, args.run_id)
184+
selected_jobs = jobs if args.keep_retries else dedupe_jobs_keep_latest(jobs)
185+
186+
if args.verbose:
187+
print(
188+
f"Found {len(jobs)} jobs ({len(selected_jobs)} after retry filtering).",
189+
file=sys.stderr,
190+
)
191+
192+
rows: List[Dict[str, str]] = []
193+
for job in selected_jobs:
194+
job_id = int(job.get("databaseId") or 0)
195+
name = str(job.get("name") or "")
196+
conclusion = str(job.get("conclusion") or "")
197+
if args.verbose:
198+
print(f"Fetching log for job {job_id}: {name}", file=sys.stderr, flush=True)
199+
200+
try:
201+
log_text = run_gh(
202+
["gh", "run", "view", str(args.run_id), "--repo", args.repo, "--job", str(job_id), "--log"]
203+
)
204+
except RuntimeError:
205+
hit_rate = "N/A"
206+
found_ccache = "N/A"
207+
else:
208+
parsed_rate = parse_hit_rate_percent(log_text)
209+
hit_rate = f"{parsed_rate}%" if parsed_rate is not None else "N/A"
210+
found_ccache = parse_found_ccache_file(log_text)
211+
212+
rows.append(
213+
{
214+
"job_id": str(job_id),
215+
"job_name": name,
216+
"conclusion": conclusion,
217+
"ccache_hit_rate": hit_rate,
218+
"found_ccache_file": found_ccache,
219+
}
220+
)
221+
print(f"Processed job {job_id}: {name}", file=sys.stderr, flush=True)
222+
223+
rows = [row for row in rows if row.get("conclusion") != "skipped"]
224+
columns = ["job_id", "job_name", "conclusion", "ccache_hit_rate", "found_ccache_file"]
225+
if args.format == "json":
226+
text = json.dumps(rows, indent=2)
227+
elif args.format == "csv":
228+
text = to_csv(rows, columns)
229+
elif args.format == "markdown":
230+
markdown_columns = ["job_name", "ccache_hit_rate", "found_ccache_file"]
231+
text = format_markdown_table(rows, markdown_columns) + "\n\n" + summarize_rows(rows)
232+
else:
233+
table_columns = ["job_name", "conclusion", "ccache_hit_rate", "found_ccache_file"]
234+
text = format_table(rows, table_columns) + "\n\n" + summarize_rows(rows)
235+
236+
emit(text, args.output)
237+
return 0
238+
except RuntimeError as err:
239+
print(str(err), file=sys.stderr)
240+
return 1
241+
242+
243+
if __name__ == "__main__":
244+
raise SystemExit(main())

0 commit comments

Comments
 (0)