Skip to content

fix: improve adaptive retrieval and noise filter accuracy for CJK and edge cases#532

Open
ssyn0813 wants to merge 7 commits intoCortexReach:masterfrom
ssyn0813:fix/adaptive-retrieval-cjk-accuracy
Open

fix: improve adaptive retrieval and noise filter accuracy for CJK and edge cases#532
ssyn0813 wants to merge 7 commits intoCortexReach:masterfrom
ssyn0813:fix/adaptive-retrieval-cjk-accuracy

Conversation

@ssyn0813
Copy link
Copy Markdown
Contributor

@ssyn0813 ssyn0813 commented Apr 5, 2026

Summary

Fixes four bugs in the filtering/retrieval pipeline that silently drop valid content, plus addresses the slash-command regression from the previous PR (#401).

  1. \p{Emoji} regex matches digits"12345", "8080" treated as pure emoji and skipped
  2. Slash command regex too broad — file paths like /usr/bin/node misidentified as slash commands
  3. CJK short text killed in adaptive-retrievallength < 5 hard threshold fires before CJK-aware branch
  4. CJK short text killed in noise-filter — same length < 5 with no CJK awareness

Bug 3+4 create a double loss: short CJK content can neither be stored nor retrieved.

Root Cause

  • Unicode \p{Emoji} includes ASCII digits 0-9, #, * as keycap bases
  • ^\// matches any /-prefixed text, not just /command format
  • Both adaptive-retrieval.ts:78 and noise-filter.ts:76 use length < 5 without considering CJK character density

Changes

  • src/adaptive-retrieval.ts:
    • \p{Emoji}\p{Extended_Pictographic} (avoids matching digits)
    • Slash regex → ^\/[a-z][\w-]*(\s|$) (matches /help, /recall my name, but not /usr/bin/node)
    • CJK-aware hard threshold + digit exemption for port/issue numbers
    • Slash command guard on FORCE_RETRIEVE to prevent /recall bypassing skip logic
  • src/noise-filter.ts:
    • CJK-aware hard threshold (CJK: length < 2, non-CJK: length < 5)
    • Boilerplate regex anchored to avoid false positives on longer sentences
  • test/adaptive-retrieval.test.mjs: 23 tests — emoji, slash (with and without arguments), CJK, existing behavior
  • test/noise-filter.test.mjs: 15 tests — CJK, English, patterns, options, filterNoise generic
  • package.json: Register both new test files

Addressed Review Feedback from #401

  • Slash commands with arguments (/recall my name, /remember content, /lesson text) are now correctly skipped — regression test added
  • Rebased on latest master — no conflicts

Test Plan

  • node --test test/adaptive-retrieval.test.mjs — 23/23 pass
  • node --test test/noise-filter.test.mjs — 15/15 pass

Supersedes: #401
Related: #127

ssyn0813 and others added 7 commits April 5, 2026 06:10
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use \p{Extended_Pictographic} instead of \p{Emoji} to avoid matching digits
- Narrow slash command regex to /word format, no longer matches file paths
- Add CJK-aware hard threshold (length < 2 for CJK, < 5 for non-CJK)
- Exempt digit-containing strings (port numbers, issue IDs) from length thresholds
- Lower CJK defaultMinLength from 6 to 3 for short meaningful CJK queries
- Lower non-CJK defaultMinLength from 15 to 13 to allow file path queries
- Prevent FORCE_RETRIEVE from hijacking slash commands like /recall
Short CJK text (2+ chars) is no longer falsely marked as noise.
Uses same CJK detection pattern as adaptive-retrieval.

Also tightens boilerplate greeting regex to only match standalone
greetings (≤1 trailing word), so real memories starting with "hello"
are not incorrectly filtered.
Change slash command regex from ^\/[a-z][\w-]*\s*$ to ^\/[a-z][\w-]*(\s|$)
so that argument-bearing commands like /recall my name, /remember content,
and /lesson text are still recognized and skipped.

File paths like /usr/bin/node remain unmatched because the second segment
starts with / not whitespace.

Add regression tests for argument-bearing slash commands.
Copy link
Copy Markdown
Collaborator

@AliceLJY AliceLJY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-scoped fix for three real edge cases:

  1. Emoji regex (\p{Emoji}\p{Extended_Pictographic}): 修复数字被误匹配为 emoji 的问题,#1238080 这类携带语义的输入不再被 skip
  2. Slash command regex (/^\///^\/[a-z][\w-]*(\s|$)/i): 区分 /help/usr/bin/node,文件路径不再误杀
  3. CJK threshold 下调合理(CJK 2 chars, EN 13 chars),含数字的字符串豁免长度截断

Tests 覆盖了每个 fix 的正反面,包括 existing behavior preservation。isNoise 的 CJK threshold 同步修复保持一致性。

LGTM ✅

@rwmjhb
Copy link
Copy Markdown
Collaborator

rwmjhb commented Apr 6, 2026

Review Summary

Automated multi-round review (6 rounds, Claude + Codex adversarial). Value score: 73%. Fixing the CJK content loss and emoji/slash-command regex bugs is important — good work identifying these.

Must Fix

  1. Build failurenpm test chain timed out at 180s. The sequential test structure may structurally exceed this budget. CI cannot certify the PR end-to-end.

  2. Stale base — PR is not rebased on current main despite claiming so. Integration risk is unverified. Please rebase.

Nice to Have

  • Digit exemption too broad — Single-char digit strings like "1" or "2" now escape all filters and could accumulate as low-value memories.
  • CJK soft threshold 6→3 — May allow short CJK affirmations (not in SKIP_PATTERNS) to trigger unnecessary retrieval.
  • Non-CJK soft threshold 15→13 — Behavioral change not documented or covered by boundary tests. Scope drift.
  • Write-side CJK fix incompletesmart-extractor.ts still hard-drops abstracts shorter than 5 characters, so CJK content saved by this PR's read-side fix may still be lost on the write side.
  • Slash-path fix partial — Single-segment root-style paths (e.g., /etc) still match the new slash-command regex.

Questions

  • Is the test timeout pre-existing on main or introduced by this PR?
  • Is the defaultMinLength 15→13 change intentional? If so, please document it.

Please address the two must-fix items (rebase + fix build/test timeout). Once green, this is ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants