fix: improve adaptive retrieval and noise filter accuracy for CJK and edge cases#532
fix: improve adaptive retrieval and noise filter accuracy for CJK and edge cases#532ssyn0813 wants to merge 7 commits intoCortexReach:masterfrom
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use \p{Extended_Pictographic} instead of \p{Emoji} to avoid matching digits
- Narrow slash command regex to /word format, no longer matches file paths
- Add CJK-aware hard threshold (length < 2 for CJK, < 5 for non-CJK)
- Exempt digit-containing strings (port numbers, issue IDs) from length thresholds
- Lower CJK defaultMinLength from 6 to 3 for short meaningful CJK queries
- Lower non-CJK defaultMinLength from 15 to 13 to allow file path queries
- Prevent FORCE_RETRIEVE from hijacking slash commands like /recall
Short CJK text (2+ chars) is no longer falsely marked as noise. Uses same CJK detection pattern as adaptive-retrieval. Also tightens boilerplate greeting regex to only match standalone greetings (≤1 trailing word), so real memories starting with "hello" are not incorrectly filtered.
Change slash command regex from ^\/[a-z][\w-]*\s*$ to ^\/[a-z][\w-]*(\s|$) so that argument-bearing commands like /recall my name, /remember content, and /lesson text are still recognized and skipped. File paths like /usr/bin/node remain unmatched because the second segment starts with / not whitespace. Add regression tests for argument-bearing slash commands.
AliceLJY
left a comment
There was a problem hiding this comment.
Clean, well-scoped fix for three real edge cases:
- Emoji regex (
\p{Emoji}→\p{Extended_Pictographic}): 修复数字被误匹配为 emoji 的问题,#123、8080这类携带语义的输入不再被 skip - Slash command regex (
/^\//→/^\/[a-z][\w-]*(\s|$)/i): 区分/help和/usr/bin/node,文件路径不再误杀 - CJK threshold 下调合理(CJK 2 chars, EN 13 chars),含数字的字符串豁免长度截断
Tests 覆盖了每个 fix 的正反面,包括 existing behavior preservation。isNoise 的 CJK threshold 同步修复保持一致性。
LGTM ✅
Review SummaryAutomated multi-round review (6 rounds, Claude + Codex adversarial). Value score: 73%. Fixing the CJK content loss and emoji/slash-command regex bugs is important — good work identifying these. Must Fix
Nice to Have
Questions
Please address the two must-fix items (rebase + fix build/test timeout). Once green, this is ready to merge. |
Summary
Fixes four bugs in the filtering/retrieval pipeline that silently drop valid content, plus addresses the slash-command regression from the previous PR (#401).
\p{Emoji}regex matches digits —"12345","8080"treated as pure emoji and skipped/usr/bin/nodemisidentified as slash commandslength < 5hard threshold fires before CJK-aware branchlength < 5with no CJK awarenessBug 3+4 create a double loss: short CJK content can neither be stored nor retrieved.
Root Cause
\p{Emoji}includes ASCII digits0-9,#,*as keycap bases^\//matches any/-prefixed text, not just/commandformatadaptive-retrieval.ts:78andnoise-filter.ts:76uselength < 5without considering CJK character densityChanges
src/adaptive-retrieval.ts:\p{Emoji}→\p{Extended_Pictographic}(avoids matching digits)^\/[a-z][\w-]*(\s|$)(matches/help,/recall my name, but not/usr/bin/node)/recallbypassing skip logicsrc/noise-filter.ts:length < 2, non-CJK:length < 5)test/adaptive-retrieval.test.mjs: 23 tests — emoji, slash (with and without arguments), CJK, existing behaviortest/noise-filter.test.mjs: 15 tests — CJK, English, patterns, options, filterNoise genericpackage.json: Register both new test filesAddressed Review Feedback from #401
/recall my name,/remember content,/lesson text) are now correctly skipped — regression test addedTest Plan
node --test test/adaptive-retrieval.test.mjs— 23/23 passnode --test test/noise-filter.test.mjs— 15/15 passSupersedes: #401
Related: #127