Skip to content

[v1.37] Tokenization improvements#388

Open
g-despot wants to merge 2 commits intov1-37/mainfrom
v1-37/tokenizer
Open

[v1.37] Tokenization improvements#388
g-despot wants to merge 2 commits intov1-37/mainfrom
v1-37/tokenizer

Conversation

@g-despot
Copy link
Copy Markdown
Contributor

@g-despot g-despot commented Apr 10, 2026

Summary

  • Accent folding: Document new textAnalyzer.asciiFold and asciiFoldIgnore property-level config for normalizing accented Latin characters to ASCII equivalents during indexing and querying.
  • Custom and per-property stopwords: Document invertedIndexConfig.stopwordPresets for named stopword lists and textAnalyzer.stopwordPreset for per-property overrides.
  • Tokenize endpoint: Document POST /v1/tokenize (freeform) and POST /v1/schema/{class}/properties/{prop}/tokenize (property-based) endpoints for testing tokenization.

Pages modified

  • concepts/indexing/inverted-index.md — New accent folding section, expanded stop words
  • concepts/search/keyword-search.md — Cross-links to new features
  • config-refs/indexing/inverted-index.mdxstopwordPresets, textAnalyzer, and tokenize endpoint reference
  • tutorials/tokenization.md — Examples 4 (accent folding), 5 (custom stopwords), 6 (tokenize endpoint)

Copy link
Copy Markdown

@orca-security-eu orca-security-eu bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orca Security Scan Summary

Status Check Issues by priority
Passed Passed Infrastructure as Code high 0   medium 0   low 0   info 0 View in Orca
Passed Passed SAST high 0   medium 2   low 0   info 0 View in Orca
Passed Passed Secrets high 0   medium 0   low 0   info 0 View in Orca
Passed Passed Vulnerabilities high 0   medium 0   low 0   info 0 View in Orca
🛡️ The following SAST misconfigurations have been detected
NAME FILE
medium Missing Timeout in Requests Module Can Cause DoS ...tokenize_endpoint.py View in code
medium Missing Timeout in Requests Module Can Cause DoS ...tokenize_endpoint.py View in code

@g-despot g-despot changed the base branch from main to v1-37/main April 10, 2026 10:52
Copy link
Copy Markdown

@orca-security-eu orca-security-eu bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orca Security Scan Summary

Status Check Issues by priority
Passed Passed Secrets high 0   medium 0   low 0   info 0 View in Orca

@g-despot g-despot requested a review from amourao April 11, 2026 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant