-
-
Notifications
You must be signed in to change notification settings - Fork 865
Description
Post #2726, there is about a 15% regression on string sorts.
I have not had time to triage this yet, but I strongly suspect that it is due to impl SortKeyComputer for SortByString doing individual lookups to the column's dictionary per term:
tantivy/src/collector/sort_key/sort_by_string.rs
Lines 62 to 72 in d0e1600
| fn convert_segment_sort_key(&self, term_ord_opt: Option<TermOrdinal>) -> Option<String> { | |
| let term_ord = term_ord_opt?; | |
| let str_column = self.str_column_opt.as_ref()?; | |
| let mut bytes = Vec::new(); | |
| str_column | |
| .dictionary() | |
| .ord_to_term(term_ord, &mut bytes) | |
| .ok()?; | |
| String::try_from(bytes).ok() | |
| } | |
| } |
When ordering by strings, the resulting values will be sequential in the column's dictionary. Because the dictionary is compressed, each of these lookups will decompress a block of the term dictionary, and since the values are potentially contiguous in the dictionary, this can mean that we decompress the same block multiple times.
Previously on main, this used sorted_ords_to_term_cb to batch convert the TermOrdinals into terms. One way to get this performance back would be to change SegmentSortKeyComputer::convert_segment_sort_key into a batch method.