use unicode-width instead of len() or grapheme cluster #7.#71
Conversation
|
Reason this PR is failing is due to ![feature(rustc_private)]. Please suggest alternative to get past this issue. Thanks, |
|
|
||
| [dev-dependencies] | ||
| log = "0.4" | ||
| unicode-width = "0.1.5" |
There was a problem hiding this comment.
[dev-dependencies] are not available for the "main" crate. You want [dependencies].
| use std::iter::{repeat, IntoIterator}; | ||
| use std::result; | ||
|
|
||
| extern crate unicode_width; |
There was a problem hiding this comment.
nit: let's put this extern crate up with the extern crate log statement.
| } else { | ||
| row.push_str("--"); | ||
| } | ||
| row.push_str(if self.long_only { "-" } else { "--" }); |
| /// Note: Function was moved here from `std::str` because this module | ||
| /// is the only place that uses it, and because it was too specific for | ||
| /// a general string function. | ||
| fn each_split_within(desc: &String, lim: usize) -> Vec<String> { |
There was a problem hiding this comment.
nit: This could be desc: &str
| // A single word has gone over the limit. In this | ||
| // case we just accept that the word will be too long. | ||
| B | ||
| /// Note: Function was moved here from `std::str` because this module |
There was a problem hiding this comment.
This comment probably isn't accurate anymore.
| let mut rows = Vec::new(); | ||
| for line in desc.trim().lines() { | ||
| let mut words = Vec::new(); | ||
| let mut word = String::new(); |
There was a problem hiding this comment.
We've got a fair amount of allocation going on in this method that would be good to avoid if we can. It seems like we're processing the line multiple times to clear out excess whitespace. We could do this without the temporary strings by maintaining an index into the line that we're processing:
// Add an additional whitespace to flush the last word
let line_chars = line.chars().chain(Some(' '));
let words = line_chars.fold((Vec::new(), 0, 0), |(mut words, word_start_idx, last_idx), c| {
// Get the current byte offset
let idx = last_idx + c.len_utf8();
// If the char is whitespace, advance the word start and maybe push a word
if c.is_whitespace() {
if word_start_idx != last_idx {
words.push(&line[word_start_idx..last_idx]);
}
(words, idx, idx)
}
// If the char is not whitespace, continue, retaining the current
else {
(words, word_start_idx, idx)
}
}).0;The example uses the Iterator::fold method to let us thread state through our chars, so we can find the point at which a word start, then keep that index until we hit the end. Here's a runnable version you can check out.
| C | ||
| }).0; | ||
|
|
||
| let mut row = String::new(); |
There was a problem hiding this comment.
We could cut down some more allocations in this part of the function too. We don't need the filter anymore because the words we get from above are all greater than 0 in length. So we could do something like this:
let mut current_row = String::new();
for word in words.iter() {
let sep = if current_row.len() > 0 { Some(" ") } else { None };
let mut width =
current_row.width() + word.width() + sep.map(UnicodeWidthStr::width).unwrap_or(0);
if width <= lim {
if let Some(sep) = sep {
current_row.push_str(sep);
}
current_row.push_str(word);
continue
}
if current_row.len() > 0 {
rows.push(current_row.clone());
current_row.clear();
}
current_row.push_str(word);
}
if current_row.len() > 0 {
rows.push(current_row);
}So we re-use the same current_row with its capacity already set somewhere up around lim instead of creating a new string buffer each time.
We also don't need to filter and copy rows, we can just return it as-is at the end of the method because it's only got valid rows in it.
What do you think?
There was a problem hiding this comment.
Thanks for the suggestion. 👍 amended the PR.
|
The AppVeyor failure is transient. |
I have refactored each_split_within() to follow, hopefully, a simpler logic. The test cases are passing and I have added new test case to test multi-width characters.
Let me know if this PR will be useful for this issue or need modifications.
Thanks,