-
Notifications
You must be signed in to change notification settings - Fork 126
fix(lsi): improve sentence and paragraph splitting in Summary #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -3,6 +3,8 @@ | |||||||||||||||||||||||||||||||||||||
| # License:: LGPL | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| class String | ||||||||||||||||||||||||||||||||||||||
| ABBREVIATIONS = %w[Mr Mrs Ms Dr Prof Jr Sr Inc Ltd Corp Co vs etc al eg ie].freeze | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| def summary(count = 10, separator = ' [...] ') | ||||||||||||||||||||||||||||||||||||||
| perform_lsi split_sentences, count, separator | ||||||||||||||||||||||||||||||||||||||
| end | ||||||||||||||||||||||||||||||||||||||
|
|
@@ -12,20 +14,38 @@ def paragraph_summary(count = 1, separator = ' [...] ') | |||||||||||||||||||||||||||||||||||||
| end | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| def split_sentences | ||||||||||||||||||||||||||||||||||||||
| split(/(\.|!|\?)/) # TODO: make this less primitive | ||||||||||||||||||||||||||||||||||||||
| return pragmatic_segment if defined?(PragmaticSegmenter) | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| split_sentences_regex | ||||||||||||||||||||||||||||||||||||||
| end | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| def split_paragraphs | ||||||||||||||||||||||||||||||||||||||
| split(/(\n\n|\r\r|\r\n\r\n)/) # TODO: make this less primitive | ||||||||||||||||||||||||||||||||||||||
| split(/\r?\n\r?\n+/) | ||||||||||||||||||||||||||||||||||||||
| end | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| private | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| def pragmatic_segment | ||||||||||||||||||||||||||||||||||||||
| PragmaticSegmenter::Segmenter.new(text: self).segment | ||||||||||||||||||||||||||||||||||||||
| end | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| def split_sentences_regex | ||||||||||||||||||||||||||||||||||||||
| abbrev_pattern = ABBREVIATIONS.map { |a| "#{a}\\." }.join('|') | ||||||||||||||||||||||||||||||||||||||
| text = gsub(/\b(#{abbrev_pattern})/i) { |m| m.gsub('.', '<<<DOT>>>') } | ||||||||||||||||||||||||||||||||||||||
| text = text.gsub(/(\d)\.(\d)/, '\1<<<DOT>>>\2') | ||||||||||||||||||||||||||||||||||||||
| sentences = text.split(/(?<=[.!?])(?:\s+|(?=[A-Z]))/) | ||||||||||||||||||||||||||||||||||||||
| sentences.map { |s| s.gsub('<<<DOT>>>', '.') } | ||||||||||||||||||||||||||||||||||||||
| end | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| def perform_lsi(chunks, count, separator) | ||||||||||||||||||||||||||||||||||||||
| lsi = Classifier::LSI.new auto_rebuild: false | ||||||||||||||||||||||||||||||||||||||
| chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 } | ||||||||||||||||||||||||||||||||||||||
| chunks.each do |chunk| | ||||||||||||||||||||||||||||||||||||||
| stripped = chunk.strip | ||||||||||||||||||||||||||||||||||||||
| next if stripped.empty? || stripped.split.size == 1 | ||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||
| lsi << chunk | ||||||||||||||||||||||||||||||||||||||
| end | ||||||||||||||||||||||||||||||||||||||
|
Comment on lines
40
to
+47
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. style: nested conditional - use early return instead
Suggested change
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! Prompt To Fix With AIThis is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 40:45
Comment:
**style:** nested conditional - use early return instead
```suggestion
def perform_lsi(chunks, count, separator)
lsi = Classifier::LSI.new auto_rebuild: false
chunks.each do |chunk|
stripped = chunk.strip
next if stripped.empty? || stripped.split.size == 1
lsi << chunk
end
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise. |
||||||||||||||||||||||||||||||||||||||
| lsi.build_index | ||||||||||||||||||||||||||||||||||||||
| summaries = lsi.highest_relative_content count | ||||||||||||||||||||||||||||||||||||||
| summaries.select { |chunk| summaries.include?(chunk) }.map(&:strip).join(separator) | ||||||||||||||||||||||||||||||||||||||
| lsi.highest_relative_content(count).map(&:strip).join(separator) | ||||||||||||||||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. style: redundant filter -
Suggested change
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! Prompt To Fix With AIThis is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 47:47
Comment:
**style:** redundant filter - `highest_relative_content` already returns valid summaries
```suggestion
lsi.highest_relative_content(count).map(&:strip).join(separator)
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise. |
||||||||||||||||||||||||||||||||||||||
| end | ||||||||||||||||||||||||||||||||||||||
|
Comment on lines
40
to
50
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. style: calls
Suggested change
Prompt To Fix With AIThis is a comment left during a code review.
Path: lib/classifier/lsi/summary.rb
Line: 40:46
Comment:
**style:** calls `chunk.strip` twice per iteration - store result
```suggestion
def perform_lsi(chunks, count, separator)
lsi = Classifier::LSI.new auto_rebuild: false
chunks.each do |chunk|
stripped = chunk.strip
lsi << chunk unless stripped.empty? || stripped.split.size == 1
end
lsi.build_index
summaries = lsi.highest_relative_content count
summaries.map(&:strip).join(separator)
end
```
How can I resolve this? If you propose a fix, please make it concise. |
||||||||||||||||||||||||||||||||||||||
| end | ||||||||||||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,113 @@ | ||
| require_relative '../test_helper' | ||
|
|
||
| class SummaryTest < Minitest::Test | ||
| def test_split_sentences_basic | ||
| text = 'Hello world. This is a test. How are you?' | ||
| sentences = text.split_sentences | ||
|
|
||
| assert_equal 3, sentences.size | ||
| assert_equal 'Hello world.', sentences[0] | ||
| assert_equal 'This is a test.', sentences[1] | ||
| assert_equal 'How are you?', sentences[2] | ||
| end | ||
|
|
||
| def test_split_sentences_with_abbreviations | ||
| text = 'Dr. Smith went to the store. He bought milk.' | ||
| sentences = text.split_sentences | ||
|
|
||
| assert_equal 2, sentences.size | ||
| assert_equal 'Dr. Smith went to the store.', sentences[0] | ||
| assert_equal 'He bought milk.', sentences[1] | ||
| end | ||
|
|
||
| def test_split_sentences_with_mr_mrs | ||
| text = 'Mr. Jones met Mrs. Smith. They talked.' | ||
| sentences = text.split_sentences | ||
|
|
||
| assert_equal 2, sentences.size | ||
| assert_equal 'Mr. Jones met Mrs. Smith.', sentences[0] | ||
| assert_equal 'They talked.', sentences[1] | ||
| end | ||
|
|
||
| def test_split_sentences_with_decimals | ||
| text = 'The price is $3.50 per unit. That is expensive.' | ||
| sentences = text.split_sentences | ||
|
|
||
| assert_equal 2, sentences.size | ||
| assert_equal 'The price is $3.50 per unit.', sentences[0] | ||
| assert_equal 'That is expensive.', sentences[1] | ||
| end | ||
|
|
||
| def test_split_sentences_with_exclamation | ||
| text = 'Hello! How are you? I am fine.' | ||
| sentences = text.split_sentences | ||
|
|
||
| assert_equal 3, sentences.size | ||
| assert_equal 'Hello!', sentences[0] | ||
| assert_equal 'How are you?', sentences[1] | ||
| assert_equal 'I am fine.', sentences[2] | ||
| end | ||
|
|
||
| def test_split_sentences_with_inc_corp | ||
| text = 'Apple Inc. makes phones. Microsoft Corp. makes software.' | ||
| sentences = text.split_sentences | ||
|
|
||
| assert_equal 2, sentences.size | ||
| assert_equal 'Apple Inc. makes phones.', sentences[0] | ||
| assert_equal 'Microsoft Corp. makes software.', sentences[1] | ||
| end | ||
|
|
||
| def test_split_sentences_with_etc | ||
| text = 'We need apples, oranges, etc. for the party. Please bring them.' | ||
| sentences = text.split_sentences | ||
|
|
||
| assert_equal 2, sentences.size | ||
| assert_includes sentences[0], 'etc.' | ||
| end | ||
|
|
||
| def test_split_paragraphs_double_newline | ||
| text = "First paragraph.\n\nSecond paragraph." | ||
| paragraphs = text.split_paragraphs | ||
|
|
||
| assert_equal 2, paragraphs.size | ||
| assert_equal 'First paragraph.', paragraphs[0] | ||
| assert_equal 'Second paragraph.', paragraphs[1] | ||
| end | ||
|
|
||
| def test_split_paragraphs_windows_line_endings | ||
| text = "First paragraph.\r\n\r\nSecond paragraph." | ||
| paragraphs = text.split_paragraphs | ||
|
|
||
| assert_equal 2, paragraphs.size | ||
| assert_equal 'First paragraph.', paragraphs[0] | ||
| assert_equal 'Second paragraph.', paragraphs[1] | ||
| end | ||
|
|
||
| def test_split_paragraphs_multiple_newlines | ||
| text = "First paragraph.\n\n\n\nSecond paragraph." | ||
| paragraphs = text.split_paragraphs | ||
|
|
||
| assert_equal 2, paragraphs.size | ||
| end | ||
|
|
||
| def test_split_paragraphs_mixed_line_endings | ||
| text = "First.\r\n\r\nSecond.\n\nThird." | ||
| paragraphs = text.split_paragraphs | ||
|
|
||
| assert_equal 3, paragraphs.size | ||
| end | ||
|
|
||
| def test_summary_returns_string | ||
| text = 'This is sentence one. This is sentence two. This is sentence three.' | ||
| result = text.summary(2) | ||
|
|
||
| assert_instance_of String, result | ||
| end | ||
|
|
||
| def test_paragraph_summary_returns_string | ||
| text = "First paragraph with content.\n\nSecond paragraph with more content." | ||
| result = text.paragraph_summary(1) | ||
|
|
||
| assert_instance_of String, result | ||
| end | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: regex pattern
(?:\s+|(?=[A-Z]))is ambiguous - splits on capital letters without space (e.g., "Hello.World"), but then what about "Hello.world"? This creates inconsistent behavior.Prompt To Fix With AI