Skip to content

Commit e67a2ed

Browse files
authored
fix: Empty Page SubElements (#186)
- Changed page elements to be initalized with document text instead of page text. - Caused all paragraphs/blocks/etc after the first page to be empty. - Introduced in https://togithub.com/googleapis/python-documentai-toolbox/pull/123 - https://togithub.com/googleapis/python-documentai-toolbox/pull/123/files#diff-af92c6de8f8e84ca66d2fb9fa7e9bddb5bd644e944153bf7a78d35f47c05853eR251 - Added new Unit test which covers the issue - Discovered in https://togithub.com/GoogleCloudPlatform/generative-ai/issues/217
1 parent 43ba96e commit e67a2ed

File tree

3 files changed

+89723
-2
lines changed

3 files changed

+89723
-2
lines changed

packages/google-cloud-documentai-toolbox/google/cloud/documentai_toolbox/wrappers/page.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ def text(self):
122122
"""
123123
if self._text is None:
124124
self._text = _text_from_layout(
125-
layout=self.documentai_object.layout, text=self._page.text
125+
layout=self.documentai_object.layout, text=self._page.document_text
126126
)
127127
return self._text
128128

@@ -324,13 +324,15 @@ def _text_from_layout(layout: documentai.Document.Page.Layout, text: str) -> str
324324
Required. an element with layout fields.
325325
text (str):
326326
Required. UTF-8 encoded text in reading order
327-
from the document.
327+
of the `documentai.Document` containing the layout element.
328328
329329
Returns:
330330
str:
331331
Text from a single element.
332332
"""
333333

334+
# Note: `layout.text_anchor.text_segments` are indexes into the full Document text.
335+
# https://cloud.google.com/document-ai/docs/reference/rest/v1/Document#textsegment
334336
return "".join(
335337
text[int(segment.start_index) : int(segment.end_index)]
336338
for segment in layout.text_anchor.text_segments

0 commit comments

Comments
 (0)