Skip to content

[textractprettyprinter] does not return the last row of a table when using get_text_from_layout_json #430

@abouberthe

Description

@abouberthe

I used the following code to extract information from documents, including text and tables:

textract_json = call_textract(
input_document=byte_img,
features=[Textract_Features.TABLES, Textract_Features.LAYOUT, Textract_Features.FORMS],
boto3_textract_client=textract_client
)

layout = get_text_from_layout_json(textract_json, exclude_figure_text=False)

if 1 in layout.keys():
full_text = layout[1]
else:
full_text = ''

However, when testing it on the attached document (document_anonyme_1.jpg), the resulting text output (document_anonymise_1.txt) is missing the last row of the table — specifically, the row that contains "COPYRIGHT EOT ..." does not appear.

Could you please help me resolve this issue?

For reference, I am using the following versions of the relevant packages:

amazon-textract-caller: 0.2.4

amazon-textract-prettyprinter: 0.1.10

amazon-textract-response-parser: 0.1.48

amazon-textract-textractor: 1.9.2

Image

document_anonymise_1.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions