I’m following this tutorial: PyPDFLoader | 🦜️🔗 LangChain
The image extraction from PDFs using Tesseract (or any other image parser) no longer works.
How to reproduce
Use the example PDF from the tutorial found here: langchain/docs/docs/integrations/document_loaders/example_data/layout-parser-paper.pdf at master · langchain-ai/langchain · GitHub
Run the following with langchain-community 0.3.27:
from langchain_community.document_loaders.parsers import TesseractBlobParser
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(
"layout-parser-paper.pdf",
mode="page",
images_inner_format="html-img",
images_parser=TesseractBlobParser(),
)
docs = loader.load()
print(docs[5].page_content)
The output would be:
6 Z. Shen et al.
Fig. 2: The relationship between the three types of layout data structures.
Coordinate supports three kinds of variation; TextBlock consists of the co-
ordinate information and extra features like block text, types, and reading orders;
a Layout object is a list of all possible layout elements, including other Layout
objects. They all support the same set of transformation and operation APIs for
maximum flexibility.
Shown in Table 1, LayoutParser currently hosts 9 pre-trained models trained
on 5 different datasets. Description of the training dataset is provided alongside
with the trained models such that users can quickly identify the most suitable
models for their tasks. Additionally, when such a model is not readily available,
LayoutParser also supports training customized layout models and community
sharing of the models (detailed in Section 3.5).
3.2 Layout Data Structures
A critical feature of LayoutParser is the implementation of a series of data
structures and operations that can be used to efficiently process and manipulate
the layout elements. In document image analysis pipelines, various post-processing
on the layout analysis model outputs is usually required to obtain the final
outputs. Traditionally, this requires exporting DL model outputs and then loading
the results into other pipelines. All model outputs from LayoutParser will be
stored in carefully engineered data types optimized for further processing, which
makes it possible to build an end-to-end document digitization pipeline within
LayoutParser. There are three key components in the data structure, namely
the Coordinate system, the TextBlock, and the Layout. They provide different
levels of abstraction for the layout data, and a set of APIs are supported for
transformations or operations on these classes.
Whereas in the example provided in the tutorial, the output includes the following OCR segment at the end:
<img alt="Coordinate
textblock
x-interval
JeAsaqul-A
Coordinate
+
Extra features
Rectangle
Quadrilateral
Block
Text
Block
Type
Reading
Order
layout
[ coordinatel textblock1 |
'
“y textblock2 , layout1 ]
A list of the layout elements
The same transformation and operation APIs src="#" />
I suspect this is something to do with the LangChain version used?