Image parsing with PyPDF loader not working as described in documentation

iuruoy-shao · August 1, 2025, 10:08pm

I’m following this tutorial: PyPDFLoader | 🦜️🔗 LangChain

The image extraction from PDFs using Tesseract (or any other image parser) no longer works.

How to reproduce

Use the example PDF from the tutorial found here: langchain/docs/docs/integrations/document_loaders/example_data/layout-parser-paper.pdf at master · langchain-ai/langchain · GitHub

Run the following with langchain-community 0.3.27:

from langchain_community.document_loaders.parsers import TesseractBlobParser
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(
    "layout-parser-paper.pdf",
    mode="page",
    images_inner_format="html-img",
    images_parser=TesseractBlobParser(),
)
docs = loader.load()
print(docs[5].page_content)

The output would be:

6 Z. Shen et al.
Fig. 2: The relationship between the three types of layout data structures.
Coordinate supports three kinds of variation; TextBlock consists of the co-
ordinate information and extra features like block text, types, and reading orders;
a Layout object is a list of all possible layout elements, including other Layout
objects. They all support the same set of transformation and operation APIs for
maximum ﬂexibility.
Shown in Table 1, LayoutParser currently hosts 9 pre-trained models trained
on 5 diﬀerent datasets. Description of the training dataset is provided alongside
with the trained models such that users can quickly identify the most suitable
models for their tasks. Additionally, when such a model is not readily available,
LayoutParser also supports training customized layout models and community
sharing of the models (detailed in Section 3.5).
3.2 Layout Data Structures
A critical feature of LayoutParser is the implementation of a series of data
structures and operations that can be used to eﬃciently process and manipulate
the layout elements. In document image analysis pipelines, various post-processing
on the layout analysis model outputs is usually required to obtain the ﬁnal
outputs. Traditionally, this requires exporting DL model outputs and then loading
the results into other pipelines. All model outputs from LayoutParser will be
stored in carefully engineered data types optimized for further processing, which
makes it possible to build an end-to-end document digitization pipeline within
LayoutParser. There are three key components in the data structure, namely
the Coordinate system, the TextBlock, and the Layout. They provide diﬀerent
levels of abstraction for the layout data, and a set of APIs are supported for
transformations or operations on these classes.

Whereas in the example provided in the tutorial, the output includes the following OCR segment at the end:

<img alt="Coordinate

textblock

x-interval

JeAsaqul-A

Coordinate
+

Extra features

Rectangle

Quadrilateral

Block
Text

Block
Type

Reading
Order

layout

[ coordinatel textblock1 |
&#x27;

“y textblock2 , layout1 ]

A list of the layout elements

The same transformation and operation APIs src="#" />

I suspect this is something to do with the LangChain version used?

Topic		Replies	Views
Image editing inside langchain ts LangChain js-help	0	64	August 6, 2025
How to make an image tool? LangChain python-help	7	351	July 10, 2025
Inconsistent image_url Formats in ChatOpenAI vs. ChatGoogleGenerativeAI LangChain product-feedback , python-help	1	92	July 12, 2025
Fine-tuning an Extraction Chain (LLM with structured output) LangChain python-help	0	66	July 18, 2025
LLM tools invoking LangGraph python-help	3	197	July 2, 2025

Image parsing with PyPDF loader not working as described in documentation

Related topics