Convert PDF pages to text with python

A simple guide to text from PDF. This is an extension of the Convert PDF pages to JPEG with python post

Objectives:
1. 1. Extract text from PDF
Required Tools:
1. 1. Poppler for windows— Poppler is a PDF rendering library . Include the pdftoppm utility
  2. Poppler for Mac — If HomeBrew already installed, can use brew install Poppler
  3. pdftotext— Python module. Wraps the poppler pdftotext utility to convert PDF to text.
Steps:
1. 1. Install Poppler. For windows, Add “xxx/bin/” to env path
  2. pip install pdftotext

Usage (sample code from pdftotext github)

import pdftotext

# Load your PDF
with open("Target.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# Save all text to a txt file.
with open('output.txt', 'w') as f:
    f.write("\n\n".join(pdf))

Further notes

https://github.com/jalan/pdftotext

See also:

Convert PDF pages to JPEG with python

One comment

Harshad Vyawahare says:

October 14, 2019 at 11:28 am

The installation on win is not straight forward, very good and detailed level steps are mentioned on this blog, it would be worth including that link in this post: https://coder.haus/2019/09/27/installing-pdftotext-through-pip-on-windows-10/

Reply