A simple guide to text from PDF. This is an extension of the Convert PDF pages to JPEG with python post
- Objectives:
-
- Extract text from PDF
-
- Required Tools:
-
- Poppler for windows— Poppler is a PDF rendering library . Include the pdftoppm utility
- Poppler for Mac — If HomeBrew already installed, can use brew install Poppler
- pdftotext— Python module. Wraps the poppler pdftotext utility to convert PDF to text.
-
- Steps:
-
- Install Poppler. For windows, Add “xxx/bin/” to env path
- pip install pdftotext
-
Usage (sample code from pdftotext github)
import pdftotext # Load your PDF with open("Target.pdf", "rb") as f: pdf = pdftotext.PDF(f) # Save all text to a txt file. with open('output.txt', 'w') as f: f.write("\n\n".join(pdf))
Further notes
See also:
The installation on win is not straight forward, very good and detailed level steps are mentioned on this blog, it would be worth including that link in this post: https://coder.haus/2019/09/27/installing-pdftotext-through-pip-on-windows-10/