pdf2text

PDF manipulation with Python

This post covers basic PDF manipulation for daily tasks using simple Python modules.

  1. Merging mulitple PDF
  2. Extract text from PDF
  3. Extract image from PDF

Merging PDF

from PyPDF2 import PdfFileMerger
pdfs = ['a.pdf', b.pdf]
merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf)

merger.write("output.pdf")

Extract text from PDF

import pdftotext

# Load your PDF
with open("Target.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# Save all text to a txt file.
with open('output.txt', 'w') as f:
    f.write("\n\n".join(pdf))

More information from “Convert PDF pages to text with python

Extract Image (JPEG) from PDF

 

import os
import tempfile
from pdf2image import convert_from_path

filename = 'target.pdf'

with tempfile.TemporaryDirectory() as path:
     images_from_path = convert_from_path(filename, output_folder=path, last_page=1, first_page =0)

base_filename  =  os.path.splitext(os.path.basename(filename))[0] + '.jpg'      

save_dir = 'your_saved_dir'

for page in images_from_path:
    page.save(os.path.join(save_dir, base_filename), 'JPEG')

More information from “Convert PDF pages to JPEG with python

Advertisement

Convert PDF pages to text with python

A simple guide to text from PDF. This is an extension of the Convert PDF pages to JPEG with python post

  1. Objectives:
      1. Extract text from PDF
  2. Required Tools:
      1. Poppler for windows— Poppler is a PDF rendering library . Include the pdftoppm utility
      2. Poppler for Mac — If HomeBrew already installed, can use brew install Poppler
      3. pdftotext— Python module. Wraps the poppler pdftotext utility to convert PDF to text.
  3. Steps:
      1. Install Poppler. For windows, Add “xxx/bin/” to env path
      2. pip install pdftotext

Usage (sample code from pdftotext github)

import pdftotext

# Load your PDF
with open("Target.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# Save all text to a txt file.
with open('output.txt', 'w') as f:
    f.write("\n\n".join(pdf))

Further notes 

See also: