pdf2image

PDF manipulation with Python

This post covers basic PDF manipulation for daily tasks using simple Python modules.

  1. Merging mulitple PDF
  2. Extract text from PDF
  3. Extract image from PDF

Merging PDF

from PyPDF2 import PdfFileMerger
pdfs = ['a.pdf', b.pdf]
merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(pdf)

merger.write("output.pdf")

Extract text from PDF

import pdftotext

# Load your PDF
with open("Target.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# Save all text to a txt file.
with open('output.txt', 'w') as f:
    f.write("\n\n".join(pdf))

More information from “Convert PDF pages to text with python

Extract Image (JPEG) from PDF

 

import os
import tempfile
from pdf2image import convert_from_path

filename = 'target.pdf'

with tempfile.TemporaryDirectory() as path:
     images_from_path = convert_from_path(filename, output_folder=path, last_page=1, first_page =0)

base_filename  =  os.path.splitext(os.path.basename(filename))[0] + '.jpg'      

save_dir = 'your_saved_dir'

for page in images_from_path:
    page.save(os.path.join(save_dir, base_filename), 'JPEG')

More information from “Convert PDF pages to JPEG with python

Advertisement

Convert PDF pages to JPEG with python

A simple guide to extract images (jpeg, png) from PDF.

  1. Objectives:
      1. Extract Images from PDF
  2. Required Tools:
      1. Poppler for windows— Poppler is a PDF rendering library . Include the pdftoppm utility
      2. Poppler for Mac — If HomeBrew already installed, can use brew install Poppler
      3. Pdf2image— Python module. Wraps the pdftoppm utility to convert PDF to a PIL Image object.
  3. Steps:
      1. Install Poppler. For windows, Add “xxx/bin/” to env path
      2. pip install pdf2image

Usage

import os
import tempfile
from pdf2image import convert_from_path

filename = 'target.pdf'

with tempfile.TemporaryDirectory() as path:
     images_from_path = convert_from_path(filename, output_folder=path, last_page=1, first_page =0)

base_filename  =  os.path.splitext(os.path.basename(filename))[0] + '.jpg'      

save_dir = 'your_saved_dir'

for page in images_from_path:
    page.save(os.path.join(save_dir, base_filename), 'JPEG')

Further notes