This post covers basic PDF manipulation for daily tasks using simple Python modules.
- Merging mulitple PDF
- Extract text from PDF
- Extract image from PDF
Merging PDF
from PyPDF2 import PdfFileMerger
pdfs = ['a.pdf', b.pdf]
merger = PdfFileMerger()
for pdf in pdfs:
merger.append(pdf)
merger.write("output.pdf")
Extract text from PDF
import pdftotext
# Load your PDF
with open("Target.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# Save all text to a txt file.
with open('output.txt', 'w') as f:
f.write("\n\n".join(pdf))
More information from “Convert PDF pages to text with python”
Extract Image (JPEG) from PDF
import os
import tempfile
from pdf2image import convert_from_path
filename = 'target.pdf'
with tempfile.TemporaryDirectory() as path:
images_from_path = convert_from_path(filename, output_folder=path, last_page=1, first_page =0)
base_filename = os.path.splitext(os.path.basename(filename))[0] + '.jpg'
save_dir = 'your_saved_dir'
for page in images_from_path:
page.save(os.path.join(save_dir, base_filename), 'JPEG')
More information from “Convert PDF pages to JPEG with python“

