網(wǎng)站首頁編程語言正文

利用Python提取PDF文本的簡單方法實例_python

作者：somenzz ? 更新時間： 2022-09-17 編程語言

第一步，安裝工具庫

1、tika — 用于從各種文件格式中進(jìn)行文檔類型檢測和內(nèi)容提取

2、wand — 基于 ctypes 的簡單 ImageMagick 綁定

3、pytesseract — OCR 識別工具

創(chuàng)建一個虛擬環(huán)境，安裝這些工具

python -m venv venv
source venv/bin/activate
pip install tika wand pytesseract

第二步，編寫代碼

假如 pdf 文件里面既有文字，又有圖片，以下代碼可以直接識別文字：

import io
import pytesseract
import sys
 
from PIL import Image
from tika import parser
from wand.image import Image as wi
 
text_raw = parser.from_file("example.pdf")
print(text_raw['content'].strip())

這還不夠，我們還需要能失敗圖片的部分：

def extract_text_image(from_file, lang='deu', image_type='jpeg', resolution=300):
    print("-- Parsing image", from_file, "--")
    print("---------------------------------")
    pdf_file = wi(filename=from_file, resolution=resolution)
    image = pdf_file.convert(image_type)
    image_blobs = []
    for img in image.sequence:
        img_page = wi(image=img)
        image_blobs.append(img_page.make_blob(image_type))
    extract = []
    for img_blob in image_blobs:
        image = Image.open(io.BytesIO(img_blob))
        text = pytesseract.image_to_string(image, lang=lang)
        extract.append(text)
    for item in extract:
        for line in item.split("\n"):
            print(line)

合并一下，完整代碼如下：

import io
import sys
 
from PIL import Image
import pytesseract
from wand.image import Image as wi
from tika import parser
 
def extract_text_image(from_file, lang='deu', image_type='jpeg', resolution=300):
    print("-- Parsing image", from_file, "--")
    print("---------------------------------")
    pdf_file = wi(filename=from_file, resolution=resolution)
    image = pdf_file.convert(image_type)
    for img in image.sequence:
        img_page = wi(image=img)
        image = Image.open(io.BytesIO(img_page.make_blob(image_type)))
        text = pytesseract.image_to_string(image, lang=lang)
        for part in text.split("\n"):
            print("{}".format(part))
 
def parse_text(from_file):
    print("-- Parsing text", from_file, "--")
    text_raw = parser.from_file(from_file)
    print("---------------------------------")
    print(text_raw['content'].strip())
    print("---------------------------------")
 
if __name__ == '__main__':
    parse_text(sys.argv[1])
    extract_text_image(sys.argv[1], sys.argv[2])

第三步，執(zhí)行

假如 example.pdf 是這樣的：

在命令行這樣執(zhí)行：

python run.py example.pdf deu | xargs -0 echo > extract.txt

最終 extract.txt 的結(jié)果如下：

-- Parsing text example.pdf --
---------------------------------
Title pure text
?
Content pure text
?
?? ?Slide 1
?? ?Slide 2
---------------------------------
-- Parsing image example.pdf --
---------------------------------
Title pure text
?
Content pure text
?
Title in image
?
Text in image

你可能會問，如果是簡體中文，那個 lang 參數(shù)傳遞什么，傳 'chi_sim'，其實是有官方說明的，鏈接如下：

https://github.com/tesseract-ocr/tessdoc/blob/main/Data-Files-in-different-versions.md

最后的話

從 PDF 中提取文本的腳本實現(xiàn)并不復(fù)雜，許多庫簡化了工作并取得了很好的效果

原文鏈接：https://blog.csdn.net/somenzz/article/details/124440977

上一篇：Golang中的包及包管理工具go?mod詳解_Golang
下一篇：Makefile構(gòu)建Golang項目示例詳解_Golang

日本免费高清视频-国产福利视频导航-黄色在线播放国产-天天操天天操天天操天天操|www.shdianci.com

網(wǎng)站首頁編程語言正文

利用Python提取PDF文本的簡單方法實例_python

目錄

第一步，安裝工具庫

第二步，編寫代碼

第三步，執(zhí)行

最后的話

相關(guān)推薦

日本免费高清视频-国产福利视频导航-黄色在线播放国产-天天操天天操天天操天天操|www.shdianci.com

網(wǎng)站首頁 編程語言 正文

利用Python提取PDF文本的簡單方法實例_python

目錄

第一步，安裝工具庫

第二步，編寫代碼

第三步，執(zhí)行

最后的話

相關(guān)推薦

網(wǎng)站首頁編程語言正文

第三步，執(zhí)行