Python批量转换HTML为PDF

wkhtmltopdf

简介

wkhtmltopdf and wkhtmltoimage are open source (LGPLv3) command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine. These run entirely “headless” and do not require a display or display service.

wkhtmltopdf 和 wkhtmltoimage是一个开元的命令行工具,用来转换html为pdf和各种图像格式。

安装

下载地址:https://wkhtmltopdf.org/downloads.html
mac的话可以直接安装了,其他系统就看着办吧。

1
brew install Caskroom/cask/wkhtmltopdf

使用方式

  • Download a precompiled binary or build from source
  • Create your HTML document that you want to turn into a PDF (or image)
  • Run your HTML document through the tool.
  • For example, if I really like the treatment Google has done to their logo today and want to capture it forever as a PDF:
    1
    wkhtmltopdf http://google.com google.pdf
    下载安装-》创建HTML文件-》命令行执行

Pdfkit

A JavaScript PDF generation library for Node and the browser.

简介

PDFKit is a PDF document generation library for Node and the browser that makes creating complex, multi-page, printable documents easy. It’s written in CoffeeScript, but you can choose to use the API in plain ‘ol JavaScript if you like. The API embraces chainability, and includes both low level functions as well as abstractions for higher level functionality. The PDFKit API is designed to be simple, so generating complex documents is often as simple as a few function calls.

pdfkit 是 wkhtmltopdf 的Python封装包。

安装

1
2
3
npm install pdfkit
or
pip install pdfkit

支持模块

支持以下方式:

  • URL
  • 文件
  • 字符串
1
2
3
pdfkit.from_url('https://www.google.com.hk','out1.pdf')   
pdfkit.from_file('123.html','out2.pdf')
pdfkit.from_string('Hello!','out3.pdf')

代码示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#!usr/bin/env python
# -*- coding:utf-8 _*-
"""
@author:medivh
@file: html_to_pdf.py
@time: 2018/12/20
"""

import pdfkit
import os
import threading

src = '/Users/medivh/Downloads/tmp/new/'
desc = '/Users/medivh/Downloads/tmp/new-pdf/'
# read file path and destination path
sem = threading.Semaphore(10)
#控制线程数量
try:
os.mkdir(desc)
except:
pass
def ToPdf(filename):
with sem:
try:
with open(src + filename, encoding="utf-8") as f:
pdf_name = desc + filename[:-6] + '.pdf'
#拼接文件名
pdfkit.from_file(f, pdf_name)
except:
print(filename)

threads = list()
for i in os.listdir(src):
t = threading.Thread(target=ToPdf, args=(i,))
threads.append(t)

if __name__ == '__main__':
for t in threads:
t.setDaemon(True)
t.start()
for t in threads:
t.join()
start = len(os.listdir(src))
end = len(os.listdir(desc))
print(start,end)
if start == end:
print('ok')
else:
print('no')

问题总结

  • 'ascii' codec can't decode byte 0xb4 in position 11: ordinal not in range(128)
    • 解决:加上encoding,with open(src + filename, encoding="utf-8")
  • 注意文件数量,否则数量太大而且没设置线程数的话机器会卡死
    • 解决:使用threading.Semaphore(10)

参考资料

http://pdfkit.org/
https://wkhtmltopdf.org/index.html