PDF Processing Workflow

使用到的工具:ImageMagick, qpdf, ocrmypdf, mat2.

可以与 navi 搭配使用。

1. pre-processing

1.1 after scan: images \(\to\) pdf(s) or pdfs \(\to\) pdf

  1. 首先处理大量图片

# 1. 假如把所有图片合并成一个pdf
## 先按序号命名,如 page-%d
## 然后测试不同quality(数值越低文件越小),一般60-80就可以了,然后直接到下一步ocr
convert "*.jpg" -quality 100 out-100.pdf
convert "*.jpg" -quality 80 out-80.pdf
convert "*.jpg" -quality 60 out-60.pdf
convert "*.jpg" -quality 50 out-50.pdf

# 2. 假如要把每一个图片分别转换为pdf
magick *.jpg +adjoin page-%d.pdf

# 3. 假如扫描结果是按页数命名的多个pdf
## 所以需要:将按页数命名的pdfs merge为一个pdf
## 这边的 page-*.pdf 是每一页pdf的命名方式
qpdf --empty --pages $(for i in page-*.pdf; do echo $i 1-z; done) -- out.pdf
  1. 然后压缩或者ocr这个pdf。去 2.1 processing \(\to\) compress pdf 部分,或者 2.2 processing \(\to\) ocr pdf 部分。

ref:

  1. Anatomy of the Command-line, imagemagick.org.

1.2 after download: grey pdf \(\to\) well-contrast pdf

## 首先extract出十页(p10-p20)测试
### input文件以 input.pdf 为例
qpdf input.pdf --pages . 10-20 -- test.pdf
## 降低亮度并增加对比度
### 以上一步得到的 test.pdf 为例
### experimenting with multiple args
### 1. modulate & contrast
convert -modulate 50 test.pdf test-de.pdf && convert test-de.pdf -contrast -monochrome test-mono.pdf && rm test-de.pdf
convert -modulate 45 test.pdf test-de.pdf && convert test-de.pdf -contrast -monochrome test-mono.pdf && rm test-de.pdf
convert -modulate 40 test.pdf test-de.pdf && convert test-de.pdf -contrast -monochrome test-mono.pdf && rm test-de.pdf
### 2. colorspace & sharpen & contrast
### experimenting with multiple args & combinations
convert -density 150 test.pdf -brightness-contrast 5x25 -sharpen 0x1 test-mono.pdf
convert -density 300 test.pdf -colorspace gray -normalize -level 50%,51% -sharpen 0x1 test-mono.pdf
convert -density 300 test.pdf -colorspace gray -normalize -level 25%,26% -sharpen 0x1 test-mono.pdf
convert -density 300 test.pdf -colorspace gray -normalize -modulate 150 -sharpen 0x1 test-mono.pdf
convert -density 300 test.pdf -contrast -contrast -contrast -contrast -sharpen 0x1 test-mono.pdf
convert -density 300 test.pdf -contrast-stretch 15% -sharpen 0x.5 test-mono.pdf
### 3. 组合使用,例如:
convert -density 300 test.pdf -brightness-contrast 5x25 -sharpen 0x1 test-mono.pdf && convert -density 300 -monochrome  test-mono.pdf -compress LZW test-mono-conpressed-lzw.pdf

ref:

  1. ImageMagick: Increase pdf scan contrast and sharpening, vielhuber.

1.3 after download: djvu \(\to\) well pdf

  1. 先在djview里把djvu文件,导出为pdf(在这里可以先设置“最大图像分辨率”为300 dpi 或者“允许有损JPEG压缩”)

  2. 然后压缩或者ocr这个pdf。去 2.1 processing \(\to\) compress pdf 部分,或者 2.2 processing \(\to\) ocr pdf 部分。

2. processing

2.1 compress pdf: pdf \(\to\) compressed & monochromized pdf

  1. 直接compress整个pdf

# compress 方法二
## 比较四种compress方法及不同的density (200-400) 效果
## 建议先extract出5页 input-test.pdf 来测试,然后再应用到整个pdf
### density越大越清晰,所以 density越小,文件也会越小
### full compress types: None, BZip, Fax, Group4, JPEG, JPEG2000, Lossless, LZW, RLE or Zip.
convert -density 300 -monochrome input-test.pdf -compress fax compressed-fax.pdf
convert -density 300 -monochrome input-test.pdf -compress Group4 compressed-group4.pdf
convert -density 300 -monochrome input-test.pdf -compress LZW compressed-lzw.pdf
convert -density 300 -monochrome input-test.pdf -compress Zip compressed-zip.pdf
  1. 然后ocr这个pdf。去 2.2 processing \(\to\) ocr pdf 部分。

ref:

  1. Monochromization of PDF Files, geistlib.

  2. Annotated List of Command-line Options, -compress type, imagemagick.org.

  3. ImageMagick Tutorial, XahLee.

2.2 ocr pdf: pdf \(\to\) ocred pdf

## 以刚刚批量处理出来的 compressed-lzw.pdf 为例
### mat2 在某些时候可以大幅度减小pdf大小(并且不对分辨率造成任何损失),所以在ocr前尝试压缩一次
### 这个操作会生成 compressed-lzw.cleaned.pdf,比较大小
mat2 -L compressed-lzw.pdf
## 假如语言是中英混杂,使用 -l eng+chi_sim
### 假如有很多图片
ocrmypdf -l eng --jbig2-lossy --optimize 3  --output-type pdf --clean --force-ocr compressed-lzw.cleaned.pdf compressed-lzw-ocred.pdf
### 假如没有图片
ocrmypdf -l eng --optimize 3  --output-type pdf --clean --force-ocr compressed-lzw.cleaned.pdf compressed-lzw-ocred.pdf

2.2.1 关于 ocrmypdfoptimize 选项

Level

Comments

--op timize 0

Disables optimization.

--op timize 1

Enables lossless optimizations, such as transcoding images to more efficient formats. Also compress other uncompressed objects in the PDF and enables the more efficient “object streams” within the PDF. (If --jbig2-lossy is issued, then lossy JBIG2 optimization is used. The decision to use lossy JBIG2 is separate from standard optimization settings.)

--op timize 2

All of the above, and enables lossy optimizations and color quantization.

--op timize 3

All of the above, and enables more aggressive optimizations and targets lower image quality.

2.2.2 关于 ocrmypdfjbig2-lossy 选项

# 对于大部分linux用户需要手动安装JBIG2 encoder
git clone https://github.com/agl/jbig2enc
cd jbig2enc
./autogen.sh
./configure && make
[sudo] make install

ref:

1. PDF optimization, ocrmypdf. 2. Installing the JBIG2 encoder, ocrmypdf.

3. post-processing

3.1 delete metadata

It will erase the result of ocr, so use it in a non-ocred pdf.

# Main: https://0xacab.org/jvoisin/mat2
## 以ocr得到的 compressed-lzw-ocred.pdf 为例
## 查看metadata
mat2 -s compressed-lzw-ocred.pdf
## 删除metadata
### 1. 重新生成一个删除了metadata的新pdf
mat2 -L compressed-lzw-ocred.pdf
### 2. 原地删除metadata
mat2 --inplace compressed-lzw-ocred.pdf
## 查看删除了metadata的pdf
mat2 -s compressed-lzw-ocred.cleaned.pdf

# Alternative: https://matweb.info/

3.2 upload to libgen

上传libgen时建议使用 Tor Browser & 默认账号。

Also See:

  1. OCRmyPDF 使用教程

  2. ImageMagick 6.0.6

  3. Examples of ImageMagick Usage - Legacy Version 6