pdf2djvu - Man Page
creates DjVu files from PDF files
Synopsis
pdf2djvu [{-o | --output} output-djvu-file] [option...] pdf-file...
pdf2djvu {-i | --indirect} index-djvu-file [option...] pdf-file...
Description
This program creates a DjVu file from one or more Portable Document Format files.
Options
pdf2djvu accepts the following options:
Document type, file names
- -o, --output=output-djvu-file
Generate a bundled multi-page document. Write the file into output-djvu-file instead of standard output.
- -i, --indirect=index-djvu-file
Generate an indirect multi-page document. Use index-djvu-file as the index file name; put the component files into the same directory. The directory must exist and be writable.
- --page-id-template=template
Specifies the naming scheme for page identifiers. Consult the “Template Language” section for the template language description.
The default template is “p{page:04*}.djvu”.
For portability reasons, page identifiers:
- must consist only of lowercase ASCII letters, digits, _, +, - and dot,
- cannot start with a +, - or a dot,
- cannot contain two consecutive dots,
- must end with the .djvu or the .djv extension.
- --page-id-prefix=prefix
Equivalent to “--page-id-template=prefix{page:04*}.djvu”.
- --page-title-template=template
Specifies the template for page titles. Consult the “Template Language” section for the template language description.
The default template is “{label}”.
- --no-page-titles
Don't set page titles.
Resolution, page size
- -d, --dpi=resolution
Specifies the desired resolution to resolution dots per inch. The default is 300 dpi. The allowed range is: 72 ≤ resolution ≤ 6000.
- --media-box
Use MediaBox to determine page size. CropBox is used by default.
- --page-size=widthxheight
Specifies the preferred page size to width pixels × height pixels. The actual page size may be altered in order to respect aspect ratio and DjVu limitations on resolution. (This option takes precedence over -d/--dpi.)
- --guess-dpi
Try to guess native resolution by inspecting embedded images. Use with care.
Image quality
- --bg-slices=n+...+n, --bg-slices=n,...,n
Specifies the encoding quality of the IW44 background layer. This option is similar to the -slice option of c44. Consult the c44(1) manual page for details. The default is 72+11+10+10.
- --bg-subsample=n
Specifies the background subsampling ratio. The default is 3. Valid values are integers between 1 and 12, inclusive.
- --fg-colors=default
Try to preserve all the foreground layer colors. This is the default.
- --fg-colors=web
Reduce foreground layer colors to the web palette (216 colors). This option is not recommended.
- --fg-colors=n
Use GraphicsMagick to reduce number of distinct colors in the foreground layer to n. Valid values are integers between 1 and 4080. This option is not recommended.
- --fg-colors=black
Discard any color information from the foreground layer.
- --monochrome
Render pages as monochrome bitmaps. With this option, --bg-... and --fg-... options are not respected.
- --loss-level=n
Specifies the aggressiveness of the lossy compression. The default is 0 (lossless). Valid values are integers between 0 and 200, inclusive. This option is similar to the -losslevel option of cjb2; consult the cjb2(1) manual page for details. This option can be used only if the --monochrome option is also enabled.
- --lossy
Synonym for --loss-level=100.
- --anti-alias
Enable font and vector anti-aliasing. This option is not recommended.
Extraction
- --no-metadata
Don't extract the metadata.
By default:
- The following entries of the document information dictionary are extracted: Title, Author, Subject, Creator, Producer, CreationDate, ModDate. Timestamps are formatted according to RFC 3999[1], with date and time components separated by a single space.
- The XMP metadata is extracted (or created) and updated accordingly.
Note
If multiple input documents are specified, only metadata of the first one is taken into account.- --verbatim-metadata
Keep the original metadata intact.
- --no-outline
Don't extract the document outline.
- --hyperlinks=border-avis
Make hyperlink borders always visible.
By default, a hyperlink border is visible only when the mouse is over the hyperlink.
- --hyperlinks=#RRGGBB
Force the specified border color for hyperlinks.
- --no-hyperlinks, --hyperlinks=none
Don't extract hyperlinks.
- --no-text
Don't extract the text.
- --words
Extract the text. Record the location of every word. This is the default.
- --lines
Extract the text. Record the location of every line, rather that every word.
- --crop-text
Extract no text outside the page boundary.
- --no-nfkc
Do not apply NFKC[2] normalization on the text, except for characters from the Alphabetic Presentation Forms block[3] (U+FB00–U+FB4F), which are normalized unconditionally.
The default is to apply NFKC normalization on all characters.
- --filter-text=command-line
Filter the text through the command-line. The provided filter must preserve whitespace, control characters and decimal digits.
This option implies --no-nfkc.
- -p, --pages=page-range
Specifies pages to convert. page-range is a comma-separated list of sub-ranges. Each sub-range is either a single page (e.g. 17) or a contiguous range of pages (e.g. 37-42). Duplicate page numbers are not allowed. Pages are numbered from 1.
The default is to convert all pages.
Performance
- -j, --jobs=n
Use n threads to perform conversion. The default is to use one thread.
- -j0, --jobs=0
Determine automatically how many threads to use to perform conversion.
Verbosity, help
- -v, --verbose
Display more informational messages while converting the file.
- -q, --quiet
Don't display informational messages while converting the file.
- --version
Output version information and exit.
- -h, --help
Display help and exit.
Environment
The following environment variables affects pdf2djvu on Unix systems:
- OMP_*
Details of runtime behavior with respect to parallelism can be controlled by several environment variables. Please refer to the OpenMP API specification[4] for details.
- TMPDIR
pdf2djvu makes heavy use of temporary files. It will store them in a directory specified by this variable. The default is /tmp.
Template Language
Template syntax
The template language is roughly modeled on the Python string formatting syntax[5].
A template is a piece of text which contains fields, surrounded by curly braces {}. Fields are replaced with appropriately formatted values when the template is evaluated. Moreover, {{ is replaced with a single { and }} is replaced with a single }.
Field syntax
Each field consists of a variable name, optionally followed by a shift, optionally followed by a format specification.
The shift is a signed (i.e. starting with a + or - character) integer.
The format specification consists of a colon, followed by a width specification.
The width specification is a decimal integer defining the minimum field width. If not specified, then the field width will be determined by the content. Preceding the width specification with a zero (0) character enables zero-padding.
The width specification is optionally followed by an asterisk (*) character, which increases the minimum field width to the width of the longest possible content of the variable.
Available variables
- dpage
Page number in the DjVu document.
- page, spage
Page number in the PDF document.
- label
Page label (logical page number) in the PDF document.
This variable is available only for page titles.
Implementation Details
Layer separation algorithm
Unless the --monochrome option is on, pdf2djvu uses the following naive layer separation algorithm:
- 1.
For each page, do the following:
- Rasterize the page into a pixmap, in the usual manner.
Rasterize the page into another pixmap, omitting the following page elements:
- text,
- 1 bit-per-pixel raster images,
- vector elements (except fills of large areas).
Compare both pixmaps, pixel by pixel:
- If their colors match, classify the pixel as a part of the background layer.
- Otherwise, classify the pixel as a part of the foreground layer.
Bug Reports
If you find a bug in pdf2djvu, please report it at the issue tracker[6] or to the mailing list[7].
See Also
Notes
- RFC 3999
https://www.ietf.org/rfc/rfc3339 - NFKC
https://unicode.org/reports/tr15/ - Alphabetic Presentation Forms block
https://unicode.org/charts/PDF/UFB00.pdf - OpenMP API specification
https://www.openmp.org/specifications/ - Python string formatting syntax
https://docs.python.org/2/library/string.html#format-string-syntax - the issue tracker
https://github.com/jwilk/pdf2djvu/issues - the mailing list
https://groups.io/g/pdf2djvu