Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 25 additions & 73 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,102 +1,54 @@
# libpdfrip

libpdfrip is a high-performance PDF rendering and analysis tool built on top of **libpdfio** and the **Cairo 2D** graphics library. It provides efficient page rendering to PNG format and includes a content stream inspection utility for debugging and development.
libpdfrip is a C library for rendering PDF pages to PNG images using the Cairo graphics library and PDFio for PDF parsing.

## Features
## Purpose

* Render individual PDF pages directly to PNG.
* Configurable output resolution (DPI).
* Content stream analysis mode for inspecting PDF operator usage.
* Optional verbose logging for detailed diagnostics.
* Flexible output naming conventions to support automation and testing.
libpdfrip provides PDF page rendering functionality for applications that need to convert PDF documents to raster images. The library uses PDFio to parse PDF structure and Cairo to render vector graphics and text to PNG output.

## Dependencies
## Requirements

The following libraries and tools must be installed:
The following tools and libraries are required to build libpdfrip:

* C compiler (gcc or clang)
* make
* pkg-config
* libpdfio (development headers)
* cairo (development headers)
* freetype2 (development headers)
* libpng (development headers)
* PDFio
* Cairo
* FreeType2
* libpng

### Debian/Ubuntu Installation
### Installing tools on Debian/Ubuntu

```
Install the required packages with:

```bash
sudo apt-get install build-essential pkg-config libpdfio-dev libcairo2-dev libfreetype6-dev libpng-dev
```

## Building
## Building libpdfrip

```
To build libpdfrip from source:

```bash
git clone https://github.com/OpenPrinting/libpdfrip.git
cd libpdfrip
make
```

This produces:
The build produces the following executables:

* `pdf2cairo/pdf2cairo_main` – primary rendering and analysis tool
* `testpdf2cairo` – test runner

## Usage

```
./pdf2cairo/pdf2cairo_main [options] input.pdf
```

### Options

| Flag | Argument | Description |
| ----------- | -------------- | ------------------------------------------------------------------ |
| `--analyze` | | Analyze PDF content streams instead of rendering output. |
| `--help` | | Display usage information. |
| `-o` | `<output.png>` | Output PNG filename when rendering. |
| `-p` | `<pagenum>` | Page number to process (default: 1). |
| `-r` | `<dpi>` | Output resolution in DPI (default: 72). |
| `-t` | | Generate a temporary output filename (requires `-d`). |
| `-d` | `<directory>` | Output directory when using `-t`. |
| `-T` | | Generate a temporary filename inside `testfiles/renderer-output/`. |
| `-v` | | Enable verbose diagnostic output. |

### Examples

Render page 1 to PNG:

```
./pdf2cairo/pdf2cairo_main -o output.png document.pdf
```
* `pdf2cairo/pdf2cairo_main` - PDF rendering and analysis tool
* `testpdf2cairo` - test runner

Render page 5 at 300 DPI:
## Documentation

```
./pdf2cairo/pdf2cairo_main -p 5 -r 300 -o high-res.png document.pdf
```

Analyze page 2 content stream:

```
./pdf2cairo/pdf2cairo_main --analyze -p 2 document.pdf
```

## Testing

```
make test
```

Test output images are written to:

```
testfiles/renderer-output/
```
Detailed contributor documentation is available in the `docs/` directory, including background material on PDF internals and Form XObjects.

## Contributing

Contributions are welcomed. All pull requests must:
Contributions are welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on submitting pull requests and reporting issues.

* Pass the existing test suite (`make test`).
* Follow the current code structure and formatting conventions.
## License

See [LICENSE](LICENSE) and [NOTICE](NOTICE) for license information.
187 changes: 187 additions & 0 deletions docs/understanding-pdfs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# Understanding PDFs (For Contributors)

This document provides background information on PDF internals relevant to understanding and contributing to the libpdfrip codebase. It focuses on the practical structure of PDF files as encountered by a PDF renderer, rather than on exhaustive coverage of the PDF specification.

---

## PDFs as Instruction Streams

A PDF file is not a static document format. Instead, it consists of a sequence of drawing instructions that are interpreted by a rendering engine to produce visual output. These instructions operate on a graphics state and are executed sequentially.

A PDF rendering engine processes commands such as:
- Moving the current drawing position
- Constructing vector paths
- Filling or stroking shapes
- Placing text or raster images

---

## Pages

A PDF document contains one or more pages. Each page consists of the following components:

- **Resources**: A collection of fonts, images, and reusable objects required by the page.
- **Content Stream**: A stream of PDF operators that describe how the page is rendered.
- **MediaBox**: The physical dimensions of the page in user space coordinates.

---

## Content Streams

The content stream contains the primary drawing commands for a page. It is a sequence of PDF operators and operands that modify the graphics state and generate output.

Example content stream:

```
10 20 m % Move to (10, 20)
100 20 l % Line to (100, 20)
100 100 l % Line to (100, 100)
10 100 l % Line to (10, 100)
h % Close path
S % Stroke (draw the outline)
```

The interpreter processes these operators sequentially and issues corresponding drawing commands to the underlying graphics library (Cairo in the case of libpdfrip).

---

## XObjects (External Objects)

XObjects are reusable content objects that can be referenced multiple times within a PDF document. This mechanism avoids duplication of drawing commands and reduces file size.

There are two primary types of XObjects:

- **Image XObjects**: Embedded raster images (JPEG, PNG, etc.)
- **Form XObjects**: Reusable vector graphics and text content

Form XObjects function analogously to subroutines in procedural programming—they are defined once and invoked as needed.

---

## Form XObjects: Detailed Specification

Form XObjects are prevalent in production PDF files. A thorough understanding of Form XObjects is essential for working with the libpdfrip interpreter.

### Definition

A Form XObject is a self-contained stream of PDF content that can be referenced and rendered multiple times. It encapsulates drawing commands along with the necessary resources (fonts, images, etc.) and transformation parameters.

Example use case: A company logo appearing on every page of a document. Rather than embedding the logo's drawing commands on each page, the logo is defined once as a Form XObject and referenced multiple times. This approach reduces file size and may improve rendering performance through caching.

### Structure of a Form XObject

A Form XObject is a PDF stream object with a dictionary containing the following key entries:

```
<<
/Type /XObject
/Subtype /Form % Identifies this as a Form XObject
/BBox [0 0 100 50] % Bounding box defining coordinate space
/Matrix [1 0 0 1 0 0] % Transformation matrix
/Resources << ... >> % Resource dictionary
>>
stream
% Drawing commands (identical in structure to page content streams)
1 0 0 rg % Set fill color to red
0 0 100 50 re % Define rectangle from (0,0) to (100,50)
f % Fill the rectangle
endstream
```

### Key Dictionary Entries

#### /Subtype

The `/Subtype` entry specifies the XObject type. The value `/Form` indicates a Form XObject, as distinct from `/Image` for Image XObjects. This entry is mandatory for distinguishing between XObject types.

#### /BBox (Bounding Box)

Syntax: `/BBox [x_min y_min x_max y_max]`

The bounding box defines the coordinate system for the Form XObject's content. All drawing operations within the Form are interpreted relative to this coordinate space.

Example: `/BBox [0 0 200 100]` establishes a coordinate space extending from (0, 0) to (200, 100).

#### /Matrix (Transformation Matrix)

Syntax: `/Matrix [a b c d e f]`

The transformation matrix is a 6-element array representing a 2D affine transformation. This matrix transforms the Form's coordinate space when the Form is rendered within a page or another Form XObject.

Default value: `[1 0 0 1 0 0]` (identity matrix, indicating no transformation)

The transformation matrix supports the following operations:
- **Scaling**: Adjusting the size of the Form content
- **Rotation**: Rotating the Form content
- **Translation**: Repositioning the Form content
- **Skewing**: Applying shear transformations (uncommon)

The matrix represents a standard 2D affine transformation as used in linear algebra and computer graphics.

#### /Resources

A Form XObject may include its own Resources dictionary, similar to a page:

```
/Resources <<
/Font << /F1 10 0 R >>
/XObject << /Image1 20 0 R >>
>>
```

This dictionary specifies the fonts, images, and nested Form XObjects required by the Form's content stream.

### The Do Operator: Invoking a Form XObject

A Form XObject is invoked using the `Do` operator. The Form must first be registered in the current Resources dictionary:

```
% Page's Resources dictionary:
/Resources <<
/XObject << /Logo 42 0 R >> % "Logo" references a Form XObject
>>

% Page's content stream:
q % Save graphics state
1 0 0 1 100 200 cm % Apply transformation (translate to 100, 200)
/Logo Do % Invoke the Form XObject named "Logo"
Q % Restore graphics state
```

When the renderer encounters a `Do` operator, it performs the following sequence:

1. **Save the current graphics state**: Preserves the state prior to executing the Form.
2. **Apply the Form's `/Matrix` transformation**: Transforms the coordinate space according to the Form's matrix.
3. **Activate the Form's `/Resources`**: Makes the Form's resources available during execution.
4. **Execute the Form's content stream**: Processes the drawing operators contained in the Form.
5. **Restore the previous graphics state**: Returns to the saved state.

This sequence is analogous to a function call in procedural programming, where the content stream serves as the function body.

### Common Use Cases for Form XObjects

Form XObjects are used in the following scenarios:

- **Repeated content**: Headers, footers, logos, watermarks
- **File size optimization**: Complex graphics defined once and referenced multiple times
- **Interactive PDF forms**: Form field appearances (buttons, checkboxes, text fields) are often implemented using Form XObjects
- **Content organization**: Logical structuring of content, including optional content groups (layers)

Support for Form XObjects is essential for correctly rendering the majority of real-world PDF documents.

---

## Form XObject Handling in libpdfrip

The libpdfrip interpreter implements Form XObject support through the following mechanism:

1. **Detection**: The interpreter identifies the `Do` operator in the content stream.
2. **Resolution**: The XObject name is resolved by looking up the name in the current Resources dictionary.
3. **Type checking**: The interpreter verifies that the XObject is a Form (as opposed to an Image).
4. **State management**: The current graphics state is saved.
5. **Transformation**: The Form's `/Matrix` is applied to the graphics state.
6. **Recursive execution**: The Form's content stream is processed recursively by the interpreter.
7. **State restoration**: The saved graphics state is restored.

This recursive processing model supports arbitrary nesting of Form XObjects within other Form XObjects. Proper state management is critical to correct rendering, as Form XObjects may modify the graphics state in ways that must not persist after the Form completes execution.