Files

PDF

Portable Document Format File Connector

Introduction

The Portable Document Format connector transforms static documents into dynamic, queryable data. It allows users to extract structured information from complex PDF files—such as reports, invoices, statements, or tables—using standard SQL, without manual data entry, copy-paste, or custom parsing scripts.

At its core, the connector treats each PDF as a virtual table. It uses configurable extraction algorithms to identify tabular regions on the page and infer rows and columns, even across multiple pages. When you query a PDF, Metaform reads the document’s text layout, reconstructs its logical table structure, and exposes each cell as a field in a structured dataset. With this approach, you can join PDFs to spreadsheets, databases, or JSON files just as you would any other source.

This connector is built for flexibility. For cleanly formatted reports, automatic detection may be enough. But for more complex documents—where headers span pages, columns shift, or numbers include formatting characters—you can fine-tune extraction using advanced options that control how pages are combined, how headers are interpreted, and which extraction algorithm (spreadsheet or stream) is applied.

In short, the PDF connector bridges the world of static documents and live data analysis—turning any table trapped in a PDF into something you can query, aggregate, and reason about.

Scenario

Imagine receiving state-by-state donation reports as multi-page PDF exports from an external system every month. At first glance, these reports look simple enough—rows of donors with addresses, cities, states, and contribution amounts. But once you open them, the challenges become obvious: formatting shifts from page to page, headers repeat inconsistently, numeric formats vary, and even city names appear with multiple spellings.

What you really need is a way to analyze all regions together—aggregate donations by state and city, rank markets by performance, and identify where contributions are strongest. Doing that manually, though—copying rows into spreadsheets, cleaning each page, and reconciling totals—is slow and error-prone.

The table below is a representative excerpt from one of these monthly reports. It reflects exactly how the external system exports its data. The PDFs you receive contain many pages of rows formatted just like this sample—structured, but locked inside a document format that wasn’t designed for analysis.

Last Name	First Name	Address	City	State	Zip	Date	Amount
Lidstad	Dick & Peg	62 Mississippi River Blvd N	Saint Paul	MN	55104	10/12/2012	60.00
Strom	Pam	1229 Hague Ave	St. Paul	MN	55104	9/12/2012	60.00
Seeba	Louise & Paul	1399 Sheldon St	Saint Paul	MN	55108	10/12/2012	60.00

This excerpt illustrates the core problem: the data is structured but trapped. It exists in rows and columns, but only visually. Until now, turning this into something you can query required manual labor.

Putting It Together

With Metaform, the friction described in the scenario simply disappears.

What once required hours of copy-and-paste cleanup becomes a single, repeatable SQL query you can run in seconds. Instead of wrestling with PDFs, you can immediately see which cities and states are contributing the most—and repeat the analysis every month without ever touching the source files again.

Once Metaform’s PDF connector is configured, each document is treated as a collection of structured records: rows and columns embedded inside a static PDF layout. The connector’s extraction parameters tell Metaform how to detect tables, interpret headers, normalize values, and produce a clean, typed schema that behaves exactly like a SQL table.

To convert this visually structured—but operationally unstructured—data into a queryable dataset, you can define an aggregation query like the following:

SELECT 
  State,
  City,
  SUM(
    CAST(
      REPLACE(REPLACE(Amount, '$', ''), ',', '') AS DOUBLE
    )
  ) AS `Total Amount`
FROM 
  table(
    pdf.`data.pdf` (
      type => 'pdf',
      combinePages => true,
      extractHeaders => true,
      extractionAlgorithm => 'spreadsheet'
    )
  )
WHERE 
  State IS NOT NULL 
  AND TRIM(State) <> ''
GROUP BY 
  State,
  City
ORDER BY 
  State ASC,
  `Total Amount` DESC;

Metaform then returns a clean, structured result—ready for dashboards, downstream analysis, or further enrichment:

Column	Type	Example Value
`State`	VARCHAR	`Virginia`
`City`	VARCHAR	`Charlottesville`
`Total Amount`	DOUBLE	`1000.00`

By treating PDFs as first-class, queryable datasets, Metaform turns document-driven reporting into a reliable, automated analytics workflow—no manual cleanup, no brittle exports, and no loss of fidelity across months or markets.

How It Works

Within the SQL query, each option in the table function tells Metaform how to interpret the structure of the PDF and transform it into a clean, queryable dataset:

Option	Description
`type`	Signals that the file should be processed by the PDF connector, enabling Metaform’s table-detection engine to interpret cells, rows, and column boundaries directly from the document.
`combinePages`	Consolidates all pages into a single logical table so multi-page summaries, repeated headers, and cross-page rows are handled as one unified dataset.
`extractHeaders`	Uses the first detected row as column names. Metaform then automatically types these fields (e.g., `State`, `City`, `Amount`) as standard SQL `VARCHAR` columns.
`extractionAlgorithm`	Selects the `spreadsheet` algorithm, optimized for grid-based layouts where text blocks align cleanly across rows and columns.

Inside the SQL query:

The nested REPLACE() calls strip currency symbols and thousands separators from the Amount column, preparing the values for numerical operations.
The CAST(... AS DOUBLE) ensures that those cleaned values are typed as numeric data, making them eligible for aggregation.
The GROUP BY and ORDER BY clauses summarize totals by location—grouping at the state and city level, then ordering states alphabetically and ranking cities by contribution amount.

Together, these components allow Metaform to transform a static, multi-page PDF into a fully typed relational dataset—ready for analysis with the same SQL you use everywhere else.

Working Example

To explore how Metaform handles real-world PDF reports—like the multi-page donation files in the scenario—walk through the steps below in order.

Each step introduces a key part of the workflow.

Installing the Connector

The PDF connector is provided as a standalone JSON descriptor that registers a new storage plugin with your Metaform instance. This descriptor declares the connector type, the file extensions it supports, and all available parsing and extraction parameters. Installing the connector is a one-time operation that loads these capabilities directly into Metaform’s storage subsystem.

To install the connector, run the command below for your operating system’s Docker Desktop environment.

# Download the connector JSON and pipe it directly into Metaform's connector API
curl -sSL https://docs.metaform.com/resources/examples/pdf-connector.json \
  | curl -X POST -H "Content-Type: application/json" -d @- http://localhost:8047/storage/pdf.json

# Download the connector JSON and pipe it directly into Metaform's connector API
curl -sSL https://docs.metaform.com/resources/examples/pdf-connector.json \
  | curl -X POST -H "Content-Type: application/json" -d @- http://localhost:8047/storage/pdf.json

# Download the connector JSON into a string
$json = Invoke-WebRequest `
  -Uri "https://docs.metaform.com/resources/examples/pdf-connector.json" `
  -UseBasicParsing |
  Select-Object -ExpandProperty Content

# Upload the JSON to Metaform's connector API
Invoke-RestMethod `
  -Uri "http://localhost:8047/storage/pdf.json" `
  -Method Post `
  -ContentType "application/json" `
  -Body $json

This command retrieves the connector definition and registers it with Metaform through the REST API. After registration, the pdf storage plugin appears in the Storage tab and becomes available for immediate use in queries—no service restart or additional configuration required.

Download the Data

With the connector installed, you can now retrieve the example PDF file.

Download it using the following URL:

https://docs.metaform.com/resources/examples/data.pdf

Save the file to a location accessible to your Metaform instance (for example, ~/data.pdf).

Review the Data

The example PDF encodes a set of donation records as structured rows, including fields such as name, address, occupation, employer, date, and contribution amount. Each row represents an individual or organizational contribution within a defined reporting period.

Key attributes—most notably State, City, and Amount—can be queried, grouped, and aggregated to support geographic or trend-based analysis.

Query the Data

With the connector installed and the example PDF downloaded, you can now execute the query in the Metaform Console to validate extraction and examine the resulting structure.

Copy the SQL statement below and paste it into the query field within the Query tab of the Metaform Console, then run the query to view the parsed dataset.

SELECT
  State,
  City,
  SUM(
    CAST(
      REPLACE(REPLACE(Amount, '$', ''), ',', '') AS DOUBLE
    )
  ) AS `Total Amount`
FROM
  table(
    pdf.`data.pdf` (
      type => 'pdf',
      combinePages => true,
      extractHeaders => true,
      extractionAlgorithm => 'spreadsheet'
    )
  )
WHERE
  State IS NOT NULL
  AND TRIM(State) <> ''
GROUP BY
  State,
  City
ORDER BY
  State ASC,
  `Total Amount` DESC;

What You Should See

If the connector is installed, the example PDF is available in Docker Desktop’s working directory, and the query executes successfully, Metaform will:

Detect the .pdf file through the registered PDF connector
Infer a clean schema from the document’s tabular layout
Return a three-column result set derived from the extracted data

At this point, you have validated the complete workflow: Metaform located the PDF, parsed its structure, and exposed the extracted table as a queryable dataset.

Advanced Configuration

You’ll often want greater control—such as defining how pages are combined, how table headers are interpreted, and which extraction algorithm best suits your document layout.

All of these advanced options are set at query time, allowing you to fine-tune how Metaform reads and structures PDF data without altering the source file itself.

The table below summarizes the complete set of available configuration parameters:

Option	Default	Description
combinePages	`false`	When `true`, all pages in the PDF are merged into a single logical table before extraction. This is ideal for multi-page reports where headers repeat or totals span across pages. When `false`, each page is processed separately.
extractHeaders	`false`	When `true`, the first detected row is treated as the column header and used to name output fields. When `false`, columns are automatically named as `field_1`, `field_2`, etc.
extractionAlgorithm	`basic`	Controls the algorithm used to detect tables. Use `spreadsheet` for PDFs with clearly defined rows and columns, or `basic` for text-based layouts where cell positions are irregular or column boundaries are inferred by spacing.
password	(optional)	Password to decrypt protected PDFs. If not provided, encrypted documents will be skipped or raise an error depending on the configuration.

We’re actively preparing more detailed documentation and will be adding it here shortly.

Microsoft Excel

Microsoft Excel File Connector

PSV

Pipe-Separated Values File Connector