Introduction
The Portable Document Format connector transforms static documents into dynamic, queryable data. It allows users to extract structured information from complex PDF files—such as reports, invoices, statements, or tables—using standard SQL, without manual data entry, copy-paste, or custom parsing scripts.
At its core, the connector treats each PDF as a virtual table. It uses configurable extraction algorithms to identify tabular regions on the page and infer rows and columns, even across multiple pages. When you query a PDF, Metaform reads the document’s text layout, reconstructs its logical table structure, and exposes each cell as a field in a structured dataset. With this approach, you can join PDFs to spreadsheets, databases, or JSON files just as you would any other source.
This connector is built for flexibility. For cleanly formatted reports, automatic detection may be enough. But for more complex documents—where headers span pages, columns shift, or numbers include formatting characters—you can fine-tune extraction using advanced options that control how pages are combined, how headers are interpreted, and which extraction algorithm (spreadsheet or stream) is applied.
In short, the PDF connector bridges the world of static documents and live data analysis—turning any table trapped in a PDF into something you can query, aggregate, and reason about.
Scenario
Imagine receiving state-by-state donation reports as multi-page PDF exports from an external system every month. At first glance, these reports look simple enough—rows of donors with addresses, cities, states, and contribution amounts. But once you open them, the challenges become obvious: formatting shifts from page to page, headers repeat inconsistently, numeric formats vary, and even city names appear with multiple spellings.
What you really need is a way to analyze all regions together—aggregate donations by state and city, rank markets by performance, and identify where contributions are strongest. Doing that manually, though—copying rows into spreadsheets, cleaning each page, and reconciling totals—is slow and error-prone.
The table below is a representative excerpt from one of these monthly reports. It reflects exactly how the external system exports its data. The PDFs you receive contain many pages of rows formatted just like this sample—structured, but locked inside a document format that wasn’t designed for analysis.
| Last Name | First Name | Address | City | State | Zip | Date | Amount |
|---|---|---|---|---|---|---|---|
| Lidstad | Dick & Peg | 62 Mississippi River Blvd N | Saint Paul | MN | 55104 | 10/12/2012 | 60.00 |
| Strom | Pam | 1229 Hague Ave | St. Paul | MN | 55104 | 9/12/2012 | 60.00 |
| Seeba | Louise & Paul | 1399 Sheldon St | Saint Paul | MN | 55108 | 10/12/2012 | 60.00 |
This excerpt illustrates the core problem: the data is structured but trapped. It exists in rows and columns, but only visually. Until now, turning this into something you can query required manual labor.
Putting It Together
With Metaform, the friction described in the scenario simply disappears.
What once required hours of copy-and-paste cleanup becomes a single, repeatable SQL query you can run in seconds. Instead of wrestling with PDFs, you can immediately see which cities and states are contributing the most—and repeat the analysis every month without ever touching the source files again.
Once Metaform’s PDF connector is configured, each document is treated as a collection of structured records: rows and columns embedded inside a static PDF layout. The connector’s extraction parameters tell Metaform how to detect tables, interpret headers, normalize values, and produce a clean, typed schema that behaves exactly like a SQL table.
To convert this visually structured—but operationally unstructured—data into a queryable dataset, you can define an aggregation query like the following:
SELECT
State,
City,
SUM(
CAST(
REPLACE(REPLACE(Amount, '$', ''), ',', '') AS DOUBLE
)
) AS `Total Amount`
FROM
table(
pdf.`data.pdf` (
type => 'pdf',
combinePages => true,
extractHeaders => true,
extractionAlgorithm => 'spreadsheet'
)
)
WHERE
State IS NOT NULL
AND TRIM(State) <> ''
GROUP BY
State,
City
ORDER BY
State ASC,
`Total Amount` DESC;
Metaform then returns a clean, structured result—ready for dashboards, downstream analysis, or further enrichment:
| Column | Type | Example Value |
|---|---|---|
State | VARCHAR | Virginia |
City | VARCHAR | Charlottesville |
Total Amount | DOUBLE | 1000.00 |
By treating PDFs as first-class, queryable datasets, Metaform turns document-driven reporting into a reliable, automated analytics workflow—no manual cleanup, no brittle exports, and no loss of fidelity across months or markets.
How It Works
Within the SQL query, each option in the table function tells Metaform how to interpret the structure of the PDF and transform it into a clean, queryable dataset:
| Option | Description |
|---|---|
type | Signals that the file should be processed by the PDF connector, enabling Metaform’s table-detection engine to interpret cells, rows, and column boundaries directly from the document. |
combinePages | Consolidates all pages into a single logical table so multi-page summaries, repeated headers, and cross-page rows are handled as one unified dataset. |
extractHeaders | Uses the first detected row as column names. Metaform then automatically types these fields (e.g., State, City, Amount) as standard SQL VARCHAR columns. |
extractionAlgorithm | Selects the spreadsheet algorithm, optimized for grid-based layouts where text blocks align cleanly across rows and columns. |
Inside the SQL query:
- The nested
REPLACE()calls strip currency symbols and thousands separators from theAmountcolumn, preparing the values for numerical operations. - The
CAST(... AS DOUBLE)ensures that those cleaned values are typed as numeric data, making them eligible for aggregation. - The
GROUP BYandORDER BYclauses summarize totals by location—grouping at the state and city level, then ordering states alphabetically and ranking cities by contribution amount.
Together, these components allow Metaform to transform a static, multi-page PDF into a fully typed relational dataset—ready for analysis with the same SQL you use everywhere else.
Working Example
To explore how Metaform handles real-world PDF reports—like the multi-page donation files in the scenario—walk through the steps below in order.
Each step introduces a key part of the workflow.
Installing the Connector
The PDF connector is provided as a standalone JSON descriptor that registers a new storage plugin with your Metaform instance. This descriptor declares the connector type, the file extensions it supports, and all available parsing and extraction parameters. Installing the connector is a one-time operation that loads these capabilities directly into Metaform’s storage subsystem.
To install the connector, run the command below for your operating system’s Docker Desktop environment.
# Download the connector JSON and pipe it directly into Metaform's connector API
curl -sSL https://docs.metaform.com/resources/examples/pdf-connector.json \
| curl -X POST -H "Content-Type: application/json" -d @- http://localhost:8047/storage/pdf.json
# Download the connector JSON and pipe it directly into Metaform's connector API
curl -sSL https://docs.metaform.com/resources/examples/pdf-connector.json \
| curl -X POST -H "Content-Type: application/json" -d @- http://localhost:8047/storage/pdf.json
# Download the connector JSON into a string
$json = Invoke-WebRequest `
-Uri "https://docs.metaform.com/resources/examples/pdf-connector.json" `
-UseBasicParsing |
Select-Object -ExpandProperty Content
# Upload the JSON to Metaform's connector API
Invoke-RestMethod `
-Uri "http://localhost:8047/storage/pdf.json" `
-Method Post `
-ContentType "application/json" `
-Body $json
This command retrieves the connector definition and registers it with Metaform through the REST API. After registration, the pdf storage plugin appears in the Storage tab and becomes available for immediate use in queries—no service restart or additional configuration required.
Download the Data
With the connector installed, you can now retrieve the example PDF file.
Download it using the following URL:
https://docs.metaform.com/resources/examples/data.pdf
Save the file to a location accessible to your Metaform instance (for example, ~/data.pdf).
Review the Data
The example PDF encodes a set of donation records as structured rows, including fields such as name, address, occupation, employer, date, and contribution amount. Each row represents an individual or organizational contribution within a defined reporting period.
Key attributes—most notably State, City, and Amount—can be queried, grouped, and aggregated to support geographic or trend-based analysis.
Query the Data
With the connector installed and the example PDF downloaded, you can now execute the query in the Metaform Console to validate extraction and examine the resulting structure.
Copy the SQL statement below and paste it into the query field within the Query tab of the Metaform Console, then run the query to view the parsed dataset.
SELECT
State,
City,
SUM(
CAST(
REPLACE(REPLACE(Amount, '$', ''), ',', '') AS DOUBLE
)
) AS `Total Amount`
FROM
table(
pdf.`data.pdf` (
type => 'pdf',
combinePages => true,
extractHeaders => true,
extractionAlgorithm => 'spreadsheet'
)
)
WHERE
State IS NOT NULL
AND TRIM(State) <> ''
GROUP BY
State,
City
ORDER BY
State ASC,
`Total Amount` DESC;
What You Should See
If the connector is installed, the example PDF is available in Docker Desktop’s working directory, and the query executes successfully, Metaform will:
- Detect the
.pdffile through the registered PDF connector - Infer a clean schema from the document’s tabular layout
- Return a three-column result set derived from the extracted data
At this point, you have validated the complete workflow: Metaform located the PDF, parsed its structure, and exposed the extracted table as a queryable dataset.
Advanced Configuration
You’ll often want greater control—such as defining how pages are combined, how table headers are interpreted, and which extraction algorithm best suits your document layout.
All of these advanced options are set at query time, allowing you to fine-tune how Metaform reads and structures PDF data without altering the source file itself.
The table below summarizes the complete set of available configuration parameters:
| Option | Default | Description |
|---|---|---|
| combinePages | false | When true, all pages in the PDF are merged into a single logical table before extraction. This is ideal for multi-page reports where headers repeat or totals span across pages. When false, each page is processed separately. |
| extractHeaders | false | When true, the first detected row is treated as the column header and used to name output fields. When false, columns are automatically named as field_1, field_2, etc. |
| extractionAlgorithm | basic | Controls the algorithm used to detect tables. Use spreadsheet for PDFs with clearly defined rows and columns, or basic for text-based layouts where cell positions are irregular or column boundaries are inferred by spacing. |
| password | (optional) | Password to decrypt protected PDFs. If not provided, encrypted documents will be skipped or raise an error depending on the configuration. |