Introduction

Architecture

Anatomy of the Platform

Metaform is designed from the ground up to combine data sources in situ while delivering high performance on disparate systems. Its architecture incorporates several core design elements that enable interactive query speeds at scale.


Distributed Engine

Metaformm utilizes a distributed query execution model, enabling users to submit requests to any node within the cluster. Queries are parallelized across nodes, with each node handling a portion of the work. Adding nodes increases capacity, enabling the system to scale seamlessly for higher data volumes, more concurrent users, or faster performance.


Columnar Execution

Metaform's in-memory data model is both hierarchical and columnar, which aligns naturally with modern data workloads. When data is stored in columnar formats such as Parquet, Metaform:

  • Skips reading columns not needed by the query, minimizing disk I/O.
  • Executes SQL directly on columnar data without row materialization.

This approach reduces memory usage, eliminates unnecessary computation, and accelerates analytical workloads.


Vectorization

Traditional engines process one value at a time. Metaform instead uses vectorized execution, where the CPU operates on record batches—arrays of values spanning multiple records. This model leverages modern CPU architectures with deep pipelines, ensuring they are utilized efficiently. The result is performance much closer to the hardware's theoretical limits compared to row-at-a-time processing.


Runtime Compilation

Instead of interpreting queries, Metaform performs runtime code generation, compiling custom, optimized code for each query. This approach reduces overhead and ensures execution paths are tailored to the query structure, yielding significantly faster performance.


Optimistic Query Execution

Metaform uses an optimistic execution model, assuming that failures during short-lived queries are rare. Rather than adding checkpoints and recovery boundaries, Metaform reruns a query in the event of a failure.

Execution is fully pipelined, with all tasks scheduled simultaneously. Data flows through the pipeline in-memory as much as possible, spilling to disk only if memory overflows. This design reduces latency and maximizes throughput across large workloads.