Introduction
Arrow Extension
Apache Arrow is a language-independent columnar memory format for flat and hierarchical data. The Arrow ecosystem provides high-performance implementations for common data operations — including I/O for formats like Parquet, CSV, JSON, and Arrow IPC — in C++, Rust, Java, Python, and other languages.
This extension brings the Arrow Rust ecosystem into PHP via ext-php-rs. It exposes Arrow's native readers and writers through PHP streaming interfaces, letting PHP applications benefit from Rust-level performance without leaving the PHP runtime.
[!TIP] The recommended way to use this extension is through the parquet library, which provides a higher-level PHP API and automatically leverages the Arrow extension when it is loaded. You only need to use the classes documented here directly if you want low-level control over the Arrow reader/writer.
Current Scope
The first module exposed through this extension is Apache Parquet — a columnar storage format widely used in data engineering and analytics.
Planned Modules
The Arrow Rust crates offer additional I/O capabilities that are candidates for future exposure through this extension:
- CSV — high-performance CSV reading/writing with automatic type inference and Arrow-native batching
- JSON — Arrow-backed JSON line (JSONL/NDJSON) reading/writing with schema support
- IPC — Arrow's own binary streaming/file format for zero-copy data exchange between processes and languages
Features
- Read and write Apache Parquet files through PHP streaming interfaces
- Flat types: INT32, INT64, FLOAT, DOUBLE, BOOLEAN, STRING, BINARY, DATE32, TIMESTAMP
- Nested types: LIST, STRUCT, MAP (arbitrarily nested)
- Compression codecs: UNCOMPRESSED, SNAPPY, GZIP, ZSTD, LZ4_RAW, BROTLI
- Column projection for selective reads
- Configurable row group size, compression level, and writer version
- Columnar batch I/O for maximum throughput
Requirements
- PHP 8.3+
- Rust toolchain (rustc, cargo) — install from rustup.rs
- clang / libclang (for ext-php-rs bindgen)
- make
Installation
Using PIE (Recommended)
PIE is the modern PHP extension installer.
Prerequisites: Install Rust toolchain and clang on your system:
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Ubuntu/Debian
sudo apt-get install build-essential clang libclang-dev
# macOS with Homebrew
brew install llvm
export LIBCLANG_PATH=$(brew --prefix llvm)/lib
Install the extension:
pie install flow-php/arrow-ext
Build from Source
# Install build dependencies (Ubuntu/Debian)
sudo apt-get install build-essential clang libclang-dev
# Install build dependencies (macOS)
brew install llvm
export LIBCLANG_PATH=$(brew --prefix llvm)/lib
# Build
cd src/extension/arrow-ext
make build
# Run tests
make test
# Install to system PHP
make install
Using Nix (Monorepo Development)
From the Flow PHP monorepo root:
# Default shell includes the pre-built arrow extension
nix-shell
php -m | grep arrow
# For extension development (Rust toolchain + PHP dev headers, no pre-built extension)
nix-shell --arg with-arrow-ext false --arg with-rust true
cd src/extension/arrow-ext
make build && make test
Loading the Extension
In php.ini
extension = arrow
During Development
php -d extension=./ext/modules/arrow.so your_script.php
Usage
Implementing the Streaming Interfaces
The extension operates on two PHP interfaces for I/O. You must provide implementations for your storage backend.
Source (reading):
<?php
use Flow\Arrow\RandomAccessFile;
class FileSource implements RandomAccessFile
{
private readonly string $data;
public function __construct(string $path)
{
$this->data = file_get_contents($path);
}
public function read(int $length, int $offset): string
{
return substr($this->data, $offset, $length);
}
public function size(): ?int
{
return strlen($this->data);
}
}
Destination (writing):
<?php
use Flow\Arrow\OutputStream;
class FileDestination implements OutputStream
{
/** @var resource */
private $fh;
public function __construct(string $path)
{
$this->fh = fopen($path, 'wb');
}
public function append(string $data): self
{
fwrite($this->fh, $data);
return $this;
}
public function __destruct()
{
fclose($this->fh);
}
}
Reading Parquet Files
<?php
use Flow\Arrow\Parquet\Reader;
$reader = new Reader(new FileSource('data.parquet'));
// Get schema and metadata
$schema = $reader->schema();
$metadata = $reader->metadata();
// Read row groups (with optional column projection)
while ($batch = $reader->readNextRowGroup(['id', 'name'])) {
// $batch is ['column_name' => [values...], ...]
foreach ($batch['id'] as $i => $id) {
echo "$id: {$batch['name'][$i]}\n";
}
}
$reader->close();
Writing Parquet Files
<?php
use Flow\Arrow\Parquet\Writer;
$schema = [
['name' => 'id', 'type' => 'INT64', 'optional' => false],
['name' => 'name', 'type' => 'STRING', 'optional' => true],
];
$writer = new Writer(new FileDestination('output.parquet'), $schema, 'SNAPPY');
$writer->writeBatch([
'id' => [1, 2, 3],
'name' => ['Alice', 'Bob', null],
]);
$writer->close();
Schema Definition
The schema is an array of column definitions. Each column has a name, type, and optional optional flag.
| Type | PHP Read Value | Notes |
|---|---|---|
BOOLEAN |
bool |
|
INT32 |
int |
|
INT64 |
int |
|
FLOAT |
float |
|
DOUBLE |
float |
|
STRING |
string |
|
BINARY |
string (raw bytes) |
|
DATE32 |
string (YYYY-MM-DD) |
|
TIMESTAMP |
string (ISO 8601) |
|
LIST |
array |
Requires children key with 1 element |
STRUCT |
array (associative) |
Requires children key with N elements |
MAP |
array (associative) |
Requires children key with 2 elements (key + value) |
Nested schema example:
<?php
$schema = [
['name' => 'id', 'type' => 'INT64', 'optional' => false],
['name' => 'tags', 'type' => 'LIST', 'optional' => true, 'children' => [
['name' => 'element', 'type' => 'STRING', 'optional' => true],
]],
['name' => 'address', 'type' => 'STRUCT', 'optional' => true, 'children' => [
['name' => 'street', 'type' => 'STRING', 'optional' => true],
['name' => 'city', 'type' => 'STRING', 'optional' => true],
]],
['name' => 'metadata', 'type' => 'MAP', 'optional' => true, 'children' => [
['name' => 'key', 'type' => 'STRING', 'optional' => false],
['name' => 'value', 'type' => 'STRING', 'optional' => true],
]],
];
Writer Options
Options are passed as the fourth argument to the Writer constructor:
<?php
$writer = new Writer($stream, $schema, 'SNAPPY', [
'ROW_GROUP_SIZE_BYTES' => 128 * 1024 * 1024,
'WRITER_VERSION' => '2.0',
]);
| Option Key | Type | Description |
|---|---|---|
ROW_GROUP_SIZE_BYTES |
int |
Maximum row group size in bytes |
COMPRESSION_LEVEL |
int |
Compression level (codec-specific) |
WRITER_VERSION |
string |
"1.0" or "2.0" |
API Reference
Interfaces
Flow\Arrow\RandomAccessFile
| Method | Parameters | Returns | Description |
|---|---|---|---|
read |
int $length, int $offset |
string |
Read $length bytes starting at $offset |
size |
— | ?int |
Return total size in bytes, or null if unknown |
Flow\Arrow\OutputStream
| Method | Parameters | Returns | Description |
|---|---|---|---|
append |
string $data |
self |
Append data to the output stream |
Classes
Flow\Arrow\Parquet\Reader
| Method | Parameters | Returns | Description |
|---|---|---|---|
__construct |
RandomAccessFile $source, array $options = [] |
— | Open a Parquet source for reading |
schema |
— | array |
Return the file schema as nested arrays |
metadata |
— | array |
Return file-level metadata (row count, row groups, key-value metadata) |
readNextRowGroup |
?array $columns = null |
?array |
Read next row group as columnar batch, or null when exhausted |
close |
— | void |
Release resources |
Flow\Arrow\Parquet\Writer
| Method | Parameters | Returns | Description |
|---|---|---|---|
__construct |
OutputStream $stream, array $schema, string $compression = 'SNAPPY', array $options = [] |
— | Open a Parquet destination for writing |
writeBatch |
array $batch |
void |
Write a columnar batch (['col' => [values...]]) |
close |
— | void |
Flush and finalize the Parquet file |
Flow\Arrow\Parquet\Exception
Extends \RuntimeException. Thrown on all Parquet read/write errors originating from the Rust layer.
Error Handling
<?php
use Flow\Arrow\Parquet\Exception;
use Flow\Arrow\Parquet\Reader;
try {
$reader = new Reader(new FileSource('data.parquet'));
while ($batch = $reader->readNextRowGroup()) {
// process batch
}
$reader->close();
} catch (Exception $e) {
echo "Parquet error: " . $e->getMessage();
}
Development
Build Commands
make build # Build the extension
make test # Run PHPT tests
make install # Install to system PHP
make clean # Remove build artifacts
make rebuild # Full clean + build
Modifying the Extension
cd src/extension/arrow-ext
make rebuild
make test
Architecture
- Built with ext-php-rs, which generates PHP bindings from Rust code
- Uses Apache Arrow and Parquet Rust crates from the Arrow ecosystem
- All compression codecs compiled into the extension — no external PHP compression extensions needed
- PHP streaming interfaces (
RandomAccessFile,OutputStream) called from Rust via ext-php-rs callbacks - Columnar batch format aligns with Parquet's native columnar storage
- PIE-compatible via
ext/config.m4that delegates tocargo build