flow php

Partitioning

Partitioning divides data into logical groups based on column values, enabling more efficient processing of large datasets and reducing memory usage.

Basic Partitioning

partitionBy() - Partition by columns

<?php

use function Flow\ETL\DSL\{data_frame, from_array, col, to_parquet};

data_frame()
    ->read(from_array([
        ['date' => '2024-01-01', 'department' => 'sales', 'amount' => 100],
        ['date' => '2024-01-01', 'department' => 'marketing', 'amount' => 200],
        ['date' => '2024-01-02', 'department' => 'sales', 'amount' => 150],
        ['date' => '2024-01-02', 'department' => 'marketing', 'amount' => 250],
    ]))
    ->partitionBy('date') // Partition by date
    ->write(to_parquet(__DIR__ . '/output/sales.parquet'))
    ->run();

File structure:

output/
├── date=2024-01-01/
│   └── sales.parquet       # Contains 2 rows (sales + marketing for Jan 1)
└── date=2024-01-02/
    └── sales.parquet       # Contains 2 rows (sales + marketing for Jan 2)

Each partition creates a separate directory with the format column=value, and data files are written inside those directories.

Multi-Column Partitioning

<?php

data_frame()
    ->read(from_array([
        ['date' => '2024-01-01', 'department' => 'sales', 'amount' => 100],
        ['date' => '2024-01-01', 'department' => 'marketing', 'amount' => 200],
        ['date' => '2024-01-02', 'department' => 'sales', 'amount' => 150],
        ['date' => '2024-01-02', 'department' => 'marketing', 'amount' => 250],
    ]))
    ->partitionBy('date', 'department') // Partition by date AND department
    ->write(to_parquet(__DIR__ . '/output/sales.parquet'))
    ->run();

File structure with nested partitions:

output/
├── date=2024-01-01/
│   ├── department=sales/
│   │   └── sales.parquet       # 1 row: sales on Jan 1
│   └── department=marketing/
│       └── sales.parquet       # 1 row: marketing on Jan 1
└── date=2024-01-02/
    ├── department=sales/
    │   └── sales.parquet       # 1 row: sales on Jan 2
    └── department=marketing/
        └── sales.parquet       # 1 row: marketing on Jan 2

Each partition column creates an additional nesting level in the directory hierarchy.

Dropping Partitions

<?php

$dataFrame = data_frame()
    ->read($extractor)
    ->partitionBy('date')
    ->map($transformation)
    ->dropPartitions() // Remove partition information but keep data
    ->write($loader)
    ->run();

// Drop partitions AND partition columns
$dataFrame
    ->partitionBy('date')
    ->dropPartitions(dropPartitionColumns: true) // Also removes 'date' column
    ->run();

Performance Considerations

Choosing Partition Columns

<?php

// Good partitioning - balanced partition sizes
$dataFrame->partitionBy('date'); // Assuming data is spread across dates

// Bad partitioning - unbalanced partitions
$dataFrame->partitionBy('id'); // If IDs are unique, creates many tiny partitions

// Good partitioning - moderate cardinality
$dataFrame->partitionBy('department'); // Assuming reasonable number of departments

Contributors

Join us on GitHub external resource
scroll back to top