Introduction
Data Frame
A Data Frame is the core component of Flow PHP's ETL framework. It represents a structured collection of tabular data that can be processed, transformed, and loaded efficiently. Think of it as a programmable spreadsheet that can handle large datasets with minimal memory footprint.
Key Features
- Memory Efficient: Processes data in chunks using generators, avoiding memory exhaustion
- Lazy Evaluation: Operations are only executed when needed
- Immutable: Each transformation returns a new DataFrame instance
- Type Safe: Strict typing throughout with comprehensive schema support
- Chainable API: Fluent interface for building complex data pipelines
Understanding DataFrame Operations
DataFrame methods fall into two categories based on when they execute:
@lazy
)
Lazy Operations (These methods build the processing pipeline without executing it immediately:
- Transformations:
filter()
,map()
,withEntry()
,select()
,drop()
,rename()
- Memory-intensive:
collect()
,sortBy()
,groupBy()
,join()
,cache()
- Processing control:
batchSize()
,limit()
,offset()
,partitionBy()
@trigger
)
Trigger Operations (These methods execute the entire pipeline and return results:
- Data retrieval:
get()
,getEach()
,fetch()
,count()
- Output operations:
run()
,forEach()
,printRows()
,printSchema()
- Schema inspection:
schema()
,display()
Important: Build your complete pipeline with lazy operations, then execute once with a trigger operation for optimal performance.
Creating DataFrames
DataFrames are created using the data_frame()
DSL function and populated with data through extractors. The framework supports various data sources through adapter-specific extractors.
<?php
use function Flow\ETL\DSL\{data_frame, from_array, to_output};
$dataFrame = data_frame()
->read(from_array([
['id' => 1, 'name' => 'John', 'age' => 30],
['id' => 2, 'name' => 'Jane', 'age' => 25],
['id' => 3, 'name' => 'Bob', 'age' => 35],
]))
->filter(col('age')->greaterThan(lit(25)))
->select('id', 'name')
->write(to_output())
->run();
Note: Flow PHP supports many data sources through specialized adapters. See individual adapter documentation for specific extractor usage (CSV, JSON, Parquet, databases, APIs, etc.).
Memory Management Best Practices
- Prefer Generator Methods: Use
get()
,getEach()
,getEachAsArray()
overfetch()
for large datasets - Avoid Memory-Intensive Operations: Be cautious with
collect()
,sortBy()
,groupBy()
, andjoin()
on large datasets - Use Appropriate Batch Sizes: Start with 1000-5000 rows and adjust based on your memory constraints
- Monitor Memory Usage: Use
run(analyze: true)
to track memory consumption during development
Performance Optimization
- Push Operations to Data Source: When possible, perform filtering, sorting, and joins at the database/file level
- Minimize Data Movement: Apply filters early in the pipeline to reduce data volume
- Cache Strategically: Only cache expensive operations that will be reused multiple times
- Avoid Large Offsets: Use data source pagination instead of DataFrame
offset()
for large skips
Component Documentation
For detailed information about specific DataFrame operations, see the following component documentation:
Core Operations
- Building Blocks - Understanding Rows, Entries, and basic data structures
- Select/Drop - Column selection and removal
- Rename - Column renaming strategies
- Map - Row transformations and data mapping
- Filter - Row filtering and conditions
Data Processing
- Join - DataFrame joining operations
- Group By - Grouping and aggregation operations
- Pivot - Transform data from long to wide format
- Sort - Data sorting
- Limit - Result limiting and pagination
- Offset - Skipping rows and pagination
- Until - Conditional processing termination
- Window Functions - Advanced analytical functions
Memory & Performance
- Batch Processing - Controlling batch sizes and memory collection
- Partitioning - Data partitioning for efficient processing
- Caching - Performance optimization through caching
- Data Retrieval - Methods for getting processed data
Data Quality & Validation
- Schema - Schema management and validation
- Constraints - Data integrity constraints and business rules
- Error Handling - Error management strategies
Output & Display
- Display - Data visualization and output