Introduction

Data Frame

Data Frame
- Building Blocks
- Data Retrieval
- Data Manipulation
- Select/Drop
- Rename
- Map
- Filter
- Join
- Group By
  - Aggregations
- Pivot
- Window Functions
- Sort
- Limit
- Offset
- Until
- Batch Processing
- Caching
- Partitioning
- Constraints
- Schema
- Display
- Error Handling
CLI

Data Frame

A Data Frame is the core component of Flow PHP's ETL framework. It represents a structured collection of tabular data that can be processed, transformed, and loaded efficiently. Think of it as a programmable spreadsheet that can handle large datasets with minimal memory footprint.

Key Features

Memory Efficient: Processes data in chunks using generators, avoiding memory exhaustion
Lazy Evaluation: Operations are only executed when needed
Immutable: Each transformation returns a new DataFrame instance
Type Safe: Strict typing throughout with comprehensive schema support
Chainable API: Fluent interface for building complex data pipelines

Understanding DataFrame Operations

DataFrame methods fall into two categories based on when they execute:

Lazy Operations (`@lazy`)

These methods build the processing pipeline without executing it immediately:

Transformations: filter(), map(), withEntry(), select(), drop(), rename()
Memory-intensive: collect(), sortBy(), groupBy(), join(), cache()
Processing control: batchSize(), limit(), offset(), partitionBy()

Trigger Operations (`@trigger`)

These methods execute the entire pipeline and return results:

Data retrieval: get(), getEach(), fetch(), count()
Output operations: run(), forEach(), printRows(), printSchema()
Schema inspection: schema(), display()

Important: Build your complete pipeline with lazy operations, then execute once with a trigger operation for optimal performance.

Creating DataFrames

DataFrames are created using the data_frame() DSL function and populated with data through extractors. The framework supports various data sources through adapter-specific extractors.

<?php

use function Flow\ETL\DSL\{data_frame, from_array, to_output};

$dataFrame = data_frame()
    ->read(from_array([
        ['id' => 1, 'name' => 'John', 'age' => 30],
        ['id' => 2, 'name' => 'Jane', 'age' => 25],
        ['id' => 3, 'name' => 'Bob', 'age' => 35],
    ]))
    ->filter(col('age')->greaterThan(lit(25)))
    ->select('id', 'name')
    ->write(to_output())
    ->run();

Note: Flow PHP supports many data sources through specialized adapters. See individual adapter documentation for specific extractor usage (CSV, JSON, Parquet, databases, APIs, etc.).

Memory Management Best Practices

Prefer Generator Methods: Use get(), getEach(), getEachAsArray() over fetch() for large datasets
Avoid Memory-Intensive Operations: Be cautious with collect(), sortBy(), groupBy(), and join() on large datasets
Use Appropriate Batch Sizes: Start with 1000-5000 rows and adjust based on your memory constraints
Monitor Memory Usage: Use run(analyze: true) to track memory consumption during development

Performance Optimization

Push Operations to Data Source: When possible, perform filtering, sorting, and joins at the database/file level
Minimize Data Movement: Apply filters early in the pipeline to reduce data volume
Cache Strategically: Only cache expensive operations that will be reused multiple times
Avoid Large Offsets: Use data source pagination instead of DataFrame offset() for large skips

Component Documentation

For detailed information about specific DataFrame operations, see the following component documentation:

Core Operations

Building Blocks - Understanding Rows, Entries, and basic data structures
Select/Drop - Column selection and removal
Rename - Column renaming strategies
Map - Row transformations and data mapping
Filter - Row filtering and conditions

Data Processing

Join - DataFrame joining operations
Group By - Grouping and aggregation operations
Pivot - Transform data from long to wide format
Sort - Data sorting
Limit - Result limiting and pagination
Offset - Skipping rows and pagination
Until - Conditional processing termination
Window Functions - Advanced analytical functions

Memory & Performance

Batch Processing - Controlling batch sizes and memory collection
Partitioning - Data partitioning for efficient processing
Caching - Performance optimization through caching
Data Retrieval - Methods for getting processed data

Data Quality & Validation

Schema - Schema management and validation
Constraints - Data integrity constraints and business rules
Error Handling - Error management strategies

Output & Display

Display - Data visualization and output

Adapters

Libraries

Bridges

Contributors

Join us on GitHub

Introduction

Data Frame

#Data Frame

#Key Features

#Understanding DataFrame Operations

#Lazy Operations (@lazy)

#Trigger Operations (@trigger)

#Creating DataFrames

#Memory Management Best Practices

#Performance Optimization

#Component Documentation

#Core Operations

#Data Processing

#Memory & Performance

#Data Quality & Validation

#Output & Display

Adapters

Libraries

Bridges

Contributors

Data Frame

Key Features

Understanding DataFrame Operations

Lazy Operations (`@lazy`)

Trigger Operations (`@trigger`)

Creating DataFrames

Memory Management Best Practices

Performance Optimization

Component Documentation

Core Operations

Data Processing

Memory & Performance

Data Quality & Validation

Output & Display