flow php

Data Manipulation

DataFrame provides several methods for manipulating data structures and values within your datasets. These operations allow you to add, modify, cast, and clean data efficiently.

Type Casting with autoCast()

Automatically cast data types based on content analysis:

<?php

use function Flow\ETL\DSL\{data_frame, from_array, to_output};

data_frame()
    ->read(from_array([
        ['id' => '1', 'price' => '19.99', 'active' => 'true'],
        ['id' => '2', 'price' => '29.99', 'active' => 'false'],
        ['id' => '3', 'price' => '39.99', 'active' => 'true'],
    ]))
    ->autoCast() // Automatically cast strings to appropriate types
    ->write(to_output())
    ->run();

// Result: id becomes integer, price becomes float, active becomes boolean

Note: autoCast() analyzes data patterns and attempts to convert string values to more appropriate types like integers, floats, booleans, and dates. Use with caution on large datasets as it requires data analysis.

Adding Entries with withEntry()

Add new columns or modify existing ones using expressions:

<?php

use function Flow\ETL\DSL\{data_frame, from_array, col, lit, concat, to_output};

data_frame()
    ->read(from_array([
        ['first_name' => 'John', 'last_name' => 'Doe', 'salary' => 50000],
        ['first_name' => 'Jane', 'last_name' => 'Smith', 'salary' => 60000],
    ]))
    ->withEntry('full_name', concat(col('first_name'), lit(' '), col('last_name')))
    ->withEntry('annual_bonus', col('salary')->multiply(lit(0.1)))
    ->write(to_output())
    ->run();

Duplicating Rows

Create duplicate rows for testing or data expansion:

duplicateRow() - Duplicate Specific Row

<?php

data_frame()
    ->read(from_array([
        ['id' => 1, 'name' => 'Product A'],
        ['id' => 2, 'name' => 'Product B'],
        ['id' => 3, 'name' => 'Product C'],
    ]))
    ->duplicateRow(1) // Duplicate the second row (0-indexed)
    ->write(to_output())
    ->run();

// Result: Row with id=2 appears twice in the output

Removing Duplicates

Remove duplicate rows from your dataset:

<?php

data_frame()
    ->read(from_array([
        ['id' => 1, 'name' => 'Product A', 'category' => 'Electronics'],
        ['id' => 2, 'name' => 'Product B', 'category' => 'Books'],
        ['id' => 1, 'name' => 'Product A', 'category' => 'Electronics'], // Duplicate
        ['id' => 3, 'name' => 'Product C', 'category' => 'Electronics'],
        ['id' => 2, 'name' => 'Product B', 'category' => 'Books'], // Duplicate
    ]))
    ->dropDuplicates() // Remove all duplicate rows
    ->write(to_output())
    ->run();

// Result: Only unique rows remain

Selective Duplicate Removal

Remove duplicates based on specific columns:

<?php

use function Flow\ETL\DSL\{data_frame, from_array, to_output};

data_frame()
    ->read(from_array([
        ['id' => 1, 'name' => 'Product A', 'version' => 1],
        ['id' => 1, 'name' => 'Product A', 'version' => 2], // Same product, different version
        ['id' => 2, 'name' => 'Product B', 'version' => 1],
        ['id' => 3, 'name' => 'Product C', 'version' => 1],
    ]))
    ->dropDuplicates('id', 'name') // Remove duplicates based on id and name only
    ->write(to_output())
    ->run();

// Result: Keep first occurrence of each id/name combination

Contributors

Join us on GitHub external resource
scroll back to top