Flow PHP

DataFrame

FinalYes

Methods

__construct() : mixed
aggregate() : self
autoCast() : self
batchSize() : self: Merge/Split Rows yielded by Extractor into batches of given size.
cache() : self: Start processing rows up to this moment and put each instance of Rows into previously defined cache.
collect() : self: Before transforming rows, collect them and merge into single Rows instance.
collectRefs() : self: This method allows to collect references to all entries used in this pipeline.
constrain() : self
count() : int
crossJoin() : self
display() : string
drop() : self: Drop given entries.
dropDuplicates() : $this
dropPartitions() : self: Drop all partitions from Rows, additionally when $dropPartitionColumns is set to true, partition columns are also removed.
duplicateRow() : self
fetch() : Rows: Be aware that fetch is not memory safe and will load all rows into memory.
filter() : self
filterPartitions() : self
filters() : self
forEach() : void
get() : Generator<string|int, Rows>: Yields each row as an instance of Rows.
getAsArray() : Generator<string|int, array<string|int, array<string|int, mixed>>>: Yields each row as an array.
getEach() : Generator<string|int, Row>: Yield each row as an instance of Row.
getEachAsArray() : Generator<string|int, array<string|int, mixed>>: Yield each row as an array.
groupBy() : GroupedDataFrame
join() : self
joinEach() : self
limit() : self
load() : self
map() : self
match() : self
mode() : $this: SaveMode defines how Flow should behave when writing to a file/files that already exists.
offset() : self: Skip given number of rows from the beginning of the dataset.
onError() : self
partitionBy() : self
pivot() : self
printRows() : void
printSchema() : void
rename() : self
renameAll() : self
renameAllLowerCase() : self
renameAllStyle() : self
renameAllUpperCase() : self
renameAllUpperCaseFirst() : self
renameAllUpperCaseWord() : self
renameEach() : self
reorderEntries() : self
rows() : self
run() : mixed
saveMode() : self: Alias for DataFrame::mode.
schema() : Schema
select() : self
sortBy() : self
transform() : self: Alias for DataFrame::with().
until() : self: The difference between filter and until is that filter will keep filtering rows until extractors finish yielding rows. Until will send a STOP signal to the Extractor when the condition is not met.
validate() : self
void() : self
with() : self
withEntries() : self
withEntry() : self
write() : self

__construct()


    public
                    __construct(Pipeline $pipeline, Config|FlowContext $context) : mixed

Parameters

$pipeline : Pipeline
$context : Config|FlowContext

aggregate()


    public
                    aggregate(AggregatingFunction ...$aggregations) : self

Parameters

$aggregations : AggregatingFunction

Return values

self

autoCast()


    public
                    autoCast() : self

Return values

self

batchSize()

Merge/Split Rows yielded by Extractor into batches of given size.


    public
                    batchSize(int<1, max> $size) : self

For example, when Extractor is yielding one row at time, this method will merge them into batches of given size before passing them to the next pipeline element. Similarly when Extractor is yielding batches of rows, this method will split them into smaller batches of given size.

In order to merge all Rows into a single batch use DataFrame::collect() method or set size to -1 or 0.

Parameters

$size : int<1, max>

Return values

self

cache()

Start processing rows up to this moment and put each instance of Rows into previously defined cache.


    public
                    cache([null|string $id = null ][, int|null $cacheBatchSize = null ]) : self

Cache type can be set through ConfigBuilder. By default everything is cached in system tmp dir.

Important: cache batch size might significantly improve performance when processing large amount of rows. Larger batch size will increase memory consumption but will reduce number of IO operations. When not set, the batch size is taken from the last DataFrame::batchSize() call.

Parameters

$id : null|string = null
$cacheBatchSize : int|null = null

Return values

self

collect()

Before transforming rows, collect them and merge into single Rows instance.


    public
                    collect() : self

This might lead to memory issues when processing large amount of rows, use with caution.

Return values

self

collectRefs()

This method allows to collect references to all entries used in this pipeline.


    public
                    collectRefs(References $references) : self

(new Flow())
  ->read(From::chain())
  ->collectRefs($refs = refs())
  ->run();

Parameters

$references : References

Return values

self

constrain()


    public
                    constrain(Constraint $constraint, Constraint ...$constraints) : self

Parameters

$constraint : Constraint
$constraints : Constraint

Return values

self

count()


    public
                    count() : int

Return values

int

crossJoin()


    public
                    crossJoin(self $dataFrame[, string $prefix = '' ]) : self

Parameters

$dataFrame : self
$prefix : string = ''

Return values

self

display()


    public
                    display([int $limit = 20 ][, bool|int $truncate = 20 ][, Formatter $formatter = new AsciiTableFormatter() ]) : string

Parameters

$limit : int = 20: maximum numbers of rows to display
$truncate : bool|int = 20: false or if set to 0 columns are not truncated, otherwise default truncate to 20 characters
$formatter : Formatter = new AsciiTableFormatter()

Return values

string

drop()

Drop given entries.


    public
                    drop(string|Reference ...$entries) : self

Parameters

$entries : string|Reference

Return values

self

dropDuplicates()


    public
                    dropDuplicates(Reference|string ...$entries) : $this

Parameters

$entries : Reference|string

Return values

$this

dropPartitions()

Drop all partitions from Rows, additionally when $dropPartitionColumns is set to true, partition columns are also removed.


    public
                    dropPartitions([bool $dropPartitionColumns = false ]) : self

Parameters

$dropPartitionColumns : bool = false

Return values

self

duplicateRow()


    public
                    duplicateRow(mixed $condition, WithEntry ...$entries) : self

Parameters

$condition : mixed
$entries : WithEntry

Return values

self

fetch()

Be aware that fetch is not memory safe and will load all rows into memory.


    public
                    fetch([int|null $limit = null ]) : Rows

If you want to safely iterate over Rows use oe of the following methods:.

DataFrame::get() : \Generator DataFrame::getAsArray() : \Generator DataFrame::getEach() : \Generator DataFrame::getEachAsArray() : \Generator

Parameters

$limit : int|null = null

Return values

Rows

filter()


    public
                    filter(ScalarFunction $function) : self

Parameters

$function : ScalarFunction

Return values

self

filterPartitions()


    public
                    filterPartitions(Filter|ScalarFunction $filter) : self

Parameters

$filter : Filter|ScalarFunction

Return values

self

filters()


    public
                    filters(array<string|int, ScalarFunction> $functions) : self

Parameters

$functions : array<string|int, ScalarFunction>

Return values

self

forEach()


    public
                    forEach([null|callable(Rows $rows): void $callback = null ]) : void

Parameters

$callback : null|callable(Rows $rows): void = null

get()

Yields each row as an instance of Rows.


    public
                    get() : Generator<string|int, Rows>

Return values

Generator<string|int, Rows>

getAsArray()

Yields each row as an array.


    public
                    getAsArray() : Generator<string|int, array<string|int, array<string|int, mixed>>>

Return values

Generator<string|int, array<string|int, array<string|int, mixed>>>

getEach()

Yield each row as an instance of Row.


    public
                    getEach() : Generator<string|int, Row>

Return values

Generator<string|int, Row>

getEachAsArray()

Yield each row as an array.


    public
                    getEachAsArray() : Generator<string|int, array<string|int, mixed>>

Return values

Generator<string|int, array<string|int, mixed>>

groupBy()


    public
                    groupBy(string|Reference ...$entries) : GroupedDataFrame

Parameters

$entries : string|Reference

Return values

GroupedDataFrame

join()


    public
                    join(self $dataFrame, Expression $on[, string|Join $type = Join::left ]) : self

Parameters

$dataFrame : self
$on : Expression
$type : string|Join = Join::left

Return values

self

joinEach()


    public
                    joinEach(DataFrameFactory $factory, Expression $on[, string|Join $type = Join::left ]) : self

Parameters

$factory : DataFrameFactory
$on : Expression
$type : string|Join = Join::left

Return values

self

limit()


    public
                    limit(int|null $limit) : self

Parameters

$limit : int|null

Return values

self

load()


    public
                    load(Loader $loader) : self

Parameters

$loader : Loader

Return values

self

map()


    public
                    map(callable(Row $row): Row $callback) : self

Parameters

$callback : callable(Row $row): Row

Return values

self

match()


    public
                    match(Schema $schema[, null|SchemaValidator $validator = null ]) : self

Parameters

$schema : Schema

$validator : null|SchemaValidator = null

when null, StrictValidator gets initialized

Return values

self

mode()

SaveMode defines how Flow should behave when writing to a file/files that already exists.


    public
                    mode(SaveMode $mode) : $this

For more details please see SaveMode enum.

Parameters

$mode : SaveMode

Return values

$this

offset()

Skip given number of rows from the beginning of the dataset.


    public
                    offset(int<0, max>|null $offset) : self

When $offset is null, nothing happens (no rows are skipped).

Performance Note: DataFrame must iterate through and process all skipped rows to reach the offset position. For large offsets, this can impact performance as the data source still needs to be read and processed up to the offset point.

Parameters

$offset : int<0, max>|null

Return values

self

onError()


    public
                    onError(ErrorHandler $handler) : self

Parameters

$handler : ErrorHandler

Return values

self

partitionBy()


    public
                    partitionBy(string|Reference $entry, string|Reference ...$entries) : self

Parameters

$entry : string|Reference
$entries : string|Reference

Return values

self

pivot()


    public
                    pivot(Reference $ref) : self

Parameters

$ref : Reference

Return values

self

printRows()


    public
                    printRows([int|null $limit = 20 ][, int|bool $truncate = 20 ][, Formatter $formatter = new AsciiTableFormatter() ]) : void

Parameters

$limit : int|null = 20
$truncate : int|bool = 20
$formatter : Formatter = new AsciiTableFormatter()

printSchema()


    public
                    printSchema([int|null $limit = 20 ][, SchemaFormatter $formatter = new ASCIISchemaFormatter() ]) : void

Parameters

$limit : int|null = 20
$formatter : SchemaFormatter = new ASCIISchemaFormatter()

rename()


    public
                    rename(string $from, string $to) : self

Parameters

$from : string
$to : string

Return values

self

renameAll()


    public
                    renameAll(string $search, string $replace) : self

use DataFrame::renameEach() with a RenameReplaceStrategy

Parameters

$search : string
$replace : string

Return values

self

renameAllLowerCase()


    public
                    renameAllLowerCase() : self

use DataFrame::renameEach() with a selected StringStyles

Return values

self

renameAllStyle()


    public
                    renameAllStyle(StringStyles|StringStyles|string $style) : self

use DataFrame::renameEach() with a selected Style

Parameters

$style : StringStyles|StringStyles|string

Return values

self

renameAllUpperCase()


    public
                    renameAllUpperCase() : self

use DataFrame::renameEach() with a selected Style

Return values

self

renameAllUpperCaseFirst()


    public
                    renameAllUpperCaseFirst() : self

use DataFrame::renameEach() with a selected Style

Return values

self

renameAllUpperCaseWord()


    public
                    renameAllUpperCaseWord() : self

use DataFrame::renameEach() with a selected Style

Return values

self

renameEach()


    public
                    renameEach(RenameEntryStrategy ...$strategies) : self

Parameters

$strategies : RenameEntryStrategy

Return values

self

reorderEntries()


    public
                    reorderEntries([Comparator $comparator = new TypeComparator() ]) : self

Parameters

$comparator : Comparator = new TypeComparator()

Return values

self

rows()


    public
                    rows(Transformer|Transformation $transformer) : self

Parameters

$transformer : Transformer|Transformation

Return values

self

run()


    public
                    run([null|callable(Rows $rows, FlowContext $context): void $callback = null ][, Analyze|bool $analyze = false ]) : mixed

Parameters

$callback : null|callable(Rows $rows, FlowContext $context): void = null

$analyze : Analyze|bool = false

when set run will return Report

saveMode()

Alias for DataFrame::mode.


    public
                    saveMode(SaveMode $mode) : self

Parameters

$mode : SaveMode

Return values

self

schema()


    public
                    schema() : Schema

Return values

Schema

select()


    public
                    select(string|Reference ...$entries) : self

Parameters

$entries : string|Reference

Return values

self

sortBy()


    public
                    sortBy(Reference ...$entries) : self

Parameters

$entries : Reference

Return values

self

transform()

Alias for DataFrame::with().


    public
                    transform(Transformer|Transformation|Transformations|WithEntry $transformer) : self

Parameters

$transformer : Transformer|Transformation|Transformations|WithEntry

Return values

self

until()

The difference between filter and until is that filter will keep filtering rows until extractors finish yielding rows. Until will send a STOP signal to the Extractor when the condition is not met.


    public
                    until(ScalarFunction $function) : self

Parameters

$function : ScalarFunction

Return values

self

validate()


    public
                    validate(Schema $schema[, null|SchemaValidator $validator = null ]) : self

Please use DataFrame::match instead

Parameters

$schema : Schema

$validator : null|SchemaValidator = null

when null, StrictValidator gets initialized

Return values

self

void()


    public
                    void() : self

Return values

self

with()


    public
                    with(Transformer|Transformation|Transformations|WithEntry $transformer) : self

Parameters

$transformer : Transformer|Transformation|Transformations|WithEntry

Return values

self

withEntries()


    public
                    withEntries(array<int, WithEntry>|array<string, ScalarFunction|WindowFunction|WithEntry> $references) : self

Parameters

$references : array<int, WithEntry>|array<string, ScalarFunction|WindowFunction|WithEntry>

Return values

self

withEntry()


    public
                    withEntry(Definition<string|int, mixed>|string $entry, ScalarFunction|WindowFunction $reference) : self

Parameters

$entry : Definition<string|int, mixed>|string
$reference : ScalarFunction|WindowFunction

Return values

self

write()


    public
                    write(Loader $loader) : self

Parameters

$loader : Loader

Return values

self

DataFrame

Methods

Methods

__construct()

Parameters

aggregate()

Parameters

Tags

Return values

autoCast()

Return values

batchSize()

Parameters

Tags

Return values

cache()

Parameters

Tags

Return values

collect()

Tags

Return values

collectRefs()

Parameters

Tags

Return values

constrain()

Parameters

Return values

count()

Tags

Return values

crossJoin()

Parameters

Tags

Return values

display()

Parameters

Tags

Return values

drop()

Parameters

Tags

Return values

dropDuplicates()

Parameters

Tags

Return values

dropPartitions()

Parameters

Tags

Return values

duplicateRow()

Parameters

Return values

fetch()

Parameters

Tags

Return values

filter()

Parameters

Tags

Return values

filterPartitions()

Parameters

Tags

Return values

filters()

Parameters

Tags

Return values

forEach()

Parameters

Tags

get()

Tags

Return values

getAsArray()

Tags

Return values