Introduction

Data Frame

Data Frame
- Building Blocks
- Data Retrieval
- Data Manipulation
- Select/Drop
- Rename
- Map
- Filter
- Join
- Group By
  - Aggregations
- Pivot
- Window Functions
- Sort
- Limit
- Offset
- Until
- Batch Processing
- Caching
- Partitioning
- Constraints
- Schema
- Display
- Error Handling
CLI

Flow Command Line Interface

Installation

composer require flow-php/cli:~0.22.0

In some cases, it might make sense to install the CLI globally:

composer global require flow-php/cli:~0.22.0

Now you can run the CLI using the flow command.

Docker

Flow CLI application is also available as a docker image:

docker run -v $(pwd):/flow-workspace -it ghcr.io/flow-php/flow:latest --version

Commands

Config

All Flow CLI Commands can be configured using --config option. The option accepts a path to a configuration file in php that returns a Config or ConfigBuilder instance.

.flow.php

<?php

use function Flow\ETL\DSL\config_builder;

return config_builder()
    ->id('execution-id');

flow read --config .flow.php orders.csv

One of the most common use cases is to mount a custom filesystem into Flow fstab to access remote files through CLI.

$ flow
Flow PHP - Data processing framework

Usage:
  command [options] [arguments]

Options:
  -h, --help            Display help for the given command. When no command is given display help for the list command
  -q, --quiet           Do not output any message
  -V, --version         Display this application version
      --ansi|--no-ansi  Force (or disable --no-ansi) ANSI output
  -n, --no-interaction  Do not ask any interactive question
  -v|vv|vvv, --verbose  Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

Available commands:
  completion             Dump the shell completion script
  help                   Display help for a command
  list                   List commands
 file
  file:convert           [convert] Read data from a file.
  file:read              [read] Read data from a file.
  file:rows:count        [count] Read data schema from a file.
  file:schema            [schema] Read data schema from a file.
 parquet
  parquet:read           [parquet:read:data] Read data from parquet file
  parquet:read:metadata  Read metadata from parquet file
 pipeline
  pipeline:run           [run] Execute ETL pipeline from a php/json file.

`file:convert` alias `convert`

Description:
  Read data from a file.

Usage:
  file:convert [options] [--] <input-file> <output-file>
  convert

Arguments:
  input-file                                                         Path to a file that should be converted to another format.
  output-file                                                        Path where converted file should be saved.

Options:
      --input-file-format=INPUT-FILE-FORMAT                          File format. When not set file format is guessed from input file path extension
      --input-file-batch-size=INPUT-FILE-BATCH-SIZE                  Number of rows that are going to be read and displayed in one batch, when set to -1 whole dataset will be displayed at once [default: 100]
      --input-file-limit=INPUT-FILE-LIMIT                            Limit number of rows that are going to be used to infer file schema, when not set whole file is analyzed
      --output-file-format=OUTPUT-FILE-FORMAT                        File format. When not set file format is guessed from output file path extension
      --output-overwrite[=OUTPUT-OVERWRITE]                          When set output file will be overwritten if exists
      --schema-auto-cast[=SCHEMA-AUTO-CAST]                          When set Flow will try to automatically cast values to more precise data types, for example datetime strings will be casted to datetime type [default: false]
      --analyze[=ANALYZE]                                            Collect processing statistics and print them. [default: false]
      --config=CONFIG                                                Path to a local php file that MUST return instance of: Flow\ETL\Config
      --input-json-pointer=INPUT-JSON-POINTER                        JSON Pointer to a subtree from which schema should be extracted
      --input-json-pointer-entry-name                                When set, JSON Pointer will be used as an entry name in the schema
      --input-csv-header[=INPUT-CSV-HEADER]                          When set, CSV header will be used as a schema
      --input-csv-empty-to-null[=INPUT-CSV-EMPTY-TO-NULL]            When set, empty CSV values will be treated as NULL values
      --input-csv-separator=INPUT-CSV-SEPARATOR                      CSV separator character
      --input-csv-enclosure=INPUT-CSV-ENCLOSURE                      CSV enclosure character
      --input-csv-escape=INPUT-CSV-ESCAPE                            CSV escape character
      --output-csv-header[=OUTPUT-CSV-HEADER]                        When set, CSV header will be used as a schema
      --output-csv-new-line-separator=OUTPUT-CSV-NEW-LINE-SEPARATOR  When set, empty CSV values will be treated as NULL values
      --output-csv-separator=OUTPUT-CSV-SEPARATOR                    CSV separator character
      --output-csv-enclosure=OUTPUT-CSV-ENCLOSURE                    CSV enclosure character
      --output-csv-escape=OUTPUT-CSV-ESCAPE                          CSV escape character
      --output-csv-date-time-format=OUTPUT-CSV-DATE-TIME-FORMAT      DateTime format for CSV output
      --input-excel-header=INPUT-EXCEL-HEADER                        When set, Excel header will be used as a schema
      --input-excel-sheet-name=INPUT-EXCEL-SHEETNAME                 When set, Excel sheet name will be selected for reading
      --input-excel-offset=INPUT-EXCEL-OFFSET                        Offset to start reading from
      --input-xml-node-path=INPUT-XML-NODE-PATH                      XML node path to a subtree from which schema should be extracted, for example /root/element This is not xpath, just a node names separated by slash
      --input-xml-buffer-size=INPUT-XML-BUFFER-SIZE                  XML buffer size in bytes
      --input-parquet-columns=INPUT-PARQUET-COLUMNS                  Columns to read from parquet file (multiple values allowed)
      --input-parquet-offset=INPUT-PARQUET-OFFSET                    Offset to start reading from
  -h, --help                                                         Display help for the given command. When no command is given display help for the list command
  -q, --quiet                                                        Do not output any message
  -V, --version                                                      Display this application version
      --ansi|--no-ansi                                               Force (or disable --no-ansi) ANSI output
  -n, --no-interaction                                               Do not ask any interactive question
  -v|vv|vvv, --verbose                                               Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

`file:schema` alias `schema`

$ flow file:schema --help
Description:
  Read data schema from a file.

Usage:
  file:schema [options] [--] <input-file>
  schema

Arguments:
  input-file                                               Path to a file from which schema should be extracted.

Options:
      --input-file-format=INPUT-FILE-FORMAT                Source file format. When not set file format is guessed from source file path extension
      --input-file-limit=INPUT-FILE-LIMIT                  Limit number of rows that are going to be used to infer file schema, when not set whole file is analyzed
      --output-pretty                                      Pretty print schema
      --output-table                                       Pretty schema as ascii table
      --schema-auto-cast[=SCHEMA-AUTO-CAST]                When set Flow will try to automatically cast values to more precise data types, for example datetime strings will be casted to datetime type [default: false]
      --config=CONFIG                                      Path to a local php file that MUST return instance of: Flow\ETL\Config
      --input-json-pointer=INPUT-JSON-POINTER              JSON Pointer to a subtree from which schema should be extracted
      --input-json-pointer-entry-name                      When set, JSON Pointer will be used as an entry name in the schema
      --input-csv-header[=INPUT-CSV-HEADER]                When set, CSV header will be used as a schema
      --input-csv-empty-to-null[=INPUT-CSV-EMPTY-TO-NULL]  When set, empty CSV values will be treated as NULL values
      --input-csv-separator=INPUT-CSV-SEPARATOR            CSV separator character
      --input-csv-enclosure=INPUT-CSV-ENCLOSURE            CSV enclosure character
      --input-csv-escape=INPUT-CSV-ESCAPE                  CSV escape character
      --input-excel-header=INPUT-EXCEL-HEADER              When set, Excel header will be used as a schema
      --input-excel-sheet-name=INPUT-EXCEL-SHEETNAME       When set, Excel sheet name will be selected for reading
      --input-excel-offset=INPUT-EXCEL-OFFSET              Offset to start reading from
      --input-xml-node-path=INPUT-XML-NODE-PATH            XML node path to a subtree from which schema should be extracted, for example /root/element This is not xpath, just a node names separated by slash
      --input-xml-buffer-size=INPUT-XML-BUFFER-SIZE        XML buffer size in bytes
      --input-parquet-columns=INPUT-PARQUET-COLUMNS        Columns to read from parquet file (multiple values allowed)
      --input-parquet-offset=INPUT-PARQUET-OFFSET          Offset to start reading from
  -h, --help                                               Display help for the given command. When no command is given display help for the list command
  -q, --quiet                                              Do not output any message
  -V, --version                                            Display this application version
      --ansi|--no-ansi                                     Force (or disable --no-ansi) ANSI output
  -n, --no-interaction                                     Do not ask any interactive question
  -v|vv|vvv, --verbose                                     Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

Example:

$ flow schema orders.csv --table --auto-cast
+------------+----------+----------+-------------+----------+
|       name |     type | nullable | scalar_type | metadata |
+------------+----------+----------+-------------+----------+
|   order_id |     uuid |    false |             |       [] |
| created_at | datetime |    false |             |       [] |
| updated_at | datetime |    false |             |       [] |
|   discount |   scalar |     true |      string |       [] |
|    address |     json |    false |             |       [] |
|      notes |     json |    false |             |       [] |
|      items |     json |    false |             |       [] |
+------------+----------+----------+-------------+----------+
7 rows

`file:analyze` alias `analyze`

file:analyze --help                         
Description:
  Analyze a file.

Usage:
  file:analyze [options] [--] <input-file>
  analyze

Arguments:
  input-file                                               Path to a file from which schema should be extracted.

Options:
      --input-file-format=INPUT-FILE-FORMAT                File format. When not set file format is guessed from source file path extension
      --input-file-batch-size=INPUT-FILE-BATCH-SIZE        Number of rows that are going to be read and displayed in one batch, when set to -1 whole dataset will be displayed at once [default: 1000]
      --input-file-limit=INPUT-FILE-LIMIT                  Limit number of rows that are going to be used to infer file schema, when not set whole file is analyzed
      --config=CONFIG                                      Path to a local php file that MUST return instance of: Flow\ETL\Config
      --input-json-pointer=INPUT-JSON-POINTER              JSON Pointer to a subtree from which schema should be extracted
      --input-json-pointer-entry-name                      When set, JSON Pointer will be used as an entry name in the schema
      --input-csv-header[=INPUT-CSV-HEADER]                When set, CSV header will be used as a schema
      --input-csv-empty-to-null[=INPUT-CSV-EMPTY-TO-NULL]  When set, empty CSV values will be treated as NULL values
      --input-csv-separator=INPUT-CSV-SEPARATOR            CSV separator character
      --input-csv-enclosure=INPUT-CSV-ENCLOSURE            CSV enclosure character
      --input-csv-escape=INPUT-CSV-ESCAPE                  CSV escape character
      --input-excel-header=INPUT-EXCEL-HEADER              When set, Excel header will be used as a schema
      --input-excel-sheet-name=INPUT-EXCEL-SHEETNAME       When set, Excel sheet name will be selected for reading
      --input-excel-offset=INPUT-EXCEL-OFFSET              Offset to start reading from
      --input-xml-node-path=INPUT-XML-NODE-PATH            XML node path to a subtree from which schema should be extracted, for example /root/element This is not xpath, just a node names separated by slash
      --input-xml-buffer-size=INPUT-XML-BUFFER-SIZE        XML buffer size in bytes
      --input-parquet-columns=INPUT-PARQUET-COLUMNS        Columns to read from parquet file (multiple values allowed)
      --input-parquet-offset=INPUT-PARQUET-OFFSET          Offset to start reading from
      --stats-schema[=STATS-SCHEMA]                        Prints schema of executed data transformation pipeline. [default: false]
      --stats-columns[=STATS-COLUMNS]                      Prints number of rows in dataset. [default: false]
  -h, --help                                               Display help for the given command. When no command is given display help for the list command
      --silent                                             Do not output any message
  -q, --quiet                                              Only errors are displayed. All other output is suppressed
  -V, --version                                            Display this application version
      --ansi|--no-ansi                                     Force (or disable --no-ansi) ANSI output
  -n, --no-interaction                                     Do not ask any interactive question
  -v|vv|vvv, --verbose                                     Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

`file:read` alias `read`

$ flow read --help
Description:
  Read data from a file.

Usage:
  file:read [options] [--] <input-file>
  read

Arguments:
  input-file                                               Path to a file from which schema should be extracted.

Options:
      --input-file-format=INPUT-FILE-FORMAT                File format. When not set file format is guessed from source file path extension
      --input-file-batch-size=INPUT-FILE-BATCH-SIZE        Number of rows that are going to be read and displayed in one batch, when set to -1 whole dataset will be displayed at once [default: 100]
      --input-file-limit=INPUT-FILE-LIMIT                  Limit number of rows that are going to be used to infer file schema, when not set whole file is analyzed
      --output-truncate=OUTPUT-TRUNCATE                    Truncate output to given number of characters, when set to -1 output is not truncated at all [default: 20]
      --schema-auto-cast[=SCHEMA-AUTO-CAST]                When set Flow will try to automatically cast values to more precise data types, for example datetime strings will be casted to datetime type [default: false]
      --config=CONFIG                                      Path to a local php file that MUST return instance of: Flow\ETL\Config
      --input-json-pointer=INPUT-JSON-POINTER              JSON Pointer to a subtree from which schema should be extracted
      --input-json-pointer-entry-name                      When set, JSON Pointer will be used as an entry name in the schema
      --input-csv-header[=INPUT-CSV-HEADER]                When set, CSV header will be used as a schema
      --input-csv-empty-to-null[=INPUT-CSV-EMPTY-TO-NULL]  When set, empty CSV values will be treated as NULL values
      --input-csv-separator=INPUT-CSV-SEPARATOR            CSV separator character
      --input-csv-enclosure=INPUT-CSV-ENCLOSURE            CSV enclosure character
      --input-csv-escape=INPUT-CSV-ESCAPE                  CSV escape character
      --input-excel-header=INPUT-EXCEL-HEADER              When set, Excel header will be used as a schema
      --input-excel-sheet-name=INPUT-EXCEL-SHEETNAME       When set, Excel sheet name will be selected for reading
      --input-excel-offset=INPUT-EXCEL-OFFSET              Offset to start reading from
      --input-xml-node-path=INPUT-XML-NODE-PATH            XML node path to a subtree from which schema should be extracted, for example /root/element This is not xpath, just a node names separated by slash
      --input-xml-buffer-size=INPUT-XML-BUFFER-SIZE        XML buffer size in bytes
      --input-parquet-columns=INPUT-PARQUET-COLUMNS        Columns to read from parquet file (multiple values allowed)
      --input-parquet-offset=INPUT-PARQUET-OFFSET          Offset to start reading from
  -h, --help                                               Display help for the given command. When no command is given display help for the list command
  -q, --quiet                                              Do not output any message
  -V, --version                                            Display this application version
      --ansi|--no-ansi                                     Force (or disable --no-ansi) ANSI output
  -n, --no-interaction                                     Do not ask any interactive question
  -v|vv|vvv, --verbose                                     Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

`file:rows:count` alias `count`

$ flow count --help
Description:
  Read data schema from a file.

Usage:
  file:rows:count [options] [--] <input-file>
  count

Arguments:
  input-file                                               Path to a file from which schema should be extracted.

Options:
      --input-file-format=INPUT-FILE-FORMAT                Source file format. When not set file format is guessed from source file path extension
      --input-file-limit=INPUT-FILE-LIMIT                  Limit number of rows that are going to be used to infer file schema, when not set whole file is analyzed
      --config=CONFIG                                      Path to a local php file that MUST return instance of: Flow\ETL\Config
      --input-json-pointer=INPUT-JSON-POINTER              JSON Pointer to a subtree from which schema should be extracted
      --input-json-pointer-entry-name                      When set, JSON Pointer will be used as an entry name in the schema
      --input-csv-header[=INPUT-CSV-HEADER]                When set, CSV header will be used as a schema
      --input-csv-empty-to-null[=INPUT-CSV-EMPTY-TO-NULL]  When set, empty CSV values will be treated as NULL values
      --input-csv-separator=INPUT-CSV-SEPARATOR            CSV separator character
      --input-csv-enclosure=INPUT-CSV-ENCLOSURE            CSV enclosure character
      --input-csv-escape=INPUT-CSV-ESCAPE                  CSV escape character
      --input-excel-header=INPUT-EXCEL-HEADER              When set, Excel header will be used as a schema
      --input-excel-sheet-name=INPUT-EXCEL-SHEETNAME       When set, Excel sheet name will be selected for reading
      --input-excel-offset=INPUT-EXCEL-OFFSET              Offset to start reading from
      --input-xml-node-path=INPUT-XML-NODE-PATH            XML node path to a subtree from which schema should be extracted, for example /root/element This is not xpath, just a node names separated by slash
      --input-xml-buffer-size=INPUT-XML-BUFFER-SIZE        XML buffer size in bytes
      --input-parquet-columns=INPUT-PARQUET-COLUMNS        Columns to read from parquet file (multiple values allowed)
      --input-parquet-offset=INPUT-PARQUET-OFFSET          Offset to start reading from
  -h, --help                                               Display help for the given command. When no command is given display help for the list command
  -q, --quiet                                              Do not output any message
  -V, --version                                            Display this application version
      --ansi|--no-ansi                                     Force (or disable --no-ansi) ANSI output
  -n, --no-interaction                                     Do not ask any interactive question
  -v|vv|vvv, --verbose                                     Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

`parquet:read:metadata`

$ flow parquet:read:metadata --help
Description:
  Read metadata from parquet file

Usage:
  parquet:read:metadata [options] [--] <file>

Arguments:
  file                  path to parquet file

Options:
      --columns         Display column details
      --row-groups      Display row group details
      --column-chunks   Display column chunks details
      --statistics      Display column chunks statistics details
      --page-headers    Display page headers details
  -h, --help            Display help for the given command. When no command is given display help for the list command
  -q, --quiet           Do not output any message
  -V, --version         Display this application version
      --ansi|--no-ansi  Force (or disable --no-ansi) ANSI output
  -n, --no-interaction  Do not ask any interactive question
  -v|vv|vvv, --verbose  Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

`pipeline:run`

$ flow pipeline:run --help
Description:
  Execute data processing pipeline from a php file.

Usage:
  pipeline:run [options] [--] <pipeline-file>
  run

Arguments:
  pipeline-file                        Path to a php/json with DataFrame definition.

Options:
      --analyze[=ANALYZE]              Collect processing statistics and print them. [default: false]
      --config=CONFIG                  Path to a local php file that MUST return instance of: Flow\ETL\Config
      --stats-schema[=STATS-SCHEMA]    Prints schema of executed data transformation pipeline. [default: false]
      --stats-columns[=STATS-COLUMNS]  Prints number of rows in dataset. [default: false]
  -h, --help                           Display help for the given command. When no command is given display help for the list command
      --silent                         Do not output any message
  -q, --quiet                          Only errors are displayed. All other output is suppressed
  -V, --version                        Display this application version
      --ansi|--no-ansi                 Force (or disable --no-ansi) ANSI output
  -n, --no-interaction                 Do not ask any interactive question
  -v|vv|vvv, --verbose                 Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

Help:
  pipeline-file argument must point to a valid php file that returns DataFrame instance.
  Make sure to not execute run() or any other trigger function.
  
  Example of pipeline.php:
  <?php
  return df()
      ->read(from_array([
          ['id' => 1, 'name' => 'User 01', 'active' => true],
          ['id' => 2, 'name' => 'User 02', 'active' => false],
          ['id' => 3, 'name' => 'User 03', 'active' => true],
      ]))
      ->collect()
      ->write(to_output());

`db:table:schema`

Prints a schema of a dataset from a database table.

Note: Database connection string can be passed through FLOW_DB_CONNECTION_STRING environment variable, otherwise command will ask for it. If you need a custom connection, you can use --db-connection-file option which should point to a php file that will be included and should return an instance of \Doctrine\DBAL\Connection.

flow db:table:schema --help
Description:
  Read data schema from a database table.

Usage:
  db:table:schema [options] [--] [<input-db-table>]

Arguments:
  input-db-table                               Table name for which we are going to generate schema.

Options:
      --output-php                             Print schema as PHP code
      --output-table                           Print schema as ascii table
      --output-ascii                           Print schema as ascii list
      --config=CONFIG                          Path to a local php file that MUST return instance of: Flow\ETL\Config
  -c, --db-connection-file=DB-CONNECTION-FILE  Path to file that returns and instance of \Doctrine\DBAL\Connection
  -h, --help                                   Display help for the given command. When no command is given display help for the list command
      --silent                                 Do not output any message
  -q, --quiet                                  Only errors are displayed. All other output is suppressed
  -V, --version                                Display this application version
      --ansi|--no-ansi                         Force (or disable --no-ansi) ANSI output
  -n, --no-interaction                         Do not ask any interactive question
  -v|vv|vvv, --verbose                         Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

Help:
  Database connection string can be passed through FLOW_DB_CONNECTION_STRING environment variable, otherwise command will ask for it.
  --db-connection-file option takes priority over FLOW_DB_CONNECTION_STRING environment.

`db:table:list`

flow db:table:list --help  
Description:
  Print list of datasets from a database.

Usage:
  db:table:list [options]

Options:
      --db-namespace=DB-NAMESPACE              List of namespaces for which this command will list tables, multiple values allowed. When not set, all tables from all namespaces are listed. (multiple values allowed)
      --config=CONFIG                          Path to a local php file that MUST return instance of: Flow\ETL\Config
  -c, --db-connection-file=DB-CONNECTION-FILE  Path to file that returns and instance of \Doctrine\DBAL\Connection
  -h, --help                                   Display help for the given command. When no command is given display help for the list command
      --silent                                 Do not output any message
  -q, --quiet                                  Only errors are displayed. All other output is suppressed
  -V, --version                                Display this application version
      --ansi|--no-ansi                         Force (or disable --no-ansi) ANSI output
  -n, --no-interaction                         Do not ask any interactive question
  -v|vv|vvv, --verbose                         Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

Help:
  Database connection string can be passed through FLOW_DB_CONNECTION_STRING environment variable, otherwise command will ask for it.
  --db-connection-file option takes priority over FLOW_DB_CONNECTION_STRING environment.

`schema:format`

Take a JSON schema and print it in another format, including PHP,
which can be then directly used to validate datasets.

flow schema:format --help
Description:
  Print a json schema in one of the available formats.

Usage:
  schema:format [options] [--] <input-schema-file>
  format

Arguments:
  input-schema-file     Path to a json with schema Flow.

Options:
      --output-php      Print schema as PHP code
      --output-table    Print schema as ascii table
      --output-ascii    Print schema as ascii list
      --config=CONFIG   Path to a local php file that MUST return instance of: Flow\ETL\Config
  -h, --help            Display help for the given command. When no command is given display help for the list command
      --silent          Do not output any message
  -q, --quiet           Only errors are displayed. All other output is suppressed
  -V, --version         Display this application version
      --ansi|--no-ansi  Force (or disable --no-ansi) ANSI output
  -n, --no-interaction  Do not ask any interactive question
  -v|vv|vvv, --verbose  Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

Adapters

Libraries

Bridges

Contributors

Join us on GitHub

Introduction

Data Frame

#Flow Command Line Interface

#Installation

#Docker

#Commands

#Config

#file:convert alias convert

#file:schema alias schema

#file:analyze alias analyze

#file:read alias read

#file:rows:count alias count

#parquet:read:metadata

#pipeline:run

#db:table:schema

#db:table:list

#schema:format