Dataflow
https://github.com/sqlparser/sqlflow_public/blob/master/doc/get-started/introduction.md#2-the-entity-in-dataflow
Last updated
https://github.com/sqlparser/sqlflow_public/blob/master/doc/get-started/introduction.md#2-the-entity-in-dataflow
Last updated
SQLFlow generates data lineage by analyzing SQL queries and stored procedures.
The entity in the data lineage model includes table, column, function, relation and other entities . The combination of the entity and dataflow shows the lineage from one table/column to another.
A dataflow unit includes a source entity, a target entity and a dataflow type between them.
This is the dataflow generated for the above SQL query.
Source and target entity usually referes to table, view and other relations such as common table expression, result set generated during the execution of the query. It may also refers to a file in the HDFS system and etc.
There are two types of dataflow: direct and indirect.
The direct dataflow means the data of the target entity comes directly from the source entity.
In the above diagram, the data of RS-1.FirstName
comes from the Person.FirstName
directly.
An arrow is used to represent a direct dataflow in the diagram:
The indirect dataflow means the data of the target column does not come from the source column, but the data of the source column/table impact the result data of the target column.
A dotted line arrow is used to represent an indirect dataflow in the diagram:
The source column in the indirect dataflow usually appears in the following clause:
Where clause
Group by clause
Winddows function
Join condition
The value of COUNT() and SUM(SAL) is impacted by the value of column deptno
in the group by clause. So the indirect dataflows will be created like this:
For other indirect dataflows, we will discuss later.
The join
relationship build a link between 2 or more columns in the join condition. Striclty speaking, the relation is not a dataflow type.
A join relationship will be created after analzying the above SQL. It indicates a join relationship betwee tbl.key
and TT.key
.
When build dataflow between 2 entities: the source and target entity. They can be column to column, or, table to colum, or table to table.
This is the most usual case. Both entites in a dataflow are columns.
It is meant to say the total number of rows of a table impacts the value of a column when we say a table is impacting the value of a column. In most of cases, the column is derived from a COUNT() function.
A table to column dataflow is represented by using a RelationRows pseduo column. An indirect dataflow is built from scott.emp.RelationRows to RS-1.num_emp
Sometimes, there will be a dataflow between 2 tables. For example, in an alter table rename
SQL statement, a table to table dataflow will be created. Acutally, a table to table dataflow is represented by a column to column dataflow using the RelationRows
pseudo column.
A data lineage consists of lots of basic dataflow units.
The data lineage diagram:
The output also available in XML or JSON format .
xml code used in this article is generated by DataFlowAnalyzer tools
digram used in this article is generated by the Gudu SQLFlow Cloud version