Gudu SQLFlow Product Docs
  • 1. Introduction
    • What is Gudu SQLFlow?
      • What SQLFlow can do
      • Architecture Overview
    • Getting Started
      • Sign up a new account
        • Team Management
        • Delete My Account
        • Activate by entering a coupon
      • How to use SQLFlow
      • Different modes in Gudu SQLFlow
        • Query mode
        • Job mode
      • Basic Usage
      • Convert SQL to E-R Diagram
      • Colors in SQLFlow Diagram
      • Show call relationship
    • Installation
      • Version and Users
        • Cloud and On-Premise version
        • SQLFlow before Version 6
          • For older version SQLFlow under Linux
          • For older version SQLFlow under MacOS
          • For older version SQLFlow under Windows
      • Linux
      • MacOS
      • Windows
      • Docker
      • Clickhouse Installation
        • Clickhouse For CentOs
        • Clickhouse For Ubuntu/Debian/RHEL
      • Troubleshooting
      • Upgrade
      • Third Party Components
      • Renew License File
    • UI
      • SQLText Editor
      • Schema Explorer
      • Diagram Panel
      • Settings
      • Job Management
        • Job Sources
    • Dlineage Tool
      • Overview
      • Usage
        • Analyze data linege from SQL files
        • Analyze data linege from a database
        • Resolve the ambiguous columns in SQL query
        • Map the DataFlowAnalyzer and the settings on SQLFlow UI
        • Settings
      • Dataflow.xml structure
      • FAQ
  • 2. CONCEPTS
    • Data Lineage Basics
      • Dataflow
        • Relations generated by SQLFlow
      • Direct Dataflow
      • Indirect Dataflow
      • Aggregate function and Dataflow
      • Dataflow chain
    • Data Lineage Format Reference
  • 3. API Docs
    • Prerequisites
    • Using the Rest API
    • SQLFlow Rest API reference
      • User Interface
      • Generation Interface
        • /sqlflow
        • /sqlflow/selectedgraph/table_level_lineage
        • /sqlflow/selectedgraph/image
        • /sqlflow/graph
        • /sqlflow/graph/table_level_lineage
        • /sqlflow/graph/image
        • /sqlflow/downstreamGraph
        • /sqlflow/upstreamGraph
        • /sqlflow/erdiagramSelectGraph
        • /sqlflow/leftMostSourceTableGraph
      • Job Interface
        • /submitUserJob
        • /displayUserJobSummary
        • /displayUserJobsSummary
        • /exportLineageAsJson
        • /exportFullLineageAsJson
        • /exportLineageAsGraphml
        • /submitPersistJob
        • /displayUserLatestJobTableLevelLineage
      • Export Image
      • Export CSV
        • /sqlflow/exportFullLineageAsCsv
        • /job/exportFullLineageAsCsv
    • Swagger UI
    • Export the data lineage result
    • Python
      • Basic Usage
      • Advanced Usage
    • SQL Parser API
      • checkSyntax
  • 4. SQLFlow Widget
    • Widget Get started
    • Usages
    • Widget API Reference
  • 5. Databases
    • Database Objects
      • Azure
      • DB2
  • 6. SQLFlow-ingester
    • Introduction
      • SQLFlow-Exporter
      • SQLFlow-Extractor
      • SQLFlow-Submitter
    • Get Started
      • SQL Server
    • SQLFlow-Ingester Java API Usage
    • Understand the format of exported data
      • Oracle
      • Microsoft SQL Server
      • MySQL
      • PostgreSQL
    • List of Supported dbVendors
    • Git Repo
    • Third Party Components
  • 7. Reference
    • Lineage Model
      • Json Format Lineage Model
      • XML Format Lineage Model
      • Data Lineage Elements
    • Database Model
  • 8. other
    • FAQ
      • Handling Internal Database
      • Delete Your Account
      • Table Form Data Without Intermediates
      • Not all schema exported from Oracle
      • Lineage Customization
    • Roadmap
    • SQL Samples
      • Exchange table partition
      • Generate relationship for renamed table
      • Snowflake table function lineage detection
    • Change Logs
    • SQLFlow with Oracle XML functions
    • Major Organizations Utilizing SQLFlow
Powered by GitBook
On this page
  • RelationRows
  • 1 RelationsRows in target relation
  • 2 RelationRows in source relation
  • 3. Table level dataflow using RelationRows
  • Where and Group-By
  • 1. Columns in where clause
  • 2. COUNT()
  • 3. Other aggregate function
  • 4. Summary
  • References
  1. 2. CONCEPTS
  2. Data Lineage Basics

Indirect Dataflow

https://github.com/sqlparser/sqlflow_public/tree/master/doc/get-started

PreviousDirect DataflowNextAggregate function and Dataflow

Last updated 2 years ago

This article introduces some SQL elements that generate indirect dataflow. Indirect dataflow usually is generated from columns used in the where clause, group by clause, aggregate function and etc.

RelationRows

In order to create indirect dataflow between columns, we introduce a pseudo column: RelationRows.

RelationRows is a pseudo column of a relation used to represents the number of rows in a relation. As its name indicates, RelationRows is not a real column in the relation(table/resultset and etc). Usually, It is used to represents a dataflow between a column and a relation.

RelationRows pseudo column can be used in both the source and target relation.

1 RelationsRows in target relation

Take this SQL for example:

SELECT a.empName "eName"
FROM scott.emp a
Where sal > 1000

The total number of rows of the select list is impacted by the value of column sal in the where clause. So, an indirect dataflow is created like this:

scott.emp.sal -> indirect -> RS-1.RelationRows

2 RelationRows in source relation

Here is another sample:

SELECT count() totalNum, sum(sal) totalSal 
FROM   scott.emp 

The value of count() function and sum(sal) function is impacted by the number of rows in the scott.emp source table.

scott.emp.RelationRows -> indirect -> count()
scott.emp.RelationRows -> indirect -> sum(sal)

3. Table level dataflow using RelationRows

RelationRows is also used to represent table level dataflow.

alter table t2 rename to t3;

A table level dataflow is not built on a table, but on the pseudo column RelationRows like this:

t2.RelationRows -> direct -> t3.RelationRows

Build a table to table dataflow that using the RelationRows pseudo column for 2 reasons:

  • This pseudo column that used to represent a table to column dataflow will be used to generate a table to table dataflow later if user need a table-level lineage model.

  • If other columns in the same table are used in a column to column dataflow while this table itself is also in a table to table dataflow, then, this pseudo column will make it possible for a single table to includes both the column to column dataflow and table to table dataflow.

take this SQL for example

create view v1 as select f1 from t2;
alter table t2 rename to t3;

The first create view statement will generate a column-level dataflow between the table t2 and view v1,

t2.f1 -> direct -> RS-1.f1 -> direct -> v1.f1

while the second alter table statement will genereate a table-level dataflow between the table t2 and t3.

t2.RelationRows -> direct -> t3.RelationRows

As you can see, Table t2 involved in the column to column dataflow generated by the create view statement, It also involved in a table to table dataflow generated by the alter table statement. A single table t2 in the above digram shows that it includes both the column to column dataflow and a table to table dataflow.

Where and Group-By

1. Columns in where clause

Some of the columns in source tables in WHERE clause do not influence target columns but are crucial for the selected row set, so they should be saved for impact analyses, with an indirect dataflow to the target tables.

Take this SQL for example:

SELECT a.empName "eName"
FROM scott.emp a
Where sal > 1000

The total number of rows of the select list is impacted by the value of column sal in the where clause. We build an indirect dataflow for this relationship.

scott.emp.sal -> indirect -> RS-1.RelationRows

2. COUNT()

COUNT() function is an aggregate function that used to calculate the total number of rows of a relation.

2.1 where clause without group by clause

SELECT COUNT() num_emp
FROM scott.emp
where city=1

In above SQL, two indirect dataflows will be created, because the value of COUNT() is impacted by the city column in where clause and the total number of rows of scott.emp table.

scott.emp.city -> indirect -> COUNT()
scott.emp.RelationRow -> indirect -> COUNT()

2.2 where clause with group by clause

SELECT deptno, count() total_num
FROM scott.emp
where city=1
group by deptno

As you can see, besides the two indirect dataflows created in the previous SQL, a third indirect dataflow is created using the deptno in the group by clause.

scott.emp.city -> indirect -> COUNT()
scott.emp.Relations -> indirect -> COUNT()
scott.emp.deptno -> indirect -> COUNT()

3. Other aggregate function

When creating indirect dataflow, other aggregate functions such as SUM() works a little bit differently to the COUNT() function.

3.1 where clause with group by clause

SELECT deptno, SUM(SAL) sal_sum
FROM scott.emp
where city=1
group by deptno

aggregate function such as SUM() calculates the value from a record set determined by the columns used in the group by clause, so deptno column in the group by clause is used to create an indirect dataflow to SUM() function.

an indirect dataflow is created from deptno to SUM().

scott.emp.deptno -> indirect -> SUM()

RelationRows pseudo column will not be used to create an indirect dataflow if group by clause if presented.

3.2 where clause without group by clause

SELECT SUM(SAL) sal_sum
FROM scott.emp
where city=1

The above SQL means that the whole record set of the table will be used to calculate the value of SUM() function.

So two indirect dataflows will be created as below:

scott.emp.city -> indirect -> SUM()
scott.emp.RelationRows -> indirect -> SUM()

4. Summary

  • Columns in where clause always create an indirect dataflow to all aggregate functions used in the select list.

  • RelationRows pseudo column always create an indirect dataflow to COUNT() function, but only create an indirect dataflow to other aggregate functions such as SUM() when the group by clause is not used.

  • Columns in the group by clause always create an indirect dataflow to all aggregate functions used in the select list.

References

xml code used in this article is generated by tools

digram used in this article is generated by the

Gudu SQLFlow Cloud version
DataFlowAnalyzer