Relations generated by SQLFlow

https://github.com/sqlparser/sqlflow_public/blob/master/dbobjects_relationship.md

The main relation type in dataflow

  1. column to column dataflow, the data of target column is coming from the source column(fdd).

  2. column to resultset(mainly select list), the number of row in the resultset is impacted by the source column(fdr).

  3. resultset to resultset, the number of row in a source table in the from clause impact the number of row in the target select list.(fdr)

Analyze dataflow in the different SQL elements - part 1

  1. select list

  2. where clause

  3. function (case expression)

  4. group by(aggregate function)

  5. from clause

  6. handle of select * (Not finished yet)

1. select list

SELECT a.empName "eName"
FROM scott.emp a

the data of target column "eName" comes from scott.emp.empName, so we have a dataflow relation like this:

scott.emp.empName -> fdd -> "eName"

the result generated by the select list called: resultset likes a virtual table includes columns and rows.

2. where clause

SELECT a.empName "eName"
FROM scott.emp a
Where sal > 1000

The total number of row in the select list is impacted by the value of column sal in the where clause. So we have a dataflow relation like this:

sal -> fdr -> resultset.PseudoRows

As you can see, we introduced a new pseudo column: PseudoRows to represents the number of rows in the resultset.

3. function

During the dataflow analyzing, function plays a key role. It accepts arguments which usually is column and generate resultset which maybe a scalar value or a set value.

select round(salary) as sal from scott.emp

The relation of the round function in the above SQL :

scott.emp.salary -> fdd -> round(salary) -> fdd -> sal

4. group by and aggregate function

SELECT deptno, COUNT() num_emp, SUM(SAL) sal_sum
FROM scott.emp
Where city = 'NYC'
GROUP BY deptno

4.1

since SUM() is an aggregate function, so deptno column in the group by clause will be treated as an implict argument of the SUM() function. However, deptno column doesn't directly contribute the value to the SUM() function as column SAL does, So, the relation type is fdr:

scott.emp.deptno -> fdr -> SUM(SAL) -> fdd -> sal_sum

the columns in the having clause have the same relation as the columns in the group by clause as mentioned above.

4.2

The value of SUM() function also effected by the total rows of the table scott.emp, so, there is a relation like this:

scott.emp.pseudoRows -> fdr -> SUM(SAL) -> fdd -> sal_sum

The above rules apply to all aggregation functions, such as the count() function in the SQL.

5. From clause

If the resultset of a subquery or CTE is used in the from clause of the upper-level statement, then the impact of the lower level resultset will be transferred to the upper-level.

WITH
  cteReports (EmpID, FirstName, LastName, MgrID, EmpLevel)
  AS
  (
    SELECT EmployeeID, FirstName, LastName, ManagerID, 1  -- resultset1
    FROM Employees
    WHERE ManagerID IS NULL
  )
SELECT
  FirstName + ' ' + LastName AS FullName, EmpLevel  -- resultset2
FROM cteReports 

In the CTE, there is an impact relation:

Employees.ManagerID -> fdr -> resultset1.pseudoRows

Since cteReports is used in the from clause of the upper-level statement, then the impact will carry on like this:

Employees.ManagerID -> fdr -> resultset1.pseudoRows -> fdd -> resultset2.pseudoRows

If we choose to ignore the intermediate resultset, the end to end dataflow is :

Employees.ManagerID -> fdr -> resultset2.pseudoRows

Handle the dataflow chain

Every relation in the SQL is picked up by the tool, and connected together to show the whole dataflow chain. Sometimes, we only need to see the end to end relation and ignore all the intermediate relations.

If we need to convert a fully chained dataflow to an end to end dataflow, we may consider the following rules:

  1. A single dataflow chain with the mixed relation types: fdd and fdr.

    A -> fdd -> B -> fdr -> C -> fdd -> D

    the rule is: if any fdr relation appears in the chain, the relation from A -> D will be consider as type of fdr, otherwise, the final relation is fdd for the end to end relation of A -> D.

  2. If there are multiple chains from A -> D

    A -> fdd -> B1 -> fdr -> C1 -> fdd -> D
    A -> fdd -> B2 -> fdr -> C1 -> fdd -> D
    A -> fdd -> B3 -> fdd -> C3 -> fdd -> D

    The final relation should choose the fdd chain if any.

Analyze dataflow in the different SQL elements - part 2

1. case expression

 select
 	case when a.kamut=1 and b.teur IS null
 			 then 'no locks'
 		 when a.kamut=1
 			then b.teur
 	else 'locks'
 	end teur
 from tbl a left join TT b on (a.key=b.key)

During the analyzing of dataflow, case expression is treated as a function. The column used inside the case expression will be treated like the arguments of a function. So for the above SQL, the following relation is discovered:

tbl.kamut -> fdd -> teur
TT.teur -> fdd -> teur

2. join condition

Columns in the join condition also effect the number of row in the resultset of the select list just like column in the where clause do. So, the following relation will be discoverd in the above SQL.

tbl.key -> fdr -> resultset.PseudoRows
TT.key -> fdr -> resultset.PseudoRows

3. create view

create view vEmp(eName) as
SELECT a.empName "eName"
FROM scott.emp a
Where sal > 1000

From this query, you will see how the column sal in where clause impact the number of rows in the top level view vEmp.

scott.emp.sal -> fdr -> resultset1.PseudoRows -> fdr -> vEmp.PseudoRows

So, from an end to end point of view, there will be a fdr relation between column sal and view vEmp like this:

scott.emp.sal -> fdr -> vEmp.PseudoRows

4. rename/swap table

alter table t2 rename to t3;

We also use PseudoRows to represent the relation when rename a table, the relation type is fdd.

t2.PseudoRows -> fdd -> t3.PseudoRows

5. create external table (snowflake)

create or replace stage exttable_part_stage
  url='s3://load/encrypted_files/'
  credentials=(aws_key_id='1a2b3c' aws_secret_key='4x5y6z')
  encryption=(type='AWS_SSE_KMS' kms_key_id = 'aws/key');

create external table exttable_part(
 date_part date as to_date(split_part(metadata$filename, '/', 3)
   || '/' || split_part(metadata$filename, '/', 4)
   || '/' || split_part(metadata$filename, '/', 5), 'YYYY/MM/DD'),
 timestamp bigint as (value:timestamp::bigint),
 col2 varchar as (value:col2::varchar))
 partition by (date_part)
 location=@exttable_part_stage/logs/
 auto_refresh = true
 file_format = (type = parquet);

The data of the external table exttable_part comes from the stage: exttable_part_stage

exttable_part_stage (url='s3://load/encrypted_files/') -> fdd -> exttable_part(date_part,timestamp,col2) 

6. create external table (bigquery)

CREATE EXTERNAL TABLE dataset.CsvTable OPTIONS (
  format = 'CSV',
  uris = ['gs://bucket/path1.csv', 'gs://bucket/path2.csv']
);

The data of the external table dataset.CsvTable comes from the csv file: gs://bucket/path1.csv, gs://bucket/path2.csv

file (uri='gs://bucket/path1.csv')  -> fdd ->  dataset.CsvTable
file (uri='gs://bucket/path2.csv')  -> fdd ->  dataset.CsvTable

7. build data lineage for the foreign key in the create table statement.

CREATE TABLE masteTable
(
	masterColumn         varchar(3)  Primary Key,
);


CREATE TABLE foreignTable
(
	foreignColumn1            varchar(3)  NOT NULL ,
	foreignColumn2            varchar(3)  NOT NULL 
	FOREIGN KEY (foreignColumn1) REFERENCES masteTable(masterColumn),
	FOREIGN KEY (foreignColumn2) REFERENCES masteTable(masterColumn)
)

The data flow is:

masteTable.masterColumn -> fdd -> foreignTable.foreignColumn1
masteTable.masterColumn -> fdd -> foreignTable.foreignColumn2

8, Hive load data

LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view PARTITION(date='2008-06-08', country='US')

The data flow is:

file (/tmp/pv_2008-06-08_us.txt)  ->  fdd -> page_view(date,country)

9, Hive INSERT OVERWRITE [LOCAL] DIRECTORY

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum'
SELECT pv_gender_sum.*
FROM pv_gender_sum;

The data flow is:

pv_gender_sum(*) ->  fdd ->  file ('/tmp/pv_gender_sum')

The meaning of the letter in fdd, fdr

The meaning of the letter in fdd, fdr. f: dataflow, d: data value, r: record set.

The first letter is always f,the second letter represents the source column,the third letter represents the target column, the fourth is reserved.

  • fdd: data of the source column will be used in the target column

  • fdr: data of the source column will impact the number of the resultset in the select list, or will impact the result value of an anggreate function.

Last updated