GP - SQL Joins

Available Join Types?
Row Elimination
Minimize Data Movement
Join Implementation Seen in Query Plans

Join Types:
Inner Join

Resulting data set is obtained from joining two tables on a common column
Each row in left table is compared with each row in right table
Matching rows will be present in result set
Most commonly used
Also know has Simple Join or EQUI-Join

Left Outer Join

Returns all rows from the left table even if there is no matching rows in right table.
Will also get the matching rows
Rows that are not matched in the right table will not be included in the result set

Right Outer Join

This is similar to left outer join, just need to flip the join

Full Outer Join

Will return all rows from both the tables where there is a match
We will receive a NULL for rows which don’t match

Cross Join

Which is also called has Cartesian products
Result set will have all rows on the left table paired with all rows on the right table
Example: If we have 2 tables each containing 3 rows, the result set will be 9 rows
Often, this is used UN intentionally

To Test above said joins, kindly to use below sql(s);

Two Tables Client & Product

/* Create Table client */
CREATE TABLE client (Id integer PRIMARY KEY, Name text);

/* Insert few records into table client */
INSERT INTO client VALUES (1, 'val1');
INSERT INTO client VALUES (2, 'val2');
INSERT INTO client VALUES (3, 'val3');

/* Display all records in table client */
SELECT * FROM client;

/* Create Table product */
CREATE TABLE product (Id integer PRIMARY KEY, value text);
/* Insert few records into table product */
INSERT INTO product VALUES (0, 'a');
INSERT INTO product VALUES (1, 'b');
INSERT INTO product VALUES (2, 'c');

/* Display all records in table product */
SELECT * FROM product; 

Inner Join:
SELECT c.id, c.name, p.value FROM client c, product p WHERE c.id = p.id ORDER BY c.id ASC ;
Alternate way for inner join - ANSI Syntax for Joins
SELECT * FROM client JOIN product ON client.id = product.id; 

Left Outer Join
SELECT * FROM client LEFT OUTER JOIN product ON client.id = product.id ORDER BY client.id ASC ;

Right Outer Join
SELECT * FROM client RIGHT /*OUTER is implicit*/ JOIN product ON client.id = product.id 
ORDER BY client.id ASC ;

Full Outer Join
SELECT * FROM client FULL JOIN product ON client.id = product.id 
ORDER BY client.id ASC ;

To overcome columns appearing more number of times we can use Natural Full Outer Join, 
NATURAL key word helps us to eliminate that.
SELECT * FROM client NATURAL FULL JOIN product;

Row Elimination during Joins

Greenplum often has to use disk space to temporarily store data.
Query optimizer minimizes the amount of memory and disk space required by

Projecting (copying) only those columns that the query requires
Doing single-table set selections first (qualifying rows)
Eliminating rows early

Joins: Parallel Implementation

The core join algorithms will be the same as in non-distributed systems
Additional details: how to we partition data and still guarantee correctness in a distributed system?
For the most part, these details are the same, regardless of the join algorithm (e.g. Merge, Hash, and Nested Loop).

Row Redistribution / Motion during Joins

Try the max to avoid this situation
If you can do a co-located join, you do.
Between broadcast and redistributed joins, the optimizer looks at the cost of motion that gets introduced and choose the plan that it believes to be cheaper.
The optimizer depends on up-to-date statistics on the tables.

Co-Located Join

If the join key is the distributed key for both tables
We can guarantee the join can be handled locally on each segment without redistribution
This is the most efficient option
This is a key consideration when designing a schema
All equal keys are hashed to the same node, so join results will be correct.
This assumes that the datatypes of the join keys is the same so that the hash algorithm for the join keys will hash the keys to the same segment.

Broadcast join

Neither table is distributed by the join keys
One table is considerably smaller than the other table.
We may broadcast the smaller table
All segments see a complete copy of the smaller table
They perform a co-located join
Each row of the larger table can see all target rows in the smaller table, so join results will be correct.

Redistributed Join

One or both of the tables are not distributed on the join keys.
We can redistribute the table(s).
They perform a co-located join
All equal keys are hashed to the same node, so join results will be correct.

Join Implementations

This means that GPDB uses to join two tables
These are what you will see in query plans, sort merge, hash, and nested loop.

Sort Merge Join

Common when the join condition is based on an inequality operator, like <, <=, >, >=
List will not include <>
How to work on this, Sort the tables by join attribute, scan the two tables in parallel, combine matching rows to form join rows.

Hash Join

Build phase, scan smaller tables, creating in-memory hash table, In this hash tables, keys are going to be the join columns, values are the corresponding rows.
Probe phase, scan the larger tables and find the relevant rows from the smaller relation by looking in the hash table.

Nested Loop Join

Can be one of the most efficient types of join if we have index in the tables.
Basically for each row of the table in left, scan the right table to find a match (nested loop)
Typically if we don't have index, then this type join is inefficient.

N-tables

If we have more than two table in the join, then they are basically reduced to a series of two-table joins.
This is because the query engine can only work on two tables at a time
Final result must be built up by a tree of join steps, each with two inputs.

Search This Blog

gpdbguru

GP - SQL Joins

Comments

Post a Comment

Popular posts from this blog

GP - Kerberos errors and resolutions

How to set Optimizer at database level in greenplum