GP - SQL Joins
- Available Join Types?
- Row Elimination
- Minimize Data Movement
- Join Implementation Seen in Query Plans
Inner Join
- Resulting data set is obtained from joining two tables on a common column
- Each row in left table is compared with each row in right table
- Matching rows will be present in result set
- Most commonly used
- Also know has Simple Join or EQUI-Join
Left Outer Join
- Returns all rows from the left table even if there is no matching rows in right table.
- Will also get the matching rows
- Rows that are not matched in the right table will not be included in the result set
- Will return all rows from both the tables where there is a match
- We will receive a NULL for rows which don’t match
- Which is also called has Cartesian products
- Result set will have all rows on the left table paired with all rows on the right table
- Example: If we have 2 tables each containing 3 rows, the result set will be 9 rows
- Often, this is used UN intentionally
- Two Tables Client & Product
/* Create Table client */ CREATE TABLE client (Id integer PRIMARY KEY, Name text); /* Insert few records into table client */ INSERT INTO client VALUES (1, 'val1'); INSERT INTO client VALUES (2, 'val2'); INSERT INTO client VALUES (3, 'val3'); /* Display all records in table client */ SELECT * FROM client; /* Create Table product */ CREATE TABLE product (Id integer PRIMARY KEY, value text); /* Insert few records into table product */ INSERT INTO product VALUES (0, 'a'); INSERT INTO product VALUES (1, 'b'); INSERT INTO product VALUES (2, 'c'); /* Display all records in table product */ SELECT * FROM product; Inner Join: SELECT c.id, c.name, p.value FROM client c, product p WHERE c.id = p.id ORDER BY c.id ASC ; Alternate way for inner join - ANSI Syntax for Joins SELECT * FROM client JOIN product ON client.id = product.id; Left Outer Join SELECT * FROM client LEFT OUTER JOIN product ON client.id = product.id ORDER BY client.id ASC ; Right Outer Join SELECT * FROM client RIGHT /*OUTER is implicit*/ JOIN product ON client.id = product.id ORDER BY client.id ASC ; Full Outer Join SELECT * FROM client FULL JOIN product ON client.id = product.id ORDER BY client.id ASC ; To overcome columns appearing more number of times we can use Natural Full Outer Join, NATURAL key word helps us to eliminate that. SELECT * FROM client NATURAL FULL JOIN product;
Row Elimination during Joins
- Greenplum often has to use disk space to temporarily store data.
- Query optimizer minimizes the amount of memory and disk space required by
- Projecting (copying) only those columns that the query requires
- Doing single-table set selections first (qualifying rows)
- Eliminating rows early
Joins: Parallel Implementation
- The core join algorithms will be the same as in non-distributed systems
- Additional details: how to we partition data and still guarantee correctness in a distributed system?
- For the most part, these details are the same, regardless of the join algorithm (e.g. Merge, Hash, and Nested Loop).
Row Redistribution / Motion during Joins
- Try the max to avoid this situation
- If you can do a co-located join, you do.
- Between broadcast and redistributed joins, the optimizer looks at the cost of motion that gets introduced and choose the plan that it believes to be cheaper.
- The optimizer depends on up-to-date statistics on the tables.
Co-Located Join
- If the join key is the distributed key for both tables
- We can guarantee the join can be handled locally on each segment without redistribution
- This is the most efficient option
- This is a key consideration when designing a schema
- All equal keys are hashed to the same node, so join results will be correct.
- This assumes that the datatypes of the join keys is the same so that the hash algorithm for the join keys will hash the keys to the same segment.
Broadcast join
- Neither table is distributed by the join keys
- One table is considerably smaller than the other table.
- We may broadcast the smaller table
- All segments see a complete copy of the smaller table
- They perform a co-located join
- Each row of the larger table can see all target rows in the smaller table, so join results will be correct.
Redistributed Join
- One or both of the tables are not distributed on the join keys.
- We can redistribute the table(s).
- They perform a co-located join
- All equal keys are hashed to the same node, so join results will be correct.
Join Implementations
- This means that GPDB uses to join two tables
- These are what you will see in query plans, sort merge, hash, and nested loop.
Sort Merge Join
- Common when the join condition is based on an inequality operator, like <, <=, >, >=
- List will not include <>
- How to work on this, Sort the tables by join attribute, scan the two tables in parallel, combine matching rows to form join rows.
Hash Join
- Build phase, scan smaller tables, creating in-memory hash table, In this hash tables, keys are going to be the join columns, values are the corresponding rows.
- Probe phase, scan the larger tables and find the relevant rows from the smaller relation by looking in the hash table.
Nested Loop Join
- Can be one of the most efficient types of join if we have index in the tables.
- Basically for each row of the table in left, scan the right table to find a match (nested loop)
- Typically if we don't have index, then this type join is inefficient.
N-tables
- If we have more than two table in the join, then they are basically reduced to a series of two-table joins.
- This is because the query engine can only work on two tables at a time
- Final result must be built up by a tree of join steps, each with two inputs.
Comments
Post a Comment