Data Distribution and Partitioning
Table and Data Distribution
Technique for balancing workload
across all nodes
- All database tables are distributed across all Segments
- Distribution Policy determines how the data in the tables are distributed
- ·One or more columns used as distribution key
Hashed into "Buckets" for each segment
- · Unique keys assure even distribution
- · When key is unspecified - uses Primary key or first column
- ·Round-Robin distribution of rows to Segments
- ·Evenly distributes data but may be less efficient at query time
ex1: Create table tablex ... Distributed by CustomerID;
- ·Hashing algorithm will help in distributing the columns evenly across the segments
- Distribution clause is set to lower cardinality
- This will skew the segments
- ·In this case we are not fully utilizing MPP architecture
- Technique for eliminating rows at query time
- Logically divides a large table into smaller parts
- Significantly improves query performance
- Facilitates database maintenance
- Two Type
Range Partitioning
List Partitioning
- Partitioning in Greenplum works using table inheritance and CHECK table constraints
- Partitioned tables also have distribution keys
Single Partition:
ex1: Create Table tablex ... Distributed by CustomerID;
- This is also called has un-partitioned table
- In this example we are using hash distribution and column is against color
- This ends up in distributing all columns evenly in all segments
- But in this case if we are querying the data specific to a single color, this will end up utilizing all segments.
- tablex in all segments needs to be scanned to filter the specific color
- Which will add up additional time in query processing
- In a large table this will be a time taking process
- Due to the above syntax, table is partitioned based on color column
- Now when we load the same data as in ex1, data will be distributed across the segments, but into appropriate partition based on color
- In this case only the partition which has `Blue` color will be scanned for data
- The query planner will not query the partition which don't have the required data
- This kind of optimization will help in improving the query performance
- Have a large (fact) table
- Query predicates have identifiable access patterns
- e.g., WHERE trans_date>= '27/08/2016') AND trans_date< '28/08/2016'
- Need to maintain a "rolling window" of data
- Data can be divided into somewhat equal parts based on some defining criteria
- Experiencing unsatisfactory performance
Bad Distribution:
CREATE TABLE faa.faa_otp_some_key
(LIKE faa.faa_otp_load) DISTRIBUTED BY (GENDER);
INSERT INTO faa.faa_opt_some_key
SELECT * FROM faa.faa_opt_load;
SELECT DISTINCT (GENDER) FROM
faa.faa_opt_some_key;
For Even Distribution key:
CREATE TABLE faa.faa_otp_even_key
(LIKE faa.faa_otp_load) DISTRIBUTED BY (Flt_Year, Flt_Month, Flt_DayofMonth);
INSERT INTO faa.faa_otp_even_key
SELECT * FROM faa.faa_otp_load;
Table Partitioning:
non-partitioned table:
EXPLAIN
SELECT
....
FROM
...
WHERE
flightdate>=
'2009-11-01'
AND flightdate< '2009-12-01'
ORDER BY
Flt_Year;
partitioned table:
...
Summary:
Data Distribution
- Balancing workload across all segments
- Hash Distribution - use unique key(s)
- Random Distribution - round-robin rows distribution
- Choosing a poor distribution key causes skew
Partitioning
- Eliminating rows at query time
- Significantly improves performance
- Range and List Partitioning
- Works in conjunction with Distribution
Comments
Post a Comment