Data Distribution and Partitioning

Table and Data Distribution
Technique for balancing workload across all nodes
  1. All database tables are distributed across all Segments
  2. Distribution Policy determines how the data in the tables are distributed
Hash Distribution (Default)
  • ·One or more columns used as distribution key
Hashed into "Buckets" for each segment
  • · Unique keys assure even distribution
  • · When key is unspecified - uses Primary key or first column
Random Distribution
  • ·Round-Robin distribution of rows to Segments
  • ·Evenly distributes data but may be less efficient at query time
Distribution:
ex1: Create table tablex ... Distributed by CustomerID;
  • ·Hashing algorithm will help in distributing the columns evenly across the segments
ex2: Create Table tablex ... Distributed by Gender;
  • Distribution clause is set to lower cardinality
  • This will skew the segments
         Choosing a poor distribution value causes Skew

  • ·In this case we are not fully utilizing MPP architecture
Data Table Partitioning
  • Technique for eliminating rows at query time
  • Logically divides a large table into smaller parts
  • Significantly improves query performance
  • Facilitates database maintenance
  • Two Type
Range Partitioning
List Partitioning
  • Partitioning in Greenplum works using table inheritance and CHECK table constraints
  • Partitioned tables also have distribution keys
Partitions:
Single Partition:
ex1: Create Table tablex ... Distributed by CustomerID;
  • This is also called has un-partitioned table
  • In this example we are using hash distribution and column is against color
  • This ends up in distributing all columns evenly in all segments
Select * from tablex ... Where Color = `Blue`
  • But in this case if we are querying the data specific to a single color, this will end up utilizing all segments.
  • tablex in all segments needs to be scanned to filter the specific color
  • Which will add up additional time in query processing
  • In a large table this will be a time taking process
ex2: Create Table tablex ... Distributed by Customer ID Partition by Color;
  • Due to the above syntax, table is partitioned based on color column
  • Now when we load the same data as in ex1, data will be distributed across the segments, but into appropriate partition based on color
Select * from tableX ... Where Color = `Blue`;
  • In this case only the partition which has `Blue` color will be scanned for data
  • The query planner will not query the partition which don't have the required data
  • This kind of optimization will help in improving the query performance
When to Partition?
  • Have a large (fact) table
  • Query predicates have identifiable access patterns
  • e.g., WHERE trans_date>= '27/08/2016') AND trans_date< '28/08/2016'
  • Need to maintain a "rolling window" of data
  • Data can be divided into somewhat equal parts based on some defining criteria
  • Experiencing unsatisfactory performance
Demo:
Bad Distribution:
CREATE TABLE faa.faa_otp_some_key (LIKE faa.faa_otp_load) DISTRIBUTED BY (GENDER);
INSERT INTO faa.faa_opt_some_key SELECT * FROM faa.faa_opt_load;
SELECT DISTINCT (GENDER) FROM faa.faa_opt_some_key;

For Even Distribution key:
CREATE TABLE faa.faa_otp_even_key (LIKE faa.faa_otp_load) DISTRIBUTED BY (Flt_Year, Flt_Month, Flt_DayofMonth);
INSERT INTO faa.faa_otp_even_key SELECT * FROM faa.faa_otp_load;

Table Partitioning:
non-partitioned table:
EXPLAIN
SELECT
                ....
FROM
                ...
WHERE
                flightdate>= '2009-11-01'
AND       flightdate< '2009-12-01'
ORDER BY
                Flt_Year;

partitioned table:
...

Summary:
Data Distribution
  • Balancing workload across all segments
  • Hash Distribution - use unique key(s)
  • Random Distribution - round-robin rows distribution
  • Choosing a poor distribution key causes skew


Partitioning
  • Eliminating rows at query time
  • Significantly improves performance
  • Range and List Partitioning
  • Works in conjunction with Distribution

Comments

Popular posts from this blog

GP - Kerberos errors and resolutions

How to set Optimizer at database level in greenplum

GP - SQL Joins