Data Distribution and Partitioning

February 13, 2017

Table and Data Distribution

Technique for balancing workload across all nodes

All database tables are distributed across all Segments
Distribution Policy determines how the data in the tables are distributed

Hash Distribution (Default)

·One or more columns used as distribution key

Hashed into "Buckets" for each segment

· Unique keys assure even distribution
· When key is unspecified - uses Primary key or first column

Random Distribution

·Round-Robin distribution of rows to Segments
·Evenly distributes data but may be less efficient at query time

Distribution:

ex1: Create table tablex ... Distributed by CustomerID;

·Hashing algorithm will help in distributing the columns evenly across the segments

ex2: Create Table tablex ... Distributed by Gender;

Distribution clause is set to lower cardinality
This will skew the segments

Choosing a poor distribution value causes Skew

·In this case we are not fully utilizing MPP architecture

Data Table Partitioning

Technique for eliminating rows at query time
Logically divides a large table into smaller parts
Significantly improves query performance
Facilitates database maintenance
Two Type

Range Partitioning

List Partitioning

Partitioning in Greenplum works using table inheritance and CHECK table constraints
Partitioned tables also have distribution keys

Partitions:

Single Partition:

ex1: Create Table tablex ... Distributed by CustomerID;

This is also called has un-partitioned table
In this example we are using hash distribution and column is against color
This ends up in distributing all columns evenly in all segments

Select * from tablex ... Where Color = `Blue`

But in this case if we are querying the data specific to a single color, this will end up utilizing all segments.
tablex in all segments needs to be scanned to filter the specific color
Which will add up additional time in query processing
In a large table this will be a time taking process

ex2: Create Table tablex ... Distributed by Customer ID Partition by Color;

Due to the above syntax, table is partitioned based on color column
Now when we load the same data as in ex1, data will be distributed across the segments, but into appropriate partition based on color

Select * from tableX ... Where Color = `Blue`;

In this case only the partition which has `Blue` color will be scanned for data
The query planner will not query the partition which don't have the required data
This kind of optimization will help in improving the query performance

When to Partition?

Have a large (fact) table
Query predicates have identifiable access patterns
e.g., WHERE trans_date>= '27/08/2016') AND trans_date< '28/08/2016'
Need to maintain a "rolling window" of data
Data can be divided into somewhat equal parts based on some defining criteria
Experiencing unsatisfactory performance

Demo:

Bad Distribution:

CREATE TABLE faa.faa_otp_some_key (LIKE faa.faa_otp_load) DISTRIBUTED BY (GENDER);

INSERT INTO faa.faa_opt_some_key SELECT * FROM faa.faa_opt_load;

SELECT DISTINCT (GENDER) FROM faa.faa_opt_some_key;

For Even Distribution key:

CREATE TABLE faa.faa_otp_even_key (LIKE faa.faa_otp_load) DISTRIBUTED BY (Flt_Year, Flt_Month, Flt_DayofMonth);

INSERT INTO faa.faa_otp_even_key SELECT * FROM faa.faa_otp_load;

Table Partitioning:

non-partitioned table:

EXPLAIN

SELECT

....

FROM

...

WHERE

flightdate>= '2009-11-01'

AND flightdate< '2009-12-01'

ORDER BY

Flt_Year;

partitioned table:

...

Summary:

Data Distribution

Balancing workload across all segments
Hash Distribution - use unique key(s)
Random Distribution - round-robin rows distribution
Choosing a poor distribution key causes skew

Partitioning

Eliminating rows at query time
Significantly improves performance
Range and List Partitioning
Works in conjunction with Distribution

Search This Blog

gpdbguru

Data Distribution and Partitioning

Comments

Post a Comment

Popular posts from this blog

GP - Kerberos errors and resolutions

GP - SQL Joins

How to set Optimizer at database level in greenplum