Introduction to Massively Parallel Processing (MPP) Database

In MPP databases data is partitioned across multiple nodes with each nodes having its own hardware to process the data locally. All communications happen via local network (Interconnect). This is a `shared-nothing` architecture.
In this post I'll try to explain how MPP databases work by using GPDB as an example.

Details about Master/Standby Master, Interconnect switch & Segment nodes are provided in my other posts.
Let me start with a simple data set that I will use to demonstrate data partitioning in greenplum.

employee_id     employee_name
1001            Jack
1002            Neil
1003            Erick
1004            Dickson
1005            Milo
1006            Peter
1007            Sam
1008            Vinc
1009            Max

As mentioned earlier the main character tics of MPP database is data distribution. Data is distributed across each segment database to achieve data and processing parallelism. This is achieved by creating a database table with DISTRIBUTED BY clause. By using this clause data is automatically distributed across segment databases. In Greenplum you can either use hash or round-robin distribution.
In this example we will use hash distribution to distribute the data across segment databases. To get even distribution of data across segments it makes sense to use primary key as a distribution key.

CREATE TABLE employee
(
               employee_id integer;
               employee_name varchar(50)
)
DISTRIBUTED BY (employee_id);

Once the table is created, lets insert 9 records and see how the data is distributed based on hashing of the employee_id.

INSERT INTO employee values (1001,’Jack’);
INSERT INTO employee values (1002,’Neil’);
INSERT INTO employee values (1003,’Erick′);
INSERT INTO employee values (1004,’Dickson′);
INSERT INTO employee values (1005,’Milo’);
INSERT INTO employee values (1006,’Peter’);
INSERT INTO employee values (1007,’Sam′);
INSERT INTO employee values (1008,’Vinc’);
INSERT INTO employee values (1009,’Max’);
As the distribution key was employee_id, data we hashed using the distribution key and spread out evenly on each of the segment database. Now let’s query the database and see how distributed data helps in query performance.

case #1 query data from one employee_id

SELECT * FROM employee WHERE employee_id = 1004;

When this query is issued, master will parse and build a query plan. Since the query is being filtered on a distribution key column the plan will be sent to segment 1 database for execution - remember master stores metadata about data distribution for each table. After the query is issued to segment database 1, it has to scan only 3 records to get to the result compared to 9 records on a share everything database making the query to run faster.

case #2 query data for 2 employee_id

SELECT * FROM employee WHERE employee_id IN (1001, 1009);

When this query is issued, master will parse and build a query plan. Since the query is being filter on a distribution key column the plan will be sent to segment 1 and segment 3 database for execution - remember master stores metadata about data distribution for each table. After the query is issued to segment database 1 and 3, each parallel process has to scan only 3 records to get the result. Data sets from both the scans is returned to the master. Master gathers the data together to present to the user.

Points to be noted
  • Individual self-sufficient physical servers to store and process data.
  • Data is distributed physically on separate individual databases.
  • Shared Nothing Architecture.

Comments

Popular posts from this blog

How to set Optimizer at database level in greenplum

IPMI Tool | IPMI Cheatsheet

Changing the gpmon Password