Introduction to Massively Parallel Processing (MPP) Database
In MPP databases data is partitioned across multiple nodes with each nodes having its own hardware to process the data locally. All communications happen via local network (Interconnect). This is a `shared-nothing` architecture.
In this post I'll try to explain how MPP databases work by using GPDB as an example.
Details about Master/Standby Master, Interconnect switch & Segment nodes are provided in my other posts.
Let me start with a simple data set that I will use to demonstrate data partitioning in greenplum.
employee_id employee_name 1001 Jack 1002 Neil 1003 Erick 1004 Dickson 1005 Milo 1006 Peter 1007 Sam 1008 Vinc 1009 Max
As mentioned earlier the main character tics of MPP database is data distribution. Data is distributed across each segment database to achieve data and processing parallelism. This is achieved by creating a database table with DISTRIBUTED BY clause. By using this clause data is automatically distributed across segment databases. In Greenplum you can either use hash or round-robin distribution.
In this example we will use hash distribution to distribute the data across segment databases. To get even distribution of data across segments it makes sense to use primary key as a distribution key.
CREATE TABLE employee ( employee_id integer; employee_name varchar(50) ) DISTRIBUTED BY (employee_id);
Once the table is created, lets insert 9 records and see how the data is distributed based on hashing of the employee_id.
INSERT INTO employee values (1001,’Jack’); INSERT INTO employee values (1002,’Neil’); INSERT INTO employee values (1003,’Erick′); INSERT INTO employee values (1004,’Dickson′); INSERT INTO employee values (1005,’Milo’); INSERT INTO employee values (1006,’Peter’); INSERT INTO employee values (1007,’Sam′); INSERT INTO employee values (1008,’Vinc’); INSERT INTO employee values (1009,’Max’);
As the distribution key was employee_id, data we hashed using the distribution key and spread out evenly on each of the segment database. Now let’s query the database and see how distributed data helps in query performance.
case #1 query data from one employee_id
SELECT * FROM employee WHERE employee_id = 1004;
When this query is issued, master will parse and build a query plan. Since the query is being filtered on a distribution key column the plan will be sent to segment 1 database for execution - remember master stores metadata about data distribution for each table. After the query is issued to segment database 1, it has to scan only 3 records to get to the result compared to 9 records on a share everything database making the query to run faster.
case #2 query data for 2 employee_id
SELECT * FROM employee WHERE employee_id IN (1001, 1009);
When this query is issued, master will parse and build a query plan. Since the query is being filter on a distribution key column the plan will be sent to segment 1 and segment 3 database for execution - remember master stores metadata about data distribution for each table. After the query is issued to segment database 1 and 3, each parallel process has to scan only 3 records to get the result. Data sets from both the scans is returned to the master. Master gathers the data together to present to the user.
Points to be noted
- Individual self-sufficient physical servers to store and process data.
- Data is distributed physically on separate individual databases.
- Shared Nothing Architecture.
Comments
Post a Comment