Greenplum Architecture - Brief
Pivotal Greenplum Database is a massively parallel
processing (MPP) database server with an architecture specially designed to
manage large-scale analytic data warehouses and business intelligence
workloads.
MPP (also known as a shared nothing architecture) refers to
systems with two or more processors that cooperate to carry out an operation,
each processor with its own memory, operating system and disks. Greenplum uses
this high-performance system architecture to distribute the load of
multi-terabyte data warehouses, and can use all of a system's resources in
parallel to process a query.
Greenplum Database is based on PostgreSQL open-source
technology. It is essentially several PostgreSQL database instances acting
together as one cohesive database management system (DBMS). It is based on
PostgreSQL 8.2.15, and in most cases is very similar to PostgreSQL with regard
to SQL support, features, configuration options, and end-user functionality.
Database users interact with Greenplum Database as they would a regular
PostgreSQL DBMS.
The internals of PostgreSQL have been modified or
supplemented to support the parallel structure of Greenplum Database. For
example, the system catalog, optimizer, query executor, and transaction manager
components have been modified and enhanced to be able to execute queries
simultaneously across all of the parallel PostgreSQL database instances. The
Greenplum interconnect (the networking layer) enables communication between the
distinct PostgreSQL instances and allows the system to behave as one logical
database.
Greenplum Database also includes features designed to
optimize PostgreSQL for business intelligence (BI) workloads. For example,
Greenplum has added parallel data loading (external tables), resource
management, query optimizations, and storage enhancements, which are not found
in standard PostgreSQL. Many features and optimizations developed by Greenplum
make their way into the PostgreSQL community. For example, table partitioning
is a feature first developed by Greenplum, and it is now in standard
PostgreSQL.
Greenplum Database stores and processes large amounts of
data by distributing the data and processing workload across several servers or
hosts. Greenplum Database is an array of individual databases based upon
PostgreSQL 8.2 working together to present a single database image. The master
is the entry point to the Greenplum Database system. It is the database
instance to which clients connect and submit SQL statements. The master
coordinates its work with the other database instances in the system, called
segments, which store and process the data.
Greenplum Master
The Greenplum Database master is the entry to the Greenplum
Database system, accepting client connections and SQL queries, and distributing
work to the segment instances.
Greenplum Database end-users interact with Greenplum
Database (through the master) as they would with a typical PostgreSQL database.
They connect to the database using client programs such as psql or application
programming interfaces (APIs) such as JDBC or ODBC.
The master is where the global system catalog resides. The
global system catalog is the set of system tables that contain metadata about
the Greenplum Database system itself. The master does not contain any user
data; data resides only on the segments. The master authenticates client
connections, processes incoming SQL commands, distributes workloads among
segments, coordinates the results returned by each segment, and presents the
final results to the client program.
Greenplum Segments
Greenplum Database segment instances are independent
PostgreSQL databases that each store a portion of the data and perform the
majority of query processing.
When a user connects to the database via the Greenplum
master and issues a query, processes are created in each segment database to
handle the work of that query. For more information about query processes.
User-defined tables and their indexes are distributed across
the available segments in a Greenplum Database system; each segment contains a
distinct portion of data. The database server processes that serve segment data
run under the corresponding segment instances. Users interact with segments in
a Greenplum Database system through the master.
Segments run on a servers called segment hosts. A segment
host typically executes from two to eight Greenplum segments, depending on the
CPU cores, RAM, storage, network interfaces, and workloads. Segment hosts are
expected to be identically configured. The key to obtaining the best
performance from Greenplum Database is to distribute data and workloads evenly
across a large number of equally capable segments so that all segments begin
working on a task simultaneously and complete their work at the same time.
Greenplum Interconnect
The interconect is the networking layer of the Greenplum
Database architecture.
The interconnect refers to the inter-process communication
between segments and the network infrastructure on which this communication
relies. The Greenplum interconnect uses a standard 10-Gigabit Ethernet
switching fabric.
Comments
Post a Comment