gpcheckcat – Part 1

Why do I need to gpcheckcat?
This is an utility script which is used to run on the OS to check for catalog inconsistencies. Due to Greenplum’s nature as a distributed computing platform catalog inconsistencies can exist not only internal to the catalog on the master or individual segment node but between the master catalog and segment nodes as well. Gpcheckcat contains multiple tests to run on the master, on the primary segments via utility mode or both. Gpcheckcat can be run during weekly or monthly maintenance. We have option to generate a repair script the current version of gpcheckcat is unable to automatically generate fixes for all issues detected. In any situation where gpcheckcat finds an issue a DBA should review the generated output and alsoto review any auto generated repair script.

Options:
gpcheckcat [<option>][dbname]
-?
-B parallel : number of worker threads
-g dir : generate SQL to rectify catalog corruption, put it in dir
-p port : DB port number
-P passwd : DB password
-U uname : DB User Name
-v : Verbose
-A : All databases
-S option : shared table options (none,only)
-O : Online
-l : List all tests
-R test : run this particular test
-C catname : run cross consistency, FK and ACL tests for this catalog table

Lists of Tests
  • duplicate
  • missing_extraneous
  • inconsistent
  • foreign_key
  • acl
  • persistent
  • pgclass
  • namespace
  • distribution_policy
  • dependency
  • owner
  • part_integrity
  • part_constraint
  • duplicate_persistent
Now we will see among the above mentioned list of tests, which can be run in Online mode and Offline mode.

Tests in Online Mode:
  • duplicate
  • missing_extraneous
  • inconsistent
  • foreign_key
  • acl
  • owner
  • part_integrity
  • part_constraint
  • duplicate_persistent
Tests in Offline Mode:
  • persistant
  • pgclass
  • namespace
  • disrtinution_policy
  • dependency
Test(s) in detail:

Duplicate: Check for duplicate entries
This test checks catalog tables for primary key violation. This test will skip master only tables and catalog tables without a primary key. This actually queries a catalog tables, grouping by the hidden gp_segment_id column and any primary key columns, then returns any duplicate based on count.

Missing_extraneous: Cross consistency check for missing or extraneous entries
This checks for missing or extra records that violate the primary key constraint in a catalog tables. Again, master only tables or tables without a primary key will be skipped by this test.

Inconsistent: Cross consistent check for master segment inconsistency
This checks for tables where one segment has an inconsistent number for unique records compared to other segments. Again, master only tables or tables without a primary key will be skipped, test itself finds the number of unique records in the table across all segments and the master. The record count that has the highest number of segments in agreement us declared the ‘clean’ and any other segment with a different record count is declared to be ‘unclean’ | ‘error’.

Foreign_key: Check foreign keys
Greenplum database does not enforce foreign key constraints. However we had versions of the database the planner could leverage foreign key definitions to produce query plans. In the current version of Greenplum (4.2.7 and beyond) this this behavior is controlled by gp_statistics_use_fkeys configuration parameter. By default gp_statistics_use_fkeys is set to OFF, meaning that any foreign key definitions are completely ignored. gpceckcat will check foreign key dependencies on catalog and user tables, with the exception of the following master only tables:
  • gp_realation_node
  • pg_description
  • pg_shdescription
  • pg_stat_last_opertaion
  • pg_stat_last_shoperation
  • pg_statistic
acl: Cross consistency check for access control privileges
This checks access control lists for tables between the master and segments. The SQL associated with this test compare the access control list defined for a table in the master catalog against the access control list defined for the same table on the segments, The query uses the @> (contains) operator and does a two way comparison, first testing whether the master acl is contained in the segment acl and vice versa.

Persistent: Check persistent tables
This checks the primary to mirror synchronization process. This test leverage the gp_persistent_datatbase_node, gp_persistent_filespace_node and gp_persistent_relation_node tables. All of these tables are leveraged by the primary to mirror synchronization process. In particular the persistent_state column in all three catalog tables is used to track the current status of a database object (e.g. ‘create pending’, ‘drop pending’ etc.) in relation to file system objects.

pg_class: Check pg_class entry that does not have any corresponding pg_attribute entry
This check stables with no associated columns. The associated SQL for this tests joins the pg_class table (which holds records for database objects like tables and indexes) against the pg_attributes table (which hold table column metadata).

namespace: Check for leaked temporary schema and missing schema definition
Concept behind Greenplum and PostgreSQL temporary tables is the creation of a private schema visible only to a user’s current session. Temporary table schemas are named with a consistent standard, each scheme consists of the identifier. ‘pg_temp_’ followed by the user’s session ID.
Schemas for temporary tables should be automatically dropped when the underlying user session is disconnected. Sometimes this process is interrupted, resulting in orphaned schemas on either the master or on individual segments. The internal function gp_dist_ramdon is used to fetch results from the segment rather than the master from the pg_namespace catalog table. These results arethen combined with the results ofthe query against pg_namespace on the master via a SQL union clause. All results are left joined to the pg_stat_activity view via the session ID stored in the temporary table schema’s name. Care must be taken when viewing the output from this test. Orphaned schemes that exist on the segment but not the master may require DBA intervention to resolve.

distribution_policy:  Check constraints on randomly distributed tables
This checks the unique constraints are not allowed on tables that have a random distribution policy. Similarly unique constraints on hash-distributed tables must be the same as the table’s distribution columns. In this tests gpcheckcat runs 2 queries. The first to check the existence of any primary or unique constraints on any randomly distributed table. The second to check the distribution key columns for all hash-distributed tables are a subset of any unique or primary column constraints.

dependency: Check for dependency on non-existent objects
This checks for invalid dependencies for both database objects and cluster objects. This test queries both the pg_depend catalog table and pg_shdepend catalog table. In GPDB objects like roles are cluster wide and span multiple databases. Dependencies for these objects are stored in the pg_shdepend table. Dependencies within a single database, such as between a view and its underlying tables, are stored in the pg_depend table.

owner: Check table ownership that is inconsistent with master database
This checks the GPDB tables which may have associated subsidiary objects, e.g. TOAST tables or append-only tables. The DB role that owns those objects may not be consistent. This test actually has 2 parts:
1.  Check uses the gp_dist_random function to retrieve ownership information from the segment catalog tables for database tables with associated TOAST tables and append-only segments.  These results are then compared to the master.
2.  Check uses the gp_dist_random function to retrieve ownership information from the segments for types. These results are then compared to the master.

part_integrity: Check pg_partition branch integrity, partition with OIDs, partition distribution policy
Partitions in GPDB are simply regular DB tables associated to the parent table. In the case of multi-level partitions each partition table would have a number of associated sub partition tables. Data is stored in the lowest level of partition tables and not on any higher level. This test actually has 3 parts:
1.  Checks for branch partition tables that contains data by simply performing a SELECT COUNT(*) against segments.
2.  Checks for partitioned tables created with OID’s. Object ID’s are integer primary key used in the catalog for DB objects. There are a finite number of OID’s available in any installation. For this reason while it is possible to create tables where every row in the table is assigned an OID this should never be done in practice since OID roll-over can only be resolved via DUMP, REINSTALL and IMPORT. In addition OID’s should never be used with partition tables.
3.  Checks for partitioned tables that have a distribution compared to their parent tables.

part_constarint: Check constraints on partitioned tables
This checks for table constraints that are defined on the root table of a partitioned table but that are not defined on the branches or leaves.

duplicate_persistent:
This checks the duplicate and persistent tests but it provides a further filter on its catalog queries to limit results to a single DB.

Comments

Popular posts from this blog

GP - Kerberos errors and resolutions

How to set Optimizer at database level in greenplum

GP - SQL Joins