GP - User Questions

Q1. When Batch insert is running via JDBC, those are processed as individual inserts in DB. Is this expected behavior in Greenplum?
Answer: YES, This is an expected behavior, this is done to ensure atomicity. 

Q2. Does individual insert statement performance differ for AO row based vs AO columnar tables, and how? 
Answer:
Row base tables are more suitable for inserts / update than columnar as for as one insert only file accessed.
In columnar table each column has a separate file, so it will be a bit slower than row base table insert or any DML operation as each. 
Here in this case, when an insert is happening on a separate file will be created for each column insert and in Greenplum each column is a separate physical file on every segment.
Columnar is actually good for selects sorts of operations.
Row base tables provide much better performance.

Q3. Why don't I see any DDL queries in queries_history table?
Answer:
Yes, DDL queries are not captured in queries_history table. This is an expected behavior.


Q4. Is there a recommended column to use for DISTRIBUTED BY attribute of a table?

Answer:
Commonly used joined tables should be distributed on same key. This allows the Greenplum to perform the location operations hence increasing performance. An Integer column when possible. Hash joins works better on the integers hence  integer is better choice for distribution key. A column that does not allow NULL, or minimum NULL values. Null values will hash to the same distribution and potentially cause skew. Explicitly define a column or random distribution for all tables. Do not use the default as it may cause skew on segments. You should use a single best suited column that will distribute data across all segments evenly. You should distribute data on unique or primary key columns. Do not distribute on columns that will be used in the WHERE clause of a query. You should partition the data on the column that is used in WHERE clause. Do not distribute on dates or timestamps. Distributing data on date or timestamp may case poor system performance. The distribution key column data should contain unique values or very high cardinality. If a single column cannot achieve an even distribution, use a multi-column distribution key, but no more than two columns.  Additional columns in distribution require additional time in the hashing process. In case if a two-column distribution key cannot achieve an even distribution of data, use a random distribution. Multi-column distribution keys in most cases require motion operations to join tables, so they offer no advantages over a random distribution.


Q4. Is there a recommended way to define a timestamp column?

Answer: 
You can use both with timezone or without timezone, depends on use case.

Comments

Popular posts from this blog

How to set Optimizer at database level in greenplum

Changing the gpmon Password

IPMI Tool | IPMI Cheatsheet