It's all in database: Learnings from benchmarking PostgreSQL on TPC-H

After making its mark in the OLTP world, PostgreSQL is moving towards catering the needs of OLAP environment. Hence, the recent advancements in features like parallel query, declarative partitioning, and partition-wise joins and aggregates, etc.

Since the introduction of parallel query in v 9.6, there were attempts to prove its value through bench-marking it with TPC-H. In this regard, the very first post I could find is this one, straight from the committer of the feature. Later as new parallel operators were introduced, more intensive bench-marking was required to see the effect of the new operators. That is where my contribution came into picture.

It started with bench-marking the proposed patches on postgresql-hackers list using TPC-H for different scale factors. There were three main directions to this study,

Testing patches for their correctness and robustness
Analysing the combined effect of multiple parallel operators. There were many different things I and the team learned in the process, which I will talk about in this post
Finding if there is need for some more parallel operators or enhancement on existing ones

If you are here to learn about setting up a TPC-H database for yourself, please checkout the next section.

Lessons learned

Parallel scans

Create indexes! By default, TPC-H schema has only primary and foreign keys. In order to optimize your queries or employ a better plan you might need to add indexes. A simple yet easy to forget thing.
Tune the related parameters! Proper indexes not being used? How about lowering your random_page_cost and increasing effective_cache_size. By default the value of random_page_cost is 4 and sequential_page_cost is 1. This is to justify the costly operation of random fetching of tuples. But if there is enough cache to keep enough (or whole) of the index, then it is required that effective_cache_size is set to that value and random_page_cost is lowered to 1. Now, the query planner knows that there is enough memory and whole enough data will be available in cache, so it will not be a very random operation indeed.
Partial parallelism: Pay attention to which part of the operator is parallel and how much is its contribution. E.g. parallel bitmap heap scan, there is a limit to the improvement caused by parallel bitmap heap scan, because only the scan of bitmap heap is parallelised and not the bitmap index scan part. Now, if the contribution of of bitmap heap scan is negligible compared to its index scan, then it is likely that the query will not show much benefits with this shiny new parallel scan. This is also true for parallel aggregates, if the finalize aggregate is costlier part then improvements in partial aggregate will not help much

Parallel joins

Work_mem! It is the one of the most influential memory related parameter for planner. Using too less of it means telling planner that tasks involving memory like sorting, merge-join, hash-join, etc. would be costly. However, giving too much of it might create issues of consuming too much memory specially in case of parallel-query. Since work-mem is per node and query ends up using too much memory if multiple workers are working. Rule of thumb says to set it up something around 20% of the RAM size.
Parallel merge-join: It is during this analysis that we figured out that parallelism in merge join is next required thing. As we move towards higher scales, more joins were in need parallel merge join.
Shared parallel hash: Another interesting analysis was the performance improvement with shared parallel hash, in hash joins. In the initial versions of parallel hash join, each worker creates their own hash, hence requiring too much memory particularly in cases of selecitivity underestimation. Later, with v11, parallel hash join creates just one hash using all the workers and then each worker can use the same.

Other operators:

Gather-merge: Most of the TPC-H queries are having group by and/or order by hence sorting remains a costly operation for most of the queries. Since, there is no parallel sort at the first glance it appears that the queries with costly sort node at the top are likely to receive not much benefit with intra-query parallelism. However, the introduction of a new operator -- gather-merge at the top and parallel index scans at lower nodes, queries improved significantly. Gather-merge maintains the order of rows coming from lower nodes and hence avoids the requirement of sorting later.
Stalling: However, we also noticed that sometimes gather-merge makes the query slower as it has to wait for one or more workers to maintain the order, which in turn stalls other faster workers. Hence we sometimes witnesses no performance improvements in queries switching to parallel scans and joins and having gather-merge at the top.
init or sub plans: Notice if your query has sub query or init plans, if so query might not leverage benefits of parallel operators. In v9.6 no init or sub plan can use parallel query. With v10, init plans and uncorrelated sub plans can use parallel operators. Correlated sub queries can still not use parallel operators.
Queries via PL: Till v9.6, queries coming from procedural languages were not able to use parallel operators. We found this out and enabled them for parallel safe cases from v10 onward.
Planner issues: Remember to not blame parallel operators for incorrect estimation. Often in our analysis we found that some operators get picked up or dropped because of incorrect estimation.

Links to my parallel query related presentations at Postgres conferences in last few years,

What is TPC-H and how to set it up?

Disclaimer: You may totally skip this section if you are already aware of TPC-H and know how to set it up.

To be brief about this, TPC-H is industry wide accepted bench-marking standard for OLAP environment, or atleast it used to be, which is now replaced by more advanced version of it -- TPC-DS. To know more about TPC council and the different benchmarks they have, etc. you may read this.

Basically, when you download TPC-H from their aforementioned website, you get two tools - dbgen and qgen and a detailed readme with the instructions and meaning and intent of each query. Anyhow, if you chose to skip the readme, here is a brief overview. With their self explanatory names, dbgen is for generating the data and qgen is for generating queries. Interesting thing to know about this is the data is randomly generated for the tables and so do the parameter values for the 22 queries. So in each run of the qgen, you may get different queries, as they may have different parameters.

Next thing is to load this data to your postgres database. Now, for this I would recommend following,

Downloaded the tool from official TPC-H website
Generate the data for required scale factor
Convert them to csv files as described here
Use the script (tpch-load) provided here for loading the data. Other scripts provided there are also useful for adding keys, indexes, etc. you may use or skip them as per your own requirements
You may create the queries using qgen, which will substitute the parameters values with some random numbers based on your generated data.
And that's it! You are now ready to roll!

It's all in database

Sunday, 1 March 2020

Learnings from benchmarking PostgreSQL on TPC-H

Lessons learned

What is TPC-H and how to set it up?

No comments:

Post a Comment

Authentication monitoring in PostgreSQL