It's all in database: parallelism in postgresql

Showing posts with label parallelism in postgresql. Show all posts

Sunday, 1 March 2020

Learnings from benchmarking PostgreSQL on TPC-H

After making its mark in the OLTP world, PostgreSQL is moving towards catering the needs of OLAP environment. Hence, the recent advancements in features like parallel query, declarative partitioning, and partition-wise joins and aggregates, etc.

Since the introduction of parallel query in v 9.6, there were attempts to prove its value through bench-marking it with TPC-H. In this regard, the very first post I could find is this one, straight from the committer of the feature. Later as new parallel operators were introduced, more intensive bench-marking was required to see the effect of the new operators. That is where my contribution came into picture.

It started with bench-marking the proposed patches on postgresql-hackers list using TPC-H for different scale factors. There were three main directions to this study,

Testing patches for their correctness and robustness
Analysing the combined effect of multiple parallel operators. There were many different things I and the team learned in the process, which I will talk about in this post
Finding if there is need for some more parallel operators or enhancement on existing ones

If you are here to learn about setting up a TPC-H database for yourself, please checkout the next section.

Lessons learned

Parallel scans

Create indexes! By default, TPC-H schema has only primary and foreign keys. In order to optimize your queries or employ a better plan you might need to add indexes. A simple yet easy to forget thing.
Tune the related parameters! Proper indexes not being used? How about lowering your random_page_cost and increasing effective_cache_size. By default the value of random_page_cost is 4 and sequential_page_cost is 1. This is to justify the costly operation of random fetching of tuples. But if there is enough cache to keep enough (or whole) of the index, then it is required that effective_cache_size is set to that value and random_page_cost is lowered to 1. Now, the query planner knows that there is enough memory and whole enough data will be available in cache, so it will not be a very random operation indeed.
Partial parallelism: Pay attention to which part of the operator is parallel and how much is its contribution. E.g. parallel bitmap heap scan, there is a limit to the improvement caused by parallel bitmap heap scan, because only the scan of bitmap heap is parallelised and not the bitmap index scan part. Now, if the contribution of of bitmap heap scan is negligible compared to its index scan, then it is likely that the query will not show much benefits with this shiny new parallel scan. This is also true for parallel aggregates, if the finalize aggregate is costlier part then improvements in partial aggregate will not help much

Parallel joins

Work_mem! It is the one of the most influential memory related parameter for planner. Using too less of it means telling planner that tasks involving memory like sorting, merge-join, hash-join, etc. would be costly. However, giving too much of it might create issues of consuming too much memory specially in case of parallel-query. Since work-mem is per node and query ends up using too much memory if multiple workers are working. Rule of thumb says to set it up something around 20% of the RAM size.
Parallel merge-join: It is during this analysis that we figured out that parallelism in merge join is next required thing. As we move towards higher scales, more joins were in need parallel merge join.
Shared parallel hash: Another interesting analysis was the performance improvement with shared parallel hash, in hash joins. In the initial versions of parallel hash join, each worker creates their own hash, hence requiring too much memory particularly in cases of selecitivity underestimation. Later, with v11, parallel hash join creates just one hash using all the workers and then each worker can use the same.

Other operators:

Gather-merge: Most of the TPC-H queries are having group by and/or order by hence sorting remains a costly operation for most of the queries. Since, there is no parallel sort at the first glance it appears that the queries with costly sort node at the top are likely to receive not much benefit with intra-query parallelism. However, the introduction of a new operator -- gather-merge at the top and parallel index scans at lower nodes, queries improved significantly. Gather-merge maintains the order of rows coming from lower nodes and hence avoids the requirement of sorting later.
Stalling: However, we also noticed that sometimes gather-merge makes the query slower as it has to wait for one or more workers to maintain the order, which in turn stalls other faster workers. Hence we sometimes witnesses no performance improvements in queries switching to parallel scans and joins and having gather-merge at the top.
init or sub plans: Notice if your query has sub query or init plans, if so query might not leverage benefits of parallel operators. In v9.6 no init or sub plan can use parallel query. With v10, init plans and uncorrelated sub plans can use parallel operators. Correlated sub queries can still not use parallel operators.
Queries via PL: Till v9.6, queries coming from procedural languages were not able to use parallel operators. We found this out and enabled them for parallel safe cases from v10 onward.
Planner issues: Remember to not blame parallel operators for incorrect estimation. Often in our analysis we found that some operators get picked up or dropped because of incorrect estimation.

Links to my parallel query related presentations at Postgres conferences in last few years,

What is TPC-H and how to set it up?

Disclaimer: You may totally skip this section if you are already aware of TPC-H and know how to set it up.

To be brief about this, TPC-H is industry wide accepted bench-marking standard for OLAP environment, or atleast it used to be, which is now replaced by more advanced version of it -- TPC-DS. To know more about TPC council and the different benchmarks they have, etc. you may read this.

Basically, when you download TPC-H from their aforementioned website, you get two tools - dbgen and qgen and a detailed readme with the instructions and meaning and intent of each query. Anyhow, if you chose to skip the readme, here is a brief overview. With their self explanatory names, dbgen is for generating the data and qgen is for generating queries. Interesting thing to know about this is the data is randomly generated for the tables and so do the parameter values for the 22 queries. So in each run of the qgen, you may get different queries, as they may have different parameters.

Next thing is to load this data to your postgres database. Now, for this I would recommend following,

Downloaded the tool from official TPC-H website
Generate the data for required scale factor
Convert them to csv files as described here
Use the script (tpch-load) provided here for loading the data. Other scripts provided there are also useful for adding keys, indexes, etc. you may use or skip them as per your own requirements
You may create the queries using qgen, which will substitute the parameters values with some random numbers based on your generated data.
And that's it! You are now ready to roll!

Friday, 1 March 2019

Using parallelism for queries from PL functions in PostgreSQL 10

Intra-query parallelism was introduced in PostgreSQL in version 9.6. The benefit from the parallel scans and joins were talked about and significant improvement in the benchmark queries on higher scale factors were highlighted. However, one area remained devoid of the benefits - queries from
procedural language functions. Precisely, if you fire a query from a PL/pgSQL function then it can not use parallel scans or joins for that query, even though the query is capable of using them otherwise. Have a look at an example yourself,

-- creating and populating the table

create table foo (i int, j int)
insert into foo values (generate_series(1,500), generate_series(1,500));

-- for experimental purposes we are forcing parallelism by setting relevant parameters

set parallel_tuple_cost = 0;
set parallel_setup_cost = 0;
alter table foo set (parallel_workers = 4);
set max_parallel_workers_per_gather = 4;

-- executing the query as an SQL statement

explain analyse select * from foo where i <= 150;

Gather (cost=0.00..4.56 ...) (actual time=0.217..5.614 ...)
Workers Planned: 4
Workers Launched: 4
-> Parallel Seq Scan on foo (cost=0.00..4.56 ...) (actual time=0.004..0.018 ...)
Filter: (i <= 150)
Rows Removed by Filter: 70

-- create a PLpgSQL function to fire the same query from

create or replace function total ()
returns integer as $total$
declare total integer;
begin
select count(*) from foo into total from foo where i <= 150;
return total;
end;
$total$
language plpgsql;

-- executing the query from a PLpgSQL function in v 9.6

explain analyse select total();

Query Text: SELECT count(*) FROM foo where i <=150
Aggregate (cost=9.25..9.26 ...)
-> Seq Scan on foo (cost=0.00..8.00 ...)
Query Text: explain analyse select total();
Result (cost=0.00..0.26 ...)

To your relief the feature was then added in version 10. Have a look,

-- executing the query from a PLpgSQL function in v10

explain analyse select total();

Query Text: SELECT count(*) FROM foo where i <=150
Finalize Aggregate (cost=4.68..4.69 ...)
-> Gather (cost=4.66..4.67 ...)
Workers Planned: 4
-> Partial Aggregate (cost=4.66..4.67 ...)
-> Parallel Seq Scan on foo (cost=0.00..4.56 ...)
Filter: (i <= 150)

This extends the utility of parallel query to more realistic environments wherein queries are fired
through functions and are not simple SQL statements.

Wednesday, 31 October 2018

Using parallel sequential scan in PostgreSQL

Parallel sequential scan is the first parallel access method in PostgreSQL and is introduced in version 9.6. The committer of this feature and my colleague at EnterpriseDB Robert Haas wrote an awesome blog on it, there is another great blog by another PostgreSQL committer and my colleague Amit Kapila. Both of these blogs explain this access method, its design, usage, and related parameters.

Still, I could not help but notice that there are curiosities around the usage of this access method. Every now and then I could see a complaint saying parallel sequential scan is not getting selected or it is degrading the performance of a query. So, I decided to write this blog to cater more practical scenarios and specifically focus on its less talked about aspect -- where parallel sequential scan would (should) not improve the performance.

Before diving into the details of parallel SeqScan, let's first understand the basic infrastructure and terminology related to it in PostgreSQL. The processes that run in parallel and scan the tuples of a relation are called parallel workers or workers in short. There is one special worker namely leader which co-ordinates and collects the output of the scan from each of the worker. This worker may or may not participate in scanning the relation depending on it's load in dividing and combining processes. End users can also control the involvement of leader in relation scan by GUC parameter parallel_leader_participation, it is a boolean parameter.

Now, let's understand the concept of parallel scan in PostgreSQL by a simple example.

Let there be a table T (a int, b int) containing 100 tuples
Let's say we have two workers and one leader,
Cost of scanning one tuple is 10
Cost of communicating a tuple from worker to leader is 20
Cost of dividing the tuples among workers is 30
For simplicity, let's assume that leader gives 50 tuples to each of the worker

Now, let's analyse if parallel scan will be faster than non parallel scan,

Cost of SeqScan = 10*100 = 1000
Cost of Parallel SeqScan = 30 + (50 * 10) + (50 * 20) * 2 = 2530

Here, we can see that though the cost of scanning the tuples is halved yet the cost of combining the total result is enough to make the overall cost of parallel SeqScan higher than non parallel SeqScan.

Now, let's say we want to list only the tuples which have a > 80, and there are only 20 (say) such tuples, then cost of SeqScan will remain same, but cost of parallel SeqScan can be given as,

Cost of Parallel SeqScan = 30 + (50 * 10) + (10 * 20) * 2 = 730

Hence, parallel SeqScan is likely to improve the performance of queries that require scanning a large amount of data but only few of them satisfy the selection criteria. To generalise this,

Cost of SeqScan = Cost of scanning one tuple * number of tuples

Cost of parallel SeqScan = Cost of dividing the work among workers + cost of combining the work
from workers + cost of work done by a worker * number of workers

Let's dive into it a bit more,

Cost of dividing the work among workers is fairly constant depending on the relation size

Cost of combining the the work from workers = cost of communicating the selected tuples from each
worker to the leader

Cost of work done by a worker = cost of scanning a tuple * number of tuples the respective worker
has received

Now, we can see that the cost of combining the work is dependent on the number of tuples received by each worker. Now, for the queries where all or almost all of the tuples are in the final result, we will paying more cost than its non parallel flavour, first in scanning the tuple and second in combining it to the final result.

In PostgreSQL, the cost determining the cost of dividing the work among workers is given as parallel_setup_cost, the cost of communicating the tuple from worker to leader is given by parallel_tuple_cost, and the number of workers is upper bounded by the GUC max_parallel_workers_per_gather.

So, if you are using a system a high frequency multiple processor then lowering the parallel_setup_cost and parallel_tuple_cost will help in selection of parallel scans. If there are not many processes running in parallel, then increasing max_parallel_workers_per_gather can leverage more parallel processes to improve the query performance. Another point to note is that the number of workers is further capped by max_worker_processes.