Tuesday 20 August 2024

Authentication monitoring in PostgreSQL

How about a situation when you want to log all the login attempts made to your PostgreSQL server. Yes, one way would be to read your log files and find out all the information related to login attempts and then transfer to them some other file, etc. to further use this information. Now, this is one way of doing it but it might pose some challenges like time and memory consumption in reading from those long files, additionally this is going to incur a lot of I/O which might be a serious issue when you are on cloud. During my time working for Zalando, we came across this problem and decided to write a Postgres extension to handle this efficiently.

And that's the story motivation behind the development of this extension - pg_auth_mon. In order to use this, include it in shared_preload_libraries, and then run CREATE EXTENSION pg_auth_mon on your server. Once this extension is in place, all the log in attempts to your server will be saved in a view called pg_auth_mon in the following way.

Now, a caveat to remember here is that all the information in this view would be lost at the server restart.

If you are interested in understanding what is happening inside the extension, here is a brief description on it. Basically, at the time of login  prev_client_auth_hook redirects the control to this extension. And it is here in function auth_monitor, all the details from the connection like, authentication method, user name, etc. which are available in Port data structure are read and saved in the required view. It also records the the time of the last login attempt. So, it might help in understanding if there's some  security attack attempted or a genuine mistake by the username. 

The underlying data structure used here is hash_table, which makes it easy to search for the usernames if they have been already logged in before. Additionally, it also logs the attempts made using incorrect usernames i.e. usernames that do not exist in the database. Now, in order to be mindful of the space it takes in case of invalid username, all such attempts are counted in a single row and the username column is left blank for this. So, one drawback in this attempt is to not able to know what all incorrect usernames were tried by an attacker (if any) in such cases.

In latest version, the code also takes care of the case when a user is renamed since the last login attempt. It has an additional field login_name_at_last_attempt to provide the last username used.

Sunday 1 March 2020

Learnings from benchmarking PostgreSQL on TPC-H


After making its mark in the OLTP world,  PostgreSQL is moving towards catering the needs of OLAP environment. Hence, the recent advancements in features like parallel query, declarative partitioning, and partition-wise joins and aggregates, etc. 

Since the introduction of parallel query in v 9.6, there were attempts to prove its value through bench-marking it with TPC-H. In this regard,  the very first post I could find is this one, straight from the committer of the feature. Later as new parallel operators were introduced, more intensive bench-marking was required to see the effect of the new operators. That is where my contribution came into picture.

It  started with bench-marking the proposed patches on  postgresql-hackers list using TPC-H for different scale factors. There were three main directions to this study,
  1. Testing patches for their correctness and robustness
  2. Analysing the combined effect of  multiple parallel operators. There were many different things I and the team learned in the process, which I will talk about in this post
  3. Finding if there is need for some more parallel operators or enhancement on existing ones

If you are here to learn about setting up a TPC-H database for yourself, please checkout the next section.

Lessons learned

  • Parallel scans
    1. Create indexes! By default, TPC-H schema has only primary and foreign keys. In order to optimize your queries or employ a better plan you might need to add indexes. A simple yet easy to forget thing.
    2. Tune the related parameters! Proper indexes not being used? How about lowering your random_page_cost and increasing effective_cache_size. By default the value of random_page_cost is 4 and sequential_page_cost is 1. This is to justify the costly operation of random fetching of tuples. But if there is enough cache to keep enough (or whole) of the index, then it is required that effective_cache_size is set to that value and random_page_cost is lowered to 1. Now, the query planner knows that there is enough memory and whole enough data will be available in cache, so it will not be a very random operation indeed.
    3. Partial parallelism: Pay attention to which part of the operator is parallel and how much is its contribution. E.g. parallel bitmap heap scan, there is a limit to the improvement caused by parallel bitmap heap scan, because only the scan of bitmap heap is parallelised and not the bitmap index scan part. Now, if the contribution of of bitmap heap scan is negligible compared to its index scan, then it is likely that the query will not show much benefits with this shiny new parallel scan. This is also true for parallel aggregates, if the finalize aggregate is costlier part then improvements in partial aggregate will not help much
  • Parallel joins
    1. Work_mem! It is the one of the most influential memory related parameter for planner. Using too less of it means telling  planner that tasks involving memory like sorting, merge-join, hash-join, etc. would be costly. However, giving too much of it might create issues of consuming too much memory specially in case of parallel-query. Since work-mem is per node and query ends up using too much memory if multiple workers are working. Rule of thumb says to set it up something around 20% of the RAM size.
    2. Parallel merge-join: It is during this analysis that we figured out that parallelism in merge join is next required thing. As we move towards higher scales, more joins were in need parallel merge join.
    3. Shared parallel hash: Another interesting analysis was the performance improvement with shared parallel hash, in hash joins. In the initial versions of parallel hash join, each worker creates their own hash, hence requiring too much memory particularly in cases of  selecitivity underestimation. Later, with v11, parallel hash join creates just one hash using all the workers and then each worker can use the same.
  • Other operators:
    1. Gather-merge: Most of the TPC-H queries are having group by and/or order by hence sorting remains a costly operation for most of the queries. Since, there is no parallel sort at the first glance it appears that the queries with costly sort node at the top are likely to receive not much benefit with intra-query parallelism. However, the introduction of a new operator -- gather-merge at the top and parallel index scans at lower nodes, queries improved significantly. Gather-merge maintains the order of rows coming from lower nodes and hence avoids the requirement of sorting later.
    2. Stalling: However, we also noticed that sometimes gather-merge makes the query slower as it has to wait for one or more workers to maintain the order, which in turn stalls other faster workers. Hence we sometimes witnesses no performance improvements in queries switching to parallel scans and joins and having gather-merge at the top.
    3. init or sub plans: Notice if your query has sub query or init plans, if so query might not leverage benefits of parallel operators. In v9.6 no init or sub plan can use parallel query. With v10, init plans and uncorrelated sub plans can use parallel operators. Correlated sub queries can still not use parallel operators.
    4. Queries via PL: Till v9.6, queries coming from procedural languages were not able to use parallel operators. We found this out and enabled them for parallel safe cases from v10 onward.
    5. Planner issues: Remember to not blame parallel operators for incorrect estimation. Often in our analysis we found that some operators get picked up or dropped because of incorrect estimation.
Links to my parallel query related presentations at Postgres conferences in last few years,

What is TPC-H and how to set it up?

Disclaimer: You may totally skip this section if you are already aware of TPC-H and know how to set it up.

To be brief about this, TPC-H is industry wide accepted bench-marking standard for  OLAP environment, or atleast it used to be, which is now replaced by more advanced version of it -- TPC-DS. To know more about TPC council and the different benchmarks they have, etc. you may read this.

Basically, when you download TPC-H from their aforementioned website, you get two tools - dbgen and qgen and a detailed readme with the instructions and meaning and intent of each query. Anyhow, if you chose to skip the readme, here is a brief overview. With their self explanatory names, dbgen is for generating the data and qgen is for generating queries. Interesting thing to know about this is the data is randomly generated for the tables and so do the parameter values for the 22 queries. So in each run of the qgen, you may get different queries, as they may have different parameters.

Next thing is to load this data to your postgres database. Now, for this I would recommend following,
  1. Downloaded the tool from official TPC-H website
  2. Generate the data for required scale factor
  3. Convert them to csv files as described here
  4. Use the script (tpch-load) provided here for loading the data. Other scripts provided there are also useful for adding keys, indexes, etc. you may use or skip them as per your own requirements
  5. You may create the queries using qgen, which will substitute the parameters values with some random numbers based on your generated data.
  6. And that's it! You are now ready to roll!

Saturday 7 December 2019

Interesting aspects of logical replication at a glance

In PostgreSQL, there is this amazing feature called logical replication, which enables you to replicate a set of table(s) from one server to other. Being a new feature it is less mature and needs many improvements. Nevertheless, it should not limit us enjoying its advantage and so here is a post to highlight some interesting aspects of the feature.

No matter how similar it might sound, it is quite a different thing. It solves a different problem altogether and in some sense it even overcomes the issues of streaming replication. Some of the important aspects of logical replication are as follows,

  • Writes at the secondary
Unlike in streaming replication, here the secondary can serve as a normal server. To be particular, you can perform  inserts and updates at the secondary also. Also, unlike streaming replication we can set up logical replication for one or more tables only. This comes in handy when we want to setup additional servers for load sharing.
  • Schema
The schema is not automatically copied at the secondary once you start the replication. You have to create the tables at the secondary before you start the subscription.
Now in case if there are schema changes in the primary then they will not be replicated via logical replication. To go about such changes, one could pause the replication, make the necessary changes at the secondary and resume the replication then.
  • Attribute comparison
Attributes of a table are matched by name. A different order of columns in the target table is allowed, also the data type can also be different as long as the text representation of the type is same as that the secondary. The target table can have additional columns not provided by the published table. Those will be filled with their default values. This makes it easy to also optimize the schema at the time of migration of your database from one to another environment.
  • Sequences
As of the PostgreSQL version 12, the sequences used in the tables are not replicated to the secondary server. What it means is that the values of the column containing sequence will be copied to the secondary but the next value, max value, etc. of the sequence will remain unaffected at the secondary.

To get around this situation one can just manually increment the next value of the sequence at the secondary to some high value. An interesting thing could be to decide what high value to choose for the purpose. A good guess would be based upon the normal rate of insertions on the required table and the max value at the primary. Important thing is to choose the value such that sequenced column does not conflict with the ones coming from the primary.
  •  Privileges
Currently, the permissions for the tables are checked only at the time of creation of the subscription and never afterwards. So, be careful if some permission changes and there is subscription on that table.
  • Partition tables
As you might be aware that current version of logical replication allows only normal tables to be replicated, so if you have a partition table then the root table can not be replicated. However, you may create a publication for the child tables individually and it will be replicated accordingly.

Saturday 23 November 2019

Which size fits me?

If you have ever used postgreSQL and wondered about the size of the tables, this post might help. A postgreSQL table is no simple table rather it has many things associated with it like index tables, toast tables, then there are other important stuff like free space map and visibility map. Now postgreSQL provides multiple options for measuring the size of table and there are interesting distinctions among them.

\dt+ table-name

This is one of the most common way to find the size of a table. On your psql client you may simple type it and it will return the information about the mentioned table including its size in kB. Internally, this size is same as
select * from pg_size_pretty(pg_relation_size('table-name')) -- for versions older than 9
select * from pg_size_pretty(pg_table_size('table-name')) -- for later versions
This brings us to our next function.

 

pg_table_size

This is the size of the table which includes the size of its heap, any associated toast table, free space map, and visibility map.  So, this is the actual size of the relation on the disk. Note that this does not include the size of any indexes associated with this table.

pg_relation_size

This is the measure to provide the disk space used by one particular fork -- main, init, fsm, or vm. Now, when you write pg_relation_size('table-name') it will be same as pg_relation_size('table-name', 'main'). So, if one wants to measure the space by one fork, this is the function to call.

Now, what about if we want to know the table size with all of its indexes, for that we move to this next function.

pg_total_relation_size

This is the function which gives you the total size of the relation (as in pg_table_size) plus the size of all the indexes associated with it.

Here is the time to introduce an interesting extension in this regards. 

pgstattuple

If you are little more into the table size and how postgreSQL stores it, you might have come across this extension. Basically, after installing this extension when you run
select * from pgstattuple('table-name')
it gives you several metrics relating to the size-metrics, and first one among them is table_len. Now, this table_len is same as the output of \dt+. 

Which one is for you?

Now when you know about different size functions, let's find out which suits your purpose best.

Are you trying to find the bloat size in your table?

pgstattuple would be a good place to get that. It gives you the number of dead tuples, and the size of total dead tuples, free space, etc. so it gives a clearer picture of your table actual and bloat size. There is a lighter version of it also called pgstattuple_approx, with lesser locks and accuracy, it might serve good when you aren't bothered about the last MB.

Are you interested in only finding the size of an index?

An interesting thing to remember here is that postgreSQL indexes are also saved as relations, so you may very well use the pg_relation/table_size with the index name as its argument to get the size of required index only.

Additionally, pgstattuple has pgstatindex function for this. So, you may also use select * from pgstatindex('index-name'). The difference between them would be same as explained above for the corresponding relation functions.

Are you trying to find out the size of only the toast table(s) associated with your table?

If you know the name of the toast table you are interested in, then you may use  any of the above mentioned functions as per the requirement. In case you are struggling about finding the associated toast table, try this,
select relname from pg_class where oid = (select  reltoastrelid from pg_class where relname='table-name');
As a shortcut, the name of your toast table is in the format pg_toast.pg_toast_oid_of_parent_table

Have fun with all your size endeavors!

Friday 1 March 2019

Using parallelism for queries from PL functions in PostgreSQL 10




Intra-query parallelism was introduced in PostgreSQL in version 9.6. The benefit from the parallel  scans and joins were talked about and significant improvement in the benchmark queries on higher  scale factors were highlighted. However, one area remained devoid of the benefits - queries from
 procedural language functions. Precisely, if you fire a query from a PL/pgSQL function then it can  not use parallel scans or joins for that query, even though the query is capable of using them  otherwise. Have a look at an example yourself,

-- creating and populating the table
create table foo (i int, j int)
insert into foo values (generate_series(1,500), generate_series(1,500));
-- for experimental purposes we are forcing parallelism by setting relevant parameters
set parallel_tuple_cost = 0;
set parallel_setup_cost = 0;
alter table foo set (parallel_workers = 4);
set max_parallel_workers_per_gather = 4;
-- executing the query as an SQL statement
 explain analyse select * from foo where i <= 150; 
 Gather  (cost=0.00..4.56 ...) (actual time=0.217..5.614 ...)
   Workers Planned: 4
   Workers Launched: 4
   ->  Parallel Seq Scan on foo  (cost=0.00..4.56 ...) (actual time=0.004..0.018 ...)
         Filter: (i <= 150)
         Rows Removed by Filter: 70
-- create a PLpgSQL function to fire the same query from
create or replace function total ()
returns integer as $total$
declare total integer;
begin
     select count(*) from foo into total from foo where i <= 150;
     return total;
end;
$total$
language plpgsql;

-- executing the query from a PLpgSQL function in v 9.6
explain analyse select total(); 
Query Text: SELECT count(*) FROM foo where i <=150
Aggregate  (cost=9.25..9.26 ...)
  ->  Seq Scan on foo  (cost=0.00..8.00 ...)
Query Text: explain analyse select total();
Result  (cost=0.00..0.26 ...)
To your relief the feature was then added in  version 10. Have a look,

-- executing the query from a PLpgSQL function in v10
explain analyse select total(); 
Query Text: SELECT count(*) FROM foo where i <=150
Finalize Aggregate  (cost=4.68..4.69 ...)
  ->  Gather  (cost=4.66..4.67 ...)
        Workers Planned: 4
        ->  Partial Aggregate  (cost=4.66..4.67 ...)
              ->  Parallel Seq Scan on foo  (cost=0.00..4.56 ...)
                    Filter: (i <= 150)
This extends the utility of parallel query to more realistic environments wherein queries are fired
through functions and are not simple SQL statements.















Thursday 1 November 2018

My experience at PGConf Europe 2018




It was my first time at PGConf Europe this year, like many other firsts it was special, hence the blog.

Let's start with some of the basics, PostgreSQL conferences are held in a somewhat regional basis. There are many of them like,  PGConf India, PGConf USA, PGConf Europe, PGConf Asia, and then there are other one day events called PgDays. Coming back to PGConf Europe 2018,  it was organised from 23-26 October in Lisbon Marriott, Lisbon.

My talk 'Parallel Query in PG: how not to (mis)use it?' was scheduled on the first slot of last day. So, I had enough time to analyse and study the audience and prepare accordingly. But, first things first...

The conference started with a one day training session on 22 Oct, one has to buy different tickets for training and conference. You get a free registration for the conference only if you're the speaker. I wasn't part of the training session, hence will not be discussing anything about it. This was my day to rest and try the Portugal cuisine.

The next day was the start of the conference. It was opened by Magnus Hagander covering the logistics and introducing us to the conference halls, etc., must say it was one entertaining start. The next was the keynote by Paul Ramsey. The keynote was my first comprehensive introduction to PostGIS. Further, there was a nice snack buffet arranged in the lobby, and this was my time to know more people, the most exciting part of any conference. I happened to catch Tom Lane!

Henceforth, I was forced to take some difficult decisions like which talk to attend, since there were three parallel sessions going on. There was such a variety of areas covered in the conference and most of them have amazing presentations, that it made me greedy and hate the idea of parallel sessions.

To keep the discussion short, I enjoyed being exposed to some of the new areas and uses of postgres like, challenges of using postgres on cloud, multi-column indexes, pluggable storage, benchmarking,  efficient query planning in latest PG, new and old features of postgres. Cannot go ahead without mentioning a couple of sentences about zheap -- the current project I am part of it. People were excited about it and there were all sorts of curiosities around it.

The conference ended with a set of sponsor talks, I liked the one by Marc Linster. No, not because it was the sponsored talk of EDB but it was truly an awesome talk covering the journey of Postgres in the last two decades. The final closing talk was yet another entertainment by Magnus and Dave Page, and Tomas Vondra in special appearance as Slony, but I could not help seeing him just as one GIANT BLUE elephant. It was really scary! Finally speakers were given a gift and there was one big group selfie.

To end the blog, I must say it was a great experience to find the faces behind those names you see every now and then on hackers-list. The amazing food and nice weather of Lisbon added to the overall fun.

P.S. Here are the slides of my talk.

Wednesday 31 October 2018

Using parallel sequential scan in PostgreSQL




Parallel sequential scan is the first parallel access method in PostgreSQL and is introduced in version 9.6.  The committer of this feature and my colleague at EnterpriseDB Robert Haas wrote an awesome blog on it, there is another great blog by another PostgreSQL committer and my colleague Amit Kapila. Both of these blogs explain this access method, its design, usage, and related parameters. 

Still, I could not help but notice that there are curiosities around the usage of this access method. Every now and then I could see a complaint saying parallel sequential scan is not getting selected or it is degrading the performance of a query.  So, I decided to write this blog to cater more practical scenarios and specifically focus on its less talked about aspect  -- where parallel sequential scan would (should) not improve the performance.

Before diving into the details of parallel SeqScan, let's first understand the basic infrastructure and terminology related to it in PostgreSQL. The processes that run in parallel and scan the tuples of a relation are called parallel workers or workers in short. There is one special worker namely leader which co-ordinates and collects the output of the scan from each  of the worker. This worker may or may not participate in scanning the relation depending on it's load in dividing and combining processes. End users can also control the involvement of leader in relation scan by GUC parameter parallel_leader_participation, it is a boolean parameter. 

Now, let's understand the concept of parallel scan in PostgreSQL by a simple example.
  • Let there be a table T (a int, b int) containing 100 tuples
  • Let's say we have two workers and one leader,
  • Cost of scanning one tuple is 10
  • Cost of communicating a tuple from worker to leader is 20
  • Cost of dividing the tuples among workers is 30
  • For simplicity, let's assume that leader gives 50 tuples to each of the worker
Now, let's analyse if parallel scan will be faster than non parallel scan,

Cost of SeqScan = 10*100 = 1000
Cost of Parallel SeqScan = 30 + (50 * 10)  + (50 * 20) * 2 = 2530

Here, we can see that though the cost of scanning the tuples is halved yet the cost of combining the total result is enough to make the overall cost of parallel SeqScan higher than non parallel SeqScan.

Now, let's say we want to list only the tuples which have a > 80, and there are only 20 (say) such tuples, then cost of SeqScan will remain same, but cost of parallel SeqScan can be given as,

Cost of Parallel SeqScan = 30 + (50 * 10) + (10 * 20) * 2 =  730

Hence, parallel SeqScan is likely to improve the performance of queries that require scanning a large amount of data but only few of them satisfy the selection criteria. To generalise this,

Cost of SeqScan = Cost of scanning one tuple * number of tuples
Cost of parallel SeqScan = Cost of dividing the work among workers + cost of combining the work
                                            from workers + cost of work done by a worker * number of workers

Let's dive into it a bit more,

Cost of dividing the work among workers is fairly constant depending on the relation size
Cost of combining the the work from workers = cost of communicating the selected tuples from each
                                                                              worker to the leader
Cost of work done by a worker = cost of scanning a tuple * number of tuples the respective worker
                                                      has received

Now, we can see that the cost of combining the work is dependent on the number of tuples received by each worker. Now, for the queries where all or almost all of the tuples are in the final result, we will paying more cost than its non parallel flavour, first in scanning the tuple and second in combining it to the final result.

In PostgreSQL, the cost determining the cost of dividing the work among workers is given as parallel_setup_cost, the cost of communicating the tuple from worker to leader is given by parallel_tuple_cost, and the number of workers is upper bounded by the GUC max_parallel_workers_per_gather.

So, if you are using a system a high frequency multiple processor then lowering the parallel_setup_cost and parallel_tuple_cost will help in selection of parallel scans. If there are not many processes running in parallel, then increasing max_parallel_workers_per_gather can leverage more parallel processes to improve the query performance. Another point to note is that the number of workers is further capped by max_worker_processes.

Authentication monitoring in PostgreSQL

How about a situation when you want to log all the login attempts made to your PostgreSQL server. Yes, one way would be to read your log fil...