This blog post is a summary of a production issue I had to investigate some time ago with people from Oslandia, and since it’s quite unusual I wanted to share it with some methodology I used, if it can help anyone running into the same kind of problem. It’s also a good opportunity to say that upgrading to a newer PostgreSQL version is almost always a good idea.
The problem
The initial performance issue reported enough information to know something strange was happening.
The database is a PostgreSQL 9.3.5 (yes, missing some minor version updates), and obviously a lot of major versions late. The configuration also had quite unusual settings. The most relevant hardware and settings are:
Server
CPU: 40 core, 80 with hyperthreading enabled
RAM: 128 GB
PostgreSQL:
shared_buffers: 16 GB
max_connections: 1500
The high shared_buffers
, especially given the quite old PostgreSQL version,
is a good candidate for more investigation. The max_connections
is also
quite high, but unfortunately the software vendor claims that using a
connection pooler isn’t supported. Therefore most of the connections are idle.
This isn’t great because it implies quite some overhead to acquire a snapshot,
but there are enough CPU to handle quite a lot of connections.
The main problem was that sometimes, the same queries could be extremely slower. The following trivial example was provided:
So while the “good” timing is a little but slow (though there are 1500 connections), the “bad” timing is more than 100x slower, for a very simple query.
Another example of a trivial query on production data was provided, but with some more informations. Here’s an anonymized version:
There was also quite some instrumentation data on O/S side, showing that
neither the disk, CPU or RAM where exhausted, and no interesting message in
dmesg
or any system log.
What do we know?
For the first query, we see that the inner index scan average time raises from 0.001ms to 0.6ms:
With a high shared_buffers
setting and an old PostgreSQL version, a common
issue is a slowdown when the dataset is larger that the shared_buffers
, due
to the clocksweep algorithm used to evict buffers.
However, the second query shows that the same thing is happening while all the
blocks are in shared_buffers
. This cannot be a buffer eviction problem due
to too high shared_buffers
setting, or any disk latency issue.
While some PostgreSQL configuration settings could be improved, none of them seems to explain this exact behavior. It’d be likely that modifying them will fix the situation, but we need more information to undetstand what’s happening here exactly and avoid any further performance issue.
Any wild guess?
Since the simple explanations have been discarded, it’s necessary to think about lower level explanations.
If you followed the latest PostgreSQL versions enhancements, you should have noticed quite a few optimizations on scalability and locking. If you want more information, there are plenty of blog entries about these, for instance this great article.
On the kernel side and given the high number of connections, it also can be, and it’s probably the most likely explanation, a TLB exhaustion.
In any case, in order to confirm any theory we need to use very specific tools.
Deeper analysis: TLB exhaustion
Without going into too much detail, you need to know that each processus has an
area of kernel memory used to store the page tables
entries, called the PTE
, which
are mapping between the virtual addresses that the process are using and the
real physical address in RAM. This area is usually not big, because usually
a process doesn’t access to gigabytes of data in RAM. But since PostgreSQL is
relying on multiple processes accessing a big chunk of shared memory, each
process will have an entry for each 4kB (the default page size) address of the
shared buffers it has accessed. So you can end up with quite a lot of memory
used for the PTE
, and even have overall mappings that address way more than
the total physical memory available on the server.
You can know the size of the PTE
at the O/S level looking for the VmPTE
entry in the processus status. You can also check the RssShmem entry to
know how many shared memory pages is mapped. For instance:
This process didn’t access lots of buffers, so the PTE is small. If we try with a process which has accessed all the buffers of a 8 GB shared_buffers:
It’s 16 MB used for the PTE! Multiplying that with the number of connections, and you end up with gigabytes of memory used for the PTE. Obviously, this wont’ fit in the TLB. As a consequence, the processes will have a lot of TLB miss every time they need to access a page in memory, drastically increasing the latency.
On the system that had performance issue, with 16 GB of shared buffers and 1500 long lived connections, the total memory size of the combined PTE was around 45 GB! A rough approximation can be done with this small script:
NOTE: This will compute the memory used for the PTE of all postgres processes. If you have multiple clusters on the same machine and you want to have per cluster information, you need to adapt this command to only match the processes whose ppid are you cluster’s postmaster pid.
This is evidently the culprit here. Just to be sure, let’s look at what perf
show when the performance slowdown occurs, and when it doesn’t.
Here are the top consuming functions (consuming more than 2% of CPU time) reported by perf when everything is fine:
Nothing really interesting, the system was really not saturated. Now when the problem occurs:
We can see s_lock
, the postgres function that wait on a
spinlock, consumes almost 9% of the
CPU time. But this is PostgreSQL 9.3, and lightweight locks (transient
internal locks) were still implented using spin lock (they now use atomic
operation).
If we look a little bit more about s_lock
calls:
99% of s_lock
calls are indeed due to lightweight locks. This indicates a
general slowdown and high contentions. But this is just a consequence of the
real problem, the next top consumer function.
With almost 5% of the CPU time smaps_pte_entry
, a
kernel function doing the translation for a single entry, shows the problem.
This function is supposed to be extremely fast, and shouldn’t even appear in a
perf record! It means that very often, when a process wants to access to page
in RAM, it has to wait to get the real address. But waiting for an address
translation means a lot of pipeline
stalls. Processors have longer
and longer pipeline, and those stalls totally ruin the benefits of this kind of
architecture. As a result, a good proportion of CPU time is simply wasted
waiting for addresses. This certainly explain the extreme slowdown, and the
lack of high lever counters able to explain such slowdowns.
The solution
Multiple solutions are possible to solve this problem.
The usual answer is to ask PostgreSQL to allocate the shared_buffers
in huge
pages.
Indeed, with 2MB pages instead of 4kB, the memory needed for PTE will
automatically drop 512 times. This would be an easy and huge win.
Unfortunately, this is only possible since version 9.4, but upgrading wasn’t
even an option, since the software vendor claimed that only the 9.3 version is
supported.
Another way to reduce the PTE size is to reduce the number of connections, which is quite high here, and would also probably improve performance. Unfortunately again, the vendor claimed that connection poolers aren’t supported, and the customer needed that many connections.
So the only remaining solution was therefore to reduce the shared_buffers. After some tries, the higher value that could be used to avoid the extreme slowdown was 4 GB. Fortunately, PostgreSQL was able to have quite good performance with this size of dedicated cache.
If software vendors read this post, please understand that if people ask for newer PostgreSQL version compatibility, or pooler compatibility, they have very good reasons for that. There are usually very few behavior changes with newer versions, and they’re all documented!