work_mem?
The work memory, or work_mem is one of the most complicated setting to configure. It can be used for various purposes. It’s mainly used when sorting data or creating hash tables, but it can also be used by set returning functions using a tuplestore for instance, like the generate_series() function. Moreover, each node of a query can use this amount of memory. Set this parameter too low, and a lot of temporary files will be used, set it too high and you may encounter errors, or even an Out Of Memory (OOM) depending on your OS configuration.
I’ll focus here on the amount of memory needed when sorting data, to help you understand how much memory is required when PostgreSQL runs a sort operation.
Truth is out
I sometimes hear people think that there is a correlation between the size of the temporary files generated and the amount of memory that would have been needed to perform the same sort entirely in memory. It’s unfortunately wrong, you can’t make any assumption on the value of work_mem based only on the size of a sort temporary file.
It’s because when the data to be sorted don’t fit in the allowed memory, PostgreSQL will use different algorithms, either external sort or external merge, which have a totally different space usage. In addition to work_mem usage, a smaller temporary file can be used multiple times, with external merge algorithm, for less disk usage and better performance. If you want more details on this, the relevant source code is present in tuplesort.c and logtapes.c. As a brief introduction, the header of tuplesort.c says:
[…] This module handles sorting of heap tuples, index tuples, or single Datums (and could easily support other kinds of sortable objects, if necessary). It works efficiently for both small and large amounts of data. Small amounts are sorted in-memory using qsort(). Large amounts are sorted using temporary files and a standard external sort algorithm.
See Knuth, volume 3, for more than you want to know about the external sorting algorithm. We divide the input into sorted runs using replacement selection, in the form of a priority tree implemented as a heap (essentially his Algorithm 5.2.3H), then merge the runs using polyphase merge, Knuth’s Algorithm 5.4.2D. The logical “tapes” used by Algorithm D are implemented by logtape.c, which avoids space wastage by recycling disk space as soon as each block is read from its “tape”. […]
NOTE: It’s an extract from the 9.5 version of the //readme//. External sorts are now using a //quicksort// quicksort algorithm rather than a //replacement selection//.
It can be easily verified. First, let’s create a table and add some data:
To sort all these rows, 7813kB
is needed (more details later). Let’s see the
EXPLAIN ANALYZE with work_mem set to 7813kB
and 7812kB
:
So, 7813kB
are needed, but if we lack only 1kB
, the temporary file size
is 2432kB
.
You can also activate the trace_sort parameter to have some more information:
With these data, 28 tapes are used.
So, how do I know how much work_mem is needed?
First, you need to know that all the data will be allocated through PostgreSQL’s allocator AllocSet. If you want to know more about it, I recommend to read the excellent articles Tomas Vondras wrote on this topic: Introduction to memory contexts, Allocation set internals and palloc overhead examples.
The needed information here is that the allocator adds some overhead. Each
allocated block has a fixed overhead of 16B
, and the memory size requested
(without the 16B overhead) will be rounded up to a 2^N
size. So if you ask
for 33B, 80B will be used: 16B of overhead and 64B, the closest 2^N multiple.
The work_mem will be used to store every row, and some more information.
For each row to sort, a fixed amount of 24B
memory will be used. This is the
size of a SortTuple which is the structure sorted. This amount of memory
will be allocated in a single block, so we have only 24B
overhead (fixed 8B
and the 16B to go to the closest 2^N multiple).
The first part of the formula is therefore:
(n being the number of tuple sorted)
Then, you have to know that PostgreSQL will preallocate this space for 1024 rows. So you’ll never see a memory consumption of 2 or 3kB.
Then, each SortTuple will then contain a
MinimalTuple, which is basically a tuple without the system metadata (xmin,
xmax…), or an IndexTuple if the tuples come from an index scan. This
structure will be allocated separately for each tuple, so there can be a pretty
big overhead. Theses structures lengths are both 6B
, but need to be aligned.
This represents 16B
per tuple.
These structures will also contain the entire row, the size depends on the table, and the content for variable length columns.
The second part of the formula is therefore:
We can now estimate how much memory is needed:
Testing the formula
Let’s see on our table. It contains two fields, id and val. id is an
integer, so it uses 4B
. The val column is variable length. First, figure
out the estimated average row size:
Just to be sure, as I didn’t do any ANALYZE on the table:
So, the average row size is approximatively 14B
. PostgreSQL showed the same
estimation on the previous EXPLAIN plan, the reported width was 14:
NOTE: It’s better to rely on the pg_statistic, because it’s faster and doesn’t consume resources. Also, if you have large fields, they’ll be toasted, and only a pointer will be stored in work_mem, not the entire field
We add the 16B
overhead for the MinimalTuple structure and get 30B
. This will lead to an allocated space of 32B
.
Finally, the table contains 100.000 tuples, we can now compute the memory needed :
We now find the 7813kB
I announced earlier!
This is a very simple example. If you only sort some of the rows, the estimated size can be too high or too low if the rows you sort don’t match the average size.
Also, note that if the data length of a row exceed 8kB
(not counting the
toasted data), the allocated size won’t be rounded up to the next 2^N multiple.
Wait, what about NULLs?
Yes, this formula was way too simple…
The formula assume you don’t have any NULL field, so it compute the maximum estimated memory needed.
A NULL field won’t consume space for data, obviously, but will add a bit in a bitmap stored in the MinimalTuple.
If at least one field of a tuple is NULL, the bitmap will be created. Its size is:
So, if a tuple has 3 integer fields, and two of them are NULL, the data size will not be 16B
but:
You can then try to estimate a better size with the statistic NULL fractions of each attribute, available in pg_statistics.
For the lazy ones
Here’s a simple query that will do the maths for you. It assumes:
- only fields from one table is sorted
- there are no NULL
- all the rows will be sorted
- statistics are accurate
Conclusion
Now, you know the basics to estimate the amount of memory you need to sort your data.
A minimal example was presented here for a better understanding, things start to get really complicated when you don’t only sort all the rows of a single table but the result of some joins and filters.
I hope you’ll have fun tuning work_mem on your favorite cluster. But don’t forget, work_mem is used for more than just sorting tuples!