I hear quite often people being disappointed on how much space PostgreSQL is wasting for each row it stores. I’ll try to show here some tricks to minimize this effect, to allow more efficient storage.
If you don’t have tables with more than few hundred of million of rows, it’s likely that you didn’t have an issue with this.
For each row stored, postgres will store aditionnal data for its own need. This is documented here. The documentation says:
|t_xmin||TransactionId||4 bytes||insert XID stamp|
|t_xmax||TransactionId||4 bytes||delete XID stamp|
|t_cid||CommandId||4 bytes||insert and/or delete CID stamp (overlays with t_xvac)|
|t_xvac||TransactionId||4 bytes||XID for VACUUM operation moving a row version|
|t_ctid||ItemPointerData||6 bytes||current TID of this or newer row version|
|t_infomask2||uint16||2 bytes||number of attributes, plus various flag bits|
|t_infomask||uint16||2 bytes||various flag bits|
|t_hoff||uint8||1 byte||offset to user data|
Which is 23 bytes on most architectures (you have either t_cid or t_xvac).
You can see part of these fields in hidden column present on any table by adding them in the SELECT part of a query, or look for negative attribute number in pg_attribute catalog:
If you compare to the previous table, you can see than not all of these columns are not stored on disk. Obviously PostgreSQL doesn’t store the table’s oid in each row. It’s added after, while constructing a tuple.
How costly is it?
As the overhead is fixed, it’ll become more and more neglictable as the row size grows. If you only store a single int column (4 bytes), each row will need:
So, it’s 85% overhead, pretty horrible.
On the other hand, if you store 5 integer, 3 bigint and 2 text columns (let’s say ~80B average), you’ll have:
That’s “only” 10% overhead.
So, how to minimize this overhead
The idea is to store the same data with less records. How to do that? Aggregating data in arrays. The more records you put in a single array, the less overhead you have. And if you aggregate enough data, you can benefit from transparent compression thanks to the TOAST mechanism
Let’s try with a single 1 integer column table containing 10M rows:
The user data should need 10M * 4B, ie. around 38MB, while this table will consume 348MB. Inserting the data takes around 23 seconds.
NOTE: If you do the maths, you’ll find out that the overhead is slighty more than 32B, not 23B. This is because each block also has some overhead, NULL handling and alignement issue. If you want more information on this, I recommand to see this presentation
Let’s compare with aggregated versions of the same data:
This will insert 5 elements per row. I’ve done the same test with 20, 100, 200 and 1000 elements per row. Results are below:
NOTE: The size for 1000 element per row is a little higher than lower value. This is because it’s the only one which is big enough to be TOAST-ed, but not big enough to be compressed. We can see a little TOAST overhead here.
So far so good, we can see quite good improvements, both in size and INSERT time even for very small arrays. Let’s see the impact to retrieve rows. I’ll try to retrieve all the rows, then only one row with an index scan (for the tests I’ve used EXPLAIN ANALYZE to minimize the time to represent the data in psql):
To properly index this array, we need a GIN index. To get all the values from aggregated data, we need to unnest() the arrays, and to be a little more creative to get a single record:
Here’s the chart comparing index creation time and index size:
The GIN index is a little more than twice the btree index, if I add the table size, total size is almost the same as without aggregation. That’s not a big issue since this example is naive, we’ll see later how to avoid using GIN index to keep total size low. Also index is way slower to build, meaning that INSERT will also be slower.
Here’s the chart comparing the time to retrieve all rows and a single row:
Getting all the rows is probably not an interesting example, but it’s interesting to note that as soon as array contains enough elements it starts to be faster than the same operation using the original table. We also see that getting only one element is much more faster than with the btree index, thanks to GIN efficiency. It’s not tested here, but since only btree index are sorted, if you need to get a lot of data sorted, using a GIN index will require an extra sort which will be way slower than a simple btree index scan.
A more realistic example
Now that we’ve seen the basics, let’s see how to go further: aggregating more than one columns and avoid to use too much disk space (and slowdown at write time) with a GIN index. For this, I’ll present how PoWA stores it’s data.
For each datasource collected, two tables are used: one for the historic and aggregated data, and one the current data. These tables store data in a custom type instead of plain columns. Let’s see the tables related to pg_stat_statements:
The custom type, basically all the counters present in pg_stat_statements and the timestamp associated to this record:
The table for current data stores the pg_stat_statement unique identifier (queryid, dbid, userid), and a record of counters:
The table for aggregated data contains the same unique identifier, an array of records and some special fields:
We also store the timestamp range (coalesce_range) containing all aggregated counters in the row, and the minimum and maximum values of each counter in two dedicated records. These extra fields doesn’t consume too much space, and allows very efficient indexing and computation, based on the data access pattern of the related application.
This table is used to know how much ressource a query consumed on a given time range. The GiST index won’t be too big since it only indexes two small values per X aggregated counters, and will find efficiently the rows matching a given queryid and time range.
Then, computing the resources consumed can be done efficiently, since the pg_stat_statements counters are strictly monotonic. The algorithm would be:
- if the row time range is entirely contained in the asked time range, we only need to compute delta of summary record: maxs_in_range.counter - mins_in_range.counter
- if not (meaning only two rows for each queryid) we unnest the array, filter out records that aren’t in the asked time range, keep first and last value and compute for each counter the maximum minus the minimum.
NOTE: Actually, PoWA interface always unnest all records contained in the asked time interval, since the interface is designed to show these counters evolution on a relatively small time range, but with a great precision. Hopefuly, unnesting the arrays is not that expensive, especially compared to the disk space saved.
And here’s the size needed for the aggregated and non aggregated values. For this I let PoWA generate 12.331.366 records (configuring a snapshot every 5 seconds for some hours, with default aggregation of 100 records per row), and used a btree index on (queryid, ((record).ts) to simulate the index present on the aggregated table:
Pretty efficient, right?
There are some limitations with aggregating records. If you do this, you can’t enforce constraints such as foreign keys or unique constraints. The use is therefore non-relationnal data, such as counters or metadata.
Using custom types also allows some nice things, like defining custom operators. For instance, the release 3.1.0 of PoWA provides two operators for each custom type defined:
- the - operator, to get difference between two record
- the / operator, to get the difference per second
You can therefore do quite easily this kind of queries:
If you’re interested on how to implement such operators, you can look at PoWA implementation.
You now know the basics to work around the per tuple overhead. Depending on your needs and your data specifities, you should find a way to aggregate your data, maybe add some extra columns, to keep nice performance and spare some disk space.