Selectivity estimation error is one of the main cause of bad query plans. It’s
quite straighforward to compute those estimation error using EXPLAIN
(ANALYZE)
, either manually or with the help of
explain.depesz.com (or other similar tools),
but until now there were now tool available to get this information
automatically and globally. Version 2 of pg_qualstats fixes that, thanks a
lot to Oleg Bartunov for the original idea!
Note: If you don’t know pg_qualstats extension, you may want to see my last article about it.
The problem
There can be many causes to that issue: outdated statistics, complex predicates, non uniform data… But whatever the reason is, if the optimizer doesn’t have an accurate idea on how much data each predicate will filter, the result is the same: a bad query plan, which can lead to longer query execution.
To illustrate the problem, I’ll use here a simple test case, voluntarily built to fool the optimizer.
Here postgres think that the query will emit 12500 tuples, while in reality
none will be emitted. If you’re wondering how postgres came up with that
number, the explanation is simple. When multiple independant (overlapping
range predicate can be merged) clauses are AND-ed and no extended statistics
are available (see below for more about it), postgres will simply multiply each
clause selectivity. This is done in clauselist_selectivity_simple
, in
src/backend/optimizer/path/clausesel.c:
In this case, each predicate will independantly filter approximately 50% of the table, as we can see in pg_stats view:
So when using both clauses, the estimate is 25% of the table, since postgres
doesn’t know by default that both values are mutually exclusive.
Continuing with this artificial test case, let’s see what happens if we add a
join on top of if. For instance, joining the table to itself on the val1
column only. For clarity, I’ll use t1 for the table on which I’m applying
the mutually exclusive predicates, and t2 the table joined:
Postgres thinks that this join will emit 313 millions rows, while obviously no rows will be emitted. And this is a good example on how bad assumptions can lead to an inefficient plan.
Here Postgres can deduce that the val1 = 0
predicate can be applied to
t2. So how to join two relations, one that should emit 25000 tuples and
the other that should emit 12500 tuples, with no index available? A nested
loop is not a bad choice, as both relation aren’t really big. As no index is
available, postgres also chooses to materialize the inner relation, meaning
storing it in memory, to make it more efficient. As it tries to limit memory
consumption as much as possible, the smallest relation is materialized, and
that’s the mistake here.
Indeed, postgres will read the whole table twice: once to get every rows
corresponding to the val1 = 0
predicate for the outer relation, and once to
find all rows to be materialized. If the opposite was done, as it would
probably have if the estimates had been more realistic, the table would only
have been read once.
In this case, as the dataset isn’t big and quite artificial, a better plan wouldn’t drastically change the execution time. But keep in mind than with real production environements, it could mean choosing a nested loop assuming that there’ll be only a couple of rows to loop on while in reality the backend will spend minutes or even hours looping over millions of rows, and another plan would have been orders of magnitude quicker.
Detecting the problem
pg_qualstats 2 will now compute the selectivity estimation error, both in a ratio and a raw number, and will keep track for each predicate the minimum, maximum and mean values, with the standard deviation. This is now quite simple to detect problematic quals!
After executing the last query, here’s what the pg_qualstats
view will
return:
NOTE: qualid
is an identifier if multiple qual are AND-ed, NULL
otherwise, and qualnodeid
is a per-qual only identifier.
We see here that when used alone, the qual pgqs.val = ?
doesn’t show any
selectivity estimate problem as the ratio (mean_ratio) is very close to
1 and the raw number (mean_num) is quite low. On the other hand, when
combined with AND pgqs.val2 = ?
pg_qualstats reports significant estimate
error. That’s a very strong sign that those columns are functionally
dependent.
If for example a qual alone shows issues, it could be a sign of outdated statistics, or that the sample size isn’t big enough.
Also, if you have pg_stat_statements
extension installed, pg_qualstats
will
give you the query identifier for each predicate. With that and a bit of
SQL, you can for instance find the query with a long average execution time
which contains quals for which the selectivity estimation is off by 10 or more.
Interlude: Extended statistics
If you’re wondering how to solve the issue I just explained, the solution is very easy since extended statistics were introduced in PostgreSQL 10, and assuming that you know that’s the root issue. Create an extended statistcs on the related columns, perform an ANALYZE and you’re done!
If you want more details on extended statistics, I recommend looking at the slides from Tomas Vondra’s excellent talk on this subject.
Going further
Tracking the quals in every single qual executed is of course quite expensive,
and would significantly impact the performance for any non datawarehouse
workload. That’s why pg_qualstats
has an option,
pg_qualstats.sample_rate, to sample the query that will be processed.
This setting is by default set to 1 / max_connections, which will make the
overhead quite negligible, but don’t be surprised if you don’t see any qual
reported after running a few queries!
But if you’re instead only interested by the quals that has bad selectivity estimation, for instance to detect this class of problem rather than missing indexes, there are two new options available for that:
- pg_qualstats.min_err_estimate_ratio
- pg_qualstats.min_err_estimate_num
Those options are cumulative and can be changed at anytime, and will limit the quals that pg_qualstats will store to the ones that have a selectivity estimate ratio and/or raw number higher that what you ask. Although those options will help to reduce the performance overhead, they of course can be combined with pg_qualstats.sample_rate if needed.
Conclusion
After introducing the new global index advisor, this article presented a class of problems that are frequently seen as a DBA, and how to detect and solve them.
I believe that those two new features in pg_qualstats will greatly help PostgreSQL databases administration. Also, external tools that aims to solve related issue, such as pg_plan_advsr or AQO could also benefit from pg_qualstats, as they could directly get the exact data they need to be able perform analysis and optimize the queries!