Formerly we applied a functional dependency "a => b with dependency degree f" using the formula
P(a,b) = P(a) * [f + (1-f)*P(b)]
This leads to the possibility that the combined selectivity P(a,b) could exceed P(b), which is not ideal. The addition of support for IN and OR clauses (commits 8f321bd16c and ccaa3569f5) would seem to make this more likely, since the user-supplied values in such clauses are not necessarily compatible with the functional dependency.
Mitigate this by using the formula
P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)
instead, which guarantees that the combined selectivity is less than each column's individual selectivity. Logically, this is modifies the part of the formula that accounts for dependent rows to handle cases where P(a) > P(b), whilst not changing the second term which accounts for independent rows.
Additionally, this refactors the way that functional dependencies are applied, so now dependencies_clauselist_selectivity() estimates both the implying clauses and the implied clauses for each functional dependency (formerly only the implied clauses were estimated), and now all clauses for each attribute are taken into account (formerly only one clause for each implied attribute was estimated). This removes the previously built-in assumption that only equality clauses will be seen, which is no longer true, and opens up the possibility of applying functional dependencies to more general clauses.
Patch by me, reviewed by Tomas Vondra.
Discussion: https://postgr.es/m/CAEZATCXaNFZyOhR4XXAfkvj1tibRBEjje6ZbXwqWUB_tqbH%3Drw%40mail.gmail.com Discussion: https://postgr.es/m/20200318002946.6dvblukm3cfmgir2%40development
87779aa474 Prevent functional dependency estimates from exceeding column estimates.
src/backend/statistics/dependencies.c | 349 ++++++++++++++++++++++----------
src/test/regress/expected/stats_ext.out | 18 +-
src/test/regress/sql/stats_ext.sql | 8 +-
3 files changed, 266 insertions(+), 109 deletions(-)