Fix replica backward scan race condition

Enterprise / PostgreSQL - Peter Geoghegan [bowt.ie] - 3 August 2020 22:54 UTC

It was possible for the logic used by backward scans (which must reason about concurrent page splits/deletions in its own peculiar way) to become confused when running on a replica. Concurrent replay of a WAL record that describes the second phase of page deletion could cause _bt_walk_left() to get confused. btree_xlog_unlink_page() simply failed to adhere to the same locking protocol that we use on the primary, which is obviously wrong once you consider these two disparate functions together. This bug is present in all stable branches.

More concretely, the problem was that nothing stopped _bt_walk_left() from observing inconsistencies between the deletion's target page and its original sibling pages when running on a replica. This is true even though the second phase of page deletion is supposed to work as a single atomic action. Queries running on replicas raised "could not find left sibling of block %u in index %s" can't-happen errors when they went back to their scan's "original" page and observed that the page has not been marked deleted (even though it really was concurrently deleted).

There is no evidence that this actually happened in the real world. The issue came to light during unrelated feature development work. Note that _bt_walk_left() is the only code that cares about the difference between a half-dead page and a fully deleted page that isn't also exclusively used by nbtree VACUUM (unless you include contrib/amcheck code). It seems very likely that backward scans are the only thing that could become confused by the inconsistency. Even amcheck's complex bt_right_page_check_scankey() dance was unaffected.

To fix, teach btree_xlog_unlink_page() to lock the left sibling, target, and right sibling pages in that order before releasing any locks (just like _bt_unlink_halfdead_page()). This is the simplest possible approach. There doesn't seem to be any opportunity to be more clever about lock acquisition in the REDO routine, and it hardly seems worth the trouble in any case.

This fix might enable contrib/amcheck verification of leaf page sibling links with only an AccessShareLock on the relation. An amcheck patch from Andrey Borodin was rejected back in January because it clashed with btree_xlog_unlink_page()'s lax approach to locking pages. It now seems likely that the real problem was with btree_xlog_unlink_page(), not the patch.

This is a low severity, low likelihood bug, so no backpatch.

Author: Michail Nikolaev

9a9db08ae4 Fix replica backward scan race condition.
src/backend/access/nbtree/README | 18 ++++++++
src/backend/access/nbtree/nbtxlog.c | 88 +++++++++++++++++++++++--------------
2 files changed, 72 insertions(+), 34 deletions(-)

Upstream: git.postgresql.org


  • Share