A critical failure in some of the end-of-recovery actions before the end-of-recovery record is written can cause PostgreSQL to react inconsistently with the rest of the cluster in the event of a crash before the final record is written. Two such failures are for example an error while processing a two-phase state files or when operating on recovery.conf. With this commit, the failures are still considered FATAL, but the write of the timeline history file is delayed as much as possible so as the window between the moment the file is written and the end-of-recovery record is generated gets minimized. This way, in the event of a crash or a failure, the new timeline decided at promotion will not seem taken by other nodes in the cluster. It is not really possible to reduce to zero this window, hence one could still see failures if a crash happens between the history file write and the end-of-recovery record, so any future code should be careful when adding new end-of-recovery actions. The original report from Magnus Hagander mentioned a renamed recovery.conf as original end-of-recovery failure which caused a timeline to be seen as taken but the subsequent processing on the now-missing recovery.conf cause the startup process to issue stop on FATAL, which at follow-up startup made the system inconsistent because of on-disk changes which already happened.
Processing of two-phase state files still needs some work as corrupted entries are simply ignored now. This is left as a future item and this commit fixes the original complain.
cbc55da556 Rework order of end-of-recovery actions to delay timeline history write
src/backend/access/transam/xlog.c | 37 +++++++++++++++++++++++++------------
1 file changed, 25 insertions(+), 12 deletions(-)