... instead of applying banning credits.
There have been a couple of cases where recovery repeatedly takes just over 2 minutes to fail. Therefore, banning credits expire between failures and a continuously problematic node is never banned, resulting in endless recoveries. This is because it takes 2 applications of banning credits before a node is banned, which generally involves 2 recovery failures.
The recovery helper makes up to 3 attempts to recover each database during a single run. If a node causes 3 failures then this is really equivalent to 3 recovery failures in the model that existed before the recovery helper added retries. In that case the node would have been banned after 2 failures.
So, instead of applying banning credits to the "most failing" node, simply ban it directly from the recovery helper.
If multiple nodes are causing recovery failures then this can cause a node to be banned more quickly than it might otherwise have been, even pre-recovery-helper. However, 90 seconds (i.e. 3 failures) is a long time to be in recovery, so banning earlier seems like the best approach.
27df4f002a5 ctdb-recovery: Ban a node that causes recovery failure
ctdb/server/ctdb_recovery_helper.c | 46 +++++++++++++++++++++++++-------------
1 file changed, 31 insertions(+), 15 deletions(-)