ctdb-recovery: Ban a node that causes recovery failure

Enterprise / Samba - Martin Schwenke [meltin.net] - 5 November 2018 05:52 EST

... instead of applying banning credits.

There have been a couple of cases where recovery repeatedly takes just over 2 minutes to fail. Therefore, banning credits expire between failures and a continuously problematic node is never banned, resulting in endless recoveries. This is because it takes 2 applications of banning credits before a node is banned, which generally involves 2 recovery failures.

The recovery helper makes up to 3 attempts to recover each database during a single run. If a node causes 3 failures then this is really equivalent to 3 recovery failures in the model that existed before the recovery helper added retries. In that case the node would have been banned after 2 failures.

So, instead of applying banning credits to the "most failing" node, simply ban it directly from the recovery helper.

If multiple nodes are causing recovery failures then this can cause a node to be banned more quickly than it might otherwise have been, even pre-recovery-helper. However, 90 seconds (i.e. 3 failures) is a long time to be in recovery, so banning earlier seems like the best approach.

BUG: https://bugzilla.samba.org/show_bug.cgi?id=13670

27df4f002a5 ctdb-recovery: Ban a node that causes recovery failure
ctdb/server/ctdb_recovery_helper.c | 46 +++++++++++++++++++++++++-------------
1 file changed, 31 insertions(+), 15 deletions(-)

Upstream: gitweb.samba.org


  • Share