Failure Detection in Distributed Systems
๐๐ฎ๐ถ๐น๐๐ฟ๐ฒ ๐ฑ๐ฒ๐๐ฒ๐ฐ๐๐ถ๐ผ๐ป is a fundamental building block for ensuring fault tolerance in large scale distributed systems. A ๐ณ๐ฎ๐ถ๐น๐๐ฟ๐ฒ ๐ฑ๐ฒ๐๐ฒ๐ฐ๐๐ผ๐ฟ is a process that responds to questions asking whether a given process has failed. The difficulty of making a reliable failure detector is due to the ๐ฝ๐ฎ๐ฟ๐๐ถ๐ฎ๐น๐น๐ ๐๐๐ป๐ฐ๐ต๐ฟ๐ผ๐ป๐ผ๐๐ ๐ป๐ฎ๐๐๐ฟ๐ฒ of distributed systems. In a synchronous system there is a bound on message delivery time so it is therefore easily achievable but in asynchronous systems things are much harder, in this context it is possible for a failure detector to be accurate or live, but not both. In general, in asynchronous systems we have no strict timing assumptions and itโs difficult to determine whether a process has failed or is simply taking a long time for execution, so many of the algorithms have probabilistic assumptions.