paxos.pl -- A Replicated Data Store
This module provides a replicated data store that is coordinated using a
variation on Lamport's Paxos concensus protocol. The original method is
described in his paper entitled, "The Part-time Parliament", which was
published in 1998. The algorithm is tolerant of non-Byzantine failure.
That is late or lost delivery or reply, but not senseless delivery or
reply. The present algorithm takes advantage of the convenience offered
by multicast to the quorum's membership, who can remain anonymous and
who can come and go as they please without effecting Liveness or Safety
properties.
Paxos' quorum is a set of one or more attentive members, whose processes
respond to queries within some known time limit (< 20ms), which includes
roundtrip delivery delay. This property is easy to satisfy given that
every coordinator is necessarily a member of the quorum as well, and a
quorum of one is permitted. An inattentive member (e.g. one whose
actions are late or lost) is deemed to be "not-present" for the purposes
of the present transaction and consistency cannot be assured for that
member. As long as there is at least one attentive member of the quorum,
then persistence of the database is assured.
Each member maintains a ledger of terms along with information about
when they were originally recorded. The member's ledger is
deterministic. That is to say that there can only be one entry per
functor/arity combination. No member will accept a new term proposal
that has a line number that is equal-to or lower-than the one that is
already recorded in the ledger.
Paxos is a three-phase protocol:
1: A coordinator first prepares the quorum for a new proposal by
broadcasting a proposed term. The quorum responds by returning the
last known line number for that functor/arity combination that is
recorded in their respective ledgers.
2: The coordinator selects the highest line number it receives,
increments it by one, and then asks the quorum to finally accept the
new term with the new line number. The quorum checks their respective
ledgers once again and if there is still no other ledger entry for
that functor/arity combination that is equal-to or higher than the
specified line, then each member records the term in the ledger at
the specified line. The member indicates consent by returning the
specified line number back to the coordinator. If consent is withheld
by a member, then the member returns a nack
instead. The
coordinator requires unanimous consent. If it isn't achieved then the
proposal fails and the coordinator must start over from the
beginning.
3: Finally, the coordinator concludes the successful negotiation by
broadcasting the agreement to the quorum in the form of a
paxos_changed(Key,Value)
event. This is the only event that
should be of interest to user programs.
For practical reasons, we rely on the partially synchronous behavior
(e.g. limited upper time bound for replies) of broadcast_request/1 over
TIPC to ensure Progress. Perhaps more importantly, we rely on the fact
that the TIPC broadcast listener state machine guarantees the atomicity
of broadcast_request/1 at the process level, thus obviating the need for
external mutual exclusion mechanisms.
Note that this algorithm does not guarantee the rightness of the value
proposed. It only guarantees that if successful, the value proposed is
identical for all attentive members of the quorum.
- author
- - Jeffrey Rosenwald (JeffRose@acm.org)
- See also
- -
tipc_broadcast.pl
, udp_broadcast.pl
- license
- - BSD-2
- paxos_initialize(+Options) is det
- Initialize this Prolog process as a paxos node. The initialization
requires an initialized and configured TIPC, UDP or other broadcast
protocol. Calling this initialization may be omitted, in which case
the equivant of
paxos_initialize([])
is executed lazily as part of
the first paxos operation. Defined options:
- node(?NodeID)
- When instantiated, this node rejoins the network with the given
node id. A fixed node idea should be used if the node is
configured for persistency and causes the new node to receive
updates for keys that have been created or modified since the
node left the network. If NodeID is a variable it is unified
with the discovered NodeID.
NodeID must be a small non-negative integer as these identifiers
are used in bitmaps.
- paxos_get_admin(+Name, -Value) is semidet[private]
- paxos_set_admin(+Name, +Value) is semidet[private]
- Set administrative keys. We use a wrapper such that we can hide the
key identity.
- node(?NodeId)[private]
- quorum(?Bitmap)[private]
- dead(?Bitmap)[private]
- failed(?Bitmap)[private]
- failed(?NodeId, ?LastTried, ?Score)[private]
- Track our identity as well as as the status of our peers in the
network. NodeId is a small integer. Multiple NodeIds are combined in
a Bitmap.
- node/1 is our identity.
- quorum/1 is the set of members of the quorum
- failed/1 is the set of members for which the last message was
not confirmed.
- failed/3 tracks individual failed nodes. If accumulates failures
until the node is marked dead.
- dead/1 is the set of members that is considered dead.
- paxos_assign_node(+Options) is det[private]
- Assign a node for this paxos instance. If node is given as an
option, this is the node id that is used. Otherwise the network is
analysed and the system selects a new node.
- paxos_rejoin[private]
- Re-join the network. Tasks:
- Remove myself from the dead list if I'm on there
- Tell the replicators we lost everything.
- paxos_leave is det[private]
- paxos_leave(+Node) is det[private]
- Leave the network. The predicate paxos_leave/0 is called from
at_halt/1 to ensure the node is deleted as the process dies. The
paxos_leave/1 version is called to discard other nodes if they
repeatedly did not respond to queries.
- update_failed(+Action, +Quorum, +Alive) is det[private]
- We just sent the Quorum a message and got a reply from the set
Alive.
- Arguments:
-
is | - one of set , get or replicate and indicates the
intended action. |
- life_quorum(-Quorum, -LifeQuorum) is det[private]
- Find the Quorum and the living nodes from the Quorum. This is the
set for which we wait. If the LifeQuorum is not a majority we
address the whole Quorum.
- To be done
- - At some point in time we must remove a node from the quorum.
- paxos_property(?Property)
- True if Property is a current property for the paxos network.
Defined properties are:
- node(?NodeID)
- quorum(?NodeBitmap)
- failed(?NodeBitmap)
- paxos_message(?Message)[private]
- Handle inbound actions from our peers. Defines values for Message
are:
- prepare(+Key, -Node, -Gen, +Value)
- A request message to set Key to Value. Returns the current
generation at which we have a value or
0
for Gen and the
our node id for Node.
- accept(+Key, -Node, +Gen, -GenA, +Value)
- A request message to set Key to Value if Gen is newer than
the generation we have for Key. In that case GenA is Gen.
Otherwise we reject using GenA =
nack
.
- changed(+Key, +Gen, +Value, +Acceptors)
- The leader got enough accepts for setting Key to Value at Gen.
Acceptors is the set of nodes that accepted this value.
- learn(+Key, -Node, +Gen, -GenA, +Value)
- Request message peforming phase one for replication to learner
nodes.
- learned(+Key, +Gen, +Value, +Acceptors)
- Phase two of the replication. Confirm the newly learned knowledge.
- retrieve(+Key, -Node, -Gen, -Value)
- A request message to retrieve our value for Key. Also provides
our node id and the generation.
- forget(+Nodes)
- Forget the existence of Nodes.
- node(-Node, -Quorum, -Dead)
- Get my view about the network. Node is the (integer) node id of
this node, Quorum is the idea of the quorum and Dead is the idea
about non-responsive nodes.
- To be done
- - : originally the changed was handled by a get and when not
successful with a new set, named paxos_audit. I don't really see
why we need this.
- paxos_set(+Term) is semidet
- Equivalent to
paxos_key(Term,Key)
, pasox_set(Key,Term)
. I.e., Term
is a ground compound term and its key is the name/arity pair. This
version provides compatibility with older versions of this library.
- paxos_set(+Key, +Value) is semidet
- paxos_set(+Key, +Value, +Options) is semidet
- negotiates to have Key-Value recorded in the ledger for each of the
quorum's members. This predicate succeeds if the quorum unanimously
accepts the proposed term. If no such entry exists in the Paxon's
ledger, then one is silently created. paxos_set/1 will retry the
transaction several times (default: 20) before failing. Failure is
rare and is usually the result of a collision of two or more writers
writing to the same term at precisely the same time. On failure, it
may be useful to wait some random period of time, and then retry the
transaction. By specifying a retry count of zero, paxos_set/2 will
succeed iff the first ballot succeeds.
On success, paxos_set/1 will also broadcast the term
paxos_changed(Key,Value)
, to the quorum.
Options processed:
- retry(Retries)
- is a non-negative integer specifying the number of retries that
will be performed before a set is abandoned. Defaults to the
setting
max_sets
(20).
- timeout(+Seconds)
- Max time to wait for the forum to reply. Defaults to the
setting
response_timeout
(0.020, 20ms).
- Arguments:
-
Term | - is a compound that may have unbound variables. |
- To be done
- - If the Value is already current, should we simply do nothing?
- collect(+Quorum, :Stop, ?Node, ?Template, ?Message, -Result, -NodeSet) is semidet[private]
- Perform a broadcast request using Message. Node and Template share
with Message and extract the replying node and the result value from
Message. Result is the list of instantiations for Template received
and NodeSet is the set (bitmask) of Node values that replies, i.e.
|NodeSet| is
length(Result)
. The transfer stops if all members of
the set Quorum responded or the configured timeout passed.
- paxos_quorum_ask(?Template, +Message, -Result, +Options)
- Ask the paxos forum for their opinion. This predicate is not really
part of the paxos protocol, but reuses notably the quorum
maintenance mechanism of this library for asking questions to the
quorum (cluster). Message is the message to be asked. Result is a
list of copies of Template from the quorum. Options:
- timeout(+Seconds)
- Max time to wait for a reply. Default is the setting
response_timeout
.
- node(?Node)
- Can be used to include the replying node into Template.
- quorum(?Quorum)
- Set/query the interviewed quorum as a bitmask
- paxos_get(?Term) is semidet
- Equivalent to
paxos_key(Term,Key)
, pasox_get(Key,Term)
. I.e., Term
is a compound term and its key is the name/arity pair. This version
provides compatibility with older versions of this library.
- paxos_get(+Key, +Value) is semidet
- paxos_get(+Key, +Value, +Options) is semidet
- unifies Term with the entry retrieved from the Paxon's ledger. If no
such entry exists in the member's local cache, then the quorum is
asked to provide a value, which is verified for consistency. An
implied paxos_set/1 follows. This predicate succeeds if a term
with the same functor and arity exists in the Paxon's ledger, and
fails otherwise.
Options processed:
- retry(Retries)
- is a non-negative integer specifying the number of retries that
will be performed before a set is abandoned. Defaults to the
setting
max_gets
(5).
- timeout(+Seconds)
- Max time to wait for the forum to reply. Defaults to the
setting
response_timeout
(0.020, 20ms).
- Arguments:
-
Term | - is a compound. Any unbound variables are unified with
those provided in the ledger entry. |
- paxos_key(+Term, -Key) is det[private]
- Compatibility to allow for paxos_get/1, paxos_set/1, etc. The key of
a compound term is a term
'$c'(Name,Arity)
. Note that we do not
use Name/Arity
and X/Y
is naturally used to organize keys as
hierachical paths.
- start_replicator[private]
- Start or signal the replicator thread that there may be outstanding
replication work. This is the case if
- The union of quorum and learners was extended, and thus
all data may need to be replicated to the new members.
- A paxos_set/3 was not fully acknowledged.
- paxos_replicate_key(+Nodes:bitmap, ?Key, +Options) is det
- Replicate a Key to Nodes. If Key is unbound, a random key is
selected.
- timeout(+Seconds)
- Max time to wait for the forum to reply. Defaults to the
setting
response_timeout
(0.020, 20ms).
- paxos_on_change(?Term, :Goal) is det
- paxos_on_change(?Key, ?Value, :Goal) is det
- Executes the specified Goal when Key changes. paxos_on_change/2
listens for
paxos_changed(Key,Value)
notifications for Key, which
are emitted as the result of successful paxos_set/3 transactions.
When one is received for Key, then Goal is executed in a separate
thread of execution.
- Arguments:
-
Term | - is a compound, identical to that used for
paxos_get/1. |
Goal | - is one of:
- a callable atom or term, or
- the atom
ignore , which causes monitoring for Term to be
discontinued.
|
- node(-Node) is det[private]
- Get the node ID for this paxos node.
- quorum(-Quorum) is det[private]
- Get the current quorum as a bitmask
- paxos_message(+PaxOS, +TimeOut, -BroadcastMessage) is det[private]
- Transform a basic PaxOS message in a message for the broadcasting
service. This predicate is hooked by paxos_message_hook/3 with the
same signature.
- Arguments:
-
TimeOut | - is one of - or a time in seconds. |
- paxos_ledger_hook(+Action, ?Key, ?Gen, ?Value, ?Holders)[multifile]
- Hook called for all operations on the ledger. Defined actions are:
- current
- Enumerate our ledger content.
- get
- Get a single value from our ledger.
- create
- Create a new key in our ledger.
- accept
- Accept a new newly proposed value for a key. Failure causes
the library to send a NACK message.
- set
- Final acceptance of Ken@Gen, providing the holders that accepted
the new value.
- learn
- Accept new keys in a new node or node that has been offline for
some time.
- ledger_current(?Key, ?Gen, ?Value, ?Holders) is nondet[private]
- True when Key is a known key in my ledger.
- ledger(+Key, -Gen, -Value) is semidet[private]
- True if the ledger has Value associated with Key at generation Gen.
Note that if the value is not yet acknowledged by the leader we
should not use it.
- ledger_create(+Key, +Gen, +Value) is det[private]
- Create a new Key-Value pair at generation Gen. This is executed
during the preparation phase.
- ledger_update(+Key, +Gen, +Value) is semidet[private]
- Update Key to Value if the current generation is older than Gen.
This reflects the accept phase of the protocol.
- ledger_update_holders(+Key, +Gen, +Holders) is det[private]
- The leader acknowledged that Key@Gen represents a valid new
- ledger_learn(+Key, +Gen, +Value) is semidet[private]
- We received a learn event.
- ledger_forget(+Nodes) is det[private]
- Remove Nodes from all ledgers. This is executed in a background
thread.
- c_element(+NewList, +Old, -Value)[private]
- A Muller c-element is a logic block used in asynchronous logic. Its
output assumes the value of its input iff all of its inputs are
identical. Otherwise, the output retains its original value.
- arg_union(+Arg, +ListOfTerms, -Set) is det[private]
- Get all the nth args from ListOfTerms and do a set union on the
result.
Re-exported predicates
The following predicates are exported from this file while their implementation is defined in imported modules or non-module files loaded by this module.
- paxos_on_change(?Term, :Goal) is det
- paxos_on_change(?Key, ?Value, :Goal) is det
- Executes the specified Goal when Key changes. paxos_on_change/2
listens for
paxos_changed(Key,Value)
notifications for Key, which
are emitted as the result of successful paxos_set/3 transactions.
When one is received for Key, then Goal is executed in a separate
thread of execution.
- Arguments:
-
Term | - is a compound, identical to that used for
paxos_get/1. |
Goal | - is one of:
- a callable atom or term, or
- the atom
ignore , which causes monitoring for Term to be
discontinued.
|
- paxos_set(+Key, +Value) is semidet
- paxos_set(+Key, +Value, +Options) is semidet
- negotiates to have Key-Value recorded in the ledger for each of the
quorum's members. This predicate succeeds if the quorum unanimously
accepts the proposed term. If no such entry exists in the Paxon's
ledger, then one is silently created. paxos_set/1 will retry the
transaction several times (default: 20) before failing. Failure is
rare and is usually the result of a collision of two or more writers
writing to the same term at precisely the same time. On failure, it
may be useful to wait some random period of time, and then retry the
transaction. By specifying a retry count of zero, paxos_set/2 will
succeed iff the first ballot succeeds.
On success, paxos_set/1 will also broadcast the term
paxos_changed(Key,Value)
, to the quorum.
Options processed:
- retry(Retries)
- is a non-negative integer specifying the number of retries that
will be performed before a set is abandoned. Defaults to the
setting
max_sets
(20).
- timeout(+Seconds)
- Max time to wait for the forum to reply. Defaults to the
setting
response_timeout
(0.020, 20ms).
- Arguments:
-
Term | - is a compound that may have unbound variables. |
- To be done
- - If the Value is already current, should we simply do nothing?
- paxos_get(+Key, +Value) is semidet
- paxos_get(+Key, +Value, +Options) is semidet
- unifies Term with the entry retrieved from the Paxon's ledger. If no
such entry exists in the member's local cache, then the quorum is
asked to provide a value, which is verified for consistency. An
implied paxos_set/1 follows. This predicate succeeds if a term
with the same functor and arity exists in the Paxon's ledger, and
fails otherwise.
Options processed:
- retry(Retries)
- is a non-negative integer specifying the number of retries that
will be performed before a set is abandoned. Defaults to the
setting
max_gets
(5).
- timeout(+Seconds)
- Max time to wait for the forum to reply. Defaults to the
setting
response_timeout
(0.020, 20ms).
- Arguments:
-
Term | - is a compound. Any unbound variables are unified with
those provided in the ledger entry. |
Undocumented predicates
The following predicates are exported, but not or incorrectly documented.
- paxos_admin_key(Arg1, Arg2)