14
pages
English
Documents
2013
Obtenez un accès à la bibliothèque pour le consulter en ligne En savoir plus
Découvre YouScribe et accède à tout notre catalogue !
Découvre YouScribe et accède à tout notre catalogue !
14
pages
English
Documents
2013
Obtenez un accès à la bibliothèque pour le consulter en ligne En savoir plus
Fault Tolerance via Idempotence
G. Ramalingam and Kapil Vaswani
Microsoft Research, India
grama,kapilv@microsoft.com
Abstract ing comes with its own pitfalls, such as process failures, imperfect
messaging, asynchrony and concurrency.Building distributed services and applications is challenging due
Consider the prototypical bank account transfer service into the pitfalls of distribution such as process and communication
Fig. 1. The goal of the service is to transfer money between bankfailures. A natural solution to these problems is to detect potential
accounts, potentially in different banks. If the accounts belong tofailures, and retry the failed computation and/or resend messages.
different banks, ensuring that the transfer executes as an atomicEnsuring correctness in such an environment requires distributed
(distributed) transaction is usually not feasible, and the natural wayservices and applications to be idempotent.
of expressing this computation is as a workflow [10, 20] consistingIn this paper, we study the inter-related aspects of process fail-
of two steps, a debit followed by a credit.ures, duplicate messages, and idempotence. We first introduce a
What if the process executing the workflow fails in between thesimple core language (based on-calculus) inspired by modern dis-
debit and credit steps? A natural solution is to detect this failuretributed computing platforms. This language formalizes the notions
and ensure that a different process completes the remaining stepsof a service, duplicate requests, process failures, data partitioning, 1of the workflow. A challenging aspect of realizing this solutionand local atomic transactions that are restricted to a single store.
is figuring out whether the original process failed before or afterWe then formalize a desired (generic) correctness criterion for
completing a particular step (either debit or credit). If not doneapplications written in this language, consisting of idempotence
carefully, the debit or credit step may be executed multiple times,(which captures the desired safety properties) and failure-freedom
leading to further correctness concerns. Services often rely on a the progress properties).
central workflow manager to manage process failures during theWe then propose language support in the form of a monad that
workflow (using distributed transactions).automatically ensures failfree idempotence. A key characteristic of
Now consider a (seemingly) different problem. Messages sentour implementation is that it is decentralized and does not require
between the client initiating the transfer and the service may be lost.distributed coordination. We show that the language support can
The only option for a client, when it does not receive a responsebe enriched with other useful constructs, such as compensations,
within some reasonable time, is to resend its request. Yet the clientwhile retaining the coordination-free decentralized nature of the
does not want the transfer to occur twice!implementation.
In this paper, we study process and communication failures inWe have implemented the idempotence monad (and its variants)
the context of workflows. The seemingly different problems causedin F# and C# and used our implementation to build realistic appli-
by process and communication failures are, in fact, inter-related.cations on Windows Azure. We find that the monad has low runtime
Idempotence, a correctness criterion that requires the system to tol-overheads and leads to more declarative applications.
erate duplicate requests, is the key to handling both communication
Categories and Subject Descriptions D.4.5 [Operating Systems]: and process failures efficiently. Idempotence, when combined with
Reliability—Fault-tolerance; C.2.4 [Computer-Communication Net- retry, gives us the essence of a workflow, a fault tolerant composi-
works]: Distributed Systems—Client/server, Distributed applica- tion of atomic actions, for free without the need for distributed co-
tions ordination. In the transfer example, a fault tolerant account trans-
fer can be implemented without a central workflow manager if the
General Terms Reliability, Languages, Design debit and credit steps can be designed to be idempotent,
Keywords fault tolerance, idempotence, workflow, transaction, Formalizing Failfree Idempotence. In this paper, we introduce a
monad simple core language , inspired by contemporary cloud plat-FAIL
forms. This formalizes process failure, duplicate requests,
1. Introduction partitioned data, and local transactions. A local transaction pro-
vides ACID guarantees but is restricted to access data within a sin-Distributed computing is becoming mainstream. Several modern
gle partition (typically a single server). Computations in areFAILplatforms offer virtualized distributed systems at low entry cost
like workflows, but without any fault-tolerance guarantees for thewith the promise of scaling out on demand. But distributed comput-
composition (i.e., the computation may fail between transactions).
We then formalize a generic correctness criterion for applica-
tions written in . A simple, powerful and tempting criterion isFAIL
that an application’s behavior in the presence of duplicate requests
Permission to make digital or hard copies of all or part of this work for personal or and process failures should be indistinguishable from its behav-
classroom use is granted without fee provided that copies are not made or distributed
ior in the absence of duplicate requests and failures. We formal-for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
1to lists, requires prior specific permission and/or a fee. In general, detecting failures perfectly in an asynchronous, message pass-
POPL’13, January 23–25, 2013, Rome, Italy. ing system is impossible [8]. Conservative failure detection can also lead to
cCopyright 2013 ACM 978-1-4503-1832-7/13/01. . . $10.00 the same problem of duplicated computation.let process (request) = workflow monad [2, 4]. We find that the core business logic in these
match requestwith applications can be declaratively expressed using the monad. Our
j (“getBalance”, (branch, account))! evaluation shows that performance overheads of using the monad
atomic branchflookupg over hand-coded implementations are statistically insignificant.
j (“transfer”, (fromBranch, fromAccount, toBranch, toAccount, amt)!
The rest of the paper is organized as follows. In Section 2, we
atomic fromBranchf
introduce a language and formalize duplicate requests andFAILupdate fromAccount ((lookup fromAccount) amt)
process failures. We formalize what it means for a applicationFAILg;
to correctly tolerate duplicate requests and failures. In Section 3,atomic toBranchf
update toAccount ((lookup toAccount) + amt) we present the idempotence monad and show how it can be used to
g; tolerate duplicate requests as well as process failures. In Section 4,
“Transfer complete.” we describe extensions of the idempotence construct. In 5,
we evaluate the idempotence monad and our implementation from
the perspective of expressiveness, benefits and overheads. Section 6Figure 1: A banking service example, in syntactically sugared
discusses related work. , that is neither idempotent nor fault-tolerant.FAIL
2. Failfree Idempotence
ize a slightly weaker, but more appropriate, correctness criterion, In this section we present a language that distils essential el-FAIL
namely failure-freedom modulo message duplication. Informally, ements of distributed computing platforms such as Windows Azure
this criterion permits the system to send duplicate responses. This and formalize the concept of failfree idempotence.
weakening is appropriate from the perspective of composition: if
the recipient of the responses can also tolerate duplicate messages, 2.1 The LanguageFAIL
then the sender is freed of the obligation to send the response ex-
Informal Overview. A programe represents a service thatFAILactly once.
receives input requests and produces output responses. An input
Automating Idempotence. Next, we address the problem of auto- requestv is processed by creating an agent to evaluateev. When
0 0matically ensuring idempotence for a service. We present our solu- the evaluation of e v terminates, producing a value v , v is sent
tion as a monad, the monad. We then show that idem- back as the response. Multiple input requests can be processed con-
potence, when coupled with a simple retry mechanism, provides a currently, and their evaluation can be interleaved. Shared, mutable,
“free” solution to the problem of tolerating process failures, guaran- persistent data is stored in tables.
teeing failure-freedom modulo message duplication. We then pro- Agents. An agent has its own internal state (captured by local
pose dedicated language support for idempotent computations. variables of the code). An agent may fail at any point in time. A
failure models problems such as hardware failure, software crashes
Decentralized Idempotence and Workflow. The idea underlying and reboots . Data stored in tables is persistent and is unaffected by
the idempotence monad is conceptually simple, but tedious to im- agent failures.
plement manually (i.e., without the monad).