Why Cassandra doesn’t need vector clocks
By –
September 2, 2013 |
One of the notable features of Amazon’s 2007 Dynamo paper was the use of vector clocks for conflict resolution.
However, the two most prominent systems designed by engineers who worked on Dynamo — Cassandra and DynamoDB — both avoid the use of vector clocks in favor of finer-grained updates. To understand why, let’s first look at the problem that vector clocks solve.
Resolving conflicts with vector clocks
The original Dynamo, like the open-source Voldemort and Riak, was a key/value database. Thus, objects would need to be serialized in a format such as json. For example, I might have a user object with key jbellis and value of {'email': 'jbellis@example.com', 'phone': '555-5555'}. We’ll call this initial value V0.
Next, suppose we update the email address, changing the value to V1 of {'email': 'jbellis@illustration.com', 'phone': '555-5555'}. Some failure causes this to only be written to one replica. Later, we update the phone number, but we read from a different replica so we start from the original value V0 (with the original email address), so we write V2 {'email': 'jbellis@example.com', 'phone': '444-4444'}.
(Note that failure — whether actual machine failure, network failure, or even load shedding — can cause “conflicting” updates even with a single client and no concurrency.)
Since our object values are opaque blobs to this system, a naive last-write-wins conflict resolution policy will result in discarding the V1 email address change in favor of the V2 phone number update. This is why it’s so easy to lose data using last-write-wins conflict resolution in a key/value system like Riak.
Vector clocks solve this problem by allowing the database to push conflict resolution back out to the client. Skipping a lot of details, the database would retain both V1 and V2, and when a client next reads key jbellis, it would return both versions and tell the client, “you figure out what you want the value to be now.” The client can then deserialize the objects and merge the separately updated fields without data loss to the intended value of {'email': 'jbellis@illustration.com', 'phone': '444-4444'}.
Problems with vector clocks
I see three main problems with a key/value database like Dynamo and its first-generation open-source derivatives:
- Performance: as I alluded to earlier this year, updating a single field in an object stored in a key/value database requires three steps: read and deserialize the exiting object, update the desired field, and serialize and write the resulting object as a new value. Updating an object in Cassandra requires only communicating the changed fields, no more.
- Siblings — multiple versions generated by conflicting updates — are difficult to deal with in practice, to the point that Riak makes last-write-wins the default despite the high potential for data loss.
- Vector clocks are good at helping clients with simple merges like the above user object, but it’s important to understand that vector clocks only tell you that a conflict occurred, and not how to resolve it; as Basho put it, even with perfect implementation you can’t have perfect information about causality in an open system without unbounded information growth. This is why Cassandra and later Riak both had to go beyond vector clocks when implementing counters.
Cassandra’s solution
People who have been burned by last-write-wins in other systems are justifiably nervous when approaching Cassandra. But Cassandra breaks a row up into columns that can be updated independently. Here’s what that looks like for our example:
CREATE TABLE users ( username text PRIMARY KEY, email text, phone text ); INSERT INTO users (username, email, phone) VALUES ('jbellis', 'jbellis@example.com', '555-5555'); UPDATE users SET email = 'jbellis@illustration.com' WHERE username = 'jbellis'; UPDATE users SET phone = '444-4444' WHERE username = 'jbellis';
This way, the storage engine can resolve changes to email and phone columns automatically. Conversely, if there are concurrent changes to a single field, only one will be retained, which is also what we want. (Cassandra extends this fine-grained conflict resolution to Collection elements as well.)
Thus, clock synchronization is nice to have in a Cassandra cluster but not critical; timestamps are only used to pick a “winning” update within a single column or collection element. (A timestamp tie will also result in a deterministic, commutative result.) Lightweight transactions are available when linearizability is important.
What Cassandra gives up here is the ability to create custom behavior for updates-based-on-existing-values. However, as counters illustrate, vector clocks are often inadequate for this as well. Usually you end up having to store the entire version history, which Cassandra supports by clustering with uuids.
Summary
Cassandra addresses the problem that vector clocks were designed to solve by breaking up documents/objects/rows into units of data that can be updated and merged independently. This allows Cassandra to offer improved performance and simpler application design.