Paper Review: Dynamo: Amazon’s Highly Available Key-value Store


TLDR

Rating: 4 out of 5.

Very easy to read.

Direct link to the paper.

Interesting takeaways

  • Novel approach to conflict resolution: Unlike most data systems that push conflict resolution to the write phase, Dynamo allows writes and shifts conflict resolutions to reads. This unique strategy ensures that writes are never rejected.
  • Quality: An impressively high bar by focusing on 99.9% percentiles instead of mean and median. This quality standard is hard to achieve in practice, showcasing Amazon’s commitment to excellence.
  • Data-driven culture: The objective analysis, based on empirical data at every stage, instills confidence. It’s a simple yet effective approach: run the experiment, gather the data, and make informed decisions.

Areas that could be clearer

Symmetry: Every node should have the same responsibilities as its peers; no node should take on extra roles.

This point had me scratching my head and wondering how Dynamo handles distributed systems’ issues, such as leadership elections, conflict resolution, tiebreakers, etc.

Tidbits from interesting sections

I skipped a few sections because they didn’t contain interesting insights (e.g., historical insights)

3.3 Dynamo design considerations

  • Always writeable
  • Nodes are all trusted
  • Flattened hierarchies and no relational links between artifacts
  • Latency sensitive

4.4 Data versioning

The app users bear the onus of reconciliation since they know about eventual consistency. Dynamo uses vector clocks, and to prevent infinite growth of the Vector clocks, Dynamo truncates after some threshold. It hasn’t been found to cause issues in production, so it has not been investigated, but the potential risk exists.

4.6 Handling failures: Hinted Handoff

Sloppy quorum – all read and write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while walking the consistent hash ring. Each replica node ensures that writes reach their intended host when those hosts return online.

6.0: Balancing between availability, durability, and consistency using the W and R parameters.

Traditional wisdom holds that durability and availability go hand in hand. However, this is not necessarily true for Dynamo. For instance, increasing W can decrease the vulnerability window for durability. This may increase the probability of rejecting requests (thereby decreasing availability) because more storage hosts must be alive to process a write request.

Typical SLA requirements of 99.9% read and write requests execute within 300ms. This is an impressive feat for two reasons:

  • Dynamo runs on multiple nodes across data centres connected via high-speed fiber links.
  • Dynamo runs on standard commodity hardware components that have less I/O throughput compared to high-end commercial servers.

6.2: Ensuring uniform load distributions

Bots mainly trigger divergent data; 99.94% of all requests do not have to resolve version conflicts.

Don’t miss the next post!

Subscribe to get regular posts on leadership methodologies for high-impact outcomes.

Join 3,993 other subscribers

Discover more from CodeKraft

Subscribe to get the latest posts sent to your email.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.