Designing Data-Intensive Applications

14 June 2023

Martin Kleppmann's Designing Data-Intensive Applications is an O'Reilly book with a boar on the cover and not unsurprisingly is sometimes a bit of a bore to read. When he keeps it engaging with real world examples things flow smoothly, but when he gets bogged down in discussing special cases and theoretical problems (e.g. the chapter on consistency and consensus) the text can become an effective sedative. Such is the price of really going in depth on a subject I suppose.

Kleppmann starts off by explaining what he means by "reliable, scalable, and maintainable systems". Reliable means that a system is fault tolerant. Scalable means that it continues to perform when loads increase. And finally maintainable means that the system is set up in such a way that operations and engineering can find and repair problems and deploy new features easily.

Next up is a discussion of data models, storage systems, and query languages. The storage systems run the gamut from unconnected document databases to hyperconnected graph databases, with the often used relational databases falling somewhere in the middle. SQL is, of course, the most used query language but others are mentioned.

Storage and retrieval are the next topic with the opposing constraints of OLTP (transactions) versus OLAP (analytics) at the heart of the discussion. I remeber this was the crux of our problem at PatientFocus when we started to scale and building a proper data warehouse was the only way we could achieve both.

A discussion of encoding data and dataflow follows. Rolling upgrades are introduced for multinode systems, whereby changes to nodes propagate over the nodes gradually allowing for rollback if problems arise. Since multiple versions are running at the same time, backward and forward compatability of the changes are a must. The big three data formats (JSON, XML, CSV) were also discussed alongside binary and language specific formats. Several modes of dataflow were then presented from databases, to REST APIs, to RPC, to aysnchoronus message brokers.

Replication was the opening topic as Kleppmann started going into the topic of distributed data. The purposes for replication include high availability, disconnected operation, latency and scalability. To achieve these goals there are several approaches from single leader, through multi leader, to leaderless. The choice depends on a tradeoff between consistency and fault tolerance. Several approaches are also available to achieving consistency in the face of replication lag: read-after-write consistency, monotonic reads and consistent prefix reads. Conflict resolution in the case of concurrent writes was then discussed.

The next topic was partitioning, two which there are two main approaches: key range partitioning and hash partitioning. The tradeoffs between local and global indices which make writes and reads easier respectively.

Worrying about infrastructure problems while programming has lead to the idea of transactions, which provide a layer to deal with the issues of the previous chapters so the application doesn't have to. For simple applications they're overkill, but in complex applications they significantly reduce the potential for errors. Concurrency control can be dealt with by several different isolation levels: read committed, snapshot isolation, and serializable. The last is the only one which protects application developers against all the potential race conditions (dirty reads, dirty writes, read skew, write skew, lost updates, and phantom reads). Methods for implementation of serializable transactions were then discussed, including executing transactions in serial order, two phase locking, and serializable snapshot isolation.

The above topic of transactions was initially limited to issues arising on a signle machine. The followup topic was naturally distributed systems. This section mostly discussed failures that can occur, how the might be detected and mitigated. No single node can make a decision though and must communicate with others over an unreliable network to reach a consensus. The chapter on consistency and consensus is probably the densest part of the book.

Batch processing is the opening topic in the discussion of derived data and begins with a history lesson on how it evolved from Unix to MapReduce. Assuming a distributed system there are two main problems that arise in batch processing, partitioning and fault tolerence. The advantage of batch processing is that it can always be run over the same data again.

Kleppmann discusses stream processing next although through he revisits ideas discussed in earlier chapters. He compares two types of message brokers acting as middlemen between producers and consumers, those that destroy messages once consumed (AQMP/JMS-style) and those that do not (log-based). The difficulties of handling time and order between multiple producers and consumers are discussed at length. An additional, very useful analogy of the the state of a database being the sum or integral of the stream of changes and vice versa the stream representing the delta or derivative of the state is also presented. He concludes by discussing joins on streams and the ensuing problems arising therefrom.

Looking into his crystal ball of the future from 2017 Kleppmann lastly tries to predict future developments and the book just kind of fizzles out. This is a book that I wish I read 5 or 6 years ago, but it still took me over a year of stops and starts to finish.

Last Blog | Index | Next Blog

Designing Data-Intensive Applications

Last altered 25 June 2023 by Bradley James Wogsland. Copyright © 2023 Bradley James Wogsland. All rights reserved.

Last altered 25 June 2023 by Bradley James Wogsland.
Copyright © 2023 Bradley James Wogsland. All rights reserved.