A distributed system can be defined as multiple computers (nodes) communicating via a network trying to achieve some task together.

Martin Kleppmann’s Course

Notes from Martin Kleppmann’s Distributed Systems Course. He has a set of course notes on his teaching site as well.

How do we share data amongst different concurrent entities?

  • Recommended Reading
    • ”Distributed Systems” by van Steen & Tanenbaum: Implementation detail heavy, more practical
    • ”Introduction to Reliable and Secure Distributed Programs” (2nd ed) by Cachin, Guerraoui & Rodrigues: Theory heavy
    • ”Designing Data-Intensive Applications” by Kleppmann: More oriented toward distributed databases
    • ”Operating Systems: Concurrent and Distributed Software Design” by Addison-Wesley: links to Operating Systems

Why distributed?

  • Things are inherently distributed: sending a message from your phone to your friend’s phone
  • Reliability: even if one node fails, the system as a whole keeps functioning
  • Performance: get data from a nearby node rather than one centralized server halfway around the world
  • Solve bigger problems: some amounts of data can’t fit on just one machine

Why not distributed?

  • Communication may fail (and we might not even know it has failed)
  • Processes may crash (and we might not know)
  • All of this can happen nondeterministically
  • Thus we need to think about fault tolerance