Notes: Reliable, Maintainable & Scalable Applications
source: Designing Data Intensive Applications by Martin Kleppmann (Chapter 1)
Data has replaced raw CPU power as the limiting factor for computer systems. It’s more of a problem to solve when data is:
- high volume
- complex
- changing in real time
Thinking about Data Systems
Lots of tools (like databases, queues, caches) are types of data systems, so it’s a pretty broad category. They’re important to think about, though, because applications are usually just a collection of data systems.
When building a product, it’s important to consider both functional and nonfunctional requirements.
- functional requirements describe what a system should do
- nonfunctional requirements include security, reliability, scalability and maintainability
For data intensive applications, we need five things:
- Databases
- Caches
- Search Indexes
- Stream Processing
- Batch Processing
Reliability
Reliable systems tolerate user error, abuse and large loads.
fault vs failure:
- Faults are when a component in a system deviates from its specifications
- Failures are when a while system stops providing the required service We want fault-tolerant (resillient) systems, as opposed to fault-free systems. Faults are to some degree inevitable, and when a system isn’t fault tolerant, it just fails.
An exception to this would be matters of security. The goal is prevention for security features.
Hardware Faults
These happen all the time, especially when more machines are involved. Solve that problem using redundancy and backups. If we have multiple machines for a given system, the system should (ideally) be tolerant of one more more machines going down
Software Errors
Software errors are harder to anticipate, and tend to be more likely to cause system failures.
- they can affect shared resources more easily
- they can also cause cascading failures lots of software bugs are caused when assuptions about an environment are wrong There’s not quick fix, but partial solutions include:
- thinking ahead
- unit tests, integration tests
- monitoring
Human Errors
Humans are really unreliable. To get areound this, use good UIs, sandboxes, failing hard/loudly, integration/unit tests, detailed monitoring and good processes
Scalability
when measuring scalability, think of it as: (resources)/(load) = performance
So if you want to investigate scalability, you can do it in a couple of ways
- increase load but keep resources the same. See how that affects performance
- increase load, increase resources to keep performance the same. Keep track of what you need to change
Determining Load
In Twitter’s case, the key metric to determine load is either users or followers. What’s the best way to design for a user with 30 million followers? What about dealing with millions of users, each with about 70 followers? The algorithm (to deliver tweets to mailboxes) that scales well depends on the type of user.
Measuring Performance
There are lots of good metrics for performance, like throughput and response time. It’s best to look at these metrics in terms of percentiles (rather than average or median), since looking at the 99th percentile can be more instructive than looking at the 55th percentile.
- reason for this: queing delays, head-of-line blocking. Bottlenecks in your system can exaserbate outliers
- Even when you’re runnign things in parallel, the longest running process determines response time for the whole application
- Shorthand for 99th percentile: P99. 99.9th percentile -> P999, 50th percentile -> P50
latency vs response time
response time (or how long it takes for a user to get a reponse) often depends on the latency (amount of time an operation waits to be handled) of a set of systems.
Strategies for Scaling
The architecture you use to scale depends on your system, and you typically have a mix of the following:
- scaling up (vertical scaling): using a more powerful machine
- this is better for stateful services
- scaling out (horizontal scaling): using more machines
- this is better for stateless services
Elastic Systems: scale automatically when load changes. This is opposed to more simple systems, where humans anticipate changes to load and manually scale the system. Elastic systems are not nessecary for a lot of applications, but they’re better for unpredictable environments.
Maintainability
You can determine maintainability based on how happy your developers are. You make systems maintainable to reduce headache. The three methods are:
- operability
- simplicity
- evolvability
Operability
Ops teams do a lot (monitering system health, dealing with issues, security patches, maintenance, capacity planning…) so good data systems should make life easier for operations teams. Good tools for this include:
- runtime visibility tools
- automization
- documentation
- redundancy
- predictability
Simplicity
complexity causes all sorts of problems that lead to software bugs.
- Accidental Complexity isn’t inherent to the problem that a piece of software is solving, but it comes from implementation
- Abstraction: tool for removing accidental complexity. Just hide the implementation behind a facade.
Evolvability
Most systems, obviously, change over time. Use Agile working patterns to provide a framework for change. simplicity and operability are also super helpful