Book notes for the textbook
Chapter 1: Reliable, Scalable, Maintainable Applications
Data intensive applications - the bottleneck is the data, not the CPU unlike Compute Intensive.
Three main concerns important in most software systems:
*Reliability The system should continue to work correctly even in the face of adversity
Scalability As the system grows, there should be reasonable ways of dealing with that growth
Maintainability Over time, many different people will work on the system, and they should all be able to work on it productively.
Reliability
Fault vs failure: A fault is usually defined as where one part of the system deviates from the spec, whereas a failure is when the systems as a whole stops providing the service to the user.
The netflix chaos monkey is an example of an approach to design fault tolerant systems.
It’s preferrable to tolerate faults rather than preventing them. However in security applications we must prevent them.
Software faults
The bugs that cause software faults often lie dormant until they are triggered by unique circumstances, where it is revealed the software was making an assumption about the environment.
Preventing these kinda of faults is difficult, and there is no one solution.
Human Errors
It was found that humans cause 10-15% of software errors, according to a ref in the book. There are several things we can do to avoid them.
How important is reliability?
Of course it is very important in safety critical applications, for example. There are situations where we may choose to sacrifice reliability to reduce development cost, but we should still be mindful.
Scalability
Describing Load
When discussing scale, we must talk of load parameters. The specific parameters used are particular to each application. For example with twitter, the parameter was the distribution of followers per user.
Describing Performance
Load effects Performance. We can investigate what load parameters do to performance.
In some systems, such as Batch Processing systems, it’s the throughput that matters. Whereas in online systems, it’s usually the response time.
We often look at percentiles when looking at response times for a system. High percentiles of response times, or tail latencies are important because they directly affect the users’ experience of the service.
Amazon describes requirements for internal services in terms of the 99.9th percentile, even though it only impacts 1 of 1000 requests. This is because the customers with the slowest requests are also the most valuable - they have a lot of data on their account.
On the other hand, optimising for high percentiles has diminishing returns. Often the high response times are due to things outside of our control.
Queuing delays often accounts for a large proportion of response time at high percentiles. This happens when, for example, a server begins dealing with a few slow requests, that hold up the queue. Even if the requests in the queue would be quick to process, they have to wait. This is also why it’s important to test a system by sending many requests randomly, instead of waiting for each request to complete.
On a related note, if a request requires multiple backend calls, the shortest response time is bounded by the longest backend call. The more calls a request makes, the higher the chance of getting a slow call, and thus a higher proportion of requests end up slowing down. This is called Tail Latency Amplification