The running time of data structures is not obvious without making assumptions on the current state of the data structure under speculation.
As a programmer and an algorist, I find myself computing the operating cost of devices that I intend to purchase. In my recent shopping endeavors, I was in search for a printer with simple requirements: it must be a laser printer that prints pages in grayscale. Once my range of options have been narrowed, it comes down to selecting the most cost-effective option: which printer has the cheapest cost per page printed?
What is interesting here is that the cost of each page printed is not necessarily independent of the printer's upfront cost. While the cost of the paper itself is independent of the printer, we must factor in the cost of the toner cartridge as well as the cost of the drum which are dependent on the printer. Subsequently, the true operational cost of a printer must be computed with respect to the cost of the toner cartridge as well as the cost of the drum replacements. This type of analysis can be seen as amortized analysis (or depreciation in accounting terms).
Amortized analysis, in Computer Science terms, is a technique for determining the asymptotic behaviors of sequences of operations.
This post is primarily concerned with the technique of amortized analysis for computing the operating costs that vary with the state of a system. Briefly, we will cover complexity analysis, then, we will discuss amortized analysis, and finally, we will use printers as a physical system to analyze with our newfound techniques.
Complexity Analysis
Complexity analysis is concerned with the running time of algorithms. To formalize this, let $T(x)$ be the running time of an algorithm $x$. Now, consider an abstract problem, $P$. If there are two algorithms, $A$ and $B$, that can be used to solve problem $P$, then complexity analysis is the process of computing $T(A)$ and $T(B)$ in order to answer interesting questions such as the following:
- Which algorithm can solve $P$ quicker? $T(A) > T(B)$ or $T(A) < T(B)$?
- What is the minimum running time to solve $P$? Solve for $T(Z)$ with algorithm $Z$ where $T(Z) < T(Y)$ for all algorithms $Y$.
Binary search trees (BST) are archetypal examples for complexity analysis with $O(\text{log } n)$ insertions, removals, and searches. The general assumption that should be noted here is that the cost of each of these operations is the same regardless of the size of the data structure. Complexity analysis becomes more complex once operations begin to vary with the state of the data structure.
Amortized Analysis
Amortized analysis is primarily concerned with the running time of sequences of operations for data structures. While standard complexity analysis is concerned with static instances of problems, amortized analysis takes the state of a data structure into consideration. This is extremely useful for proving average-case running times for databases such as Cassandra.
Consider the cost of insertions in a dynamically sized array. The cost varies depending on the state of the array. Since arrays are fixed-size, contiguous blocks of memory, they must be resized when the fixed-size block reaches its capacity.
For example, an array of size $n=3$ has constant-time, $O(1)$ insertion time until the fourth element is inserted. Upon the fourth element's insertion, we must double the size of the array by allocating a new array and copying over all previous $n$ values. Hence, the running time of an insertion is $O(n)$ every time the array capacity is reached. This makes analysis difficult.
Accounting Method
To analyze average-case running time of dynamically sized arrays, we must amortize the cost of a resize over the cost of the insertions until the resize. The accounting method enables us to perform such amortized analysis.
The accounting method begins by noting that a standard insertion is $O(1)$ and that the cost of a resize is $O(n)$. Since we want to show the running time in terms of a standard insertion alone, we would like to credit the standard insertion in order to pay for the cost of a resize later. This is actually fairly simple. We will use dollars as a cost to operations as an illustration.
Let the cost of a standard insertion be \$2. \$1 can be used to cover the cost of the standard insertion by definition and the other will be used to accumulate credit for the resize. Upon resize, we require \$N to pay for the operation! Fortunately, the resize operation occurs after \$N standard insertions which has accumulated precisely \$N for us. We use the \$N of credit to pay for the resize and all operations have been paid for. Subsequently, we have shown that insertions can be paid with a constant unit of payment.
Thus, all insertions are $O(1)$ on the average-case.
Operational Cost of Printers
The operational cost of a printer can be computed likewise. We know that after the upfront cost of a printer, there is an operating cost associated to printing individual pages. Intuitively, the cost to print a page is at least the cost of the page itself. However, we also know that we must replace the toner every $X$ pages and we must also replace the drum every $Y$ pages. As with the dynamically sized array, we can see that we may compute the average cost of a page with respect to the toner and drum replacement by amortized analysis.
We will now consider a complete example. For a stack of 500 pages, we say that it costs approximately \$10. For a toner that is good for 3000 pages, we say that it costs approximately \$50. For a drum that is good for 10000 pages, we say that it costs approximately \$75.
The accounting method is much easier to apply here. To amortize the cost of the toner and drum over each page printed, we simply compute the average the cost over the number of pages that it is good for. Simply, the amortized operational cost of a printer would be
The average cost is thus approximately \$0.044. Now, if we had used the paper average alone of \$0.02, our absolute error, \$0.024, is greater than the paper average! Subsequently, the absolute error in computing the average cost of paper alone is significant.
Conclusion
The primary point to learn here is that the true operational cost of any device (printers) cannot be easily estimated over the primary resource consumed (paper). Expensive replacement parts (toners and drums) must be factored in through amortized analysis in order to achieve an accurate estimate of operational costs.
Similarly, data structures should not simply be regarded by their worst-case complexity on input especially when the worst-case scenarios happen infrequently relative to the average-case scenarios.