165 Offline caching
Modern computers use a cache to store a small amount of data in a fast memory. Even though a program may access large amounts of data, by storing a small subset of the main memory in the cache—a small but faster memory—overall access time can greatly decrease. When a computer program executes, it makes a sequence $\langle r_1, r_2, \ldots, r_n \rangle$ of $n$ memory requests, where each request is for a particular data element. For example, a program that accesses 4 distinct elements $\{a, b, c, d\}$ might make the sequence of requests $\langle d, b, d, b, d, a, c, d, b, a, c, b \rangle$. Let $k$ be the size of the cache. When the cache contains $k$ elements and the program requests the $(k + 1)$st element, the system must decide, for this and each subsequent request, which $k$ elements to keep in the cache. More precisely, for each request $r_i$, the cachemanagement algorithm checks whether element $r_i$ is already in the cache. If it is, then we have a cache hit; otherwise, we have a cache miss. Upon a cache miss, the system retrieves $r_i$ from the main memory, and the cachemanagement algorithm must decide whether to keep $r_i$ in the cache. If it decides to keep $r_i$ and the cache already holds $k$ elements, then it must evict one element to make room for $r_i$ . The cachemanagement algorithm evicts data with the goal of minimizing the number of cache misses over the entire sequence of requests.
Typically, caching is an online problem. That is, we have to make decisions about which data to keep in the cache without knowing the future requests. Here, however, we consider the offline version of this problem, in which we are given in advance the entire sequence of $n$ requests and the cache size $k$, and we wish to minimize the total number of cache misses.
We can solve this offline problem by a greedy strategy called furthestinfuture, which chooses to evict the item in the cache whose next access in the request sequence comes furthest in the future.
a. Write pseudocode for a cache manager that uses the furthestinfuture strategy. The input should be a sequence $\langle r_1, r_2, \ldots, r_n \rangle$ of requests and a cache size $k$, and the output should be a sequence of decisions about which data element (if any) to evict upon each request. What is the running time of your algorithm?
b. Show that the offline caching problem exhibits optimal substructure.
c. Prove that furthestinfuture produces the minimum possible number of cache misses.
a. The procedure $\text{CACHEMANAGER}$ is a generic procedure, which initializes a cache by calling $\text{INITIALIZECACHE}$ and then calls $\text{ACCESS}$ with each data element in turn. The inputs are a sequence $R = \langle r_1, r_, \ldots, r_n \rangle$ of memory requests and a cache size $k$.
1 2 3 4  CACHEMANAGER(R, k) INITIALIZECACHE(R, k) for i = 1 to n ACCESS(r[i]) 
The running time of $\text{CACHEMANAGER}$ of course depends heavily on how $\text{ACCESS}$ is implemented. We have several choices for how to implement the greedy strategy outlined in the problem. A straightforward way of implementing the greedy strategy is that when processing request $r_i$, for each of the at most $k$ elements currently in the cache, scan through requests $r_{i + 1}, \ldots, r_n$ to find which of the elements in the cache and $r_i$ has its next access furthest in the future, and evict this element. Because each scan takes $O(n)$ time, each request entails $O(k)$ scans, and there are $n$ requests, the running time of this straightforward approach is $O(kn^2)$.
Instead, we describe an asymptotically faster algorithm, which uses a redblack tree to check whether a given element is currently in the cache, a maxpriority queue to retrieve the data element with the furthest access time, and a hash table (resolving collisions by chaining) to map data elements to integer indices. We assume that the data elements can be linearly ordered, so that it makes sense to put them into a redblack tree and a maxpriority queue. The following procedure $\text{INITIALIZECACHE}$ creates and initializes some global data structures that are used by $\text{ACCESS}$.
1 2 3 4 5 6 7 8 9 10 11 12 13  INITIALIZECACHE(R, k) let T be a new redblack tree let P be a new maxpriority queue let H be a new hash table ind = 1 for i = 1 to n j = HASHSEARCH(r[i]) if j == NIL HASHINSERT(r[i], ind) let S[ind] be a new linked list j = ind ind = ind + 1 append i to S[j] 
In the above procedure, here is the meaning of various variables:
 The redblack tree $T$ has at most $k$ nodes and holds the distinct data elements that are currently in the cache. We assume that the redblack tree procedures are modified to keep track of the number of nodes currently in the tree, and that the procedure $\text{TREESIZE}$ returns this value. Because redblack tree $T$ has at most $k$ nodes, we can insert into, delete from, or search in it in $O(\lg k)$ worstcase time.
 The maxpriority queue $P$ contains elements with two attributes: $key$ is the next access time of a data element, and $value$ is the actual data element for each data element in the cache. $key$ gives the key and $value$ is satellite data in the priority queue. Like the redblack tree $T$, the maxpriority queue contains only elements currently in the cache. We need to maintain $T$ and $P$ separately, however, because $T$ is keyed on the data elements and $P$ is keyed on access times. Using a maxheap to implement $P$, we can extract the maximum element or insert a new element in $O(\lg k)$ time, and we can find the maximum element in $\Theta(1)$ time.
 The hash table $H$ is a dictionary or a map, which maps each data element to a unique integer. This integer is used to index linked lists, which are described next. We assume that the $\text{HASHINSERT}$ procedure uses the tableexpansion technique of Section 17.4.1 to keep the hash table's load factor to be at most some constant $\alpha$. In this way, the amortized cost per insertion is $\Theta(1)$ and, under the assumption of simple uniform hashing, then by Theorems 11.1 and 11.2, the averagecase search time is also $\Theta(1)$.
 For every distinct data element $r_i$, we create a linked list $S_{ind}$ (where $ind$ is obtained through the hash table) holding the indices in the input array where $r_i$ occurs. For example, if the input sequence is $\langle d, b, d, b, d, a, c, d, b, a, c, b \rangle$, then we create four linked lists: $S_1$ for $a$, $S_2$ for $b$, $S_3$ for $c$, and $S_4$ for $d$. $S_1$ holds the indices where $a$ is accessed, and so $S_1 = \langle 6, 10 \rangle$. Similarly, $S_2 = \langle 2, 4, 9, 12 \rangle$, $S_3 = \langle 7, 11 \rangle$ and $S_4 = \langle 1, 3, 5, 8 \rangle$.
For each data element $r_i$, we first check whether there is already a linked list associated with $r_i$ and create a new linked list if not. We retrieve the linked list associated with $r_i$ and append $i$ to it, indicating that an access to $r_i$ occurs at access $i$.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33  ACCESS(r[i]) // Compute the next access time for r[i]. ind = HASHSEARCH(r[i]) time = ∞ delete the head of S[ind] if S[ind] is not empty time = head of S[ind] // Check to see whether r[i] is currently in the cache. if TREESEARCH(T.root, r[i]) != NIL print "cache hit" else if TREESIZE(T) < k // Insert in an empty slot in the cache. let z be a new node for T z.key = r[i] RBINSERT(T, z) let event be a new object for P event.key = time event.value = r[i] INSERT(P, event) print "cache miss, inserted" r[i] "in empty slot" else event = MAXIMUM(P) if event.key ≤ time // r[i] has the furthest access time print "cache miss, no data element evicted" else // evict the element with furthest access time print "cache miss, evict data element" event.value event = EXTRACTMAX(P) RBDELETE(T, TREESEARCH(T.root, event.value)) event.key = time event.value = r[i] INSERT(P, event) let z be a new node for T z.key = r[i] RBINSERT(T, z) 
The procedure $\text{ACCESS}$ takes an input $r_i$ and decides which element to evict, if any, from the cache. The first if condition properly sets time to the next access time of $r_i$. The head of the linked list associated with $r_i$ contains $i$; we remove this element from the list, and the new head contains the next access time for $r_i$. Then, we check to see whether $r_i$ is already present in the cache. If $r_i$ is not present in the cache, we check to see whether we can store $r_i$ in an empty slot. If there are no empty slots, we have to evict the element with the furthest access time. We retrieve the element with the furthest access time from the maxpriority queue and compare it with that of $r_i$ . If $r_i$'s next access is sooner, we evict the element with the furthest access time from the cache (deleting the element from the tree and from the priority queue) and insert $r_i$ into the tree and priority queue.
Under the assumption of simple uniform hashing, the averagecase running time of $\text{ACCESS}$ is $O(\lg k)$, since it performs a constant number of operations on the redblack tree, priority queue, and hash table. Thus, the averagecase running time of $\text{CACHEMANAGER}$ is $O(n\lg k)$.
b. To show that the problem exhibits optimal substructure, we define the subproblem $(C, i)$ as the contents of the cache just before the $i$th request, where $C$ is a subset of the set of input data elements containing at most $k$ of them. A solution to $(C, i)$ is a sequence of decisions that specifies which element to evict (if any) for each request $i, i + 1, \ldots, n$. An optimal solution to $(C, i)$ is a solution that minimizes the number of cache misses.
Let $S$ be an optimal solution to $(C, i)$. Let $S'$ be the subsolution of $S$ for requests $i + 1, i + 2, \ldots, n$. If a cache hit occurs on the $i$th request, then the cache remains unchanged. If a cache miss occurs, then the $i$th request results in the contents of the cache changing to $C'$ (possibly with $C' = C$ if no element was evicted). We claim that $S'$ is an optimal solution to $(C', i + 1)$. Why? If $S'$ were not an optimal solution to $(C', i + 1)$, then there exists another solution $S''$ to $(C', i + 1)$ that makes fewer cache misses than $S'$. By combining $S''$ with the decision of $S$ at the $i$th request, we obtain another solution that makes fewer cache misses than $S$, which contradicts our assumption that $S$ is an optimal solution to $(C, i)$.
Suppose the $i$th request results in a cache miss. Let $P_C$ be the set of all cache states that can be reached from $C$ through a single decision of the cache manager. The set $P_C$ contains up to $k + 1$ states: $k$ of them arising from different elements of the cache being evicted and one arising from the decision of evicting no element. For example, if $C = {r_1, r_2, r_3}$ and the requested data element is $r_4$, then
$$P_C = \{\{r_1, r_2, r_3\}, \{r_1, r_2, r_4\}, \{r_1, r_3, r_4\}, \{r_2, r_3, r_4\}\}.$$ Let $miss(C, i)$ denote the minimum number of cache misses for $(C, i)$. We can state a recurrence for $miss(C, i)$ as
$$ miss(C, i) = \begin{cases} 0 & \text{if $i = n$ and $r_n \in C$}, \\ 1 & \text{if $i = n$ and $r_n \notin C$}, \\ miss(C, i + 1) & \text{if $i < n$ and $r_i \in C$}, \\ 1 + \min\limits_{C' \in P_C} \{miss(C', i + 1)\} & \text{if $i < n$ and $r_i \notin C$}. \end{cases} $$
Thus, we conclude that the problem exhibits optimal substructure.
c. To prove that the furthestinfuture strategy yields an optimal solution, we show that the problem exhibits the greedychoice property. Combined with the optimalsubstructure property from part (b), the greedychoice property will prove that furthestinfuture produces the minimum possible number of cache misses.
We use the definitions of subproblem, solution, and optimal solution from part (b). Since we will be comparing different solutions, let us define $C_{Ai}$ as the state of the cache for solution $A$ just before the ith request. The following theorem is the key.
Theorem (Greedychoice property)
Let $A$ be some optimal solution to $(C, i)$. Let b be the element in $C_{Ai} \cup \{r_i\}$ whose next access at the time of the $i$th request is furthest in the future, at time $m$. Then, we can construct another solution $A'$ to $(C, i)$ that has the following properties:
 On the $i$th request, $A'$ evicts $b$.
 For $i + 1 \le j \le m$, the caches $C_{Aj}$ and $C_{A' j}$ differ by at most one element. If they differ, then $b \in C_{Aj}$ is always the element in $C_{Aj}$ that is not in $C_{A' j}$. Equivalently, if $C_{Aj}$ and $C_{A' j}$ differ, we can write $C_{Aj} = D_j \cup \{b\}$ and $C_{A' j} = D_j \cup \{x\}$, where $D_j$ is a size($k  1$) set and $x \ne b$ is some data element.
 For requests $i, \ldots, m  1$, if $A$ has a cache hit, then $A'$ has a cache hit.
 $C_{Aj} = C_{A' j}$ for $j > m$.
 For requests $i, \ldots, m$, the number of cache misses produced by $A'$ is at most the number of cache misses produced by $A$.
Proof
If $A$ evicts $b$ at request $i$, then the proof of the theorem is trivial. Therefore, suppose $A$ evicts data element $a$ on request $i$, where $a \ne b$. We will prove the theorem by constructing $A'$ inductively for each request.
 At request $i$, $A'$ evicts $b$ instead of $a$.

We proceed with induction on $j$, where $i + 1 \le j \le m$. The construction for property 1 establishes the base case because $C_{A, i + 1}$ and $C_{A', i + 1}$ differ by just one element and b is the element in $C_{A, i + 1}$ that is not in $C_{A', i + 1}$.
For the induction step, suppose property 2 is true for some request $j$, where $i + 1 \le j < m$. If $A$ does not evict any element or evicts an element in $D_j$, then construct $A'$ to make the same decision on request $j$ as $A$ makes. If $A$ evicts $b$ on request $j$, then construct $A'$ to evict $x$ and keep the same element as $A$ keeps, namely $r_j$. This construction conserves property 2 for $j + 1$. Note that this construction might sometimes insert duplicate elements in the cache. This situation can easily be dealt with by introducing a dummy element for $x$.

Suppose $A$ has a cache hit for request $j$, where $i \le j \le m  1$. Then, $r_j \in D_j$ since $r_j \ne b$. Thus, $r_j \in C_{A' j}$ and $A'$ has a cache hit, too.
 By property 2, the cache $C_{Am}$ differs from $C_{A' m}$ by at most one element, with $b$ being the element in $C_{Am}$ that might not be in $C_{A' m}$ . If $C_{Am} = C_{A' m}$ , then construct $A'$ to make the same decision on request $m$ as $A$. Otherwise, $C_{Am} \ne C_{A' m}$ and $b \in C_{Am}$. Construct $A'$ to evict $x$ and keep $b$ on request $m$. Since the $m$th request is for element $b$ and $b \in C_{Am}$, $A$ has a cache hit so that it does not evict any element. Thus, we can ensure that $C_{A, m + 1} = C_{A', m + 1}$. From the ($m + 1$)st request on, $A'$ simply makes the same decisions as $A$.
 By property 3, for requests $i, \ldots, m  1$, whenever we have a cache hit for $A$, we also have a cache hit for $A'$. Thus, we have to concern ourselves with only the $m$th request. If $A$ has a cache miss on the $m$th request, we are done. Otherwise, $A$ has a cache hit on the $m$th request, and we will prove that there exists at least one request $j$, where $i + 1 \le j \le m  1$, such that the $j$th request results in a cache miss for $A$ and a cache hit for $A'$. Because $A$ evicts data element $a$ in request $i$, then, by our construction of $A'$, $C_{A', i + 1} = D_{i + 1} \cup \{a\}$. The $m$th request is for data element $b$. If $A$ has a cache hit, then because none of the requests $i + 1, \ldots, m  1$ were for $b$, $A$ could not have evicted $b$ and brought it back. Moreover, because $A$ has a cache hit on the $m$th request, $b \in C_{Am}$. Therefore, $A$ did not evict $b$ in any of requests $i, \ldots, m  1$. By our construction, $A'$ did not evict $a$. But a request for $a$ occurs at least once before the $m$th request. Consider the first such instance. At this instance, $A$ has a cache miss and $A'$ has a cache hit.
The above theorem and the optimalsubstructure property proved in part (b) imply that furthestinfuture produces the minimum number of cache misses.