# 9.3 Selection in worst-case linear time

## 9.3-1

In the algorithm $\text{SELECT}$, the input elements are divided into groups of $5$. Will the algorithm work in linear time if they are divided into groups of $7$? Argue that $\text{SELECT}$ does not run in linear time if groups of $3$ are used.

For groups of $7$, the algorithm still works in linear time. The number of elements greater than $x$ (and similarly, the number less than $x$) is at least

$$4\Bigg(\Bigg\lceil \frac{1}{2} \Big\lceil \frac{n}{7} \Big\rceil\Bigg\rceil - 2\Bigg) \ge \frac{2n}{7} - 8,$$

and the recurrence becomes

$$T(n) \le T(\lceil n/7 \rceil) + T(5n/7 + 8) + O(n),$$

which can be shown to be $O(n)$ by substitution, as for the groups of $5$ case in the text.

For groups of $3$, however, the algorithm no longer works in linear time. The number of elements greater than $x$, and the number of elements less than $x$, is at least

$$2\Bigg(\Bigg\lceil \frac{1}{2} \Big\lceil \frac{n}{3} \Big\rceil\Bigg\rceil - 2\Bigg) \ge \frac{n}{3} - 4,$$

and the recurrence becomes

$$T(n) \le T(\lceil n / 3 \rceil) + T(2n / 3 + 4) + O(n),$$

which does not have a linear solution.

We can prove that the worst-case time for groups of $3$ is $\Omega(n\lg n)$. We do so by deriving a recurrence for a particular case that takes $\Omega(n\lg n)$ time.

In counting up the number of elements greater than $x$ (and similarly, the number less than $x$), consider the particular case in which there are exactly $\Big\lceil \frac{1}{2} \Big\lceil \frac{n}{3} \Big\rceil\Big\rceil$ groups with medians $\ge x$ and in which the "leftover" group does contribute 2 elements greater than $x$. Then the number of elements greater than $x$ is exactly $2\Big(\Big\lceil \frac{1}{2} \Big\lceil \frac{n}{3} \Big\rceil\Big\rceil - 1\Big) + 1$ (the $-1$ discounts $x$'s group, as usual, and the $+1$ is contributed by $x$'s group) $= 2\lceil n / 6 \rceil - 1$, and the recursive step for elements $\le x$ has $n - (2 \lceil n / 6 \rceil - 1) \ge n - (2(n / 6 + 1) - 1) = 2n / 3 - 1$ elements. Observe also that the $O(n)$ term in the recurrence is really $\Theta(n)$, since the partitioning in step 4 takes $\Theta(n)$ (not just $O(n)$) time. Thus, we get the recurrence

\begin{aligned} T(n) & \ge T(\lceil n / 3 \rceil) + T(2n / 3 - 1) + \Theta(n) \\ & \ge T(n / 3) + T(2n / 3 - 1) + \Theta(n), \end{aligned}

from which you can show that $T(n) \ge cn\lg n$ by substitution. You can also see that $T(n)$ is nonlinear by noticing that each level of the recursion tree sums to $n$.

[In fact, any odd group size $\ge 5$ works in linear time.]

## 9.3-2

Analyze $\text{SELECT}$ to show that if $n \ge 140$, then at least $\lceil n / 4 \rceil$ elements are greater than the median-of-medians $x$ and at least $\lceil n / 4 \rceil$ elements are less than $x$.

\begin{aligned} \frac{3n}{10} - 6 & \ge \lceil \frac{n}{4} \rceil \\ \frac{3n}{10} - 6 & \ge \frac{n}{4} + 1 \\ 12n - 240 & \ge 10n + 40 \\ n & \ge 140. \end{aligned}

## 9.3-3

Show how quicksort can be made to run in $O(n\lg n)$ time in the worst case, assuming that all elements are distinct.

A modification to quicksort that allows it to run in $O(n\lg n)$ time in the worst case uses the deterministic $\text{PARTITION}$ algorithm that was modified to take an element to partition around as an input parameter.

$\text{SELECT}$ takes an array $A$, the bounds $p$ and $r$ of the subarray in $A$, and the rank $i$ of an order statistic, and in time linear in the size of the subarray $A[p..r]$ it returns the ith smallest element in $A[p..r]$.

 1 2 3 4 5 6 7 BEST-CASE-QUICKSORT(A, p, r) if p < r i = floor((r - p + 1) / 2) x = SELECT(A, p, r, i) q = PARTITION(x) BEST-CASE-QUICKSORT(A, p, q - 1) BEST-CASE-QUICKSORT(A, q + 1, r) 

For an $n$-element array, the largest subarray that $\text{BEST-CASE-QUICKSORT}$ recurses on has $n / 2$ elements. This situation occurs when $n = r - p + 1$ is even; then the subarray $A[q + 1..r]$ has $n / 2$ elements, and the subarray $A[p..q - 1]$ has $n / 2 - 1$ elements.

Because $\text{BEST-CASE-QUICKSORT}$ always recurses on subarrays that are at most half the size of the original array, the recurrence for the worst-case running time is $T(n) \le 2T(n / 2) + \Theta(n) = O(n\lg n)$.

## 9.3-4 $\star$

Suppose that an algorithm uses only comparisons to find the $i$th smallest element in a set of $n$ elements. Show that it can also find the $i - 1$ smaller elements and $n - i$ larger elements without performing additional comparisons.

Create a graph with $n$ vertices and draw a directed edge from vertex $i$ to vertex $j$ if the $i$th and $j$th elements of the array are compared in the algorithm and we discover that $A[i] \ge A[j]$. Observe that $A[i]$ is one of the $i - 1$ smaller elements if there exists a path from $x$ to $i$ in the graph, and $A[i]$ is one of the $n - i$ larger elements if there exists a path from $i$ to $x$ in the graph. Every vertex $i$ must either lie on a path to or from $x$ because otherwise the algorithm can't distinguish between $i \le x$ and $i \ge x$. Moreover, if a vertex $i$ lies on both a path to $x$ and a path from $x$, then it must be such that $x \le A[i] \le x$,so $x = A[i]$. In this case, we can break ties arbitrarily.

## 9.3-5

Suppose that you have a "black-box" worst-case linear-time median subroutine. Give a simple, linear-time algorithm that solves the selection problem for an arbitrary order statistic.

We assume that are given a procedure $\text{MEDIAN}$ that takes as parameters an array $A$ and subarray indices $p$ and $r$, and returns the value of the median element of $A[p..r]$ in $O(n)$ time in the worst case.

Given $\text{MEDIAN}$, here is a linear-time algorithm $\text{SELECT}'$ for finding the $i$th smallest element in $A[p..r]$. This algorithm uses the deterministic $\text{PARTITION}$ algorithm that was modified to take an element to partition around as an input parameter.

  1 2 3 4 5 6 7 8 9 10 11 SELECT'(A, p, r, i) if p == r return A[p] x = MEDIAN(A, p, r) q = PARTITION(x) k = q - p + 1 if i == k return A[q] else if i < k return SELECT'(A, p, q - 1, i) else return SELECT'(A, q + 1, r, i - k) 

Because $x$ is the median of $A[p..r]$, each of the subarrays $A[p..q - 1]$ and $A[q + 1..r]$ has at most half the number of elements of $A[p..r]$. The recurrence for the worst-case running time of $\text{SELECT}'$ is $T(n) \le T(n / 2) + O(n) = O(n)$.

## 9.3-6

The $k$th quantiles of an $n$-element set are the $k - 1$ order statistics that divide the sorted set into $k$ equal-sized sets (to within $1$). Give an $O(n\lg k)$-time algorithm to list the $k$th quantiles of a set.

Pre-calculate the positions of the quantiles in $O(k)$, we use the $O(n)$ select algorithm to find the $\lfloor k / 2 \rfloor$th position, after that the elements are divided into two sets by the pivot the $\lfloor k / 2 \rfloor$th position, we do it recursively in the two sets to find other positions. Since the maximum depth is $\lceil \lg k \rceil$, the total running time is $O(n\lg k)$.

  1 2 3 4 5 6 7 8 9 10 PARTITION(A, p, r) x = A[r] i = p for k = p to r if A[k] < x i = i + 1 swap A[i] with A[k] i = i + 1 swap a[i] with a[r] return i 
 1 2 3 4 RANDOMIZED-PARTITION(A, p, r) x = RANDOM(p, r) swap A[x] with A[r] return PARTITION(A, p, r) 
  1 2 3 4 5 6 7 8 9 10 11 12 13 RANDOMIZED-SELECT(A, p, r, i) while true if p == r return p, A[p] q = RANDOMIZED-PARTITION(A, p, r) k = q - p + 1 if i == k return q, A[q] if i < k r = q else p = q + 1 i = i - k 
  1 2 3 4 5 6 7 8 9 10 k-QUANTITLES-SUB(A, p, r, pos, f, e, quantiles) if f + 1 > e return mid = (f + e) / 2 q, val = RANDOMIZED-SELECT(A, p, r, pos[mid) quantiles[mid] = val k = q - p + 1 for i = mid + 1 to e pos[i] = pos[i] - k k-QUANTILES-SUB(A, q + 1, r, pos, mid + 1, e, quantiles) 
  1 2 3 4 5 6 7 8 9 10 11 k-QUANTITLES(A, k) num = A.size() / k mod = A.size() % k pos = num[1..k] for i = 1 to mod pos[i] = pos[i] + 1 for i = 1 to k pos[i] = pos[i] + pos[i - 1] quantiles = [1..k] k-QUANTITLES-SUB(A, 0, A.length, pos, 0, pos.size(), quantiles) return quantiles 

## 9.3-7

Describe an $O(n)$-time algorithm that, given a set $S$ of $n$ distinct numbers and a positive integer $k \le n$, determines the $k$ numbers in $S$ that are closest to the median of $S$.

Find the median in $O(n)$; create a new array, each element is the absolute value of the original value subtract the median; find the $k$th smallest number in $O(n)$, then the desired values are the elements whose absolute difference with the median is less than or equal to the $k$th smallest number in the new array.

## 9.3-8

Let $X[1..n]$ and $Y[1..n]$ be two arrays, each containing $n$ numbers already in sorted order. Give an $O(\lg n)$-time algorithm to find the median of all $2n$ elements in arrays $X$ and $Y$.

Let's start out by supposing that the median (the lower median, since we know we have an even number of elements) is in $X$. Let's call the median value $m$, and let's suppose that it's in $X[k]$. Then $k$ elements of $X$ are less than or equal to $m$ and $n - k$ elements of X are greater than or equal to m. We know that in the two arrays combined, there must be $n$ elements less than or equal to $m$ and $n$ elements greater than or equal to $m$, and so there must be $n - k$ elements of $Y$ that are less than or equal to $m$ and $n - (n - k)=k$ elements of $Y$ that are greater than or equal to $m$.

Thus, we can check that $X[k]$ is the lower median by checking whether $Y[n - k] \le X[k] \le Y[n - k + 1]$. A boundary case occurs for $k = n$. Then $n - k = 0$, and there is no array entry $Y[0]$; we only need to check that $X[n] \le Y[1]$.

Now, if the median is in $X$ but is not in $X[k]$, then the above condition will not hold. If the median is in $X[k']$ , where $k' < k$, then $X[k]$ is above the median, and $Y[n - k + 1] < X[k]$. Conversely, if the median is in $X[k'']$, where $k'' > k$, then $X[k]$ is below the median, and $X[k] < Y[n - k]$.

Thus, we can use a binary search to determine whether there is an $X[k]$ such that either $k < n$ and $X[n - k] \le Y[k] \le X[n - k + 1]$ or $k = n$ and $X[k] \le Y[n - k + 1]$; if we find such an $X[k]$, then it is the median. Otherwise, we know that the median is in $Y$, and we use a binary search to find a $Y[k]$ such that either $k < n$ and $X[n - k] \le Y[k] \le X[n - k + 1]$ or $k = n$ is the median. Since each binary search takes $O(\lg n)$ time, we spend a total of $O(\lg n)$ time.

Here's how we write the algorithm in pseudocode:

 1 2 3 4 5 6 TWO-ARRAY-MEDIAN(X, Y) n = X.length // n also equals Y.length median = FIND-MEDIAN(X, Y, n, 1, n) if median == NOT-FOUND median = FIND-MEDIAN(Y, X, n, 1, n) return median 
  1 2 3 4 5 6 7 8 9 10 11 FIND-MEDIAN(A, B, n, low, hign) if low > high return NOT-FOUND else k = floor((low + high) / 2) if k == n and A[n] ≤ B[1] return A[n] else if k < n and B[n - k] ≤ A[k] ≤ B[n - k + 1] return A[k] else A[k] > B[n - k + 1] return FIND-MEDIAN(A, B, n, low, k - 1) else return FIND-MEDIAN(A, B, n, k + 1, high) 

## 9.3-9

Professor Olay is consulting for an oil company, which is planning a large pipeline running east to west through an oil field of $n$ wells. The company wants to connect a spur pipeline from each well directly to the main pipeline along a shortest route (either north or south), as shown in Figure 9.2. Given the $x$- and $y$-coordinates of the wells, how should the professor pick the optimal location of the main pipeline, which would be the one that minimizes the total length of the spurs? Show how to determine the optimal location in linear time.

In order to find the optimal placement for Professor Olay's pipeline, we need only find the median(s) of the $y$-coordinates of his oil wells, as the following proof explains.

Claim

The optimal $y$-coordinate for Professor Olay's east-west oil pipeline is as follows:

• If $n$ is even, then on either the oil well whose $y$-coordinate is the lower median or the one whose $y$-coordinate is the upper median, or anywhere between them.
• If $n$ is odd, then on the oil well whose $y$-coordinate is the median.

Proof

We examine various cases. In each case, we will start out with the pipeline at a particular $y$-coordinate and see what happens when we move it. We'll denote by $s$ the sum of the north-south spurs with the pipeline at the starting location, and $s'$ will denote the sum after moving the pipeline.

We start with the case in which n is even. Let us start with the pipeline somewhere on or between the two oil wells whose $y$-coordinates are the lower and upper medians. If we move the pipeline by a vertical distance $d$ without crossing either of the median wells, then $n / 2$ of the wells become $d$ farther from the pipeline and $n / 2$ become $d$ closer, and so $s' = s + dn / 2 - dn / 2 = s$; thus, all locations on or between the two medians are equally good.

Now suppose that the pipeline goes through the oil well whose $y$-coordinate is the upper median. What happens when we increase the $y$-coordinate of the pipeline by $d > 0$ units, so that it moves above the oil well that achieves the upper median? All oil wells whose $y$-coordinates are at or below the upper median become d units farther from the pipeline, and there are at least $n / 2 + 1$ such oil wells (the upper median, and every well at or below the lower median). There are at most $n / 2 - 1$ oil wells whose $y$-coordinates are above the upper median, and each of these oil wells becomes at most d units closer to the pipeline when it moves up. Thus, we have a lower bound on $s'$ of $s' \ge s + d(n / 2 + 1) - d(n / 2 - 1) = s + 2d > s$. We conclude that moving the pipeline up from the oil well at the upper median increases the total spur length. A symmetric argument shows that if we start with the pipeline going through the oil well whose $y$-coordinate is the lower median and move it down, then the total spur length increases.

We see, therefore, that when $n$ is even, an optimal placement of the pipeline is anywhere on or between the two medians.

Now we consider the case when $n$ is odd. We start with the pipeline going through the oil well whose $y$-coordinate is the median, and we consider what happens when we move it up by $d > 0$ units. All oil wells at or below the median become $d$ units farther from the pipeline, and there are at least $(n + 1) / 2$ such wells (the one at the median and the $(n - 1) / 2$ at or below the median. There are at most $(n - 1) / 2$ oil wells above the median, and each of these becomes at most d units closer to the pipeline. We get a lower bound on $s'$ of $s' \ge s + d(n + 1) / 2 - d(n - 1) / 2 = s + d > s$, and we conclude that moving the pipeline up from the oil well at the median increases the total spur length. A symmetric argument shows that moving the pipeline down from the median also increases the total spur length, and so the optimal placement of the pipeline is on the median. (claim)

Since we know we are looking for the median, we can use the linear-time median-finding algorithm.