12-3 Average node depth in a randomly built binary search tree

In this problem, we prove that the average depth of a node in a randomly built binary search tree with $n$ nodes is $O(\lg n)$. Although this result is weaker than that of Theorem 12.4, the technique we shall use reveals a surprising similarity between the building of a binary search tree and the execution of $\text{RANDOMIZED-QUICKSORT}$ from Section 7.3.

We define the total path length $P(T)$ of a binary tree $T$ as the sum, over all nodes $x$ in $T$, of the depth of node $x$, which we denote by $d(x, T)$.

a. Argue that the average depth of a node in $T$ is

$$\frac{1}{n} \sum_{x \in T} d(x, T) = \frac{1}{n} P(T).$$

Thus, we wish to show that the expected value of $P(T)$ is $O(n\lg n)$.

b. Let $T_L$ and $T_R$ denote the left and right subtrees of tree $T$, respectively. Argue that if $T$ has $n$ nodes, then

$$P(T) = P(T_L) + P(T_R) + n - 1.$$

c. Let $P(n)$ denote the average total path length of a randomly built binary search tree with n nodes. Show that

$$P(n) = \frac{1}{n} \sum_{i = 0}^{n - 1} (P(i) + P(n - i - 1) + n - 1).$$

d. Show how to rewrite $P(n)$ as

$$P(n) = \frac{2}{n} \sum_{k = 1}^{n - 1} P(k) + \Theta(n).$$

e. Recalling the alternative analysis of the randomized version of quicksort given in Problem 7-3, conclude that $P(n) = O(n\lg n)$. At each recursive invocation of quicksort, we choose a random pivot element to partition the set of elements being sorted. Each node of a binary search tree partitions the set of elements that fall into the subtree rooted at that node.

f. Describe an implementation of quicksort in which the comparisons to sort a set of elements are exactly the same as the comparisons to insert the elements into a binary search tree. (The order in which comparisons are made may differ, but the same comparisons must occur.)

a. The total path length $P(T)$ is defined as $\sum_{x \in T} d(x, T)$. Dividing both quantities by $n$ gives the desired equation.

b. For any node $x$ in $T_L$, we have $d(x, T_L) = d(x, T) - 1$, since the distance to the root of $T_L$ is one less than the distance to the root of $T$. Similarly, for any node $x$ in $T_R$, we have $d(x, T_R) = d(x, T) - 1$. Thus if $T$ has $n$ nodes, we have

$$P(T) = P(T_L) + P(T_R) + n - 1,$$

since each of the $n$ nodes of $T$ (except the root) is in either $T_L$ or $T_R$.

c. If $T$ is a randomly built binary search tree, then the root is equally likely to be any of the $n$ elements in the tree, since the root is the first element inserted. It follows that the number of nodes in subtree $T_L$ is eqaully likely to be any integer in the set $\{0, 1, \ldots, n - 1\}$. The definition of $P(n)$ as the average total path length of a randomly built binary search tree, along with part (b), gives us the recurrence

$$P(n) = \frac{1}{n} \sum_{i = 0}^{n - 1} (P(i) + P(n - i - 1) + n - 1).$$

d. Since $P(0) = 0$, and since for $k = 1, 2, \ldots, n - 1$, each term $P(k)$ in the summation appears once as $P(i)$ and once as $P(n - i - 1)$, we can rewrite the equation from part (c) as

$$P(n) = \frac{2}{n} \sum_{k = 1}^{n - 1} P(k) + \Theta(n).$$

e. Observe that if, in the recurrence $\text{(7.6)}$ in part (c) Problem 7-3, we replace $\text E[T(\cdot)]$ by $P(\cdot)$ and we replace $q$ by $k$, we get almost the same recurrence as in part (d) of Problem 12-3. The remaining difference is that in Problem 12-3(d), the summation starts at $1$ rather than $2$. Observe, however, that a binary tree with just one node has a total path length of $0$, so that $P(1) = 0$. Thus, we can rewrite the recurrence in Problem 12-3(d) as

$$P(n) = \frac{2}{n} \sum_{k = 2}^{n - 1} P(k) + \Theta(n)$$

and use the same technique as was used in Problem 7-3 to solve it.

We start by solving part (d) of Problem 7-3: showing that

$$\sum_{k = 2}^{n - 1} k\lg k \le \frac{1}{2}n^2\lg n - \frac{1}{8}n^2.$$

Following the hint in Problem 7-3(d), we split the summation into two parts:

$$\sum_{k = 2}^{n - 1} k\lg k = \sum_{k = 2}^{\lceil n / 2\rceil - 1} k\lg k + \sum_{k = \lceil n / 2\rceil}^{n - 1} k\lg k.$$

The $\lg k$ in the first summation on the right is less than $\lg(n / 2) = \lg n - 1$, and the $\lg k$ in the second summation is less than $\lg n$. Thus,

$$ \begin{aligned} \sum_{k = 2}^{n - 1} k\lg k & < (\lg n - 1) \sum_{k = 2}^{\lceil n / 2\rceil - 1} k + \lg n \sum_{k = \lceil n / 2\rceil}^{n - 1} k \\ & = \lg n \sum_{k = 2}^{n - 1} k - \sum_{k = 2}^{\lceil n / 2 \rceil - 1} k \\ & \le \frac{1}{2} n(n - 1)\lg n - \frac{1}{2}\Big(\frac{n}{1} - 1\Big) \frac{n}{2} \\ & \le \frac{1}{2} n^2\lg n - \frac{1}{8} n^2 \end{aligned} $$

if $n \ge 2$.

Now we show that the recurrence

$$P(n) = \frac{2}{n} \sum_{k = 2}^{n - 1} P(k) + \Theta(n)$$

has the solution $P(n) = O(n\lg n)$. We use the substitution method. Assume inductively that $P(n) \le an\lg n + b$ for some positive constants $a$ and $b$ to be determined. We can pick $a$ and $b$ sufficiently large so that $an\lg n + b \ge P(1)$. Then, for $n > 1$, we have by substitution

$$ \begin{aligned} P(n) & = \frac{2}{n} \sum_{k = 2}^{n - 1} P(k) + \Theta(n) \\ & \le \frac{2}{n} \sum_{k = 2}^{n - 1} (ak\lg k + b) + \Theta(n) \\ & = \frac{2a}{n} \sum_{k = 2}^{n - 1} k\lg k + \frac{2b}{n} (n - 2) + \Theta(n) \\ & \le \frac{2a}{n} \Big(\frac{1}{2} n^2\lg n - \frac{1}{8}n^2\Big) + \frac{2b}{n}(n - 2) + \Theta(n) \\ & \le an\lg n - \frac{a}{4}n + 2b + \Theta(n) \\ & = an\lg n + b + \Big(\Theta(n) + b - \frac{a}{4}n\Big) \\ & \le an\lg n + b, \end{aligned} $$

since we can choose $a$ large enough so that $\frac{a}{4}n$ dominates $\Theta(n) + b$. Thus, $P(n) = O(n\lg n)$.

f. We draw an analogy between inserting an element into a subtree of a binary search tree and sorting a subarray in quicksort. Observe that once an element $x$ is chosen as the root of a subtree $T$, all elements that will be inserted after $x$ into $T$ will be compared to $x$. Similarly, observe that once an element $y$ is chosen as the pivot in a subarray $S$, all other elements in $S$ will be compared to $y$. Therefore, the quicksort implementation in which the comparisons are the same as those made when inserting into a binary search tree is simply to consider the pivots in the same order as the order in which the elements are inserted into the tree.