9-2 Weighted median

For $n$ distinct elements $x_1, x_2, \ldots, x_n$ with positive weights $w_1, w_2, \ldots, w_n$ such that $\sum_{i = 1}^n w_i = 1$, the weighted (lower) median is the element $x_k$ satisfying

$$\sum_{x_i < x_k} w_i < \frac{1}{2}$$

and

$$\sum_{x_i > x_k} w_i \le \frac{1}{2}.$$

For example, if the elements are $0.1, 0.35, 0.05, 0.1, 0.15, 0.05, 0.2$ and each element equals its weight (that is, $w_i = x_i$ for $i = 1, 2, \ldots, 7$), then the median is $0.1$, but the weighted median is $0.2$.

a. Argue that the median of $x_1, x_2, \ldots, x_n$ is the weighted median of the $x_i$ with weights $w_i = 1 / n$ for $i = 1, 2, \ldots, n$.

b. Show how to compute the weighted median of $n$ elements in $O(n\lg n)$ worstcase time using sorting.

c. Show how to compute the weighted median in $\Theta(n)$ worst-case time using a linear-time median algorithm such as $\text{SELECT}$ from Section 9.3.

The post-office location problem is defined as follows. We are given $n$ points $p_1, p_2, \ldots, p_n$ with associated weights $w_1, w_2, \ldots, w_n$. We wish to find a point $p$ (not necessarily one of the input points) that minimizes the sum $\sum_{i = 1}^n w_i d(p, p_i)$, where $d(a, b)$ is the distance between points $a$ and $b$.

d. Argue that the weighted median is a best solution for the $1$-dimensional postoffice location problem, in which points are simply real numbers and the distance between points $a$ and $b$ is $d(a, b) = |a - b|$.

e. Find the best solution for the $2$-dimensional post-office location problem, in which the points are $(x,y)$ coordinate pairs and the distance between points $a = (x_1, y_1)$ and $b = (x_2, y_2)$ is the Manhattan distance given by $d(a, b) = |x_1 - x_2| + |y_1 - y_2|$.

a. The median $x$ of the elements $x_1, x_2, \ldots, x_n$, is an element $x = x_k$ satisfying $|\{x_i: 1\le i\le n \text{ and } x_i < x\}| \le n / 2$ and $|\{x_i: 1 \le i \le n \text{ and } x_i > x\}| \le n / 2$. If each element $x_i$ is assigned a weight $x_i = 1 / n$, then we get

$$ \begin{aligned} \sum_{x_i < x} w_i & = \sum_{x_i < x} \frac{1}{n} \\ & = \frac{1}{n} \cdot \sum_{x_i < x} 1 \\ & = \frac{1}{n} \cdot |{x_i: 1\le i\le n\text{ and } x_i < x}| \\ & \le \frac{1}{n} \cdot \frac{n}{2} \\ & = \frac{1}{2}, \end{aligned} $$

and

$$ \begin{aligned} \sum_{x_i > x} w_i & = \sum_{x_i > x} \frac{1}{n} \\ & = \frac{1}{n} \cdot \sum_{x_i > x} 1 \\ & = \frac{1}{n} \cdot |{x_i: 1\le i\le n\text{ and } x_i > x}| \\ & \le \frac{1}{n} \cdot \frac{n}{2} \\ & = \frac{1}{2}, \end{aligned} $$

which proves that $x$ is also the weighted median of $x_1, x_2, \ldots, x_n$ with weights $x_i = 1 / n$, for $i = 1, 2, \ldots, n$.

b. We first sort the $n$ elements into increasing order by $x_i$ values. Then we scan the array of sorted $x_i$'s, starting with the smallest element and accumulating weights as we scan, until the total exceeds $1 / 2$. The last element, say $x_k$, whose weight caused the total to exceed $1 / 2$, is the weighted median. Notice that the total weight of all elements smaller than $x_k$ is less than $1 / 2$, because $x_k$ was the first element that caused the total weight to exceed $1 / 2$. Similarly, the total weight of all elements larger than $x_k$ is also less than $1 / 2$, because the total weight of all the other elements exceeds $1 / 2$.

The sorting phase can be done in $O(n\lg n)$ worst-case time (using merge sort or heapsort), and the scanning phase takes $O(n)$ time. The total running time in the worst case, therefore, is $O(n\lg n)$.

c. We find the weighted median in $\Theta(n)$ worst-case time using the $\Theta(n)$ worst-case median algorithm in Section 9.3. (Although the first paragraph of the section only claims an $O(n)$ upper bound, it is easy to see that the more precise running time of $\Theta(n)$ applies as well, since steps 1, 2, and 4 of $\text{SELECT}$ actually take $\Theta(n)$ time.)

The weighted-median algorithm works as follows. If $n \le 2$, we just return the brute-force solution. Otherwise, we proceed as follows. We find the actual median $x_k$ of the $n$ elements and then partition around it. We then compute the total weights of the two halves. If the weights of the two halves are each strictly less than $1 / 2$, then the weighted median is $x_k$ . Otherwise, the weighted median should be in the half with total weight exceeding $1 / 2$. The total weight of the "light" half is lumped into the weight of $x_k$ , and the search continues within the half that weighs more than $1 / 2$. Here's pseudocode, which takes as input a set $X = \{x_1, x_2, \ldots, x_n\}$:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
WEIGHTED-MEDIAN(X)
    if n == 1
        return x[1]
    else if n == 2
        if w[1]  w[2]
            return x[1]
        else return x[2]
    else find the median x[k] of X = {x[1], x[2], ..., x[n]}
        partition the set X around x[k]
        compute W[L] = sum(x[i] < x[k]) w[i] and W[G] = sum(x[i] > x[k]) w[i]
        if W[L] < 1 / 2 and W[G] < 1 / 2
            return x[k]
        else if W[L] > 1 / 2
            w[k] = w[k] + W[G]
            X' = {x[i]  X: x[i]  x[k]}
            return WEIGHTED-MEDIAN(X')
        else w[k] = w[k] + W[L]
            X' = {x[i]  X: x[i]  x[k]}
            return WEIGHTED-MEDIAN(X')

The recurrence for the worst-case running time of $\text{WEIGHTED-MEDIAN}$ is $T(n) = T(n / 2 + 1) + \Theta(n)$, since there is at most one recursive call on half the number of elements, plus the median element $x_k$, and all the work preceding the recursive call takes $\Theta(n)$ time. The solution of the recurrence is $T (n) = \Theta(n)$.

d. Let the $n$ points be denoted by their coordinates $x_1, x_2, \ldots, x_n$, let the corresponding weights be $w_1, w_2, \ldots, w_n$, and let $x = x_k$ be the weighted median. For any point $p$, let $f(p) = \sum_{i = 1}^n w_i |p - x_i|$; we want to find a point $p$ such that $f(p)$ is minimum. Let $y$ be any point (real number) other than $x$. We show the optimality of the weighted median $x$ by showing that $f(y) - f(x) \ge 0$. We examine separately the cases in which $y > x$ and $x > y$. For any $x$ and $y$, we have

$$ \begin{aligned} f(y) - f(x) & = \sum_{i = 1}^n w_i |y - x_i| - \sum_{i = 1}^n w_i |x - x_i| \\ & = \sum_{i = 1}^n w_i (|y - x_i| - |x - x_i|). \end{aligned} $$

When $y > x$, we bound the quantity $|y - x_i| - |x - x_i|$ from below by examining three cases:

  1. $x < y \le x_i$: Here, $|x - y| + |y - x_i| = |x - x_i|$ and $|x - y| = y - x$, which imply that $|y - x_i| - |x - x_i| = -|x - y| = x - y$.
  2. $x < x_i \le y$: Here, $|y - x_i| \ge 0$ and $|x_i - x| \le y - x$, which imply that $|y - x_i| - |x - x_i| \ge -(y - x) = x - y$.
  3. $x_i \le x < y$: Here, $|x - x_i| + |y - x| = |y - x_i|$ and $|y - x| = y - x$, which imply that $y - x_i| - |x - x_i| = |y - x| = y - x$.

Separating out the first two cases, in which $x < x_i$, from the third case, in which $x \ge x_i$, we get

$$ \begin{aligned} f(y) - f(x) & = \sum_{i = 1}^n w_i(|y - x_i| - |x - x_i|) \\ & \ge \sum_{x < x_i} w_i (x - y) + \sum_{x \ge x_i} w_i (y - x) \\ & = (y - x) \Bigg(\sum_{x \ge x_i} w_i - \sum_{x < x_i} w_i \Bigg). \end{aligned} $$

The property that $\sum_{x_i < x} w_i < 1 / 2$ implies that $\sum_{x \ge x_i} w_i \ge 1 / 2$. This fact, combined with $y - x > 0$ and $\sum_{x < x_i} w_i \le 1 / 2$, yields that $f(y) - f(x) \ge 0$.

When $x > y$, we bound the quantity $|y - x_i| - |x - x_i|$ from below by examining three cases:

  1. $x_i \le y < x$: Here, $|y - x_i| + |x - y| = |x - x_i|$ and $|x - y| = x - y$, which imply that $|y - x_i| - |x - x_i| = -|x - y| = y - x$.
  2. $y \le x_i < x$: Here, $|y - x_i| \ge 0$ and $|x - x_i| \le x - y$, which imply that $|y - x_i| - |x - x_i| \ge -(x - y) = y - x$.
  3. $y < x \le x_i$: Here, $|x - y| + |x - x_i| = |y - x_i|$ and $|x - y| = x - y$, which imply that $y - x_i| - |x - x_i| = |x - y| = x - y$.

Separating out the first two cases, in which $x > x_i$, from the third case, in which $x \le x_i$, we get

$$ \begin{aligned} f(y) - f(x) & = \sum_{i = 1}^n w_i(|y - x_i| - |x - x_i|) \\ & \ge \sum_{x > x_i} w_i(y - x) + \sum_{x \ge x_i} w_i(x - y) \\ & = (x - y) \Bigg(\sum_{x \le x_i} w_i - \sum_{x > x_i} w_i \Bigg). \end{aligned} $$

The property that $sum_{x_i > x} w_i \le 1 / 2$ implies that $\sum_{x \le x_i} w_i > 1 / 2$. This fact, combined with $x - y > 0$ and $\sum_{x > x_i} w_i < 1 / 2$, yields that $f(y) - f(x) > 0$.

e. We are given $n$ $2$-dimensional points $p_1, p_2, \ldots, p_n$, where each $p_i$ is a pair of real numbers $p_i = (x_i, y_i)$, and positive weights $w_1, w_2, \ldots, w_n$. The goal is to find a point $p = (x, y)$ that minimizes the sum

$$f(x, y) = \sum_{i = 1}^n w_i(|x - x_i| + |y - y_i|).$$

We can express the cost function of the two variables, $f(x, y)$, as the sum of two functions of one variable each: $f(x, y) = g(x) + h(y)$, where $g(x) = \sum_{i = 1}^n w_i |x - x_i|$, and $h(y) = \sum_{i = 1}^n w_i |y - y_i|$. The goal of finding a point $p = (x, y)$ that minimizes the value of $f(x, y)$ can be achieved by treating each dimension independently, because $g$ does not depend on $y$ and $h$ does not depend on $x$. Thus,

$$ \begin{aligned} \min_{x, y} f(x, y) & = \min_{x, y} (g(x) + h(y)) \\ & = \min_x \Big(\min_y(g(x) + h(y))\Big) \\ & = \min_x \Big(g(x) + \min_y h(y)\Big) \\ & = \min_x g(x) + \min_y h(y). \end{aligned} $$

Consequently, finding the best location in $2$ dimensions can be done by finding the weighted median $x_k$ of the $x$-coordinates and then finding the weighted median $y_j$ of the $y$-coordinates. The point $(x_k, y_j)$ is an optimal solution for the $2$-dimensional post-office location problem.