Official website .
Homework 1
Problem 1 (Expectation as the optimal estimator)
Let X X X be a random variable with finite expectation. Show that the function f ( a ) = E ( X − a ) 2 f(a) = \mathbb{E}(X − a)^2 f ( a ) = E ( X − a ) 2 is minimized at a = E X a = \mathbb{E} X a = E X .
E ( X − a ) 2 = E X 2 − 2 a E X + a 2 = ( a − E X ) 2 . \mathbb{E}(X-a)^2=\mathbb{E}X^2-2a\mathbb{E}X+a^2=(a-\mathbb{E}X)^2.
E ( X − a ) 2 = E X 2 − 2 a E X + a 2 = ( a − E X ) 2 .
Problem 2 (Expectation of random vectors)
Let X X X and Y Y Y be random vectors in R n \mathbb{R}^n R n .
(a)
(Linearity) Prove that for any constants a , b ∈ R a, b\in \mathbb{R} a , b ∈ R , we have
E ( a X + b Y ) = a E ( X ) + b E ( Y ) . \mathbb{E}(aX + bY ) = a \mathbb{E}(X) + b \mathbb{E}(Y ).
E ( a X + bY ) = a E ( X ) + b E ( Y ) .
E ( a X + b Y ) = ( a E ( X i ) + b E ( Y i ) ) i = 1 n = a E ( X ) + b E ( Y ) . \mathbb{E}(aX+bY)=(a\mathbb{E}(X_i)+b\mathbb{E}(Y_i))_{i=1}^n=a\mathbb{E}(X)+b\mathbb{E}(Y).
E ( a X + bY ) = ( a E ( X i ) + b E ( Y i ) ) i = 1 n = a E ( X ) + b E ( Y ) .
(b)
(Multiplicativity) Prove that if X X X and Y Y Y are independent, then
E ⟨ X , Y ⟩ = ⟨ E X , E Y ⟩ . \mathbb{E}\langle X, Y\rangle = \langle \mathbb{E} X, \mathbb{E} Y\rangle.
E ⟨ X , Y ⟩ = ⟨ E X , E Y ⟩ .
This generalizes the identity E ( X Y ) = ( E X ) ( E Y ) \mathbb{E}(XY ) = (\mathbb{E} X)(\mathbb{E} Y ) E ( X Y ) = ( E X ) ( E Y ) for independent random variables.
E ⟨ X , Y ⟩ = E ( ∑ i = 1 n X i Y i ) = ∑ i = 1 n E X i E Y i = ⟨ E X , E Y ⟩ . \mathbb{E}\langle X, Y\rangle=\mathbb{E}(\sum_{i=1}^n X_iY_i)=\sum_{i=1}^n \mathbb{E}X_i\mathbb{E}Y_i=\langle \mathbb{E}X,\mathbb{E}Y\rangle.
E ⟨ X , Y ⟩ = E ( i = 1 ∑ n X i Y i ) = i = 1 ∑ n E X i E Y i = ⟨ E X , E Y ⟩ .
Problem 3 (Expectation as optimal estimator: random vectors)
Let X X X be a random vector with finite expectation. Show that the function f ( a ) = E ∥ X − a ∥ 2 2 f(a) = \mathbb{E}\Vert X − a\Vert_2^2 f ( a ) = E ∥ X − a ∥ 2 2 is minimized at a = E X a = \mathbb{E} X a = E X .
E ∥ X − a ∥ 2 2 = E ⟨ X − a , X − a ⟩ = ∑ i = 1 n E ( X i − a i ) 2 ⟹ a = E X . \mathbb{E}\Vert X-a\Vert^2_2=\mathbb{E}\langle X-a,X-a\rangle=\sum_{i=1}^n \mathbb{E}(X_i-a_i)^2\implies a=\mathbb{E}X.
E ∥ X − a ∥ 2 2 = E ⟨ X − a , X − a ⟩ = i = 1 ∑ n E ( X i − a i ) 2 ⟹ a = E X .
Problem 4 (A variance-like identity)
For a random vector in R n \mathbb{R}^n R n , define V ( X ) = E ∥ X − E X ∥ 2 2 V (X) = \mathbb{E}\Vert X − \mathbb{E} X\Vert_2^2 V ( X ) = E ∥ X − E X ∥ 2 2 .
(a)
Prove that
V ( X ) = E ∥ X ∥ 2 2 − ∥ E X ∥ 2 2 . V (X) = \mathbb{E}\Vert X\Vert_2^2 −\Vert \mathbb{E} X\Vert_2^2.
V ( X ) = E ∥ X ∥ 2 2 − ∥ E X ∥ 2 2 .
V ( X ) = E ∥ X − E X ∥ 2 2 = E ⟨ X − E X , X − E X ⟩ = E ∑ i = 1 n ( X i − E X i ) 2 = ∑ i = 1 n E X i 2 − ( E X i ) 2 = E ∥ X ∥ 2 2 − ∥ E X ∥ 2 2 . \begin{align*}
V(X)&=\mathbb{E}\Vert X-\mathbb{E}X\Vert^2_2=\mathbb{E}\langle X-\mathbb{E}X,X-\mathbb{E}X\rangle=\mathbb{E}\sum_{i=1}^n (X_i-\mathbb{E}X_i)^2\\\\
&=\sum_{i=1}^n \mathbb{E}X_i^2-(\mathbb{E}X_i)^2=\mathbb{E}\Vert X\Vert^2_2-\Vert\mathbb{E}X\Vert_2^2.
\end{align*}
V ( X ) = E ∥ X − E X ∥ 2 2 = E ⟨ X − E X , X − E X ⟩ = E i = 1 ∑ n ( X i − E X i ) 2 = i = 1 ∑ n E X i 2 − ( E X i ) 2 = E ∥ X ∥ 2 2 − ∥ E X ∥ 2 2 .
(b)
Prove that
V ( X ) = 1 2 E ∥ X − X 0 ∥ 2 2 , V (X) = \frac{1}{2}\mathbb{E}\Vert X − X_0\Vert_2^2 ,
V ( X ) = 2 1 E ∥ X − X 0 ∥ 2 2 ,
where X 0 X_0 X 0 is an independent copy of X X X .
1 2 E ∥ X − X ′ ∥ 2 2 = 1 2 ∑ i = 1 n E ( X i − X i ′ ) 2 = ∑ i = 1 n E X i 2 − E ( X i X i ′ ) = E ∥ X ∥ 2 2 − ∥ E X ∥ 2 2 . \frac{1}{2}\mathbb{E}\Vert X-X'\Vert_2^2=\frac{1}{2}\sum_{i=1}^n\mathbb{E}(X_i-X'_i)^2=\sum_{i=1}^n\mathbb{E}X_i^2-\mathbb{E}(X_iX'_i)=\mathbb{E}\Vert X\Vert^2_2-\Vert\mathbb{E}X\Vert_2^2.
2 1 E ∥ X − X ′ ∥ 2 2 = 2 1 i = 1 ∑ n E ( X i − X i ′ ) 2 = i = 1 ∑ n E X i 2 − E ( X i X i ′ ) = E ∥ X ∥ 2 2 − ∥ E X ∥ 2 2 .
Problem 5 (Additivity of variance)
(a)
Check that if X X X and Y Y Y are independent random vectors in R n \mathbb{R}^n R n with zero means, then
E ∥ X + Y ∥ 2 2 = E ∥ X ∥ 2 2 + E ∥ Y ∥ 2 2 . \mathbb{E}\Vert X+Y\Vert_2^2=\mathbb{E}\Vert X\Vert_2^2+\mathbb{E}\Vert Y\Vert_2^2.
E ∥ X + Y ∥ 2 2 = E ∥ X ∥ 2 2 + E ∥ Y ∥ 2 2 .
E ∥ X + Y ∥ 2 2 = ∑ i = 1 n E ( X i + Y i ) 2 = E ∥ X ∥ 2 2 + E ∥ Y ∥ 2 2 . \mathbb{E}\Vert X+Y\Vert^2_2=\sum_{i=1}^n \mathbb{E}(X_i+Y_i)^2=\mathbb{E}\Vert X\Vert_2^2+\mathbb{E}\Vert Y\Vert_2^2.
E ∥ X + Y ∥ 2 2 = i = 1 ∑ n E ( X i + Y i ) 2 = E ∥ X ∥ 2 2 + E ∥ Y ∥ 2 2 .
(b)
Prove that if random vectors X 1 , ⋯ , X k X_1,\cdots , X_k X 1 , ⋯ , X k in R n \mathbb{R}^n R n are independent, then
V ( X 1 + ⋯ + X k ) = V ( X 1 ) + ⋯ + V ( X k ) , V (X_1 +\cdots + X_k) = V (X_1) +\cdots + V (X_k),
V ( X 1 + ⋯ + X k ) = V ( X 1 ) + ⋯ + V ( X k ) ,
where V ( ⋅ ) V (\cdot) V ( ⋅ ) is defined as V ( X ) = E ∥ X − E X ∥ 2 2 V (X) = \mathbb{E}\Vert X − \mathbb{E} X\Vert_2^2 V ( X ) = E ∥ X − E X ∥ 2 2 .
V ( X 1 + X 2 + ⋯ + X k ) = E ∥ ∑ i = 1 k X i ∥ 2 2 − ∥ E ∑ i = 1 k X i ∥ 2 2 = E ∑ i = 1 n ( ∑ j = 1 k X j , i ) 2 − ∑ i = 1 n ( ∑ j = 1 k E X j , i ) 2 = ∑ i = 1 n ( ∑ j = 1 k E X j , i 2 + 2 ∑ j = 1 k ∑ l = j + 1 k E X j , i E X l , i − ∑ j = 1 k E 2 X j , i − 2 ∑ j = 1 k ∑ l = j + 1 k E X j , i E X l , i ) = ∑ j = 1 k ∑ i = 1 n E X j , i 2 − E 2 X j , i = ∑ j = 1 k V ( X j ) . \begin{align*}
V(X_1+X_2+\cdots+X_k)&=\mathbb{E}\Vert \sum_{i=1}^k X_i\Vert_2^2-\Vert \mathbb{E}\sum_{i=1}^k X_i\Vert_2^2\\\\
&=\mathbb{E}\sum_{i=1}^n (\sum_{j=1}^k X_{j,i})^2-\sum_{i=1}^n (\sum_{j=1}^k \mathbb{E}X_{j,i})^2\\\\
&=\sum_{i=1}^n (\sum_{j=1}^k\mathbb{E}X_{j,i}^2+2\sum_{j=1}^k\sum_{l=j+1}^k \mathbb{E}X_{j,i}\mathbb{E}X_{l,i}-\sum_{j=1}^k \mathbb{E}^2X_{j,i}-2\sum_{j=1}^k\sum_{l=j+1}^k \mathbb{E}X_{j,i}\mathbb{E}X_{l,i})\\\\
&=\sum_{j=1}^k \sum_{i=1}^n \mathbb{E}X_{j,i}^2-\mathbb{E}^2X_{j,i}=\sum_{j=1}^k V(X_j).
\end{align*}
V ( X 1 + X 2 + ⋯ + X k ) = E ∥ i = 1 ∑ k X i ∥ 2 2 − ∥ E i = 1 ∑ k X i ∥ 2 2 = E i = 1 ∑ n ( j = 1 ∑ k X j , i ) 2 − i = 1 ∑ n ( j = 1 ∑ k E X j , i ) 2 = i = 1 ∑ n ( j = 1 ∑ k E X j , i 2 + 2 j = 1 ∑ k l = j + 1 ∑ k E X j , i E X l , i − j = 1 ∑ k E 2 X j , i − 2 j = 1 ∑ k l = j + 1 ∑ k E X j , i E X l , i ) = j = 1 ∑ k i = 1 ∑ n E X j , i 2 − E 2 X j , i = j = 1 ∑ k V ( X j ) .
Homework 2
Problem 1 (Balancing vectors)
Let x 1 , ⋯ , x n x_1,\cdots , x_n x 1 , ⋯ , x n be an arbitrary set of unit vectors in R n \mathbb{R}^n R n . Prove that there exist ε 1 , ⋯ , ε n ∈ { − 1 , 1 } \varepsilon_1,\cdots , \varepsilon_n\in \lbrace −1, 1\rbrace ε 1 , ⋯ , ε n ∈ { − 1 , 1 } such that
∥ ε 1 x 1 + ⋯ + ε n x n ∥ 2 ≤ n . \Vert \varepsilon_1x_1+\cdots+\varepsilon_nx_n\Vert_2\leq \sqrt{n}.
∥ ε 1 x 1 + ⋯ + ε n x n ∥ 2 ≤ n .
Choose ε i ∼ X \varepsilon_i\sim X ε i ∼ X , where P [ X = 1 ] = P [ X = − 1 ] = 1 2 \mathbb{P}[X=1]=\mathbb{P}[X=-1]=\frac{1}{2} P [ X = 1 ] = P [ X = − 1 ] = 2 1 and E [ X ] = 0 \mathbb{E}[X]=0 E [ X ] = 0 . Compute
E ∥ ∑ i = 1 n ε i x i ∥ 2 2 = ∑ i = 1 n E ∥ ϵ i x i ∥ 2 2 = n ⟹ ∃ ε i , s.t. ∥ ∑ i = 1 n ε i x i ∥ 2 2 ≤ n , i.e. ∥ ∑ i = 1 n ε i x i ∥ 2 ≤ n . \mathbb{E}\Vert \sum_{i=1}^n \varepsilon_ix_i\Vert_2^2=\sum_{i=1}^n \mathbb{E}\Vert\epsilon_ix_i\Vert_2^2=n\\\\
\implies \exists\ \varepsilon_i, \text{ s.t. }\Vert\sum_{i=1}^n \varepsilon_ix_i\Vert_2^2\leq n,\text{ i.e. }\Vert\sum_{i=1}^n \varepsilon_ix_i\Vert_2\leq \sqrt{n}.
E ∥ i = 1 ∑ n ε i x i ∥ 2 2 = i = 1 ∑ n E ∥ ϵ i x i ∥ 2 2 = n ⟹ ∃ ε i , s.t. ∥ i = 1 ∑ n ε i x i ∥ 2 2 ≤ n , i.e. ∥ i = 1 ∑ n ε i x i ∥ 2 ≤ n .
Problem 2 (Random vectors have norm ≈ n \approx \sqrt{n} ≈ n )
Let X X X be a standard normal random vector in R n \mathbb{R}^n R n , where n ≥ C 1 n\geq C_1 n ≥ C 1 .
(a)
Check that
E ∥ X ∥ 2 2 = n and Var ( ∥ X ∥ 2 2 ) = 2 n . \mathbb{E}\Vert X\Vert_2^2 = n \text{ and }\text{Var}(\Vert X\Vert_2^2) = 2n.
E ∥ X ∥ 2 2 = n and Var ( ∥ X ∥ 2 2 ) = 2 n .
E ∥ X ∥ 2 2 = ∑ i = 1 n E ∥ X i ∥ 2 2 = n . \mathbb{E}\Vert X\Vert_2^2=\sum_{i=1}^n\mathbb{E}\Vert X_i\Vert_2^2=n.
E ∥ X ∥ 2 2 = i = 1 ∑ n E ∥ X i ∥ 2 2 = n .
Var ( ∥ X ∥ 2 2 ) = E ∥ X ∥ 2 4 − ( E ∥ X ∥ 2 2 ) 2 = E [ ( ∑ i = 1 n X i 2 ) 2 ] − n 2 = E [ ∑ i = 1 n X i 4 + 2 ∑ i < j X i 2 X j 2 ] − n 2 = 3 n + 2 n ( n − 1 ) / 2 − n 2 = 2 n . \text{Var}(\Vert X\Vert_2^2)=\mathbb{E}\Vert X\Vert_2^4-(\mathbb{E}\Vert X\Vert_2^2)^2\\\\
=\mathbb{E}[(\sum_{i=1}^n X_i^2)^2]-n^2=\mathbb{E}[\sum_{i=1}^n X_i^4+2\sum_{i<j}X_i^2X_j^2]-n^2\\\\
=3n+2n(n-1)/2-n^2=2n.
Var ( ∥ X ∥ 2 2 ) = E ∥ X ∥ 2 4 − ( E ∥ X ∥ 2 2 ) 2 = E [( i = 1 ∑ n X i 2 ) 2 ] − n 2 = E [ i = 1 ∑ n X i 4 + 2 i < j ∑ X i 2 X j 2 ] − n 2 = 3 n + 2 n ( n − 1 ) /2 − n 2 = 2 n .
(b)
Conclude that
∣ ∥ X ∥ 2 2 − n ∣ ≤ C 2 n with probability at least 0.99. |\Vert X\Vert_2^2 − n|\leq C_2 \sqrt{n}\text{ with probability at least }0.99.
∣∥ X ∥ 2 2 − n ∣ ≤ C 2 n with probability at least 0.99.
Chebshev’s inequality.
P [ ∣ ∥ X ∥ 2 2 − n ∣ ≤ C 2 n ] = 1 − P [ ∣ ∥ X ∥ 2 2 − μ ∣ ≥ C 2 n ] ≥ 1 − 2 C 2 2 \mathbb{P}[|\Vert X\Vert_2^2-n|\leq C_2\sqrt{n}]=1-\mathbb{P}[|\Vert X\Vert_2^2-\mu|\geq C_2\sqrt{n}]\\\\
\geq 1-\frac{2}{C_2^2}
P [ ∣∥ X ∥ 2 2 − n ∣ ≤ C 2 n ] = 1 − P [ ∣∥ X ∥ 2 2 − μ ∣ ≥ C 2 n ] ≥ 1 − C 2 2 2
Let C 2 = 200 C_2=\sqrt{200} C 2 = 200 .
(c)
Deduce that
1 2 n ≤ ∥ X ∥ 2 ≤ 2 n with probability at least 0.99. \frac{1}{2}\sqrt{n}\leq \Vert X\Vert_2\leq 2\sqrt{n}\text{ with probability at least }0.99.
2 1 n ≤ ∥ X ∥ 2 ≤ 2 n with probability at least 0.99.
Use conclusion from (b), then
n − C 2 n ≤ ∥ X ∥ 2 ≤ n + C 2 n , with probability at least 0.99 \sqrt{n-C_2\sqrt{n}}\leq \Vert X\Vert_2\leq \sqrt{n+C_2\sqrt{n}}, \text{ with probability at least }0.99
n − C 2 n ≤ ∥ X ∥ 2 ≤ n + C 2 n , with probability at least 0.99
Upper Bound: ⟹ n ≥ ( C 2 3 ) 2 \implies n\geq (\frac{C_2}{3})^2 ⟹ n ≥ ( 3 C 2 ) 2 .
Lower Bound: ⟹ n ≥ ( 4 C 2 3 ) 2 \implies n\geq (\frac{4C_2}{3})^2 ⟹ n ≥ ( 3 4 C 2 ) 2 .
(d)
Prove the tighter bound:
∣ ∥ X ∥ 2 − n ∣ ≤ C 3 with probability at least 0.99. |\Vert X\Vert_2 − \sqrt{n}|\leq C_3\text{ with probability at least }0.99.
∣∥ X ∥ 2 − n ∣ ≤ C 3 with probability at least 0.99.
∣ ∥ X ∥ 2 − n ∣ = ∣ ∥ X ∥ 2 2 − n ∣ ∣ ∥ X ∥ 2 + n ∣ ≤ C 2 n n = C 2 . |\Vert X\Vert_2-\sqrt{n}|=\frac{|\Vert X\Vert_2^2-n|}{|\Vert X\Vert_2+\sqrt{n}|}\leq \frac{C_2\sqrt{n}}{\sqrt{n}}=C_2.
∣∥ X ∥ 2 − n ∣ = ∣∥ X ∥ 2 + n ∣ ∣∥ X ∥ 2 2 − n ∣ ≤ n C 2 n = C 2 .
Problem 3 (Random vectors are almost orthogonal)
Let X X X and Y Y Y be independent standard normal random vectors in R n \mathbb{R}^n R n , where n ≥ C 3 n\geq C_3 n ≥ C 3 .
(a)
Check that
E ⟨ X , Y ⟩ 2 = n . \mathbb{E}\langle X, Y\rangle^2 = n.
E ⟨ X , Y ⟩ 2 = n .
E ⟨ X , Y ⟩ 2 = E [ ( ∑ i = 1 n X i Y i ) 2 ] = E [ ∑ i = 1 n X i 2 Y i 2 + 2 ∑ i < j X i Y i X j Y j ] = n . \mathbb{E}\langle X,Y\rangle^2=\mathbb{E}[(\sum_{i=1}^n X_iY_i)^2]=\mathbb{E}[\sum_{i=1}^n X_i^2Y_i^2+2\sum_{i<j}X_iY_iX_jY_j]=n.
E ⟨ X , Y ⟩ 2 = E [( i = 1 ∑ n X i Y i ) 2 ] = E [ i = 1 ∑ n X i 2 Y i 2 + 2 i < j ∑ X i Y i X j Y j ] = n .
(b)
Deduce that
⟨ X , Y ⟩ 2 ≤ C 4 n with probability at least 0.99. \langle X, Y\rangle^2\leq C_4n\text{ with probability at least }0.99.
⟨ X , Y ⟩ 2 ≤ C 4 n with probability at least 0.99.
Markov’s inequality.
P [ ⟨ X , Y ⟩ 2 ≤ C 4 n ] = 1 − P [ ⟨ X , Y ⟩ 2 ≥ C 4 n ] ≥ n C 4 n = 1 C 4 . \mathbb{P}[\langle X,Y\rangle^2\leq C_4n]=1-\mathbb{P}[\langle X,Y\rangle^2\geq C_4n]\geq \frac{n}{C_4n}=\frac{1}{C_4}.
P [⟨ X , Y ⟩ 2 ≤ C 4 n ] = 1 − P [⟨ X , Y ⟩ 2 ≥ C 4 n ] ≥ C 4 n n = C 4 1 .
Let C 4 = 100 C_4=100 C 4 = 100 .
(c)
Denote by θ \theta θ the angle between the vectors X X X and Y Y Y , as shown in the figure below. Prove that
∣ θ − π 2 ∣ ≤ C 5 n with probability at least 0.97. |\theta−\frac{\pi}{2}|\leq \frac{C_5}{\sqrt{n}}\text{ with probability at least }0.97.
∣ θ − 2 π ∣ ≤ n C 5 with probability at least 0.97.
Use Problem 3(b) and Problem 2©.
P [ cos 2 ( θ ) = ⟨ X , Y ⟩ 2 ∥ X ∥ 2 2 ∥ Y ∥ 2 2 ≤ C 4 n ( n 2 ) 2 ( n 2 ) 2 = 16 C 4 n ] ≥ 1 − 3 × 0.001 = 0.997 ⟹ P [ ∣ cos ( θ ) ∣ ≤ 4 C 4 n ] ≥ 0.997. \mathbb{P}[\cos^2(\theta)=\frac{\langle X,Y\rangle^2}{\Vert X\Vert_2^2\Vert Y\Vert_2^2}\leq \frac{C_4n}{(\frac{\sqrt{n}}{2})^2(\frac{\sqrt{n}}{2})^2}=\frac{16C_4}{n}]\geq 1-3\times0.001=0.997\\\\
\implies \mathbb{P}[|\cos(\theta)|\leq \frac{4\sqrt{C_4}}{\sqrt{n}}]\geq 0.997.
P [ cos 2 ( θ ) = ∥ X ∥ 2 2 ∥ Y ∥ 2 2 ⟨ X , Y ⟩ 2 ≤ ( 2 n ) 2 ( 2 n ) 2 C 4 n = n 16 C 4 ] ≥ 1 − 3 × 0.001 = 0.997 ⟹ P [ ∣ cos ( θ ) ∣ ≤ n 4 C 4 ] ≥ 0.997.
Analyse ∣ θ − π / 2 ∣ |\theta-\pi/2| ∣ θ − π /2∣ .
cos ( θ ) = sin ( π 2 − θ ) ≈ π 2 − θ ⟹ ∣ π 2 − θ ∣ ≈ ∣ cos ( θ ) ∣ ≤ 4 C 4 n = C 5 n . \cos(\theta)=\sin(\frac{\pi}{2}-\theta)\approx \frac{\pi}{2}-\theta\\\\
\implies |\frac{\pi}{2}-\theta|\approx |\cos(\theta)|\leq \frac{4\sqrt{C_4}}{\sqrt{n}}=\frac{C_5}{\sqrt{n}}.
cos ( θ ) = sin ( 2 π − θ ) ≈ 2 π − θ ⟹ ∣ 2 π − θ ∣ ≈ ∣ cos ( θ ) ∣ ≤ n 4 C 4 = n C 5 .
Homework 3
Problem 1 (Small ball probabilities)
Let X 1 , ⋯ , X N X_1,\cdots , X_N X 1 , ⋯ , X N be non-negative independent random variables. Assume that the PDF (probability density function) of each X i X_i X i is uniformly bounded by 1 1 1 .
(a)
Check that each X i X_i X i satisfies
P { X i ≤ ε } ≤ ε for all ε > 0. \mathbb{P}\lbrace X_i\leq \varepsilon\rbrace\leq \varepsilon\text{ for all }\varepsilon > 0.
P { X i ≤ ε } ≤ ε for all ε > 0.
P { X i ≤ ε } = ∫ 0 ε f X i ( x ) d x ≤ ∫ 0 ε d x = ε . \mathbb{P}\lbrace X_i\leq \varepsilon\rbrace=\int_0^{\varepsilon} f_{X_i}(x){\rm d}x\leq \int_0^{\varepsilon} {\rm d}x=\varepsilon.
P { X i ≤ ε } = ∫ 0 ε f X i ( x ) d x ≤ ∫ 0 ε d x = ε .
(b)
Show that the MGF (moment generating function) of each X i X_i X i satisfies
E exp ( − t X i ) ≤ 1 t for all t > 0. \mathbb{E}\exp(−tX_i)\leq \frac{1}{t}\text{ for all }t > 0.
E exp ( − t X i ) ≤ t 1 for all t > 0.
E exp ( − t X i ) = ∫ 0 ∞ exp ( − t x ) P { X i = x } d x ≤ ∫ 0 ∞ exp ( − t x ) d x = 1 t . \mathbb{E}\exp(-tX_i)=\int_0^{\infty} \exp(-tx)\mathbb{P}\lbrace X_i=x\rbrace{\rm d}x\leq \int_0^{\infty}\exp(-tx){\rm d}x=\frac{1}{t}.
E exp ( − t X i ) = ∫ 0 ∞ exp ( − t x ) P { X i = x } d x ≤ ∫ 0 ∞ exp ( − t x ) d x = t 1 .
(c)
Deduce that averaging increases the strength of (a) dramatically. Namely, show that
P { 1 N ∑ i = 1 N X i ≤ ε } ≤ ( C ε ) N for all ε > 0. \mathbb{P}\lbrace \frac{1}{N}\sum_{i=1}^N X_i\leq \varepsilon\rbrace\leq (C\varepsilon)^N\text{ for all } \varepsilon > 0.
P { N 1 i = 1 ∑ N X i ≤ ε } ≤ ( Cε ) N for all ε > 0.
Use Chernoff’s inequality, let μ = E ∑ i = 1 N X i \mu=\mathbb{E}\sum_{i=1}^N X_i μ = E ∑ i = 1 N X i ,
P { ∑ i = 1 N X i ≤ N ε } ≤ min t < 0 e − t N ε E [ e t ∑ i = 1 N X i ] = min t > 0 ∏ i = 1 N E [ e − t X i ] e t N ε ≤ min t > 0 t − N e t N ε . \mathbb{P}\lbrace \sum_{i=1}^N X_i\leq N\varepsilon\rbrace\leq \min_{t<0}e^{-tN\varepsilon}\mathbb{E}[e^{t\sum_{i=1}^N X_i}]=\min_{t>0}\prod_{i=1}^N \mathbb{E}[e^{-tX_i}]e^{tN\varepsilon}\leq \min_{t>0}t^{-N}e^{tN\varepsilon}.
P { i = 1 ∑ N X i ≤ Nε } ≤ t < 0 min e − tNε E [ e t ∑ i = 1 N X i ] = t > 0 min i = 1 ∏ N E [ e − t X i ] e tNε ≤ t > 0 min t − N e tNε .
Let f ( t ) = t − N e t N ε f(t)=t^{-N}e^{tN\varepsilon} f ( t ) = t − N e tNε , then
d ln f ( t ) d t = − N 1 t + N ε = 0 ⟹ t = 1 ε ⟹ P { 1 N ∑ i = 1 N X i ≤ ε } ≤ ( e ε ) N . \frac{\mathrm{d}\ln f(t)}{\mathrm{d}t}=-N\frac{1}{t}+N\varepsilon=0\implies t=\frac{1}{\varepsilon}\\\\
\implies \mathbb{P}\lbrace \frac{1}{N}\sum_{i=1}^N X_i\leq \varepsilon\rbrace\leq (e\varepsilon)^N.
d t d ln f ( t ) = − N t 1 + Nε = 0 ⟹ t = ε 1 ⟹ P { N 1 i = 1 ∑ N X i ≤ ε } ≤ ( e ε ) N .
Problem 2 (Boosting randomized algorithms)
Imagine we have an algorithm for solving some decision problem (for example, the algorithm may answer the question: “is there a helicopter in a given image?”). Suppose each time the algorithm runs, it gives the correct answer independently with probability 1 2 + δ \frac{1}{2} + \delta 2 1 + δ with some small margin δ ∈ ( 0 , 1 / 2 ) \delta\in (0, 1/2) δ ∈ ( 0 , 1/2 ) . In other words, the algorithm does just marginally better than a random guess.
To improve the confidence, the following “boosting” procedure is often used. Run the algorithm N N N times, and take the majority vote. Show that the new algorithm gives the correct answer with probability at least 1 − 2 exp ( − c δ 2 N ) 1 − 2\exp(−c\delta^2 N) 1 − 2 exp ( − c δ 2 N ) . This is good because the confidence rapidly (exponentially!) approaches 1 1 1 as N N N grows.
Let P { X i = 0 } = δ + 1 2 , P { X i = 1 } = 1 2 − δ \mathbb{P}\lbrace X_i=0\rbrace=\delta+\frac{1}{2}, \mathbb{P}\lbrace X_i=1\rbrace=\frac{1}{2}-\delta P { X i = 0 } = δ + 2 1 , P { X i = 1 } = 2 1 − δ , μ = E ∑ i = 1 N X i = N ( 1 2 − δ ) \mu=\mathbb{E}\sum_{i=1}^N X_i=N(\frac{1}{2}-\delta) μ = E ∑ i = 1 N X i = N ( 2 1 − δ ) .
Use Chernoff’s inequality and Taylor Expansion,
P { S N ≥ N 2 } ≤ e − μ ( 2 e μ N ) N / 2 = e − N ( 1 / 2 − δ ) ( e ( 1 − 2 δ ) ) N / 2 = e N δ ( 1 − 2 δ ) N / 2 = ( e δ ( 1 − 2 δ ) 1 / 2 ) N = e N ( δ + 1 2 ln ( 1 − 2 δ ) ) ≤ e N ( δ + 1 2 ( − 2 δ − 2 δ 2 ) ) = e − N δ 2 ⟹ P { S N < N 2 } ≥ 1 − e − N δ 2 . \mathbb{P}\lbrace S_N\geq \frac{N}{2}\rbrace\leq e^{-\mu}(\frac{2e\mu}{N})^{N/2}=e^{-N(1/2-\delta)}(e(1-2\delta))^{N/2}=e^{N\delta}(1-2\delta)^{N/2}=(e^\delta(1-2\delta)^{1/2})^N\\\\
=e^{N(\delta+\frac{1}{2}\ln(1-2\delta))}\leq e^{N(\delta+\frac{1}{2}(-2\delta-2\delta^2))}=e^{-N\delta^2}\\\\
\implies \mathbb{P}\lbrace S_N<\frac{N}{2}\rbrace\geq 1-e^{-N\delta^2}.
P { S N ≥ 2 N } ≤ e − μ ( N 2 e μ ) N /2 = e − N ( 1/2 − δ ) ( e ( 1 − 2 δ ) ) N /2 = e N δ ( 1 − 2 δ ) N /2 = ( e δ ( 1 − 2 δ ) 1/2 ) N = e N ( δ + 2 1 l n ( 1 − 2 δ )) ≤ e N ( δ + 2 1 ( − 2 δ − 2 δ 2 )) = e − N δ 2 ⟹ P { S N < 2 N } ≥ 1 − e − N δ 2 .
Problem 3 (Optimality of Chernoff’s inequality)
(a)
Prove the following useful inequalities for binomial coefficients:
( n m ) m ≤ ( n m ) ≤ ∑ k = 0 m ( n k ) ≤ ( e n m ) m \left(\frac{n}{m}\right)^m\leq \binom{n}{m}\leq \sum_{k=0}^m \binom{n}{k}\leq \left(\frac{en}{m}\right)^m
( m n ) m ≤ ( m n ) ≤ k = 0 ∑ m ( k n ) ≤ ( m e n ) m
for all integers m ∈ [ 1 , n ] m\in [1, n] m ∈ [ 1 , n ] .
The first inequality:
n m ≤ n − i m − i , for all i = 0 , ⋯ , m − 1 ⟹ ( n m ) m ≤ n m n − 1 m − 1 ⋯ n − m + 1 1 = ( n m ) . \frac{n}{m}\leq \frac{n-i}{m-i}, \text{ for all }i=0,\cdots,m-1\\\\
\implies \left(\frac{n}{m}\right)^m\leq \frac{n}{m}\frac{n-1}{m-1}\cdots\frac{n-m+1}{1}=\binom{n}{m}.
m n ≤ m − i n − i , for all i = 0 , ⋯ , m − 1 ⟹ ( m n ) m ≤ m n m − 1 n − 1 ⋯ 1 n − m + 1 = ( m n ) .
The second inequality: Obviously.
The third inequality:
∑ k = 0 m ( n k ) ≤ ( e n m ) m ⟺ ∑ k = 0 m ( n k ) ( m n ) m ≤ ∑ k = 0 ∞ m k k ! . \sum_{k=0}^m \binom{n}{k}\leq \left(\frac{en}{m}\right)^m\iff \sum_{k=0}^m \binom{n}{k}\left(\frac{m}{n}\right)^m\leq \sum_{k=0}^{\infty} \frac{m^k}{k!}.
k = 0 ∑ m ( k n ) ≤ ( m e n ) m ⟺ k = 0 ∑ m ( k n ) ( n m ) m ≤ k = 0 ∑ ∞ k ! m k .
We have
( n k ) ( m n ) m ≤ m k k ! ⟺ n ! m m ( n − k ) ! n m ≤ m k . \binom{n}{k}\left(\frac{m}{n}\right)^m\leq \frac{m^k}{k!}\iff \frac{n!m^m}{(n-k)!n^m}\leq m^{k}.
( k n ) ( n m ) m ≤ k ! m k ⟺ ( n − k )! n m n ! m m ≤ m k .
And
( n − i ) m n ≤ m ⟹ n ! m m ( n − k ) ! n m ≤ n ! m k ( n − k ) ! n k ≤ m k ⟹ ∑ k = 0 m ( n k ) ( m n ) m ≤ ∑ k = 0 ∞ m k k ! ⟹ ∑ k = 0 m ( n k ) ≤ ( e n m ) m . \frac{(n-i)m}{n}\leq m\implies \frac{n!m^m}{(n-k)!n^m}\leq \frac{n!m^k}{(n-k)!n^k}\leq m^k\implies \sum_{k=0}^m \binom{n}{k}\left(\frac{m}{n}\right)^m\leq \sum_{k=0}^{\infty} \frac{m^k}{k!}\\\\
\implies \sum_{k=0}^m \binom{n}{k}\leq \left(\frac{en}{m}\right)^m.
n ( n − i ) m ≤ m ⟹ ( n − k )! n m n ! m m ≤ ( n − k )! n k n ! m k ≤ m k ⟹ k = 0 ∑ m ( k n ) ( n m ) m ≤ k = 0 ∑ ∞ k ! m k ⟹ k = 0 ∑ m ( k n ) ≤ ( m e n ) m .
(b)
Show that Chernoff’s inequality is almost optimal. Namely, if S N S_N S N is a binomial random variable with mean μ \mu μ , that is S N ∼ Binom ( N , μ / N ) S_N\sim \text{Binom}(N, \mu/N) S N ∼ Binom ( N , μ / N ) , show that
P { S N ≥ t } ≥ e − μ ( μ / t ) t \mathbb{P}\lbrace S_N\geq t\rbrace\geq e^{-\mu}(\mu/t)^t
P { S N ≥ t } ≥ e − μ ( μ / t ) t
for any integer t ∈ { 1 , ⋯ , N } t\in \lbrace 1,\cdots , N\rbrace t ∈ { 1 , ⋯ , N } such that t ≥ μ t\geq \mu t ≥ μ .
Use Problem 3 (a).
P { S N ≥ t } = ∑ i = t N ( N i ) ( μ / N ) i ( 1 − μ / N ) N − i ≥ ∑ i = t N ( N i ) i ( μ / N ) i ( 1 − μ / N ) N − i = ∑ i = t N ( μ / i ) i ( 1 − μ / N ) N − i ≥ ( μ / t ) t ( 1 − μ / N ) N − t ≥ ( μ / t ) t ( N − μ N ) N − μ ≥ ( μ / t ) t e − μ ⟺ ( N − μ N ) N ≥ ( N − μ e N ) μ ⟺ N ( ln ( N − μ ) − ln N ) ≥ μ ( ln ( N − μ ) − 1 − ln N ) ⟺ ( N − μ ) ln ( N − μ ) + μ ln N + μ ≥ N ln N \mathbb{P}\lbrace S_N\geq t\rbrace=\sum_{i=t}^N \binom{N}{i}(\mu/N)^i(1-\mu/N)^{N-i}\\\\
\geq \sum_{i=t}^N (\frac{N}{i})^i(\mu/N)^i(1-\mu/N)^{N-i}=\sum_{i=t}^N (\mu/i)^i(1-\mu/N)^{N-i}\geq (\mu/t)^t(1-\mu/N)^{N-t}\\\\
\geq(\mu/t)^t\left(\frac{N-\mu}{N}\right)^{N-\mu}\geq (\mu/t)^te^{-\mu}\iff \left(\frac{N-\mu}{N}\right)^N\geq \left(\frac{N-\mu}{eN}\right)^{\mu}\\\\
\iff N(\ln (N-\mu)-\ln N)\geq \mu(\ln (N-\mu)-1-\ln N)\iff (N-\mu)\ln(N-\mu)+\mu \ln N+\mu\geq N\ln N
P { S N ≥ t } = i = t ∑ N ( i N ) ( μ / N ) i ( 1 − μ / N ) N − i ≥ i = t ∑ N ( i N ) i ( μ / N ) i ( 1 − μ / N ) N − i = i = t ∑ N ( μ / i ) i ( 1 − μ / N ) N − i ≥ ( μ / t ) t ( 1 − μ / N ) N − t ≥ ( μ / t ) t ( N N − μ ) N − μ ≥ ( μ / t ) t e − μ ⟺ ( N N − μ ) N ≥ ( e N N − μ ) μ ⟺ N ( ln ( N − μ ) − ln N ) ≥ μ ( ln ( N − μ ) − 1 − ln N ) ⟺ ( N − μ ) ln ( N − μ ) + μ ln N + μ ≥ N ln N
Let f ( μ ) = ( N − μ ) ln ( N − μ ) + μ ln N + μ f(\mu)=(N-\mu)\ln(N-\mu)+\mu\ln N+\mu f ( μ ) = ( N − μ ) ln ( N − μ ) + μ ln N + μ , then
f ′ ( μ ) = − 1 − ln ( N − μ ) + ln N + 1 = ln N − ln ( N − μ ) ≥ 0 ⟹ f ( μ ) ≥ f ( 0 ) = N ln N . f'(\mu)=-1-\ln(N-\mu)+\ln N+1=\ln N-\ln(N-\mu)\geq 0\\\\
\implies f(\mu)\geq f(0)=N\ln N.
f ′ ( μ ) = − 1 − ln ( N − μ ) + ln N + 1 = ln N − ln ( N − μ ) ≥ 0 ⟹ f ( μ ) ≥ f ( 0 ) = N ln N .
Q.E.D.
Problem 4 (The lower tail in Chernoff’s inequality)
Let X i X_i X i be independent Bernoulli random variables with parameters p i p_i p i . Consider their sum S N = ∑ i = 1 N X i S_N = \sum_{i=1}^N X_i S N = ∑ i = 1 N X i and denote its mean by μ = E S N \mu = \mathbb{E} S_N μ = E S N . Then, for any t < μ t <\mu t < μ , we have
P { S N ≤ t } ≤ e − μ ( e μ t ) t \mathbb{P}\lbrace S_N\leq t\rbrace\leq e^{-\mu}(\frac{e\mu}{t})^t
P { S N ≤ t } ≤ e − μ ( t e μ ) t
Use λ \lambda λ -MGF Method and Markov’s inequality.
P { S N ≤ t } = P { exp ( − λ S N ) ≥ exp ( − λ t ) } ≤ exp ( λ t ) E exp ( − λ S N ) = exp ( λ t ) ∏ i = 1 N E exp ( − λ X i ) = exp ( λ t ) ∏ i = 1 N ( 1 − p i + p i e − λ ) = exp ( λ t ) ∏ i = 1 N ( p i ( e − λ − 1 ) + 1 ) ≤ exp ( λ t ) ∏ i = 1 N exp ( p i ( e − λ − 1 ) ) = exp ( λ t ) exp ( μ ( e − λ − 1 ) ) = exp ( λ t + μ e − λ − μ ) . \mathbb{P}\lbrace S_N\leq t\rbrace=\mathbb{P}\lbrace \exp(-\lambda S_N)\geq \exp(-\lambda t)\rbrace\leq \exp(\lambda t)\mathbb{E}\exp(-\lambda S_N)\\\\
=\exp(\lambda t)\prod_{i=1}^N \mathbb{E}\exp(-\lambda X_i)=\exp(\lambda t)\prod_{i=1}^N (1-p_i+p_ie^{-\lambda})\\\\
=\exp(\lambda t)\prod_{i=1}^N (p_i(e^{-\lambda}-1)+1)\leq \exp(\lambda t)\prod_{i=1}^N \exp(p_i(e^{-\lambda}-1))\\\\
=\exp(\lambda t)\exp(\mu(e^{-\lambda}-1))=\exp(\lambda t+\mu e^{-\lambda}-\mu).
P { S N ≤ t } = P { exp ( − λ S N ) ≥ exp ( − λ t )} ≤ exp ( λ t ) E exp ( − λ S N ) = exp ( λ t ) i = 1 ∏ N E exp ( − λ X i ) = exp ( λ t ) i = 1 ∏ N ( 1 − p i + p i e − λ ) = exp ( λ t ) i = 1 ∏ N ( p i ( e − λ − 1 ) + 1 ) ≤ exp ( λ t ) i = 1 ∏ N exp ( p i ( e − λ − 1 )) = exp ( λ t ) exp ( μ ( e − λ − 1 )) = exp ( λ t + μ e − λ − μ ) .
Let f ( λ ) = λ t + μ e − λ f(\lambda)=\lambda t+\mu e^{-\lambda} f ( λ ) = λ t + μ e − λ , then
f ′ ( λ ) = t − μ e − λ = 0 ⟹ λ = ln ( μ / t ) ⟹ min f ( λ ) = f ( ln ( μ / t ) ) = t ln ( μ / t ) + t . f'(\lambda)=t-\mu e^{-\lambda}=0\implies \lambda=\ln(\mu/t)\implies \min f(\lambda)=f(\ln(\mu/t))=t\ln(\mu/t)+t.
f ′ ( λ ) = t − μ e − λ = 0 ⟹ λ = ln ( μ / t ) ⟹ min f ( λ ) = f ( ln ( μ / t )) = t ln ( μ / t ) + t .
Therefore,
exp ( λ t + μ e − λ − μ ) ≤ exp ( t ln ( μ / t ) + t − μ ) = e − μ ( e μ t ) t . \exp(\lambda t+\mu e^{-\lambda}-\mu)\leq \exp(t\ln(\mu/t)+t-\mu)=e^{-\mu}\left(\frac{e\mu}{t}\right)^t.
exp ( λ t + μ e − λ − μ ) ≤ exp ( t ln ( μ / t ) + t − μ ) = e − μ ( t e μ ) t .