Elementary matrix
- elementary matrix
- types
- Row-switching transformations
- The inverse of this matrix is itself
- The determinant is -1
- Row-multiplying transformations
- The inverse of this matrix can be made by replacing the changed element with its reciprocal
- The determinant is 1/m
- Row-addition transformations
- The inverse of this matrix can be made by multiplying -1 to the added element.
- The determinant is 1.
References:
Linearly dependent
- at least one of the vectors in a set can be defined as a linear combination of the others
Linearly independent
- no vector in a set can be written as a linear combination of the others.
- there is no solution of $x$ but the zero vector for $Ax = 0$.
Ordinary least squares
Ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model.
\[X\boldsymbol{\beta} = \boldsymbol{y}\]
Usually this equation is overdetermined so there is no solution $\beta$.
Instead, we want to find ${\hat \beta}$ such that
\[{\hat {\boldsymbol{\beta}}} = \operatorname{arg\,min}_{\boldsymbol{\beta}} {\lVert X\boldsymbol{\beta} - \boldsymbol{y}\rVert^2}\]
To find the critical point we calculate the gradient value of the squared error term $S = \lVert X\boldsymbol{\beta} - \boldsymbol{y}\rVert^2$.
\[\begin{aligned}
\nabla_{\boldsymbol{\beta}}S & = \nabla_{\boldsymbol{\beta}} \lVert X\boldsymbol{\beta} - \boldsymbol{y}\rVert^2 \\
& = \nabla_{\boldsymbol{\beta}} (X\boldsymbol{\beta} - \boldsymbol{y})^{\mathsf {T}}(X\boldsymbol{\beta} - \boldsymbol{y}) \\
& =\nabla_{\boldsymbol{\beta}} (\boldsymbol{\beta}^{\mathsf {T}}X^{\mathsf {T}} - \boldsymbol{y}^{\mathsf {T}})(X\boldsymbol{\beta} - \boldsymbol{y})\\
& =\nabla_{\boldsymbol{\beta}} (\boldsymbol{\beta}^{\mathsf {T}}X^{\mathsf {T}}X\boldsymbol{\beta} - \boldsymbol{\beta}^{\mathsf {T}}X^{\mathsf {T}}\boldsymbol{y} - \boldsymbol{y}^{\mathsf {T}}X\boldsymbol{\beta} + \boldsymbol{y}^{\mathsf {T}}\boldsymbol{y})\\
& =\nabla_{\boldsymbol{\beta}} (\boldsymbol{\beta}^{\mathsf {T}}X^{\mathsf {T}}X\boldsymbol{\beta} - \boldsymbol{\beta}^{\mathsf {T}}X^{\mathsf {T}}\boldsymbol{y} - (\boldsymbol{\beta}^{\mathsf {T}}X^{\mathsf {T}}\boldsymbol{y})^{\mathsf {T}} + \boldsymbol{y}^{\mathsf {T}}\boldsymbol{y})\\
& =\nabla_{\boldsymbol{\beta}} (\boldsymbol{\beta}^{\mathsf {T}}X^{\mathsf {T}}X\boldsymbol{\beta} - 2 \boldsymbol{\beta}^{\mathsf {T}}X^{\mathsf {T}}\boldsymbol{y} + \boldsymbol{y}^{\mathsf {T}}\boldsymbol{y})\\
& =\nabla_{\boldsymbol{\beta}} (\boldsymbol{\beta}^{\mathsf {T}}X^{\mathsf {T}}X\boldsymbol{\beta} - 2 \boldsymbol{\beta}^{\mathsf {T}}X^{\mathsf {T}}\boldsymbol{y})\\
& =2 X^{\mathsf {T}}X\boldsymbol{\beta} - 2 X^{\mathsf {T}}\boldsymbol{y}
\end{aligned}\]
Set $\nabla_{\boldsymbol{\beta}}S = 0$ for the critical point.
\[\begin{aligned}
\nabla_{\boldsymbol{\beta}}S = 2 X^{\mathsf {T}}X\boldsymbol{\beta} - 2 X^{\mathsf {T}}\boldsymbol{y} = 0 \\
X^{\mathsf {T}}X\boldsymbol{\beta} = X^{\mathsf {T}}\boldsymbol{y} \\
\boldsymbol{\beta} = (X^{\mathsf {T}}X)^{-1} X^{\mathsf {T}}\boldsymbol{y} \\
\end{aligned}\]
So
\({\hat {\boldsymbol{\beta}}} =(X^{\mathsf {T}}X)^{-1} X^{\mathsf {T}}{\boldsymbol{y}} \\\)
.
Remarks
- $X^{\mathsf {T}}X$ is called moment matrix.
- If X has multicollinearity the coefficient estimates $\boldsymbol{\beta}$ may change erratically in response to small changes in the model or the data.
- If X has perfect multicollinearity the moment matrix has no inverse matrix.
orthogonal set
- A set of vectors such that
- nonzero
- mutually orthogonal
- $a_i \cdot a_j = 0$ (where $i \neq j$)
orthonormal set
- A set of vectors that
- all vectors in the set are mutually orthogonal
- and all of unit length.
standard basis
set of unit vectors to the axis direction in Euclidean space. For example, (0, 1), (1,0).
det(A)
A determinant expresses the signed n-dimensional volume of n-dimensional parallelepiped.
diagonal matrix
- no need to be a square matrix
- transformation to the direction of the axises
symmetric matrix
- eigenvalues are all real numbers.
- eigenvectors are orthogonal
- scaling in mutually perpendicular directions
- can be decomposed as $QDQ^{\mathsf {T}}$
skew-symmetric matrix (anti-symmetric matrix)
\[A^{\mathsf {T}} = -A\]
- eigenvectors are orthogonal
- eigenvalues are imaginary numbers
orthogonal matrix
\[Q^{\mathsf {T}} Q = QQ^{\mathsf {T}} = I\]
- square matrix
- length preserving or isometric
- unitary transformation (rotation or reflection or rotoreflection)
- eigenvectors are the rotation axis and they can be complex numbers.
conjugate transpose
\[A^\ast={\overline {A^{\mathsf {T}}}}\]
Hermitian matrix
\[A=A^\ast\]
- self-adjoint matrix
- The eigenvalues are all real.
- Eigenvectors belonging to distinct eigenvalues are orthogonal.
- (Spectral Theorem) There exists a unitary matrix U that diagonalizes Hermitian matrix A.
unitary matrix
Column vectors form an orthonormal set in $C^n$.
- U is unitary if and only if $Q^\ast Q = QQ^\ast = I$.
- when conjugate transpose matrix = inverse matrix
- Given two complex vectors $x$ and $y$, multiplication by a unitary matrix $U$ preserves their inner product; that is, $⟨Ux, Uy⟩ = ⟨x, y⟩$.
normal matrix
$A^\ast A = AA^\ast = I$
- A is normal if and only if A posseses a complete orthonormal set of eigenvectors.
positive definite
\[x^{\mathsf {T}}Ax \gt 0\]
- for symmetric matrices
- eigenvalues are all real numbers and greater than zero.
- can be seen as $x^{\mathsf {T}} \cdot Ax = x \cdot (Ax)$
- that means A is not changing the direction of x too much.
- up to 90 degrees - exclusive
- In eigendecomposition of A, D consists of positive values.
positive semi-definite
\[x^{\mathsf {T}}Ax \ge 0\]
- for symmetric matrices
- eigenvalues are all real numbers and greater than equal to zero.
- can be seen as $x^{\mathsf {T}} \cdot Ax = x \cdot (Ax)$
- that means A is not changing the direction of x too much.
- up to 90 degrees - inclusive
- In eigendecomposition of A, D consists of non-negative values.
QR Factorization
\[A = QR\]
- Q: orthogonal matrix
- R: upper triangular matrix
eigenvalues and eigenvectors
\[Ax = \lambda x\]
- $\lambda_1 + \lambda_2 + … + \lambda_n = \operatorname{tr}(A)$
- $\lambda_1\lambda_2…\lambda_n = \operatorname{det}(A)$
- Eigenvectors are nonzero.
- Eigenvalues can be zero.
eigendecomposition
\[A = Q \Lambda Q^{-1}\]
- A: sqaure matrix
- A has eigenvectors as many as dim(A)
- can reduce number of computations by multiplying eigenvalues instead of eigenvectors
(special cases)
for real symmetric matrices
\(A = Q \Lambda Q^{\mathsf {T}}\)
.
SVD
SVD is similar to finding orthogonal matrix $V$ which still can be represented as a product of another orthogonal matrix $U$ and a diagonal matrix $\Sigma$ when $V$ is transformed by $M$.
- $M$
- $V$
- $n \times n$ matrix
- right singular vector
- consists of eigenvectors of $A^{\mathsf {T}}A$
- orthogonal matrix (which is rotation)
- set of orthonormal eigenvectors of M*M
- $U$
- left singular vector
- $m \times m$
- consists of eigenvectors of $AA^{\mathsf {T}}$
- orthogonal matrix (which is rotation)
- set of orthonormal eigenvectors of $MM^\ast$
- $\Sigma$
- $m \times n$
- square roots of the non-zero eigenvalues of either $M^\ast M$ or $MM^\ast$
- etc.
- for the positive definite matrices singular decomposition and eigendecomposition are the same.
- $M^\ast M = V\Sigma^2V^\ast$
- $MM^\ast = U\Sigma^2U^\ast$
- SVD always exists
- applications
reduced SVD
truncated SVD
Picked $k$ singular values.
PCA
- analysis on $M^\ast M = V\Sigma^2V^\ast$
- Data supposed to be centered to apply PCA.
- How to do dimensionality reduction usually
- Standardize the d-dimensional dataset.
- Construct the covariance matrix.
- Decompose the covariance matrix into its eigenvectors and eigenvalues.
- Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors.
- Select k eigenvectors which correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace (k ≤ d).
- Construct a projection matrix W from the “top” k eigenvectors.
- Transform the d-dimensional input dataset X using the projection matrix W to obtain the new k-dimensional feature subspace.
Gramian matrix
- https://en.wikipedia.org/wiki/Gramian_matrix
- is the Hermitian matrix of inner products, whose entries are given by $G_{ij} = <v_i, v_j>$ angle
- where $v_1, …, v_n$ are
- usually columns of a matrix V
- in an inner product space.
- positive definite
- applications
- Given vectors are centered random variable, Gramian matrix is approximately
- equivalent to $A^T A$
- proportional to the covariance with the scaling determined by the number of elements in the vector
- isometry
- a mapping that preserves distances
References