Empirical observations of L_{2,d} spectral radii and other functions

In this post, we shall make a few experimental observations about the L_{2,d}-spectral radius and dimensionality reduction that give evidence that the L_{2,d}-spectral radius and dimensionality reduction will eventually have a deep mathematical theory behind it and also become more applicable to machine learning. I have not been able to prove these empirical observations, so one should consider this post to be about experimental mathematics. Hopefully these experimental results will eventually be backed up by formal theoretical mathematics. The reader is referred to this post for definitions.

This post is currently a work in progress, so I will be updating this post with more information. Let M_d(\mathbb{C})^{r++}\subseteq M_d(\mathbb{C})^r be the collection of all tuples (A_1,\dots,A_r) where A_1\otimes\overline{A_1}+\dots+A_r\otimes\overline{A_r} is not nilpotent. Given matrices A_1,\dots,A_r\in M_n(\mathbb{C}), define a function L_{A_1,\dots,A_r;d}:M_d(\mathbb{C})^{d++}\rightarrow\mathbb{R} by letting L_{A_1,\dots,A_r;d}(X_1,\dots,X_r)=\frac{\rho(A_1\otimes\overline{X_1}+\dots+A_r\otimes\overline{X_r})}{\rho(X_1\otimes\overline{X_1}+\dots+X_r\otimes\overline{X_r})^{1/2}}.

Given matrices A_1,\dots,A_r\in M_d(\mathbb{C}), we say that (X_1,\dots,X_r)\in M_d(\mathbb{C})^r is an L_{2,d}-spectral radius dimensionality reduction of (A_1,\dots,A_r) if (X_1,\dots,X_r) locally maximizes L_{A_1,\dots,A_r;d}(X_1,\dots,X_r). We say that an L_{2,d}-spectral radius dimensionality reduction is optimal if (X_1,\dots,X_r) globally maximizes L_{A_1,\dots,A_r;d}(X_1,\dots,X_r).

We say that (X_1,\dots,X_r) is projectively similar to (Y_1,\dots,Y_r) if there is some \lambda\in\mathbb{C}^\times and some C\in M_d(\mathbb{C}) where Y_j=\lambda CX_jC^{-1} for 1\leq j\leq r. Let \simeq denote the equivalence relation where we set (X_1,\dots,X_r)\simeq(Y_1,\dots,Y_r) precisely when (X_1,\dots,X_r) and (Y_1,\dots,Y_r) are projectively similar. Let [X_1,\dots,X_r] denote the equivalence relation with respect to \simeq.

Observation: (apparent unique local optimum; global attractor) Suppose that (X_1,\dots,X_r),(Y_1,\dots,Y_r) are L_{2,d}-SRDRs of (A_1,\dots,A_r) obtained by gradient ascent on the domain M_d(\mathbb{C})^{++}. Then (X_1,\dots,X_r),(Y_1,\dots,Y_r) are usually projectively similar. Suppose now that \mathcal{X} is the random variable consisting of all equivalence classes [X_1,\dots,X_r] where (X_1,\dots,X_r) is an L_{2,d}-SRDR trained by some variant of gradient ascent. Then our computations indicate that the random variable \mathcal{X} has typically has low entropy in the case when \mathcal{X} is a non-constant random variable.

Example: The above observation does not always hold as L_{A_1,\dots,A_r;d} may have multiple non-projectively similar local maxima. For example, suppose that A_i= \begin{bmatrix} B_i & 0 \\ 0 & C_i\end{bmatrix} for 1\leq i\leq r where each B_i and each C_i is a d\times d-matrix. Then L_{A_1,\dots,A_r;d}(X_1,\dots,X_r)=\max(L_{B_1,\dots,B_r;d}(X_1,\dots,X_r),L_{C_1,\dots,C_r;d}(X_1,\dots,X_r)). Suppose now that L_{B_1,\dots,B_r;d}(B_1,\dots,B_r)>L_{C_1,\dots,C_r;d}(B_1,\dots,B_r) and L_{C_1,\dots,C_r;d}(C_1,\dots,C_r)>L_{B_1,\dots,B_r;d}(C_1,\dots,C_r). Then (B_1,\dots,B_r),(C_1,\dots,C_r) are usually non-projectively similar local maxima for the function L_{A_1,\dots,A_r;d}. Of course, (A_1,\dots,A_r) does not generate the algebra M_n(\mathbb{C}), but there is some neighborhood U of (A_1,\dots,A_r) where if (D_1,\dots,D_r)\in U, then L_{D_1,\dots,D_r;d} will have at least two non-projectively similar local maxima. This sort of scenario is not as contrived as one may think it is since (D_1,\dots,D_r) could represent two clusters where there are few or weak connections between these clusters (for example, in natural language processing, each language could represent its own cluster).

Observation: (height and breadth correlation for local maxima) Suppose that \mathcal{X} is the probability distribution of the (real, complex,or quaternionic) equivalence classes [X_1,\dots,X_r] of L_{2,d}-SRDRs obtained from (A_1,\dots,A_r) using gradient ascent. Then the random variables P(\mathcal{X}) (mapping the outcome to the probability of observing that outcome) and L_{A_1,\dots,A_r}(\mathcal{X}) tend to have a positive linear correlation. The probability P([X_1,\dots,X_r]) is maximized typically when L_{A_1,\dots,A_r}(X_1,\dots,X_r) is maximized instead of just locally maximized.

Observation: (preservation of symmetry) Suppose that A_1,\dots,A_r are Hermitian (resp. real, real symmetric, complex symmetric, positive semidefinite) and (X_1,\dots,X_r) is an L_{2,d}-SRDR of (A_1,\dots,A_r). Then (X_1,\dots,X_r) is usually projectively similar to some tuple (Y_1,\dots,Y_r) where each Y_j is Hermitian (resp. real, real symmetric, complex symmetric, positive semidefinite).

Observation: (near functoriality) Suppose that e<d<n. Suppose that A_1,\dots,A_r generates M_d(\mathbb{C}). Let X_1,\dots,X_r be an L_{2,e}-SRDR of A_1,\dots,A_r. Let B_1,\dots,B_r be an L_{2,d}-SRDR of A_1,\dots,A_r. Let C_1,\dots,C_r be an L_{2,e}-SRDR of B_1,\dots,B_r. Then (C_1,\dots,C_r) is usually not projectively similar to (X_1,\dots,X_r), but [C_1,\dots,C_r] is usually close to [X_1,\dots,X_r].

Observation: (Suboptimal real LSRDRs) Suppose that (A_1,\dots,A_r)\in M_n(\mathbb{R})^r. Let L_{A_1,\dots,A_r;d}^R:M_d(\mathbb{R})^{++}\rightarrow[0,\infty) be the restriction of L_{A_1,\dots,A_r;d} to the domain M_d(\mathbb{R})^{++}. A local maxima (X_1,\dots,X_r) for L_{A_1,\dots,A_r;d}^R shall be called a real LSRDR. Gradient ascent for the function L_{A_1,\dots,A_r;d}^R often produces more local maxima which are not global maxima than the complex version L_{A_1,\dots,A_r;d}^R has. These supoptimal local optima are tuples (X_1,\dots,X_r) where if A_1\otimes X_1+\dots+A_r\otimes X_r has eigenvalues \lambda_1,\dots,\lambda_{nr} with |\lambda_1|\leq\dots\leq|\lambda_{nr}|, then \lambda_{nr-1},\lambda_{nr} are complex conjugates. In this case, \rho(A_1\otimes X_1+\dots+A_r\otimes X_r)=|\lambda_{nr}|=|\lambda_{nr-1}|, but L_{A_1,dots,A_r;d}^R(X_1,\dots,X_r) is not as large as it can be since the gradient ascent process must make the absolute value of two eigenvalues |\lambda_{nr-1}|,|\lambda_{nr}| large instead of simply making the absolute value of a single eigenvalue large. Fortunately, the most efficient way of computing real LSRDRs also naturally avoids the case when |\lambda_{nr}|=|\lambda_{nr-1}|. To compute LSRDRs, one will need to compute the gradient of the expression \rho(A_1\otimes X_1+\dots+A_r\otimes X_r), but the most efficient way to compute the gradient of the spectral radius of a matrix X, one will need to use the dominant left and right eigenvectors of X. The dominant left and right eigenvectors of X can be computed using a power iteration, but the power iteration technique (unless generalized) only effectively computes the dominant left and right eigenvectors of X if X has a single eigenvector \lambda where |\lambda|>|\mu| whenever \mu is an eigenvalue different than \lambda. Therefore, a variation of gradient ascent that uses the dominant left and right eigenvectors to approximate the gradient of L_{A_1,\dots,A_r;d}^R will be inaccurate whenever A_1\otimes X_1+\dots+A_r\otimes X_r has multiple dominant eigenvectors. In practice, the gradient ascent process will still converge, but it will converge to a value (X_1,\dots,X_r) where A_1\otimes X_1+\dots+A_r\otimes X_r has a single dominant eigenvalue. In this case, the gradient ascent process avoids supoptimal real LSRDRs. With everything being said, there are still examples (such as with MPO word embeddings) of real matrices (A_1,\dots,A_r) where training a complex LSRDR achieves better results than training a real LSRDR and where if (X_1,\dots,X_r) is a complex LSRDR, then there are C,\gamma where \gamma CX_jC^{-1} is a real matrix for 1\leq j\leq r.

If A_1,\dots,A_r\in M_n(\mathbb{C})^{r}, then define a function T_{A_1,\dots,A_r;d}:\{(R,S)\mid(RA_1S,\dots,RA_rS)\in M_d(\mathbb{C})^{r++}\}\rightarrow[0,\infty) by setting T_{A_1,\dots,A_r;d}(R,S)=L_{A_1,\dots,A_r;d}(RA_1S,\dots,RA_rS).

We say that (R,S) is an L_{2,d}-SRDR projector if (R,S) is a local maximum for T_{A_1,\dots,A_r;d}(R,S). We say that (R,S) is a strong L_{2,d}-SRDR projector if (RA_1S,\dots,RA_rS) is an L_{2,d}-SRDR.

Observation: If (R,S) is an L_{2,d}-SRDR projector, then there is usually a constant \lambda\in\mathbb{C}^\times with RS=\lambda 1_d and a projection \pi:\mathbb{C}^n\rightarrow\mathbb{C}^n of rank d with SR=\lambda \pi. L_{2,d}-SRDRs projectors are usually strong L_{2,d}-SRDR projectors. If R_1,R_2\in M_{d,n}(\mathbb{C}),S_1,S_2\in M_{n,d}(\mathbb{C}), then set [[R_1,S_1]]=[[R_2,S_2]] precisely when there are \mu,\nu\in\mathbb{C}^\times where R_2=\mu QR_1,S_2=\nu S_1Q^{-1}. If (R_1,S_1) and (R_2,S_2) are L_{2,d}-SRDR projectors, then we typically have [[R_1,S_1]]=[[R_2,S_2]]. In fact, R_1S_2R_2S_1=\theta 1_d for some \theta\in\mathbb{C}^\times and if we set Q=R_2S_1, then R_2=\mu QR_1,S_2=\nu S_1Q^{-1} for some \mu,\nu\in\mathbb{C}^\times.

We say that a L_{2,d}-SRDR projector is constant factor normalized if RS=1_d. If (R_1,S_1),(R_2,S_2) are two constant factor normalized L_{2,d}-SRDR projectors of (A_1,\dots,A_r), then R_1S_2R_2S_1=1_d and where if we set Q=R_2S_1, then R_2=QR_1,S_2=S_1Q^{-1}. Furthermore, we typically have S_1R_1=S_2R_2=(S_1R_1)^2=(S_2R_2)^2. We shall call the projection matrix S_1R_1 an L_{2,d}-LSRDR projection operator of (A_1,\dots,A_r).

Observation: Let (R,S) be a constant factor normalized L_{2,d}-SRDR projector of (A_1,\dots,A_r), and suppose that X_j=RA_jS for 1\leq j\leq r. Let P=SR and suppose that P is a projection matrix. Let U_R be a dominant eigenvector of \Gamma(A_1,\dots,A_r;X_1,\dots,X_r) and let U_L be a dominant eigenvector of \Gamma(A_1,\dots,A_r;X_1,\dots,X_r)^*=\Gamma(A_1^*,\dots,A_r^*;X_1^*,\dots,X_r^*). Then in our experiments, the matrices RU_R,U_RS^*,S^*U_L,U_LR can all be written as positive semidefinite matrices multiplied by a complex constant.

The eigenvalues of \Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP) are all simple eigenvalues.

Let G_0,H_0 be random complex n\times n-positive semidefinite matrices. Let H_{n+1}=\frac{\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP)(H_n)}{\|\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP)(H_n)\|} and let G_{n+1}=\frac{\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP)^*(G_n)}{\|\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP)^*(G_n)\|} for n\geq 0. Then the sequence H_n converges to a positive semidefinite matrix H and the sequence G_n converges to a positive semidefinite matrix G. The positive semidefinite matrices G,H do not depend on the initialization G_0,H_0. There exists complex constants \mu_H,\mu_G where U_RS^*=\mu_H\cdot H and U_LR=\mu_G\cdot G.

Here, G=GP=P^*G=P^*GP,H=PH=HP^*=PHP^*, and \text{Im}(G)=\text{Im}(P^*)=\ker(P)^\perp and \text{Im}(H)=\text{Im}(P)=\ker(P^*)^\perp.

Let G_0,H_0^\sharp be random complex n\times n-positive semidefinite matrices. Let H^\sharp_{n+1}=\frac{\Gamma(PA_1P,\dots,PA_rP;PA_1P,\dots,PA_rP)(H^\sharp_n)}{\|\Gamma(PA_1P,\dots,PA_rP;PA_1P,\dots,PA_rP)(H^\sharp_n)\|} and let G_{n+1}^\sharp=\frac{\Gamma(PA_1P,\dots,PA_rP;PA_1P,\dots,PA_rP)^*(G^\sharp_n)}{\|\Gamma(PA_1P,\dots,PA_rP;PA_1P,\dots,PA_rP)^*(G^\sharp_n)\|} for n\geq 0. Then G_n^\sharp\rightarrow G,H_n^\sharp\rightarrow H.

The matrix H is the dominant eigenvector of the superoperator \Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP). The matrix H is the dominant eigenvector of the superoperator \Phi(PA_1P,\dots,PA_rP). The matrix G is the dominant eigenvector of the superoperator \Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP)^*. The matrix G is the dominant eigenvector of the superoperator \Phi(PA_1P,\dots,PA_rP)^*. Furthermore, we have \rho(\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP))=\rho(\Phi(PA_1P,\dots,PA_rP)). These eigenvectors correspond to real positive eigenvalues: \rho(\Phi(PA_1P,\dots,PA_rP))\cdot H=\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP)(H)=\Phi(PA_1P,\dots,PA_rP)(H) and \rho(\Phi(PA_1P,\dots,PA_rP))\cdot G=\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP)^*(G)=\Phi(PA_1P,\dots,PA_rP)^*(G).

Application: LSRDRs may be used to compress data that can be represented as a collection of matrices. Suppose that (A_1,\dots,A_r) is a collection of matrices. Let (R,S) be a constant factor normalized L_{2,d}-SRDR of (A_1,\dots,A_r). Let X_j=RA_jS for 1\leq j\leq r. Then the matrices (A_1,\dots,A_r) can be approximately recovered from (R,S) and (X_1,\dots,X_r). The matrix SX_jR=SRA_jSR is an approximation for the matrix A_j which we shall call an LSRDR principal component (LSRDRPC) for A_j . We observe that if (R_1,S_1),(R_2,S_2) are two constant factor normalized L_{2,d}-SRDRs of (A_1,\dots,A_r), then since S_1R_1=S_2R_2, we have S_1R_1A_jS_1R_1=S_2R_2A_jS_2R_2. In other word, the LSRDRPC for A_j is typically unique or at the very least, there will be few LSRDRPCs for A_j.

Observation: If (X_1,\dots,X_r) is an L_{2,d}-SRDR of (A_1,\dots,A_r), then there is usually a strong L_{2,d}-SRDR projector (R,S) where X_j=RA_jS for 1\leq j\leq r.

Observation: If (X_1,\dots,X_r) are self-adjoint, then whenever (R,S) is a constant factor normalized L_{2,d}-SRDR projector of (X_1,\dots,X_r), we usually have [[R,S]]=[[R_1,S_1]] for some L_{2,d}-SRDR projector of (X_1,\dots,X_r) with S_1=R_1^* and where R_1R_1^*=S_1^*S_1=1_d. Furthermore, if [[R_2,S_2]] is another L_{2,d}-SRDR projector of (X_1,\dots,X_r) with S_2=R_2^*,R_2R_2^*=S_2^*S_2=1_d, then Q=R_1S_2 is a unitary matrix, and S_2R_2=S_1R_1, and \text{Im}(R_1)=\text{Im}(R_2)=\ker(S_1)^\perp=\ker(S_2)^\perp. If \iota:\text{Im}(R_1)\rightarrow\mathbb{C}_n is the inclusion mapping, and j:\mathbb{C}_n\rightarrow\text{Im}(R_1) is the orthogonal projection, then [[\iota,j]] is an L_{2,d}-SRDR projector.

Observation: If (A_1,\dots,A_r) are real square matrices, then whenever (R,S) is a L_{2,d}-SRDR projector of (A_1,\dots,A_r), we usually have [[R,S]]=[[R_1,S_1]] for some L_{2,d}-SRDR projector of (X_1,\dots,X_r) where R_1,S_1 are real matrices. Furthermore, if (R_1,S_1),(R_2,S_2) are real constant factor normalized L_{2,d}-SRDR projectors of (A_1,\dots,A_r), then there is a real matrix Q where R_2=QR_1,S_2=R_1Q^{-1}.

Observation: If (A_1,\dots,A_r) are real symmetric matrices, then whenever (R,S) is a L_{2,d}-SRDR projector of (A_1,\dots,A_r), we usually have [[R,S]]=[[R_1,S_1]] for some L_{2,d}-SRDR projector of (X_1,\dots,X_r) where R_1,S_1 are real matrices with R_1=S_1^*. In fact, if (R_1,S_1),(R_2,S_2) are real constant factor normalized L_{2,d}-SRDR projectors of (A_1,\dots,A_r) with R_1=S_1^*,R_2=S_2^*, there is a real orthogonal matrix Q with R_2=QR_1,S_2=R_1Q^{-1}.

Observation: If (A_1,\dots,A_r) are complex symmetric matrices, then whenever (R,S) is a L_{2,d}-SRDR projector of (A_1,\dots,A_r), we usually have [[R,S]]=[[R_1,S_1]] for some L_{2,d}-SRDR projector of (X_1,\dots,X_r) and where R_1=S_1^T. Furthermore, if (R_1,S_1),(R_2,S_2) are constant factor normalized L_{2,d}-SRDR projectors of (A_1,\dots,A_r), and Q=R_1S_2, then QQ^T=1_d.

Application: Dimensionality reduction of various manifolds. Let \mathbb{CQ}^{n-1} denote the quotient complex manifold (\mathbb{C}^n\setminus\{0\})/\simeq where we set \mathbf{x}\simeq\mathbf{y} precisely when \mathbf{x}=\mathbf{y} or \mathbf{x}=-\mathbf{y}. One can use our observations about LSRDRs to perform dimensionality reduction of data in \mathbb{RP}^{n-1}\times(0,\infty),\mathbb{CP}^{n-1}\times(0,\infty),\mathbb{CQ}^{n-1}, and we can also use dimensionality reduction to map data in \mathbb{C}^n and \mathbb{R}^n to data in \mathbb{C}^d and \mathbb{R}^d respectively.

Let \mathcal{A}_n be the collection of all n\times n positive definite rank-1 real matrices. Let \mathcal{B}_n be the collection of all positive definite rank-1 complex matrices. Let \mathcal{C}_n be the collection of all symmetric rank-1 complex matrices. The sets \mathcal{A}_n,\mathcal{B}_n,\mathcal{C}_n can be put into a canonical one-to-one correspondence with \mathbb{RP}^{n-1}\times(0,\infty),\mathbb{CP}^{n-1}\times(0,\infty),\mathbb{CQ}^{n-1} respectively. Suppose that (A_1,\dots,A_r)\in M_n(\mathbb{C})^r and (X_1,\dots,X_r)\in M_d(\mathbb{C})^r is an L_{2,d}-SRDR of (A_1,\dots,A_r). If (A_1,\dots,A_r)\in\mathcal{A}_n^r, then [X_1,\dots,X_r]\cap\mathcal{A}_d^r\neq\emptyset, and every element of [X_1,\dots,X_r]\cap\mathcal{A}_d^r can be considered as a dimensionality reduction of (A_1,\dots,A_r). Similarly, if (A_1,\dots,A_r)\in\mathcal{B}_n^r, then [X_1,\dots,X_r]\cap\mathcal{B}_d^r\neq\emptyset, and if (A_1,\dots,A_r)\in\mathcal{C}_n^r, then [X_1,\dots,X_r]\cap\mathcal{C}_d^r\neq\emptyset.

L_{2,d}-SRDRs can be used to produce dimensionality reductions that map subsets W\subseteq K^n to K^d where K\in\{\mathbb{R},\mathbb{C},\mathbb{H}\}.

Suppose that a_1,\dots,a_r\in\mathbb{C}^n. Then let A_k=a_ka_k^* for 1\leq k\leq r. Then there is some subspace V\subseteq\mathbb{C}^n where if \iota:V\rightarrow\mathbb{C}^n,j:\mathbb{C}^n\rightarrow V are the inclusion and projection mappings, then (\iota,j) is a constant factor normalized L_{2,d}-SRDR projector for (A_1,\dots,A_r). In this case, (j(a_1),\dots,j(a_r)) is a dimensionality reduction of the tuple a_1,\dots,a_r.

Suppose that a_1,\dots,a_r\in\mathbb{C}^n. Then let A_k=a_ka_k^T for 1\leq k\leq r. Suppose that (R,S) is a constant factor normalized L_{2,d}-SRDR projector with respect to (A_1,\dots,A_r). Then (S(a_1),\dots,S(a_r)) is a dimensionality reduction of (a_1,\dots,a_r).

One may need to normalize the data a_1,\dots,a_r\in K^r in the above paragraphs before taking a dimensionality reduction. Suppose that (b_1,\dots,b_r)\in K^r. Let \mu=(b_1+\dots+b_r)/r be the mean of (b_1,\dots,b_r). Let a_k=b_k-\mu. Suppose that (T(a_1),\dots,T(a_r)) is a dimensionality reduction of (a_1,\dots,a_r). Then (T(a_1)+\mu,\dots,T(a_r)+\mu) is normalized dimensionality reduction of (b_1,\dots,b_r).

Application: (clustering for neural networks with bilinear layers) Suppose that V,W are finite dimensional real inner product spaces. We say that a function f:V\rightarrow W is a quadratic form if g\circ f:V\rightarrow\mathbb{R} is a quadratic form for each g\in W^*. We say that a linear operator A:V\rightarrow V\otimes W is symmetric if (1_V\otimes w)A is a symmetric linear operator for each w\in W^*. The quadratic forms f:V\rightarrow W can be put into a one-to-one correspondence with the symmetric linear operators A:V\rightarrow V\otimes W by the relation f(v)=(v^T\otimes 1_W)Av for v\in V.

Suppose that f:V\rightarrow W is a quadratic form and A:V\rightarrow V\otimes W is a linear mapping such that f(v)=(v^T\otimes 1_W)Av for v\in V. Suppose furthermore that X:U\rightarrow U\otimes W is an LSRDR of A. Let R,S be the linear operators where X=(R\otimes 1_W)AS and suppose that RS=1_U. Then let P=SR. Then P is an orthogonal projection operator, and the space \text{Im}(P) is a cluster in V. Furthermore, let U_R be the dominant eigenvector of \Gamma(A;X) and let U_L be the dominant eigenvector of \Gamma(A;X)^*. Then we may assume that the operators U_R\cdot S^T,U_L\cdot R are positive semidefinite with trace 1. Then whenever v\in V,\|v\|=1, the values \langle U_R\cdot S^T(v),v\rangle and \langle U_L\cdot R(v),v\rangle are measures of how well the vector v belongs to the cluster \text{Im}(P).

Observation: (Random real matrices) Suppose that d\cdot r/n\approx 0 and U_1,\dots,U_r\in M_n(\mathbb{R}) are random real matrices; for simplicity assume that the entries in U_1,\dots,U_r are independent, and normally distributed with mean 0 and variance 1. Then whenever X_1,\dots,X_r\in M_d(\mathbb{C}), we have \frac{\rho(U_1\otimes X_1+\dots+U_r\otimes X_r)}{\rho(X_1\otimes\overline{X_1}+\dots+X_r\otimes\overline{X_r})^{1/2}\sqrt{n}}\approx 1. In particular, \rho_{2,d}(U_1,\dots,U_r)/\sqrt{n}\approx 1.

Observation: (Random complex matrices) Suppose now that d\cdot r/n\approx 0 and U_1,\dots,U_r\in M_n(\mathbb{C}) are random complex matrices; for simplicity assume that the entries in U_1,\dots,U_r are independent, Gaussian with mean 0, where the real part of each entry in U_r has variance 1/2 and the imaginary part of each entry in U_r has variance 1/2. Then \frac{\rho(U_1\otimes X_1+\dots+U_r\otimes X_r)}{\rho(X_1\otimes\overline{X_1}+\dots+X_r\otimes\overline{X_r})^{1/2}\sqrt{n}}\approx 1 and \rho_{2,d}(U_1,\dots,U_r)/\sqrt{n}\approx 1.

Observation: (Random permutation matrices) Suppose that d\cdot r/n\approx 0 and \phi:S_n\rightarrow V is the standard irreducible representation. Let g_1,\dots,g_r\in S_n be selected uniformly at random. Then \frac{\rho(\phi(g_1)\otimes X_1+\dots+\phi(g_r)\otimes X_r)}{\rho(X_1\otimes\overline{X_1}+\dots+X_r\otimes\overline{X_r})^{1/2}}\approx 1 and \rho_{2,d}(\phi(g_1),\dots,\phi(g_r))\approx 1.

Observation: (Random orthogonal or unitary matrices): Suppose that d\cdot r/n\approx 0, and let (U_1,\dots,U_r) be a collection of unitary matrices select at random according to the Haar measure or let (U_1,\dots,U_r) be a collection of orthogonal matrices selected at random according to the Haar measure. Then \frac{\rho(U_1\otimes X_1+\dots+U_r\otimes X_r)}{\rho(X_1\otimes\overline{X_1}+\dots+X_r\otimes \overline{X_r})^{1/2}}\approx 1 and \rho_{2,d}(U_1,\dots,U_r)\approx 1.

Suppose that X is a complex matrix with eigenvalues \lambda_1,\dots,\lambda_n with |\lambda_1|\geq\dots\geq|\lambda_n|. Then define the spectral gap ratio of X as |\lambda_1|/|\lambda_2|.

Observation: Let 1\leq d\leq n. Let A_1,\dots,A_r\in M_n(\mathbb{C}) be random matrices according to some distribution. Suppose that R\in M_{d,n}(\mathbb{C}),S\in M_{n,d}(\mathbb{C}) are matrices where RS is positive semidefinite (for example, RS could be a constant factor normalized LSRDR projector of (A_1,\dots,A_r) ). Let X=A_1\otimes\overline{RA_1S}+\dots+A_r\otimes\overline{RA_rS}. Suppose that the eigenvalues of X are \lambda_1,\dots,\lambda_{nd} with |\lambda_1|\geq\dots\geq|\lambda_{nd}|. Then the spectral gap ratio of X is much larger than the spectral gap of ratio random matrices that follow the circular law. However, the eigenvalues (\lambda_2,\dots,\lambda_{nd}) with the dominant eigenvalue \lambda_1 removed still follow the circular law (or semicircular or elliptical law depending on the choice of distribution of A_1,\dots,A_r). The value \lambda_1/|\lambda_1| will be approximately 1. The large spectral gap ratio for X makes the version of gradient ascent that we use to compute LSRDRs work well; the power iteration algorithm for computing dominant eigenvectors converges more quickly if there is a large spectral gap ratio, and the dominant eigenvectors of a matrix are needed to calculate the gradient of the spectral radius of that matrix.

Observation: Suppose that d/n is near zero. Suppose that U_1,\dots,U_r\in M_n(\mathbb{C}) are independent random complex matrices. Suppose furthermore that the entries in each U_i are independent, Gaussian, and have variance 1/(2\cdot n). Then \rho_{2,d}(U_1,\dots,U_r)\approx 1+\sqrt{rd/n}.

Observation: (increased regularity in high dimensions) Suppose that A_1,\dots,A_r are n\times n-complex matrices. Suppose that 1\leq d<e\leq n. Suppose that (W_1,\dots,W_r),(X_r,\dots,X_r) are L_{2,d}-SRDRs of (A_1,\dots,A_r) trained by gradient ascent while (Y_1,\dots,Y_r),(Z_1,\dots,Z_r) are L_{2,e}-SRDRs of (A_1,\dots,A_r). Then it is more likely for (Y_1,\dots,Y_r) to be projectively similar to (Z_1,\dots,Z_r) than it is for (W_1,\dots,W_r) to be projectively similar to (X_1,\dots,X_r). In general, (Y_1,\dots,Y_r) will be better behaved than (W_1,\dots,W_r).

Application: (Iterated LSRDRs) Suppose that 1\leq n_1<\dots<n_s=n. Let A_1,\dots,A_r be n\times n-complex matrices. Suppose that X_{s,j}=A_j for 1\leq j\leq r. Suppose furthermore that (X_{t,1},\dots,X_{t,r}) is an LSRDR of (X_{t+1,1},\dots,X_{t+1,r}) whenever 1\leq t<s. Then we shall call (X_{1,1},\dots,X_{1,r}) an iterated LSRDR of (A_1,\dots,A_r). One advantage of iterated LSRDRs over LSRDRs is that an iterated LSRDR (X_{1,1},\dots,X_{1,r}) of (A_1,\dots,A_r) where each X_{j,r} is an n_1\times n_1-matrix is more likely to be well-behaved than an LSRDR (X_1,\dots,X_r) of (A_1,\dots,A_r) where each X_{j} is an n_1\times n_1-matrix. While iterated LSRDRs may seem to require more intensive computation than ordinary LSRDRs, it will take fewer rounds of gradient ascent to compute (X_{t,1},\dots,X_{t,r}) from (X_{t+1,1},\dots,X_{t+1,r}) than it will take to compute the LSRDR (X_1,\dots,X_r).

Observation: (Increased regularity with more matrices) Suppose that 2\leq r<s. Let A_1,\dots,A_s\in M_n(\mathbb{C}). Suppose that (X_1,\dots,X_r) is an LSRDR of (A_1,\dots,A_r) while (Y_1,\dots,Y_s) is an LSRDR of (A_1,\dots,A_s). Then the LSRDR (Y_1,\dots,Y_s) is generally better behaved than the LSRDR (X_1,\dots,X_r). For example, if all the matrices A_1,\dots,A_s are real matrices, then it is more likely that one can find \lambda,C where (\lambda CY_1C^{-1},\dots,\lambda CY_sC^{-1}) are all real matrices than it is to find \lambda,C where each (\lambda CX_1C^{-1},\dots,\lambda CX_rC^{-1}) are all real matrices.

Application: (Appending matrix products) Suppose that (A_1,\dots,A_r) are n\times n-complex matrices and 1\leq d<n. Suppose that (X_1,\dots,X_r)\in M_d(\mathbb{C})^r is an LSRDR of (A_1,\dots,A_r). If the LSRDR (X_1,\dots,X_r) is poorly behaved, then one may modify (A_1,\dots,A_r) to obtain a better behaved LSRDR.

For example, if (a_1,\dots,a_s)\in\{1,\dots,r\}^s, then set A_{(a_1,\dots,a_s)}=A_{a_1}\dots A_{a_s}. Suppose that \alpha_1,\dots,\alpha_s are non-negative real numbers. Then let V=\bigcup_{j=1}^s\{1,\dots,r\}^j. Then let (Y_\gamma)_{\gamma\in V} be an LSRDR of (\alpha_{|\gamma|}\cdot A_\gamma)_{\gamma\in V}. Then the LSRDR (Y_\gamma)_{\gamma\in V} typically behaves better than the LSRDR (X_1,\dots,X_r). While each round of gradient ascent for computing the LSRDR (Y_\gamma)_{\gamma\in V} takes longer than each round for computing the LSRDR (X_1,\dots,X_r), the gradient ascent will converge in fewer rounds for (Y_\gamma)_{\gamma\in V} than it will for (X_1,\dots,X_r).

Block preservation

Suppose that V_1,\dots,V_q are real or complex inner product spaces. Let V=V_1\oplus\dots\oplus V_q. Let A_j:V\rightarrow V be a linear operator for 1\leq j\leq r. Suppose that for each 1\leq j\leq r, there are u_j,v_j\in\{1,\dots,q\} where if x\in V_u,y\in V_v and \langle A_jx,y\rangle\neq 0, then u=u_j,v=v_j. Let P:V\rightarrow V be an LSRDR projection operator of (A_1,\dots,A_r). Then we may typically find projection operators P_j:V_j\rightarrow V_j for 1\leq j\leq q with P=P_1\oplus\dots\oplus P_q.

Suppose now that f:\{1,\dots,n\}^2\rightarrow\mathbb{C} is a function. For i,j\in\{1,\dots,n\}, let A_{i,j} be the matrix where the (i,j)-th entry is f(i,j) and every other entry is zero. Let a_{i,j}=|f(i,j)|^2 for i,j\in\{1,\dots,n\}, and let A=(a_{i,j})_{1\leq i\leq n,1\leq j\leq n}. Let P\in M_n(\mathbb{C}) be an LSRDR projection operator of (A_{i,j})_{1\leq i\leq n,1\leq j\leq n}. If T\subseteq\{1,\dots,n\}, then let A|_T be the matrix where A|_T=(b_{i,j})_{1\leq i\leq n,1\leq j\leq n} and where b_{i,j}=a_{i,j} whenever i,j\in T and where b_{i,j}=0 otherwise. If T\subseteq\{1,\dots,n\}, then let \pi_T be the n\times n-diagonal matrix where the i-th diagonal entry is 1 for i\in T and the i-th diagonal entry is 0 whenever i\not\in T. Then we typically have P=\pi_T where T is the subset of \{1,\dots,n\} with |T|=d and where \rho(A|_T) is maximized.

Application: Given a matrix Q, let \sum Q denote the sum of all the entries in Q. Suppose that B is a real n\times n-matrix. Then we may use LSRDRs to find a subset T\subseteq \{1,\dots,n\} where \sum(B|_T) is maximized.

Let M_{n,d} be the n\times n-matrix where each entry is 1/d, and let M_d=M_{d,d}. Let X be a real n\times n-matrix. Then it is not too difficult to show that ^{\lim}_{\delta\rightarrow 0}\frac{1}{\delta}(\rho(M_d+\delta X)-\rho(M_d))=\sum X/d.

Suppose that B is an n\times n-real matrix, and there is a unique T\subseteq\{1,\dots,n\} with |T|=d and where \sum B|_T is maximized. Then whenever \sum B|_T is maximized, \delta is sufficiently small, and A=M_d+\delta B, we typically have P=\pi_T.

Quaternions

We say that a 2d\times 2d complex matrix is a quaternionic complex matrix if a_{2i,2j}=\overline{a_{2i-1,2j-1}} and a_{2i,2j-1}=-\overline{a_{2i-1,2j}} whenever i,j\in\{1,\dots,d\}. Suppose that each A_j is an n\times n quaternionic complex matrix for 1\leq j\leq r. Suppose that d is an even integer with d\leq n and suppose that (X_1,\dots,X_r) is a L_{2,d}-SRDR of (A_1,\dots,A_r). Then there is often some C\in M_d(\mathbb{C}) and \lambda\in\mathbb{C}^\times where if we set Y_j=\lambda CX_jC^{-1} for 1\leq j\leq r, then Y_1,\dots,Y_r are quaternionic complex matrices. Furthermore, if (R,S) is an L_{2,d}-SRDR projector, then there is some LSRDR projector [[R_1,S_1]] where [[R,S]]=[[R_1,S_1]] but where R_1,S_1 are quaternionic complex matrices.

We observe that if A_1,\dots,A_r\in M_n(\mathbb{C}) are quaternionic complex matrices and R\in M_{n,d}(\mathbb{C}),S\in M_{d,n}(\mathbb{C}) are rectangular matrices with RS=1_d, then the matrix A_1\otimes\overline{RA_1S}+\dots+A_r\otimes\overline{RA_rS} will still have a single dominant eigenvalue with a spectral gap ratio significantly above 1 in contrast with the fact that A_1,\dots,A_r each have spectral gap ratio of 1 since the eigenvalues of A_1,\dots,A_r come in conjugate pairs. In particular, if X_1,\dots,X_r\in M_n(\mathbb{C}) are matrices obtained during the process of training L_{A_1,\dots,A_r;d}(X_1,\dots,X_r), then the matrix A_1\otimes\overline{X_1}+\dots+A_r\otimes\overline{X_r} will have a spectral gap ratio that is significantly above 1; this ensures that gradient ascent will not have any problems in optimizing L_{A_1,\dots,A_r;d}(X_1,\dots,X_r).

Gradient ascent on subspaces

Let n be a natural number. Let \mathcal{H}_n,\mathcal{S}_n,\mathcal{Q}_n be the set of all n\times n-Hermitian matrices, complex symmetric matrices, and complex quaternionic matrices. Suppose that \mathcal{C}\in\{\mathcal{H},\mathcal{S},\mathcal{Q}\}. Suppose that (A_1,\dots,A_r)\in\mathcal{C}_n^r. Then let (X_1,\dots,X_r)\in\mathcal{C}_d^r be a tuple such that L_{A_1,\dots,A_r;d}|_{\mathcal{C}^r}(X_1,\dots,X_r) is locally maximized. Suppose that (Y_1,\dots,Y_r)\in M_d(\mathbb{C})^r is a tuple where L_{A_1,\dots,A_r;d}(Y_1,\dots,Y_r) is locally maximized. Then we usually have L_{A_1,\dots,A_r;d}(X_1,\dots,X_r)=L_{A_1,\dots,A_r;d}(Y_1,\dots,Y_r). In this case, we would say that (X_1,\dots,X_r) is an alternate domain L_{2,d}-spectral radius dimensionality reduction (ADLSRDR).

Homogeneous non-commutative polynomials

Suppose that p(x_1,\dots,x_r) is a homogeneous non-commutative polynomial of degree n. Define a fitness function M_{p,d} by letting M_{p,d}(A_1,\dots,A_r)=\frac{\rho(p(A_1,\dots,A_r))^{1/n}}{\rho_2(A_1,\dots,A_r)} whenever (A_1,\dots,A_r)\in M_d(\mathbb{C})^r. We say that (A_1,\dots,A_r) is an L_{2,d}-spectral radius dimensionality reduction of p(x_1,\dots,x_r) if M_{p,d}(A_1,\dots,A_r) is maximized.

If p is a tensor of degree 2, then \max M_{p,d}(A_1,\dots,A_r)=\|p\|_2 where \|p\|_2 denotes the Frobenius norm of p. One should take an LSRDR of an n-tensor p only if n\geq 3 since there are plenty of linear algebraic tools that one can use to analyze 2-tensors.

Suppose that p(x_1,\dots,x_r) is a homogeneous non-commutative polynomial of degree n with random coefficients. Let (X_1,\dots,X_r) be an LSRDR of p(x_1,\dots,x_r). Then whenever q(x_1,\dots,x_r) is a homogeneous non-commutative polynomial of degree m with m\mod n\neq 0, we typically have \text{Tr}(q(X_1,\dots,X_r))\approx 0. Let \mathcal{E}=\Phi(X_1,\dots,X_r). The eigenvalues of \mathcal{E} will be symmetric under 2\pi/n rotations. Whenever the coefficients of p are random complex numbers and n=3, then \mathcal{E} has no real eigenvalues except for a dominant eigenvalue or zero eigenvalues. In the case when the coefficients of p are random real numbers and n=3, the superoperator \mathcal{E} may have other real eigenvalues, but these non-zero non-dominant eigenvalues will tend to have multiplicity 2.

A simple calculation shows that the above observations of the traces of q(X_1,\dots,X_r) and of \mathcal{E} are equivalent.

Observation: The trace of a completely positive superoperator is always a non-negative real number. In particular, \text{Tr}(\Phi(A_1,\dots,A_r))=|\text{Tr}(A_1)|^2+\dots+|\text{Tr}(A_r)|^2.

Observation: Suppose that A_{1,1},\dots,A_{1,r_1},\dots,A_{s,1},\dots,A_{s,r_s} are d\times d real or complex matrices. Then

\text{Tr}(\Phi(A_{1,1},\dots,A_{1,r_1})\dots\Phi(A_{s,1},\dots,A_{s,r_s}))=\sum_{i_1\in\{1,\dots,r_1\},\dots,i_s\in\{1,\dots,r_s\}}|\text{Tr}(A_{1,i_1}\dots A_{s,i_s})|^2.

Therefore \text{Tr}(\Phi(A_{1,1},\dots,A_{1,r_1})\dots\Phi(A_{s,1},\dots,A_{s,r_s}))=0 if and only if \text{Tr}(A_{1,i_1}\dots A_{s,i_s})=0 whenever i_1\in\{1,\dots,r_1\},\dots,i_s\in\{1,\dots,r_s\}.

LSRDRs of homogeneous non-commutative polynomials are typically well behaved even when the polynomial is of a very high degree. For example, let k\geq 0 and let (p_{i,j})_{1\leq i\leq k,1\leq j\leq r} be a system of non-commutative polynomials each of degree s. Then for 1\leq i\leq k, let p_i(x_1,\dots,x_r)=(p_{i,1}(x_1,\dots,x_r),\dots,p_{i,r}(x_1,\dots,x_r)). Let p=p_k\circ\dots\circ p_1. Then there are homogeneous non-commutative polynomials q_1,\dots,q_r where p(x_1,\dots,x_r)=(q_1(x_1,\dots,x_r),\dots,q_r(x_1,\dots,x_r)). If (X_1,\dots,X_r),(Y_1,\dots,Y_r) are L_{2,d}-SRDR of q_1 trained with different initializations, then we may typically find a central scalar \lambda and some invertible C with Y_j=\lambda CX_jC^{-1} for 1\leq j\leq r. This is despite the fact that q_1 has degree s^k which could be very large.

One can use LSRDRs to map tensors in V_1\otimes\dots\otimes V_n to other objects in several ways. Let V=V_1\oplus\dots\oplus V_n. Then since V_1\otimes\dots\otimes V_n\subseteq V^{\otimes n}, we may consider a tensor v in V_1\otimes\dots\otimes V_n as an element of V^{\otimes n}, and we may take our LSRDR of the tensor v. Suppose that \text{Dim}(V_j)=r_j. Then we can represent a tensor v\in V_1\otimes\dots\otimes V_n as a homogeneous non-commutative polynomial p(x_{1,1},\dots,x_{1,r_1};\dots;x_{n,1},\dots,x_{n,r_n}) consisting of a linear combination of monomials of the form x_{1,i_1}\dots x_{n,i_n}. Define a fitness function M:M_d(K)^r\rightarrow[-\infty,\infty) by setting M(A_{1,1},\dots,A_{1,r_1};\dots,A_{n,1},\dots,A_{n,r_n})

=\log(\rho(p(A_{1,1},\dots,A_{1,r_1};\dots,A_{n,1},\dots,A_{n,r_n})))/n

-\log(m(\|\sum_{k=1}^{r_1}A_{1,k}A_{1,k}^*\|_p,\dots,\|\sum_{k=1}^{r_n}A_{n,k}A_{n,k}^*\|_p))/2 where m is some mean. Then the fitness function empirically has a unique local maximum value.

Tensors to collections of matrices

Not only are we able to obtain tuples of matrices from tensors, but we may also obtain tensors from tuples of matrices, and go back and forth between tensors and tuples of matrices. Suppose that K is either the field of real numbers or the field of complex numbers. Let X_{1,1},\dots,X_{1,r_1};\dots;X_{n,1},\dots,X_{n;r_n}\in M_d(K). Then a LSRDR (\alpha_{i_1,\dots,i_n})_{i_1,\dots,i_n}\in K^{r_1}\otimes\dots\otimes K^{r_n} of (X_{1,i_1}\dots X_{n,i_n})_{i_1,\dots,i_n} is typically unique up to a scalar multiple, and if K is the field of complex numbers but each X_{i,j} is a real matrix, then (\alpha_{i_1,\dots,i_n})_{i_1,\dots,i_n} is a real tensor multiplied by a complex nonzero scalar.

Suppose that V_1,\dots,V_n are finite dimensional inner product spaces over the field K. Suppose that e_{i,1},\dots,e_{i,d_i} is a basis for V_i. Let (m_0,\dots,m_n) be a sequence of positive integers with m_0=m_n=1. Suppose that X_{i,j} is an m_{i-1}\times m_i matrix over the field K whenever 1\leq i\leq n. Then define the tensor T((X_{i,j})_{i,j})=\sum_{i_1,\dots,i_n}e_{i_1}\otimes\dots\otimes e_{i_n}X_{1,i_1}\dots X_{n,i_n}. If \mathbf{v} and \|T((X_{i,j})_{i,j})-\mathbf{v}\| is locally minimized and \mathbf{u}=T((X_{i,j})_{i,j}), then we say that \mathbf{u} is a matrix table to tensor dimensionality reduction of type (m_0,\dots,m_n). Our computer experiments suggest that the matrix table to tensor dimensionality reduction of type (m_0,\dots,m_n) is generally unique. Furthermore, if K=\mathbb{C} and \mathbf{v} is an \mathbb{R}-linear combination of the basis vectors e_{1,i_1}\otimes\dots\otimes e_{n,i_n}, then in our computer experiments, the matrix table to tensor dimensionality reduction of type (m_0,\dots,m_n) is also an \mathbb{R}-linear combination of the basis vectors e_{1,i_1}\otimes\dots\otimes e_{n,i_n}.

Suppose that \mathbf{v} is a sparse tensor whose entries are integers and K=\mathbb{R}. Suppose furthermore that \mathbf{u}_1,\dots,\mathbf{u}_z are matrix table to tensor dimensionality reductions of \\mathbf{v}. Then the values \|\mathbf{u}_i-\mathbf{v}\|^2,\|\mathbf{u}_i-\mathbf{u}_j\|^2,\|\mathbf{u}_i\|^2 are often positive integers. It is sometimes but not always the case that each entry in \mathbf{u}_i is an integer or at least a rational number. It is sometimes the case that

  1. there is a partition A,B of \{1,\dots,z\} where \mathbf{u}_i=\mathbf{u}_j precisely when i,j\in B,
  2. \|\mathbf{u}_i\|^2 is always a positive integer,
  3. there is a positive integer m where \|\mathbf{u}_i-\mathbf{u}_j\|^2=m whenever i\in A,j\in B,
  4. if i\in B, then all of the entries in \mathbf{u}_i are integers,
  5. if i\in A, then most but not all of the entries in \mathbf{u}_i are integers,
  6. if i\in A, then the tensor \mathbf{u}_i has repeated non-integer entries, and
  7. one can find non-integer entries \alpha,\beta in \mathbf{u}_i where \alpha-\beta is a non-zero integer.

Suppose that \mathbf{v} is a random tensor in V_1\otimes\dots\otimes V_n. Then one can often find a matrix table to tensor dimensionality reduction \mathbf{u} of type (m_0,\dots,m_n) such that \mathbf{v}-\mathbf{u} is its own matrix table to tensor dimensionality reduction \mathbf{u} of type (m_0,\dots,m_n).

LSRDRs to learn data

LSRDRs may be adapted to solve classification tasks. Suppose that K\in\{\mathbb{R},\mathbb{C},\mathbb{H}\}.

Suppose that r,q,d are natural numbers. Suppose that 1<p\leq\infty. Let n_1,\dots,n_q be natural numbers. Suppose that A_{i,j}\in M_{n_i}(K) for i\in\{1,\dots,q\},j\in\{1,\dots,r\}. We say that a collection X_1,\dots,X_r\in M_d(K) of matrices is an L_{2,d}-spectral radius similarity log-expected maximizer (LSRSLEM) of degree p and dimension d if X_1,\dots,X_r locally maximizes the fitness

\frac{1}{q}\sum_{i=1}^q\log(\rho(\Gamma(A_{i,1},\dots,A_{i,r};X_1,\dots,X_r)))

-\log(\|X_1X_1^*+\dots+X_rX_r^*\|_p)/2.

Application: LSRSLEMs may be used to solve machine learning classification problems. Suppose that Z is a universal set. Suppose that we are given a training data set D_0\subseteq Z, and we would like the machine learning algorithm to recognize objects that are similar to the elements in D_0. Then one would first construct a function h:Z\rightarrow\bigcup_{n=1}^\infty M_n(K)^r. One then constructs an LSRSLEM X_1,\dots,X_r of (H(d))_{d\in D_0}. Then a higher value of \log(\rho(h(z);X_1,\dots,X_r)) indicates that the value z is more similar to the training data set D_0.

Observation: If each A_{i,j} is nearly positive semidefinite, and (X_1,\dots,X_r),(Y_1,\dots,Y_r) are LSRSLEMs of the same degree and dimension of (A_{i,j})_{1\leq i\leq q,1\leq j\leq r}, then (X_1,\dots,X_r),(Y_1,\dots,Y_r) typically have the same fitness level.

Observation: Suppose that (X_1,\dots,X_r) is an LSRSLEM. Then one can typically find subspaces V_1,\dots,V_s of the vector space K^d such that each V_j is invariant with respect to each X_j and each X_j^*. In other words, there is a unitary matrix U along with m_1,\dots,m_s with m_1+\dots+m_s=d and where each UX_jU^* is a block diagonal matrix whose diagonal blocks are matrices X_{1,j},\dots,X_{s,j} of sizes m_1,\dots,m_s. Furthermore, the matrices X_{i,1},\dots,X_{i,r} typically generate the algebra M_{m_i}(K).

Application: LSRSLEMs may be used to recognize objects in a data set, but LSRSLEMs may also be used to partition data sets into clusters and produce representatives of these clusters. Suppose that X_1,\dots,X_r is an LSRDLEM of (A_{i,j})_{i,j} and each X_j is a block diagonal matrix with block diagonal entries X_{1,j},\dots,X_{s,j}.

Our clustering function f:\{1,\dots,q\}\rightarrow\{1,\dots,s\} will simply be the function that maximizes the quantity \log(\rho(\Gamma(A_{i,1},\dots,A_{i,r};X_{f(i),1},\dots,X_{f(i),r}))). Our clustering function can be used to assign tuples (A_1,\dots,A_r) that are not contained in our original q data values to these s clusters. Our extended clustering function F:M_{n_1}(K)\times\dots\times M_{n_r}(K)\rightarrow\{1\,dots,s\} will be the partial function where if (A_1,\dots,A_r)\in M_{n_1}(K)\times\dots\times M_{n_r}(K), then F(A_1,\dots,A_r)=j whenever \log(\rho(\Gamma(A_1,\dots,A_r;X_{j,1},\dots,X_{j,r}))) is maximized.

An LSRDR (A_1,\dots,A_r) of a cluster (X_{j,1},\dots,X_{j,r}) is a representative for the j-th cluster.

Iterated LSRSLEM clustering

One can use iterated LSRSLEMs to partition data into clusters. For simplicity, assume that r=d.

Suppose that v_{i,j} is a random positive real number for \{i,j\}\in\{1,\dots,q\}\times\{1,\dots,r\}. Let 1<p<\infty. Then we shall construct v_{i,j,n} whenever n\geq 0,1\leq i\leq q,1\leq j\leq r using recursion on n.

From (v_{i,j,n})_{i,j}, we construct an LSRSLEM (X_{1,n},\dots,X_{r,n}) of (v_{i,j,n})_{i,j} of degree p and dimension d. Suppose that (v_{i,1,n+1},\dots,v_{i,r,n+1}) is a dominant eigenvector with norm 1 of v_{i,1,n}X_{1,n}+\dots+v_{i,r,n}X_{r,n} whenever 1\leq i\leq q. Then there is an equivalence relation \simeq on \{1,\dots,q\} where for all \epsilon>0, there is some N where if n\geq N and \alpha=(v_{i,1,n},\dots,v_{i,r,n}),\beta=(v_{j,1,n},\dots,v_{j,r,n}), then \|\alpha-\beta\|<\epsilon whenever i\simeq j and \langle \alpha,\beta\rangle<\epsilon otherwise. The equivalence relation \simeq does not typically depend on the choices made when training the LSRSLEM (X_{1,n},\dots,X_{r,n}) for n\geq 0.

The stability of the iterations v_{i,j,n} also suggests that our method of iterating LSRSLEMs will not be able to replicate the capabilities of deep neural networks.

Multiple tensors

Compositional LSRDRs

Suppose that r_1,\dots,r_l, n_1,\dots,n_l,d_1,\dots,d_l are positive integers. Suppose that A_{i,j} is an n_i\times n_{i+1}-matrix over the field K (here addition is taken modulo l) whenever 1\leq i\leq l,1\leq j\leq r_i. Then we say that (X_{i,j})_{i,j}\in \prod_{i=1}^l M_{d_i,d_{i+1}}(K)^{r_i} is a compositional LSRDR (abbreviated CLSRDR) of A_{i,j} with dimensions (d_1,\dots,d_l) of (A_{i,j})_{i,j} if the following quantity known as the fitness is locally maximized:

\log(\rho\big(\Gamma(A_{1,1},\dots,A_{1,r_1};X_{1,1},\dots,X_{1,r_1})\dots\Gamma(A_{l,1},\dots,A_{l,r_l};X_{l,1},\dots,X_{l,r_l})\big))-\log(\rho\big(\Phi(X_{1,1},\dots,X_{1,r_1})\dots\Phi(X_{l,1},\dots,X_{l,r_l})\big))/2.

Observation: Two CLSRDRs of (A_{i,j})_{i,j} often have the same fitness levels.

Observation: Suppose that r_1,\dots,r_l, n_1,\dots,n_l,d_1,\dots,d_l,e_1,\dots,e_l are positive integers with \max(n_i,d_i)=\max(n_i,e_i) for all i. Then a CLSRDR with dimensions (d_1,\dots,d_l) will typically have the same fitness level as a CLSRDR with dimensions (e_1,\dots,e_l).

Observation: Suppose that (X_{i,j})_{i,j} is a CLSRDR of (A_{i,j})_{i,j}. Then one can often find two sequences (R_1,\dots,R_l),(S_1,\dots,S_l) of matrices where X_{i,j}=R_iA_{i,j}S_i for all i,j. Furthermore, for all i, there is a non-zero constant \lambda where R_{i+1}S_i=\lambda\cdot I. We say that the CLSRDR (X_{i,j})_{i,j} is constant factor normalized if R_{i+1}S_i=I for all i. In this case, the matrix S_iR_{i+1} will be a projection matrix that we may denote by P_i.

Let U_{R,1},\dots,U_{R,l},U_{L,1},\dots,U_{L,l} denote the matrices where U_{R,i} is the dominant eigenvector of \Delta_i=\Gamma(A_{i+1,1},\dots,A_{i+1,r_{i+1}};X_{i+1,1},\dots,X_{i+1,r_{i+1}})\dots\Gamma(A_{l,1},\dots,A_{l,r_l};X_{l,1},\dots,X_{l,r_l})\Gamma(A_{1,1},\dots,A_{1,r_1};X_{1,1},\dots,X_{1,r_1})\dots\Gamma(A_{i,1},\dots,A_{i,r_i};X_{i,1},\dots,X_{i,r_i}) and where U_{L,i} is the dominant eigenvector of the adjoint \Delta_i^*. Then the matrices R_{i+1}U_{R,i},U_{R,i}S_i^*,S_i^*U_{L,i+1},U_{L,i}R_i are generally positive semidefinite up to a constant factor. If we train the CLSRDR twice then each time we will get the same value for U_{R,i}S_i^*/\text{Tr}(U_{R,i}S_i^*),U_{L,i}R_i/\text{Tr}(U_{L,i}R_i).

Suppose furthermore that A_{i_1,j}=A_{i_2,j} for all i,j. Then S_{i_1}R_{i_1+1}=S_{i_2}R_{i_2+1} whenever the CLSRDR (X_{i,j})_{i,j} is constant factor normalized, and the positive semidefinite matrices U_{R,i}S_i^*/\text{Tr}(U_{R,i}S_i^*),U_{L,i}R_i/\text{Tr}(U_{L,i}R_i) do not depend on the value i.

WHAT IF ALL BUT ONE ENTRY IS ZERO? BLOCK MATRICES. UNDER CONSTRUCTION. Suppose that n_1,\dots,n_l,

Other fitness functions

Let X,Y be a finite dimensional real or complex inner product space. Suppose that \mu is a Borel probability measure with compact support on X\times Y. Now define a fitness function f_\mu:L(X,Y)\rightarrow[-\infty,\infty) by letting f_\mu(A)=\int \log(|\langle Ax,y\rangle|)d\mu(x,y)-\log(\|A\|_2).

If \nu is a Borel probability measure with compact support on X, then define a fitness function f_\nu:L(X)\rightarrow[-\infty,\infty) by letting f_\mu(A)=\int \log(|\langle Ax,x\rangle|)d\mu(x)-\log(\|A\|_2).

If X,Y are real inner product spaces, then \{A\in L(X,Y)\mid f_\mu(A)>-\infty\},\{A\in L(X)\mid f_\nu(A)>-\infty\} will generally be disconnected spaces, so gradient ascent will have no hope of finding an A\in L(X,Y) or A\in L(X) that maximizes f_\mu,f_\nu respectively simply by using gradient ascent with a random initialization. However, f_\nu(A)>0 whenever A is positive definite, and gradient ascent seems to always produce the same local optimum value f_\nu(A) for the function f_\nu when A is initialized to be a positive semidefinite matrix. If A,B are linear operators that maximize f_\nu, then we generally have A=\pm B.

If X is a complex inner product space, then gradient ascent seems to always produce the same local optimal value f_\nu(A) for the fitness function f_\nu, and if A,B are matrices that maximize the fitness function f_\nu, then we generally have A=\lambda B for some \lambda\in\mathbb{C}^\times. The local optimal operator \lambda A will generally have positive eigenvalues for some \lambda\in\mathbb{C}^\times, but the operator \lambda A itself will not be positive semidefinite. If X=\mathbb{C}^n and \nu is supported on \mathbb{R}^n, then for each operator A that optimizes f_\nu, there is some \lambda\in\mathbb{C}^\times where \lambda A[\mathbb{R}^n]\subseteq\mathbb{R}^n, and A is up to sign the same matrix found by real gradient ascent with positive definite initialization.

The problem of optimizing f_\mu,f_\nu is a problem that is useful for constructing MPO word embeddings and other objects for machine learning similar to LSRDR dimensionality reductions. While f_\nu tends to have a unique up to constant factor local optimum, the fitness function f_\mu does not generally have a unique local optimum; if X,Y are complex inner product spaces, then f_\mu has few local optima in the sense that it is not unusual to there to be a constant \lambda where A=\lambda B such that A,B locally optimize f_\mu. In this case, the matrix A that globally optimizes f_\mu is most likely also the matrix that has the largest basin of attraction with respect to various types of gradient ascent.

MPO word embeddings: Suppose that N is a matrix embedding potential, A is a finite set, and a_1\dots a_n\in A^*. Suppose that f,g:A\rightarrow M_d(\mathbb{C}) is are MPO word pre-embeddings with respect to the matrix embedding potential N. Then there are often constants \lambda_a\in S^1 for a\in A where f(a)=\lambda_ag(a) for a\in A. Furthermore, there is often a matrix R along with constants \mu_a\in S^1 for a\in A where \mu_a Rf(a)R^{-1} is a real matrix for a\in A.

Rank-1 matrix valued graph embeddings

Applying pseudodeterminism to test randomness/pseudorandomness

The uniqueness of the fitness functions (pseudodeterminism) in the machine learning algorithms in this post is useful for testing for randomness/pseudorandomness in cryptography. Here is a non-exhaustive list of ways that one can use pseudodeterministic machine learning algorithms to test data sets for randomness/pseudorandomness and to measure the quantity of randomness/pseudorandomness in the data set and obtain confidence levels of the non-randomness/pseudorandomness of an object.

Application: Let A be a finite set. Let 1<p\leq\infty and let d be a positive integer. Let (a_1,\dots,a_n)\in A^n. Define our fitness function F_p:M_d(\mathbb{R})^A\setminus\{\mathbf{0}\}\rightarrow[-\infty,\infty) by setting F_p((X_a)_{a\in A})=\log(\rho(X_{a_1}\dots X_{a_n}))/n-\log(\|\sum_{a\in A}X_aX_a^*\|_p). If F_p((X_a)_{a\in A}) is locally maximized, then a lower value of F_p((X_a)_{a\in A}) indicates that the sequence (a_1,\dots,a_n) is more random/pseudorandom.

Application: Suppose that A is a finite set, n is a natural number, and R\subseteq A^n is a subset. Let 1<p\leq\infty and let d be a positive integer. Define a function L:M_d(\mathbb{R})^{A\times n}\setminus\{\mathbf{0}\}\rightarrow[-\infty,\infty) by setting L((X_{a,k})_{a\in A,k\in\{1,\dots,n\}}) =\big(\sum_{(r_1,\dots,r_n)\in R}\log(\rho(X_{r_1,1}+\dots+X_{r_n,n}))/(n\cdot |R|)\big)-\log(\|\sum_{a\in A}\sum_{k=1}^nX_{a,k}X_{a,k}^\ast\|_p)/2.

If L((X_{a,k})_{a\in A,k\in\{1,\dots,n\}}) is locally maximized, then a lower value of L((X_{a,k})_{a\in A,k\in\{1,\dots,n\}}) indicates that the set R is more random/pseudorandom.

Best quantum channel approximation to a relation.

Let U,V be complex Euclidean spaces. Suppose that f=\{(u_1,v_1),\dots,(u_n,v_n)\}\subseteq U\times V be such that each u_j,v_j is a unit vector. Then we say that a quantum channel \mathcal{E}:L(U)\rightarrow L(V) is a best Choi rank r approximation to f if the quantity \sum_{k=1}^n-\log(\langle\mathcal{E}(u_ku_k^*)v_k,v_k\rangle) is locally minimized subject to the condition that \mathcal{E} is a quantum channel of Choi rank at most r.

Empirical observation: Best Choi rank r approximations to functions tend to be unique in the sense that the gradient descent process tends to produce the same Best Choi rank r approximations when we run the gradient descent multiple times.

Suppose that \mathcal{E} is a best Choi rank r approximation to f=\{(u_1,v_1),\dots,(u_n,v_n)\}.

Let f=\{(u_1,v_1),\dots,(u_n,v_n)\}\subseteq U\times V where each u_j,v_j is a unit vector. For each j, let u_j^\sharp be a unit dominant eigenvector of \mathcal{E}(u_ju_j^*). Let f^\sharp=\{(u^\sharp_1,v_1),\dots,(u^\sharp_n,v_n)\}. Then we typically have f^{\sharp\sharp\sharp}=f^{\sharp\sharp}. More generally, if U is a neighborhood of f^\sharp of small diameter and g\in U, then we typically have g^{\sharp\sharp}=g^\sharp and if \mathcal{E} is a best Choi rank r-approximation to g^\sharp, then \mathcal{E} is typically the identity mapping.

Afterward

Even though I have published this post, this post is not yet complete since I would like to add information to this post about functions related to and which satisfy similar properties to the functions of the form L_{A_1,\dots,A_r;d}. I also want the community to actually say something and communicate back before I add any information to this post.

Leave a comment