Empirical observations of L_{2,d} spectral radii and other functions

In this post, we shall make a few experimental observations about the $L_{2,d}$ -spectral radius and dimensionality reduction that give evidence that the $L_{2,d}$ -spectral radius and dimensionality reduction will eventually have a deep mathematical theory behind it and also become more applicable to machine learning. I have not been able to prove these empirical observations, so one should consider this post to be about experimental mathematics. Hopefully these experimental results will eventually be backed up by formal theoretical mathematics. The reader is referred to this post for definitions.

This post is currently a work in progress, so I will be updating this post with more information. Let $M_d(\mathbb{C})^{r++}\subseteq M_d(\mathbb{C})^r$ be the collection of all tuples $(A_1,\dots,A_r)$ where $A_1\otimes\overline{A_1}+\dots+A_r\otimes\overline{A_r}$ is not nilpotent. Given matrices $A_1,\dots,A_r\in M_n(\mathbb{C})$ , define a function $L_{A_1,\dots,A_r;d}:M_d(\mathbb{C})^{d++}\rightarrow\mathbb{R}$ by letting $L_{A_1,\dots,A_r;d}(X_1,\dots,X_r)=\frac{\rho(A_1\otimes\overline{X_1}+\dots+A_r\otimes\overline{X_r})}{\rho(X_1\otimes\overline{X_1}+\dots+X_r\otimes\overline{X_r})^{1/2}}.$

Given matrices $A_1,\dots,A_r\in M_d(\mathbb{C})$ , we say that $(X_1,\dots,X_r)\in M_d(\mathbb{C})^r$ is an $L_{2,d}$ -spectral radius dimensionality reduction of $(A_1,\dots,A_r)$ if $(X_1,\dots,X_r)$ locally maximizes $L_{A_1,\dots,A_r;d}(X_1,\dots,X_r)$ . We say that an $L_{2,d}$ -spectral radius dimensionality reduction is optimal if $(X_1,\dots,X_r)$ globally maximizes $L_{A_1,\dots,A_r;d}(X_1,\dots,X_r)$ .

We say that $(X_1,\dots,X_r)$ is projectively similar to $(Y_1,\dots,Y_r)$ if there is some $\lambda\in\mathbb{C}^\times$ and some $C\in M_d(\mathbb{C})$ where $Y_j=\lambda CX_jC^{-1}$ for $1\leq j\leq r.$ Let $\simeq$ denote the equivalence relation where we set $(X_1,\dots,X_r)\simeq(Y_1,\dots,Y_r)$ precisely when $(X_1,\dots,X_r)$ and $(Y_1,\dots,Y_r)$ are projectively similar. Let $[X_1,\dots,X_r]$ denote the equivalence relation with respect to $\simeq.$

Observation: (apparent unique local optimum; global attractor) Suppose that $(X_1,\dots,X_r),(Y_1,\dots,Y_r)$ are $L_{2,d}$ -SRDRs of $(A_1,\dots,A_r)$ obtained by gradient ascent on the domain $M_d(\mathbb{C})^{++}$ . Then $(X_1,\dots,X_r),(Y_1,\dots,Y_r)$ are usually projectively similar. Suppose now that $\mathcal{X}$ is the random variable consisting of all equivalence classes $[X_1,\dots,X_r]$ where $(X_1,\dots,X_r)$ is an $L_{2,d}$ -SRDR trained by some variant of gradient ascent. Then our computations indicate that the random variable $\mathcal{X}$ has typically has low entropy in the case when $\mathcal{X}$ is a non-constant random variable.

Example: The above observation does not always hold as $L_{A_1,\dots,A_r;d}$ may have multiple non-projectively similar local maxima. For example, suppose that $A_i=$ $\begin{bmatrix} B_i & 0 \\ 0 & C_i\end{bmatrix}$ for $1\leq i\leq r$ where each $B_i$ and each $C_i$ is a $d\times d$ -matrix. Then $L_{A_1,\dots,A_r;d}(X_1,\dots,X_r)=\max(L_{B_1,\dots,B_r;d}(X_1,\dots,X_r),L_{C_1,\dots,C_r;d}(X_1,\dots,X_r)).$ Suppose now that $L_{B_1,\dots,B_r;d}(B_1,\dots,B_r)>L_{C_1,\dots,C_r;d}(B_1,\dots,B_r)$ and $L_{C_1,\dots,C_r;d}(C_1,\dots,C_r)>L_{B_1,\dots,B_r;d}(C_1,\dots,C_r)$ . Then $(B_1,\dots,B_r),(C_1,\dots,C_r)$ are usually non-projectively similar local maxima for the function $L_{A_1,\dots,A_r;d}.$ Of course, $(A_1,\dots,A_r)$ does not generate the algebra $M_n(\mathbb{C})$ , but there is some neighborhood $U$ of $(A_1,\dots,A_r)$ where if $(D_1,\dots,D_r)\in U$ , then $L_{D_1,\dots,D_r;d}$ will have at least two non-projectively similar local maxima. This sort of scenario is not as contrived as one may think it is since $(D_1,\dots,D_r)$ could represent two clusters where there are few or weak connections between these clusters (for example, in natural language processing, each language could represent its own cluster).

Observation: (height and breadth correlation for local maxima) Suppose that $\mathcal{X}$ is the probability distribution of the (real, complex,or quaternionic) equivalence classes $[X_1,\dots,X_r]$ of $L_{2,d}$ -SRDRs obtained from $(A_1,\dots,A_r)$ using gradient ascent. Then the random variables $P(\mathcal{X})$ (mapping the outcome to the probability of observing that outcome) and $L_{A_1,\dots,A_r}(\mathcal{X})$ tend to have a positive linear correlation. The probability $P([X_1,\dots,X_r])$ is maximized typically when $L_{A_1,\dots,A_r}(X_1,\dots,X_r)$ is maximized instead of just locally maximized.

Observation: (preservation of symmetry) Suppose that $A_1,\dots,A_r$ are Hermitian (resp. real, real symmetric, complex symmetric, positive semidefinite) and $(X_1,\dots,X_r)$ is an $L_{2,d}$ -SRDR of $(A_1,\dots,A_r)$ . Then $(X_1,\dots,X_r)$ is usually projectively similar to some tuple $(Y_1,\dots,Y_r)$ where each $Y_j$ is Hermitian (resp. real, real symmetric, complex symmetric, positive semidefinite).

Observation: (near functoriality) Suppose that $e<d<n.$ Suppose that $A_1,\dots,A_r$ generates $M_d(\mathbb{C}).$ Let $X_1,\dots,X_r$ be an $L_{2,e}$ -SRDR of $A_1,\dots,A_r$ . Let $B_1,\dots,B_r$ be an $L_{2,d}$ -SRDR of $A_1,\dots,A_r$ . Let $C_1,\dots,C_r$ be an $L_{2,e}$ -SRDR of $B_1,\dots,B_r$ . Then $(C_1,\dots,C_r)$ is usually not projectively similar to $(X_1,\dots,X_r)$ , but $[C_1,\dots,C_r]$ is usually close to $[X_1,\dots,X_r].$

Observation: (Suboptimal real LSRDRs) Suppose that $(A_1,\dots,A_r)\in M_n(\mathbb{R})^r$ . Let $L_{A_1,\dots,A_r;d}^R:M_d(\mathbb{R})^{++}\rightarrow[0,\infty)$ be the restriction of $L_{A_1,\dots,A_r;d}$ to the domain $M_d(\mathbb{R})^{++}$ . A local maxima $(X_1,\dots,X_r)$ for $L_{A_1,\dots,A_r;d}^R$ shall be called a real LSRDR. Gradient ascent for the function $L_{A_1,\dots,A_r;d}^R$ often produces more local maxima which are not global maxima than the complex version $L_{A_1,\dots,A_r;d}^R$ has. These supoptimal local optima are tuples $(X_1,\dots,X_r)$ where if $A_1\otimes X_1+\dots+A_r\otimes X_r$ has eigenvalues $\lambda_1,\dots,\lambda_{nr}$ with $|\lambda_1|\leq\dots\leq|\lambda_{nr}|$ , then $\lambda_{nr-1},\lambda_{nr}$ are complex conjugates. In this case, $\rho(A_1\otimes X_1+\dots+A_r\otimes X_r)=|\lambda_{nr}|=|\lambda_{nr-1}|$ , but $L_{A_1,dots,A_r;d}^R(X_1,\dots,X_r)$ is not as large as it can be since the gradient ascent process must make the absolute value of two eigenvalues $|\lambda_{nr-1}|,|\lambda_{nr}|$ large instead of simply making the absolute value of a single eigenvalue large. Fortunately, the most efficient way of computing real LSRDRs also naturally avoids the case when $|\lambda_{nr}|=|\lambda_{nr-1}|$ . To compute LSRDRs, one will need to compute the gradient of the expression $\rho(A_1\otimes X_1+\dots+A_r\otimes X_r)$ , but the most efficient way to compute the gradient of the spectral radius of a matrix $X$ , one will need to use the dominant left and right eigenvectors of $X$ . The dominant left and right eigenvectors of $X$ can be computed using a power iteration, but the power iteration technique (unless generalized) only effectively computes the dominant left and right eigenvectors of $X$ if $X$ has a single eigenvector $\lambda$ where $|\lambda|>|\mu|$ whenever $\mu$ is an eigenvalue different than $\lambda.$ Therefore, a variation of gradient ascent that uses the dominant left and right eigenvectors to approximate the gradient of $L_{A_1,\dots,A_r;d}^R$ will be inaccurate whenever $A_1\otimes X_1+\dots+A_r\otimes X_r$ has multiple dominant eigenvectors. In practice, the gradient ascent process will still converge, but it will converge to a value $(X_1,\dots,X_r)$ where $A_1\otimes X_1+\dots+A_r\otimes X_r$ has a single dominant eigenvalue. In this case, the gradient ascent process avoids supoptimal real LSRDRs. With everything being said, there are still examples (such as with MPO word embeddings) of real matrices $(A_1,\dots,A_r)$ where training a complex LSRDR achieves better results than training a real LSRDR and where if $(X_1,\dots,X_r)$ is a complex LSRDR, then there are $C,\gamma$ where $\gamma CX_jC^{-1}$ is a real matrix for $1\leq j\leq r$ .

If $A_1,\dots,A_r\in M_n(\mathbb{C})^{r}$ , then define a function $T_{A_1,\dots,A_r;d}:\{(R,S)\mid(RA_1S,\dots,RA_rS)\in M_d(\mathbb{C})^{r++}\}\rightarrow[0,\infty)$ by setting $T_{A_1,\dots,A_r;d}(R,S)=L_{A_1,\dots,A_r;d}(RA_1S,\dots,RA_rS).$

We say that $(R,S)$ is an $L_{2,d}$ -SRDR projector if $(R,S)$ is a local maximum for $T_{A_1,\dots,A_r;d}(R,S)$ . We say that $(R,S)$ is a strong $L_{2,d}$ -SRDR projector if $(RA_1S,\dots,RA_rS)$ is an $L_{2,d}$ -SRDR.

Observation: If $(R,S)$ is an $L_{2,d}$ -SRDR projector, then there is usually a constant $\lambda\in\mathbb{C}^\times$ with $RS=\lambda 1_d$ and a projection $\pi:\mathbb{C}^n\rightarrow\mathbb{C}^n$ of rank $d$ with $SR=\lambda \pi.$ $L_{2,d}$ -SRDRs projectors are usually strong $L_{2,d}$ -SRDR projectors. If $R_1,R_2\in M_{d,n}(\mathbb{C}),S_1,S_2\in M_{n,d}(\mathbb{C})$ , then set $[[R_1,S_1]]=[[R_2,S_2]]$ precisely when there are $\mu,\nu\in\mathbb{C}^\times$ where $R_2=\mu QR_1,S_2=\nu S_1Q^{-1}$ . If $(R_1,S_1)$ and $(R_2,S_2)$ are $L_{2,d}$ -SRDR projectors, then we typically have $[[R_1,S_1]]=[[R_2,S_2]].$ In fact, $R_1S_2R_2S_1=\theta 1_d$ for some $\theta\in\mathbb{C}^\times$ and if we set $Q=R_2S_1$ , then $R_2=\mu QR_1,S_2=\nu S_1Q^{-1}$ for some $\mu,\nu\in\mathbb{C}^\times$ .

We say that a $L_{2,d}$ -SRDR projector is constant factor normalized if $RS=1_d$ . If $(R_1,S_1),(R_2,S_2)$ are two constant factor normalized $L_{2,d}$ -SRDR projectors of $(A_1,\dots,A_r)$ , then $R_1S_2R_2S_1=1_d$ and where if we set $Q=R_2S_1$ , then $R_2=QR_1,S_2=S_1Q^{-1}.$ Furthermore, we typically have $S_1R_1=S_2R_2=(S_1R_1)^2=(S_2R_2)^2.$ We shall call the projection matrix $S_1R_1$ an $L_{2,d}$ -LSRDR projection operator of $(A_1,\dots,A_r).$

Observation: Let $(R,S)$ be a constant factor normalized $L_{2,d}$ -SRDR projector of $(A_1,\dots,A_r)$ , and suppose that $X_j=RA_jS$ for $1\leq j\leq r$ . Let $P=SR$ and suppose that $P$ is a projection matrix. Let $U_R$ be a dominant eigenvector of $\Gamma(A_1,\dots,A_r;X_1,\dots,X_r)$ and let $U_L$ be a dominant eigenvector of $\Gamma(A_1,\dots,A_r;X_1,\dots,X_r)^*=\Gamma(A_1^*,\dots,A_r^*;X_1^*,\dots,X_r^*)$ . Then in our experiments, the matrices $RU_R,U_RS^*,S^*U_L,U_LR$ can all be written as positive semidefinite matrices multiplied by a complex constant.

The eigenvalues of $\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP)$ are all simple eigenvalues.

Let $G_0,H_0$ be random complex $n\times n$ -positive semidefinite matrices. Let $H_{n+1}=\frac{\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP)(H_n)}{\|\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP)(H_n)\|}$ and let $G_{n+1}=\frac{\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP)^*(G_n)}{\|\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP)^*(G_n)\|}$ for $n\geq 0$ . Then the sequence $H_n$ converges to a positive semidefinite matrix $H$ and the sequence $G_n$ converges to a positive semidefinite matrix $G$ . The positive semidefinite matrices $G,H$ do not depend on the initialization $G_0,H_0.$ There exists complex constants $\mu_H,\mu_G$ where $U_RS^*=\mu_H\cdot H$ and $U_LR=\mu_G\cdot G$ .

Here, $G=GP=P^*G=P^*GP,H=PH=HP^*=PHP^*$ , and $\text{Im}(G)=\text{Im}(P^*)=\ker(P)^\perp$ and $\text{Im}(H)=\text{Im}(P)=\ker(P^*)^\perp$ .

Let $G_0,H_0^\sharp$ be random complex $n\times n$ -positive semidefinite matrices. Let $H^\sharp_{n+1}=\frac{\Gamma(PA_1P,\dots,PA_rP;PA_1P,\dots,PA_rP)(H^\sharp_n)}{\|\Gamma(PA_1P,\dots,PA_rP;PA_1P,\dots,PA_rP)(H^\sharp_n)\|}$ and let $G_{n+1}^\sharp=\frac{\Gamma(PA_1P,\dots,PA_rP;PA_1P,\dots,PA_rP)^*(G^\sharp_n)}{\|\Gamma(PA_1P,\dots,PA_rP;PA_1P,\dots,PA_rP)^*(G^\sharp_n)\|}$ for $n\geq 0$ . Then $G_n^\sharp\rightarrow G,H_n^\sharp\rightarrow H$ .

The matrix $H$ is the dominant eigenvector of the superoperator $\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP)$ . The matrix $H$ is the dominant eigenvector of the superoperator $\Phi(PA_1P,\dots,PA_rP)$ . The matrix $G$ is the dominant eigenvector of the superoperator $\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP)^*.$ The matrix $G$ is the dominant eigenvector of the superoperator $\Phi(PA_1P,\dots,PA_rP)^*$ . Furthermore, we have $\rho(\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP))=\rho(\Phi(PA_1P,\dots,PA_rP))$ . These eigenvectors correspond to real positive eigenvalues: $\rho(\Phi(PA_1P,\dots,PA_rP))\cdot H=\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP)(H)=\Phi(PA_1P,\dots,PA_rP)(H)$ and $\rho(\Phi(PA_1P,\dots,PA_rP))\cdot G=\Gamma(A_1,\dots,A_r;PA_1P,\dots,PA_rP)^*(G)=\Phi(PA_1P,\dots,PA_rP)^*(G).$

Application: LSRDRs may be used to compress data that can be represented as a collection of matrices. Suppose that $(A_1,\dots,A_r)$ is a collection of matrices. Let $(R,S)$ be a constant factor normalized $L_{2,d}$ -SRDR of $(A_1,\dots,A_r)$ . Let $X_j=RA_jS$ for $1\leq j\leq r$ . Then the matrices $(A_1,\dots,A_r)$ can be approximately recovered from $(R,S)$ and $(X_1,\dots,X_r)$ . The matrix $SX_jR=SRA_jSR$ is an approximation for the matrix $A_j$ which we shall call an LSRDR principal component (LSRDRPC) for $A_j$ . We observe that if $(R_1,S_1),(R_2,S_2)$ are two constant factor normalized $L_{2,d}$ -SRDRs of $(A_1,\dots,A_r)$ , then since $S_1R_1=S_2R_2$ , we have $S_1R_1A_jS_1R_1=S_2R_2A_jS_2R_2$ . In other word, the LSRDRPC for $A_j$ is typically unique or at the very least, there will be few LSRDRPCs for $A_j.$

Observation: If $(X_1,\dots,X_r)$ is an $L_{2,d}$ -SRDR of $(A_1,\dots,A_r)$ , then there is usually a strong $L_{2,d}$ -SRDR projector $(R,S)$ where $X_j=RA_jS$ for $1\leq j\leq r.$

Observation: If $(X_1,\dots,X_r)$ are self-adjoint, then whenever $(R,S)$ is a constant factor normalized $L_{2,d}$ -SRDR projector of $(X_1,\dots,X_r)$ , we usually have $[[R,S]]=[[R_1,S_1]]$ for some $L_{2,d}$ -SRDR projector of $(X_1,\dots,X_r)$ with $S_1=R_1^*$ and where $R_1R_1^*=S_1^*S_1=1_d$ . Furthermore, if $[[R_2,S_2]]$ is another $L_{2,d}$ -SRDR projector of $(X_1,\dots,X_r)$ with $S_2=R_2^*,R_2R_2^*=S_2^*S_2=1_d$ , then $Q=R_1S_2$ is a unitary matrix, and $S_2R_2=S_1R_1$ , and $\text{Im}(R_1)=\text{Im}(R_2)=\ker(S_1)^\perp=\ker(S_2)^\perp.$ If $\iota:\text{Im}(R_1)\rightarrow\mathbb{C}_n$ is the inclusion mapping, and $j:\mathbb{C}_n\rightarrow\text{Im}(R_1)$ is the orthogonal projection, then $[[\iota,j]]$ is an $L_{2,d}$ -SRDR projector.

Observation: If $(A_1,\dots,A_r)$ are real square matrices, then whenever $(R,S)$ is a $L_{2,d}$ -SRDR projector of $(A_1,\dots,A_r)$ , we usually have $[[R,S]]=[[R_1,S_1]]$ for some $L_{2,d}$ -SRDR projector of $(X_1,\dots,X_r)$ where $R_1,S_1$ are real matrices. Furthermore, if $(R_1,S_1),(R_2,S_2)$ are real constant factor normalized $L_{2,d}$ -SRDR projectors of $(A_1,\dots,A_r)$ , then there is a real matrix $Q$ where $R_2=QR_1,S_2=R_1Q^{-1}$ .

Observation: If $(A_1,\dots,A_r)$ are real symmetric matrices, then whenever $(R,S)$ is a $L_{2,d}$ -SRDR projector of $(A_1,\dots,A_r)$ , we usually have $[[R,S]]=[[R_1,S_1]]$ for some $L_{2,d}$ -SRDR projector of $(X_1,\dots,X_r)$ where $R_1,S_1$ are real matrices with $R_1=S_1^*$ . In fact, if $(R_1,S_1),(R_2,S_2)$ are real constant factor normalized $L_{2,d}$ -SRDR projectors of $(A_1,\dots,A_r)$ with $R_1=S_1^*,R_2=S_2^*$ , there is a real orthogonal matrix $Q$ with $R_2=QR_1,S_2=R_1Q^{-1}$ .

Observation: If $(A_1,\dots,A_r)$ are complex symmetric matrices, then whenever $(R,S)$ is a $L_{2,d}$ -SRDR projector of $(A_1,\dots,A_r)$ , we usually have $[[R,S]]=[[R_1,S_1]]$ for some $L_{2,d}$ -SRDR projector of $(X_1,\dots,X_r)$ and where $R_1=S_1^T$ . Furthermore, if $(R_1,S_1),(R_2,S_2)$ are constant factor normalized $L_{2,d}$ -SRDR projectors of $(A_1,\dots,A_r)$ , and $Q=R_1S_2$ , then $QQ^T=1_d$ .

Application: Dimensionality reduction of various manifolds. Let $\mathbb{CQ}^{n-1}$ denote the quotient complex manifold $(\mathbb{C}^n\setminus\{0\})/\simeq$ where we set $\mathbf{x}\simeq\mathbf{y}$ precisely when $\mathbf{x}=\mathbf{y}$ or $\mathbf{x}=-\mathbf{y}$ . One can use our observations about LSRDRs to perform dimensionality reduction of data in $\mathbb{RP}^{n-1}\times(0,\infty),\mathbb{CP}^{n-1}\times(0,\infty),\mathbb{CQ}^{n-1}$ , and we can also use dimensionality reduction to map data in $\mathbb{C}^n$ and $\mathbb{R}^n$ to data in $\mathbb{C}^d$ and $\mathbb{R}^d$ respectively.

Let $\mathcal{A}_n$ be the collection of all $n\times n$ positive definite rank-1 real matrices. Let $\mathcal{B}_n$ be the collection of all positive definite rank-1 complex matrices. Let $\mathcal{C}_n$ be the collection of all symmetric rank-1 complex matrices. The sets $\mathcal{A}_n,\mathcal{B}_n,\mathcal{C}_n$ can be put into a canonical one-to-one correspondence with $\mathbb{RP}^{n-1}\times(0,\infty),\mathbb{CP}^{n-1}\times(0,\infty),\mathbb{CQ}^{n-1}$ respectively. Suppose that $(A_1,\dots,A_r)\in M_n(\mathbb{C})^r$ and $(X_1,\dots,X_r)\in M_d(\mathbb{C})^r$ is an $L_{2,d}$ -SRDR of $(A_1,\dots,A_r)$ . If $(A_1,\dots,A_r)\in\mathcal{A}_n^r$ , then $[X_1,\dots,X_r]\cap\mathcal{A}_d^r\neq\emptyset$ , and every element of $[X_1,\dots,X_r]\cap\mathcal{A}_d^r$ can be considered as a dimensionality reduction of $(A_1,\dots,A_r)$ . Similarly, if $(A_1,\dots,A_r)\in\mathcal{B}_n^r$ , then $[X_1,\dots,X_r]\cap\mathcal{B}_d^r\neq\emptyset$ , and if $(A_1,\dots,A_r)\in\mathcal{C}_n^r$ , then $[X_1,\dots,X_r]\cap\mathcal{C}_d^r\neq\emptyset$ .

$L_{2,d}$ -SRDRs can be used to produce dimensionality reductions that map subsets $W\subseteq K^n$ to $K^d$ where $K\in\{\mathbb{R},\mathbb{C},\mathbb{H}\}$ .

Suppose that $a_1,\dots,a_r\in\mathbb{C}^n$ . Then let $A_k=a_ka_k^*$ for $1\leq k\leq r$ . Then there is some subspace $V\subseteq\mathbb{C}^n$ where if $\iota:V\rightarrow\mathbb{C}^n,j:\mathbb{C}^n\rightarrow V$ are the inclusion and projection mappings, then $(\iota,j)$ is a constant factor normalized $L_{2,d}$ -SRDR projector for $(A_1,\dots,A_r)$ . In this case, $(j(a_1),\dots,j(a_r))$ is a dimensionality reduction of the tuple $a_1,\dots,a_r$ .

Suppose that $a_1,\dots,a_r\in\mathbb{C}^n$ . Then let $A_k=a_ka_k^T$ for $1\leq k\leq r$ . Suppose that $(R,S)$ is a constant factor normalized $L_{2,d}$ -SRDR projector with respect to $(A_1,\dots,A_r)$ . Then $(S(a_1),\dots,S(a_r))$ is a dimensionality reduction of $(a_1,\dots,a_r).$

One may need to normalize the data $a_1,\dots,a_r\in K^r$ in the above paragraphs before taking a dimensionality reduction. Suppose that $(b_1,\dots,b_r)\in K^r$ . Let $\mu=(b_1+\dots+b_r)/r$ be the mean of $(b_1,\dots,b_r)$ . Let $a_k=b_k-\mu$ . Suppose that $(T(a_1),\dots,T(a_r))$ is a dimensionality reduction of $(a_1,\dots,a_r)$ . Then $(T(a_1)+\mu,\dots,T(a_r)+\mu)$ is normalized dimensionality reduction of $(b_1,\dots,b_r)$ .

Application: (clustering for neural networks with bilinear layers) Suppose that $V,W$ are finite dimensional real inner product spaces. We say that a function $f:V\rightarrow W$ is a quadratic form if $g\circ f:V\rightarrow\mathbb{R}$ is a quadratic form for each $g\in W^*$ . We say that a linear operator $A:V\rightarrow V\otimes W$ is symmetric if $(1_V\otimes w)A$ is a symmetric linear operator for each $w\in W^*$ . The quadratic forms $f:V\rightarrow W$ can be put into a one-to-one correspondence with the symmetric linear operators $A:V\rightarrow V\otimes W$ by the relation $f(v)=(v^T\otimes 1_W)Av$ for $v\in V$ .

Suppose that $f:V\rightarrow W$ is a quadratic form and $A:V\rightarrow V\otimes W$ is a linear mapping such that $f(v)=(v^T\otimes 1_W)Av$ for $v\in V.$ Suppose furthermore that $X:U\rightarrow U\otimes W$ is an LSRDR of $A$ . Let $R,S$ be the linear operators where $X=(R\otimes 1_W)AS$ and suppose that $RS=1_U$ . Then let $P=SR$ . Then $P$ is an orthogonal projection operator, and the space $\text{Im}(P)$ is a cluster in $V$ . Furthermore, let $U_R$ be the dominant eigenvector of $\Gamma(A;X)$ and let $U_L$ be the dominant eigenvector of $\Gamma(A;X)^*$ . Then we may assume that the operators $U_R\cdot S^T,U_L\cdot R$ are positive semidefinite with trace 1. Then whenever $v\in V,\|v\|=1$ , the values $\langle U_R\cdot S^T(v),v\rangle$ and $\langle U_L\cdot R(v),v\rangle$ are measures of how well the vector $v$ belongs to the cluster $\text{Im}(P)$ .

Observation: (Random real matrices) Suppose that $d\cdot r/n\approx 0$ and $U_1,\dots,U_r\in M_n(\mathbb{R})$ are random real matrices; for simplicity assume that the entries in $U_1,\dots,U_r$ are independent, and normally distributed with mean $0$ and variance $1$ . Then whenever $X_1,\dots,X_r\in M_d(\mathbb{C})$ , we have $\frac{\rho(U_1\otimes X_1+\dots+U_r\otimes X_r)}{\rho(X_1\otimes\overline{X_1}+\dots+X_r\otimes\overline{X_r})^{1/2}\sqrt{n}}\approx 1$ . In particular, $\rho_{2,d}(U_1,\dots,U_r)/\sqrt{n}\approx 1$ .

Observation: (Random complex matrices) Suppose now that $d\cdot r/n\approx 0$ and $U_1,\dots,U_r\in M_n(\mathbb{C})$ are random complex matrices; for simplicity assume that the entries in $U_1,\dots,U_r$ are independent, Gaussian with mean $0$ , where the real part of each entry in $U_r$ has variance $1/2$ and the imaginary part of each entry in $U_r$ has variance $1/2$ . Then $\frac{\rho(U_1\otimes X_1+\dots+U_r\otimes X_r)}{\rho(X_1\otimes\overline{X_1}+\dots+X_r\otimes\overline{X_r})^{1/2}\sqrt{n}}\approx 1$ and $\rho_{2,d}(U_1,\dots,U_r)/\sqrt{n}\approx 1$ .

Observation: (Random permutation matrices) Suppose that $d\cdot r/n\approx 0$ and $\phi:S_n\rightarrow V$ is the standard irreducible representation. Let $g_1,\dots,g_r\in S_n$ be selected uniformly at random. Then $\frac{\rho(\phi(g_1)\otimes X_1+\dots+\phi(g_r)\otimes X_r)}{\rho(X_1\otimes\overline{X_1}+\dots+X_r\otimes\overline{X_r})^{1/2}}\approx 1$ and $\rho_{2,d}(\phi(g_1),\dots,\phi(g_r))\approx 1.$

Observation: (Random orthogonal or unitary matrices): Suppose that $d\cdot r/n\approx 0$ , and let $(U_1,\dots,U_r)$ be a collection of unitary matrices select at random according to the Haar measure or let $(U_1,\dots,U_r)$ be a collection of orthogonal matrices selected at random according to the Haar measure. Then $\frac{\rho(U_1\otimes X_1+\dots+U_r\otimes X_r)}{\rho(X_1\otimes\overline{X_1}+\dots+X_r\otimes \overline{X_r})^{1/2}}\approx 1$ and $\rho_{2,d}(U_1,\dots,U_r)\approx 1$ .

Observation: Let $1\leq d\leq n$ . Let $A_1,\dots,A_r\in M_n(\mathbb{C})$ be random matrices according to some distribution. Suppose that $R\in M_{d,n}(\mathbb{C}),S\in M_{n,d}(\mathbb{C})$ are matrices where $RS$ is positive semidefinite (for example, $RS$ could be a constant factor normalized LSRDR projector of $(A_1,\dots,A_r)$ ). Let $X=A_1\otimes\overline{RA_1S}+\dots+A_r\otimes\overline{RA_rS}$ . Suppose that the eigenvalues of $X$ are $\lambda_1,\dots,\lambda_{nd}$ with $|\lambda_1|\geq\dots\geq|\lambda_{nd}|.$ Then the spectral gap ratio of $X$ is much larger than the spectral gap of ratio random matrices that follow the circular law. However, the eigenvalues $(\lambda_2,\dots,\lambda_{nd})$ with the dominant eigenvalue $\lambda_1$ removed still follow the circular law (or semicircular or elliptical law depending on the choice of distribution of $A_1,\dots,A_r$ ). The value $\lambda_1/|\lambda_1|$ will be approximately $1$ . The large spectral gap ratio for $X$ makes the version of gradient ascent that we use to compute LSRDRs work well; the power iteration algorithm for computing dominant eigenvectors converges more quickly if there is a large spectral gap ratio, and the dominant eigenvectors of a matrix are needed to calculate the gradient of the spectral radius of that matrix.

Observation: Suppose that $d/n$ is near zero. Suppose that $U_1,\dots,U_r\in M_n(\mathbb{C})$ are independent random complex matrices. Suppose furthermore that the entries in each $U_i$ are independent, Gaussian, and have variance $1/(2\cdot n)$ . Then $\rho_{2,d}(U_1,\dots,U_r)\approx 1+\sqrt{rd/n}.$

Observation: (increased regularity in high dimensions) Suppose that $A_1,\dots,A_r$ are $n\times n$ -complex matrices. Suppose that $1\leq d<e\leq n$ . Suppose that $(W_1,\dots,W_r),(X_r,\dots,X_r)$ are $L_{2,d}$ -SRDRs of $(A_1,\dots,A_r)$ trained by gradient ascent while $(Y_1,\dots,Y_r),(Z_1,\dots,Z_r)$ are $L_{2,e}$ -SRDRs of $(A_1,\dots,A_r)$ . Then it is more likely for $(Y_1,\dots,Y_r)$ to be projectively similar to $(Z_1,\dots,Z_r)$ than it is for $(W_1,\dots,W_r)$ to be projectively similar to $(X_1,\dots,X_r)$ . In general, $(Y_1,\dots,Y_r)$ will be better behaved than $(W_1,\dots,W_r)$ .

Application: (Iterated LSRDRs) Suppose that $1\leq n_1<\dots<n_s=n$ . Let $A_1,\dots,A_r$ be $n\times n$ -complex matrices. Suppose that $X_{s,j}=A_j$ for $1\leq j\leq r$ . Suppose furthermore that $(X_{t,1},\dots,X_{t,r})$ is an LSRDR of $(X_{t+1,1},\dots,X_{t+1,r})$ whenever $1\leq t<s$ . Then we shall call $(X_{1,1},\dots,X_{1,r})$ an iterated LSRDR of $(A_1,\dots,A_r)$ . One advantage of iterated LSRDRs over LSRDRs is that an iterated LSRDR $(X_{1,1},\dots,X_{1,r})$ of $(A_1,\dots,A_r)$ where each $X_{j,r}$ is an $n_1\times n_1$ -matrix is more likely to be well-behaved than an LSRDR $(X_1,\dots,X_r)$ of $(A_1,\dots,A_r)$ where each $X_{j}$ is an $n_1\times n_1$ -matrix. While iterated LSRDRs may seem to require more intensive computation than ordinary LSRDRs, it will take fewer rounds of gradient ascent to compute $(X_{t,1},\dots,X_{t,r})$ from $(X_{t+1,1},\dots,X_{t+1,r})$ than it will take to compute the LSRDR $(X_1,\dots,X_r)$ .

Observation: (Increased regularity with more matrices) Suppose that $2\leq r<s$ . Let $A_1,\dots,A_s\in M_n(\mathbb{C})$ . Suppose that $(X_1,\dots,X_r)$ is an LSRDR of $(A_1,\dots,A_r)$ while $(Y_1,\dots,Y_s)$ is an LSRDR of $(A_1,\dots,A_s)$ . Then the LSRDR $(Y_1,\dots,Y_s)$ is generally better behaved than the LSRDR $(X_1,\dots,X_r).$ For example, if all the matrices $A_1,\dots,A_s$ are real matrices, then it is more likely that one can find $\lambda,C$ where $(\lambda CY_1C^{-1},\dots,\lambda CY_sC^{-1})$ are all real matrices than it is to find $\lambda,C$ where each $(\lambda CX_1C^{-1},\dots,\lambda CX_rC^{-1})$ are all real matrices.

Application: (Appending matrix products) Suppose that $(A_1,\dots,A_r)$ are $n\times n$ -complex matrices and $1\leq d<n$ . Suppose that $(X_1,\dots,X_r)\in M_d(\mathbb{C})^r$ is an LSRDR of $(A_1,\dots,A_r)$ . If the LSRDR $(X_1,\dots,X_r)$ is poorly behaved, then one may modify $(A_1,\dots,A_r)$ to obtain a better behaved LSRDR.

For example, if $(a_1,\dots,a_s)\in\{1,\dots,r\}^s$ , then set $A_{(a_1,\dots,a_s)}=A_{a_1}\dots A_{a_s}$ . Suppose that $\alpha_1,\dots,\alpha_s$ are non-negative real numbers. Then let $V=\bigcup_{j=1}^s\{1,\dots,r\}^j$ . Then let $(Y_\gamma)_{\gamma\in V}$ be an LSRDR of $(\alpha_{|\gamma|}\cdot A_\gamma)_{\gamma\in V}$ . Then the LSRDR $(Y_\gamma)_{\gamma\in V}$ typically behaves better than the LSRDR $(X_1,\dots,X_r)$ . While each round of gradient ascent for computing the LSRDR $(Y_\gamma)_{\gamma\in V}$ takes longer than each round for computing the LSRDR $(X_1,\dots,X_r)$ , the gradient ascent will converge in fewer rounds for $(Y_\gamma)_{\gamma\in V}$ than it will for $(X_1,\dots,X_r)$ .

Block preservation

Suppose that $V_1,\dots,V_q$ are real or complex inner product spaces. Let $V=V_1\oplus\dots\oplus V_q$ . Let $A_j:V\rightarrow V$ be a linear operator for $1\leq j\leq r$ . Suppose that for each $1\leq j\leq r$ , there are $u_j,v_j\in\{1,\dots,q\}$ where if $x\in V_u,y\in V_v$ and $\langle A_jx,y\rangle\neq 0$ , then $u=u_j,v=v_j$ . Let $P:V\rightarrow V$ be an LSRDR projection operator of $(A_1,\dots,A_r)$ . Then we may typically find projection operators $P_j:V_j\rightarrow V_j$ for $1\leq j\leq q$ with $P=P_1\oplus\dots\oplus P_q$ .

Suppose now that $f:\{1,\dots,n\}^2\rightarrow\mathbb{C}$ is a function. For $i,j\in\{1,\dots,n\}$ , let $A_{i,j}$ be the matrix where the $(i,j)$ -th entry is $f(i,j)$ and every other entry is zero. Let $a_{i,j}=|f(i,j)|^2$ for $i,j\in\{1,\dots,n\}$ , and let $A=(a_{i,j})_{1\leq i\leq n,1\leq j\leq n}.$ Let $P\in M_n(\mathbb{C})$ be an LSRDR projection operator of $(A_{i,j})_{1\leq i\leq n,1\leq j\leq n}$ . If $T\subseteq\{1,\dots,n\}$ , then let $A|_T$ be the matrix where $A|_T=(b_{i,j})_{1\leq i\leq n,1\leq j\leq n}$ and where $b_{i,j}=a_{i,j}$ whenever $i,j\in T$ and where $b_{i,j}=0$ otherwise. If $T\subseteq\{1,\dots,n\}$ , then let $\pi_T$ be the $n\times n$ -diagonal matrix where the $i$ -th diagonal entry is $1$ for $i\in T$ and the $i$ -th diagonal entry is $0$ whenever $i\not\in T.$ Then we typically have $P=\pi_T$ where $T$ is the subset of $\{1,\dots,n\}$ with $|T|=d$ and where $\rho(A|_T)$ is maximized.

Application: Given a matrix $Q$ , let $\sum Q$ denote the sum of all the entries in $Q.$ Suppose that $B$ is a real $n\times n$ -matrix. Then we may use LSRDRs to find a subset $T\subseteq \{1,\dots,n\}$ where $\sum(B|_T)$ is maximized.

Let $M_{n,d}$ be the $n\times n$ -matrix where each entry is $1/d$ , and let $M_d=M_{d,d}$ . Let $X$ be a real $n\times n$ -matrix. Then it is not too difficult to show that $^{\lim}_{\delta\rightarrow 0}\frac{1}{\delta}(\rho(M_d+\delta X)-\rho(M_d))=\sum X/d.$

Suppose that $B$ is an $n\times n$ -real matrix, and there is a unique $T\subseteq\{1,\dots,n\}$ with $|T|=d$ and where $\sum B|_T$ is maximized. Then whenever $\sum B|_T$ is maximized, $\delta$ is sufficiently small, and $A=M_d+\delta B$ , we typically have $P=\pi_T.$

Quaternions

We say that a $2d\times 2d$ complex matrix is a quaternionic complex matrix if $a_{2i,2j}=\overline{a_{2i-1,2j-1}}$ and $a_{2i,2j-1}=-\overline{a_{2i-1,2j}}$ whenever $i,j\in\{1,\dots,d\}$ . Suppose that each $A_j$ is an $n\times n$ quaternionic complex matrix for $1\leq j\leq r$ . Suppose that $d$ is an even integer with $d\leq n$ and suppose that $(X_1,\dots,X_r)$ is a $L_{2,d}$ -SRDR of $(A_1,\dots,A_r)$ . Then there is often some $C\in M_d(\mathbb{C})$ and $\lambda\in\mathbb{C}^\times$ where if we set $Y_j=\lambda CX_jC^{-1}$ for $1\leq j\leq r$ , then $Y_1,\dots,Y_r$ are quaternionic complex matrices. Furthermore, if $(R,S)$ is an $L_{2,d}$ -SRDR projector, then there is some LSRDR projector $[[R_1,S_1]]$ where $[[R,S]]=[[R_1,S_1]]$ but where $R_1,S_1$ are quaternionic complex matrices.

We observe that if $A_1,\dots,A_r\in M_n(\mathbb{C})$ are quaternionic complex matrices and $R\in M_{n,d}(\mathbb{C}),S\in M_{d,n}(\mathbb{C})$ are rectangular matrices with $RS=1_d$ , then the matrix $A_1\otimes\overline{RA_1S}+\dots+A_r\otimes\overline{RA_rS}$ will still have a single dominant eigenvalue with a spectral gap ratio significantly above 1 in contrast with the fact that $A_1,\dots,A_r$ each have spectral gap ratio of $1$ since the eigenvalues of $A_1,\dots,A_r$ come in conjugate pairs. In particular, if $X_1,\dots,X_r\in M_n(\mathbb{C})$ are matrices obtained during the process of training $L_{A_1,\dots,A_r;d}(X_1,\dots,X_r)$ , then the matrix $A_1\otimes\overline{X_1}+\dots+A_r\otimes\overline{X_r}$ will have a spectral gap ratio that is significantly above $1$ ; this ensures that gradient ascent will not have any problems in optimizing $L_{A_1,\dots,A_r;d}(X_1,\dots,X_r)$ .

Gradient ascent on subspaces

Let $n$ be a natural number. Let $\mathcal{H}_n,\mathcal{S}_n,\mathcal{Q}_n$ be the set of all $n\times n$ -Hermitian matrices, complex symmetric matrices, and complex quaternionic matrices. Suppose that $\mathcal{C}\in\{\mathcal{H},\mathcal{S},\mathcal{Q}\}$ . Suppose that $(A_1,\dots,A_r)\in\mathcal{C}_n^r$ . Then let $(X_1,\dots,X_r)\in\mathcal{C}_d^r$ be a tuple such that $L_{A_1,\dots,A_r;d}|_{\mathcal{C}^r}(X_1,\dots,X_r)$ is locally maximized. Suppose that $(Y_1,\dots,Y_r)\in M_d(\mathbb{C})^r$ is a tuple where $L_{A_1,\dots,A_r;d}(Y_1,\dots,Y_r)$ is locally maximized. Then we usually have $L_{A_1,\dots,A_r;d}(X_1,\dots,X_r)=L_{A_1,\dots,A_r;d}(Y_1,\dots,Y_r)$ . In this case, we would say that $(X_1,\dots,X_r)$ is an alternate domain $L_{2,d}$ -spectral radius dimensionality reduction (ADLSRDR).

Homogeneous non-commutative polynomials

Suppose that $p(x_1,\dots,x_r)$ is a homogeneous non-commutative polynomial of degree $n$ . Define a fitness function $M_{p,d}$ by letting $M_{p,d}(A_1,\dots,A_r)=\frac{\rho(p(A_1,\dots,A_r))^{1/n}}{\rho_2(A_1,\dots,A_r)}$ whenever $(A_1,\dots,A_r)\in M_d(\mathbb{C})^r$ . We say that $(A_1,\dots,A_r)$ is an $L_{2,d}$ -spectral radius dimensionality reduction of $p(x_1,\dots,x_r)$ if $M_{p,d}(A_1,\dots,A_r)$ is maximized.

If $p$ is a tensor of degree $2$ , then $\max M_{p,d}(A_1,\dots,A_r)=\|p\|_2$ where $\|p\|_2$ denotes the Frobenius norm of $p$ . One should take an LSRDR of an $n$ -tensor $p$ only if $n\geq 3$ since there are plenty of linear algebraic tools that one can use to analyze $2$ -tensors.

Suppose that $p(x_1,\dots,x_r)$ is a homogeneous non-commutative polynomial of degree $n$ with random coefficients. Let $(X_1,\dots,X_r)$ be an LSRDR of $p(x_1,\dots,x_r)$ . Then whenever $q(x_1,\dots,x_r)$ is a homogeneous non-commutative polynomial of degree $m$ with $m\mod n\neq 0$ , we typically have $\text{Tr}(q(X_1,\dots,X_r))\approx 0$ . Let $\mathcal{E}=\Phi(X_1,\dots,X_r)$ . The eigenvalues of $\mathcal{E}$ will be symmetric under $2\pi/n$ rotations. Whenever the coefficients of $p$ are random complex numbers and $n=3$ , then $\mathcal{E}$ has no real eigenvalues except for a dominant eigenvalue or zero eigenvalues. In the case when the coefficients of $p$ are random real numbers and $n=3$ , the superoperator $\mathcal{E}$ may have other real eigenvalues, but these non-zero non-dominant eigenvalues will tend to have multiplicity 2.

A simple calculation shows that the above observations of the traces of $q(X_1,\dots,X_r)$ and of $\mathcal{E}$ are equivalent.

Observation: The trace of a completely positive superoperator is always a non-negative real number. In particular, $\text{Tr}(\Phi(A_1,\dots,A_r))=|\text{Tr}(A_1)|^2+\dots+|\text{Tr}(A_r)|^2$ .

Observation: Suppose that $A_{1,1},\dots,A_{1,r_1},\dots,A_{s,1},\dots,A_{s,r_s}$ are $d\times d$ real or complex matrices. Then

$\text{Tr}(\Phi(A_{1,1},\dots,A_{1,r_1})\dots\Phi(A_{s,1},\dots,A_{s,r_s}))=\sum_{i_1\in\{1,\dots,r_1\},\dots,i_s\in\{1,\dots,r_s\}}|\text{Tr}(A_{1,i_1}\dots A_{s,i_s})|^2$ .

Therefore $\text{Tr}(\Phi(A_{1,1},\dots,A_{1,r_1})\dots\Phi(A_{s,1},\dots,A_{s,r_s}))=0$ if and only if $\text{Tr}(A_{1,i_1}\dots A_{s,i_s})=0$ whenever $i_1\in\{1,\dots,r_1\},\dots,i_s\in\{1,\dots,r_s\}$ .

LSRDRs of homogeneous non-commutative polynomials are typically well behaved even when the polynomial is of a very high degree. For example, let $k\geq 0$ and let $(p_{i,j})_{1\leq i\leq k,1\leq j\leq r}$ be a system of non-commutative polynomials each of degree $s$ . Then for $1\leq i\leq k$ , let $p_i(x_1,\dots,x_r)=(p_{i,1}(x_1,\dots,x_r),\dots,p_{i,r}(x_1,\dots,x_r))$ . Let $p=p_k\circ\dots\circ p_1$ . Then there are homogeneous non-commutative polynomials $q_1,\dots,q_r$ where $p(x_1,\dots,x_r)=(q_1(x_1,\dots,x_r),\dots,q_r(x_1,\dots,x_r))$ . If $(X_1,\dots,X_r),(Y_1,\dots,Y_r)$ are $L_{2,d}$ -SRDR of $q_1$ trained with different initializations, then we may typically find a central scalar $\lambda$ and some invertible $C$ with $Y_j=\lambda CX_jC^{-1}$ for $1\leq j\leq r$ . This is despite the fact that $q_1$ has degree $s^k$ which could be very large.

One can use LSRDRs to map tensors in $V_1\otimes\dots\otimes V_n$ to other objects in several ways. Let $V=V_1\oplus\dots\oplus V_n$ . Then since $V_1\otimes\dots\otimes V_n\subseteq V^{\otimes n}$ , we may consider a tensor $v$ in $V_1\otimes\dots\otimes V_n$ as an element of $V^{\otimes n}$ , and we may take our LSRDR of the tensor $v.$ Suppose that $\text{Dim}(V_j)=r_j$ . Then we can represent a tensor $v\in V_1\otimes\dots\otimes V_n$ as a homogeneous non-commutative polynomial $p(x_{1,1},\dots,x_{1,r_1};\dots;x_{n,1},\dots,x_{n,r_n})$ consisting of a linear combination of monomials of the form $x_{1,i_1}\dots x_{n,i_n}$ . Define a fitness function $M:M_d(K)^r\rightarrow[-\infty,\infty)$ by setting $M(A_{1,1},\dots,A_{1,r_1};\dots,A_{n,1},\dots,A_{n,r_n})$

$=\log(\rho(p(A_{1,1},\dots,A_{1,r_1};\dots,A_{n,1},\dots,A_{n,r_n})))/n$

$-\log(m(\|\sum_{k=1}^{r_1}A_{1,k}A_{1,k}^*\|_p,\dots,\|\sum_{k=1}^{r_n}A_{n,k}A_{n,k}^*\|_p))/2$ where $m$ is some mean. Then the fitness function empirically has a unique local maximum value.

Tensors to collections of matrices

Not only are we able to obtain tuples of matrices from tensors, but we may also obtain tensors from tuples of matrices, and go back and forth between tensors and tuples of matrices. Suppose that $K$ is either the field of real numbers or the field of complex numbers. Let $X_{1,1},\dots,X_{1,r_1};\dots;X_{n,1},\dots,X_{n;r_n}\in M_d(K)$ . Then a LSRDR $(\alpha_{i_1,\dots,i_n})_{i_1,\dots,i_n}\in K^{r_1}\otimes\dots\otimes K^{r_n}$ of $(X_{1,i_1}\dots X_{n,i_n})_{i_1,\dots,i_n}$ is typically unique up to a scalar multiple, and if $K$ is the field of complex numbers but each $X_{i,j}$ is a real matrix, then $(\alpha_{i_1,\dots,i_n})_{i_1,\dots,i_n}$ is a real tensor multiplied by a complex nonzero scalar.

Suppose that $V_1,\dots,V_n$ are finite dimensional inner product spaces over the field $K$ . Suppose that $e_{i,1},\dots,e_{i,d_i}$ is a basis for $V_i$ . Let $(m_0,\dots,m_n)$ be a sequence of positive integers with $m_0=m_n=1$ . Suppose that $X_{i,j}$ is an $m_{i-1}\times m_i$ matrix over the field $K$ whenever $1\leq i\leq n$ . Then define the tensor $T((X_{i,j})_{i,j})=\sum_{i_1,\dots,i_n}e_{i_1}\otimes\dots\otimes e_{i_n}X_{1,i_1}\dots X_{n,i_n}$ . If $\mathbf{v}$ and $\|T((X_{i,j})_{i,j})-\mathbf{v}\|$ is locally minimized and $\mathbf{u}=T((X_{i,j})_{i,j})$ , then we say that $\mathbf{u}$ is a matrix table to tensor dimensionality reduction of type $(m_0,\dots,m_n)$ . Our computer experiments suggest that the matrix table to tensor dimensionality reduction of type $(m_0,\dots,m_n)$ is generally unique. Furthermore, if $K=\mathbb{C}$ and $\mathbf{v}$ is an $\mathbb{R}$ -linear combination of the basis vectors $e_{1,i_1}\otimes\dots\otimes e_{n,i_n}$ , then in our computer experiments, the matrix table to tensor dimensionality reduction of type $(m_0,\dots,m_n)$ is also an $\mathbb{R}$ -linear combination of the basis vectors $e_{1,i_1}\otimes\dots\otimes e_{n,i_n}$ .

Suppose that $\mathbf{v}$ is a sparse tensor whose entries are integers and $K=\mathbb{R}$ . Suppose furthermore that $\mathbf{u}_1,\dots,\mathbf{u}_z$ are matrix table to tensor dimensionality reductions of \ $\mathbf{v}$ . Then the values $\|\mathbf{u}_i-\mathbf{v}\|^2,\|\mathbf{u}_i-\mathbf{u}_j\|^2,\|\mathbf{u}_i\|^2$ are often positive integers. It is sometimes but not always the case that each entry in $\mathbf{u}_i$ is an integer or at least a rational number. It is sometimes the case that

there is a partition $A,B$ of $\{1,\dots,z\}$ where $\mathbf{u}_i=\mathbf{u}_j$ precisely when $i,j\in B$ ,
$\|\mathbf{u}_i\|^2$ is always a positive integer,
there is a positive integer $m$ where $\|\mathbf{u}_i-\mathbf{u}_j\|^2=m$ whenever $i\in A,j\in B$ ,
if $i\in B$ , then all of the entries in $\mathbf{u}_i$ are integers,
if $i\in A$ , then most but not all of the entries in $\mathbf{u}_i$ are integers,
if $i\in A$ , then the tensor $\mathbf{u}_i$ has repeated non-integer entries, and
one can find non-integer entries $\alpha,\beta$ in $\mathbf{u}_i$ where $\alpha-\beta$ is a non-zero integer.

Suppose that $\mathbf{v}$ is a random tensor in $V_1\otimes\dots\otimes V_n$ . Then one can often find a matrix table to tensor dimensionality reduction $\mathbf{u}$ of type $(m_0,\dots,m_n)$ such that $\mathbf{v}-\mathbf{u}$ is its own matrix table to tensor dimensionality reduction $\mathbf{u}$ of type $(m_0,\dots,m_n)$ .

LSRDRs to learn data

LSRDRs may be adapted to solve classification tasks. Suppose that $K\in\{\mathbb{R},\mathbb{C},\mathbb{H}\}$ .

Suppose that $r,q,d$ are natural numbers. Suppose that $1<p\leq\infty$ . Let $n_1,\dots,n_q$ be natural numbers. Suppose that $A_{i,j}\in M_{n_i}(K)$ for $i\in\{1,\dots,q\},j\in\{1,\dots,r\}$ . We say that a collection $X_1,\dots,X_r\in M_d(K)$ of matrices is an $L_{2,d}$ -spectral radius similarity log-expected maximizer (LSRSLEM) of degree $p$ and dimension $d$ if $X_1,\dots,X_r$ locally maximizes the fitness

$\frac{1}{q}\sum_{i=1}^q\log(\rho(\Gamma(A_{i,1},\dots,A_{i,r};X_1,\dots,X_r)))$

$-\log(\|X_1X_1^*+\dots+X_rX_r^*\|_p)/2.$

Application: LSRSLEMs may be used to solve machine learning classification problems. Suppose that $Z$ is a universal set. Suppose that we are given a training data set $D_0\subseteq Z$ , and we would like the machine learning algorithm to recognize objects that are similar to the elements in $D_0$ . Then one would first construct a function $h:Z\rightarrow\bigcup_{n=1}^\infty M_n(K)^r$ . One then constructs an LSRSLEM $X_1,\dots,X_r$ of $(H(d))_{d\in D_0}.$ Then a higher value of $\log(\rho(h(z);X_1,\dots,X_r))$ indicates that the value $z$ is more similar to the training data set $D_0$ .

Observation: If each $A_{i,j}$ is nearly positive semidefinite, and $(X_1,\dots,X_r),(Y_1,\dots,Y_r)$ are LSRSLEMs of the same degree and dimension of $(A_{i,j})_{1\leq i\leq q,1\leq j\leq r}$ , then $(X_1,\dots,X_r),(Y_1,\dots,Y_r)$ typically have the same fitness level.

Observation: Suppose that $(X_1,\dots,X_r)$ is an LSRSLEM. Then one can typically find subspaces $V_1,\dots,V_s$ of the vector space $K^d$ such that each $V_j$ is invariant with respect to each $X_j$ and each $X_j^*$ . In other words, there is a unitary matrix $U$ along with $m_1,\dots,m_s$ with $m_1+\dots+m_s=d$ and where each $UX_jU^*$ is a block diagonal matrix whose diagonal blocks are matrices $X_{1,j},\dots,X_{s,j}$ of sizes $m_1,\dots,m_s$ . Furthermore, the matrices $X_{i,1},\dots,X_{i,r}$ typically generate the algebra $M_{m_i}(K)$ .

Application: LSRSLEMs may be used to recognize objects in a data set, but LSRSLEMs may also be used to partition data sets into clusters and produce representatives of these clusters. Suppose that $X_1,\dots,X_r$ is an LSRDLEM of $(A_{i,j})_{i,j}$ and each $X_j$ is a block diagonal matrix with block diagonal entries $X_{1,j},\dots,X_{s,j}$ .

Our clustering function $f:\{1,\dots,q\}\rightarrow\{1,\dots,s\}$ will simply be the function that maximizes the quantity $\log(\rho(\Gamma(A_{i,1},\dots,A_{i,r};X_{f(i),1},\dots,X_{f(i),r}))).$ Our clustering function can be used to assign tuples $(A_1,\dots,A_r)$ that are not contained in our original $q$ data values to these $s$ clusters. Our extended clustering function $F:M_{n_1}(K)\times\dots\times M_{n_r}(K)\rightarrow\{1\,dots,s\}$ will be the partial function where if $(A_1,\dots,A_r)\in M_{n_1}(K)\times\dots\times M_{n_r}(K)$ , then $F(A_1,\dots,A_r)=j$ whenever $\log(\rho(\Gamma(A_1,\dots,A_r;X_{j,1},\dots,X_{j,r})))$ is maximized.

An LSRDR $(A_1,\dots,A_r)$ of a cluster $(X_{j,1},\dots,X_{j,r})$ is a representative for the $j$ -th cluster.

Iterated LSRSLEM clustering

One can use iterated LSRSLEMs to partition data into clusters. For simplicity, assume that $r=d$ .

Suppose that $v_{i,j}$ is a random positive real number for $\{i,j\}\in\{1,\dots,q\}\times\{1,\dots,r\}$ . Let $1<p<\infty$ . Then we shall construct $v_{i,j,n}$ whenever $n\geq 0,1\leq i\leq q,1\leq j\leq r$ using recursion on $n$ .

From $(v_{i,j,n})_{i,j}$ , we construct an LSRSLEM $(X_{1,n},\dots,X_{r,n})$ of $(v_{i,j,n})_{i,j}$ of degree $p$ and dimension $d$ . Suppose that $(v_{i,1,n+1},\dots,v_{i,r,n+1})$ is a dominant eigenvector with norm $1$ of $v_{i,1,n}X_{1,n}+\dots+v_{i,r,n}X_{r,n}$ whenever $1\leq i\leq q$ . Then there is an equivalence relation $\simeq$ on $\{1,\dots,q\}$ where for all $\epsilon>0$ , there is some $N$ where if $n\geq N$ and $\alpha=(v_{i,1,n},\dots,v_{i,r,n}),\beta=(v_{j,1,n},\dots,v_{j,r,n})$ , then $\|\alpha-\beta\|<\epsilon$ whenever $i\simeq j$ and $\langle \alpha,\beta\rangle<\epsilon$ otherwise. The equivalence relation $\simeq$ does not typically depend on the choices made when training the LSRSLEM $(X_{1,n},\dots,X_{r,n})$ for $n\geq 0$ .

The stability of the iterations $v_{i,j,n}$ also suggests that our method of iterating LSRSLEMs will not be able to replicate the capabilities of deep neural networks.

Multiple tensors

Compositional LSRDRs

Suppose that $r_1,\dots,r_l, n_1,\dots,n_l,d_1,\dots,d_l$ are positive integers. Suppose that $A_{i,j}$ is an $n_i\times n_{i+1}$ -matrix over the field $K$ (here addition is taken modulo $l$ ) whenever $1\leq i\leq l,1\leq j\leq r_i$ . Then we say that $(X_{i,j})_{i,j}\in \prod_{i=1}^l M_{d_i,d_{i+1}}(K)^{r_i}$ is a compositional LSRDR (abbreviated CLSRDR) of $A_{i,j}$ with dimensions $(d_1,\dots,d_l)$ of $(A_{i,j})_{i,j}$ if the following quantity known as the fitness is locally maximized:

$\log(\rho\big(\Gamma(A_{1,1},\dots,A_{1,r_1};X_{1,1},\dots,X_{1,r_1})\dots\Gamma(A_{l,1},\dots,A_{l,r_l};X_{l,1},\dots,X_{l,r_l})\big))-\log(\rho\big(\Phi(X_{1,1},\dots,X_{1,r_1})\dots\Phi(X_{l,1},\dots,X_{l,r_l})\big))/2.$

Observation: Two CLSRDRs of $(A_{i,j})_{i,j}$ often have the same fitness levels.

Observation: Suppose that $r_1,\dots,r_l, n_1,\dots,n_l,d_1,\dots,d_l,e_1,\dots,e_l$ are positive integers with $\max(n_i,d_i)=\max(n_i,e_i)$ for all $i$ . Then a CLSRDR with dimensions $(d_1,\dots,d_l)$ will typically have the same fitness level as a CLSRDR with dimensions $(e_1,\dots,e_l)$ .

Observation: Suppose that $(X_{i,j})_{i,j}$ is a CLSRDR of $(A_{i,j})_{i,j}$ . Then one can often find two sequences $(R_1,\dots,R_l),(S_1,\dots,S_l)$ of matrices where $X_{i,j}=R_iA_{i,j}S_i$ for all $i,j$ . Furthermore, for all $i$ , there is a non-zero constant $\lambda$ where $R_{i+1}S_i=\lambda\cdot I$ . We say that the CLSRDR $(X_{i,j})_{i,j}$ is constant factor normalized if $R_{i+1}S_i=I$ for all $i$ . In this case, the matrix $S_iR_{i+1}$ will be a projection matrix that we may denote by $P_i$ .

Let $U_{R,1},\dots,U_{R,l},U_{L,1},\dots,U_{L,l}$ denote the matrices where $U_{R,i}$ is the dominant eigenvector of $\Delta_i=\Gamma(A_{i+1,1},\dots,A_{i+1,r_{i+1}};X_{i+1,1},\dots,X_{i+1,r_{i+1}})\dots\Gamma(A_{l,1},\dots,A_{l,r_l};X_{l,1},\dots,X_{l,r_l})\Gamma(A_{1,1},\dots,A_{1,r_1};X_{1,1},\dots,X_{1,r_1})\dots\Gamma(A_{i,1},\dots,A_{i,r_i};X_{i,1},\dots,X_{i,r_i})$ and where $U_{L,i}$ is the dominant eigenvector of the adjoint $\Delta_i^*$ . Then the matrices $R_{i+1}U_{R,i},U_{R,i}S_i^*,S_i^*U_{L,i+1},U_{L,i}R_i$ are generally positive semidefinite up to a constant factor. If we train the CLSRDR twice then each time we will get the same value for $U_{R,i}S_i^*/\text{Tr}(U_{R,i}S_i^*),U_{L,i}R_i/\text{Tr}(U_{L,i}R_i)$ .

Suppose furthermore that $A_{i_1,j}=A_{i_2,j}$ for all $i,j$ . Then $S_{i_1}R_{i_1+1}=S_{i_2}R_{i_2+1}$ whenever the CLSRDR $(X_{i,j})_{i,j}$ is constant factor normalized, and the positive semidefinite matrices $U_{R,i}S_i^*/\text{Tr}(U_{R,i}S_i^*),U_{L,i}R_i/\text{Tr}(U_{L,i}R_i)$ do not depend on the value $i$ .

WHAT IF ALL BUT ONE ENTRY IS ZERO? BLOCK MATRICES. UNDER CONSTRUCTION. Suppose that $n_1,\dots,n_l,$

Other fitness functions

Let $X,Y$ be a finite dimensional real or complex inner product space. Suppose that $\mu$ is a Borel probability measure with compact support on $X\times Y$ . Now define a fitness function $f_\mu:L(X,Y)\rightarrow[-\infty,\infty)$ by letting $f_\mu(A)=\int \log(|\langle Ax,y\rangle|)d\mu(x,y)-\log(\|A\|_2)$ .

If $\nu$ is a Borel probability measure with compact support on $X$ , then define a fitness function $f_\nu:L(X)\rightarrow[-\infty,\infty)$ by letting $f_\mu(A)=\int \log(|\langle Ax,x\rangle|)d\mu(x)-\log(\|A\|_2)$ .

If $X,Y$ are real inner product spaces, then $\{A\in L(X,Y)\mid f_\mu(A)>-\infty\},\{A\in L(X)\mid f_\nu(A)>-\infty\}$ will generally be disconnected spaces, so gradient ascent will have no hope of finding an $A\in L(X,Y)$ or $A\in L(X)$ that maximizes $f_\mu,f_\nu$ respectively simply by using gradient ascent with a random initialization. However, $f_\nu(A)>0$ whenever $A$ is positive definite, and gradient ascent seems to always produce the same local optimum value $f_\nu(A)$ for the function $f_\nu$ when $A$ is initialized to be a positive semidefinite matrix. If $A,B$ are linear operators that maximize $f_\nu$ , then we generally have $A=\pm B$ .

If $X$ is a complex inner product space, then gradient ascent seems to always produce the same local optimal value $f_\nu(A)$ for the fitness function $f_\nu$ , and if $A,B$ are matrices that maximize the fitness function $f_\nu$ , then we generally have $A=\lambda B$ for some $\lambda\in\mathbb{C}^\times$ . The local optimal operator $\lambda A$ will generally have positive eigenvalues for some $\lambda\in\mathbb{C}^\times$ , but the operator $\lambda A$ itself will not be positive semidefinite. If $X=\mathbb{C}^n$ and $\nu$ is supported on $\mathbb{R}^n$ , then for each operator $A$ that optimizes $f_\nu$ , there is some $\lambda\in\mathbb{C}^\times$ where $\lambda A[\mathbb{R}^n]\subseteq\mathbb{R}^n$ , and $A$ is up to sign the same matrix found by real gradient ascent with positive definite initialization.

The problem of optimizing $f_\mu,f_\nu$ is a problem that is useful for constructing MPO word embeddings and other objects for machine learning similar to LSRDR dimensionality reductions. While $f_\nu$ tends to have a unique up to constant factor local optimum, the fitness function $f_\mu$ does not generally have a unique local optimum; if $X,Y$ are complex inner product spaces, then $f_\mu$ has few local optima in the sense that it is not unusual to there to be a constant $\lambda$ where $A=\lambda B$ such that $A,B$ locally optimize $f_\mu$ . In this case, the matrix $A$ that globally optimizes $f_\mu$ is most likely also the matrix that has the largest basin of attraction with respect to various types of gradient ascent.

MPO word embeddings: Suppose that $N$ is a matrix embedding potential, $A$ is a finite set, and $a_1\dots a_n\in A^*$ . Suppose that $f,g:A\rightarrow M_d(\mathbb{C})$ is are MPO word pre-embeddings with respect to the matrix embedding potential $N$ . Then there are often constants $\lambda_a\in S^1$ for $a\in A$ where $f(a)=\lambda_ag(a)$ for $a\in A$ . Furthermore, there is often a matrix $R$ along with constants $\mu_a\in S^1$ for $a\in A$ where $\mu_a Rf(a)R^{-1}$ is a real matrix for $a\in A$ .

Rank-1 matrix valued graph embeddings

Applying pseudodeterminism to test randomness/pseudorandomness

The uniqueness of the fitness functions (pseudodeterminism) in the machine learning algorithms in this post is useful for testing for randomness/pseudorandomness in cryptography. Here is a non-exhaustive list of ways that one can use pseudodeterministic machine learning algorithms to test data sets for randomness/pseudorandomness and to measure the quantity of randomness/pseudorandomness in the data set and obtain confidence levels of the non-randomness/pseudorandomness of an object.

Application: Let $A$ be a finite set. Let $1<p\leq\infty$ and let $d$ be a positive integer. Let $(a_1,\dots,a_n)\in A^n$ . Define our fitness function $F_p:M_d(\mathbb{R})^A\setminus\{\mathbf{0}\}\rightarrow[-\infty,\infty)$ by setting $F_p((X_a)_{a\in A})=\log(\rho(X_{a_1}\dots X_{a_n}))/n-\log(\|\sum_{a\in A}X_aX_a^*\|_p).$ If $F_p((X_a)_{a\in A})$ is locally maximized, then a lower value of $F_p((X_a)_{a\in A})$ indicates that the sequence $(a_1,\dots,a_n)$ is more random/pseudorandom.

Application: Suppose that $A$ is a finite set, $n$ is a natural number, and $R\subseteq A^n$ is a subset. Let $1<p\leq\infty$ and let $d$ be a positive integer. Define a function $L:M_d(\mathbb{R})^{A\times n}\setminus\{\mathbf{0}\}\rightarrow[-\infty,\infty)$ by setting $L((X_{a,k})_{a\in A,k\in\{1,\dots,n\}})$ $=\big(\sum_{(r_1,\dots,r_n)\in R}\log(\rho(X_{r_1,1}+\dots+X_{r_n,n}))/(n\cdot |R|)\big)-\log(\|\sum_{a\in A}\sum_{k=1}^nX_{a,k}X_{a,k}^\ast\|_p)/2$ .

If $L((X_{a,k})_{a\in A,k\in\{1,\dots,n\}})$ is locally maximized, then a lower value of $L((X_{a,k})_{a\in A,k\in\{1,\dots,n\}})$ indicates that the set $R$ is more random/pseudorandom.

Best quantum channel approximation to a relation.

Let $U,V$ be complex Euclidean spaces. Suppose that $f=\{(u_1,v_1),\dots,(u_n,v_n)\}\subseteq U\times V$ be such that each $u_j,v_j$ is a unit vector. Then we say that a quantum channel $\mathcal{E}:L(U)\rightarrow L(V)$ is a best Choi rank $r$ approximation to $f$ if the quantity $\sum_{k=1}^n-\log(\langle\mathcal{E}(u_ku_k^*)v_k,v_k\rangle)$ is locally minimized subject to the condition that $\mathcal{E}$ is a quantum channel of Choi rank at most $r$ .

Empirical observation: Best Choi rank $r$ approximations to functions tend to be unique in the sense that the gradient descent process tends to produce the same Best Choi rank $r$ approximations when we run the gradient descent multiple times.

Suppose that $\mathcal{E}$ is a best Choi rank $r$ approximation to $f=\{(u_1,v_1),\dots,(u_n,v_n)\}$ .

Let $f=\{(u_1,v_1),\dots,(u_n,v_n)\}\subseteq U\times V$ where each $u_j,v_j$ is a unit vector. For each $j$ , let $u_j^\sharp$ be a unit dominant eigenvector of $\mathcal{E}(u_ju_j^*)$ . Let $f^\sharp=\{(u^\sharp_1,v_1),\dots,(u^\sharp_n,v_n)\}$ . Then we typically have $f^{\sharp\sharp\sharp}=f^{\sharp\sharp}$ . More generally, if $U$ is a neighborhood of $f^\sharp$ of small diameter and $g\in U$ , then we typically have $g^{\sharp\sharp}=g^\sharp$ and if $\mathcal{E}$ is a best Choi rank $r$ -approximation to $g^\sharp$ , then $\mathcal{E}$ is typically the identity mapping.

Afterward

Even though I have published this post, this post is not yet complete since I would like to add information to this post about functions related to and which satisfy similar properties to the functions of the form $L_{A_1,\dots,A_r;d}$ . I also want the community to actually say something and communicate back before I add any information to this post.

Share this:

Related

Leave a comment Cancel reply