MPO optimized word embeddings: mapping tokens to matrices

In this post, we will not talk about cryptography, but we will talk about another application and generalization of the $L_{2,d}$ -spectral radius to producing word embeddings and graph embeddings for machine learning applications. We shall call these word embeddings matrix product optimized (MPO) word embeddings. These MPO word embeddings and graph embeddings do not require any neural networks to produce at the moment, but once these word and graph embeddings are produced, they can be processed using neural networks. Even though the MPO word embeddings are not produced using neural networks, they have features and advantages that no other word embedding algorithm currently has. For example, MPO word embeddings can be used to evaluate the local grammatical correctness of text and produce text with the help of evolutionary computation or simulated annealing. MPO word embeddings are also the only kind of word embedding that we know of that map words to matrices, and matrix-valued word embeddings enjoy plenty of advantages over vector-valued word embeddings. MPO word embeddings also behave much more mathematically than other word embedding algorithms; for example, even when training an MPO word embedding over the field of complex numbers, the resulting word embedding (if everything goes right) will be a real matrix valued function. MPO word embeddings also also more consistent in the sense that the resulting MPO word embedding does not depend on the initialization. Any initialization would converge to a nearby MPO word embedding.

In artificial intelligence, and more specifically, natural language processing (NLP), one would want to convert words, characters, punctuation, or other tokens into something that computers can work with such as vectors in a vector space. A word embedding is typically a function from a set of tokens $A$ to a finite dimensional real vector space $V$ . With a word embedding, AI systems will therefore be able to process vectors rather than text, but we may be able to improve upon the idea of mapping tokens to elements in a vector space by instead mapping tokens to real or complex matrices. Spaces of real or complex matrices have more structure than simply vector spaces since spaces of matrices are equipped with matrix multiplication, the Frobenius inner product, several orthogonal or unitary invariant submultiplicative norms, many matrix decompositions such as the singular value decomposition, et cetera, and we shall see that this additional structure of $M_d(K)$ is very useful for word embeddings in NLP. Matrix multiplication can also be exploited by matrix neural networks [GGW] which are neural networks whose layers are of the form $X\mapsto\sigma(AXB+C)+D$ where $\sigma$ is an elementwise activation function and $A,B,C,D$ are learnable matrices of the appropriate dimensions.

Since tokens in natural language often have multiple related and unrelated meanings, a word embedding should be designed so that these meanings can be easily distinguished from each other. While vectors are suitable for encoding a single meaning of a token, matrices are much better suited for encoding multiple meanings of a token since each row (or column) of a matrix could represent a particular meaning of the word.

Suppose that $X$ is real or complex matrix. Then by applying a singular value decomposition, we obtain $X=\sum_{k=1}^r\sigma_k u_kv_k^*$ where $\sigma_1\geq\dots\geq\sigma_r\geq 0$ are the singular values of $X$ , and $(u_1,\dots,u_r),(v_1,\dots,v_r)$ are two orthonormal bases. As usual, the matrix $X$ could represent a token while each rank-1 matrix $u_kv_k^*$ could represent a specific meaning of the token and the singular values $\sigma_k$ represent the prevalence of such a meaning of the token. We observe that with MPO word embeddings, the singular value decomposition does indeed decompose a matrix representing words into a sum of rank-1 matrices each representing a specific meaning of the word.

Now assume that $X$ is an $r\times r$ -positive semidefinite matrix. Then the singular value decomposition reduces to the the spectral theorem, and we have $u_k=v_k$ for all $k$ , and we can write $X=\sum_{k=1}^{r}\sigma_k u_ku_k^*$ where $\sigma_1\geq\dots\geq\sigma_r\geq 0$ are the singular values of $X$ . In this case, $X$ could represent a token while each vector $u_j$ could represent an orthogonal meaning of the token while $\sigma_j$ represents the prevalence of this meaning. In terms of quantum information theory, a token is represented by a mixed state while the specific meanings of the token are represented as pure states, so the token is a superposition of all of its possible meanings.

In this post, $A$ will be a set that represents the set of all tokens, and $K$ denotes either the field of real or complex numbers. If $f:A\rightarrow M_d(K)$ is a function, then define $\overline{f}:A^*\rightarrow M_d(K)$ by letting $\overline{f}(a_1\dots a_n)=f(a_1)\dots f(a_n)$ whenever $a_1,\dots,a_n\in A$ .

The general idea behind MPO word embeddings is that given a string $a_1\dots a_n$ over the alphabet $A$ (the string $a_1\dots a_n$ is the corpus that the MPO word embedding trains with) and a natural number $d$ , our MPO word embedding of $a_1\dots a_n$ shall be a function $f:A\rightarrow M_d(K)$ that maximizes the product $\overline{f}(a_1\dots a_n)$ . Since one can make $\overline{f}(a_1\dots a_n)$ arbitrarily large by increasing the value of $f$ , we need to normalize the quantity $\overline{f}(a_1\dots a_n)$ . We therefore want for the MPO word embeddings to maximize an expression that looks like $\frac{\|\overline{f}(a_1\dots a_n)\|^{1/n}}{N(f(a)_{a\in A})}$ . The function $N$ should be defined so that $\frac{\|\overline{f}(a_1\dots a_n)\|^{1/n}}{N(f(a)_{a\in A})}$ is bounded and cannot be optimized simply by making $f$ large. The function $N(a_1,\dots,a_r)$ needs to be also designed so that in a trained MPO word embedding, the matrices $f(a)$ will be sufficiently far away from each other. In other words, if the matrices $f(a)$ are similar to each other, then the denominator $N((f(a))_{a\in A})$ should be large, but if the matrices $f(a)$ are distributed evenly, then the value $N(f(a))_{a\in A}$ should be small. In a trained MPO word embedding, since the matrices $f(a)$ are sufficiently far away from each other and relatively small and since $\overline{f}(a_1\dots a_n)$ relatively large, the value $\overline{f}(a_1\dots a_n)$ must be as large as possible, so the matrices $f(a_1),\dots,f(a_n)$ need to be compatible with its neighboring matrices.

To illustrate how the size of the matrix product $A_1\dots A_n$ equates with the compatibility of each of the matrices $A_1,\dots,A_n$ with its neighboring matrices, suppose that $A_1,\dots,A_n$ are rank-1 matrices so that for all $i$ , we have a factorization $A_i=u_iv_i^*$ for some $u_i,v_i$ . Then $\text{Tr}(A_1\dots A_n)=\text{Tr}(u_1v_1^*\dots u_nv_n^*)=\text{Tr}(v_1^*u_2\dots v_n^*u_1)=\langle u_2,v_1\rangle\dots\langle u_1,v_n\rangle$ . Therefore, $\frac{|\text{Tr}(A_1\dots A_n)|}{\|A_1\|_\infty\dots\|A_n\|_\infty}=1$ if and only if for each $i$ , there is some $\lambda_i\in K^\times$ where $u_{i+1}=\lambda_iv_i$ where addition is taken modulo $n$ . Furthermore, $(\frac{|\text{Tr}(A_1\dots A_n)|}{\|A_1\|_\infty\dots\|A_n\|_\infty})^{1/n}$ is a measure of compatibility between the matrices of the form $A_i$ and its neighboring matrices.

We say that a function $N:M_d(K)^+\rightarrow[0,\infty)$ is a matrix embedding potential if whenever $A_1,\dots,A_r,B_1,\dots,B_s\in M_d(K)$ , we have

$N(A_1,\dots,A_r)=N(A_{\sigma(1)},\dots,A_{\sigma(r)})$ for each permutation $\sigma$ ,
$N(\lambda A_1,\dots,\lambda A_r)=|\lambda|\cdot N(A_1,\dots,A_r)$ whenever $\lambda\in K$ ,
$N(UA_1U^*,\dots,UA_rU^*)=N(A_1,\dots,A_r)$ whenever $K=\mathbb{C}$ and $U$ is unitary or $K=\mathbb{R}$ and $U$ is orthogonal, and
there is some $\alpha>0$ where if $A_1=\dots=A_r$ , then $N(A_1,\dots,A_r,B_1,\dots B_s)=N(A_1\cdot r^\alpha,B_1,\dots,B_s)$ . The constant $\alpha>0$ shall be called the amplification degree.

If $A_1,\dots,A_r\in M_d(K)$ , then define the superoperator $\Phi(A_1,\dots,A_r):M_d(K)\rightarrow M_d(K)$ by letting $\Phi(A_1,\dots,A_r)(X)=A_1XA_1^*+\dots+A_rXA_r^*$ . We say that a matrix embedding potential $N$ is superoperator invariant if $N(A_1,\dots,A_r)=N(B_1,\dots,B_r)$ whenever $\Phi(A_1,\dots,A_r)=\Phi(B_1,\dots,B_r).$ We say that a matrix embedding potential $N$ is torus invariant if $N(\lambda_1A_1,\dots,\lambda_rA_r)=N(A_1,\dots,A_r)$ whenever $|\lambda_1|=\dots=|\lambda_r|=1.$ For example, every superoperator invariant matrix embedding potential is also torus invariant. We say that a matrix embedding potential $N(A_1,\dots,A_r)$ is a polarizer if it is not torus invariant.

Recall that $\rho(A)$ denotes the spectral radius of $A$ and $\|A\|_p$ denotes the Schatten $p$ -norm of $A$ .

We have the following examples of matrix embedding potentials:

Example 0: $N_0(A_1,\dots,A_r)=\rho(\Phi(A_1,\dots,A_r))^{1/2}$ is a superoperator invariant matrix embedding potential.

Example 1: $N_{1,p}(A_1,\dots,A_r)=\|A_1A_1^*+\dots+A_rA_r^*\|_p^{1/2}$ is another example of a torus invariant matrix embedding potential for $1\leq p\leq\infty$ (we should choose $p>1$ to ensure that the MPO word embedding is not a constant function).

Example 2: $N_{2,p,q}(A_1,\dots,A_r)=\|(A_1A_1^*)^q+\dots+(A_rA_r^*)^q\|_p^{1/2}$ is a torus invariant matrix embedding potential whenever $1\leq p\leq\infty$ and $1/2\leq q\leq\infty$ . As before, for MPO word embeddings, we must select $p>1$ to ensure that the matrix embedding potential is not a constant function.

Example 3: $N_{3,p,q,\alpha,\beta}(A_1,\dots,A_r)=\|\alpha(A_1A_1^*)^q+\dots+\alpha(A_rA_r^*)^q+\beta(A_1A_1^*)^q+\dots+\beta(A_rA_r^*)^q\|_p^{1/2}\|$ is a torus invariant matrix embedding potential whenever $1\leq p\leq\infty$ and $1/2\leq q\leq\infty$ . We will leave it to the reader to generalize this example to obtain other torus invariant matrix embedding potentials.

Example 4: $M_{0,p}(A_1,\dots,A_r)=\|A_1+\dots+A_r\|_p$ is a polarizer for $1\leq p\leq\infty$ .

Example 5: $M_1(A_1,\dots,A_r)=\rho(A_1+\dots+A_r)$ is a polarizer.

If $N$ is a matrix embedding potential, $a_1\dots a_n\in A^*$ , and $f:A\rightarrow M_d(K)$ is a function, then define $L^A(N,a_1\dots a_n,f)=\frac{\rho(\overline{f}(a_1\dots a_n))^{1/n}}{N((f(a))_{a\in A})}$ and define $L(N,a_1\dots a_n,f)=\frac{\rho(\overline{f}(a_1\dots a_n))^{1/n}}{N((f(a_1),\dots,f(a_n))))}.$ We observe that if $u,v\in K^d$ , then $L(N,a_1\dots a_n,f)$ can be approximated with the expression $\frac{(u^*\overline{f}(a_1\dots a_n)v)^{1/n}}{N((f(a_1),\dots,f(a_n))))}$ or with $\frac{\|\overline{f}(a_1\dots a_n)\|^{1/n}}{N((f(a_1),\dots,f(a_n))))}$ for any matrix norm $\|\cdot\|.$

We say that a function $f:A\rightarrow M_d(K)$ is an amplified MPO word pre-embedding for the matrix embedding potential $N$ and corpus $a_1\dots a_n$ if the quantity $L^A(N,a_1\dots a_n,f)$ is a local maximum with respect to the function $f$ . We say that $f:A\rightarrow M_d(K)$ is a normalized MPO word pre-embedding for $N$ and $a_1\dots a_n$ if the quantity $L(N,a_1\dots a_n,f)$ is a local maximum with respect to the function $f$ . Let $\alpha$ denote the amplification degree of the matrix embedding potential $N$ . Suppose that the token $a$ appears $n_a$ times in the string $a_1\dots a_n$ and $f,g:A\rightarrow M_d(K)$ are functions such that $g(a)=n_a^\alpha\cdot f(a)$ for $a\in A$ . Then $g$ is an amplified MPO word pre-embedding if and only if $f$ is a normalized MPO word pre-embedding for $N$ and $a_1\dots a_n$ . Unless otherwise specified, we shall assume that each MPO word pre-embedding is a normalized MPO word pre-embedding.

A final rotation–

Suppose that $f,g:A\rightarrow M_d(K)$ are functions and $f$ is an MPO word pre-embeddings for the matrix embedding potential $N$ and corpus $a_1\dots a_n$ , and there are constants $\lambda_a\in Z(K)$ with $g(a)=\lambda_a f(a)$ for $a\in A$ . Then $g$ will also be an MPO word pre-embedding. We may therefore select the free parameters $\lambda_a\in Z(K)$ for optimal performance.

A naive approach to selecting these parameters $\lambda_a$ would be to select some select some invariant $I$ (like the trace or determinant) such that $I(g(a))$ is a real number for all $a\in A$ . Since the matrices $a_1,\dots,a_n$ can be approximated by matrices of a low rank, we should not let $I$ be the determinant. We do not recommend simply letting $I$ be the trace either because the mapping $a\mapsto a\cdot\frac{|\text{tr}(a)|}{\text{tr}(a)}$ has a singularity of high dimension. In fact, the union of the zero set of $I$ with the set of points where $a\mapsto \frac{|I(a)|}{I(a)}$ is discontinuous is always has large dimension, so the function $a\mapsto a\cdot\frac{|I(a)|}{I(a)}$ always has a large singularity. If $Z$ is a subset of $\mathbb{RP}^n$ and $\iota:\mathbb{RP}^n\setminus Z\rightarrow S^n$ is a continuous function where $\iota([x])=x$ or $\iota([x])=-x$ whenever $x\in S^n,[x]\in\mathbb{RP}^n\setminus Z$ , then $Z$ must intersect each loop $\ell:[0,1]\rightarrow\mathbb{RP}^n$ with $\ell(0)=\ell(1)$ which is not homotopy equivalent to the null loop. In other words, the set $Z$ is necessarily a large set. One can make a similar observation for the complex projective spaces.

A more reasonable choice would be to select the constants $\lambda_a\in Z(K)$ for $a\in A$ so that the function $g$ maximizes or minimizes some quantity. One disadvantage of this approach is that in order to apply gradient ascent or descent effectively to optimize the fitness of $g$ , the field $K$ should be the field of complex numbers. Fortunately, if $K=\mathbb{R}$ , one can ensure that the sign of $g(a)$ is correct by using a proper initialization, and if $K=\mathbb{R}$ , then we shall call any MPO word pre-embedding an MPO word embedding; unfortunately, we are not able to perform this technique when $K=\mathbb{H}$ .

Suppose that $K=\mathbb{C}$ and $f:A\rightarrow M_d(K)$ is an MPO word pre-embedding with respect to the matrix embedding potential $N$ and corpus $a_1\dots a_n$ . If $N$ is a polarizer, then we say that $f$ is an MPO word pre-embedding. Suppose now that $N$ is torus invariant and $M$ is a polarizer. Let $\lambda_a\in Z(K)$ for $a\in A$ , and define a mapping $g:A\rightarrow M_d(K)$ by letting $g(a)=\lambda_a\cdot f(a)$ for $a\in A$ . Then we say that $g$ is an MPO word embedding with respect to the polarizer $M$ if $L(M,a_1\dots a_n,g)$ is locally maximized with respect to $(\lambda_a)_{a\in A}$ .

The MPO word embeddings with respect to the polarizer $M_{0,2}$ can not only be described as a function that maximizes the norm of the sum, but the word embeddings equivalently minimize the spring energy and maximize the real part of the sum of the dot product. Let $I$ be a finite totally ordered set.

We observe that $\|\sum_{i\in I}x_i\|_2^2=\sum_{i\in I}\|x_i\|_2^2+2\sum_{i<j}\text{Re}\langle x_i,x_j\rangle$ and

$\sum_{i<j}\|x_i-x_j\|_2^2=(2(|I|-1)\sum_{i}\|x_i\|_2^2)-\sum_{i<j}2\text{Re}\langle x_i,x_j\rangle.$

Observation: Let $A$ be a finite set. Let $H$ be a complex Hilbert space. Let $\simeq$ be the equivalence relation on $H$ where we set $x\simeq y$ if and only if there is some $\lambda$ with $|\lambda|=1$ and $\lambda x=y.$ Suppose that $u_a\in H/\simeq$ for $a\in A.$ Let $T$ be the collection of all $(y_a)_{a\in A}$ where $[y_a]=u_a$ for $a\in A$ .

The norm of the sum $\|\sum_{a\in A}y_a\|$ is locally maximized for $(y_a)_{a\in A}\in T$ .

2. The sum $\sum_{i<j}\text{Re}(\langle y_i,y_j\rangle)$ is locally maximized for $(y_a)_{a\in A}\in T$ .

3. The spring energy $\sum_{i<j}\|y_i-y_j\|_2^2$ is locally minimized for $(y_a)_{a\in A}\in T$ .

Positive MPO word embeddings-If $f:A\rightarrow M_d(K)$ is an MPO word pre-embedding, then we shall say that the mapping $a\mapsto (f(a)^*f(a),f(a)f(a)^*)$ is a positive MPO word embedding. We observe that if $g:A\rightarrow M_d(K)$ is another MPO word pre-embedding where $g(a)=\lambda_af(a)$ for $a\in A$ , then $latex $(f(a)^*f(a),f(a)f(a)^*)=(g(a)^*g(a),g(a)g(a)^*)$ . It is therefore easier to train a positive MPO word embedding than to train an MPO word embedding. Since the mapping $f(a)\mapsto (f(a)^*f(a),f(a)f(a)^*)$ is injective almost everywhere, the MPO word pre-embedding $f$ can be recovered from the positive MPO word embedding $a\mapsto(f(a)^*f(a),f(a)f(a)^*)$ . We believe that machine learning models may have an easier time working with positive MPO word embeddings than with MPO word embeddings.

Reduced rank– Computer experiments indicate that if $f:A\rightarrow M_d(K)$ is an MPO word embedding, then the matrices of the form $f(a)$ can be approximated by matrices of low rank. If we bound the rank of the matrix $f(a)$ by some value $c_a$ where $c_a\leq d$ , then we can reduce the amount of computation required to compute an approximate MPO word embedding. To do this, we can set $f(a)=g(a)h(a)$ where $g(a)\in M_{d,c_a}(K),h(a)\in M_{c_a,d}(K)$ for $a\in A$ and compute the MPO word embedding while only needing to keep $g,h$ in memory in order to save on computational resources.

Basic text generation and evaluation: We say that a string $a_1\dots a_r$ is locally grammatically correct of window size $N$ if $a_{i+1}\dots a_{i+N}$ is a a substring of tokens of grammatically correct text whenever $0\leq i\leq r-N$ . One can use MPO word embeddings to evaluate whether a string of words is similar to substrings of the corpus and are likely to be locally grammatically correct. By using simulated annealing or evolutionary computation to maximize a coefficient rating the local grammatical correctness, one can automatically generate locally grammatically correct strings and sometimes complete sentences.

If $f:A\rightarrow M_d(K)$ is a function, then let $\|\cdot\|$ be a submultiplicative orthogonal invariant matrix norm. Then let $f_{\|\cdot\|}:A\rightarrow M_d(K)$ be the function defined by letting $f_{\|\cdot\|}(a)=\frac{f(a)}{\|f(a)\|}$ whenever $a\in A$ . Let $a_1\dots a_r\in A^*$ . Then $\|\overline{f_{\|\cdot\|}}(a_1\dots a_r)\|$ is a real number in the interval $[0,1]$ . Define $NM(f)(b_1\dots b_r)=\|\overline{f_{\|\cdot\|}}(b_1\dots b_r)\|^{1/(r-1)}$ . Then $NM(f)(b_1\dots b_r)$ will always be a real number in the interval $[0,1]$ . If $NM(f)(a_1\dots a_r)\geq MN(f)(b_1\dots b_s)$ , then the string $a_1\dots a_r$ is more likely to be locally grammatically correct text than the string $b_1\dots b_s$ . One can use techniques such as simulated annealing to maximize $NM(f)(b_1\dots b_r)$ and when $NM(f)(b_1\dots b_r)$ is maximized, the string $b_1\dots b_r$ will resemble grammatically correct text. We will leave it to the reader to figure out how neural networks trained to produce the CBOW word embedding may be used to evaluate the local grammatical correctness of text. It should therefore be unsurprising to see a word embedding be used to easily evaluate the local grammatical correctness of text.

Distance between words: Given a word embedding $f:A\rightarrow M_d(K)$ , one defines the absolute cosine similarity between $a,b\in A$ as $\frac{|\langle f(a),f(b)\rangle|}{\|f(a)\|_2\cdot\|f(b)\|_2}$ . In our experience, if the matrix embedding potential is the function $N_0$ and the dimension $d$ is too large, then words that appear similar to humans typically do not have very high absolute cosine similarity. Fortunately, by using a more suitable matrix embedding potential such as $N_{1,p}$ or $N_{2,p,q}$ for carefully selected constants $p,q$ , words that appear similar to humans will have a high absolute cosine similarity. The MPO word embeddings with the matrix embedding potential $N_0$ can still be used to generate locally grammatically correct strings of tokens even though MPO word embeddings with the matrix embedding potential $N_0$ often fail to give words that humans regard as similar a high absolute cosine similarity. We believe that the matrix embedding potential $N_0$ behaves poorly since if $f:A\rightarrow M_d(K)$ is an amplified MPO word embedding with respect to $N_0$ , then there is some matrix $C$ and other amplified MPO word embedding $g:A\rightarrow M_d(K)$ with respect to $N_0$ where $g(a)=C\cdot f(a)\cdot C^{-1}$ for $a\in A$ and where $\sum_{a\in A}g(a)g(a)^*=1_d$ . The requirement for $f$ to be similar to a function $g$ that uses all dimensions of $K^d$ equally is too stringent and needs to be relaxed.

Contextual embeddings: Suppose that $f:A\rightarrow M_d(K)$ is an MPO word embedding. Suppose now that $b_1\dots b_m\in A^*$ is a string. Let $u$ be a row vector of length $d$ and let $v$ be a column vector of length $d$ . Then let $u_0=u$ , and let $u_i=u_{i-1}\cdot f(b_i)$ for $1\leq i\leq m$ . Then we say that $(u_1,\dots,u_m)$ is an MPO left contextual embedding of the string $b_1\dots b_m$ with starting vector $u$ and we say that $u_i$ is an MPO left contextual embedding for position $i$ of $b_1\dots b_m$ for $0\leq i\leq m$ . Similarly, let $v_{m+1}=v$ and let $v_{i}=f(b_i)\cdot v_{i+1}$ for $1\leq i\leq m$ . Then we say that $(v_1,\dots,v_m)$ is an MPO right contextual embedding of the string $b_1\dots b_m$ with respect to the ending vector $v$ , and we say that $v_i$ is an MPO right context vector for position $i$ whenever $1\leq i\leq m+1$ .

We can obtain a more sophisticated contextual embedding using the singular value decomposition. Let $b_1\dots b_m\in A^*$ . Suppose that $f(b_j)$ has singular value decomposition $\sum_{k=1}^d\sigma_ku_kv_k^*$ . Suppose that the MPO left context vector for the $j-1$ -th position is $\alpha^*$ and the MPO right context vector for the $j+1$ -th position is $\beta$ . Then the loss-free context vector for the $j$ -th position is $(\sigma_k,\langle u_k,\alpha\rangle,\langle \beta,v_k\rangle,u_k,v_k)_{k=1}^d$ . The loss-free context vector for the $j$ -th position not only encodes the matrix $f(b_j)$ , but it also encodes the context of $b_j$ so that the machine can determine the particular meaning of the token $b_j$ .

The $L_{2,d}$ -spectral radius of a collection of matrices $(A_1,\dots,A_r)$ is the maximum value of the form $\frac{\rho(A_1\otimes X_1+\dots+A_r\otimes X_r)}{\rho(\Phi(X_1,\dots,X_r))^{1/2}}$ . We say that a collection $(X_1,\dots,X_r)\in M_d(\mathbb{C})^r$ is an $L_{2,d}$ -spectral radius dimensionality reduction LSRDR if $(X_1,\dots,X_r)\mapsto\frac{\rho(A_1\otimes X_1+\dots+A_r\otimes X_r)}{\rho(\Phi(X_1,\dots,X_r))^{1/2}}$ is locally maximized.

Suppose that $a_1\dots a_n\in A^*$ . For each $a\in A$ , let $B_a$ be the $n\times n$ -matrix where all the entries are zero except for the entries of the form $(i,i+1)$ where $a_i=a$ . Then $\rho(\overline{f}(a_1\dots a_n))^{1/n}=\rho(\sum_{a\in A}f(a)\otimes B_a).$ Therefore, $f:A\rightarrow M_d(\mathbb{C})^r$ is an amplified MPO word pre-embedding for the sequence $a_1\dots a_n$ with respect to the matrix embedding potential $N_0$ if and only if $f$ is an LSRDR for the system of matrices $(B_a)_{a\in A}$ . One should therefore consider the notion of the MPO word embedding as an adaptation of the notion of the $L_{2,d}$ -spectral radius dimensionality reduction.

Graph embeddings- MPO word embeddings can be used to construct graph embeddings in multiple ways. From any directed or undirected graph, one can obtain a Markov chain simply by taking a random walk on such a graph, and from such a Markov chain, one can obtain sequences and then obtain MPO word embeddings from these sequences.

Let $X,A$ be finite sets. Suppose that $G:X\rightarrow A$ is a surjective function and let $(\mathcal{X}_n)_{n=0}^{\infty}$ be a Markov chain with underlying set $X$ . Then for all $N>0$ , we have $G(\mathcal{X}_0)\dots G(\mathcal{X}_N)\in A^*$ , so $G(\mathcal{X}_0)\dots G(\mathcal{X}_N)$ induces an MPO word embedding $f:A\rightarrow M_d(K)$ which we shall call an MMPO (Markov matrix product optimized) word embedding induced by the Markov chain $(\mathcal{X}_n)_{n=0}^{\infty}$ and mapping $f:X\rightarrow A$ . If $X=A$ , $G$ is the identity function, and $X$ is the set of vertices of a graph, and the Markov chain $(\mathcal{X}_n)_{n=0}^{\infty}$ is a random walk on this graph, then the MMPO word embeddings induced by $(\mathcal{X}_n)_{n=0}^{\infty}$ and $G$ are graph embeddings.

We may also use MPO word embeddings to produce graph embeddings without needing to compute random walks on graphs. Let $V$ be a set. Let $E=\{\{u,v\}\mid u,v\in V\}$ . and let $w:E\rightarrow [0,1]$ be a function with $\sum_{e\in E}w(e)=1$ which we shall call a weight function. Now consider the Markov chain $(\mathcal{X}_n)_{n=0}^{\infty}$ where the state space of each $\mathcal{X}_n$ consists of pairs $(\{u,v\},u)$ where $u,v\in V$ . If $\mathcal{X}_n=(\{u,v\},u)$ , then set $\mathcal{X}_{n+1}=(\{u,v\},v)$ with probability $1-\epsilon$ and set $\mathcal{X}_{n+1}=(\{a,b\},a)$ with probability $w(\{a,b\})\cdot\epsilon/2$ whenever $a\neq b$ , and set $\mathcal{X}_{n+1}=(\{a\},a)$ with probability $w(\{a\})\cdot\epsilon.$

Let $m,q:V\rightarrow[0,\infty)$ be two functions ( $m,q$ may either be fixed during training or trainable), and let $N$ be a matrix embedding potential. If $f:V\rightarrow M_d(K)$ is a function, then let $L(N,m,w,f)=\frac{\prod_{\{u,v\}\in E}\rho(f(u)f(v))^{w(\{u,v\})}}{N((m(v)\cdot f(v))_{v\in V})}.$ We say that a function $f$ is a limit induced MPO pre-embedding if $f\mapsto L(N,m,w,f)$ is locally maximized. If $N$ is a polarizer, then we shall say that $f$ is a limit induced MPO pre-embedding. Suppose that $N$ is torus invariant, and $M$ is a polarizer. Suppose that $f$ is a limit induced MPO pre-embedding with respect to the matrix embedding potential $N$ . Then let $\lambda_v\in S^1$ for $v\in V$ . Define $g:V\rightarrow M_d(K)$ by letting $g(v)=\lambda_v\cdot f(v)$ for $v\in V$ . Then we say that $g:V\rightarrow M_d(K)$ is a limit induced MPO embedding with respect to the polarizer $M$ if the parameters of the form $\lambda_v$ are chosen so that $M((q(v)\cdot g(v))_{v\in V})$ is locally maximized.

For connected bipartite graphs, we can generalize the notion of a limit induced MPO pre-embedding so that each vertex is associated with a non-square rectangular matrix instead of a square matrix. In this generalization, these non-square rectangular matrices are naturally matrices that cannot be approximated with matrices of lower rank, and since these non-square rectangular matrices are already of maximum rank, one does not need to approximate these matrices with matrices of lower rank.

Suppose that $(V,E)$ is a bipartite graph with bipartition $A,B$ . Suppose that $d,e$ are natural numbers with $e\leq d$ . Then let $f:A\rightarrow M_{d,e}(K),g:B\rightarrow M_{e,d}(K)$ be functions. Then we say that $(f,g)$ is a rectangular bipartite limit induced MPO (RBLIMPO) pre-embedding with respect to the function $N$ if the following quantity is maximized: $\frac{\prod_{a\in A,b\in B,\{a,b\}\in E}\rho(f(a)g(b))^{1/(2\cdot|E|)}}{N(f,g)}.$

Suppose now that $N$ is torus invariant. Then we say that an RBLIMPO pre-embedding $(f,g)$ is an RBLIMPO embedding if $\lambda_a=1,\mu_b=1$ for $a\in A,b\in B$ is a local maximum for the function $T:S_1^A\times S_1^B\rightarrow[0,\infty)$ defined by $T((\lambda_a)_{a\in A},(\mu_b)_{b\in B})=\rho(\sum_{a\in A}\lambda_a f(a)\sum_{b\in B}\mu_b g(b)).$

Crossing near singularities (added 3/5/2023)

When the underlying field is the field of real numbers, MPO word embeddings will behave poorly and gradient ascent will not find a good local optimum with an improper initalization. In order to better illustrate the behavior of real MPO word embeddings, we shall set the fitness function $L:M_d(\mathbb{R})^A\rightarrow[-\infty,\infty)$ to be $L(f)=\frac{(|u\overline{f}(a_1\dots a_n)v|)^{1/n}}{N(a_1,\dots,a_n)}$ for some normalizer $N$ and string $a_1\dots a_n\in A^*$ ; $L$ approximates a fitness function that we normally use for MPO word embeddings. Let $Z$ be the set of all functions $f:A\rightarrow M_d(K)$ where $u\overline{f}(a_1\dots a_n)v=0$ . In other words, $Z$ is the set of all $f:A\rightarrow M_d(K)$ where $L(f)=0$ or where $L(f)$ is undefined. If $K=\mathbb{R}$ , then $M_d(K)\setminus Z$ has many components, and each of these components has a local maximum. Furthermore, by using gradient ascent with a small enough learning rate, one will reach a local maximum in the same component as the initialization, but the component of a poor initialization may not contain a decent word embedding. Furthermore, if the initialization is random, then the trained word embedding will be in a random component which is is suboptimal. One can ensure that the initialization is in the proper component by initially setting each entry in each matrix to be positive. In the case when $K=\mathbb{R}$ , after fully training an MPO word pre-embedding with an initialization with only positive entries, one does not need to further train the MPO word pre-embedding to obtain an MPO word embedding since the matrices of the form $f(a)$ will already have the correct sign and will not need to be multiplied by $\pm 1$ to obtain an MPO word embedding. Every real MPO word pre-embedding whose initialization consists of matrices with only positive entries shall be called an MPO word embedding.

By setting $K=\mathbb{C}$ or $K=\mathbb{H}$ , one can avoid this problem of singularities since the set $Z$ will have real codimension 2 and real codimension 4 in the case when $K=\mathbb{H}$ .

Future posts- Future posts on this site will be about other machine learning algorithms that are similar to the MPO word embedding algorithm and the $L_{2,d}$ -spectral radius. We will also make a post about computing gradients of the spectral radius and other quantities so that computers can use gradient ascent to compute MPO word embeddings, LSRDRs and other objects. In future posts, we shall analyze the performance of MPO word embeddings and other algorithms.

Open directions-

Much research on MPO word embeddings and related ideas is still needed. Here are a few questions about such research.

How would natural language models use matrix-valued, projective, or complex-valued data produced by MPO word embeddings?

How well do MPO word embeddings perform on large collections of text? The largest collection of text which I have used to train a word embedding was the Bible.

What are some other examples of matrix-valued word embeddings? There are probably other kinds of matrix-valued word embeddings that still need to be discovered. These other embeddings will probably be more interpretable and safe than our current word embeddings. One may even be able to modify our current language models so that they use matrix-valued word embeddings instead of vector-valued word embeddings. I conjecture that language models that use matrix-valued word embeddings are more interpretable and safer than corresponding language models that use vector-valued word embeddings.

Post edited 11/26/2023

References:

[GGW] Gao, Junbin and Guo, Yi and Wang, Zhiyong. Matrix Neural Networks. 2016

Share this:

Related

Leave a comment Cancel reply