In this post, we will not talk about cryptography, but we will talk about another application and generalization of the -spectral radius to producing word embeddings and graph embeddings for machine learning applications. We shall call these word embeddings matrix product optimized (MPO) word embeddings. These MPO word embeddings and graph embeddings do not require any neural networks to produce at the moment, but once these word and graph embeddings are produced, they can be processed using neural networks. Even though the MPO word embeddings are not produced using neural networks, they have features and advantages that no other word embedding algorithm currently has. For example, MPO word embeddings can be used to evaluate the local grammatical correctness of text and produce text with the help of evolutionary computation or simulated annealing. MPO word embeddings are also the only kind of word embedding that we know of that map words to matrices, and matrix-valued word embeddings enjoy plenty of advantages over vector-valued word embeddings. MPO word embeddings also behave much more mathematically than other word embedding algorithms; for example, even when training an MPO word embedding over the field of complex numbers, the resulting word embedding (if everything goes right) will be a real matrix valued function. MPO word embeddings also also more consistent in the sense that the resulting MPO word embedding does not depend on the initialization. Any initialization would converge to a nearby MPO word embedding.
In artificial intelligence, and more specifically, natural language processing (NLP), one would want to convert words, characters, punctuation, or other tokens into something that computers can work with such as vectors in a vector space. A word embedding is typically a function from a set of tokens to a finite dimensional real vector space
. With a word embedding, AI systems will therefore be able to process vectors rather than text, but we may be able to improve upon the idea of mapping tokens to elements in a vector space by instead mapping tokens to real or complex matrices. Spaces of real or complex matrices have more structure than simply vector spaces since spaces of matrices are equipped with matrix multiplication, the Frobenius inner product, several orthogonal or unitary invariant submultiplicative norms, many matrix decompositions such as the singular value decomposition, et cetera, and we shall see that this additional structure of
is very useful for word embeddings in NLP. Matrix multiplication can also be exploited by matrix neural networks [GGW] which are neural networks whose layers are of the form
where
is an elementwise activation function and
are learnable matrices of the appropriate dimensions.
Since tokens in natural language often have multiple related and unrelated meanings, a word embedding should be designed so that these meanings can be easily distinguished from each other. While vectors are suitable for encoding a single meaning of a token, matrices are much better suited for encoding multiple meanings of a token since each row (or column) of a matrix could represent a particular meaning of the word.
Suppose that is real or complex matrix. Then by applying a singular value decomposition, we obtain
where
are the singular values of
, and
are two orthonormal bases. As usual, the matrix
could represent a token while each rank-1 matrix
could represent a specific meaning of the token and the singular values
represent the prevalence of such a meaning of the token. We observe that with MPO word embeddings, the singular value decomposition does indeed decompose a matrix representing words into a sum of rank-1 matrices each representing a specific meaning of the word.
Now assume that is an
-positive semidefinite matrix. Then the singular value decomposition reduces to the the spectral theorem, and we have
for all
, and we can write
where
are the singular values of
. In this case,
could represent a token while each vector
could represent an orthogonal meaning of the token while
represents the prevalence of this meaning. In terms of quantum information theory, a token is represented by a mixed state while the specific meanings of the token are represented as pure states, so the token is a superposition of all of its possible meanings.
In this post, will be a set that represents the set of all tokens, and
denotes either the field of real or complex numbers. If
is a function, then define
by letting
whenever
.
The general idea behind MPO word embeddings is that given a string over the alphabet
(the string
is the corpus that the MPO word embedding trains with) and a natural number
, our MPO word embedding of
shall be a function
that maximizes the product
. Since one can make
arbitrarily large by increasing the value of
, we need to normalize the quantity
. We therefore want for the MPO word embeddings to maximize an expression that looks like
. The function
should be defined so that
is bounded and cannot be optimized simply by making
large. The function
needs to be also designed so that in a trained MPO word embedding, the matrices
will be sufficiently far away from each other. In other words, if the matrices
are similar to each other, then the denominator
should be large, but if the matrices
are distributed evenly, then the value
should be small. In a trained MPO word embedding, since the matrices
are sufficiently far away from each other and relatively small and since
relatively large, the value
must be as large as possible, so the matrices
need to be compatible with its neighboring matrices.
To illustrate how the size of the matrix product equates with the compatibility of each of the matrices
with its neighboring matrices, suppose that
are rank-1 matrices so that for all
, we have a factorization
for some
. Then
. Therefore,
if and only if for each
, there is some
where
where addition is taken modulo
. Furthermore,
is a measure of compatibility between the matrices of the form
and its neighboring matrices.
We say that a function is a matrix embedding potential if whenever
, we have
for each permutation
,
whenever
,
whenever
and
is unitary or
and
is orthogonal, and
- there is some
where if
, then
. The constant
shall be called the amplification degree.
If , then define the superoperator
by letting
. We say that a matrix embedding potential
is superoperator invariant if
whenever
We say that a matrix embedding potential
is torus invariant if
whenever
For example, every superoperator invariant matrix embedding potential is also torus invariant. We say that a matrix embedding potential
is a polarizer if it is not torus invariant.
Recall that denotes the spectral radius of
and
denotes the Schatten
-norm of
.
We have the following examples of matrix embedding potentials:
Example 0: is a superoperator invariant matrix embedding potential.
Example 1: is another example of a torus invariant matrix embedding potential for
(we should choose
to ensure that the MPO word embedding is not a constant function).
Example 2: is a torus invariant matrix embedding potential whenever
and
. As before, for MPO word embeddings, we must select
to ensure that the matrix embedding potential is not a constant function.
Example 3: is a torus invariant matrix embedding potential whenever
and
. We will leave it to the reader to generalize this example to obtain other torus invariant matrix embedding potentials.
Example 4: is a polarizer for
.
Example 5: is a polarizer.
If is a matrix embedding potential,
, and
is a function, then define
and define
We observe that if
, then
can be approximated with the expression
or with
for any matrix norm
We say that a function is an amplified MPO word pre-embedding for the matrix embedding potential
and corpus
if the quantity
is a local maximum with respect to the function
. We say that
is a normalized MPO word pre-embedding for
and
if the quantity
is a local maximum with respect to the function
. Let
denote the amplification degree of the matrix embedding potential
. Suppose that the token
appears
times in the string
and
are functions such that
for
. Then
is an amplified MPO word pre-embedding if and only if
is a normalized MPO word pre-embedding for
and
. Unless otherwise specified, we shall assume that each MPO word pre-embedding is a normalized MPO word pre-embedding.
A final rotation-Suppose that are functions and
is an MPO word pre-embeddings for the matrix embedding potential
and corpus
, and there are constants
with
for
. Then
will also be an MPO word pre-embedding. We may therefore select the free parameters
for optimal performance.
A naive approach to selecting these parameters would be to select some select some invariant
(like the trace or determinant) such that
is a real number for all
. Since the matrices
can be approximated by matrices of a low rank, we should not let
be the determinant. We do not recommend simply letting
be the trace either because the mapping
has a singularity of high dimension. In fact, the union of the zero set of
with the set of points where
is discontinuous is always has large dimension, so the function
always has a large singularity. If
is a subset of
and
is a continuous function where
or
whenever
, then
must intersect each loop
with
which is not homotopy equivalent to the null loop. In other words, the set
is necessarily a large set. One can make a similar observation for the complex projective spaces.
A more reasonable choice would be to select the constants for
so that the function
maximizes or minimizes some quantity. One disadvantage of this approach is that in order to apply gradient ascent or descent effectively to optimize the fitness of
, the field
should be the field of complex numbers. Suppose that
is an MPO word pre-embedding with respect to the matrix embedding potential
and corpus
. If
is a polarizer, then we say that
is an MPO word pre-embedding. Suppose now that
is torus invariant and
is a polarizer. Let
for
, and define a mapping
by letting
for
. Then we say that
is an MPO word embedding with respect to the polarizer
if
is locally maximized with respect to
.
The MPO word embeddings with respect to the polarizer can not only be described as a function that maximizes the norm of the sum, but the word embeddings equivalently minimize the spring energy and maximize the real part of the sum of the dot product. Let
be a finite totally ordered set.
We observe that and
Observation: Let be a finite set. Let
be a complex Hilbert space. Let
be the equivalence relation on
where we set
if and only if there is some
with
and
Suppose that
for
Let
be the collection of all
where
for
.
- The norm of the sum
is locally maximized for
.
2. The sum is locally maximized for
.
3. The spring energy is locally minimized for
.
Reduced rank– Computer experiments indicate that if is an MPO word embedding, then the matrices of the form
can be approximated by matrices of low rank. If we bound the rank of the matrix
by some value
where
, then we can reduce the amount of computation required to compute an approximate MPO word embdding. To do this, we can set
where
for
and compute the MPO word embedding while only needing to keep
in memory in order to save on computational resources.
Basic text generation and evaluation: We say that a string is locally grammatically correct of window size
if
is a a substring of tokens of grammatically correct text whenever
. One can use MPO word embeddings to evaluate whether a string of words is similar to substrings of the corpus and are likely to be locally grammatically correct. By using simulated annealing or evolutionary computation to maximize a coefficient rating the local grammatical correctness, one can automatically generate locally grammatically correct strings and sometimes complete sentences.
If is a function, then let
be a submultiplicative orthogonal invariant matrix norm. Then let
be the function defined by letting
whenever
. Let
. Then
is a real number in the interval
. Define
. Then
will always be a real number in the interval
. If
, then the string
is more likely to be locally grammatically correct text than the string
. One can use techniques such as simulated annealing to maximize
and when
is maximized, the string
will resemble grammatically correct text. We will leave it to the reader to figure out how neural networks trained to produce the CBOW word embedding may be used to evaluate the local grammatical correctness of text. It should therefore be unsurprising to see a word embedding be used to easily evaluate the local grammatical correctness of text.
Distance between words: Given a word embedding , one defines the absolute cosine similarity between
as
. In our experience, if the matrix embedding potential is the function
and the dimension
is too large, then words that appear similar to humans typically do not have very high absolute cosine similarity. Fortunately, by using a more suitable matrix embedding potential such as
or
for carefully selected constants
, words that appear similar to humans will have a high absolute cosine similarity. The MPO word embeddings with the matrix embedding potential
can still be used to generate locally grammatically correct strings of tokens even though MPO word embeddings with the matrix embedding potential
often fail to give words that humans regard as similar a high absolute cosine similarity. We believe that the matrix embedding potential
behaves poorly since if
is an amplified MPO word embedding with respect to
, then there is some matrix
and other amplified MPO word embedding
with respect to
where
for
and where
. The requirement for
to be similar to a function
that uses all dimensions of
equally is too stringent and needs to be relaxed.
Contextual embeddings: Suppose that is an MPO word embedding. Suppose now that
is a string. Let
be a row vector of length
and let
be a column vector of length
. Then let
, and let
for
. Then we say that
is an MPO left contextual embedding of the string
with starting vector
and we say that
is an MPO left contextual embedding for position
of
for
. Similarly, let
and let
for
. Then we say that
is an MPO right contextual embedding of the string
with respect to the ending vector
, and we say that
is an MPO right context vector for position
whenever
.
We can obtain a more sophisticated contextual embedding using the singular value decomposition. Let . Suppose that
has singular value decomposition
. Suppose that the MPO left context vector for the
-th position is
and the MPO right context vector for the
-th position is
. Then the loss-free context vector for the
-th position is
. The loss-free context vector for the
-th position not only encodes the matrix
, but it also encodes the context of
so that the machine can determine the particular meaning of the token
.
The -spectral radius of a collection of matrices
is the maximum value of the form
. We say that a collection
is an
-spectral radius dimensionality reduction LSRDR if
is locally maximized.
Suppose that . For each
, let
be the
-matrix where all the entries are zero except for the entries of the form
where
. Then
Therefore,
is an amplified MPO word pre-embedding for the sequence
with respect to the matrix embedding potential
if and only if
is an LSRDR for the system of matrices
. One should therefore consider the notion of the MPO word embedding as an adaptation of the notion of the
-spectral radius dimensionality reduction.
Graph embeddings- MPO word embeddings can be used to construct graph embeddings in multiple ways. From any directed or undirected graph, one can obtain a Markov chain simply by taking a random walk on such a graph, and from such a Markov chain, one can obtain sequences and then obtain MPO word embeddings from these sequences.
Let be finite sets. Suppose that
is a surjective function and let
be a Markov chain with underlying set
. Then for all
, we have
, so
induces an MPO word embedding
which we shall call an MMPO (Markov matrix product optimized) word embedding induced by the Markov chain
and mapping
. If
,
is the identity function, and
is the set of vertices of a graph, and the Markov chain
is a random walk on this graph, then the MMPO word embeddings induced by
and
are graph embeddings.
We may also use MPO word embeddings to produce graph embeddings without needing to compute random walks on graphs. Let be a set. Let
. and let
be a function with
which we shall call a weight function. Now consider the Markov chain
where the state space of each
consists of pairs
where
. If
, then set
with probability
and set
with probability
whenever
, and set
with probability
Let be two functions (
may either be fixed during training or trainable), and let
be a matrix embedding potential. If
is a function, then let
We say that a function
is a limit induced MPO pre-embedding if
is locally maximized. If
is a polarizer, then we shall say that
is a limit induced MPO pre-embedding. Suppose that
is torus invariant, and
is a polarizer. Suppose that
is a limit induced MPO pre-embedding with respect to the matrix embedding potential
. Then let
for
. Define
by letting
for
. Then we say that
is a limit induced MPO embedding with respect to the polarizer
if the parameters of the form
are chosen so that
is locally maximized.
For connected bipartite graphs, we can generalize the notion of a limit induced MPO pre-embedding so that each vertex is associated with a non-square rectangular matrix instead of a square matrix. In this generalization, these non-square rectangular matrices are naturally matrices that cannot be approximated with matrices of lower rank, and since these non-square rectangular matrices are already of maximum rank, one does not need to approximate these matrices with matrices of lower rank.
Suppose that is a bipartite graph with bipartition
. Suppose that
are natural numbers with
. Then let
be functions. Then we say that
is a rectangular bipartite limit induced MPO (RBLIMPO) pre-embedding with respect to the function
if the following quantity is maximized:
Suppose now that is torus invariant. Then we say that an RBLIMPO pre-embedding
is an RBLIMPO embedding if
for
is a local maximum for the function
defined by
Crossing near singularities (added 3/5/2023)-
The fitness functions for MPO word embeddings when the underlying field is the field of real numbers are pathological in a way that hampers the ability for gradient descent to find a local optimum. In order to better illustrate the behavior of MPO word embeddings, we shall set the fitness function to be
for some normalizer
and string
;
approximates a fitness function that we normally use for MPO word embeddings. Let
be the set of all functions
where
. In other words,
is the set of all
where
or where
is undefined. If
, then
has many components, and each of these components has a local maximum. Furthermore, by using gradient ascent with a small enough learning rate, one will reach a local maximum in the same component as the initialization, but it is not likely that the initialization will be in a component that contains a decent word embedding. In practice, this situation is not as unfortunate as it may appear to be since the learning rate for gradient ascent is usually high enough to move from one component to another after each iteration of the gradient ascent. Furthermore, if one uses stochastic gradient ascent to train a word embedding, then this problem of singularities is not as applicable since different batches will produce different singularities. By setting
or
, one can to a great extent avoid this problem of singularities since the set
will have real codimension 2 and real codimension 4 in the case when
.
Future posts- Future posts on this site will be about other machine learning algorithms that are similar to the MPO word embedding algorithm and the -spectral radius. We will also make a post about computing gradients of the spectral radius and other quantities so that computers can use gradient ascent to compute MPO word embeddings, LSRDRs and other objects. In future posts, we shall analyze the performance of MPO word embeddings and other algorithms.
Open directions-
Much research on MPO word embeddings and related ideas is still needed. Here are a few questions about such research.
How would natural language models use matrix-valued, projective, or complex-valued data produced by MPO word embeddings?
How well do MPO word embeddings perform on large collections of text? The largest collection of text which I have used to train a word embedding was the Bible.
What are some other examples of matrix-valued word embeddings? There are probably other kinds of matrix-valued word embeddings that still need to be discovered. These other embeddings will probably be more interpretable and safe than our current word embeddings.
References:
[GGW] Gao, Junbin and Guo, Yi and Wang, Zhiyong. Matrix Neural Networks. 2016