In this post, we shall make a few experimental observations about the -spectral radius and dimensionality reduction that given evidence that the
-spectral radius and dimensionality reduction will eventually have a deep mathematical theory behind it and also become more applicable to machine learning. I have not been able to prove these empirical observations, so one should consider this post to be about experimental mathematics. Hopefully these experimental results will eventually be backed up by formal proofs. The reader is referred to this post for definitions.
This post is currently a work in progress, so I will be updating this post with more information. Let be the collection of all tuples
where
is not nilpotent. Given matrices
, define a function
by letting
Given matrices , we say that
is an
-spectral radius dimensionality reduction of
if
locally maximizes
. We say that an
-spectral radius dimensionality reduction is optimal if
globally maximizes
.
We say that is projectively similar to
if there is some
and some
where
for
Let
denote the equivalence relation where we set
precisely when
and
are projectively similar. Let
denote the equivalence relation with respect to
Observation: (apparent unique local optimum; global attractor) Suppose that are
-SRDRs of
obtained by gradient ascent on the domain
. Then
are usually projectively similar.
Example: The above observation does not always hold as may have multiple non-projectively similar local maxima. For example, suppose that
for
where each
and each
is a
-matrix. Then
Suppose now that
and
. Then
are usually non-projectively similar local maxima for the function
Of course,
does not generate the algebra
, but there is some neighborhood
of
where if
, then
will have at least two non-projectively similar local maxima. This sort of scenario is not as contrived as one may think it is since
could represent two clusters where there are few or weak connections between these clusters (for example, in natural language processing, each language could represent its own cluster).
Observation: (preservation of symmetry) Suppose that are Hermitian (resp. real, real symmetric, complex symmetric, positive semidefinite) and
is an
-SRDR of
. Then
is usually projectively similar to some tuple
where each
is Hermitian (resp. real, real symmetric, complex symmetric, positive semidefinite).
Observation: (near functoriality) Suppose that Suppose that
generates
Let
be an
-SRDR of
. Let
be an
-SRDR of
. Let
be an
-SRDR of
. Then
is usually not projectively similar to
, but
is usually close to
Observation: (complex supremacy) Suppose that . Let
be the restriction of
to the domain
. Then gradient ascent for the function
usually produces many local maxima which are not global maxima. Therefore, in order to maximize
, it is best to first extend
to the complex valued function
and obtain a local maximum
for
. One should then be able to find some
which is projectively similar to
; in this case
will be a local maximum for
that is optimal or at least higher than a local maximum that one would obtain by gradient ascent on
.
If , then define a function
by setting
We say that is an
-SRDR projector if
is a local maximum for
. We say that
is a strong
-SRDR projector if
is an
-SRDR.
Observation: If is an
-SRDR projector, then there is usually a constant
with
and a projection
of rank
with
-SRDRs projectors are usually strong
-SRDR projectors. If
, then set
precisely when there are
where
. If
and
are
-SRDR projectors, then we typically have
In fact,
for some
and if we set
, then
for some
.
We say that a -SRDR projector is constant factor normalized if
. If
are two constant factor normalized
-SRDR projectors of
, then
and where if we set
, then
Observation: If is an
-SRDR of
, then there is usually a strong
-SRDR projector
where
for
Observation: If are self-adjoint, then whenever
is a constant factor normalized
-SRDR projector of
, we usually have
for some
-SRDR projector of
with
and where
. Furthermore, if
is another
-SRDR projector of
with
, then
is a unitary matrix, and
, and
If
is the inclusion mapping, and
is the orthogonal projection, then
is an
-SRDR projector.
Observation: If are real square matrices, then whenever
is a
-SRDR projector of
, we usually have
for some
-SRDR projector of
where
are real matrices. Furthermore, if
are real constant factor normalized
-SRDR projectors of
, then there is a real matrix
where
.
Observation: If are real symmetric matrices, then whenever
is a
-SRDR projector of
, we usually have
for some
-SRDR projector of
where
are real matrices with
. In fact, if
are real constant factor normalized
-SRDR projectors of
with
, there is a real orthogonal matrix
with
.
Observation: If are complex symmetric matrices, then whenever
is a
-SRDR projector of
, we usually have
for some
-SRDR projector of
and where
. Furthermore, if
are constant factor normalized
-SRDR projectors of
, and
, then
.
Application: Dimensionality reduction of various manifolds. Let denote the quotient complex manifold
where we set
precisely when
or
. One can use our observations about LSRDRs to perform dimensionality reduction of data in
, and we can also use dimensionality reduction to map data in
and
to data in
and
respectively.
Let be the collection of all
positive definite rank-1 real matrices. Let
be the collection of all positive definite rank-1 complex matrices. Let
be the collection of all symmetric rank-1 complex matrices. The sets
can be put into a canonical one-to-one correspondence with
respectively. Suppose that
and
is an
-SRDR of
. If
, then
, and every element of
can be considered as a dimensionality reduction of
. Similarly, if
, then
, and if
, then
.
-SRDRs can be used to produce dimensionality reductions that map subsets
to
where
.
Suppose that . Then let
for
. Then there is some subspace
where if
are the inclusion and projection mappings, then
is a constant factor normalized
-SRDR projector for
. In this case,
is a dimensionality reduction of the tuple
.
Suppose that . Then let
for
. Suppose that
is a constant factor normalized
-SRDR projector with respect to
. Then
is a dimensionality reduction of
One may need to normalize the data in the above paragraphs before taking a dimensionality reduction. Suppose that
. Let
be the mean of
. Let
. Suppose that
is a dimensionality reduction of
. Then
is normalized dimensionality reduction of
.
Observation: (Random real matrices) Suppose that and
are random real matrices; for simplicity assume that the entries in
are independent, and normally distributed with mean
and variance
. Then whenever
, we have
. In particular,
.
Observation: (Random complex matrices) Suppose now that and
are random complex matrices; for simplicity assume that the entries in
are independent, Gaussian with mean
, where the real part of each entry in
has variance
and the imaginary part of each entry in
has variance
. Then
and
.
Observation: (Random permutation matrices) Suppose that and
is the standard irreducible representation. Let
be selected uniformly at random. Then
and
Observation: (Random orthogonal or unitary matrices): Suppose that , and let
be a collection of unitary matrices select at random according to the Haar measure or let
be a collection of orthogonal matrices selected at random according to the Haar measure. Then
and
.
Suppose that is a complex matrix with eigenvalues
with
. Then define the spectral gap ratio of
as
Observation: Let . Let
be random matrices according to some distribution. Suppose that
are matrices where
is positive semidefinite (for example,
could be a constant factor normalized LSRDR projector of
). Let
. Suppose that the eigenvalues of
are
with
Then the spectral gap ratio of
is much larger than the spectral gap of ratio random matrices that follow the circular law. However, the eigenvalues
with the dominant eigenvalue
removed still follow the circular law (or semicircular or elliptical law depending on the choice of distribution of
). The value
will be approximately
. The large spectral gap ratio for
makes the version of gradient ascent that we use to compute LSRDRs work well; the power iteration algorithm for computing dominant eigenvectors converges more quickly if there is a large spectral gap ratio, and the dominant eigenvectors of a matrix are needed to calculate the gradient of the spectral radius of that matrix.
Observation: (increased regularity in high dimensions) Suppose that are
-complex matrices. Suppose that
. Suppose that
are
-SRDRs of
trained by gradient ascent while
are
-SRDRs of
. Then it is more likely for
to be projectively similar to
than it is for
to be projectively similar to
. In general,
will be better behaved than
.
Quaternions
We say that a complex matrix is a quaternionic complex matrix if
and
whenever
. Suppose that each
is an
quaternionic complex matrix for
. Suppose that
is an even integer with
and suppose that
is a
-SRDR of
. Then there is often some
and
where if we set
for
, then
are quaternionic complex matrices. Furthermore, if
is an
-SRDR projector, then there is some LSRDR projector
where
but where
are quaternionic complex matrices.
We observe that if are quaternionic complex matrices and
are rectangular matrices with
, then the matrix
will still have a single dominant eigenvalue with a spectral gap ratio significantly above 1 in contrast with the fact that
each have spectral gap ratio of
since the eigenvalues of
come in conjugate pairs. In particular, if
are matrices obtained during the process of training
, then the matrix
will have a spectral gap ratio that is significantly above
; this ensures that gradient ascent will not have any problems in optimizing
.
Gradient ascent on subspaces
Let be a natural number. Let
be the set of all
-Hermitian matrices, complex symmetric matrices, and complex quaternionic matrices. Suppose that
. Suppose that
. Then let
be a tuple such that
is locally maximized. Suppose that
is a tuple where
is locally maximized. Then we usually have
. In this case, we would say that
is an alternate domain
-spectral radius dimensionality reduction (ADLSRDR).
Other fitness functions
Let be a finite dimensional real or complex inner product space. Suppose that
is a Borel probability measure with compact support on
. Now define a fitness function
by letting
.
If is a Borel probability measure with compact support on
, then define a fitness function
by letting
.
If are real inner product spaces, then
will generally be disconnected spaces, so gradient ascent will have no hope of finding an
or
that maximizes
respectively simply by using gradient ascent with a random initialization. However,
whenever
is positive definite, and gradient ascent seems to always produce the same local optimum value
for the function
when
is initialized to be a positive semidefinite matrix. If
are linear operators that maximize
, then we generally have
.
If is a complex inner product space, then gradient ascent seems to always produce the same local optimal value
for the fitness function
, and if
are matrices that maximize the fitness function
, then we generally have
for some
. The local optimal operator
will generally have positive eigenvalues for some
, but the operator
itself will not be positive semidefinite. If
and
is supported on
, then for each operator
that optimizes
, there is some
where
, and
is up to sign the same matrix found by real gradient ascent with positive definite initialization.
The problem of optimizing is a problem that is useful for constructing MPO word embeddings and other objects for machine learning similar to LSRDR dimensionality reductions. While
tends to have a unique up to constant factor local optimum, the fitness function
does not generally have a unique local optimum; if
are complex inner product spaces, then
has few local optima in the sense that it is not unusual to there to be a constant
where
such that
locally optimize
. In this case, the matrix
that globally optimizes
is most likely also the matrix that has the largest basin of attraction with respect to various types of gradient ascent.
MPO word embeddings: Suppose that is a matrix embedding potential,
is a finite set, and
. Suppose that
is are MPO word pre-embeddings with respect to the matrix embedding potential
. Then there are often constants
for
where
for
. Furthermore, there is often a matrix
along with constants
for
where
is a real matrix for
.
Even though I have published this post, this post is not yet complete since I would like to add information to this post about functions related to and which satisfy similar properties to the functions of the form .