[edit]
The Bigraphical Lasso
Abstract
Cite this Paper
Related Material
Introduction
When fitting Gaussian models to data, we usually make an independence assumption across data points and fit the covariance matrix by maximum likelihood. The number of parameters in the covariance matrix can be reduced by factor analysis like structures [see e.g. @Tipping:probpca99] or by constraining the inverse covariance (or precision) to be sparse [e.g. @Banerjee:model2008]. A sparse precision matrix defines a Gauss Markov random field relationship which is conveniently represented by a weighted undirected graph. Nodes which are not neighbors in the graph are conditionally independent given all other nodes. Models specified in this way can learn conditional independence structures between features.
An alternative Gaussian modeling approach was introduced by @Lawrence:unifying12, who showed that spectral dimensionality reduction methods have an interpretation as sparse graphical models where the independence assumption is across data features, and the parameters of the covariance are fitted by maximum likelihood (or in the case of local linear embeddings [@Roweis:lle00] by maximizing a pseudolikelihood). This assumption leads to much better determined parameters in the case where the number of features is greater than the number of data points (the so called large $p$, small $n$ case).
The choice of feature independence or data point independence is a model choice issue, but both choices are in fact a simplification of a more general framework that aims to estimate the conditional independence relationships between both features and data points. It is this type of model that we address in this paper. Specifically we want to build a sparse graph that interrelates both features and data points. For instance, we might have a data set that is a video. Here the data points are the frames of the video and the data features are the pixels in the video. Let’s assume that the ordering of the video frames and the neighborhood structure between pixels has somehow been lost. A potential learning task would be to learn both the temporal structure of the data and the spatial structure of the inter related pixels. We successfully solve this task for a simple video from the COIL data set in Section [sec:coildata].
An alternative motivating data example could be gene expression data, where we might wish to extract a genetic network from the gene expression values whilst explaining correlations between samples (such as close genetic relationships, or related experiments) with a separate network. Such data are more naturally represented by matrix-variate distributions.
Graphical Lasso and Matrix Variate Normal
The graphical lasso [GLasso, @Friedman:sparse08; @Banerjee:model2008] is a computationally efficient penalised likelihood algorithm for learning sparse structures of conditional dependencies or Gaussian Markov random fields (GMRF) over features of iid vector-variate Gaussian samples [@Lauritzen:graphical96].
The matrix variate normal [@Dawid:MatrixVariate81; @Gupta:MatrixVariate99] is a Gaussian density which can be applied to a matrix through first taking a vectorized (vec) representation1 of the matrix samples $\mb{X} \in \Realspace^{n \times p}$ and assuming the covariance matrix has the form of a Kronecker product between two covariance matrices, separately associated with the rows and columns of the data. The Kronecker product assumption for the covariance implies that the precision matrix is also a Kronecker product, which is formed from the Kronecker product of the precision matrices associated with the rows and columns ($\bPsi \otimes \bTheta$).
One approach to applying sparse graphical models to matrix data is to combine the Kronecker product structured matrix variate normal with the graphical Lasso. @Dutilleul:MLE99 used a flip-flop approach for maximum likelihood estimation of the parameters of the matrix-normal and much later @Zhang:Learning10 used it for MAP estimation with sparsity penalties on the precision matrices. More recently, @Leng:Sparse12 applied the SCAD penalty [@Fan:Variable01] as well as the Lasso in the likelihood function of the matrix-normal. @Tsiligkaridis:Convergence13 analyzed the convergence of Kronecker GLasso under asymptotic conditions as well as simulations that show significant speedups over GLasso and MLE.
However, whilst the Kronecker-product structure arises naturally when considering matrix-normals (Kronecker-normals), it is relatively dense when it comes to the dependencies it suggests between the rows. More precisely, if $\Psi_{ij}$ in $\bPsi \otimes \bTheta$ is non-zero (for example, corresponding to an edge between samples $i$ and $j$ in the design matrix $\mb{X}$) then many edges between features of sample $i$ and sample $j$ (as many as in $\bTheta$) will also be active. A sparser structure would benefit situations where the connection between a feature of some sample and a different feature of any other sample is of no interest or redundant, simply because a same-feature dependency between different samples would suffice to establish a cross-sample dependency.
The Bigraphical Lasso
In this paper, we introduce the bigraphical Lasso (BiGLasso), an model for matrix-variate data that preserves their column/row structure and, like the Kronecker product based matrix normal, simultaneously learns two graphs, one over rows and one over columns of the matrix samples. The model is trained in a flip-flop fashion, so the number of Lasso regressions reduces to $\mathcal{O}(n+p)$. However, the model preserves the matrix structure by using a novel Kronecker sum structure for the precision matrix, $(\bPsi \otimes \mb{I}) + (\mb{I} \otimes \bTheta)$ instead of the Kronecker product.
This structure enjoys enhanced sparsity in comparison to the conventional Kronecker-product structure of matrix-normals.
In the context of regression models, the Kronecker-sum prevents the conditional independence between responses of multi-output Gaussian processes, a property known in various literatures as cancellation of inter-task transfer or autokrigeability.
When operating on adjacency matrices, the Kronecker-sum is also known in algebraic graph theory as the Cartesian product of graphs and is arguably the most prominent of graph products [@Sabidussi:graph59; @Chung:Spectral96; @Imrich:TopicsInGT08]. This endows the output of the BiGLasso with a more intuitive and interpretable graph decomposition of the induced Gaussian random field (GRF), see figure [fig:graph_products].
Enhanced Sparsity
For a matrix density $\lambda \in [0,1]$ of both precision matrices the Kronecker-sum has $\mathcal{O}(\lambda n p (n+p))$ non-zeros, whereas the Kronecker-product has $\mathcal{O}(\lambda n^2 p^2)$ non-zeros.
Better Information Transfer
Kronecker product forms have a known weakness, referred to in the Gaussian process (GP) literature as the cancellation of inter-task transfer: @Bonilla:multitask08 [§2.3] showed that the predictive mean of a multi-output GP with a noise-free Kronecker-product covariance2 and the same inputs conditioned across tasks (a conditioning structure referred to as a block design) uncouples the outputs of the different tasks, that is, the posterior factorises and thus the outputs are computed independently. The key of this proof lies in the factorisable property of the inverse Kronecker-product, $(\bPsi \otimes \bTheta)^{-1} = \bPsi^{-1} \otimes \bTheta^{-1}$. This property does not apply under the presence of additive noise, hence the outputs remain coupled. This result first arose in geostatistics under the name of autokrigeability [@Wackernagel:Geostats03] and is also discussed for covariance functions by @OHagan:MarkovCovMatrices98. @Zellner:Seemingly62, @Binkley:note88 pointed out how the consideration of the correlation between regression equations leads to a gain in efficiency.
In a similar vein from econometrics, are models of seemingly unrelated regressions [SUR, @Zellner:Seemingly62], a form of general least squares that allows for a different set of regressors for each response. The problem reduces to ordinary least squares (OLS) when the same covariates are used across the outputs (block design). With a block design, OLS would pass on a potential gain in efficiency by disregarding correlations between responses. Nonetheless, the distribution of the maximum-likelihood estimators does not factorize, regardless of conditioning design. In contrast to SUR, a block design on a multi-output GP with a noise-free Kronecker-product covariance induces the stronger effect of conditional independence over the outputs. These two factorisations are very different and in general do not coincide.
The same property that allows for a simple flip-flop approach also negates the merit of exploiting any correlations between different outputs, but by coupling them with additive noise to enable inter-task transfer, flip-flop is no longer straightforward. @stegle:efficient11 addressed this issue by adding iid noise to a Kronecker-product covariance — a low-rank factor for confounding effects and a sparse-inverse factor for inter-sample dependencies — and exploiting identities of the $\textrm{vec(.)}$ notation for efficient computation within the matrix-normal model.
To summarize our contributions, contrary to existing approaches that use the Kronecker-product structure, the Kronecker-sum does not cancel the inter-task transfer. Our algorithm maintains the simplicity of the flip-flop with a simple trick of transposing the matrix-variate (samples become features and vice versa). At the same time, the equivalent Cartesian factorization of graphs provides a more parsimonious interpretation of the induced Markov network.
The rest of this paper is structured as follows. We describe the matrix-normal model with the Kronecker-sum inverse-covariance in §[sec:model]. In §[sec:algorithm], we present the BiGLasso algorithm for learning the parameters of the Kronecker-sum inverse-covariance. We present some simulations in comparison to a recent Kronecker-normal model of @Leng:Sparse12 in §[sec:simulations] and an application to an example from the COIL dataset in §[sec:coildata]. We conclude in §[sec:conclusion].
Matrix-normal model with the Kronecker-sum structure
To motivate our model, consider the case where matrix-variate data $\mb{Y}$ are sampled iid from a matrix-normal distribution (matrix-Gaussian). This is a natural generalisation of the Gaussian distribution towards tensor support3. This distribution can be reparametrized such that the support is over vectorised representations of random matrices, \(\textrm{vec}(\mb{Y}) \sim \Normal{\mb{0}}{\bTheta^{-1}_p \otimes \bPsi^{-1}_n}.\)
Under the assumption that $\bSigma_p \otimes \bGamma_n$ is sparse, the SMGM estimator for the precision matrices $\bPsi_n, \bTheta_p$ can be computed iteratively by minimizing \(\begin{split} L(\bTheta_p, \bPsi_n) = \frac{1}{Nnp} \sum_{i=1}^{N} \tr{\mb{Y}_i \bTheta_p \mb{Y}^\top_i \bPsi_n} - \frac{1}{n} \log \verts{\bPsi_n} \\ - \frac{1}{p} \log \verts{\bTheta_p} + \lambda_1 || \bPsi_n ||_1 + \lambda_2 || \bTheta_p ||_1, \end{split}\) where $\mb{Y}_i$ is the $i$-th matrix sample, $N$ is the sample size and $\lambda_1, \lambda_2$ the regularization parameters.
Let $\mb{Y} \in \Realspace^{n \times p}$ be a random matrix. If its rows are generated as iid samples from $\Normal{\mb{0}}{ \bSigma_p}$, then the sampling distribution of the sufficient statistic $\mb{Y}^{\top}\mb{Y}$ is $Wishart (n, \bSigma_p)$ with $n$ degrees of freedom and scale matrix $\bSigma_p$. Similarly, if the columns are generated as iid samples from $\Normal{\mb{0}}{ \bGamma_n}$, then the sampling distribution is $Wishart(p, \bGamma_n)$.
In a maximum entropy point of view we can constraint these second-order moments in a model both for the features and the datapoints of a design matrix. To do so, we combine these sufficient statistics in a model for the entire matrix $\mb{Y}$ as \(p(\mb{Y}) ~ \propto ~ \expo{ -\tr{\bPsi_n \mb{Y}\mb{Y}^\top} -\tr{\bTheta_p \mb{Y}^\top\mb{Y}}} ~,\) where $\bPsi_n \in \Realspace^{n \times n}$ and $\bTheta_p \in \Realspace^{p \times p}$ are positive definite matrices. This is equivalent to a joint factorized Gaussian distribution (see §3 in supplementary material) for the $n \times p$ entries of $\mb{Y}$, with a precision matrix of the form \(\bOmega ~\triangleq~ \bPsi_n \oplus \bTheta_p ~=~ \bPsi_n \otimes \mb{I}_p + \mb{I}_n \otimes \bTheta_p ~,\) where $\otimes$ is the Kronecker-product and $\oplus$ the Kronecker-sum operator. Thus, \(\omega_{ij, kl} ~=~ \psi_{i,k} \delta_{j,l} + \delta_{i,k} \theta_{j,l} ~,\) for $i,k \in {1,\dots,n}$ and $j, l \in {1,\dots,p}$. As an immediate benefit of this parameterization, while the full covariance matrix has $\mathcal{O}(n^2 p^2)$ entries, these are governed in our model by only $\mathcal{O}(n^2 + p^2)$ parameters.
Given data in the form of some design matrix $\mb{Y}$, the BiGLasso estimates sparse matrices by putting $\ell_1$ penalties on $\bTheta_p$ and $\bPsi_n$. The convex optimization problem is \(\label{equ:biglasso_problem} \begin{split} \underset{\bTheta_p, \bPsi_n}{\textrm{min}} \Big\{ & n~\tr{\bTheta_p \mb{S}} + p~\tr{\bPsi_n \mb{T}} - \ln \verts{\bPsi_n \oplus \bTheta_p} \\ &+ \lambda \norm{\bTheta_p}_1 + \gamma \norm{\bPsi_n}_1 \Big\}~, \end{split}\) \(\label{equ:empiricalCovs} \textrm{where} \quad \mb{S} \triangleq \tfrac{1}{n}\mb{Y}^\top\mb{Y} \quad \textrm{and} \quad \mb{T} \triangleq \tfrac{1}{p}\mb{Y}\mb{Y}^\top\) are empirical covariances across the samples and features respectively. A solution simultaneously estimates two graphs – one over the columns of $\mb{Y}$, corresponding to the sparsity pattern of $\bTheta_p$, and another over the rows of $\mb{Y}$, corresponding to the sparsity pattern of $\bPsi_n$. Note that since $\omega_{ii,jj} = \psi_{ii} + \theta_{jj}$, the diagonals of $\bTheta_p$ and $\bPsi_n$ are not identifiable (though we could restrict the inverses to correlation matrices). However, this does not affect the estimation of the graph structure (locations of zeros).
A penalized likelihood algorithm for the BiGLasso
A note on notation
If $\mb{M}$ is an $np \times np$ matrix written in terms of $p \times p$ blocks, as \(\mb{M} = \mat{ \mb{M}_{11} &\dots &\mb{M}_{1n} \\ \vdots &\ddots &\vdots \\ \mb{M}_{n1} &\dots &\mb{M}_{nn} } ~,\) then $\textrm{tr}_p(\mb{M})$ is the $n \times n$ matrix of traces of such blocks4: \(\textrm{tr}_p(\mb{M}) = \mat{ \tr{\mb{M}_{11}} &\dots &\tr{\mb{M}_{1n}} \\ \vdots &\ddots &\vdots \\ \tr{\mb{M}_{n1}} &\dots &\tr{\mb{M}_{nn}} } ~.\) We alternate between optimizing over $\bPsi_n$ while holding $\bTheta_p$ fixed and optimizing over $\bTheta_p$ while holding $\bPsi_n$ fixed. First we consider the case where there is no regularization. From , the first step of the optimization problem is reduced to \(\label{equ:biglasso_minPsi} \underset{\bPsi_n}{\textrm{min}} \Big\{ p~\tr{\bPsi_n \mb{T}} ~-~ \ln \verts{\bPsi_n \oplus \bTheta_p} \Big\} ~.\)
Section 2 in the supplementary material shows how to take the gradient of with respect to $\bPsi_n$. Combining (5) and (6) of the supplementary material we obtain the stationary point: \(\mb{T} - \tfrac{1}{2p} \mb{T} \circ \mb{I} = \tfrac{1}{p}\textrm{tr}_p (\mb{W}) - \tfrac{1}{2p}\textrm{tr}_p (\mb{W}) \circ \mb{I} ~,\) where we define $\mb{W} \triangleq (\bPsi_n \oplus \bTheta_p)^{-1}$. We partition $\mb{V} \triangleq \frac{1}{p}~\textrm{tr}p (\mb{W})$ as \(\mb{V} = \mat{v_{11} &\mb{v}^\top_{1\sm1} \\ \mb{v}_{1\sm1} & \mb{V}_{\sm1 \sm1}} ~,\) where $\mb{v}{1\sm1}$ is a vector of size $n-1$ and $\mb{V}{\sm 1 \sm 1}$ is a $(n-1) \times (n-1)$ matrix. Despite the complex form of the stationarity condition, only the lower-left block of its partition will be of use: \(\begin{aligned} \label{equ:stationary_point} \nonumber \mb{t}_{1\sm1} &= \tfrac{1}{p}\textrm{tr}_p (\mb{W}_{1\sm1}) = \mb{v}_{1\sm1}, ~\textrm{and also from \eqref{equ:empiricalCovs}}, \\ \quad \mb{t}_{1\sm1} &= (t_{21}, \dots,t_{n1})^\top = \tfrac{1}{p}(\mb{y}_2^\top \mb{y}_1,\, \dots, \mb{y}_n^\top \mb{y}_1)^\top . \end{aligned}\) Similarly, we partition $\mb{W}$ into blocks: \(\mb{W} = \mat{\mb{W}_{11} &\mb{W}^\top_{1\sm1} \\ \mb{W}_{1\sm1} & \mb{W}_{\sm1 \sm1}}~,\) where $\mb{W}{11}$ is a $p \times p$ matrix and $\mb{W}{1\sm1}$ is a $p(n-1) \times p$ matrix. Then from the bottom-left block of \(\label{equ:WOmega} \begin{split} &\mb{W}\bOmega = \\ &\mat{ \mb{W}_{11} &\mb{W}^\top_{1\sm1} \\ \mb{W}_{1\sm1} & \mb{W}_{\sm1 \sm1} } \mat{ \psi_{11}\mb{I}_p + \bTheta_p &\dots &\psi_{in}\mb{I}_p \\ \vdots &\ddots &\vdots \\ \psi_{n1}\mb{I}_p &\dots &\psi_{nn}\mb{I}_p + \bTheta_p } \\ &= \mb{I}_{n} \otimes \mb{I}_p ~, \end{split}\) we get \(\nonumber \mb{W}_{1\sm1} (\psi_{11}\mb{I}_p + \bTheta_p) + \mb{W}_{\sm1\sm1} (\bpsi_{1\sm1} \otimes \mb{I}_p) = \mb{0}_{n-1} \otimes \mb{I}_p\) \(\label{equ:W1_not1} \mb{W}_{1\sm1} + \mb{W}_{\sm1\sm1}\mat{ (\psi_{11}\mb{I}_p + \bTheta_p)^{-1} \psi_{21} \\ \vdots \\ (\psi_{11}\mb{I}_p + \bTheta_p)^{-1}\psi_{n1}} = \mb{0}_{n-1} \otimes \mb{I}_p\) \(\begin{split} \nonumber \mb{W}_{1\sm1} ~+~ &\mb{W}_{2\sm1} (\psi_{11}\mb{I}_p + \bTheta_p)^{-1} \psi_{21} ~+\dots~\\ ~\dots+~ &\mb{W}_{n\sm1} (\psi_{11}\mb{I}_p + \bTheta_p)^{-1} \psi_{n1} = \mb{0}_{n-1} \otimes \mb{I}_p ~, \end{split}\) with $\mb{0}{n-1}$ as the vector of $n-1$ zeros. According to the stationary point in , taking the blockwise trace $\textrm{tr}p(.)$ of both sides, gives the equation: \(p ~ \mb{t}_{1\sm1} + \mb{A}_{\sm 1 \sm1} \bpsi_{1\sm1} = \mb{0}_{n-1},\quad \textrm{where}\\\) $\mb{A}^\top{\sm 1 \sm1} \triangleq \mat{ \textrm{tr}p \left{ \mb{W}{2\sm1} (\psi_{11}\mb{I}p + \bTheta_p)^{-1} \right}^\top \ \vdots \ \textrm{tr}_p \left{ \mb{W}{n\sm1} (\psi_{11}\mb{I}_p + \bTheta_p)^{-1} \right}^\top }$.
By imposing an $\ell_1$ penalty on $\bpsi_{1\sm1}$, this problem reduces to a Lasso regression.
After estimating $\psi_{1\sm1}$, we compute $\mb{W}{1\sm1}$ by substituting into . It remains to compute $\mb{W}{11}$. This follows from , which gives \(\mb{W}_{11} = (\mb{I} - \mb{W}_{1\sm1}^\top (\bpsi_{1\sm1} \otimes \mb{I})) (\psi_{11}\mb{I} + \bTheta_p)^{-1} ~.\) This algorithm iteratively estimates columns of $\bPsi_n$ and $\mb{W}$ in this manner. The procedure for estimating $\bTheta_p$, for fixed $\bPsi_n$, becomes directly parallel to the above simply by transposing the design matrix (samples become features and vice-versa) and applying the algorithm. Algorithm 1 outlines the BiGLasso.
[!htbp]
Input: $\mb{Y}, \lambda, \gamma$ and initial estimates of $\bPsi_n$ and $\bTheta_p$ $\mb{T} \leftarrow p^{-1}\mb{Y}\mb{Y}^\top$ # Estimate $\bPsi_n$ : Partition $\bPsi_n$ into $\psi_{ii}, \bpsi_{i \sm i}$ and $\bPsi_{\sm i \sm i}$. Find a sparse solution of $~ p ~ \mb{t}{i \sm i} + \mb{A}{\sm i \sm i} \bpsi_{i \sm i} = \mb{0}{n-1}$ with *Lasso* regression. Substitute $\bpsi{i \sm i}$ into to compute $\mb{W}{i \sm i}$. $\mb{W}{ii} \leftarrow \left( \mb{I} - \mb{W}{i \sm i}^\top (\bpsi{i \sm i} \otimes \mb{I}) \right) (\psi_{ii}\mb{I} + \bTheta_p)^{-1}$ # Estimate $\bTheta_p$ : Proceed as if estimating $\bPsi_n$ with input $\mb{Y}^\top, \lambda, \gamma$.
In our experiments we treat $\lambda$ and $\gamma$ as the same parameter and the precision matrices $\bPsi_n$ and $\bTheta_p$ are initialized as identity matrices. The empirical mean matrix is removed from each dataset.
Simulations
[ht]
[ht]
To empirically assess the efficiency of BiGLasso, we generate the datasets described below from centered Gaussians with Kronecker-product (KP) and Kronecker-sum (KS) precision matrices. We run the BiGLasso and SMGM (Sparse Matrix Graphical Model) of @Leng:Sparse12, a flip-flop extension of GLasso for Kronecker-product matrix-normals using the $\ell_1$ penalty. More precisely, one iteration of SMGM consists of a GLasso iteration for fitting the rows-precision matrix with a fixed column-precision and another GLasso iteration for fitting the columns-precision matrix with a fixed rows-precision. The $\bTheta_p$ and $\bPsi_n$ precision matrices in both cases are generated in accordance to [§4, @Leng:Sparse12]; namely, as either of the following $d \times d$ blocks ($d$ being either $p$ or $n$) of increasing density:
-
$\mb{A}1$: Inverse AR(1) (auto-regressive process) such that $\mb{A}_1 = \mb{B}^{-1}$ with $B{ij} = 0.7^{|i-j|}$.
-
$\mb{A}2$: AR(4) with $A{ij} = I(|i-j|=0) ~+~ 0.4I(|i-j|=1) ~+~ 0.2I(|i-j|=2) ~+~ 0.2I(|i-j|=3) ~+~ 0.1I(|i-j|=4)$, $I(.)$ being the indicator function.
-
$\mb{A}3 = \mb{B} + \delta\mb{I}$, where for each $B{ij} = B_{ji}, i\neq j$, $P(B_{ij}=0.5) = 0.9$ and $P(B_{ij}=0) = 0.1$. The diagonal is zero and $\delta$ is chosen such that the condition number of $\mb{A}_3$ is d. Since the condition number is $k(\mb{A}_3) = d = \frac{\lambda_1 + \delta}{\lambda_d + \delta}$, the ratio of largest-to-smallest eigenvalue, then $\delta = \frac{d \lambda_d - \lambda_1}{1 - d}$.
Figures [fig:BiGLassoVsSMGM_onKSdata], [fig:BiGLassoVsSMGM_onKPdata] show the recall $=\frac{# { \widehat{\Omega}{ij} \neq 0 ~\&~ \Omega{ij} \neq 0}}{# { \Omega_{ij} \neq 0 }}$ (or true-positive rate) and precision $ = \frac{# { \widehat{\Omega}{ij} = 0 ~\&~ \Omega{ij} = 0}}{# { \widehat{\Omega}{ij} = 0 ~\&~ \Omega{ij} = 0} + # { \widehat{\Omega}{ij} = 0 ~\&~ \Omega{ij} = 1}}$ across 50 replications to assess the $\widehat{\bOmega}$ estimates under various setups. Each box shows a particular setup that varies in block combination ($\mb{A}_1,\mb{A}_2,\mb{A}_3$), in block sizes ($n,p$), in sample size N generated from the matrix-normal and by the structure used (KS or KP) to generate the sample. Each curve in a box is the solution-path of a replication in precision-recall space for a range of regularization settings $\lambda = 5^x$, for $x \in [-6,-2]$ interpolated 10 times. The blocks are arranged such that the overall density of the structured precision matrices increases from left to right.
We note that since blocks A1,A2 have a fixed form, for such combinations each curve is a different sample from the same graph structure. Only A3 is random so in combinations involving A3, each curve has a different random A3 and consequently generates a different sample. At a glance this has little effect, as most of the curves are not too spread and amounts to sampling from a joint distribution over A3-structures and data where the A3-structure is marginalized out.
Figures [fig:BiGLassoVsSMGM_onKSdata] and [fig:BiGLassoVsSMGM_onKPdata] also compare against the results of SMGM (using the Lasso penalty) on data simulated from the matrix-normal with KS structures. @Leng:Sparse12 had also ran comparisons against the MLE method of @Dutilleul:MLE99 (an unpenalized variant of SMGM), ridge-SMGM (SMGM with an $\ell_2$ penalty instead of $\ell_1$) and the GLasso of @Friedman:sparse08 (on vectorized samples from $\Normal{\mb{0}}{ \bPsi_n \otimes \bTheta_p}$, i.e. ignoring the matrix structure). They consistently outperformed all of these methods, so for brevity we compare only against the SMGM. Similarly, Figure [fig:BiGLassoVsSMGM_onKPdata] visualizes the simulations under KP structures.
By the empirical distributions of these solution-paths (50 for each model in each box), it is no surprise that the intrinsically denser SMGM tends to have low precision (many false-positives) for smaller values of $\lambda$. On the contrary, BiGLasso tends to have low recall (many false-negatives) due to its intrinsically sparser structure.
Block A3 is the only randomized sparse structure whereas A1 and A2 are more “artificial” as they respectively model an inverse-AR(1) and AR(4) and they yield banded precision matrices. Of interest is the observation that the largest effect of the increase in sample size $(10 \rightarrow 100)$ seems to occur on the A3/A3 combination (right end column of boxes). More precisely in Figure [fig:BiGLassoVsSMGM_onKSdata], we note the difference from box (1,6) to (2,6) and from (3,6) to (4,6). The sample size is very effective: with sufficiently large sample size N, BiGLasso starts to recover exactly and SMGM occupies lower regions in general.
In Figure [fig:BiGLassoVsSMGM_onKPdata], since the data generation process uses Kronecker-product structures, the SMGM is expected to outperform our method. Indeed for lower-density structure, the recovery rate of the SMGM seems consistently better than BiGLasso. and recovery can be almost exact for the SMGM for combination A1/A1. However, as the overall density increases, the performance of BiGLasso is balanced. Again, for combinations involving A3, larger sample sizes benefit BiGLasso more.
In summary, KP-simulated data proved harder to tackle for both methods than KS-generated data. These simulations have shown that the BigLasso consistently outperforms the SMGM on KS-simulations, with the possibility of exact recovery on large sample sizes. On KP-simulations the comparison is less clear, but the BiGLasso proves more practical for denser Kronecker-product structures and the SMGM more practical for sparser structures.
An Example from the COIL Dataset
In this section we perform a minor video analysis of a rotating rubber duck from the COIL dataset5. The video consists of gray-scaled images, see Figure [fig:rotating_duck].
[!htbp]
The goal is on two fronts: to recover the conditional dependency structure over the frames and the structure over the pixels. For simplicity, we reduced the resolution of each frame and sub-sampled the frames (at a ratio 1:2). After vectorizing the frames (stacking their columns into $81 \times 1$ vectors) and arranging them into a design matrix $\mb{X}$, the resulting single ``datapoint” that BiGLasso has to learn from is $36\times 81$ (#frames $\times$ vectorized frame length). Unlike our previous simulations where we had many matrix-samples, here the challenge is to learn from this single matrix $(N = 1)$.
Despite the big loss in resolution, the principal component (PCA) subspace of the rotating duck seems to remain smooth, see Figure [fig:duck_manifold]. Being a time-series, the video is expected to resemble a 1D manifold, ``homeomorphic” to the one recovered by PCA shown in figure [fig:duck_manifold], so we applied the BiGLasso on the reduced images.
\
Indeed, the left panel of figure [fig:frameNet_pixNet] shows the row-precision parameter of BiGLasso capturing a manifold-like structure where the first and last frames join, as expected of a $360^{\circ}$ rotation. The model recovered the temporal manifold structure, or in other words, we could use it to connect the dots in Figure [fig:duck_manifold] in case the true order of the frames was unknown (or randomly given to us).
The right panel of Figure [fig:frameNet_pixNet] shows the conditional dependency structure over the pixels. This figure shows strong dependencies at intervals of 9 — that is, roughly in line with the size of a frame (due to the column-wise ordering of the pixels).
This is expected, as neighboring pixels are more likely to be conditionally dependent.
A more intuitive picture of the induced Markov network is shown in Figure [fig:pixNet_visualisation]. A Gaussian graphical model can be naturally interpreted as a system of springs, where the off-diagonal entries of the inverse-covariance represent the negative stiffness of the springs. Therefore by the colorbar, a negative-color represents an ``attracting” spring between those two pixels and a positive-colour represents a ``repulsing” spring. Naturally, in the frames network almost all non-zero elements are negative.
Conclusion
There is a need for models to accommodate the growing complexity of dependency structures. We are concerned with conditional dependencies, as encoded by the inverse-covariance of a matrix-normal density. In high-dimensional cases the Markov network structures induced by a graph could be approximated by factorisations such as the tensor-product (Kronecker-product of precision matrices). In this work, we motivated a novel application of the Cartesian factorization of graphs (Kronecker-sum of precision matrices), as a more parsimonious and interpretable structure for inter-sample and inter-feature conditional dependencies. In the context of multi-output GPs, the Kronecker-product cancels any transfer (that is, ignoring any correlations) between outputs (tasks) when a block design with a noise-free covariance. This is not the case with the Kronecker-sum due to its additive form. We introduced the bigraphical Lasso (BiGLasso), an algorithm for the simultaneous point-estimation of the structures of two Gaussian graphical models: one over the rows of a matrix-sample and the other over its columns. This was demonstrated to good effect through simulations as well as a toy example from the COIL dataset.
An obvious extension that would exploit the associativity of the Cartesian product, would be the modeling of datasets organised into 3 or higher-dimensional arrays (amounting to GRFs over higher-order tensors) with dependencies across any subset of the array dimensions.
One of the appealing features of the Kronecker-sum of precision matrices is the preservation of inter-task transfer, thereby leading to potential applications on Kronecker-sums of kernels for multi-output Gaussian processes.
Finally we feel that the — largely unknown to machine learning — literature on the Cartesian product of graphs deserves a thorough study, towards modeling and algorithmic advances in probabilistic graphical modeling.
-
Vectorization of a matrix involves converting the matrix to a vector by stacking the columns of the matrix. ↩
-
One factor for inter-task one for inter-point covariances. ↩
-
A vector is an order-1 tensor, a matrix is an order-2 tensor and so on. ↩
-
In a sense, this generalizes the conventional trace operator as $\textrm{tr}_{np}(\mb{M}) = \tr{\mb{M}}$. ↩
-
http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php ↩
Introduction
When fitting Gaussian models to data, we usually make an independence assumption across data points and fit the covariance matrix by maximum likelihood. The number of parameters in the covariance matrix can be reduced by factor analysis like structures [see e.g. @Tipping:probpca99] or by constraining the inverse covariance (or precision) to be sparse [e.g. @Banerjee:model2008]. A sparse precision matrix defines a Gauss Markov random field relationship which is conveniently represented by a weighted undirected graph. Nodes which are not neighbors in the graph are conditionally independent given all other nodes. Models specified in this way can learn conditional independence structures between features.
An alternative Gaussian modeling approach was introduced by @Lawrence:unifying12, who showed that spectral dimensionality reduction methods have an interpretation as sparse graphical models where the independence assumption is across data features, and the parameters of the covariance are fitted by maximum likelihood (or in the case of local linear embeddings [@Roweis:lle00] by maximizing a pseudolikelihood). This assumption leads to much better determined parameters in the case where the number of features is greater than the number of data points (the so called large $p$, small $n$ case).
The choice of feature independence or data point independence is a model choice issue, but both choices are in fact a simplification of a more general framework that aims to estimate the conditional independence relationships between both features and data points. It is this type of model that we address in this paper. Specifically we want to build a sparse graph that interrelates both features and data points. For instance, we might have a data set that is a video. Here the data points are the frames of the video and the data features are the pixels in the video. Let’s assume that the ordering of the video frames and the neighborhood structure between pixels has somehow been lost. A potential learning task would be to learn both the temporal structure of the data and the spatial structure of the inter related pixels. We successfully solve this task for a simple video from the COIL data set in Section [sec:coildata].
An alternative motivating data example could be gene expression data, where we might wish to extract a genetic network from the gene expression values whilst explaining correlations between samples (such as close genetic relationships, or related experiments) with a separate network. Such data are more naturally represented by matrix-variate distributions.
Graphical Lasso and Matrix Variate Normal
The graphical lasso [GLasso, @Friedman:sparse08; @Banerjee:model2008] is a computationally efficient penalised likelihood algorithm for learning sparse structures of conditional dependencies or Gaussian Markov random fields (GMRF) over features of iid vector-variate Gaussian samples [@Lauritzen:graphical96].
The matrix variate normal [@Dawid:MatrixVariate81; @Gupta:MatrixVariate99] is a Gaussian density which can be applied to a matrix through first taking a vectorized (vec) representation1 of the matrix samples $\mb{X} \in \Realspace^{n \times p}$ and assuming the covariance matrix has the form of a Kronecker product between two covariance matrices, separately associated with the rows and columns of the data. The Kronecker product assumption for the covariance implies that the precision matrix is also a Kronecker product, which is formed from the Kronecker product of the precision matrices associated with the rows and columns ($\bPsi \otimes \bTheta$).
One approach to applying sparse graphical models to matrix data is to combine the Kronecker product structured matrix variate normal with the graphical Lasso. @Dutilleul:MLE99 used a flip-flop approach for maximum likelihood estimation of the parameters of the matrix-normal and much later @Zhang:Learning10 used it for MAP estimation with sparsity penalties on the precision matrices. More recently, @Leng:Sparse12 applied the SCAD penalty [@Fan:Variable01] as well as the Lasso in the likelihood function of the matrix-normal. @Tsiligkaridis:Convergence13 analyzed the convergence of Kronecker GLasso under asymptotic conditions as well as simulations that show significant speedups over GLasso and MLE.
However, whilst the Kronecker-product structure arises naturally when considering matrix-normals (Kronecker-normals), it is relatively dense when it comes to the dependencies it suggests between the rows. More precisely, if $\Psi_{ij}$ in $\bPsi \otimes \bTheta$ is non-zero (for example, corresponding to an edge between samples $i$ and $j$ in the design matrix $\mb{X}$) then many edges between features of sample $i$ and sample $j$ (as many as in $\bTheta$) will also be active. A sparser structure would benefit situations where the connection between a feature of some sample and a different feature of any other sample is of no interest or redundant, simply because a same-feature dependency between different samples would suffice to establish a cross-sample dependency.
The Bigraphical Lasso
In this paper, we introduce the bigraphical Lasso (BiGLasso), an model for matrix-variate data that preserves their column/row structure and, like the Kronecker product based matrix normal, simultaneously learns two graphs, one over rows and one over columns of the matrix samples. The model is trained in a flip-flop fashion, so the number of Lasso regressions reduces to $\mathcal{O}(n+p)$. However, the model preserves the matrix structure by using a novel Kronecker sum structure for the precision matrix, $(\bPsi \otimes \mb{I}) + (\mb{I} \otimes \bTheta)$ instead of the Kronecker product.
This structure enjoys enhanced sparsity in comparison to the conventional Kronecker-product structure of matrix-normals.
In the context of regression models, the Kronecker-sum prevents the conditional independence between responses of multi-output Gaussian processes, a property known in various literatures as cancellation of inter-task transfer or autokrigeability.
When operating on adjacency matrices, the Kronecker-sum is also known in algebraic graph theory as the Cartesian product of graphs and is arguably the most prominent of graph products [@Sabidussi:graph59; @Chung:Spectral96; @Imrich:TopicsInGT08]. This endows the output of the BiGLasso with a more intuitive and interpretable graph decomposition of the induced Gaussian random field (GRF), see figure [fig:graph_products].
Enhanced Sparsity
For a matrix density $\lambda \in [0,1]$ of both precision matrices the Kronecker-sum has $\mathcal{O}(\lambda n p (n+p))$ non-zeros, whereas the Kronecker-product has $\mathcal{O}(\lambda n^2 p^2)$ non-zeros.
Better Information Transfer
Kronecker product forms have a known weakness, referred to in the Gaussian process (GP) literature as the cancellation of inter-task transfer: @Bonilla:multitask08 [§2.3] showed that the predictive mean of a multi-output GP with a noise-free Kronecker-product covariance2 and the same inputs conditioned across tasks (a conditioning structure referred to as a block design) uncouples the outputs of the different tasks, that is, the posterior factorises and thus the outputs are computed independently. The key of this proof lies in the factorisable property of the inverse Kronecker-product, $(\bPsi \otimes \bTheta)^{-1} = \bPsi^{-1} \otimes \bTheta^{-1}$. This property does not apply under the presence of additive noise, hence the outputs remain coupled. This result first arose in geostatistics under the name of autokrigeability [@Wackernagel:Geostats03] and is also discussed for covariance functions by @OHagan:MarkovCovMatrices98. @Zellner:Seemingly62, @Binkley:note88 pointed out how the consideration of the correlation between regression equations leads to a gain in efficiency.
In a similar vein from econometrics, are models of seemingly unrelated regressions [SUR, @Zellner:Seemingly62], a form of general least squares that allows for a different set of regressors for each response. The problem reduces to ordinary least squares (OLS) when the same covariates are used across the outputs (block design). With a block design, OLS would pass on a potential gain in efficiency by disregarding correlations between responses. Nonetheless, the distribution of the maximum-likelihood estimators does not factorize, regardless of conditioning design. In contrast to SUR, a block design on a multi-output GP with a noise-free Kronecker-product covariance induces the stronger effect of conditional independence over the outputs. These two factorisations are very different and in general do not coincide.
The same property that allows for a simple flip-flop approach also negates the merit of exploiting any correlations between different outputs, but by coupling them with additive noise to enable inter-task transfer, flip-flop is no longer straightforward. @stegle:efficient11 addressed this issue by adding iid noise to a Kronecker-product covariance — a low-rank factor for confounding effects and a sparse-inverse factor for inter-sample dependencies — and exploiting identities of the $\textrm{vec(.)}$ notation for efficient computation within the matrix-normal model.
To summarize our contributions, contrary to existing approaches that use the Kronecker-product structure, the Kronecker-sum does not cancel the inter-task transfer. Our algorithm maintains the simplicity of the flip-flop with a simple trick of transposing the matrix-variate (samples become features and vice versa). At the same time, the equivalent Cartesian factorization of graphs provides a more parsimonious interpretation of the induced Markov network.
The rest of this paper is structured as follows. We describe the matrix-normal model with the Kronecker-sum inverse-covariance in §[sec:model]. In §[sec:algorithm], we present the BiGLasso algorithm for learning the parameters of the Kronecker-sum inverse-covariance. We present some simulations in comparison to a recent Kronecker-normal model of @Leng:Sparse12 in §[sec:simulations] and an application to an example from the COIL dataset in §[sec:coildata]. We conclude in §[sec:conclusion].
Matrix-normal model with the Kronecker-sum structure
To motivate our model, consider the case where matrix-variate data $\mb{Y}$ are sampled iid from a matrix-normal distribution (matrix-Gaussian). This is a natural generalisation of the Gaussian distribution towards tensor support3. This distribution can be reparametrized such that the support is over vectorised representations of random matrices, \(\textrm{vec}(\mb{Y}) \sim \Normal{\mb{0}}{\bTheta^{-1}_p \otimes \bPsi^{-1}_n}.\)
Under the assumption that $\bSigma_p \otimes \bGamma_n$ is sparse, the SMGM estimator for the precision matrices $\bPsi_n, \bTheta_p$ can be computed iteratively by minimizing \(\begin{split} L(\bTheta_p, \bPsi_n) = \frac{1}{Nnp} \sum_{i=1}^{N} \tr{\mb{Y}_i \bTheta_p \mb{Y}^\top_i \bPsi_n} - \frac{1}{n} \log \verts{\bPsi_n} \\ - \frac{1}{p} \log \verts{\bTheta_p} + \lambda_1 || \bPsi_n ||_1 + \lambda_2 || \bTheta_p ||_1, \end{split}\) where $\mb{Y}_i$ is the $i$-th matrix sample, $N$ is the sample size and $\lambda_1, \lambda_2$ the regularization parameters.
Let $\mb{Y} \in \Realspace^{n \times p}$ be a random matrix. If its rows are generated as iid samples from $\Normal{\mb{0}}{ \bSigma_p}$, then the sampling distribution of the sufficient statistic $\mb{Y}^{\top}\mb{Y}$ is $Wishart (n, \bSigma_p)$ with $n$ degrees of freedom and scale matrix $\bSigma_p$. Similarly, if the columns are generated as iid samples from $\Normal{\mb{0}}{ \bGamma_n}$, then the sampling distribution is $Wishart(p, \bGamma_n)$.
In a maximum entropy point of view we can constraint these second-order moments in a model both for the features and the datapoints of a design matrix. To do so, we combine these sufficient statistics in a model for the entire matrix $\mb{Y}$ as \(p(\mb{Y}) ~ \propto ~ \expo{ -\tr{\bPsi_n \mb{Y}\mb{Y}^\top} -\tr{\bTheta_p \mb{Y}^\top\mb{Y}}} ~,\) where $\bPsi_n \in \Realspace^{n \times n}$ and $\bTheta_p \in \Realspace^{p \times p}$ are positive definite matrices. This is equivalent to a joint factorized Gaussian distribution (see §3 in supplementary material) for the $n \times p$ entries of $\mb{Y}$, with a precision matrix of the form \(\bOmega ~\triangleq~ \bPsi_n \oplus \bTheta_p ~=~ \bPsi_n \otimes \mb{I}_p + \mb{I}_n \otimes \bTheta_p ~,\) where $\otimes$ is the Kronecker-product and $\oplus$ the Kronecker-sum operator. Thus, \(\omega_{ij, kl} ~=~ \psi_{i,k} \delta_{j,l} + \delta_{i,k} \theta_{j,l} ~,\) for $i,k \in {1,\dots,n}$ and $j, l \in {1,\dots,p}$. As an immediate benefit of this parameterization, while the full covariance matrix has $\mathcal{O}(n^2 p^2)$ entries, these are governed in our model by only $\mathcal{O}(n^2 + p^2)$ parameters.
Given data in the form of some design matrix $\mb{Y}$, the BiGLasso estimates sparse matrices by putting $\ell_1$ penalties on $\bTheta_p$ and $\bPsi_n$. The convex optimization problem is \(\label{equ:biglasso_problem} \begin{split} \underset{\bTheta_p, \bPsi_n}{\textrm{min}} \Big\{ & n~\tr{\bTheta_p \mb{S}} + p~\tr{\bPsi_n \mb{T}} - \ln \verts{\bPsi_n \oplus \bTheta_p} \\ &+ \lambda \norm{\bTheta_p}_1 + \gamma \norm{\bPsi_n}_1 \Big\}~, \end{split}\) \(\label{equ:empiricalCovs} \textrm{where} \quad \mb{S} \triangleq \tfrac{1}{n}\mb{Y}^\top\mb{Y} \quad \textrm{and} \quad \mb{T} \triangleq \tfrac{1}{p}\mb{Y}\mb{Y}^\top\) are empirical covariances across the samples and features respectively. A solution simultaneously estimates two graphs – one over the columns of $\mb{Y}$, corresponding to the sparsity pattern of $\bTheta_p$, and another over the rows of $\mb{Y}$, corresponding to the sparsity pattern of $\bPsi_n$. Note that since $\omega_{ii,jj} = \psi_{ii} + \theta_{jj}$, the diagonals of $\bTheta_p$ and $\bPsi_n$ are not identifiable (though we could restrict the inverses to correlation matrices). However, this does not affect the estimation of the graph structure (locations of zeros).
A penalized likelihood algorithm for the BiGLasso
A note on notation
If $\mb{M}$ is an $np \times np$ matrix written in terms of $p \times p$ blocks, as \(\mb{M} = \mat{ \mb{M}_{11} &\dots &\mb{M}_{1n} \\ \vdots &\ddots &\vdots \\ \mb{M}_{n1} &\dots &\mb{M}_{nn} } ~,\) then $\textrm{tr}_p(\mb{M})$ is the $n \times n$ matrix of traces of such blocks4: \(\textrm{tr}_p(\mb{M}) = \mat{ \tr{\mb{M}_{11}} &\dots &\tr{\mb{M}_{1n}} \\ \vdots &\ddots &\vdots \\ \tr{\mb{M}_{n1}} &\dots &\tr{\mb{M}_{nn}} } ~.\) We alternate between optimizing over $\bPsi_n$ while holding $\bTheta_p$ fixed and optimizing over $\bTheta_p$ while holding $\bPsi_n$ fixed. First we consider the case where there is no regularization. From , the first step of the optimization problem is reduced to \(\label{equ:biglasso_minPsi} \underset{\bPsi_n}{\textrm{min}} \Big\{ p~\tr{\bPsi_n \mb{T}} ~-~ \ln \verts{\bPsi_n \oplus \bTheta_p} \Big\} ~.\)
Section 2 in the supplementary material shows how to take the gradient of with respect to $\bPsi_n$. Combining (5) and (6) of the supplementary material we obtain the stationary point: \(\mb{T} - \tfrac{1}{2p} \mb{T} \circ \mb{I} = \tfrac{1}{p}\textrm{tr}_p (\mb{W}) - \tfrac{1}{2p}\textrm{tr}_p (\mb{W}) \circ \mb{I} ~,\) where we define $\mb{W} \triangleq (\bPsi_n \oplus \bTheta_p)^{-1}$. We partition $\mb{V} \triangleq \frac{1}{p}~\textrm{tr}p (\mb{W})$ as \(\mb{V} = \mat{v_{11} &\mb{v}^\top_{1\sm1} \\ \mb{v}_{1\sm1} & \mb{V}_{\sm1 \sm1}} ~,\) where $\mb{v}{1\sm1}$ is a vector of size $n-1$ and $\mb{V}{\sm 1 \sm 1}$ is a $(n-1) \times (n-1)$ matrix. Despite the complex form of the stationarity condition, only the lower-left block of its partition will be of use: \(\begin{aligned} \label{equ:stationary_point} \nonumber \mb{t}_{1\sm1} &= \tfrac{1}{p}\textrm{tr}_p (\mb{W}_{1\sm1}) = \mb{v}_{1\sm1}, ~\textrm{and also from \eqref{equ:empiricalCovs}}, \\ \quad \mb{t}_{1\sm1} &= (t_{21}, \dots,t_{n1})^\top = \tfrac{1}{p}(\mb{y}_2^\top \mb{y}_1,\, \dots, \mb{y}_n^\top \mb{y}_1)^\top . \end{aligned}\) Similarly, we partition $\mb{W}$ into blocks: \(\mb{W} = \mat{\mb{W}_{11} &\mb{W}^\top_{1\sm1} \\ \mb{W}_{1\sm1} & \mb{W}_{\sm1 \sm1}}~,\) where $\mb{W}{11}$ is a $p \times p$ matrix and $\mb{W}{1\sm1}$ is a $p(n-1) \times p$ matrix. Then from the bottom-left block of \(\label{equ:WOmega} \begin{split} &\mb{W}\bOmega = \\ &\mat{ \mb{W}_{11} &\mb{W}^\top_{1\sm1} \\ \mb{W}_{1\sm1} & \mb{W}_{\sm1 \sm1} } \mat{ \psi_{11}\mb{I}_p + \bTheta_p &\dots &\psi_{in}\mb{I}_p \\ \vdots &\ddots &\vdots \\ \psi_{n1}\mb{I}_p &\dots &\psi_{nn}\mb{I}_p + \bTheta_p } \\ &= \mb{I}_{n} \otimes \mb{I}_p ~, \end{split}\) we get \(\nonumber \mb{W}_{1\sm1} (\psi_{11}\mb{I}_p + \bTheta_p) + \mb{W}_{\sm1\sm1} (\bpsi_{1\sm1} \otimes \mb{I}_p) = \mb{0}_{n-1} \otimes \mb{I}_p\) \(\label{equ:W1_not1} \mb{W}_{1\sm1} + \mb{W}_{\sm1\sm1}\mat{ (\psi_{11}\mb{I}_p + \bTheta_p)^{-1} \psi_{21} \\ \vdots \\ (\psi_{11}\mb{I}_p + \bTheta_p)^{-1}\psi_{n1}} = \mb{0}_{n-1} \otimes \mb{I}_p\) \(\begin{split} \nonumber \mb{W}_{1\sm1} ~+~ &\mb{W}_{2\sm1} (\psi_{11}\mb{I}_p + \bTheta_p)^{-1} \psi_{21} ~+\dots~\\ ~\dots+~ &\mb{W}_{n\sm1} (\psi_{11}\mb{I}_p + \bTheta_p)^{-1} \psi_{n1} = \mb{0}_{n-1} \otimes \mb{I}_p ~, \end{split}\) with $\mb{0}{n-1}$ as the vector of $n-1$ zeros. According to the stationary point in , taking the blockwise trace $\textrm{tr}p(.)$ of both sides, gives the equation: \(p ~ \mb{t}_{1\sm1} + \mb{A}_{\sm 1 \sm1} \bpsi_{1\sm1} = \mb{0}_{n-1},\quad \textrm{where}\\\) $\mb{A}^\top{\sm 1 \sm1} \triangleq \mat{ \textrm{tr}p \left{ \mb{W}{2\sm1} (\psi_{11}\mb{I}p + \bTheta_p)^{-1} \right}^\top \ \vdots \ \textrm{tr}_p \left{ \mb{W}{n\sm1} (\psi_{11}\mb{I}_p + \bTheta_p)^{-1} \right}^\top }$.
By imposing an $\ell_1$ penalty on $\bpsi_{1\sm1}$, this problem reduces to a Lasso regression.
After estimating $\psi_{1\sm1}$, we compute $\mb{W}{1\sm1}$ by substituting into . It remains to compute $\mb{W}{11}$. This follows from , which gives \(\mb{W}_{11} = (\mb{I} - \mb{W}_{1\sm1}^\top (\bpsi_{1\sm1} \otimes \mb{I})) (\psi_{11}\mb{I} + \bTheta_p)^{-1} ~.\) This algorithm iteratively estimates columns of $\bPsi_n$ and $\mb{W}$ in this manner. The procedure for estimating $\bTheta_p$, for fixed $\bPsi_n$, becomes directly parallel to the above simply by transposing the design matrix (samples become features and vice-versa) and applying the algorithm. Algorithm 1 outlines the BiGLasso.
[!htbp]
Input: $\mb{Y}, \lambda, \gamma$ and initial estimates of $\bPsi_n$ and $\bTheta_p$ $\mb{T} \leftarrow p^{-1}\mb{Y}\mb{Y}^\top$ # Estimate $\bPsi_n$ : Partition $\bPsi_n$ into $\psi_{ii}, \bpsi_{i \sm i}$ and $\bPsi_{\sm i \sm i}$. Find a sparse solution of $~ p ~ \mb{t}{i \sm i} + \mb{A}{\sm i \sm i} \bpsi_{i \sm i} = \mb{0}{n-1}$ with *Lasso* regression. Substitute $\bpsi{i \sm i}$ into to compute $\mb{W}{i \sm i}$. $\mb{W}{ii} \leftarrow \left( \mb{I} - \mb{W}{i \sm i}^\top (\bpsi{i \sm i} \otimes \mb{I}) \right) (\psi_{ii}\mb{I} + \bTheta_p)^{-1}$ # Estimate $\bTheta_p$ : Proceed as if estimating $\bPsi_n$ with input $\mb{Y}^\top, \lambda, \gamma$.
In our experiments we treat $\lambda$ and $\gamma$ as the same parameter and the precision matrices $\bPsi_n$ and $\bTheta_p$ are initialized as identity matrices. The empirical mean matrix is removed from each dataset.
Simulations
[ht]
[ht]
To empirically assess the efficiency of BiGLasso, we generate the datasets described below from centered Gaussians with Kronecker-product (KP) and Kronecker-sum (KS) precision matrices. We run the BiGLasso and SMGM (Sparse Matrix Graphical Model) of @Leng:Sparse12, a flip-flop extension of GLasso for Kronecker-product matrix-normals using the $\ell_1$ penalty. More precisely, one iteration of SMGM consists of a GLasso iteration for fitting the rows-precision matrix with a fixed column-precision and another GLasso iteration for fitting the columns-precision matrix with a fixed rows-precision. The $\bTheta_p$ and $\bPsi_n$ precision matrices in both cases are generated in accordance to [§4, @Leng:Sparse12]; namely, as either of the following $d \times d$ blocks ($d$ being either $p$ or $n$) of increasing density:
-
$\mb{A}1$: Inverse AR(1) (auto-regressive process) such that $\mb{A}_1 = \mb{B}^{-1}$ with $B{ij} = 0.7^{|i-j|}$.
-
$\mb{A}2$: AR(4) with $A{ij} = I(|i-j|=0) ~+~ 0.4I(|i-j|=1) ~+~ 0.2I(|i-j|=2) ~+~ 0.2I(|i-j|=3) ~+~ 0.1I(|i-j|=4)$, $I(.)$ being the indicator function.
-
$\mb{A}3 = \mb{B} + \delta\mb{I}$, where for each $B{ij} = B_{ji}, i\neq j$, $P(B_{ij}=0.5) = 0.9$ and $P(B_{ij}=0) = 0.1$. The diagonal is zero and $\delta$ is chosen such that the condition number of $\mb{A}_3$ is d. Since the condition number is $k(\mb{A}_3) = d = \frac{\lambda_1 + \delta}{\lambda_d + \delta}$, the ratio of largest-to-smallest eigenvalue, then $\delta = \frac{d \lambda_d - \lambda_1}{1 - d}$.
Figures [fig:BiGLassoVsSMGM_onKSdata], [fig:BiGLassoVsSMGM_onKPdata] show the recall $=\frac{# { \widehat{\Omega}{ij} \neq 0 ~\&~ \Omega{ij} \neq 0}}{# { \Omega_{ij} \neq 0 }}$ (or true-positive rate) and precision $ = \frac{# { \widehat{\Omega}{ij} = 0 ~\&~ \Omega{ij} = 0}}{# { \widehat{\Omega}{ij} = 0 ~\&~ \Omega{ij} = 0} + # { \widehat{\Omega}{ij} = 0 ~\&~ \Omega{ij} = 1}}$ across 50 replications to assess the $\widehat{\bOmega}$ estimates under various setups. Each box shows a particular setup that varies in block combination ($\mb{A}_1,\mb{A}_2,\mb{A}_3$), in block sizes ($n,p$), in sample size N generated from the matrix-normal and by the structure used (KS or KP) to generate the sample. Each curve in a box is the solution-path of a replication in precision-recall space for a range of regularization settings $\lambda = 5^x$, for $x \in [-6,-2]$ interpolated 10 times. The blocks are arranged such that the overall density of the structured precision matrices increases from left to right.
We note that since blocks A1,A2 have a fixed form, for such combinations each curve is a different sample from the same graph structure. Only A3 is random so in combinations involving A3, each curve has a different random A3 and consequently generates a different sample. At a glance this has little effect, as most of the curves are not too spread and amounts to sampling from a joint distribution over A3-structures and data where the A3-structure is marginalized out.
Figures [fig:BiGLassoVsSMGM_onKSdata] and [fig:BiGLassoVsSMGM_onKPdata] also compare against the results of SMGM (using the Lasso penalty) on data simulated from the matrix-normal with KS structures. @Leng:Sparse12 had also ran comparisons against the MLE method of @Dutilleul:MLE99 (an unpenalized variant of SMGM), ridge-SMGM (SMGM with an $\ell_2$ penalty instead of $\ell_1$) and the GLasso of @Friedman:sparse08 (on vectorized samples from $\Normal{\mb{0}}{ \bPsi_n \otimes \bTheta_p}$, i.e. ignoring the matrix structure). They consistently outperformed all of these methods, so for brevity we compare only against the SMGM. Similarly, Figure [fig:BiGLassoVsSMGM_onKPdata] visualizes the simulations under KP structures.
By the empirical distributions of these solution-paths (50 for each model in each box), it is no surprise that the intrinsically denser SMGM tends to have low precision (many false-positives) for smaller values of $\lambda$. On the contrary, BiGLasso tends to have low recall (many false-negatives) due to its intrinsically sparser structure.
Block A3 is the only randomized sparse structure whereas A1 and A2 are more “artificial” as they respectively model an inverse-AR(1) and AR(4) and they yield banded precision matrices. Of interest is the observation that the largest effect of the increase in sample size $(10 \rightarrow 100)$ seems to occur on the A3/A3 combination (right end column of boxes). More precisely in Figure [fig:BiGLassoVsSMGM_onKSdata], we note the difference from box (1,6) to (2,6) and from (3,6) to (4,6). The sample size is very effective: with sufficiently large sample size N, BiGLasso starts to recover exactly and SMGM occupies lower regions in general.
In Figure [fig:BiGLassoVsSMGM_onKPdata], since the data generation process uses Kronecker-product structures, the SMGM is expected to outperform our method. Indeed for lower-density structure, the recovery rate of the SMGM seems consistently better than BiGLasso. and recovery can be almost exact for the SMGM for combination A1/A1. However, as the overall density increases, the performance of BiGLasso is balanced. Again, for combinations involving A3, larger sample sizes benefit BiGLasso more.
In summary, KP-simulated data proved harder to tackle for both methods than KS-generated data. These simulations have shown that the BigLasso consistently outperforms the SMGM on KS-simulations, with the possibility of exact recovery on large sample sizes. On KP-simulations the comparison is less clear, but the BiGLasso proves more practical for denser Kronecker-product structures and the SMGM more practical for sparser structures.
An Example from the COIL Dataset
In this section we perform a minor video analysis of a rotating rubber duck from the COIL dataset5. The video consists of gray-scaled images, see Figure [fig:rotating_duck].
[!htbp]
The goal is on two fronts: to recover the conditional dependency structure over the frames and the structure over the pixels. For simplicity, we reduced the resolution of each frame and sub-sampled the frames (at a ratio 1:2). After vectorizing the frames (stacking their columns into $81 \times 1$ vectors) and arranging them into a design matrix $\mb{X}$, the resulting single ``datapoint” that BiGLasso has to learn from is $36\times 81$ (#frames $\times$ vectorized frame length). Unlike our previous simulations where we had many matrix-samples, here the challenge is to learn from this single matrix $(N = 1)$.
Despite the big loss in resolution, the principal component (PCA) subspace of the rotating duck seems to remain smooth, see Figure [fig:duck_manifold]. Being a time-series, the video is expected to resemble a 1D manifold, ``homeomorphic” to the one recovered by PCA shown in figure [fig:duck_manifold], so we applied the BiGLasso on the reduced images.
\
Indeed, the left panel of figure [fig:frameNet_pixNet] shows the row-precision parameter of BiGLasso capturing a manifold-like structure where the first and last frames join, as expected of a $360^{\circ}$ rotation. The model recovered the temporal manifold structure, or in other words, we could use it to connect the dots in Figure [fig:duck_manifold] in case the true order of the frames was unknown (or randomly given to us).
The right panel of Figure [fig:frameNet_pixNet] shows the conditional dependency structure over the pixels. This figure shows strong dependencies at intervals of 9 — that is, roughly in line with the size of a frame (due to the column-wise ordering of the pixels).
This is expected, as neighboring pixels are more likely to be conditionally dependent.
A more intuitive picture of the induced Markov network is shown in Figure [fig:pixNet_visualisation]. A Gaussian graphical model can be naturally interpreted as a system of springs, where the off-diagonal entries of the inverse-covariance represent the negative stiffness of the springs. Therefore by the colorbar, a negative-color represents an ``attracting” spring between those two pixels and a positive-colour represents a ``repulsing” spring. Naturally, in the frames network almost all non-zero elements are negative.
Conclusion
There is a need for models to accommodate the growing complexity of dependency structures. We are concerned with conditional dependencies, as encoded by the inverse-covariance of a matrix-normal density. In high-dimensional cases the Markov network structures induced by a graph could be approximated by factorisations such as the tensor-product (Kronecker-product of precision matrices). In this work, we motivated a novel application of the Cartesian factorization of graphs (Kronecker-sum of precision matrices), as a more parsimonious and interpretable structure for inter-sample and inter-feature conditional dependencies. In the context of multi-output GPs, the Kronecker-product cancels any transfer (that is, ignoring any correlations) between outputs (tasks) when a block design with a noise-free covariance. This is not the case with the Kronecker-sum due to its additive form. We introduced the bigraphical Lasso (BiGLasso), an algorithm for the simultaneous point-estimation of the structures of two Gaussian graphical models: one over the rows of a matrix-sample and the other over its columns. This was demonstrated to good effect through simulations as well as a toy example from the COIL dataset.
An obvious extension that would exploit the associativity of the Cartesian product, would be the modeling of datasets organised into 3 or higher-dimensional arrays (amounting to GRFs over higher-order tensors) with dependencies across any subset of the array dimensions.
One of the appealing features of the Kronecker-sum of precision matrices is the preservation of inter-task transfer, thereby leading to potential applications on Kronecker-sums of kernels for multi-output Gaussian processes.
Finally we feel that the — largely unknown to machine learning — literature on the Cartesian product of graphs deserves a thorough study, towards modeling and algorithmic advances in probabilistic graphical modeling.
-
Vectorization of a matrix involves converting the matrix to a vector by stacking the columns of the matrix. ↩
-
One factor for inter-task one for inter-point covariances. ↩
-
A vector is an order-1 tensor, a matrix is an order-2 tensor and so on. ↩
-
In a sense, this generalizes the conventional trace operator as $\textrm{tr}_{np}(\mb{M}) = \tr{\mb{M}}$. ↩
-
http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php ↩