[edit]
The Bigraphical Lasso
Abstract
Cite this Paper
Related Material
Introduction
When fitting Gaussian models to data, we usually make an independence assumption across data points and fit the covariance matrix by maximum likelihood. The number of parameters in the covariance matrix can be reduced by factor analysis like structures [see e.g. @Tipping:probpca99] or by constraining the inverse covariance (or precision) to be sparse [e.g. @Banerjee:model2008]. A sparse precision matrix defines a Gauss Markov random field relationship which is conveniently represented by a weighted undirected graph. Nodes which are not neighbors in the graph are conditionally independent given all other nodes. Models specified in this way can learn conditional independence structures between features.
An alternative Gaussian modeling approach was introduced by @Lawrence:unifying12, who showed that spectral dimensionality reduction methods have an interpretation as sparse graphical models where the independence assumption is across data features, and the parameters of the covariance are fitted by maximum likelihood (or in the case of local linear embeddings [@Roweis:lle00] by maximizing a pseudolikelihood). This assumption leads to much better determined parameters in the case where the number of features is greater than the number of data points (the so called large $p$, small $n$ case).
The choice of feature independence or data point independence is a model choice issue, but both choices are in fact a simplification of a more general framework that aims to estimate the conditional independence relationships between both features and data points. It is this type of model that we address in this paper. Specifically we want to build a sparse graph that interrelates both features and data points. For instance, we might have a data set that is a video. Here the data points are the frames of the video and the data features are the pixels in the video. Let’s assume that the ordering of the video frames and the neighborhood structure between pixels has somehow been lost. A potential learning task would be to learn both the temporal structure of the data and the spatial structure of the inter related pixels. We successfully solve this task for a simple video from the COIL data set in Section [sec:coildata].
An alternative motivating data example could be gene expression data, where we might wish to extract a genetic network from the gene expression values whilst explaining correlations between samples (such as close genetic relationships, or related experiments) with a separate network. Such data are more naturally represented by matrixvariate distributions.
Graphical Lasso and Matrix Variate Normal
The graphical lasso [GLasso, @Friedman:sparse08; @Banerjee:model2008] is a computationally efficient penalised likelihood algorithm for learning sparse structures of conditional dependencies or Gaussian Markov random fields (GMRF) over features of iid vectorvariate Gaussian samples [@Lauritzen:graphical96].
The matrix variate normal [@Dawid:MatrixVariate81; @Gupta:MatrixVariate99] is a Gaussian density which can be applied to a matrix through first taking a vectorized (vec) representation^{1} of the matrix samples $\mb{X} \in \Realspace^{n \times p}$ and assuming the covariance matrix has the form of a Kronecker product between two covariance matrices, separately associated with the rows and columns of the data. The Kronecker product assumption for the covariance implies that the precision matrix is also a Kronecker product, which is formed from the Kronecker product of the precision matrices associated with the rows and columns ($\bPsi \otimes \bTheta$).
One approach to applying sparse graphical models to matrix data is to combine the Kronecker product structured matrix variate normal with the graphical Lasso. @Dutilleul:MLE99 used a flipflop approach for maximum likelihood estimation of the parameters of the matrixnormal and much later @Zhang:Learning10 used it for MAP estimation with sparsity penalties on the precision matrices. More recently, @Leng:Sparse12 applied the SCAD penalty [@Fan:Variable01] as well as the Lasso in the likelihood function of the matrixnormal. @Tsiligkaridis:Convergence13 analyzed the convergence of Kronecker GLasso under asymptotic conditions as well as simulations that show significant speedups over GLasso and MLE.
However, whilst the Kroneckerproduct structure arises naturally when considering matrixnormals (Kroneckernormals), it is relatively dense when it comes to the dependencies it suggests between the rows. More precisely, if $\Psi_{ij}$ in $\bPsi \otimes \bTheta$ is nonzero (for example, corresponding to an edge between samples $i$ and $j$ in the design matrix $\mb{X}$) then many edges between features of sample $i$ and sample $j$ (as many as in $\bTheta$) will also be active. A sparser structure would benefit situations where the connection between a feature of some sample and a different feature of any other sample is of no interest or redundant, simply because a samefeature dependency between different samples would suffice to establish a crosssample dependency.
The Bigraphical Lasso
In this paper, we introduce the bigraphical Lasso (BiGLasso), an model for matrixvariate data that preserves their column/row structure and, like the Kronecker product based matrix normal, simultaneously learns two graphs, one over rows and one over columns of the matrix samples. The model is trained in a flipflop fashion, so the number of Lasso regressions reduces to $\mathcal{O}(n+p)$. However, the model preserves the matrix structure by using a novel Kronecker sum structure for the precision matrix, $(\bPsi \otimes \mb{I}) + (\mb{I} \otimes \bTheta)$ instead of the Kronecker product.
This structure enjoys enhanced sparsity in comparison to the conventional Kroneckerproduct structure of matrixnormals.
In the context of regression models, the Kroneckersum prevents the conditional independence between responses of multioutput Gaussian processes, a property known in various literatures as cancellation of intertask transfer or autokrigeability.
When operating on adjacency matrices, the Kroneckersum is also known in algebraic graph theory as the Cartesian product of graphs and is arguably the most prominent of graph products [@Sabidussi:graph59; @Chung:Spectral96; @Imrich:TopicsInGT08]. This endows the output of the BiGLasso with a more intuitive and interpretable graph decomposition of the induced Gaussian random field (GRF), see figure [fig:graph_products].
Enhanced Sparsity
For a matrix density $\lambda \in [0,1]$ of both precision matrices the Kroneckersum has $\mathcal{O}(\lambda n p (n+p))$ nonzeros, whereas the Kroneckerproduct has $\mathcal{O}(\lambda n^2 p^2)$ nonzeros.
Better Information Transfer
Kronecker product forms have a known weakness, referred to in the Gaussian process (GP) literature as the cancellation of intertask transfer: @Bonilla:multitask08 [§2.3] showed that the predictive mean of a multioutput GP with a noisefree Kroneckerproduct covariance^{2} and the same inputs conditioned across tasks (a conditioning structure referred to as a block design) uncouples the outputs of the different tasks, that is, the posterior factorises and thus the outputs are computed independently. The key of this proof lies in the factorisable property of the inverse Kroneckerproduct, $(\bPsi \otimes \bTheta)^{1} = \bPsi^{1} \otimes \bTheta^{1}$. This property does not apply under the presence of additive noise, hence the outputs remain coupled. This result first arose in geostatistics under the name of autokrigeability [@Wackernagel:Geostats03] and is also discussed for covariance functions by @OHagan:MarkovCovMatrices98. @Zellner:Seemingly62, @Binkley:note88 pointed out how the consideration of the correlation between regression equations leads to a gain in efficiency.
In a similar vein from econometrics, are models of seemingly unrelated regressions [SUR, @Zellner:Seemingly62], a form of general least squares that allows for a different set of regressors for each response. The problem reduces to ordinary least squares (OLS) when the same covariates are used across the outputs (block design). With a block design, OLS would pass on a potential gain in efficiency by disregarding correlations between responses. Nonetheless, the distribution of the maximumlikelihood estimators does not factorize, regardless of conditioning design. In contrast to SUR, a block design on a multioutput GP with a noisefree Kroneckerproduct covariance induces the stronger effect of conditional independence over the outputs. These two factorisations are very different and in general do not coincide.
The same property that allows for a simple flipflop approach also negates the merit of exploiting any correlations between different outputs, but by coupling them with additive noise to enable intertask transfer, flipflop is no longer straightforward. @stegle:efficient11 addressed this issue by adding iid noise to a Kroneckerproduct covariance — a lowrank factor for confounding effects and a sparseinverse factor for intersample dependencies — and exploiting identities of the $\textrm{vec(.)}$ notation for efficient computation within the matrixnormal model.
To summarize our contributions, contrary to existing approaches that use the Kroneckerproduct structure, the Kroneckersum does not cancel the intertask transfer. Our algorithm maintains the simplicity of the flipflop with a simple trick of transposing the matrixvariate (samples become features and vice versa). At the same time, the equivalent Cartesian factorization of graphs provides a more parsimonious interpretation of the induced Markov network.
The rest of this paper is structured as follows. We describe the matrixnormal model with the Kroneckersum inversecovariance in §[sec:model]. In §[sec:algorithm], we present the BiGLasso algorithm for learning the parameters of the Kroneckersum inversecovariance. We present some simulations in comparison to a recent Kroneckernormal model of @Leng:Sparse12 in §[sec:simulations] and an application to an example from the COIL dataset in §[sec:coildata]. We conclude in §[sec:conclusion].
Matrixnormal model with the Kroneckersum structure
To motivate our model, consider the case where matrixvariate data $\mb{Y}$ are sampled iid from a matrixnormal distribution (matrixGaussian). This is a natural generalisation of the Gaussian distribution towards tensor support^{3}. This distribution can be reparametrized such that the support is over vectorised representations of random matrices, \(\textrm{vec}(\mb{Y}) \sim \Normal{\mb{0}}{\bTheta^{1}_p \otimes \bPsi^{1}_n}.\)
Under the assumption that $\bSigma_p \otimes \bGamma_n$ is sparse, the SMGM estimator for the precision matrices $\bPsi_n, \bTheta_p$ can be computed iteratively by minimizing \(\begin{split} L(\bTheta_p, \bPsi_n) = \frac{1}{Nnp} \sum_{i=1}^{N} \tr{\mb{Y}_i \bTheta_p \mb{Y}^\top_i \bPsi_n}  \frac{1}{n} \log \verts{\bPsi_n} \\  \frac{1}{p} \log \verts{\bTheta_p} + \lambda_1  \bPsi_n _1 + \lambda_2  \bTheta_p _1, \end{split}\) where $\mb{Y}_i$ is the $i$th matrix sample, $N$ is the sample size and $\lambda_1, \lambda_2$ the regularization parameters.
Let $\mb{Y} \in \Realspace^{n \times p}$ be a random matrix. If its rows are generated as iid samples from $\Normal{\mb{0}}{ \bSigma_p}$, then the sampling distribution of the sufficient statistic $\mb{Y}^{\top}\mb{Y}$ is $Wishart (n, \bSigma_p)$ with $n$ degrees of freedom and scale matrix $\bSigma_p$. Similarly, if the columns are generated as iid samples from $\Normal{\mb{0}}{ \bGamma_n}$, then the sampling distribution is $Wishart(p, \bGamma_n)$.
In a maximum entropy point of view we can constraint these secondorder moments in a model both for the features and the datapoints of a design matrix. To do so, we combine these sufficient statistics in a model for the entire matrix $\mb{Y}$ as \(p(\mb{Y}) ~ \propto ~ \expo{ \tr{\bPsi_n \mb{Y}\mb{Y}^\top} \tr{\bTheta_p \mb{Y}^\top\mb{Y}}} ~,\) where $\bPsi_n \in \Realspace^{n \times n}$ and $\bTheta_p \in \Realspace^{p \times p}$ are positive definite matrices. This is equivalent to a joint factorized Gaussian distribution (see §3 in supplementary material) for the $n \times p$ entries of $\mb{Y}$, with a precision matrix of the form \(\bOmega ~\triangleq~ \bPsi_n \oplus \bTheta_p ~=~ \bPsi_n \otimes \mb{I}_p + \mb{I}_n \otimes \bTheta_p ~,\) where $\otimes$ is the Kroneckerproduct and $\oplus$ the Kroneckersum operator. Thus, \(\omega_{ij, kl} ~=~ \psi_{i,k} \delta_{j,l} + \delta_{i,k} \theta_{j,l} ~,\) for $i,k \in {1,\dots,n}$ and $j, l \in {1,\dots,p}$. As an immediate benefit of this parameterization, while the full covariance matrix has $\mathcal{O}(n^2 p^2)$ entries, these are governed in our model by only $\mathcal{O}(n^2 + p^2)$ parameters.
Given data in the form of some design matrix $\mb{Y}$, the BiGLasso estimates sparse matrices by putting $\ell_1$ penalties on $\bTheta_p$ and $\bPsi_n$. The convex optimization problem is \(\label{equ:biglasso_problem} \begin{split} \underset{\bTheta_p, \bPsi_n}{\textrm{min}} \Big\{ & n~\tr{\bTheta_p \mb{S}} + p~\tr{\bPsi_n \mb{T}}  \ln \verts{\bPsi_n \oplus \bTheta_p} \\ &+ \lambda \norm{\bTheta_p}_1 + \gamma \norm{\bPsi_n}_1 \Big\}~, \end{split}\) \(\label{equ:empiricalCovs} \textrm{where} \quad \mb{S} \triangleq \tfrac{1}{n}\mb{Y}^\top\mb{Y} \quad \textrm{and} \quad \mb{T} \triangleq \tfrac{1}{p}\mb{Y}\mb{Y}^\top\) are empirical covariances across the samples and features respectively. A solution simultaneously estimates two graphs – one over the columns of $\mb{Y}$, corresponding to the sparsity pattern of $\bTheta_p$, and another over the rows of $\mb{Y}$, corresponding to the sparsity pattern of $\bPsi_n$. Note that since $\omega_{ii,jj} = \psi_{ii} + \theta_{jj}$, the diagonals of $\bTheta_p$ and $\bPsi_n$ are not identifiable (though we could restrict the inverses to correlation matrices). However, this does not affect the estimation of the graph structure (locations of zeros).
A penalized likelihood algorithm for the BiGLasso
A note on notation
If $\mb{M}$ is an $np \times np$ matrix written in terms of $p \times p$ blocks, as \(\mb{M} = \mat{ \mb{M}_{11} &\dots &\mb{M}_{1n} \\ \vdots &\ddots &\vdots \\ \mb{M}_{n1} &\dots &\mb{M}_{nn} } ~,\) then $\textrm{tr}_p(\mb{M})$ is the $n \times n$ matrix of traces of such blocks^{4}: \(\textrm{tr}_p(\mb{M}) = \mat{ \tr{\mb{M}_{11}} &\dots &\tr{\mb{M}_{1n}} \\ \vdots &\ddots &\vdots \\ \tr{\mb{M}_{n1}} &\dots &\tr{\mb{M}_{nn}} } ~.\) We alternate between optimizing over $\bPsi_n$ while holding $\bTheta_p$ fixed and optimizing over $\bTheta_p$ while holding $\bPsi_n$ fixed. First we consider the case where there is no regularization. From , the first step of the optimization problem is reduced to \(\label{equ:biglasso_minPsi} \underset{\bPsi_n}{\textrm{min}} \Big\{ p~\tr{\bPsi_n \mb{T}} ~~ \ln \verts{\bPsi_n \oplus \bTheta_p} \Big\} ~.\)
Section 2 in the supplementary material shows how to take the gradient of with respect to $\bPsi_n$. Combining (5) and (6) of the supplementary material we obtain the stationary point: \(\mb{T}  \tfrac{1}{2p} \mb{T} \circ \mb{I} = \tfrac{1}{p}\textrm{tr}_p (\mb{W})  \tfrac{1}{2p}\textrm{tr}_p (\mb{W}) \circ \mb{I} ~,\) where we define $\mb{W} \triangleq (\bPsi_n \oplus \bTheta_p)^{1}$. We partition $\mb{V} \triangleq \frac{1}{p}~\textrm{tr}p (\mb{W})$ as \(\mb{V} = \mat{v_{11} &\mb{v}^\top_{1\sm1} \\ \mb{v}_{1\sm1} & \mb{V}_{\sm1 \sm1}} ~,\) where $\mb{v}{1\sm1}$ is a vector of size $n1$ and $\mb{V}{\sm 1 \sm 1}$ is a $(n1) \times (n1)$ matrix. Despite the complex form of the stationarity condition, only the lowerleft block of its partition will be of use: \(\begin{aligned} \label{equ:stationary_point} \nonumber \mb{t}_{1\sm1} &= \tfrac{1}{p}\textrm{tr}_p (\mb{W}_{1\sm1}) = \mb{v}_{1\sm1}, ~\textrm{and also from \eqref{equ:empiricalCovs}}, \\ \quad \mb{t}_{1\sm1} &= (t_{21}, \dots,t_{n1})^\top = \tfrac{1}{p}(\mb{y}_2^\top \mb{y}_1,\, \dots, \mb{y}_n^\top \mb{y}_1)^\top . \end{aligned}\) Similarly, we partition $\mb{W}$ into blocks: \(\mb{W} = \mat{\mb{W}_{11} &\mb{W}^\top_{1\sm1} \\ \mb{W}_{1\sm1} & \mb{W}_{\sm1 \sm1}}~,\) where $\mb{W}{11}$ is a $p \times p$ matrix and $\mb{W}{1\sm1}$ is a $p(n1) \times p$ matrix. Then from the bottomleft block of \(\label{equ:WOmega} \begin{split} &\mb{W}\bOmega = \\ &\mat{ \mb{W}_{11} &\mb{W}^\top_{1\sm1} \\ \mb{W}_{1\sm1} & \mb{W}_{\sm1 \sm1} } \mat{ \psi_{11}\mb{I}_p + \bTheta_p &\dots &\psi_{in}\mb{I}_p \\ \vdots &\ddots &\vdots \\ \psi_{n1}\mb{I}_p &\dots &\psi_{nn}\mb{I}_p + \bTheta_p } \\ &= \mb{I}_{n} \otimes \mb{I}_p ~, \end{split}\) we get \(\nonumber \mb{W}_{1\sm1} (\psi_{11}\mb{I}_p + \bTheta_p) + \mb{W}_{\sm1\sm1} (\bpsi_{1\sm1} \otimes \mb{I}_p) = \mb{0}_{n1} \otimes \mb{I}_p\) \(\label{equ:W1_not1} \mb{W}_{1\sm1} + \mb{W}_{\sm1\sm1}\mat{ (\psi_{11}\mb{I}_p + \bTheta_p)^{1} \psi_{21} \\ \vdots \\ (\psi_{11}\mb{I}_p + \bTheta_p)^{1}\psi_{n1}} = \mb{0}_{n1} \otimes \mb{I}_p\) \(\begin{split} \nonumber \mb{W}_{1\sm1} ~+~ &\mb{W}_{2\sm1} (\psi_{11}\mb{I}_p + \bTheta_p)^{1} \psi_{21} ~+\dots~\\ ~\dots+~ &\mb{W}_{n\sm1} (\psi_{11}\mb{I}_p + \bTheta_p)^{1} \psi_{n1} = \mb{0}_{n1} \otimes \mb{I}_p ~, \end{split}\) with $\mb{0}{n1}$ as the vector of $n1$ zeros. According to the stationary point in , taking the blockwise trace $\textrm{tr}p(.)$ of both sides, gives the equation: \(p ~ \mb{t}_{1\sm1} + \mb{A}_{\sm 1 \sm1} \bpsi_{1\sm1} = \mb{0}_{n1},\quad \textrm{where}\\\) $\mb{A}^\top{\sm 1 \sm1} \triangleq \mat{ \textrm{tr}p \left{ \mb{W}{2\sm1} (\psi_{11}\mb{I}p + \bTheta_p)^{1} \right}^\top \ \vdots \ \textrm{tr}_p \left{ \mb{W}{n\sm1} (\psi_{11}\mb{I}_p + \bTheta_p)^{1} \right}^\top }$.
By imposing an $\ell_1$ penalty on $\bpsi_{1\sm1}$, this problem reduces to a Lasso regression.
After estimating $\psi_{1\sm1}$, we compute $\mb{W}{1\sm1}$ by substituting into . It remains to compute $\mb{W}{11}$. This follows from , which gives \(\mb{W}_{11} = (\mb{I}  \mb{W}_{1\sm1}^\top (\bpsi_{1\sm1} \otimes \mb{I})) (\psi_{11}\mb{I} + \bTheta_p)^{1} ~.\) This algorithm iteratively estimates columns of $\bPsi_n$ and $\mb{W}$ in this manner. The procedure for estimating $\bTheta_p$, for fixed $\bPsi_n$, becomes directly parallel to the above simply by transposing the design matrix (samples become features and viceversa) and applying the algorithm. Algorithm 1 outlines the BiGLasso.
[!htbp]
Input: $\mb{Y}, \lambda, \gamma$ and initial estimates of $\bPsi_n$ and $\bTheta_p$ $\mb{T} \leftarrow p^{1}\mb{Y}\mb{Y}^\top$ # Estimate $\bPsi_n$ : Partition $\bPsi_n$ into $\psi_{ii}, \bpsi_{i \sm i}$ and $\bPsi_{\sm i \sm i}$. Find a sparse solution of $~ p ~ \mb{t}{i \sm i} + \mb{A}{\sm i \sm i} \bpsi_{i \sm i} = \mb{0}{n1}$ with *Lasso* regression. Substitute $\bpsi{i \sm i}$ into to compute $\mb{W}{i \sm i}$. $\mb{W}{ii} \leftarrow \left( \mb{I}  \mb{W}{i \sm i}^\top (\bpsi{i \sm i} \otimes \mb{I}) \right) (\psi_{ii}\mb{I} + \bTheta_p)^{1}$ # Estimate $\bTheta_p$ : Proceed as if estimating $\bPsi_n$ with input $\mb{Y}^\top, \lambda, \gamma$.
In our experiments we treat $\lambda$ and $\gamma$ as the same parameter and the precision matrices $\bPsi_n$ and $\bTheta_p$ are initialized as identity matrices. The empirical mean matrix is removed from each dataset.
Simulations
[ht]
[ht]
To empirically assess the efficiency of BiGLasso, we generate the datasets described below from centered Gaussians with Kroneckerproduct (KP) and Kroneckersum (KS) precision matrices. We run the BiGLasso and SMGM (Sparse Matrix Graphical Model) of @Leng:Sparse12, a flipflop extension of GLasso for Kroneckerproduct matrixnormals using the $\ell_1$ penalty. More precisely, one iteration of SMGM consists of a GLasso iteration for fitting the rowsprecision matrix with a fixed columnprecision and another GLasso iteration for fitting the columnsprecision matrix with a fixed rowsprecision. The $\bTheta_p$ and $\bPsi_n$ precision matrices in both cases are generated in accordance to [§4, @Leng:Sparse12]; namely, as either of the following $d \times d$ blocks ($d$ being either $p$ or $n$) of increasing density:

$\mb{A}1$: Inverse AR(1) (autoregressive process) such that $\mb{A}_1 = \mb{B}^{1}$ with $B{ij} = 0.7^{ij}$.

$\mb{A}2$: AR(4) with $A{ij} = I(ij=0) ~+~ 0.4I(ij=1) ~+~ 0.2I(ij=2) ~+~ 0.2I(ij=3) ~+~ 0.1I(ij=4)$, $I(.)$ being the indicator function.

$\mb{A}3 = \mb{B} + \delta\mb{I}$, where for each $B{ij} = B_{ji}, i\neq j$, $P(B_{ij}=0.5) = 0.9$ and $P(B_{ij}=0) = 0.1$. The diagonal is zero and $\delta$ is chosen such that the condition number of $\mb{A}_3$ is d. Since the condition number is $k(\mb{A}_3) = d = \frac{\lambda_1 + \delta}{\lambda_d + \delta}$, the ratio of largesttosmallest eigenvalue, then $\delta = \frac{d \lambda_d  \lambda_1}{1  d}$.
Figures [fig:BiGLassoVsSMGM_onKSdata], [fig:BiGLassoVsSMGM_onKPdata] show the recall $=\frac{# { \widehat{\Omega}{ij} \neq 0 ~\&~ \Omega{ij} \neq 0}}{# { \Omega_{ij} \neq 0 }}$ (or truepositive rate) and precision $ = \frac{# { \widehat{\Omega}{ij} = 0 ~\&~ \Omega{ij} = 0}}{# { \widehat{\Omega}{ij} = 0 ~\&~ \Omega{ij} = 0} + # { \widehat{\Omega}{ij} = 0 ~\&~ \Omega{ij} = 1}}$ across 50 replications to assess the $\widehat{\bOmega}$ estimates under various setups. Each box shows a particular setup that varies in block combination ($\mb{A}_1,\mb{A}_2,\mb{A}_3$), in block sizes ($n,p$), in sample size N generated from the matrixnormal and by the structure used (KS or KP) to generate the sample. Each curve in a box is the solutionpath of a replication in precisionrecall space for a range of regularization settings $\lambda = 5^x$, for $x \in [6,2]$ interpolated 10 times. The blocks are arranged such that the overall density of the structured precision matrices increases from left to right.
We note that since blocks A1,A2 have a fixed form, for such combinations each curve is a different sample from the same graph structure. Only A3 is random so in combinations involving A3, each curve has a different random A3 and consequently generates a different sample. At a glance this has little effect, as most of the curves are not too spread and amounts to sampling from a joint distribution over A3structures and data where the A3structure is marginalized out.
Figures [fig:BiGLassoVsSMGM_onKSdata] and [fig:BiGLassoVsSMGM_onKPdata] also compare against the results of SMGM (using the Lasso penalty) on data simulated from the matrixnormal with KS structures. @Leng:Sparse12 had also ran comparisons against the MLE method of @Dutilleul:MLE99 (an unpenalized variant of SMGM), ridgeSMGM (SMGM with an $\ell_2$ penalty instead of $\ell_1$) and the GLasso of @Friedman:sparse08 (on vectorized samples from $\Normal{\mb{0}}{ \bPsi_n \otimes \bTheta_p}$, i.e. ignoring the matrix structure). They consistently outperformed all of these methods, so for brevity we compare only against the SMGM. Similarly, Figure [fig:BiGLassoVsSMGM_onKPdata] visualizes the simulations under KP structures.
By the empirical distributions of these solutionpaths (50 for each model in each box), it is no surprise that the intrinsically denser SMGM tends to have low precision (many falsepositives) for smaller values of $\lambda$. On the contrary, BiGLasso tends to have low recall (many falsenegatives) due to its intrinsically sparser structure.
Block A3 is the only randomized sparse structure whereas A1 and A2 are more “artificial” as they respectively model an inverseAR(1) and AR(4) and they yield banded precision matrices. Of interest is the observation that the largest effect of the increase in sample size $(10 \rightarrow 100)$ seems to occur on the A3/A3 combination (right end column of boxes). More precisely in Figure [fig:BiGLassoVsSMGM_onKSdata], we note the difference from box (1,6) to (2,6) and from (3,6) to (4,6). The sample size is very effective: with sufficiently large sample size N, BiGLasso starts to recover exactly and SMGM occupies lower regions in general.
In Figure [fig:BiGLassoVsSMGM_onKPdata], since the data generation process uses Kroneckerproduct structures, the SMGM is expected to outperform our method. Indeed for lowerdensity structure, the recovery rate of the SMGM seems consistently better than BiGLasso. and recovery can be almost exact for the SMGM for combination A1/A1. However, as the overall density increases, the performance of BiGLasso is balanced. Again, for combinations involving A3, larger sample sizes benefit BiGLasso more.
In summary, KPsimulated data proved harder to tackle for both methods than KSgenerated data. These simulations have shown that the BigLasso consistently outperforms the SMGM on KSsimulations, with the possibility of exact recovery on large sample sizes. On KPsimulations the comparison is less clear, but the BiGLasso proves more practical for denser Kroneckerproduct structures and the SMGM more practical for sparser structures.
An Example from the COIL Dataset
In this section we perform a minor video analysis of a rotating rubber duck from the COIL dataset^{5}. The video consists of grayscaled images, see Figure [fig:rotating_duck].
[!htbp]
The goal is on two fronts: to recover the conditional dependency structure over the frames and the structure over the pixels. For simplicity, we reduced the resolution of each frame and subsampled the frames (at a ratio 1:2). After vectorizing the frames (stacking their columns into $81 \times 1$ vectors) and arranging them into a design matrix $\mb{X}$, the resulting single ``datapoint” that BiGLasso has to learn from is $36\times 81$ (#frames $\times$ vectorized frame length). Unlike our previous simulations where we had many matrixsamples, here the challenge is to learn from this single matrix $(N = 1)$.
Despite the big loss in resolution, the principal component (PCA) subspace of the rotating duck seems to remain smooth, see Figure [fig:duck_manifold]. Being a timeseries, the video is expected to resemble a 1D manifold, ``homeomorphic” to the one recovered by PCA shown in figure [fig:duck_manifold], so we applied the BiGLasso on the reduced images.
\
Indeed, the left panel of figure [fig:frameNet_pixNet] shows the rowprecision parameter of BiGLasso capturing a manifoldlike structure where the first and last frames join, as expected of a $360^{\circ}$ rotation. The model recovered the temporal manifold structure, or in other words, we could use it to connect the dots in Figure [fig:duck_manifold] in case the true order of the frames was unknown (or randomly given to us).
The right panel of Figure [fig:frameNet_pixNet] shows the conditional dependency structure over the pixels. This figure shows strong dependencies at intervals of 9 — that is, roughly in line with the size of a frame (due to the columnwise ordering of the pixels).
This is expected, as neighboring pixels are more likely to be conditionally dependent.
A more intuitive picture of the induced Markov network is shown in Figure [fig:pixNet_visualisation]. A Gaussian graphical model can be naturally interpreted as a system of springs, where the offdiagonal entries of the inversecovariance represent the negative stiffness of the springs. Therefore by the colorbar, a negativecolor represents an ``attracting” spring between those two pixels and a positivecolour represents a ``repulsing” spring. Naturally, in the frames network almost all nonzero elements are negative.
Conclusion
There is a need for models to accommodate the growing complexity of dependency structures. We are concerned with conditional dependencies, as encoded by the inversecovariance of a matrixnormal density. In highdimensional cases the Markov network structures induced by a graph could be approximated by factorisations such as the tensorproduct (Kroneckerproduct of precision matrices). In this work, we motivated a novel application of the Cartesian factorization of graphs (Kroneckersum of precision matrices), as a more parsimonious and interpretable structure for intersample and interfeature conditional dependencies. In the context of multioutput GPs, the Kroneckerproduct cancels any transfer (that is, ignoring any correlations) between outputs (tasks) when a block design with a noisefree covariance. This is not the case with the Kroneckersum due to its additive form. We introduced the bigraphical Lasso (BiGLasso), an algorithm for the simultaneous pointestimation of the structures of two Gaussian graphical models: one over the rows of a matrixsample and the other over its columns. This was demonstrated to good effect through simulations as well as a toy example from the COIL dataset.
An obvious extension that would exploit the associativity of the Cartesian product, would be the modeling of datasets organised into 3 or higherdimensional arrays (amounting to GRFs over higherorder tensors) with dependencies across any subset of the array dimensions.
One of the appealing features of the Kroneckersum of precision matrices is the preservation of intertask transfer, thereby leading to potential applications on Kroneckersums of kernels for multioutput Gaussian processes.
Finally we feel that the — largely unknown to machine learning — literature on the Cartesian product of graphs deserves a thorough study, towards modeling and algorithmic advances in probabilistic graphical modeling.

Vectorization of a matrix involves converting the matrix to a vector by stacking the columns of the matrix. ↩

One factor for intertask one for interpoint covariances. ↩

A vector is an order1 tensor, a matrix is an order2 tensor and so on. ↩

In a sense, this generalizes the conventional trace operator as $\textrm{tr}_{np}(\mb{M}) = \tr{\mb{M}}$. ↩

http://www.cs.columbia.edu/CAVE/software/softlib/coil20.php ↩
Introduction
When fitting Gaussian models to data, we usually make an independence assumption across data points and fit the covariance matrix by maximum likelihood. The number of parameters in the covariance matrix can be reduced by factor analysis like structures [see e.g. @Tipping:probpca99] or by constraining the inverse covariance (or precision) to be sparse [e.g. @Banerjee:model2008]. A sparse precision matrix defines a Gauss Markov random field relationship which is conveniently represented by a weighted undirected graph. Nodes which are not neighbors in the graph are conditionally independent given all other nodes. Models specified in this way can learn conditional independence structures between features.
An alternative Gaussian modeling approach was introduced by @Lawrence:unifying12, who showed that spectral dimensionality reduction methods have an interpretation as sparse graphical models where the independence assumption is across data features, and the parameters of the covariance are fitted by maximum likelihood (or in the case of local linear embeddings [@Roweis:lle00] by maximizing a pseudolikelihood). This assumption leads to much better determined parameters in the case where the number of features is greater than the number of data points (the so called large $p$, small $n$ case).
The choice of feature independence or data point independence is a model choice issue, but both choices are in fact a simplification of a more general framework that aims to estimate the conditional independence relationships between both features and data points. It is this type of model that we address in this paper. Specifically we want to build a sparse graph that interrelates both features and data points. For instance, we might have a data set that is a video. Here the data points are the frames of the video and the data features are the pixels in the video. Let’s assume that the ordering of the video frames and the neighborhood structure between pixels has somehow been lost. A potential learning task would be to learn both the temporal structure of the data and the spatial structure of the inter related pixels. We successfully solve this task for a simple video from the COIL data set in Section [sec:coildata].
An alternative motivating data example could be gene expression data, where we might wish to extract a genetic network from the gene expression values whilst explaining correlations between samples (such as close genetic relationships, or related experiments) with a separate network. Such data are more naturally represented by matrixvariate distributions.
Graphical Lasso and Matrix Variate Normal
The graphical lasso [GLasso, @Friedman:sparse08; @Banerjee:model2008] is a computationally efficient penalised likelihood algorithm for learning sparse structures of conditional dependencies or Gaussian Markov random fields (GMRF) over features of iid vectorvariate Gaussian samples [@Lauritzen:graphical96].
The matrix variate normal [@Dawid:MatrixVariate81; @Gupta:MatrixVariate99] is a Gaussian density which can be applied to a matrix through first taking a vectorized (vec) representation^{1} of the matrix samples $\mb{X} \in \Realspace^{n \times p}$ and assuming the covariance matrix has the form of a Kronecker product between two covariance matrices, separately associated with the rows and columns of the data. The Kronecker product assumption for the covariance implies that the precision matrix is also a Kronecker product, which is formed from the Kronecker product of the precision matrices associated with the rows and columns ($\bPsi \otimes \bTheta$).
One approach to applying sparse graphical models to matrix data is to combine the Kronecker product structured matrix variate normal with the graphical Lasso. @Dutilleul:MLE99 used a flipflop approach for maximum likelihood estimation of the parameters of the matrixnormal and much later @Zhang:Learning10 used it for MAP estimation with sparsity penalties on the precision matrices. More recently, @Leng:Sparse12 applied the SCAD penalty [@Fan:Variable01] as well as the Lasso in the likelihood function of the matrixnormal. @Tsiligkaridis:Convergence13 analyzed the convergence of Kronecker GLasso under asymptotic conditions as well as simulations that show significant speedups over GLasso and MLE.
However, whilst the Kroneckerproduct structure arises naturally when considering matrixnormals (Kroneckernormals), it is relatively dense when it comes to the dependencies it suggests between the rows. More precisely, if $\Psi_{ij}$ in $\bPsi \otimes \bTheta$ is nonzero (for example, corresponding to an edge between samples $i$ and $j$ in the design matrix $\mb{X}$) then many edges between features of sample $i$ and sample $j$ (as many as in $\bTheta$) will also be active. A sparser structure would benefit situations where the connection between a feature of some sample and a different feature of any other sample is of no interest or redundant, simply because a samefeature dependency between different samples would suffice to establish a crosssample dependency.
The Bigraphical Lasso
In this paper, we introduce the bigraphical Lasso (BiGLasso), an model for matrixvariate data that preserves their column/row structure and, like the Kronecker product based matrix normal, simultaneously learns two graphs, one over rows and one over columns of the matrix samples. The model is trained in a flipflop fashion, so the number of Lasso regressions reduces to $\mathcal{O}(n+p)$. However, the model preserves the matrix structure by using a novel Kronecker sum structure for the precision matrix, $(\bPsi \otimes \mb{I}) + (\mb{I} \otimes \bTheta)$ instead of the Kronecker product.
This structure enjoys enhanced sparsity in comparison to the conventional Kroneckerproduct structure of matrixnormals.
In the context of regression models, the Kroneckersum prevents the conditional independence between responses of multioutput Gaussian processes, a property known in various literatures as cancellation of intertask transfer or autokrigeability.
When operating on adjacency matrices, the Kroneckersum is also known in algebraic graph theory as the Cartesian product of graphs and is arguably the most prominent of graph products [@Sabidussi:graph59; @Chung:Spectral96; @Imrich:TopicsInGT08]. This endows the output of the BiGLasso with a more intuitive and interpretable graph decomposition of the induced Gaussian random field (GRF), see figure [fig:graph_products].
Enhanced Sparsity
For a matrix density $\lambda \in [0,1]$ of both precision matrices the Kroneckersum has $\mathcal{O}(\lambda n p (n+p))$ nonzeros, whereas the Kroneckerproduct has $\mathcal{O}(\lambda n^2 p^2)$ nonzeros.
Better Information Transfer
Kronecker product forms have a known weakness, referred to in the Gaussian process (GP) literature as the cancellation of intertask transfer: @Bonilla:multitask08 [§2.3] showed that the predictive mean of a multioutput GP with a noisefree Kroneckerproduct covariance^{2} and the same inputs conditioned across tasks (a conditioning structure referred to as a block design) uncouples the outputs of the different tasks, that is, the posterior factorises and thus the outputs are computed independently. The key of this proof lies in the factorisable property of the inverse Kroneckerproduct, $(\bPsi \otimes \bTheta)^{1} = \bPsi^{1} \otimes \bTheta^{1}$. This property does not apply under the presence of additive noise, hence the outputs remain coupled. This result first arose in geostatistics under the name of autokrigeability [@Wackernagel:Geostats03] and is also discussed for covariance functions by @OHagan:MarkovCovMatrices98. @Zellner:Seemingly62, @Binkley:note88 pointed out how the consideration of the correlation between regression equations leads to a gain in efficiency.
In a similar vein from econometrics, are models of seemingly unrelated regressions [SUR, @Zellner:Seemingly62], a form of general least squares that allows for a different set of regressors for each response. The problem reduces to ordinary least squares (OLS) when the same covariates are used across the outputs (block design). With a block design, OLS would pass on a potential gain in efficiency by disregarding correlations between responses. Nonetheless, the distribution of the maximumlikelihood estimators does not factorize, regardless of conditioning design. In contrast to SUR, a block design on a multioutput GP with a noisefree Kroneckerproduct covariance induces the stronger effect of conditional independence over the outputs. These two factorisations are very different and in general do not coincide.
The same property that allows for a simple flipflop approach also negates the merit of exploiting any correlations between different outputs, but by coupling them with additive noise to enable intertask transfer, flipflop is no longer straightforward. @stegle:efficient11 addressed this issue by adding iid noise to a Kroneckerproduct covariance — a lowrank factor for confounding effects and a sparseinverse factor for intersample dependencies — and exploiting identities of the $\textrm{vec(.)}$ notation for efficient computation within the matrixnormal model.
To summarize our contributions, contrary to existing approaches that use the Kroneckerproduct structure, the Kroneckersum does not cancel the intertask transfer. Our algorithm maintains the simplicity of the flipflop with a simple trick of transposing the matrixvariate (samples become features and vice versa). At the same time, the equivalent Cartesian factorization of graphs provides a more parsimonious interpretation of the induced Markov network.
The rest of this paper is structured as follows. We describe the matrixnormal model with the Kroneckersum inversecovariance in §[sec:model]. In §[sec:algorithm], we present the BiGLasso algorithm for learning the parameters of the Kroneckersum inversecovariance. We present some simulations in comparison to a recent Kroneckernormal model of @Leng:Sparse12 in §[sec:simulations] and an application to an example from the COIL dataset in §[sec:coildata]. We conclude in §[sec:conclusion].
Matrixnormal model with the Kroneckersum structure
To motivate our model, consider the case where matrixvariate data $\mb{Y}$ are sampled iid from a matrixnormal distribution (matrixGaussian). This is a natural generalisation of the Gaussian distribution towards tensor support^{3}. This distribution can be reparametrized such that the support is over vectorised representations of random matrices, \(\textrm{vec}(\mb{Y}) \sim \Normal{\mb{0}}{\bTheta^{1}_p \otimes \bPsi^{1}_n}.\)
Under the assumption that $\bSigma_p \otimes \bGamma_n$ is sparse, the SMGM estimator for the precision matrices $\bPsi_n, \bTheta_p$ can be computed iteratively by minimizing \(\begin{split} L(\bTheta_p, \bPsi_n) = \frac{1}{Nnp} \sum_{i=1}^{N} \tr{\mb{Y}_i \bTheta_p \mb{Y}^\top_i \bPsi_n}  \frac{1}{n} \log \verts{\bPsi_n} \\  \frac{1}{p} \log \verts{\bTheta_p} + \lambda_1  \bPsi_n _1 + \lambda_2  \bTheta_p _1, \end{split}\) where $\mb{Y}_i$ is the $i$th matrix sample, $N$ is the sample size and $\lambda_1, \lambda_2$ the regularization parameters.
Let $\mb{Y} \in \Realspace^{n \times p}$ be a random matrix. If its rows are generated as iid samples from $\Normal{\mb{0}}{ \bSigma_p}$, then the sampling distribution of the sufficient statistic $\mb{Y}^{\top}\mb{Y}$ is $Wishart (n, \bSigma_p)$ with $n$ degrees of freedom and scale matrix $\bSigma_p$. Similarly, if the columns are generated as iid samples from $\Normal{\mb{0}}{ \bGamma_n}$, then the sampling distribution is $Wishart(p, \bGamma_n)$.
In a maximum entropy point of view we can constraint these secondorder moments in a model both for the features and the datapoints of a design matrix. To do so, we combine these sufficient statistics in a model for the entire matrix $\mb{Y}$ as \(p(\mb{Y}) ~ \propto ~ \expo{ \tr{\bPsi_n \mb{Y}\mb{Y}^\top} \tr{\bTheta_p \mb{Y}^\top\mb{Y}}} ~,\) where $\bPsi_n \in \Realspace^{n \times n}$ and $\bTheta_p \in \Realspace^{p \times p}$ are positive definite matrices. This is equivalent to a joint factorized Gaussian distribution (see §3 in supplementary material) for the $n \times p$ entries of $\mb{Y}$, with a precision matrix of the form \(\bOmega ~\triangleq~ \bPsi_n \oplus \bTheta_p ~=~ \bPsi_n \otimes \mb{I}_p + \mb{I}_n \otimes \bTheta_p ~,\) where $\otimes$ is the Kroneckerproduct and $\oplus$ the Kroneckersum operator. Thus, \(\omega_{ij, kl} ~=~ \psi_{i,k} \delta_{j,l} + \delta_{i,k} \theta_{j,l} ~,\) for $i,k \in {1,\dots,n}$ and $j, l \in {1,\dots,p}$. As an immediate benefit of this parameterization, while the full covariance matrix has $\mathcal{O}(n^2 p^2)$ entries, these are governed in our model by only $\mathcal{O}(n^2 + p^2)$ parameters.
Given data in the form of some design matrix $\mb{Y}$, the BiGLasso estimates sparse matrices by putting $\ell_1$ penalties on $\bTheta_p$ and $\bPsi_n$. The convex optimization problem is \(\label{equ:biglasso_problem} \begin{split} \underset{\bTheta_p, \bPsi_n}{\textrm{min}} \Big\{ & n~\tr{\bTheta_p \mb{S}} + p~\tr{\bPsi_n \mb{T}}  \ln \verts{\bPsi_n \oplus \bTheta_p} \\ &+ \lambda \norm{\bTheta_p}_1 + \gamma \norm{\bPsi_n}_1 \Big\}~, \end{split}\) \(\label{equ:empiricalCovs} \textrm{where} \quad \mb{S} \triangleq \tfrac{1}{n}\mb{Y}^\top\mb{Y} \quad \textrm{and} \quad \mb{T} \triangleq \tfrac{1}{p}\mb{Y}\mb{Y}^\top\) are empirical covariances across the samples and features respectively. A solution simultaneously estimates two graphs – one over the columns of $\mb{Y}$, corresponding to the sparsity pattern of $\bTheta_p$, and another over the rows of $\mb{Y}$, corresponding to the sparsity pattern of $\bPsi_n$. Note that since $\omega_{ii,jj} = \psi_{ii} + \theta_{jj}$, the diagonals of $\bTheta_p$ and $\bPsi_n$ are not identifiable (though we could restrict the inverses to correlation matrices). However, this does not affect the estimation of the graph structure (locations of zeros).
A penalized likelihood algorithm for the BiGLasso
A note on notation
If $\mb{M}$ is an $np \times np$ matrix written in terms of $p \times p$ blocks, as \(\mb{M} = \mat{ \mb{M}_{11} &\dots &\mb{M}_{1n} \\ \vdots &\ddots &\vdots \\ \mb{M}_{n1} &\dots &\mb{M}_{nn} } ~,\) then $\textrm{tr}_p(\mb{M})$ is the $n \times n$ matrix of traces of such blocks^{4}: \(\textrm{tr}_p(\mb{M}) = \mat{ \tr{\mb{M}_{11}} &\dots &\tr{\mb{M}_{1n}} \\ \vdots &\ddots &\vdots \\ \tr{\mb{M}_{n1}} &\dots &\tr{\mb{M}_{nn}} } ~.\) We alternate between optimizing over $\bPsi_n$ while holding $\bTheta_p$ fixed and optimizing over $\bTheta_p$ while holding $\bPsi_n$ fixed. First we consider the case where there is no regularization. From , the first step of the optimization problem is reduced to \(\label{equ:biglasso_minPsi} \underset{\bPsi_n}{\textrm{min}} \Big\{ p~\tr{\bPsi_n \mb{T}} ~~ \ln \verts{\bPsi_n \oplus \bTheta_p} \Big\} ~.\)
Section 2 in the supplementary material shows how to take the gradient of with respect to $\bPsi_n$. Combining (5) and (6) of the supplementary material we obtain the stationary point: \(\mb{T}  \tfrac{1}{2p} \mb{T} \circ \mb{I} = \tfrac{1}{p}\textrm{tr}_p (\mb{W})  \tfrac{1}{2p}\textrm{tr}_p (\mb{W}) \circ \mb{I} ~,\) where we define $\mb{W} \triangleq (\bPsi_n \oplus \bTheta_p)^{1}$. We partition $\mb{V} \triangleq \frac{1}{p}~\textrm{tr}p (\mb{W})$ as \(\mb{V} = \mat{v_{11} &\mb{v}^\top_{1\sm1} \\ \mb{v}_{1\sm1} & \mb{V}_{\sm1 \sm1}} ~,\) where $\mb{v}{1\sm1}$ is a vector of size $n1$ and $\mb{V}{\sm 1 \sm 1}$ is a $(n1) \times (n1)$ matrix. Despite the complex form of the stationarity condition, only the lowerleft block of its partition will be of use: \(\begin{aligned} \label{equ:stationary_point} \nonumber \mb{t}_{1\sm1} &= \tfrac{1}{p}\textrm{tr}_p (\mb{W}_{1\sm1}) = \mb{v}_{1\sm1}, ~\textrm{and also from \eqref{equ:empiricalCovs}}, \\ \quad \mb{t}_{1\sm1} &= (t_{21}, \dots,t_{n1})^\top = \tfrac{1}{p}(\mb{y}_2^\top \mb{y}_1,\, \dots, \mb{y}_n^\top \mb{y}_1)^\top . \end{aligned}\) Similarly, we partition $\mb{W}$ into blocks: \(\mb{W} = \mat{\mb{W}_{11} &\mb{W}^\top_{1\sm1} \\ \mb{W}_{1\sm1} & \mb{W}_{\sm1 \sm1}}~,\) where $\mb{W}{11}$ is a $p \times p$ matrix and $\mb{W}{1\sm1}$ is a $p(n1) \times p$ matrix. Then from the bottomleft block of \(\label{equ:WOmega} \begin{split} &\mb{W}\bOmega = \\ &\mat{ \mb{W}_{11} &\mb{W}^\top_{1\sm1} \\ \mb{W}_{1\sm1} & \mb{W}_{\sm1 \sm1} } \mat{ \psi_{11}\mb{I}_p + \bTheta_p &\dots &\psi_{in}\mb{I}_p \\ \vdots &\ddots &\vdots \\ \psi_{n1}\mb{I}_p &\dots &\psi_{nn}\mb{I}_p + \bTheta_p } \\ &= \mb{I}_{n} \otimes \mb{I}_p ~, \end{split}\) we get \(\nonumber \mb{W}_{1\sm1} (\psi_{11}\mb{I}_p + \bTheta_p) + \mb{W}_{\sm1\sm1} (\bpsi_{1\sm1} \otimes \mb{I}_p) = \mb{0}_{n1} \otimes \mb{I}_p\) \(\label{equ:W1_not1} \mb{W}_{1\sm1} + \mb{W}_{\sm1\sm1}\mat{ (\psi_{11}\mb{I}_p + \bTheta_p)^{1} \psi_{21} \\ \vdots \\ (\psi_{11}\mb{I}_p + \bTheta_p)^{1}\psi_{n1}} = \mb{0}_{n1} \otimes \mb{I}_p\) \(\begin{split} \nonumber \mb{W}_{1\sm1} ~+~ &\mb{W}_{2\sm1} (\psi_{11}\mb{I}_p + \bTheta_p)^{1} \psi_{21} ~+\dots~\\ ~\dots+~ &\mb{W}_{n\sm1} (\psi_{11}\mb{I}_p + \bTheta_p)^{1} \psi_{n1} = \mb{0}_{n1} \otimes \mb{I}_p ~, \end{split}\) with $\mb{0}{n1}$ as the vector of $n1$ zeros. According to the stationary point in , taking the blockwise trace $\textrm{tr}p(.)$ of both sides, gives the equation: \(p ~ \mb{t}_{1\sm1} + \mb{A}_{\sm 1 \sm1} \bpsi_{1\sm1} = \mb{0}_{n1},\quad \textrm{where}\\\) $\mb{A}^\top{\sm 1 \sm1} \triangleq \mat{ \textrm{tr}p \left{ \mb{W}{2\sm1} (\psi_{11}\mb{I}p + \bTheta_p)^{1} \right}^\top \ \vdots \ \textrm{tr}_p \left{ \mb{W}{n\sm1} (\psi_{11}\mb{I}_p + \bTheta_p)^{1} \right}^\top }$.
By imposing an $\ell_1$ penalty on $\bpsi_{1\sm1}$, this problem reduces to a Lasso regression.
After estimating $\psi_{1\sm1}$, we compute $\mb{W}{1\sm1}$ by substituting into . It remains to compute $\mb{W}{11}$. This follows from , which gives \(\mb{W}_{11} = (\mb{I}  \mb{W}_{1\sm1}^\top (\bpsi_{1\sm1} \otimes \mb{I})) (\psi_{11}\mb{I} + \bTheta_p)^{1} ~.\) This algorithm iteratively estimates columns of $\bPsi_n$ and $\mb{W}$ in this manner. The procedure for estimating $\bTheta_p$, for fixed $\bPsi_n$, becomes directly parallel to the above simply by transposing the design matrix (samples become features and viceversa) and applying the algorithm. Algorithm 1 outlines the BiGLasso.
[!htbp]
Input: $\mb{Y}, \lambda, \gamma$ and initial estimates of $\bPsi_n$ and $\bTheta_p$ $\mb{T} \leftarrow p^{1}\mb{Y}\mb{Y}^\top$ # Estimate $\bPsi_n$ : Partition $\bPsi_n$ into $\psi_{ii}, \bpsi_{i \sm i}$ and $\bPsi_{\sm i \sm i}$. Find a sparse solution of $~ p ~ \mb{t}{i \sm i} + \mb{A}{\sm i \sm i} \bpsi_{i \sm i} = \mb{0}{n1}$ with *Lasso* regression. Substitute $\bpsi{i \sm i}$ into to compute $\mb{W}{i \sm i}$. $\mb{W}{ii} \leftarrow \left( \mb{I}  \mb{W}{i \sm i}^\top (\bpsi{i \sm i} \otimes \mb{I}) \right) (\psi_{ii}\mb{I} + \bTheta_p)^{1}$ # Estimate $\bTheta_p$ : Proceed as if estimating $\bPsi_n$ with input $\mb{Y}^\top, \lambda, \gamma$.
In our experiments we treat $\lambda$ and $\gamma$ as the same parameter and the precision matrices $\bPsi_n$ and $\bTheta_p$ are initialized as identity matrices. The empirical mean matrix is removed from each dataset.
Simulations
[ht]
[ht]
To empirically assess the efficiency of BiGLasso, we generate the datasets described below from centered Gaussians with Kroneckerproduct (KP) and Kroneckersum (KS) precision matrices. We run the BiGLasso and SMGM (Sparse Matrix Graphical Model) of @Leng:Sparse12, a flipflop extension of GLasso for Kroneckerproduct matrixnormals using the $\ell_1$ penalty. More precisely, one iteration of SMGM consists of a GLasso iteration for fitting the rowsprecision matrix with a fixed columnprecision and another GLasso iteration for fitting the columnsprecision matrix with a fixed rowsprecision. The $\bTheta_p$ and $\bPsi_n$ precision matrices in both cases are generated in accordance to [§4, @Leng:Sparse12]; namely, as either of the following $d \times d$ blocks ($d$ being either $p$ or $n$) of increasing density:

$\mb{A}1$: Inverse AR(1) (autoregressive process) such that $\mb{A}_1 = \mb{B}^{1}$ with $B{ij} = 0.7^{ij}$.

$\mb{A}2$: AR(4) with $A{ij} = I(ij=0) ~+~ 0.4I(ij=1) ~+~ 0.2I(ij=2) ~+~ 0.2I(ij=3) ~+~ 0.1I(ij=4)$, $I(.)$ being the indicator function.

$\mb{A}3 = \mb{B} + \delta\mb{I}$, where for each $B{ij} = B_{ji}, i\neq j$, $P(B_{ij}=0.5) = 0.9$ and $P(B_{ij}=0) = 0.1$. The diagonal is zero and $\delta$ is chosen such that the condition number of $\mb{A}_3$ is d. Since the condition number is $k(\mb{A}_3) = d = \frac{\lambda_1 + \delta}{\lambda_d + \delta}$, the ratio of largesttosmallest eigenvalue, then $\delta = \frac{d \lambda_d  \lambda_1}{1  d}$.
Figures [fig:BiGLassoVsSMGM_onKSdata], [fig:BiGLassoVsSMGM_onKPdata] show the recall $=\frac{# { \widehat{\Omega}{ij} \neq 0 ~\&~ \Omega{ij} \neq 0}}{# { \Omega_{ij} \neq 0 }}$ (or truepositive rate) and precision $ = \frac{# { \widehat{\Omega}{ij} = 0 ~\&~ \Omega{ij} = 0}}{# { \widehat{\Omega}{ij} = 0 ~\&~ \Omega{ij} = 0} + # { \widehat{\Omega}{ij} = 0 ~\&~ \Omega{ij} = 1}}$ across 50 replications to assess the $\widehat{\bOmega}$ estimates under various setups. Each box shows a particular setup that varies in block combination ($\mb{A}_1,\mb{A}_2,\mb{A}_3$), in block sizes ($n,p$), in sample size N generated from the matrixnormal and by the structure used (KS or KP) to generate the sample. Each curve in a box is the solutionpath of a replication in precisionrecall space for a range of regularization settings $\lambda = 5^x$, for $x \in [6,2]$ interpolated 10 times. The blocks are arranged such that the overall density of the structured precision matrices increases from left to right.
We note that since blocks A1,A2 have a fixed form, for such combinations each curve is a different sample from the same graph structure. Only A3 is random so in combinations involving A3, each curve has a different random A3 and consequently generates a different sample. At a glance this has little effect, as most of the curves are not too spread and amounts to sampling from a joint distribution over A3structures and data where the A3structure is marginalized out.
Figures [fig:BiGLassoVsSMGM_onKSdata] and [fig:BiGLassoVsSMGM_onKPdata] also compare against the results of SMGM (using the Lasso penalty) on data simulated from the matrixnormal with KS structures. @Leng:Sparse12 had also ran comparisons against the MLE method of @Dutilleul:MLE99 (an unpenalized variant of SMGM), ridgeSMGM (SMGM with an $\ell_2$ penalty instead of $\ell_1$) and the GLasso of @Friedman:sparse08 (on vectorized samples from $\Normal{\mb{0}}{ \bPsi_n \otimes \bTheta_p}$, i.e. ignoring the matrix structure). They consistently outperformed all of these methods, so for brevity we compare only against the SMGM. Similarly, Figure [fig:BiGLassoVsSMGM_onKPdata] visualizes the simulations under KP structures.
By the empirical distributions of these solutionpaths (50 for each model in each box), it is no surprise that the intrinsically denser SMGM tends to have low precision (many falsepositives) for smaller values of $\lambda$. On the contrary, BiGLasso tends to have low recall (many falsenegatives) due to its intrinsically sparser structure.
Block A3 is the only randomized sparse structure whereas A1 and A2 are more “artificial” as they respectively model an inverseAR(1) and AR(4) and they yield banded precision matrices. Of interest is the observation that the largest effect of the increase in sample size $(10 \rightarrow 100)$ seems to occur on the A3/A3 combination (right end column of boxes). More precisely in Figure [fig:BiGLassoVsSMGM_onKSdata], we note the difference from box (1,6) to (2,6) and from (3,6) to (4,6). The sample size is very effective: with sufficiently large sample size N, BiGLasso starts to recover exactly and SMGM occupies lower regions in general.
In Figure [fig:BiGLassoVsSMGM_onKPdata], since the data generation process uses Kroneckerproduct structures, the SMGM is expected to outperform our method. Indeed for lowerdensity structure, the recovery rate of the SMGM seems consistently better than BiGLasso. and recovery can be almost exact for the SMGM for combination A1/A1. However, as the overall density increases, the performance of BiGLasso is balanced. Again, for combinations involving A3, larger sample sizes benefit BiGLasso more.
In summary, KPsimulated data proved harder to tackle for both methods than KSgenerated data. These simulations have shown that the BigLasso consistently outperforms the SMGM on KSsimulations, with the possibility of exact recovery on large sample sizes. On KPsimulations the comparison is less clear, but the BiGLasso proves more practical for denser Kroneckerproduct structures and the SMGM more practical for sparser structures.
An Example from the COIL Dataset
In this section we perform a minor video analysis of a rotating rubber duck from the COIL dataset^{5}. The video consists of grayscaled images, see Figure [fig:rotating_duck].
[!htbp]
The goal is on two fronts: to recover the conditional dependency structure over the frames and the structure over the pixels. For simplicity, we reduced the resolution of each frame and subsampled the frames (at a ratio 1:2). After vectorizing the frames (stacking their columns into $81 \times 1$ vectors) and arranging them into a design matrix $\mb{X}$, the resulting single ``datapoint” that BiGLasso has to learn from is $36\times 81$ (#frames $\times$ vectorized frame length). Unlike our previous simulations where we had many matrixsamples, here the challenge is to learn from this single matrix $(N = 1)$.
Despite the big loss in resolution, the principal component (PCA) subspace of the rotating duck seems to remain smooth, see Figure [fig:duck_manifold]. Being a timeseries, the video is expected to resemble a 1D manifold, ``homeomorphic” to the one recovered by PCA shown in figure [fig:duck_manifold], so we applied the BiGLasso on the reduced images.
\
Indeed, the left panel of figure [fig:frameNet_pixNet] shows the rowprecision parameter of BiGLasso capturing a manifoldlike structure where the first and last frames join, as expected of a $360^{\circ}$ rotation. The model recovered the temporal manifold structure, or in other words, we could use it to connect the dots in Figure [fig:duck_manifold] in case the true order of the frames was unknown (or randomly given to us).
The right panel of Figure [fig:frameNet_pixNet] shows the conditional dependency structure over the pixels. This figure shows strong dependencies at intervals of 9 — that is, roughly in line with the size of a frame (due to the columnwise ordering of the pixels).
This is expected, as neighboring pixels are more likely to be conditionally dependent.
A more intuitive picture of the induced Markov network is shown in Figure [fig:pixNet_visualisation]. A Gaussian graphical model can be naturally interpreted as a system of springs, where the offdiagonal entries of the inversecovariance represent the negative stiffness of the springs. Therefore by the colorbar, a negativecolor represents an ``attracting” spring between those two pixels and a positivecolour represents a ``repulsing” spring. Naturally, in the frames network almost all nonzero elements are negative.
Conclusion
There is a need for models to accommodate the growing complexity of dependency structures. We are concerned with conditional dependencies, as encoded by the inversecovariance of a matrixnormal density. In highdimensional cases the Markov network structures induced by a graph could be approximated by factorisations such as the tensorproduct (Kroneckerproduct of precision matrices). In this work, we motivated a novel application of the Cartesian factorization of graphs (Kroneckersum of precision matrices), as a more parsimonious and interpretable structure for intersample and interfeature conditional dependencies. In the context of multioutput GPs, the Kroneckerproduct cancels any transfer (that is, ignoring any correlations) between outputs (tasks) when a block design with a noisefree covariance. This is not the case with the Kroneckersum due to its additive form. We introduced the bigraphical Lasso (BiGLasso), an algorithm for the simultaneous pointestimation of the structures of two Gaussian graphical models: one over the rows of a matrixsample and the other over its columns. This was demonstrated to good effect through simulations as well as a toy example from the COIL dataset.
An obvious extension that would exploit the associativity of the Cartesian product, would be the modeling of datasets organised into 3 or higherdimensional arrays (amounting to GRFs over higherorder tensors) with dependencies across any subset of the array dimensions.
One of the appealing features of the Kroneckersum of precision matrices is the preservation of intertask transfer, thereby leading to potential applications on Kroneckersums of kernels for multioutput Gaussian processes.
Finally we feel that the — largely unknown to machine learning — literature on the Cartesian product of graphs deserves a thorough study, towards modeling and algorithmic advances in probabilistic graphical modeling.

Vectorization of a matrix involves converting the matrix to a vector by stacking the columns of the matrix. ↩

One factor for intertask one for interpoint covariances. ↩

A vector is an order1 tensor, a matrix is an order2 tensor and so on. ↩

In a sense, this generalizes the conventional trace operator as $\textrm{tr}_{np}(\mb{M}) = \tr{\mb{M}}$. ↩

http://www.cs.columbia.edu/CAVE/software/softlib/coil20.php ↩