- © 2013 American Society of Plant Biologists. All rights reserved.

## Abstract

Plant behaviors across levels of cellular organization, from biochemical components to tissues and organs, relate and reflect growth habitats. Quantification of the relationship between behaviors captured in various phenotypic characteristics and growth habitats can help reveal molecular mechanisms of plant adaptation. The aim of this article is to introduce the power of using statistics originally developed in the field of geographic variability analysis together with prominent network models in elucidating principles of biological organization. We provide a critical systematic review of the existing statistical and network-based approaches that can be employed to determine patterns of covariation from both uni- and multivariate phenotypic characteristics in plants. We demonstrate that parameter-independent network-based approaches result in robust insights about phenotypic covariation. These insights can be quantified and tested by applying well-established statistics combining the network structure with the phenotypic characteristics. We show that the reviewed network-based approaches are applicable from the level of genes to the study of individuals in a population of *Arabidopsis thaliana*. Finally, we demonstrate that the patterns of covariation can be generalized to quantifiable biological principles of organization. Therefore, these network-based approaches facilitate not only interpretation of large-scale data sets, but also prediction of biochemical and biological behaviors based on measurable characteristics.

## INTRODUCTION

The behaviors of biochemical components, from DNA and proteins to metabolites, in biological entities across all levels of cellular organization, from single cells to whole organisms, partly depend on the space in which they operate and interact. Advances in the development of novel (semi-)automated technologies have facilitated high-throughput quantification not only of biochemical components, resulting in particular molecular phenotypes (e.g., gene expression, protein abundances, and metabolite levels), but also complex physiological traits (e.g., flowering time and yield) and morphological properties (e.g., cell and leaf shape and shoot size) in plants. The problem of identifying the extent to which spatial constraints shape plant behaviors, within and across individuals, resulting in particular genetic makeup and varying phenotypes captured in transcriptomics, proteomics, and metabolomics data sets, is particularly relevant in the study of plants as sessile organisms. Resolving this problem requires analyses of patterns of covariation in characteristics of interrelated biochemical/biological entities. For instance, one may investigate the relation between geographic distances and genetic differences, the relationship between covariation of expression levels of genes and their functional characterization, as well as the relation of chromosomal gene locations, chromatin remodeling, or DNA coiling and gene expression levels (Zupancic et al., 2001; Marshall, 2002; Blanco et al., 2008; Ha et al., 2011; Sobetzko et al., 2012).

To analyze the patterns of covariation in characteristics of studied entities (e.g., individuals of a population and molecular components), it is necessary to specify a measure quantifying the distance between two entities. The distance measure can be used to determine which entities are to be treated as related. The related entities can be represented by a network of nodes, denoting the entities, and edges, representing the relatedness. One can then investigate covariation for a given set of characteristics only over related entities. Moreover, due to the network abstraction, covariation can be examined not only in a geographic space, but also a more general setting.

It then becomes apparent that the networks and any patterns of covariation are expected to depend on the sampled entities and the distances between them. Since, for biological entities, and especially plants, spatial location usually determines the growth habitat to which they are exposed, it is important to examine if a pattern of covariation can be robustly found across diverse environmental conditions typical for the habitat. Moreover, characteristics of biochemical and biological entities depend on not only the spatial aspects but also the timing of physiological and developmental events (e.g., flowering time), which are often tightly regulated. Therefore, the determined patterns of covariation should be investigated for robustness with respect to the time-dependent behavior of the analyzed entities.

The identification of a biologically meaningful and robust pattern of covariation in characteristics of biochemical and biological entities renders it possible to predict the behavior of entities based on their measurable characteristics. Thus, such a pattern of covariation can be regarded as a principle of biological organization. For instance, a robust pattern of genetic diversity can be used to specify the likeliest geographic location of an individual specified only by its genetic makeup. Analogously, a robust pattern in a given molecular function of genes with similar expression levels can be employed to characterize genes of unknown function. Therefore, approaches for determining patterns of covariation can find wide application across different fields in biology, from molecular physiology and genetics to ecology.

Here, we aim to provide a critical systematic review of the existing statistical and network-based approaches for revealing and further investigating patterns of covariation. First, we show how one can use measured characteristics of entities to arrive at distance matrices. We then demonstrate that comparison of the resulting distance matrices with classical statistical approaches often does not result in robust patterns of covariation across various characteristics. We in turn systematically review the approaches that extract networks from the distance matrices and focus on the properties they must satisfy to ensure robustness of any pattern of covariation. We also present a detailed overview of the statistics, originating in geostatistics and geographic variability analysis, that can be employed to quantify network-based covariation of characteristics in uni- and multivariate settings. In addition, we illustrate how these statistics relate to other well-studied network properties. On the level of biological entities, we apply these methods to show how multivariate high-throughput data can be integrated across 92 diverse *Arabidopsis thaliana* accessions to reveal relations between molecular factors and geographic location, thus providing insights in local adaptation. On the molecular level, with the help of these methods, we show how similarities in gene expression reflect the function annotation in *Arabidopsis*, thus providing a quantifiable principle useful in predicting functional characterization.

## FROM HIGH-THROUGHPUT DATA TO DISTANCE MATRICES

For the purpose of illustrating the main concepts of the reviewed methods, we assume that there are *n* entities. Each entity, *i*, 1 ≤ *i* ≤ *n*, is described by two properties with corresponding data profiles, denoted by *X _{i}* and

*Y*. The data profiles

_{i}*X*and

_{i}*Y*are collections of

_{i}*p*and

*q*characteristics, represented as vectors. Gathering all

*X*and

_{i}*Y*data profiles over all

_{i}*n*entities results in two matrices,

*X*and

*Y*, respectively. For instance, if the entities are

*Arabidopsis*accessions, the data matrix

*X*may consist of the longitude and latitude of their geographic origin, while the data matrix

*Y*may be given by the metabolic levels or single-nucleotide polymorphisms (SNPs) of each accession. On the other hand, if the entities are

*Arabidopsis*genes, the data matrix

*X*may consist of their expression levels across various experiments, while the data matrix

*Y*may be given by the characterization of genes’ function annotation as terms of a chosen ontology (e.g., MapMan [Thimm et al., 2004] or GO [Harris et al., 2004]) for the considered genes; similarly, if the entities are metabolites,

*X*and

*Y*may include the levels under same experimental scenarios in

*Arabidopsis*and tomato (

*Solanum lycopersicum*), respectively. Therefore, the particular characteristics gathered in the data matrices

*X*and

*Y*would depend on the biological question.

For a pair of data profiles (vectors), *X _{i}* and

*X*, from the entities

_{j}*i*and

*j*, 1 ≤

*i*,

*j*≤

*n*, a distance measure

*μ*results in a number, denoted by

*μ*(

*X*,

_{i}*X*), quantifying the distance between the two data profiles. A distance measure

_{j}*μ*is symmetric if its value does not depend on the order of the data profiles, for example,

*μ*(

*X*,

_{i}*X*) =

_{j}*μ*(

*X*,

_{j}*X*). In the following, we assume that the distance measure

_{i}*μ*is such that higher values denote larger distances. The Euclidean distance and modifications of Pearson correlation coefficient are commonly used distance measures.

## COMPARISON OF DISTANCE MATRICES

Equipped with the concept of a distance measure, there are two possible approaches to investigate the relationship between the data matrices *X* and *Y* regarding the distances of the included data profiles. In the first approach, one relies on applying two (not necessarily different) distance measures, *μ _{X}* and

*μ*, on the data profiles in the matrices

_{Y}*X*and

*Y*, resulting in two distance matrices,

*D*and

_{X}*D*, respectively. One can test the congruence between the distance matrices using the Mantel test or

_{Y}*R*coefficient, or determine an empirical variogram.

_{V}The Mantel test, often used in ecological studies (Reynolds, 2003; Cushman and Landguth, 2010), quantifies the correlation between two matrices over the same set of entities, as is the case here. Let and denote the distances between the data profiles of the entities *i* and *j* in *X* and *Y*, respectively. The Mantel correlation is given by the following expression (Mantel, 1967):where and denote the means, while and denote the standard deviations of *D _{X}* and

*D*, respectively. Like Pearson correlation coefficient,

_{Y}*r*takes values in the range [−1,1] whose statistical significance can be estimated empirically by permutation test (Smouse et al., 1986). However, it also shares the same disadvantages with Pearson correlation that presence of outliers may alter not only the value but also the sign of correlation (Gravetter and Wallnau, 2010).

_{M}The *R _{V}* coefficient characterizes the congruence between two matrices over the same set of entities

*n*. It is given by the normalized scalar product of the two matrices, ranging in the interval [0,1], whose statistical significance can be determined analytically (Robert and Escoufier, 1976). With both statistics, the presence of a pattern of spatial covariation may be severely obscured by considering the covariation between pairs of entities that are not necessarily related (i.e., are at large distance), leading to nonrobust findings. To illustrate, let us consider as entities 20

*Arabidopsis*accessions with geographic origin in Germany, so that

*X*contains their longitude and latitude and

*Y*gathers the levels for 49 metabolites measured under near-optimal growth condition (see Supplemental Data Set 1 online). We next generate the distance matrices

*D*and

_{X}*D*from the geographic locations and the

_{Y}*z*-normalized metabolite profiles, respectively. The resulting Mantel correlation coefficient indicates an apparently positive but nonsignificant correlation. In addition, the

*R*coefficient for

_{V}*X*and

*Y*shows a small but nonsignificant congruence between the two matrices (Figure 1, inset).

## FROM DISTANCE MATRICES TO VARIOGRAMS

Another technique of choice when analyzing covariation in space is based on the empirical variogram that quantifies how distances in a given property vary with spatial separation. Given the two distance matrices *D _{X}* and

*D*, let

_{Y}*N*(

*k*) denote the set of pairs of entities

*i*and

*j*, 1 ≤

*i*,

*j*≤

*n*, such that , and let |

*N*(

*k*)| be the number of such (

*i*,

*j*) pairs. The empirical variogram is then defined as follows (Clark, 1979):where

*k*varies in the range of

*D*. In practice, instead of determining

_{X}*γ*(

*k*) for individual values of

*k*, first the distances in

*D*are binned and the expression above is applied on pairs of entities whose differences lies in a corresponding bin. For instance, on the example of the 20

_{X}*Arabidopsis*accession with four bins, each covering a range of 50 miles (mi), as shown in Figure 1, we observe that from the first to the third bin, there is a slight increase in the mean

*γ*, indicating that accessions further apart are more variable in their metabolic phenotype than those closer together. This behavior is lost for pairs of accessions having a large distance (fourth bin). By examining the results when eight bins, each covering a range of 25 mi in Figure 1, it appears that the mean in the first bin is larger compared with that of the second, suggesting that the relation between geographic distance and metabolic profiles is more diverse even for geographically close accessions. As with any procedure that depends on binning, as shown in the example, the results are sensitive with respect to the chosen size of the bin. Although there are optimal statistical designs for robust estimation of variograms (Cressie, 1993), this is of little value when dealing with a priori sampled and well-characterized

*Arabidopsis*accessions.

Platt et al. (2010) used the concept of the variogram on *X* gathering the geographic location of 5707 plants and *Y* given by 135 SNPs spread across the genome, with *μ _{X}* given by the Euclidean distance and two versions of

*μ*: the fraction of nonmatching alleles and the fraction of those belonging to the same haplogroup. To account for the drawbacks of using different binning sizes, Platt et al. examined the variograms with three sizes (i.e., 0.5, 10, and 150 km) for the North American and Eurasian accessions. By fitting linear and exponential models to the obtained empirical variograms without any smoothing, nonstandard in the field of geostatistics (Clark, 1979), the authors concluded that

_{Y}*Arabidopsis*exhibits a measure of isolation by distance. However, we emphasize that this claim and its robustness are not statistically and quantitatively supported with the applied approach.

## FROM DISTANCE MATRICES TO PROXIMITY-BASED NETWORKS

In the second approach, one first applies a distance measure *μ _{X}* on the data matrix

*X*and employs the resulting distance matrix

*D*to define which entities are to be considered as related. This procedure generates a network

_{X}*G*, in which the set of nodes,

*V*(

*G*), represents the entities and edges (links), in the set,

*E*(

*G*), denote the pairs of entities which are related. The problem of detecting patterns of covariation then becomes one of investigating how the values in

*Y*vary only over the neighbors (i.e., the entities whose corresponding nodes are connected by an edge in the network and not over all pairs of entities). In the following, we describe and illustrate some of the most prominently used classes of networks that capture different notions of relatedness.

The notion of relatedness will depend on the number of sampled entities as well as the type of data in *X* describing their positions. For instance, what may be considered related in geographic terms, where *X* gathers the geographic location of an entity, does not necessarily coincide with relatedness of *X* in terms of gene or metabolic regulation. Nevertheless, the relation of relatedness must satisfy a set of necessary properties in order to be applicable in a statistical setting: (1) no entity is related to itself, (2) if the entity *i* is related to the entity *j*, so is *j* to *i*, and (3) the relatedness between *i* and *j* is unique, in the sense that it does not depend on the choice of parameter values. These properties translate into the following network characteristics: (1) there are no loop edges on single nodes, (2) the edges are undirected, describing symmetric relationships, and (3) the network is unique. An additional property may also include the connectedness of the network, whereby any two nodes are connected by a path.

There are certainly several relations that satisfy these properties and operate on the distance matrix *D _{X}*, resulting in different classes of networks. In the closest pairs (CP) network, a single edge is established between two entities

*i*and

*j*if their distance is the smallest one over all pairs of entities. Since this network includes a single edge, it is disconnected in all cases except when there exists an entity that is equidistant and closest to all other entities. While the problem of finding the CPs is computationally interesting, the CP network does not capture relatedness over all entities. For the 20

*Arabidopsis*accessions in Figure 1, the CP network is visualized in Figure 2A. In the nearest neighbors (NN) network, two entities

*i*and

*j*are connected by an edge if there exists no entity

*l*for which the distance between

*i*and

*l*is smaller than the distance between

*i*and

*j*(i.e., ). However, entity

*j*being the nearest to entity

*i*does not in general imply that

*i*is the nearest to

*j*, so this relation is not symmetric (Figure 2B). The NN network is in general disconnected and includes the CP network. Moreover, the generalization of the NN network to consider the

*k*NNs does not suffer only from asymmetry, but also depends on the value of the parameter

*k*, which violates the uniqueness property (see Figure 2B for

*k*= 3). A heuristic resembling the

*k*NN network, and suffering the same drawbacks, was recently employed by Anastasio et al. (2011) to determine

*Arabidopsis*accessions with erroneous data on their origin based on their discrepancy from the isolation by distance observed by Platt et al. (2010).

The minimum spanning tree (MST) network spans all entities (i.e., the network is connected), so that the sum of the distances between the entities connected by an edge is the smallest over all connected networks (Figure 2C). The MST network is unique if all distances are distinct and contains the NN network as a subnetwork. Let *d _{i}* denote the smallest of the distances between entity

*i*and any other entity. In the sphere of influence (SI) network, the entities

*i*and

*j*are connected by an edge if the spheres with radii

*d*and

_{i}*d*centered in

_{j}*i*and

*j*, respectively, intersect in more than one point (Figure 2A). Like the MST network, the SI network also contains the NN network. While the SI relation is symmetric and defines a unique network, the resulting network is not necessarily connected (e.g., five components in Figure 2A).

The concept of relative neighborhood (RN) guarantees that the network is unique and connected. In an RN network, two entities *i* and *j* are connected by an edge if for any other entity *l*, *l* ≠ *i*, *j*, holds (i.e., *l* does not lie in the intersection of the two spheres centered at *i* and *j* of radius ) (Toussaint, 1980) (Figure 2C). Since the MST network is contained in the RN network, the latter is connected (Jaromczyk and Toussaint, 1992). Moreover, for a given distance matrix *D _{X}*, the RN network is unique. The Gabriel neighborhood of two entities

*i*and

*j*is defined as the smallest sphere through

*i*and

*j*(Figure 2C). Since the sphere is of radius , the Gabriel neighborhood is contained in the RN. Like in the RN network, an edge is established between

*i*and

*j*in the Gabriel neighborhood network if the corresponding neighborhood is empty. Delaunay triangulation (DT) network is partitioning of the space into simplices in such a way that a simplex is a part of DT if the sphere through its nodes, representing the entities, contains no other nodes (Figure 2D). When the number of entities that lie on a sphere is larger than the dimension of the space, the DT network is not unique (de Berg et al., 2008). It has been shown that the RN network is contained in the DT network. For further extensions and discussion of the properties for these classes of networks, we direct the interested reader to Veltkamp (1992) and Aldous and Shun (2010).

To introduce the next class of networks, we assume that each entity is associated with a subset of entities. The subset associated with entity *i* can be determined based on the distribution/ranking of distances in . In the class of proximity networks then two entities *i* and *j* are connected by an edge if the subset of *i* contains *j* and vice versa (Mutwil et al., 2011). Moreover, in the class of intersection networks, two entities *i* and *j* are adjacent if the intersection of the associated subsets is nonempty (Karoński et al., 1999). The proximity networks can be regarded as a symmetric variant of the NN network, while the intersection networks can be viewed as a complement of the empty neighborhood in the definitions above.

## STATISTICS FOR CHARACTERISTICS OF NETWORK ENTITIES

The statistics capturing patterns of covariation of a property must explicitly consider the underlying relatedness of investigated entities. To discern such patterns, the statistics quantify how the property’s level for each entity covaries with those of its neighbors. They can be global, as in the case of the Moran’s *I* (Moran, 1950) and global *G* (Getis and Ord, 1992), or can take into account local effects, such as the Geary’s *C* (Geary, 1954) and local Moran’s *I* (also known as Anselin’s local indicators of spatial association; Anselin, 1995).

For now, let us assume that we are in a univariate setting, where each entity, *i*, 1 ≤ *i* ≤ *n*, is quantified by a single numeric value, *y _{i}*, for a given characteristics. To test for spatial clustering, referring to a range of correlation values, one usually uses Moran’s

*I*or global

*G*statistics. Moran’s

*I*is defined by the following expression:where

*w*denotes the weight of the edge (

_{ij}*i*,

*j*) ∈

*E*(

*G*) and takes a value of zero if the network does not include this edge. Usually,

*w*= 1 or

_{ij}*w*= 1/

_{ij}*k*(

*i*), where

*k*(

*i*) is the degree of the node representing the entity

*i*in

*G*. If

*w*= 1, then Moran’s

_{ij}*I*is equivalent to the Pearson correlation of a modified vector with the assumption of network homogeneity (see Supplemental Methods 1 online) and, interestingly, corresponds to the (weighted) degree-degree correlation (Newman, 2002). Therefore, its values are in the range [-1,1] and can be interpreted in the same way as correlations. Positive values indicate that the characteristic exhibits a clustered pattern, negative values suggest dispersal, and zero denotes homogenous distribution of values.

Here, we calculate Moran’s *I* for each metabolite (see Supplemental Table 1 online) in the 20 *Arabidopsis* accessions whose neighborhood structure is given by the RN network in Figure 2C. Glc, threhalose, and erythritol all show significant positive Moran’s *I* (0.67, 0.42, and 0.52, P value < 0.05), while Phe, Ile, and Glc demonstrate dispersal, random, and clustered behavior, respectively (Figure 3). Although useful in providing hints for some metabolic processes associated to local adaptation, these findings neglect the fact that metabolite levels arise from metabolic processes interconnected in a complex network.

The global *G* statistics (also known as Getis-Ord General *G*) can be used to test the existence of concentrated high or low values in a given network and are defined as follows:

This expression includes the assumption that for every edge (*i*, *j*) ∈ *E*(*G*), 0 ≤ *w _{ij}* ≤ 1, the value of global

*G*is in the range [0,1]. The equivalent

*z*-score, obtained by analytically determined mean and variance of the global

_{G}*G*statistics, can be used to assess the presence of clustering: Statistically significant values whose

*z*-score is positive indicates that high values cluster together, while negative values support the concentration of small values. The global

_{G}*G*statistics for the 20 accessions in Figure 2C indicate that in the case of trehalose and Asn, high values cluster together with

*w*= 1 if the two entities

_{ij}*i*and

*j*are connected, and

*w*= 0, otherwise (see Supplemental Table 1 online).

_{ij}A global statistic that is more sensitive to local spatial clustering is Geary’s *C* statistic, which is closely related to the inverse of Moran's *I*. It can be seen as the network equivalent of a variogram since

The values for this statistic are in the range [0,2], where 1 indicates lack of spatial clustering. Values smaller than 1 are interpreted as an indicator of clustering, while values larger than 1 suggest dispersal. An example of different values for Geary’s C is shown in Figure 3 based on analyzing the behavior of Phe, Ile, and Glc for the *Arabidopsis* accessions. The results for the other metabolites are similar to those obtained from Moran’s *I*.

Moran’s *I* and global *G* have their local counterparts [i.e., in a modified form, they can be used to assess clustering behavior with respect to a given property in a neighborhood *N*(*i*) of a given entity *i*]. This results in the following:

Like the global statistics, they can be transformed into *z*-scores to help interpret the local behavior of the characteristic.

## FROM UNI- TO MULTIVARIATE SETTINGS

The statistics described above are suitable for application when a single characteristic of an entity is observed. With the advances in high-throughput technologies, biological entities are often described by vector profiles including different system level molecular phenotypes (e.g., transcriptomic, proteomic, and metabolomic). To render the described statistics useful in a multivariate setting, we employ the distance measure *μ _{Y}*, which is used on the data matrix

*Y*but only for the entities which are related (i.e., entities whose corresponding nodes are connected by an edge) (Kleessen et al., 2012). Given a network

*G*, generated based on

*D*

_{Y}, in the simplest scenario, we determine the weight

*θ*of each edge (

_{ij}*i*,

*j*) ∈

*E*(

*G*) as . Alternatively, one may consider combining the geographic distance with the weight

*θ*. Irrespective of the scenario, each node is characterized by the mean of the weights of edges incident on it; in other words, each node

_{ij}*i*is assigned a weight,

*θ*, such that , where

_{i}*k*(

*i*) denotes the degree (number of neighbors) of the node

*i*. The statistics reviewed above can then be used with

*θ*instead of

_{i}*y*.

_{i}## BRIDGING THE GAPS BETWEEN LEVELS OF BIOLOGICAL ORGANIZATION: TWO ILLUSTRATIONS

In the following, we apply the systematically reviewed network-based approaches together with the most widely used statistic in two scenarios. In the first, we investigate the geographic distribution of multivariate molecular phenotypes. In the second, we examine the relation between coexpressed genes, captured in a proximity network, and their functional characterizations.

### Geographic Distribution of Multivariate Molecular Phenotypes

It was recently discovered that genetically similar *Arabidopsis* accessions are found closer to each other, suggesting a robust isolation by distance (Platt et al., 2010; Anastasio et al., 2011; Horton et al., 2012). However, these findings were obtained by relating the genetic and geographic distances either across all accession pairs or via parameter-dependent neighborhood structures (e.g., *k*NN-like networks) restricting the analysis to genotypic variation. The multivariate method allows for analyzing the relation of multivariate high-throughput data to the geographic origin of the accessions (Kleessen et al., 2012).

In the following, we use the RN network on 92 accessions with data sets including a collection of SNPs (Anastasio et al., 2011; Horton et al., 2012), flowering phenotypes (Atwell et al., 2010), and metabolic phenotypes consisting of levels for 49 metabolites. We chose the RN network model because it results in a unique network that does not depend on any parameters (unlike the *k*NN network model), it is sparser than other unique network models, and it results in a connected network. The metabolic phenotypes were obtained in three ex situ growth conditions: OpN (12 h light/12 h dark) in which nitrogen fertilization allows close to optimal growth; LiN (12 h light/12 h dark) in which growth is limited by nitrogen; and LiC (8 h light/12 h dark) with high nitrogen supply, with carbon-limited growth. Investigation of the results from the network-based approach shows a robust isolation by distance at the level of metabolic, flowering phenotypes, but not at the level of genetic variability for the analyzed accessions. Nevertheless, the isolation-by-distance model was confirmed also on SNP data using a larger set of 170 accessions (Horton et al., 2012), suggesting that the robustness of the isolation by distance based on SNPs depends on the number of sampled individuals.

Here, we expand these findings by focusing on chemical compound classes of metabolites and their role in biochemical processes including carbohydrates, organic acids, central amino acids, minor amino acids, miscellaneous, composition, as well as photorespiration, nitrogen assimilation, and amino acid metabolism (see Supplemental Methods 1 and Supplemental Table 2 online). Furthermore, we investigate the relation between geography and four phenotypes containing characteristics grouped according to their association with flowering (23 traits), defense (23), ion concentrations, named ionomics (18), and development (43). We focus on the later three categories, covering 39, 26, and 40 of the 92 accessions, respectively (see Supplemental Table 3 online).

The relationship between the metabolic, defense-related, ionomics, and developmental phenotypes of the neighboring accessions is investigated using Moran’s *I* statistics on the RN network. We chose to focus on Moran’s *I*, first, because it is connected to (and abstracts) network topology, as described above, and secondly, the claims made based on this statistic are of global character. The values of Moran’s *I* indicate positive relations with geographic distance for the four phenotypes (Figure 4; see Supplemental Tables 4 and 5 online), suggesting a robust isolation by distance. The results for the metabolite classes corroborate the claim that not all metabolic processes response in the same way in the three growth conditions. For instance, the metabolites involved in nitrogen assimilation have a high Moran’s *I* in LiC and OpN with a high nitrogen supply, but the smallest in LiN where growth is limited by nitrogen (Figure 4). Moreover, metabolites participating in photorespiration show larger value for the Moran’s *I* in LiC compared with those in LiN and OpN. These findings suggest that the metabolic response to particular growth conditions is affected by geographic distance and reflect the adaptation to specific environments. The defense-related, ionomics, and developmental phenotypes demonstrate a similar behavior, with a larger Moran’s *I* value for the ionomics phenotypes.

### Proximity Network-Based Statistics Validate the Guilt-by-Association Principle in Gene Networks

The guilt-by-association principle (GBA) postulates that two biological entities with similar quantitative behavior have similar functions (Eisen et al., 1998). This principle has been used for developing methods for automated prediction of gene functions based on gene expression profiles (Oliver, 2000; Lee et al., 2010; Zhou et al., 2002). While arguments have been provided both in favor of the general applicability of GBA in the case of networks obtained only using the Pearson correlation, referred to as relevance networks (Wolfe et al., 2005; Kinoshita and Obayashi, 2009), as well as questioning its validity (Gillis and Pavlidis, 2011, 2012). Interestingly, there exists no rigorous statistical assessment for the validity of this principle with respect to spatial clustering of biological functions in genome-wide coexpression networks. Such quantification will help determine which gene functions indeed follow the GBA principle, ultimately resulting in more accurate and more specific predictions.

Here, we illustrate how proximity networks extracted from gene expression data in combination with distance measures, quantifying the similarity of gene function annotation and the statistics for testing spatial clustering can be used to assess the GBA principle. The resulting quantities can be further interpreted in a network setting. To this end, we extracted the proximity network from a compendium of publically available microarray experiments in *Arabidopsis*, following the approach of Mutwil et al. (2011) (see Supplemental Methods 1 online).

In the extracted network, we focus on subsets of genes described by one of the six high-level categories from MapMan (Thimm et al., 2004), including regulation, hormones, cell wall, cellular response, primary metabolism, and secondary metabolism (Usadel et al., 2005). To assess the spatial clustering in a multivariate setting, the *θ _{i}* node weights are calculated from the semantic similarity of the gene neighbors in the proximity network (Figure 5). The semantic similarity was determined based on information-theoretic distance measures considering ancestral MapMan bins, as proposed by Klie and Nikoloski (2012) (see Supplemental Methods 1 online). The values for the Moran’s

*I*presented in Table 1 indicate that the there is increasing clustering of genes with similar function starting from regulation to cellular response and secondary metabolism. These findings can be interpreted further with respect to the behavior of genes in each category: The genes involved in secondary metabolism, involving more linear pathways, show exceptionally high congruence with respect to both coexpression and gene function (Figure 5). Moreover, the genes involved in primary metabolism and cell wall–related functions, covering tightly transcriptionally regulated pathways, also show high values for Moran’s

*I*. The broader categories of genes involved in regulation and crosstalk between pathways are, expectedly, showing the smallest Moran’s

*I*, suggesting a weak relation between gene expression and characterization of their functions.

## FUTURE TECHNICAL AND MODELING CHALLENGES

The presented modeling approach aims at revealing patterns of covariation in uni- and multivariate characteristics of biochemical and biological entities embedded in space. This approach involves testing hypothesis based on global or local statistics that allow making claims about spatial clustering of the investigated multidimensional traits.

Here, we argued that to guarantee robustness of the claims, it is desirable that the spatial embedding of the entities is parameter-free and, thus, unique for the sampled entities. While there are network classes that could be used robustly to investigate molecular, physiological, and morphological characteristics, the challenge of determining objective criteria for selecting a network model remains. One possibility for overcoming this challenge would be simultaneously to investigate the enumerated statistics from the sparsest to the densest network model ensuring uniqueness. The statistics on networks of different density can in turn be used to determine the level of spatial proximity at which dramatic shift in the statistics is observed. One may then further examine the relevant properties of the neighbors enforced by a particular network class that may lead to relevant biological insights.

The reviewed network models applied in a geographic setting rely on the airline distance between two entities. However, geographic space is three dimensional and incorporates natural barriers, like mountains or bodies of water, with variable degree of relevance to different species. Moreover, geographic locations have other properties that vary in time due to climate effects. An interesting challenge is to extend the existing models to account for barriers given terrain descriptions as well as similarity of geographically proximal locations which share climate-related properties. In such a setting, the investigations of high-throughput data would shed light on the relation of geographic segregation and climate with the underlying molecular mechanisms of local adaptation.

Finally, the proposed approach for analysis of multidimensional data profiles simultaneously considers all traits and might not identify a pattern for the entire collection of traits. However, this does not imply that there are no patterns of covariation in a subset of traits. Automated selection of subsets of traits for which patterns are observed remains one of the key challenges. As with other statistical approaches, conclusions drawn from marrying off covariation measures and spatial embedding of entities is expected to depend on the sampling scheme for the entities. Therefore, the behavior of the network models and statistics in a bootstrap setting remains to be further investigated. Providing methods that could address these issues would further contribute to the resolving the grand challenge of identifying biologically meaningful principles that can be used to predict the behavior of biochemical and biological entities based on their high-throughput readouts.

### Supplemental Data

The following materials are available in the online version of this article.

**Supplemental Table 1.**Moran’s*I*, Geary’s*C*, and Global*G*Statistics for 49 Metabolites of 20*Arabidopsis*Accessions.**Supplemental Table 2.**Assignment of Metabolites to Metabolite Classes.**Supplemental Table 3.**List of Accessions Used in the Analyses of Metabolic, Defense-Related, Ionomics, and Developmental Phenotypes.**Supplemental Table 4.**Moran’s*I*Statistic of 92*Arabidopsis*Accessions for Different Metabolite Classes.**Supplemental Table 5.**Moran’s*I*Statistic of 92*Arabidopsis*Accessions for Defense-Related, Ionomics, and Developmental Phenotypes.**Supplemental Methods 1.**Moran’s I and Pearson Correlation, Analysis of Metabolite Classes and Arabidopsis' Gene Coexpression Network.**Supplemental Data Set 1.**Metabolic Profiles and Geographic Location of 20 Accessions with Geographic Origin in Germany in 12-h-Light/12-h-Dark Photoperiod under Nitrogen Supply Allowing Close to Optimal Growth (OpN).

## Acknowledgments

We thank Alisdair R. Fernie and Roosa Laitinen from the Max Planck Institute of Molecular Plant Physiology for insightful discussions on the article.

## AUTHOR CONTRIBUTIONS

Z.N. designed the study. S.Kle. and S.Kli. performed the analyses. All authors discussed the results and wrote the article.

## Footnotes

↵[W] Online version contains Web-only data.

- Received February 27, 2013.
- Revised April 30, 2013.
- Accepted May 16, 2013.
- Published June 7, 2013.