Genetic expression programs




















GTEx resources are valuable tools for exploring the impact of genetic variation on complex traits and diseases. Please note that since the GTEx program is no longer supported by the Common Fund, the program website is being maintained as an archive and will not be updated on a regular basis.

Science News Multimedia. Genotype-Tissue Expression. Program Snapshot. Highlights of the Genotype-Tissue Expression GTEx Program major accomplishments are: Established a comprehensive catalog of genetics variants that effect gene expression across multiple tissue for the research community to evaluate tissue-specific gene expression and regulation in many different tissues.

Genetic variants that influence how genes behave are called expression quantitative trait loci eQTLs. In addition log-transformation renders fold-changes of 0. This is undesirable given the current resolution of scRNA-Seq data because the former change can frequently be attributable to noise while we have much greater confidence in the biological significance of the latter.

Finally, addition of components in log expression space corresponds to multiplication in raw expression space. We believe additivity to be a more appropriate model than multiplicativity for most GEPs, and therefore do not log-transform the data prior to running cNMF. We do not mean center the genes so as to preserve the non-negativity of the expression data which is a requirement for NMF.

Note that we do not perform any cell count normalization I. This is because cells with more counts can contribute more information to the model. Technical variation in transcript abundances across cells are captured in the usage matrix rather than the component matrix. However, for the Tasic et al. See below for details. We use non-negative matrix factorization implemented in scikit-learn version R replicates of NMF are run on the same normalized dataset with the same number of components K but with different randomly selected seeds, resulting in R instances of usage matrices U r N cells x K programs and program matrices G r K programs x H genes :.

For each replicate r, the rows of G r are normalized to have l2 norm of The component matrices from each replicate are then concatenated vertically into a single RK x H dimensional matrix, G , where each row is a component from one replicate:. Components with high mean Euclidean distance from their L nearest neighbors are then filtered out as below:.

We find this to be an appropriate default setting. Next, the rows of G f are clustered using KMeans with the Euclidean distance metric and the same number of clusters K as the number of components for the NMF runs.

Each cluster of replicate components is then collapsed down to a single consensus vector by taking the median value for each gene across components in a cluster:. This defines a KxH consensus programs matrix G c where the c superscript denotes consensus.

The merged GEP components are then l1 normalized:. We concatenate all these coefficients into a consensus usage matrix U c :. With this normalized, consensus usage matrix fixed, final program estimates can be computed in desired units, and for all genes—including genes that were not initially included among the over-dispersed set.

To convert the estimated programs to TPM units and to obtain program vectors spanning the full set of input genes, we refit against the matrix of TPM values, T :. Note that the TPM matrix T is calculated using a raw count matrix C that includes all genes, even those that were filtered out for falling below the count threshold. We repeat this for all M genes in the filtered count matrix and combine allthese coefficients into a consensus program matrix:.

We identify marker genes genes that are statistically associated with each GEP using multiple least squares regression of normalized z-scored gene expression against the consensus GEP usage matrix. This amounts to finding the genes that have higher than average expression for cells that use a specific GEP. We compute the z-score of the TPM profile like so:.

We regress against z-scored expression values rather than the un-normalized expression values so that the coefficients will be comparable between genes expressed on different scales. For discrete clustering methods, the usage matrix is a binary indicator matrix containing a 1 for the cluster column each cell is assigned to, and a 0 for all other columns.

Identifying marker genes through multivariate regression in this fashion, rather than through separate tests for each GEP, reduces the risk of confounding that can occur when GEPs tend to be expressed in the same cells. For example, if an activity GEP is predominantly expressed in cells of a specific cell-type, it avoids misattributing activity genes to the identity program of that cell-type and vice versa. We note that because gene-expression data is not normally distributed, the residuals of the regression will not be normal, which violates an assumption of OLS regression.

However, the coefficient estimates will still be unbiased even if normality is violated. In practice, we do not use the p-values of the regressions at an any point in our analysis as those can be inaccurate due to non-normality.

We recommend testing for gene-set enrichment on regression coefficients directly as we discuss below rather than setting thresholds on regression P-values. Determining the number of components K to use for cNMF is an important but challenging step without a simple approach that can work for all datasets and applications. We use two diagnostic plots to help guide this decision.

The first plot shows the stability of the solution as captured by the silhouette score and the Frobenius reconstruction error as a function of K as described previously in Alexandrov et al. However, unlike in Alexandrov et al. We compute the Frobenius error using the consensus NMF solution but without any outlier filtering.

We also use the Euclidean distance on l2 normalized components as the metric for the silhouette score rather than Cosine distance. Silhouette score is calculated using the Scikit-learn version We parallelized the individual factorization steps over cores on a multi-core virtual machine using GNU Parallel Tange, As another approach to confirm the appropriateness of our choice of K, we use scree plots which depict the proportion of variance explained per principal component Figure 2—figure supplement 3 , Figure 3—figure supplement 1 , Figure 4—figure supplement 1.

This is motivated by the fact that choosing the optimal number of principal components and choosing the number of NMF components can both be framed as estimating the rank for a low-dimensional representation of the input data.

Because principal components are orthogonal to each other, and loadings of NMF components can never be negative, K principal components will always span a larger sub-space than K NMF components. This suggests that the optimal number of NMF components will likely not be smaller than the optimal number of PCs.

The scree plot is a commonly used tool to estimate the number of principal components and we use it to help guide the number of NMF components as well. We note that these two plots merely provide a general aid for the choice of K and we considered the biological interpretability of factors found from several choices of K before proceeding. We do not recommend necessarily using the maximum stability solution of the error vs. Given the uncertainty of the choice of K, we confirmed that the conclusions of this manuscript are robust to this decision.

For each step below the selected K, approximately a single GEP was lost, but for choices above the selected K, components approximately matching the original K programs I. This suggests that cNMF yields relatively stable solutions for a moderate range of K values.

For LDA, we used the batch algorithm and all other parameters as defaults. We defined the consensus estimates across replicates in the same way as for cNMF but with a slight modification for ICA. For Louvain clustering, we used 14 principal components to compute distances between cells and used nearest neighbors to define the KNN graph. We chose 14 principal components based on the fact that the data was simulated based on a dimensional basis and, therefore, the biological variation in the data can be captured by 14 PCs and subsequent components correspond to noise.

This choice is also justified by choosing the elbow on scree plot in Figure 2—figure supplement 3. We used nearest neighbors for the clustering as this is a relatively large number to minimize variance but it is still smaller than the smallest discrete population 0.

To evaluate the accuracy of these various methods, we first calculated the z-score coefficient for associating each gene with each program as described above.

We then calculated sensitivity and false discovery rate FDR for each threshold on those coefficients and plotted those as an ROC-curve, except with FDR on the X-axis instead of false positive rate. Genes with a fold-change between 1 and 2 were ignored for this evaluation. We used the z-score regression coefficients identified as above as input for a one-sided Mann Whitney U Test with tie correction comparing the median of genes in each geneset to those of genes not in the geneset.

We first floored all negative coefficients to equal zero prior to the test. Coefficients less than 0 indicate genes that are expressed at higher levels in cells that do not use the GEP all other things equal than in cells that do. We floor these values so that variation in genes that are not directly part of a GEP which can make up the majority of genes do not substantially impact the Mann-Whitney statistic for that GEP. However, we obtained the clustering and unnormalized data by request from the authors.

We sought to understand how variability between sample replicates and batches would impact the results of cNMF. We therefore considered how GEP usage varies across replicates in the primary datasets analyzed in this manuscript, as well as in a previously published scRNA-Seq dataset of human pancreatic islets with noted batch-effect Baron et al.

First, we analyzed the aggregate GEP usage of cells in organoid replicates in the Quadrato et al. For this purpose, we defined the aggregate GEP profile of a replicate as the sum of the GEP usage of all cells derived from that replicate.

The visual cortex data showed relative uniformity of GEP usage across mouse replicates, with the only clear pattern being the expected association between depolarization-induced GEPs and mice treated with the stimulus Figure 5—figure supplement 1a - left. By contrast, there was significant variability between organoids in the Quadrato et al. This variability was discussed in the original manuscript and validated using immunohistochemistry, and thus represents true biological signal that we would hope for cNMF to discern.

We also considered whether any GEPs could be attributed to just one or a small number of replicates which could suggest that they are not reproducible within the experiment. We therefore looked at what percentage of the aggregate usage of a GEP derived from cells in each replicate. We found that each GEP contributed to cells from multiple independent replicates in both datasets Figure 5—figure supplement 1 , right panels.

Furthermore, each organoid GEP was the maximum contributing GEP for a cell in at least six distinct organoid replicates, and each visual cortex GEP was the maximum contributor for a cell in at least 10 distinct mouse replicates. This supports our conclusion that the inferred GEPs represent reproducible signals within the primary organoid and visual cortex datasets. We also analyzed a human pancreatic islet scRNA-Seq dataset where variability between four donors resulted in more substantial batch-effects to see how that would impact the behavior of cNMF Baron et al.

Applied to this dataset of 10, cells, cNMF identified 16 GEPs that corresponded well with the cell-type clusters described in the initial publication Figure 5—figure supplement 2. However, many of the cell types that were missed by cNMF were only distinguished through iterative sub-clustering in the initial publication, which we did not attempt. One potential contributor to the batch-effect could be that donors 1 and 3 were male and donors 2 and 4 were female.

But in general, the fact that cNMF is discerning multiple GEPs for the same cell-types suggests that technical sources of variation such as batch-effect can confound the identification of identity and activity GEPs. To avoid incorporating variation between batches into the inferred GEPs for datasets containing significant batch-effect, batch-effect correction can be performed prior to running cNMF.

All of the analyzed real datasets are publicly available and the relevant GEO accession codes are included in the manuscript. In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses.

A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Naama Barkai as the Senior Editor.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission. The non-negative matrix factorization method for the analysis of RNA-seq data presented in this paper, consensus-NMF cNMF , addresses the identification cell types in the context of other non-cell type specific activities.

A strength of the study is that results are not only compared with other methods, but also applied to both synthetic and real datasets. A number of key elements of the algorithm are not described in sufficient detail to make it clear and reproducible, namely:. Add equations to explain how to achieve the results and explain how parameters were selected.

It is a balance between stability and error. Currently, the choice of the number of factors, K, when applying the method to the simulated and the real datasets were not convincing. Showing how derived GEPs look under different choices of K would be essential. In addition, it is also unfair to compare with other unsupervised methods such as PCA if the K cannot be automatically selected.

The authors can suggest an automatic solution to the K. How the method controls batch effects? It would be informative to show the variance explained by the overall model and each factor in both simulated and other real datasets preferably with a larger K to put the activity GEPs into perspective. We have significantly expanded our discussion of how cNMF allows marker gene identification and how this compares to differential expression approaches that follow hard-clustering.

Given that cNMF does not assign cells to discrete groups, we cannot use a standard differential expression approach for identifying marker genes, and therefore adapted a solution to our setting. We identify marker genes by fitting multivariate linear regression models of z-scored gene expression profiles against GEP usage.

This approach generalizes the commonly-used T-test to settings where cells have continuous weights for each GEP, rather than binary assignments. This is now described in explicit mathematical notation as per essential revision 2 below. Our decision to identify marker genes with multivariate linear regression as specified above is motivated by several key considerations:.

The coefficients all represent changes relative to the average expression across all cells. Including both the identity and activity GEP as covariates avoids this confounding and strengthens our testing approach.

We do not use the P-values of the regressions at any point in our analysis as those can be inaccurate due to non-normality. In the revised manuscript, we have formalized the presentation of the method using detailed mathematical notation intertwined with the text description of the method. We agree that this will increase the precision of the method description and will provide a clearer description for mathematically inclined readers.

In the sections below, we expand on these additions and describe where they can be found in the main text. We acknowledge that our simulation of cell-types at uniform proportions is a simplification of the biological reality, where cell-type frequencies can vary over multiple orders of magnitude. We did so to simplify our benchmarking analysis, as it allows us to treat all identity programs as replicates of each other for evaluating inference accuracy.

If we had alternatively simulated cell-types with variable proportions, this simplification would not have been possible because GEPs of rare cell-types would everything else equal be harder to infer than those of common cell-types. It is important to ensure that the simplification of simulating cell-types at uniform proportions does not change our overall conclusions about the applicability or comparative performance of the methods.

Therefore, in the revised text, we describe additional simulations where cell-type proportions matched those of a representative real biological dataset the Hrvatin et al.

This analysis is described in the last paragraph of the simulation section of the Results Figure 2—figure supplement 8. So, despite their fixed length, each gene has the potential to code for expression trees of different sizes and shapes, where the simplest is composed of only one node when the first element of a gene is a terminal and the largest is composed of as many nodes as the length of the gene when all the elements of the head are functions with maximum arity.

It is evident from the examples above, that any modification made in the genome, no matter how profound, always results in a structurally correct program.

Now this is unique to GEP and is obviously at the heart of its superior performance: for instance, in the simple replicator system of GP, most modifications result in syntactically invalid programs imagine, for instance, what would happen if a terminal in a GP tree is replaced by a function , which is why most GP implementations rely exclusively on inefficient tree-specific crossover to create genetic diversity [ 4 ].

So, the chromosomes of Gene Expression Programming are usually composed of more than one gene of equal length. For each problem or run, the number of genes, as well as the size of the head, are a priori chosen. Each gene codes for a sub-ET and the sub-ETs interact with one another forming a more complex multi-subunit expression tree. For the sake of simplicity, in the linear representation above , the start of each K-expression is always given by position 0; the end of each K-expression, though, is only evident upon construction of the corresponding sub-ET.

As shown in the Figure above , the first K-expression ends at position 1; the second at position 7; and the last at position Thus, the multigenic chromosomes of GEP contain multiple K-expressions of different sizes, each one of them coding for a structurally and functionally unique sub-ET.

Obviously, these sub-ETs or sub-programs must interact with one another and, in the canonical GEP system, they interact through special functions, the so called linking functions. These functions link the sub-ETs together one after the other in an orderly fashion. For instance, the linking by addition of all the three sub-ETs shown in the Figure above is illustrated below the linking functions are shown in gray :. Note that the final program represented in the Figure above could be linearly encoded as the following K-expression:.

These smaller building blocks are separated from one another and are therefore free to evolve independently, allowing for the creation of concrete new units that might prove handy in a new situation. Structurally, the Dc comes after the tail, has a length t equal to the length of the tail, and is composed of the symbols used to represent the random constants.

Therefore, another region besides the head and the tail with defined boundaries and its own alphabet is created in the gene. Then the? The values corresponding to these symbols are kept in an array. For simplicity, the number represented by the numeral indicates the order in the array.

For instance, for the 10 elements zero-based array:. These eQTLs control the behavior of genes like a thermostat regulates the temperature of a home. Researchers also discovered eQTLs act in different ways.

This website provides a resource for the many researchers who are exploring the human genome. Understanding how the eQTLs change gene behavior in different tissues can help us understand how diseases develop in people.

This knowledge, in turn, may help us develop new therapies and treatments, improving our health overall. As of December , GTEx finished enrollment of the additional donors, for a total of donors. Analysis of the samples and data will continue for another 18 months.

Over 30, samples have been collected. In fall of , information on gene expression for over donors was released to the scientific community through the database of Genotype and Phenotype dbGaP. Additionally, the new version of the GTEx Genome Browser has been launched and features new visualization tools.

In , The National Institutes of Health awarded eight new grants to researchers to use tissues donated to GTEx to explore how human genes are expressed and regulated in different tissues.

In , the GTEx Consortium published its final set of studies analyzing genotype data from approximately post-mortem donors and approximately 17, RNA-seq samples across 54 tissue sites and 2 cell lines, with adequate power to detect Expression Quantitative Trait Loci in 48 tissues.



0コメント

  • 1000 / 1000