We will therefore aim to combine all of these datasets together and construct a coherent atlas of cellular phenotypes in the human brain, based on our best-practice methods and metrics. While ambitious, our preliminary data indicate that this goal is fully achievable, and we welcome the inclusion of additional neuronal datasets that become available during the project period. After integrating all datasets together, we will perform a single clustering analysis using graph-based community detection algorithm, which we strongly expect to dramatically boost our ability to detect rare and subtle neuronal cell states.
We will aim to deliver : 1) A catalogue of cell states and genetic markers that is robust to differences in technology alongside rare cell types which are only found in a subset of datasets, 2) Systematic comparison of transcriptomic profiles for the same cell state across technologies, with a particular focus on identifying gene sets that are enriched in either scRNA-seq or scNuc-seq experiments, 3) Jupyter notebooks with reproducible workflows, laying out a clear roadmap for similar analyses in diverse tissues. We will assess the performance of our methods using the benchmarks developed above, but will leverage longstanding collaborations with neuroscientists (Tom Maniatis; Gord Fishell; Steve Mccaroll, Evan Macosko) to assist in interpretation and exploration of our findings.
Aim 3 (Rahul) : Integrate neuronal datasets across species, mapping human cell types to their mouse counterparts
HCA aims to identify hundreds to thousands of cell types in the human brain based on molecular and spatial characterization, but functional characterization or perturbation of these cells cannot be performed in humans. Understanding the human genome faces similar challenges, and comparative genomics represents an invaluable tool to identify conserved signals, highlight important differences, and map human sequences onto tractable model systems. We propose that cross-species analysis will perform a similar essential role for HCA, and play a crucial step in connecting a catalog of cell types towards deeper biological understanding.
We recently published the first integrated analysis of human and mouse pancreatic atlases, identifying ten shared cell types despite significant evolutionary divergence. We will extend this analysis here to create an integrated atlas of the mammalian nervous system, leveraging landmark neuronal datasets in the adult mouse (Zeisel, Allen, Macosko). Our successful alignment of pancreatic islet cells demonstrates that this goal is feasible, even when only a subset of transcriptomic markers are shared across species . We will also rigorously test strategies for mapping gene ontologies Our deliverables will be as stated in Aim 2, but here, we will also focus on reporting the best transcriptomic markers that are shared between species, potentially enabling the construction of murine Cre driver lines for functional characterization. In addition, we expect that our initial characterization of cell states that are shared across species, or unique to either, will be of significant value to the neuroscience community, and will begin to establish the power of applying lessons from comparative genomics to HCA.
Aim 3 (John):
The approaches outlined in Aim 1 focus on improving the performance of data integration methods when combining multiple datasets generated from the same underlying population of cells (e.g., a specific tissue) but using either different technologies or where cells are collected from different individuals. In the context of developmental biology and when comparing tissues across species, the assumption that we are considering the same underlying population of cells does not hold. For example, in the context of tissue differentiation, cells collected at different stages of development will consist of a mix of common cell types (e.g., precursor populations present at different time points) as well as transitional populations present at only specific time points as well as, ultimately, terminally differentiated cells.
At present, technological limitations mean that cells are sampled sequentially, meaning that to construct pseudotemporal differentiation trajectories for many biological processes it is necessary to combine information across batches. This applies both in the context of developmental biology (e.g., sampling different stages of early mouse development) and when modelling the development of human cell types in vitro using organoid based systems. To integrate data collected from sequential stages of development, we propose to jointly learn the biological manifold while correcting for batch effects.
To this end, we propose to extend the Mutual Nearest Neighbor approach by employing a more formal factor analysis framework. Specifically, we will assume that variability in the expression profiles within the combined dataset (i.e., considering cells from all timepoints) can be explained by a series of hidden factors that we want to infer. Each factor will be “active” for a given set of genes that co-vary consistently across the entire dataset.
To disentangle batch effects from biological signal we will assume that batch effects apply to large numbers of genes and will thus be captured by dense factors with large numbers of active genes (including housekeeping genes). Importantly, we will also assume that technical effects are orthogonal between pairs of batches and, crucially, that they are always orthogonal to the biological signal of interest. To identify this biologically meaningful signal we will identify informative factors, which will generally have a smaller number of active genes. Additionally, prior information, corresponding to the stages when samples are collected, can help guide the choice of factors.
Aim 3 (Oli):
The methods developed in aim 2 allow for integrating different scRNA-seq datasets by modelling shared sources of gene expression covariation. We here seek to extend these methods to additional single-cell technologies, most notably expression assays that deliver spatially resolved expression levels, a critical component of the HCA.
An important of conventional scRNA-seq of disassociated populations of cells is that the natural tissue contexts of of the cells is lost. Complementary data from spatial profiling methods provide indispensable data to fill in these gaps, allowing to place single-cell RNA-seq profiles into the context of tissue coordinates. While the generation of these dataset is already underway in different contexts and a major component of several HCA projects, there is lack of computational strategies for integrating spatial expression data and scRNA-seq datasets. To address this, we will here extend the methods derived in aim 2 to account for spatial information of the cells. These approaches will allow for obtaining new insights into spatial components of gene expression variation, including predictions of spatially expression coordination of cells from scRNA-seq. At the core of our approach will be the development of new factor models that use spatial Gaussian processes priors on the inferred factors. We have recently proposed one of the first methods for modelling spatially resolved expression datasets using this class of model (Sveensson et al., 2017). By connecting these different models it will be possible to infer factors that explain co-expression clusters with and without a spatial underpinning, which can be integrated with scRNA-seq data from disassociated cells.
-- more here --
5. Dissemination of Methods, Collaboration with CZI, Commitment to Sharing
Our laboratories have been at the forefront of methods development for single cell data analysis and integration. In 2015, the Marioni and Satija groups independently published the first analytical methods to integrate scRNA-seq datasets with in-situ hybridization databases, enabling the inference of a cell's spatial localization based on its gene expression. All groups also have created and maintained powerful, widely used, and fully open-source scRNA-seq analytical toolkits, scran (Marioni), scater (Stegle) and Seurat (Satija), demonstrating our deep commitment to fully sharing analytical methods with the community.
BRIEF PROJECT SUMMARY (250 words; currently exactly 250)