The opportunity of integrating information across multiple datasets will maximise the power of the human cell atlas (HCA). However, this is a substantial computational challenge. Even for individual tissues, the HCA will not be constructed from a single dataset, but will exploit multiple technologies and samples from different individuals. The former is particularly important for single cell RNA-sequencing, where comparative studies have consistently found that no single approach is optimal, with each technology having distinct strengths and weaknesses.
Given this, we herein address a fundamental question for the HCA: how can we integrate a diverse community effort to effectively  construct a coherent atlas of human cell types? We propose that powerful machine-learning techniques based on 'joint manifold learning', often used in the 'alignment' of massive imaging datasets to recognized shared high-dimensional features, can be used to recognize shared cellular phenotypes across single cell datasets.
This proposal will establish a new collaboration between the Satija Lab at New York Genome Center, and the Marioni and Stegle Labs at EMBL/EBI, who all have expertise in computational integration for scRNA-seq, but have not previously worked together. We will collaboratively develop a set of methods and best practices for data integration, alongside novel metrics and benchmarks that are of importance and value to the community. Using these approaches we will integrate seven human neuronal scRNA-seq datasets, combining data from shallow, deep, cytoplasmic, and nuclear scRNA-seq technologies. Finally, we will extend these approaches to integrate datasets generated across different developmental timepoints, or even different species.
 
COLLABORATIVE NETWORK (500 words - 300 below)
Our laboratories have been at the forefront of methods development for single cell data analysis and integration. From the very first methods for identifying highly variable genes (Brennecke et al., 2013) from single-cell RNA-sequencing data, through to modelling and removing confounding variables (Buettner et al., 2015, Buettner et al., 2016) and integration of spatial information with scRNA-seq measurements (Achim et al., 2015; SAtija et al., 2015), to the development of machine learning methods for interpreting single-cell variation (Angermueller et al., 2017),  [RAHUL / OLI: DO YOU WANT TO HIGHLIGHT SOMETHING ELSE HERE??], our groups have consistently pioneered methods. Importantly, these methods are all open source and widely used by the community: for example, the scran (Marioni), scater (Stegle) and Seurat (Satija) packages all represent full open-source scRNA-seq analytical toolkits, clearly demonstrating our deep commitment to fully sharing methods with the wider community.
Additionally, over the past few months, all three of our groups have been working on methods for integrating multiple datasets: either different scRNA-seq datasets (Marioni, Satija) or multiple single-cell omics experiments (Stegle). With this RFA comes the opportunity to expand these methods so that they are suitable for the exceptionally complex data generated by the HCA while, at the same time, developing standards and resources that will be of high utility to the wider scientific community. 
The collaborative network will function via interactive means (slack channels / google docs / GitHub repos) where members from each group will share code and ideas as they are generated. This immediate interaction will ensure that we can hone in on the most appropriate solutions in an expeditious manner. Additionally, we will organise frequent in person meetings (at least thrice-yearly) where the scientists working on each aspect of the project will meet one another and interact. Such gatherings play an important role not only in pushing forward the science but in building a strong and interactive community of computational biologists.
KEYWORDS
Single-cell biology
Data integration
Statistics
Machine learning
Collaboration
Open-source
SIGNIFICANT PUBLICATIONS RELATED TO THIS PROPOSAL (FIVE PER GROUP)
JOHN:
Achim et al., Nat Biotechnol., 2015
Haghverdi et al., Biorxiv, 2017
Lun, Bach, Marioni, Genome Biol., 2016
Lun, McCarthy, Marioni, F1000R, 2017
Scialdone et al., Nature, 2016
Oli:
Buettner et al., Nat Biotechnol., 2015
Buettner et al., Biorxiv 2016
Angermuelller et al., Genome Biol., 2017
Svennson et al., Biorxiv 2017
Angermueller et al., Nat Methods., 2016