Describe any limitations or absence of existing cyberinfrastructure, and/or specific technical advancements in cyberinfrastructure (e.g. advanced computing, data infrastructure, software infrastructure, applications, networking, cybersecurity), that must be addressed to accomplish the identified research challenge(s).

2.1 Cyberinfrastructure in the Paleogeosciences: Trends and Needs

\label{cyberinfrastructure-in-the-paleogeosciences-trends-and-needs}
In the paleogeosciences, the emerging cyberinfrastructure model is a distributed, federated network of resources and services, each curating and advancing a particular kind of knowledge. This system is consistent with the distributed nature of geoscientific expertise: geochronological data should be curated by geochronologists, species names by paleontologists, etc. Within this structure, focal nodes serve as disciplinary loci for data stewardship and mobilization, and as resources for sharing best practices and standards across subdisciplines. This structure has emerged organically, with key support coming from NSF Geoinformatics and EarthCube, and at least five major interacting components:
Community Curated Data Repositories (CCDRs) . CCDRs have repeatedly emerged as geoscientists unite to gather and share data in response to common, broad-scale research questions (Section 1A). CCDRs usually begin as individual or small-team efforts, then mature into community resources with established data models, standards, and governance systems. Because describing the history of Earth’s climates and biodiversity requires coordinated effort, CCDRs prevail in paleoclimatology, paleobiology, and paleoecology (e.g., Neotoma Paleoecology DBPaleobiology Database, LinkedEarth, SedDB/EarthChem, NOAA Paleoclimatology, MorphoBank, VertNet). New CCDRs emerge as proxies mature, e.g. the recent call for IsoBank (Pauli et al. 2017). Key challenges include reducing data friction by better integration (Sect.2.3), encouraging community input, and, most of all, sustainability (Sect.3.1).
Museums and Sample Repositories curate physical specimens (rock samples, drill cores, fossils, biological materials) and their digital representations. Examples include LacCore/CSDCO, IODP, and marine core repositories at Columbia University, University of Rhode Island, and Oregon State. Recent initiatives have focused on digitizing collections (iDigBio, iDigPaleo), developing as persistent and unique sample identifiers (IGSNs), and establishing provenance systems for linking samples to measurements (Open Core Data).
Integration and Networking Activities . The EarthCube initiative has provided an essential push and resources to build a networked federation of interconnected CCDRs and sample repositories. Current efforts include Open Core Data (linked data standards for continental and ocean marine drilling data), the Earth Life Consortium (an umbrella organization for Neotoma, PBDB, and other paleobiological CCDRs, and linking to modern biodiversity databases), ePANDDA (linking the specimen holdings in Paleobiology Database with the museum digitization efforts by iDigPaleo), LinkedEarth ( linked data standard for paleoclimatic data, McKay et al. 2016), and Flyover Country (a popular app for viewing geological data during air or ground travel, now used by ~600 people each day for informal learning and a recent “Vizzie” award).
Individual geoscientists . Most scientific data curation is still done by individual geoscientists in their research labs, with data stored on desktop computers and local servers. Most geoscientists store and record data using flat file formats (TXT, CSV, XLS) or workflow software associated with their instrumentation systems. Many subdisciplines have no established metadata standards, data reduction standards, or community data repositories. Huge effort is spent converting small amounts of data from one format to another (by some estimates, up to 80% of total project effort).
Scientific Literature and Unstructured Data . A vast amount of paleogeoscientific data is available only through the published literature. These data are highly unstructured and not readily amenable to broad-scale synthesis. Arguably, over 100 years of geological research has mainly succeeded in transferring information from one vast and dimly accessible archive (the geologic record) to another (the published literature — better, but far from ideal). We need better systems for mining this resource.

2.2 Paleogeoscience Cyberinfrastructure: Priorities for the Next Decade

\label{paleogeoscience-cyberinfrastructure-priorities-for-the-next-decade}
Based on the above, we argue that for our community, the most productive scientific return on NSF cyberinfrastructure investments over the next decade will come from distributed, meso-scale investments, with the following six priority areas:
  1. Reduce data friction by developing scientific workflows, structured vocabularies, semantic frameworks, and data-tagging systems to pass data and metadata seamlessly within and among community resources. (Sect.2.3)
  2. Develop automated data-mining systems for extracting information from unstructured data in the scientific literature (Sect.2.4).
  3. Support the long-term sustainability of existing community cyberinfrastructure resources (Sect.3.1) and the grassroots development of community informatics resources for sub-disciplines that lack data sharing systems. (Sect.3.2).
  4. Launch funded data mobilization campaigns to unlock existing data relevant to high-priority scientific research questions (Sect.3.3).
  5. Develop and train a distributed scientific workforce, for both early career scientists and current practitioners (Sect.3.4).
  6. Establish a National Center for Paleodata Synthesis to coordinate activities among individual geoscientists and the federation of CCDRs and sample repositories, promote community best-practices and data standards, and develop education and scientific workforce training initiatives (Sect 3.5).

2.3 Reduce Data Friction

\label{reduce-data-friction}
We need to build the systems and standards necessary to pass data within primary data-generation scientific workflows (from field collection to laboratory measurement to publication and archival) and among the emerging federation of data repositories that facilitate downstream data integration and synthesis. Researchers must be able to access data from any point in the stream of data generation, and see its provenance, the implications or effects of models on their data, and its subsequent interpretations. Repositories must be interconnected, and scientists must have the ability to iteratively re-evaluate and annotate data. Clear provenance is essential to link records across resources.
Web architectural approaches are federated, scalable and tolerant; all are desirable properties for a distributed paleodata network. Common permanent and persistent identifiers such as DOIs, IGSNs and ORCIDs, combined with linked open data and semantic frameworks, are needed to enable the passing of data among community supported resources. The development and inclusion of well-documented data vocabularies would improve the ability to move across data repositories and disciplinary divides, clearly identify the assumptions surrounding data objects, and simplify translational activities, supporting the development of the semantic web. In particular, the development of an ontology of geologic time (e.g., OWL Time) should be a priority. These can then be extended into new and underserved data communities, accelerating the mobilization of large volumes of long tail data.
Several national and international efforts in this area are already underway, e.g. RDA, ESIP, EarthCube, and Mozilla Science. We need mechanisms for better sustained engagements between these efforts and paleogeoscientists (Sect.3.5).

2.4 Machine Reading Systems to Facilitate Creation of Structured Knowledge Bases From Unstructured Data

\label{machine-reading-systems-to-facilitate-creation-of-structured-knowledge-bases-from-unstructured-data}
The ability to algorithmically and repeatedly interrogate the scientific literature, en masse, for the purpose of locating and extracting the data needed to address broad-scale research questions would revolutionize efforts towards building a fully realized, data-constrained model of the evolving Earth-Life system. Current efforts to aggregate, organize, and synthesize paleogeoscientific data rely heavily on manual literature-based data compilation, which is labor-intensive, costly, and rate-limiting.
Machine reading and learning systems are rapidly advancing and hold promise as scientific research tools for literature-based data compilation efforts (Ré et al. 2014; Mallory et al. 2015; De Sa et al. 2016a, b; Peters et al. 2014, 2017). EarthCube’s GeoDeepDive is a major step in this direction, with a digital library of over 3 million documents from multiple partners and >1 million CPU hours invested in parsing and annotating these documents. The next step is to grow and build all-inclusive digital libraries of published scientific documents and the associated high-capacity computing infrastructure that enables scientists to search the literature and dynamically create structured research syntheses.