Describe any limitations or absence of existing
cyberinfrastructure, and/or specific technical advancements in
cyberinfrastructure (e.g. advanced computing, data infrastructure,
software infrastructure, applications, networking, cybersecurity), that
must be addressed to accomplish the identified research challenge(s).
2.1 Cyberinfrastructure in the Paleogeosciences: Trends and
Needs
\label{cyberinfrastructure-in-the-paleogeosciences-trends-and-needs}
In the paleogeosciences, the emerging cyberinfrastructure model is a
distributed, federated network of resources and services, each curating
and advancing a particular kind of knowledge. This system is consistent
with the distributed nature of geoscientific expertise: geochronological
data should be curated by geochronologists, species names by
paleontologists, etc. Within this structure, focal nodes serve as
disciplinary loci for data stewardship and mobilization, and as
resources for sharing best practices and standards across
subdisciplines. This structure has emerged organically, with key support
coming from NSF Geoinformatics and EarthCube, and at least five major
interacting components:
Community Curated Data Repositories (CCDRs) . CCDRs have
repeatedly emerged as geoscientists unite to gather and share data in
response to common, broad-scale research questions (Section 1A).
CCDRs usually begin as individual or small-team efforts, then mature
into community resources with established data models, standards, and
governance systems. Because describing the history of Earth’s climates
and biodiversity requires coordinated effort, CCDRs prevail in
paleoclimatology, paleobiology, and paleoecology (e.g.,
Neotoma
Paleoecology DB,
Paleobiology Database,
LinkedEarth,
SedDB/
EarthChem,
NOAA Paleoclimatology,
MorphoBank,
VertNet). New CCDRs
emerge as proxies mature, e.g. the recent call for IsoBank (Pauli et al.
2017). Key challenges include reducing data friction by better
integration (Sect.2.3), encouraging community input, and, most of
all, sustainability (Sect.3.1).
Museums and Sample Repositories curate physical specimens (rock
samples, drill cores, fossils, biological materials) and their digital
representations. Examples include
LacCore/
CSDCO, IODP, and marine core
repositories at Columbia University, University of Rhode Island, and
Oregon State. Recent initiatives have focused on digitizing collections
(iDigBio, iDigPaleo), developing as persistent and unique sample
identifiers (IGSNs), and establishing provenance systems for linking
samples to measurements (Open Core Data).
Integration and Networking Activities . The EarthCube initiative
has provided an essential push and resources to build a networked
federation of interconnected CCDRs and sample repositories. Current
efforts include Open Core Data (linked data standards for continental
and ocean marine drilling data), the Earth Life Consortium (an umbrella
organization for Neotoma, PBDB, and other paleobiological CCDRs, and
linking to modern biodiversity databases),
ePANDDA (linking the specimen
holdings in Paleobiology Database with the museum digitization efforts
by
iDigPaleo),
LinkedEarth ( linked data standard for paleoclimatic
data, McKay et al. 2016), and
Flyover Country (a popular app for viewing
geological data during air or ground travel, now used by
~600 people each day for informal learning and a
recent “Vizzie” award).
Individual geoscientists . Most scientific data curation is
still done by individual geoscientists in their research labs, with data
stored on desktop computers and local servers. Most geoscientists store
and record data using flat file formats (TXT, CSV, XLS) or workflow
software associated with their instrumentation systems. Many
subdisciplines have no established metadata standards, data reduction
standards, or community data repositories. Huge effort is spent
converting small amounts of data from one format to another (by some
estimates, up to 80% of total project effort).
Scientific Literature and Unstructured Data . A vast amount of
paleogeoscientific data is available only through the published
literature. These data are highly unstructured and not readily amenable
to broad-scale synthesis. Arguably, over 100 years of geological
research has mainly succeeded in transferring information from one vast
and dimly accessible archive (the geologic record) to another (the
published literature — better, but far from ideal). We need better
systems for mining this resource.
2.2 Paleogeoscience Cyberinfrastructure: Priorities for the
Next
Decade
\label{paleogeoscience-cyberinfrastructure-priorities-for-the-next-decade}
Based on the above, we argue that for our community, the
most productive scientific return on NSF cyberinfrastructure investments
over the next decade will come from distributed, meso-scale
investments, with the following six priority areas:
Reduce data friction by developing scientific workflows,
structured vocabularies, semantic frameworks, and data-tagging systems
to pass data and metadata seamlessly within and among community
resources. (Sect.2.3)
Develop automated data-mining systems for extracting
information from unstructured data in the scientific literature
(Sect.2.4).
Support the long-term sustainability of existing community
cyberinfrastructure resources (Sect.3.1) and the grassroots development of community informatics resources for
sub-disciplines that lack data sharing systems. (Sect.3.2).
Launch funded data mobilization campaigns to unlock existing
data relevant to high-priority scientific research questions
(Sect.3.3).
Develop and train a distributed scientific workforce, for
both early career scientists and current practitioners
(Sect.3.4).
Establish a National Center for Paleodata Synthesis to
coordinate activities among individual geoscientists and the
federation of CCDRs and sample repositories, promote community
best-practices and data standards, and develop education and
scientific workforce training initiatives (Sect 3.5).
2.3 Reduce Data Friction
\label{reduce-data-friction}
We need to build the systems and standards necessary to pass data within
primary data-generation scientific workflows (from field collection to
laboratory measurement to publication and archival) and among the
emerging federation of data repositories that facilitate downstream data
integration and synthesis. Researchers must be able to access data from
any point in the stream of data generation, and see its provenance, the
implications or effects of models on their data, and its subsequent
interpretations. Repositories must be interconnected, and scientists
must have the ability to iteratively re-evaluate and annotate data.
Clear provenance is essential to link records across resources.
Web architectural approaches are federated, scalable and tolerant; all
are desirable properties for a distributed paleodata network. Common
permanent and persistent identifiers such as DOIs, IGSNs and ORCIDs,
combined with linked open data and semantic frameworks, are needed to
enable the passing of data among community supported resources. The
development and inclusion of well-documented data vocabularies would
improve the ability to move across data repositories and disciplinary
divides, clearly identify the assumptions surrounding data objects, and
simplify translational activities, supporting the development of the
semantic web. In particular, the development of an ontology of geologic
time (e.g., OWL Time) should be a priority. These can then be extended
into new and underserved data communities, accelerating the mobilization
of large volumes of long tail data.
Several national and international efforts in this area are already
underway, e.g. RDA, ESIP, EarthCube, and Mozilla Science. We need mechanisms for
better sustained engagements between these efforts and
paleogeoscientists (Sect.3.5).
2.4 Machine Reading Systems to Facilitate Creation of
Structured Knowledge Bases From Unstructured
Data
\label{machine-reading-systems-to-facilitate-creation-of-structured-knowledge-bases-from-unstructured-data}
The ability to algorithmically and repeatedly interrogate the scientific
literature, en masse, for the purpose of locating and extracting
the data needed to address broad-scale research questions would
revolutionize efforts towards building a fully realized,
data-constrained model of the evolving Earth-Life system. Current
efforts to aggregate, organize, and synthesize paleogeoscientific data
rely heavily on manual literature-based data compilation, which is
labor-intensive, costly, and rate-limiting.
Machine reading and learning systems are rapidly advancing and hold
promise as scientific research tools for literature-based data
compilation efforts (Ré et al. 2014; Mallory et al. 2015; De Sa et al.
2016a, b; Peters et al. 2014, 2017). EarthCube’s
GeoDeepDive is a major
step in this direction, with a digital library of over 3 million
documents from multiple partners and >1 million CPU hours
invested in parsing and annotating these documents. The next step is to
grow and build all-inclusive digital libraries of published scientific
documents and the associated high-capacity computing infrastructure that
enables scientists to search the literature and dynamically create
structured research syntheses.