Any other relevant aspects, such as organization, process,
learning and workforce development, access, and sustainability, that
need to be addressed; or any other issues that NSF should consider.
As practicing geoscientists who work at the science-informatics
interface, it is our firm belief that the critical barrier to big-data
science in the paleogeosciences is not technology, it is people.
Sustained investment in human capital is essential, and has several
dimensions.
3.1 Sustainability
\label{sustainability}
Sustained Support for Community Cyberinfrastructure. NSF has an
excellent system for seeding and building cyberinfrastructure (funded
workshops to build initial connections, RCNs to develop networks, 3-year
grants to develop resources), but NSF lacks a good system for long-term
sustenance of paleodata cyberinfrastructure. Many of the
cyberinfrastructure resources (PBDB, Neotoma, LacCore, and precursors)
mentioned here have existed for decades and have achieved sustainability
by being closely linked to research priorities. But their existence is
always precarious and depends on funding in 3-year grant cycles, which
is a major barrier to bringing in new data and colleagues. When we ask
colleagues to participate in these initiatives, by contributing data or
lending expertise, a question that always comes up is sustainability.
Meaning, ‘as a hard-working geoscientist, with many demands on my
time, should I invest any effort in assisting with a resource that may
not exist in three years’?
The lack of long-term funding is incongruous with CCDRs’ increasing role
as data repositories for NSF-sponsored research, thereby fulfilling
Federal mandates for public data sharing. Cyberinfrastructure is
infrastructure, and should be supported with long-term (5-10yr)
investments, contingent upon satisfactory support of community research
priorities.
Sustainability: Human Capital. Much infrastructure focuses on
‘hard-capital’ resources: ships, planes, satellites, core repositories,
etc. In cyberinfrastructure, human capital is paramount, and comprises
(1) trained domain experts for data acquisition and quality control, and
(2) IT experts who design and maintain databases and the software
interfaces for data entry and retrieval. Through hard experience, we
have learned that exceedingly few experts have the necessary joint
training in the geosciences and data sciences. Our data are complex,
with substantial embedded knowledge. Our successes and advances have
depended on a few individuals who have, through on-the-job work,
acquired the necessary crossover training.
In drafting this letter, one contributor [Emile-Geay] wrote:
“Right now, I have an outstanding postdoc who is doing more
useful work for our project than 5 technicians… She’d love to
continue working for the LinkedEarth project, but the lack of sustained
funding means that she will probably have to move on.” Another
[Peters] noted: “My guy is brilliant for
PBDB/Macrostrat/GeoDeepDive. But he didn’t ‘come that way.’ He has a BA
in Political Science. He… acquired experience of high value on
this job with my group.“
The technology associated with much paleogeoscientific
cyberinfrastructure is low-cost and resilient to funding interruptions
– servers, APIs, etc. It is the loss of key individuals and their
embedded knowledge that cripples cyberinfrastructure initiatives. We are
continually at risk of losing talented data scientists because of
funding lapses or low pay relative to skill sets. Academic salaries are
much lower than industry salaries for many technical staff. Without
adequate reward of these mission-critical crossover data scientists and
geoscientists, with deep disciplinary knowledge and key
cyberinfrastructure skills, we risk losing irreplaceable expertise and
the sustainability of current cyberinfrastructure efforts.
3.2 Bottom-Up Development of Community Informatics
Resources:
\label{bottom-up-development-of-community-informatics-resources}
Many disciplines do not yet have agreed-upon CCDRs or minimal metadata
standards. These communities should be encouraged to self-organize
through the established NSF mechanisms of workshop grants, RCNs, and
seed grants. The new EarthRates RCN is a good step in this direction.
Emerging initiatives should be partnered with established initiatives,
to minimize reinventions of wheels and to encourage adoption of common
best practices. Clear documentation and support for domain scientists
must be generated to bridge disciplinary gaps, particularly within
fields with less of a background in informatics. Clear assessment
guidelines for new developments must be supported, so that new
infrastructure or data models can be evaluated, and best practices can
be embedded at the earliest development stages.
3.3 Data mobilization and improvement campaigns.
\label{data-mobilization-and-improvement-campaigns.}
Data mobilization campaigns are vital, given the prevalence of dark and
incompletely published data (Sect 1.2). We need to provide
resources to people to upload their data from in-house computers to
community databases.
In the paleogeosciences, the same pattern repeats over and over: A
global data synthesis is launched to target a particular time interval
or broad-scale scientific qeustion, often with workshop support from NSF
or PAGES (e.g. Climates of the last two millennia; Pliocene data-model
syntheses). Scientists contribute their individual data files. Conveners
and contributors quickly discover massive heterogeneity in the
individual spreadsheet datafiles. The project stalls, scales back
ambitions, or takes years to complete. After publication, results are
often not readily re-usable because few resources were invested in
proper data publication.
We need a new model in which research synthesis projects are explicitly
combined with data mobilization campaigns. Research teams should apply
for data mobilization funds, in which they identify a critical
scientific problem that would benefit from mobilization and synthesis of
existing data. Funding should include workshop support and support for
postdocs, grad students, or technicians to receive data from
participants and upload to community databases conforming to recognized
standards. These funds could be awarded and workshops run through NSF or
a designated synthesis center (Sect 3.5). Data mobilization
efforts should prioritize the foundational ‘raw’ data (e.g. radiocarbon
dates, geochemical measurements, fossil occurrences) and secondarily the
‘derived’ data (e.g. age models, temperature reconstructions, etc.) that
source from the raw data. The paleogeosciences could adopt the
EarthScope terminology of Level 0 (raw, unprocessed), Level 1
(quality-controlled data), Level 2 (low-level derived products), Level 3
(mid-level integrated products), Level 4 (high-level integrated
products)
(
http://www.usarray.org/files/docs/pubs/ES_Data_Portal.pdf)
and prioritize data mobilization from the bottom up.
3.4 Scientific Workforce
Development
\label{scientific-workforce-development}
We need targeted initiatives to better train our scientific workforce in
best practices in data handling and synthesis. This includes community
development platforms such as GitHub, scientific workflow systems, and
efforts towards transparent, reproducible science. Training is needed at
all levels, including undergraduate, graduate, and refresher training
for early-career and mid-career scientists. Delivery options include
IGERT-style graduate training programs, YouTube, and summer workshops,
e.g.
Software Carpentry and community coding events (e.g., the EarthCube sponsored Cyber4Paleo event:
http://cyber4paleo.github.io).
We need to rebuild undergraduate and graduate courses in the geosciences
to emphasize data science, with more emphasis on scientific programming
and coding practices, hierarchical Bayesian statistics, and
geovisualization. Our ultimate goal should be to build the next
generation of geoscientists who transform science through their work at
the science/informatics interface.
3.5 National Centers for Paleodata and Synthesis
(NCPDS)
\label{national-centers-for-paleodata-and-synthesis-ncpds}
A key idea in all of the above is
distributed. It would be a
grave mistake to try to create a single highly centralized data center
in the paleogeosciences. Given the heterogeneity of our data and
dispersal of communities, we envision a federated ecosystem of
resources, each serving their respective community with sustained
support and integrated management. Management should be through a
coordinating office that facilitates standards adoption, provenancing,
and other tools for data sharing across CCDRs, promotes community best
practices, and leads scientific workforce training initiatives. This
Center would help facilitate connections to other existing organizations
such as ESIP, Mozilla Science, ICSU, and coordinate activities with
EarthCube and NSF directorates. Possible models include the coordinating
office for the Long-Term Ecological Research (LTER) Network, the NSF
Centers for Synthesis (NCEAS, NESCENT, SESYNC), and the
USGS Powell
Center.
References
Michel Crucifix. Traditional and novel approaches to palaeoclimate modelling. Quaternary Science Reviews 57, 1–16 (2012).
De Sa, C., Ratner, A., Ré, C., Shin, J., Wang, F., Wu, S. and Zhang, C.,
2016a. Deepdive: declarative knowledge base construction. ACM SIGMOD
Record, 45(1), pp. 60-67.
De Sa, C., Ratner, A., Ré, C., Shin, J., Wang, F., Wu, S. and Zhang, C.,
2016b. Incremental knowledge base construction using DeepDive. The VLDB
Journal, pp. 1-25.
Hargreaves, J. C., J. D. Annan, M. Yoshimori, and A. Abe-Ouchi. 2012.
Can the Last Glacial Maximum constrain climate sensitivity? .
Geophysical Research Letters 39:L24702.
Mallory, E.K., Zhang, C., Ré, C. and Altman, R.B., 2015. Large-scale
extraction of gene interactions from full text literature using
DeepDive. Bioinformatics, 32(1), pp. 106-113.
McKay, N. P. and Emile-Geay, J.: Technical note: The Linked Paleo Data
framework – a common tongue for paleoclimatology, Clim. Past, 12,
1093-1100, doi:10.5194/cp-12-1093-2016, 2016.
National Research Council. Understanding Earth’s Deep Past: Lessons for
Our Climate Future. (National Academies Press, 2011).
National Research Council. New Research Opportunities in the Earth
Sciences. (National Academies Press, 2011).
National Research Council. Abrupt Impacts of Climate Change:
Anticipating Surprises. (National Academy of Sciences, 2013).
Noren, A. et al. Cyberinfrastructure for Paleogeoscience: Executive
Summary. (University of Minnesota, Minneapolis, MN, 2013).
PAGES 2k Consortium. Continental-scale temperature variability during
the past two millennia. Nature Geoscience 6, 339-346,
doi:10.1038/ngeo1797.
Pauli, J. N., S. D. Newsome, J. A. Cook, C. Harrod, S. A. Steffan, C. J.
O. Baker, M. Ben-David, D. Bloom, G. J. Bowen, T. E. Cerling, C. Cicero,
C. Cook, M. Dohm, P. S. Dharampal, G. Graves, R. Gropp, K. A. Hobson, C.
Jordan, B. MacFadden, S. Pilaar Birch, J. Poelen, S. Ratnasingham, L.
Russell, C. A. Stricker, M. D. Uhen, C. T. Yarnes, and B. Hayden. 2017.
Opinion: Why we need a centralized repository for isotopic data. Proceedings of the National Academy of Sciences 114:2997-3001.
Peters, S.E. C. Zhang, M. Livny, and C. Ré. 2014. A machine reading
system for assembling synthetic paleontological databases. PLoS One
9(12) e113523. doi: 10. 1371/journal.pone.0113523
Peters, S. E., Husson, J. M. and Wilcots, J.W. 2017. The rise and fall
of stromatolites in shallow marine environments. Geology. In press.
doi:10.1130/G38931.1 .
Ré, C., Sadeghian, A.A., Shan, Z., Shin, J., Wang, F., Wu, S. and Zhang,
C., 2014. Feature engineering for knowledge base construction. arXiv
preprint arXiv:1407.6439.
Singer, B. et al. Bringing Geochronology into the EarthCube Framework.
(University of Wisconsin, Madison, WI, 2013).
Transitions Report. 2012. TRANSITIONS: The Changing Earth-Life
System-Critical Information for Society from the Deep Past.
(http://www.sepm.org/CM_Files/ConfSumRpts/TRANSITIONSfinal.pdf, 2012)
Williams, J. W., J. L. Blois, J. L. Gill, L. M. Gonzales, E. C. Grimm,
A. Ordonez, B. Shuman, and S. Veloz. 2013. Model systems for a no-analog
future: Species associations and climates during the last deglaciation.
Annals of the New York Academy of Sciences 1297:29-43.