Any other relevant aspects, such as organization, process, learning and workforce development, access, and sustainability, that need to be addressed; or any other issues that NSF should consider.
As practicing geoscientists who work at the science-informatics interface, it is our firm belief that the critical barrier to big-data science in the paleogeosciences is not technology, it is people. Sustained investment in human capital is essential, and has several dimensions.

3.1 Sustainability

\label{sustainability}
Sustained Support for Community Cyberinfrastructure. NSF has an excellent system for seeding and building cyberinfrastructure (funded workshops to build initial connections, RCNs to develop networks, 3-year grants to develop resources), but NSF lacks a good system for long-term sustenance of paleodata cyberinfrastructure. Many of the cyberinfrastructure resources (PBDB, Neotoma, LacCore, and precursors) mentioned here have existed for decades and have achieved sustainability by being closely linked to research priorities. But their existence is always precarious and depends on funding in 3-year grant cycles, which is a major barrier to bringing in new data and colleagues. When we ask colleagues to participate in these initiatives, by contributing data or lending expertise, a question that always comes up is sustainability. Meaning, ‘as a hard-working geoscientist, with many demands on my time, should I invest any effort in assisting with a resource that may not exist in three years’?
The lack of long-term funding is incongruous with CCDRs’ increasing role as data repositories for NSF-sponsored research, thereby fulfilling Federal mandates for public data sharing. Cyberinfrastructure is infrastructure, and should be supported with long-term (5-10yr) investments, contingent upon satisfactory support of community research priorities.
Sustainability: Human Capital. Much infrastructure focuses on ‘hard-capital’ resources: ships, planes, satellites, core repositories, etc. In cyberinfrastructure, human capital is paramount, and comprises (1) trained domain experts for data acquisition and quality control, and (2) IT experts who design and maintain databases and the software interfaces for data entry and retrieval. Through hard experience, we have learned that exceedingly few experts have the necessary joint training in the geosciences and data sciences. Our data are complex, with substantial embedded knowledge. Our successes and advances have depended on a few individuals who have, through on-the-job work, acquired the necessary crossover training.
In drafting this letter, one contributor [Emile-Geay] wrote: “Right now, I have an outstanding postdoc who is doing more useful work for our project than 5 technicians… She’d love to continue working for the LinkedEarth project, but the lack of sustained funding means that she will probably have to move on.” Another [Peters] noted: “My guy is brilliant for PBDB/Macrostrat/GeoDeepDive. But he didn’t ‘come that way.’ He has a BA in Political Science. He… acquired experience of high value on this job with my group.
The technology associated with much paleogeoscientific cyberinfrastructure is low-cost and resilient to funding interruptions – servers, APIs, etc. It is the loss of key individuals and their embedded knowledge that cripples cyberinfrastructure initiatives. We are continually at risk of losing talented data scientists because of funding lapses or low pay relative to skill sets. Academic salaries are much lower than industry salaries for many technical staff. Without adequate reward of these mission-critical crossover data scientists and geoscientists, with deep disciplinary knowledge and key cyberinfrastructure skills, we risk losing irreplaceable expertise and the sustainability of current cyberinfrastructure efforts.

3.2 Bottom-Up Development of Community Informatics Resources:

\label{bottom-up-development-of-community-informatics-resources}
Many disciplines do not yet have agreed-upon CCDRs or minimal metadata standards. These communities should be encouraged to self-organize through the established NSF mechanisms of workshop grants, RCNs, and seed grants. The new EarthRates RCN is a good step in this direction. Emerging initiatives should be partnered with established initiatives, to minimize reinventions of wheels and to encourage adoption of common best practices. Clear documentation and support for domain scientists must be generated to bridge disciplinary gaps, particularly within fields with less of a background in informatics. Clear assessment guidelines for new developments must be supported, so that new infrastructure or data models can be evaluated, and best practices can be embedded at the earliest development stages.

3.3 Data mobilization and improvement campaigns.

\label{data-mobilization-and-improvement-campaigns.}
Data mobilization campaigns are vital, given the prevalence of dark and incompletely published data (Sect 1.2). We need to provide resources to people to upload their data from in-house computers to community databases.
In the paleogeosciences, the same pattern repeats over and over: A global data synthesis is launched to target a particular time interval or broad-scale scientific qeustion, often with workshop support from NSF or PAGES (e.g. Climates of the last two millennia; Pliocene data-model syntheses). Scientists contribute their individual data files. Conveners and contributors quickly discover massive heterogeneity in the individual spreadsheet datafiles. The project stalls, scales back ambitions, or takes years to complete. After publication, results are often not readily re-usable because few resources were invested in proper data publication.
We need a new model in which research synthesis projects are explicitly combined with data mobilization campaigns. Research teams should apply for data mobilization funds, in which they identify a critical scientific problem that would benefit from mobilization and synthesis of existing data. Funding should include workshop support and support for postdocs, grad students, or technicians to receive data from participants and upload to community databases conforming to recognized standards. These funds could be awarded and workshops run through NSF or a designated synthesis center (Sect 3.5). Data mobilization efforts should prioritize the foundational ‘raw’ data (e.g. radiocarbon dates, geochemical measurements, fossil occurrences) and secondarily the ‘derived’ data (e.g. age models, temperature reconstructions, etc.) that source from the raw data. The paleogeosciences could adopt the EarthScope terminology of Level 0 (raw, unprocessed), Level 1 (quality-controlled data), Level 2 (low-level derived products), Level 3 (mid-level integrated products), Level 4 (high-level integrated products) (http://www.usarray.org/files/docs/pubs/ES_Data_Portal.pdf) and prioritize data mobilization from the bottom up.

3.4 Scientific Workforce Development

\label{scientific-workforce-development}
We need targeted initiatives to better train our scientific workforce in best practices in data handling and synthesis. This includes community development platforms such as GitHub, scientific workflow systems, and efforts towards transparent, reproducible science. Training is needed at all levels, including undergraduate, graduate, and refresher training for early-career and mid-career scientists. Delivery options include IGERT-style graduate training programs, YouTube, and summer workshops, e.g. Software Carpentry and community coding events (e.g., the EarthCube sponsored Cyber4Paleo event: http://cyber4paleo.github.io). We need to rebuild undergraduate and graduate courses in the geosciences to emphasize data science, with more emphasis on scientific programming and coding practices, hierarchical Bayesian statistics, and geovisualization. Our ultimate goal should be to build the next generation of geoscientists who transform science through their work at the science/informatics interface.

3.5 National Centers for Paleodata and Synthesis (NCPDS)

\label{national-centers-for-paleodata-and-synthesis-ncpds}
A key idea in all of the above is distributed. It would be a grave mistake to try to create a single highly centralized data center in the paleogeosciences. Given the heterogeneity of our data and dispersal of communities, we envision a federated ecosystem of resources, each serving their respective community with sustained support and integrated management. Management should be through a coordinating office that facilitates standards adoption, provenancing, and other tools for data sharing across CCDRs, promotes community best practices, and leads scientific workforce training initiatives. This Center would help facilitate connections to other existing organizations such as ESIP, Mozilla Science, ICSU, and coordinate activities with EarthCube and NSF directorates. Possible models include the coordinating office for the Long-Term Ecological Research (LTER) Network, the NSF Centers for Synthesis (NCEAS, NESCENT, SESYNC), and the USGS Powell Center.

References

Michel Crucifix. Traditional and novel approaches to palaeoclimate modelling. Quaternary Science Reviews 57, 1–16 (2012).
De Sa, C., Ratner, A., Ré, C., Shin, J., Wang, F., Wu, S. and Zhang, C., 2016a. Deepdive: declarative knowledge base construction. ACM SIGMOD Record, 45(1), pp. 60-67.
De Sa, C., Ratner, A., Ré, C., Shin, J., Wang, F., Wu, S. and Zhang, C., 2016b. Incremental knowledge base construction using DeepDive. The VLDB Journal, pp. 1-25.
Hargreaves, J. C., J. D. Annan, M. Yoshimori, and A. Abe-Ouchi. 2012. Can the Last Glacial Maximum constrain climate sensitivity? . Geophysical Research Letters 39:L24702.
Mallory, E.K., Zhang, C., Ré, C. and Altman, R.B., 2015. Large-scale extraction of gene interactions from full text literature using DeepDive. Bioinformatics, 32(1), pp. 106-113.
McKay, N. P. and Emile-Geay, J.: Technical note: The Linked Paleo Data framework – a common tongue for paleoclimatology, Clim. Past, 12, 1093-1100, doi:10.5194/cp-12-1093-2016, 2016.
National Research Council. Understanding Earth’s Deep Past: Lessons for Our Climate Future. (National Academies Press, 2011).
National Research Council. New Research Opportunities in the Earth Sciences. (National Academies Press, 2011).
National Research Council. Abrupt Impacts of Climate Change: Anticipating Surprises. (National Academy of Sciences, 2013).
Noren, A. et al. Cyberinfrastructure for Paleogeoscience: Executive Summary. (University of Minnesota, Minneapolis, MN, 2013).
PAGES 2k Consortium. Continental-scale temperature variability during the past two millennia. Nature Geoscience 6, 339-346, doi:10.1038/ngeo1797.
Pauli, J. N., S. D. Newsome, J. A. Cook, C. Harrod, S. A. Steffan, C. J. O. Baker, M. Ben-David, D. Bloom, G. J. Bowen, T. E. Cerling, C. Cicero, C. Cook, M. Dohm, P. S. Dharampal, G. Graves, R. Gropp, K. A. Hobson, C. Jordan, B. MacFadden, S. Pilaar Birch, J. Poelen, S. Ratnasingham, L. Russell, C. A. Stricker, M. D. Uhen, C. T. Yarnes, and B. Hayden. 2017. Opinion: Why we need a centralized repository for isotopic data. Proceedings of the National Academy of Sciences 114:2997-3001.
Peters, S.E. C. Zhang, M. Livny, and C. Ré. 2014. A machine reading system for assembling synthetic paleontological databases. PLoS One 9(12) e113523. doi: 10. 1371/journal.pone.0113523
Peters, S. E., Husson, J. M. and Wilcots, J.W. 2017. The rise and fall of stromatolites in shallow marine environments. Geology. In press. doi:10.1130/G38931.1 .
Ré, C., Sadeghian, A.A., Shan, Z., Shin, J., Wang, F., Wu, S. and Zhang, C., 2014. Feature engineering for knowledge base construction. arXiv preprint arXiv:1407.6439.
Singer, B. et al. Bringing Geochronology into the EarthCube Framework. (University of Wisconsin, Madison, WI, 2013).
Transitions Report. 2012. TRANSITIONS: The Changing Earth-Life System-Critical Information for Society from the Deep Past. (http://www.sepm.org/CM_Files/ConfSumRpts/TRANSITIONSfinal.pdf, 2012)
Williams, J. W., J. L. Blois, J. L. Gill, L. M. Gonzales, E. C. Grimm, A. Ordonez, B. Shuman, and S. Veloz. 2013. Model systems for a no-analog future: Species associations and climates during the last deglaciation. Annals of the New York Academy of Sciences 1297:29-43.