As
shown in Figure 1, the first step in a simulation workflow typically
involves defining the configuration of the atoms (or more generally,
particles) in the system. The mBuild Python
library42,43 has been developed to be a general,
customizable tool for constructing arbitrarily complex system
configurations in a programmatic fashion (i.e., scriptable). Key to the
mBuild library is its underlying Compound data structure. A Compound is
a general “container” that can describe effectively anything: an atom,
a collection of atoms, a molecule, a generic point particle, a
collection of Compounds, operations on the underlying Compounds and/or
data, etc. Compounds can be duplicated, rotated, translated, scaled,
etc. to construct a system. Compounds can also contain information
regarding connections between the atoms, by defining either fixed Bonds
within a Compound or by adding Ports that allow connections to be made
between separate Compounds. Ports define both location and orientation
of a connection; in atomistic systems, the number of Ports and their
locations are typically representative of the underlying chemistry. For
example, Figure 2 shows Python code that defines a CH2moiety with two C-H Bonds and two Ports. In order to create a connection
between two Compounds, a user simply states which Ports should connect
and mBuild automatically performs translations and reorientations,
creating a new (composite) Compound (see Klein et
al .42 for more details). As such, this allows complex
systems to be built-up from smaller, interchangeable pieces that can beconnected , through the use of the concept of generative
modeling.42 This design approach allows for
declaratively expressing repetitive structures, such as polymer chains
and planar tilings (as used in Figure 2) and also allows significant
modifications to system structure/chemistry to be made with only minimal
changes to the initialization routines.
3.2. Foyer
After a system configuration is initialized, the interactions between
all constituents must be defined before a system can be simulated (as
shown in Figure 1), i.e., the force field must be applied to the system.
The Foyer library44 has been developed as a general
tool for applying force fields to molecular systems (i.e., atom-typing),
that provides a standardized approach to defining chemical context and
atom-typing rules22,45. In Foyer, the forcefield
parameters and the rules that dictate parameter usage are stored
together in a standardized XML file, separate from the code used to
evaluate them. Usage rules are encoded by using a combination of a
SMARTS-based annotation scheme, which defines the chemical context
associated with a given parameter, and overrides that define rule
precedence. SMARTS is a language designed for describing molecular
patterns,46 thus allowing information about the bonded
environment of an atom to be efficiently and clearly encoded in a format
that is both human and machine readable. For example, the chemical
context of a terminal methyl group (-CH3) in an alkane
can be expressed as [C;X4](C)(H)(H)H. In this annotation, [C;X4]
indicates that the atom of interest is a carbon (C), with 4 total bonds
(X4) and (C)(H)(H)H provides the identity of those 4 bonds (1 carbon, 3
hydrogens). Figure 3 shows a snippet from the Foyer XML forcefield file
demonstrating how these usage rules can be encoded, using select
parameters from OPLS-AA force field (See Klein et
al.22 for more details). By separating the usage rules
and parameters from the software used to evaluate them, the Foyer
library does not need to change if changes are made to a force field
file. As such, this allows the implementation of novel and “custom”
force fields without the need to write new software, which simplifies
the process of disseminating and evolving forcefields, and increases
reproducibility of work by making it clear not just what force field was
used, but how it was applied to the system. A complimentary approach not
requiring SMARTS and overrides is to make molecule-specific XML files
available (e.g., via webpages such as http://trappe.oit.umn.edu).
3.3. General Molecular Simulation Object
(GMSO)
With a system initialized and parameterized, the information in the
system topology must be written to a file for a simulation engine. While
the information required by different simulation engines is, generally
speaking, the same, the structure and format of the data file(s) passed
to simulation engines is typically unique to the engine itself.
Generating these files accurately, especially for a wide range of unique
simulation engines, can be non-trivial. The current version of MoSDeF
relies upon the use open-source utilities parmed47 and
OpenMM48,49 to store this information; these tools
along with native MoSDeF code, include parsers to generate syntactically
correct data files. In this approach, a single simulation topology can
be used to generated input files for a variety of simulation engines,
allowing different engines and methodologies (e.g., MC and MD) to be
applied to the same system. While effective, these backend codes do not
have general support for the breadth of simulation engines and force
fields we aim to include. To this end, the General Molecular Simulation
Object (GMSO) has been under development with the goal of becoming thede facto backend data structure of the MoSDeF. The goal of GMSO
is to serve as a general container for all of the relevant system
information (e.g., the fully parameterized system), stored in a
simulation engine agnostic way. GMSO is designed with interoperability
and support for various functional forms as a first-class feature. For
example, GMSO builds upon the idea of Foyer XML data file, shown in Fig.
3, but provides further meta data; this includes encoding the functional
forms of the potentials in the force field (those that can be expressed
in computer algebraic inputs) using the sympy Python library. GMSO is
also structured to make it easier to add data file writers, allowing
GMSO support to be extended and customized. Because GMSO supports
user-defined analytic equations for force field components, it
future-proofs GMSO for new developments in force fields, such as those
being pursued by several of the authors.
3.4. Computational Screening and Automation using
MoSDeF
Since all the functions of MoSDeF are scriptable, when combined with a
workflow management tool such as signac/signac flow21,
it is relatively trivial to perform computational screening of the
properties of systems by looping over chemistries and/or conditions and
calculating relevant properties from the simulations. The MoSDeF/signac
combination has been used to screen the impact on nanolubrication
properties of end-group chemistry of self-assembled alkylsilane tethers
on amorphous silica surfaces23, leading to a
machine-learning-derived model connecting end-group cheminformatic
descriptors with tribological properties of interest. In another
example50, the diffusivities of ions in organic
solvents were screened for 22 different solvents, revealing a pattern in
this large data set (ion diffusivity proportional to solvent
diffusivity) that was in contrast with previous, primarily experimental
findings (ion diffusivity proportional to solvent dipole moment). The
computational screening finding were confirmed in subsequent
experimental studies utilizing quasi-elastic neutron
scattering51 and NMR52.
3.5 Expanding MoSDeF
As noted earlier, the genesis of MoSDeF was a series of NSF grants to
Vanderbilt PIs Cummings, McCabe, Iacovella, and
Ledezci34–36. A recent collaborative NSF
grant53 has funded groups from the universities of
Michigan (Glotzer and Anderson), Notre Dame (Maginn), Minnesota
(Siepmann), Delaware (Jayaraman), Houston (Palmer), Wayne State
(Potoff), and Boise State (Jankowski) universities to work together to
expand MoSDeF’s capabilities, including the collaborative design and
development of the aforementioned GMSO backend. This collaboration is
resulting in increasing integration with HOOMD-blue, integration with MC
codes Cassandra and GOMC, and the first principles MD/MC code CP2K;
additionally, MoSDeF has been integrated more closely with Michigan’s
signac workflow management tools. In the case of Cassandra, for example,
using MoSDeF existing utilities and adding additional capabilities
resulting from the Vanderbilt/Notre Dame collaboration, the complexity
of setting up a simulation has been reduced from 9 steps (including 3
requiring user editing of files) to a single python script using MoSDeF;
this, in turn, has enabled computational screening with Cassandra. Other
groups, including Houston, Boise State, and Delaware, are focusing on
developing modules to implement complex workflows and analyses involved
in phase equilibrium calculations and construction of intricate
molecular models. Building the modules around the MoSDEF framework will
enable these workflows to be performed in a reproducible fashion with a
variety of widely used simulation engines.
An example of the capabilities enabled by this collaboration is given in
the Supplementary Information (SI). Inspired by the honoree of this
special issue, Keith Gubbins, in the SI we report the use of five
different simulation codes (the open-source MC codes Cassandra and GOMC,
the open-source MD codes LAMMPS and GROMACS, and the open -source first
principles MD code CP2K) to repeat calculations reported by Strioloet al .54 on the adsorption of water into carbon
slit pores. The latter were groundbreaking simulations for their time
and the paper has been cited ~200 times (Google
Scholar). The paper reported adsorption/desorption isotherms,
demonstrating the hysteresis seen in experiment, as well as density
profiles and orientational structure of the adsorbed water into carbon
slit pores. The Striolo et al . simulations were performed using
in-house codes; thus, they are almost impossible to reproduce in detail.
In the SI, we show that we can reproduce the adsorption/desorption
isotherms reported by Striolo et al . to within an acceptable
degree using Cassandra and GOMC; more importantly, we show that by using
the MoSDeF tools to create the simulations, we can easily test multiple
engines, and show we get excellent agreement between the two different
MC codes. Having used the technique of GEMC in both Cassandra and GOMC,
we establish the number of water molecules in the pore at a given
external pressure. We then perform NVT (constant number of
molecules, volume and temperature) simulations using multiple codes. We
find remarkable agreement for the water structure inside the pore
between the MC engines Cassandra and GOMC and MD engines LAMMPS and
GROMACS. The use of MoSDeF (mbuild to build the simulation systems and
foyer to apply the force fields) is absolutely essential to obtaining
consistency between these calculations. The first principles MD code
CP2K with interactions described on-the-fly via Kohn-Sham density
functional theory produces similar, but not identical, results for water
structure, thereby allowing us to identify differences in
water-substrate interactions. The fact that one can move the simulated
system between all of these codes fairly effortlessly, thanks to the use
of the MoSDeF tools and its meta-level abstraction of the concept of
molecular simulation, is a very significant step forward for the
simulation community. Moreover, the SI contains all the instructions
needed for the reader to download and run all the utilities and codes
needed to reproduce the reported calculations exactly, hence qualifying
these as TRUE simulations.33
4. Conclusions
For several decades, the open-software movement has been making its
presence felt in the chemical engineering community. Open-source
software offers many advantages over proprietary codes. First, they are
universally available and do not contain any hidden parameters. This
makes verification of results published using these codes much more
feasible than for proprietary codes. Indeed, some scholarly journals
have taken the position of considering only manuscripts for publication
in which molecular modeling calculations were performed using
open-source codes or source code that is made available to reviewers.
Second, open-source codes are available at no cost, which means that the
codes can be downloaded and used by researchers throughout the world,
removing barriers for scientific progress. Third, open-source codes
typically attract a community of users and/or developers, so that bugs
are discovered and eliminated quickly, often overnight; in the case of
proprietary software, bugs are typically only fixed during update
cycles, which may be months apart, or may even go unnoticed, since the
code cannot be inspected by users. The downside of open-source software
is that, since there is no revenue stream in the usual sense (sale of
software), the sustainability of an open-source code over decades can be
questionable. However, codes can reach a level of usage such that the
effort to maintain and improve the code is taken on by the user
community; LAMMPS has arguably reached this position. Also, for some
open-source codes there is an alternative revenue stream. For example,
Red Hat is the biggest contributor and supporter of the open-source
Linux operating system. It makes money by writing, selling, and
supporting business-oriented middleware that runs within Linux, as well
as selling consulting services to companies switching to Linux for their
enterprise software. The commercial Scienomics MAPS platform for
materials and process simulations embeds some of the open-source MD and
MC codes, such as LAMMPS, Cassandra, and MCCCS-Towhee. Enthought, Inc.
is a software company based in Austin, Texas, that develops and markets
scientific and analytic computing solutions using primarily the Python
programming language; its commercial activities underwrite the widely
used open-source SciPy (Scientific Python) package.
We dedicate this Perspective to our colleague, mentor, and friend, Keith
Gubbins. The authors of this article wish to express their deep
gratitude to Keith for all he has done for our community. We wish him
many more years of productive science.
Acknowledgements
The preparation of this Perspective article has been supported by a
National Science Foundation grants OAC-1835874 to Vanderbilt University,
OAC-1835612 to the University of Michigan, OAC-1835630 to the University
of Notre Dame, OAC-1835067 to the University of Minnesota, OAC-1835613
to the University of Delaware, OAC-1835593 to Boise State University,
OAC-1835713 to Wayne State University, and OAC-1835560 to the University
of Houston.