Results and Discussion

The first part of our curated perovskite amine database consists of 184 amines that correspond to the ammonium cations in literature, named “existing perovskite amines”. The structural similarity search on PubChem and further screening process give an additional 264 amines that are considered “potential perovskite amines” —the amines that have similar structures to existing ones. Finally, the curated perovskite amine database contains 448 amine structures. The full table of the database is provided in the supplementary material. The main reason for expanding the database is to make full use of data on amines that have been tested for bioactivity or toxicity, regardless of whether they have been studied as perovskite amines. As more amines are included in the analysis, it may be easier to find toxicity trends and their relationship to the amine structure.
The introduction of artificial intelligence to the creation of Amine Atlas and toxicity screening of amine chemistries involves the calculation of MinHash fingerprint, up to six bonds (MHFP6) and Uniform Manifold Approximation and Projection (UMAP). MHFP6 is an improved version of the extended connectivity fingerprint (ECFP)27 that lowers the dimensionality needed to describe the detailed molecular substructures as well as increases the performance of the nearest neighbor search.30 The MHFP6 fingerprint has been used in recently published chemistry databases31,32 and data visualization tool33 with big data settings. MinHash is a locality sensitive hashing (LSH) scheme that applies a family of hashing functions to the substrings in molecular shingling and stores the minimum hash generated from each hashing function in a set. These sets, containing the minimum hash values, have the interesting property that they can be indexed by an LSH algorithm for approximate nearest neighbor search (ANN), removing the curse of dimensionality.30MinHash allows for the indexing of chemical structures in extremely sparse Jaccard (Tanimoto) space, a metric more appropriate for fingerprint-based similarity calculations. 30 On the other hand, UMAP is a recently developed non-linear dimensionality reduction algorithm28 that has been used to analyze various types of scientific data, mainly in the field of biological sciences including genome aggregation34, single-cell mass flow cytometry35, and single-cell RNA sequencing (scRNA-seq)35-37. UMAP is a manifold learning method that preserves local and global structure of the high-dimensional data points by minimizing data/information loss. It explores the network connectivity using K-nearest neighbor distance (KNN) over a high-dimensional hyperplane and then estimates a low-dimensional coordinate system that replicates the same graph structure, preserving the edge connectivity of the high-dimensional by keeping graphical representation intact in the low-dimensional space. Compared with the more frequently used t -distributed stochastic neighborhood embedding (t-SNE) algorithm which has limited capability to represent the global structure of the data, it is found that UMAP retains the local and global structure of the data by simultaneously capturing the small differences and the continuity between the data subsets.
The higher level of classification gets amines categorized into aliphatic amines (cyclic and noncyclic), heterocyclic aromatic amines, and other aromatic amines including phenylalkyl amines and anilines. Combining this classification information with the results of the UMAP on the MHFP6 fingerprint of perovskite amines, the clustering of these amine classes can be observed on Amine Atlas. The optimized clustering is reached when MHFP permutation number, UMAP number of neighbors, UMAP minimum distance are set to 2048, 50, and 0.25, respectively. Using this combination of parameters, the main classes are well-separated from each other on the Amine-Atlas (Figure 2), and the same parameters are used for all the Amine Atlas below.
For each amine class, the Amine-Atlas can display further classifications as subclasses. The subclasses of heterocyclic aromatic amines are shown in Figure 3. This class of amines is clearly divided into common nitrogen-containing aromatics, including pyrrole, imidazole, pyridine, and thiazole, and sulfur-containing thiophene. No overlap is observed between the clusters, which may be due to the effectiveness of MHFP6 fingerprint in capturing the characteristics of common aromatic compounds.
Similarly, for the class of phenylalkyl amines, the subclasses are well-separated in Amine-Atlas (Figure 4). This figure shows the power of UMAP in capturing both the local and global structure of the data. Here, the UMAP captures subtle differences between subclasses (such as those with the same carbon number) by dividing them into different clusters (e.g. 1-phenylethylamines (C6H5-C(C)NH3) and phenylethylamines (C6H5-CCNH3)). At the same time, the UMAP shows the continuity of close subclasses by placing them in adjacent positions, such as the benzylamines (C6H5-CNH3) and phenylethylamines (C6H5-CCNH3) whose alkyl substituents differ in chain length by 1.
Due to the complex structure of branched alkyl chains, the noncyclic aliphatic amines have some clusters with less organization (Figure 5). However, the trend still exists in the amines with linear alkyl chains, such as the linear diamines (purple) and linear monoamines (orange) subclasses, where the length of the alkyl chain decreases along the UMAP-1 axis. In addition, amines that have functional groups in addition to amine groups (dark green) are distant from unsubstituted amines (purple and orange).
One important purpose of this study is to screen the relative hazard of amines being used in 2D and 3D perovskite synthesis – those most hazardous and those not so. We retrieve the toxicity data of perovskite amines from PubChem Bioassay Database23,26, an open-source repository holding a collection of bioactivity and toxicity data of small molecules—these molecules are cross-linked to the data of their chemical structures stored in PubChem Compounds Database22. After a search using our programming tools, we summarized a list of PubChem Bioassays that focus on the toxicity of chemicals and in the meantime include perovskite amines as test substances, and the complete list of assays is provided in the supplementary material. Examples of the toxicity effects and corresponding AID are shown in Table 1.
Table 1. Examples of selected PubChem Bioassays and the toxicity effect they study