Molecular crystals are a class of solids comprising molecular building blocks bound by van der Waals (vdW) interactions. They are used as functional materials for various applications, including organic electronics and photovoltaics, non-linear optics, and primarily pharmaceuticals because most drugs are marketed as solid forms of the active ingredient. Owing to the weak nature of vdW interactions, the same molecule may crystallize in several different structures, known as polymorphs. Polymorphs may be very close in energy and yet possess markedly different physical and chemical properties. For device applications, crystal structure may affect the electronic and optical properties. For pharmaceuticals, the crystal may affect the dissolution rate and thus the drug bioavailability. The ability to predict all the possible polymorphs of a particular molecule and their properties is therefore critically important. Molecular crystal structure prediction is extremely challenging because it requires searching a high-dimensional space with quantum mechanical accuracy. To predict the structure of molecular crystals we develop the genetic algorithm (GA) code, GAtor, and its associated structure generation package, Genarris.
Genarris: a random structure generator for molecular crystals
Genarris is a random structure generator for molecular crystals, which can be used for seeding crystal structure prediction algorithms, for generating datasets to train machine learning models, or for crystal structure prediction by random sampling. MPI-based parallelization facilitates the seamless sequential execution of user-defined workflows. The workflow of Genarris is illustrated below. Genarris starts by estimating the the unit cell volume based on the single molecule structure, using a machine-learned model trained on experimental structures. Then, structures are generated in all space groups compatible with the molecular point group symmetry and the requested number of molecules per unit cell, including space groups with molecules occupying special Wyckoff positions. A hierarchical structure check procedure detects unphysical close contacts efficiently and accurately. Special intermolecular distance settings have been implemented for strong hydrogen bonds. Once a “raw pool” is generated, down-selection may be performed by executing user-defined sequences of clustering and selection based on energy and/or diversity considerations.
GAtor: a massively parallel genetic algorithm (GA) for molecular crystal structure prediction
GAs rely on the evolutionary principle of survival of the fittest to perform global optimization. The target property is mapped onto a fitness function and structures with a high fitness have an increased probability to “mate” and propagate their structural “genes”. The process repeats iteratively until an optimum is found. GAtor has three special features: A variety of crossover and mutation operators, designed for molecular crystals, balance exploration and exploitation by breaking or preserving space group symmetries; Evolutionary niching helps overcome initial pool bias and selection bias; Massive parallelization is achieve by spawning several GA instances that only interact through a shared population. The recommended best practice for crystal structure prediction is to run GAtor several times with different settings. The figure below demonstrates how the experimental structure of tricyano-1,4-dithiino[c]-isothiazole (TCS3) was generated in seven GAtor runs with different settings via different evolutionary routes, starting from initial pool structures.
A machine learned model for molecular crystal volume estimation
The first step in a crystal structure prediction workflow is to estimate the volume of the molecular crystal based on the single molecule’s structure to define the search space. To this end, we have developed a machine learned (ML) model. The success of ML models for physical systems hinges on a good choice of descriptors that represent the salient features of the systems being studied. Our model is based on two descriptors: the volume enclosed by the packing-accessible surface and molecular topological fragments. The volume enclosed by the packing-accessible surface accounts for the presence of voids and sterically hindered regions, as well as for the effect of conformational changes. The molecular topological fragments are capture the bonding environments of the atoms in the molecule and the inter-molecular interactions they may form. the model is trained on a data extracted from the Cambridge Structural Database (CSD). Including both geometric and chemical features produces an accurate model with robust performance for unseen data.
Evolutionary niching in GAtor
Typically, genetic algorithms for crystal structure prediction use an energy-based fitness function, which assigns a higher fitness to structures with lower energy. In addition to energy-based fitness, evolutionary niching has been implemented in GAtor to perform multimodal optimization by simultaneously evolving several sub-populations. Machine learning is used to dynamically cluster the population by structural similarity. A cluster-based fitness function is then used to steer the GA towards promising under-sampled regions of the configuration space. This reduces initial population and selection biases (evolutionary drift) and improves the GA performance. An example is shown here for 1,3-dibromo-2-chloro-5-fluorobenzene. Energy-based fitness preferentially samples a basin that contains layered structures. Evolutionary niching enhances sampling in the region of the experimental structure, which has a zigzag packing motif.