Improving Solvation Energy Predictions Using The SMD Solvation Method and Semiempirical Electronic Structure Methods

The PM6 implementation in the GAMESS program is extended to elements requiring d -integrals and interfaced with the conducter-like polarized continuum model (C-PCM) of solvation, including gradients. The accuracy of aqueous solvation energies computed using AM1, PM3, PM6, and DFTB and the SMD continuum solvation model is tested using the MNSOL data set. The errors in SMD solvation energies predicted using NDDO-based methods is considerably larger than when using DFT and HF, with RMSE values of 3.4-5.9 (neutrals) and 6-15 kcal/mol (ions) compared to 2.4 and ca 5 kcal/mol for HF/6-31G(d). For the NDDO-based methods the errors are especially large for cations and considerably higher than the corresponding COSMO results, which suggests that the NDDO/SMD results can be improved by re-parameterizing the SMD parameters focusing on ions. We found the best results are obtained by changing only the radii for hydrogen, carbon, oxygen, nitrogen, and sulfur and this leads to RMSE values for PM3 (neutrals: 2.8/ions: ca 5 kcal/mol), PM6 (4.7/ca 5 kcal/mol), and DFTB (3.9/ca 5 kcal/mol) that are more comparable to HF/6-31G(d) (2.4/ca 5 kcal/mol). Though the radii are optimized to reproduce aqueous solvation energies, they also lead more accurate predictions for other polar solvents such as DMSO, acetonitrile, and methanol, while the improvements for non-polar solvents are negligible.


Introduction
Accurate yet computationally efficient models of aqueous solvation represents an important challenge to molecular modeling.Continuum solvation models such as the polarized continuum model (PCM), [1] the conductor-like screening model (COSMO), [2] and the universal SMx models [3] offer a computational efficient model of solvation for molecules treated with electronic structure methods.While the accuracy of these and related methods have been studied extensively using DFT and wavefunction-based methods, [4,5,6,7,8,9] there has been relatively little work on corresponding studies with semiempirical methods.[2,10,11,12,13] In this paper we study the accuracy of using AM1 [14], PM3 [15], PM6 [16], and DFTB [17] together with the SMD [9] continuum solvation model using the MNSOL data set [18,19] and present new parameters that increase the accuracy considerably for polar solvents.We chose the SMD method over methods with systemdependent radii, since SMD generally applicable to cases, such as transition states, where such a parameterization can become ill defined.This manuscript is organized as follows.After a presentation of our computational methodology we discuss the accuracy of the semiempirical SMD calculations using the radii optimized for ab initio methods.We then compare several sets of reoptimized radii and select the best set.We show that the new radii lead to more accurate pKa predictions as well as more accurate solvation free energies for DMSO, acetonitrile, and methanol.Finally, we present a summary and an outlook based on the results presented in the manuscript.

Computational Details
For reference we use the MNSOL [18,19] data set, which contains experimental solvation free energies of neutral and singly-charged molecules with atom-types of H, C, N, O, F, Si, P, S, Cl, Br, and I.Here we focus mainly on the aqueous solvation subset of that data set.Because iodine is not supported by 6-31G(d) we removed compounds test2018, test4001-4 and test4007-9 from the reference data set.Compounds c091 and i091 are radicals and are not included because PM6 is only implemented for RHF in GAMESS.[20] Our final reference set of molecules then consists of 522 aqueous solvation free energies of which 81 are for anions and 60 are for cations and we refer to this as the MNSOL data set.
When parameterizing SMD Coulomb radii, we use the a subset of the MNSOL data set (the SMD data set) which was also used by Marenich et al. [9].This subset consists of 384 solvation free energies, of which 59 are for anions and 52 are for cations.Following the original SMD implementation we used mono-aquo microsolvated species for small ions with atoms carrying large (partial) charges.The SMD method is re-parameterized using the gas-phase structures optimized at the respective levels of theory.
The integration of d-integrals for semiempirical methods and interface to the conducterlike polarizable continuum model (C-PCM), [21,22] is implemented in a locally modified version of GAMESS using integral code donated by James Stewart.The PM6/PCM interface follows the implementation of Steinmann et al. [23] except that we use the seminumerical gradient approach for the PCM gradient that is also used for the gas phase gradient.The DFTB calculations are done using the DFTB/PCM interface developed by Nishimoto [24] in GAMESS and using version 3ob-3-1 of the 3OB DFTB parameter set [17,25,26,27].All SMD [9] calculations were carried out with the GAMESS program, and COSMO [2] calculations were done with MOPAC2016 [28].
The original SMD parameterization was done with the Gauss-Bonet [29] tessellation scheme, while the default tessellation scheme in GAMESS is FIXPVA [30].While this difference in tessellation scheme has a negligible effect on the solvation free energies of neutral molecules, it can have a rather larger effect for ions.Thus, for calculations involving the original SMD radii we use the Gauss-Bonet tessellation scheme (mthall=1 in the $tescav input group), while we use the more numerically stable FIXPVA scheme when optimizing the radii.The original SMD parameterization was also done with integralequation-formalism PCM (IEF-PCM) [31] while we chose the C-PCM method as it is computationally significantly more efficient for semiempirical calculations.COSMO is used with AM1, PM3 and PM6 as implemented in the MOPAC program [28].The null model is constructed by setting all the predicted solvation free energies to the average of all the reference energies and the error bars reflect 95% confidence limits [32,33,34].
The pKa predictions are performed as described by Jensen et al. [35] using the same data set, i.e. a modified version of the one used by Eckert and Klamt [36].Following Jensen et al. [35], cefadroxil has been removed due to proton transfer and piroxicam has also been removed since we discovered that the wrong tautomeric form was used.

Results and Discussion
Table 1 and Figure 1 shows the computed solvation free energy results obtained using HF/6-31G(d), AM1, PM3, PM6 and DFTB using the implicit solvation models SMD and COSMO using the MNSOL data set described in the Computational Details section.We compute the root mean square error (RMSE), mean signed error (MSE) and mean unsigned error (MUE) in order to properly quantify the accuracy of our results.We present the combined results (labeled "all") for all molecules in the data set, but also present individual observations on neutral and singly charged (both anions and cations) species.

Optimizing SMD Coulomb Radii
Based on the results in the previous subsection we optimize the Coulomb radii for PM3, PM6, and DFTB independently, while leaving the non-polar solvation free energy contribution unchanged.We use the Nelder-Mead simplex method [37] to optimize the radii using the SMD data set with the SciPy [38] package using a custom optimization script.[39] Optimization of the radii is done by minimizing the RMSE with respect to the radii.
Initial tests showed that the solvation free energy does not vary greatly with a change in radii for neutral molecules and the flatness of the energy surface caused problems for the Nelder-Mead method.Furthermore, the error for the ions is considerably larger than for the neutral molecules.Thus, only radii of the ions are optimized but when testing the performance of the new radii neutral molecules are included.This means that the optimum parameters obtained for the ions will not necessarily lead to the lowest possible error for neutral molecules.We therefore test the effect of optimizing a subset of the radii but include all molecules when testing the overall performance.These results are presented in Tables 2 and 3 and in Figure 2.
For PM3 the results are relatively insensitive to the type of atoms chosen with RMSE * values of 4.4 to 4.7 kcal/mol (Table 2 and Figure 2).(Note that the RMSE value are computed using different numbers of charged species so the RMSE * does not necessarily decrease when more parameters are optimized.)Similarly, the RMSE values computed using all the molecules in the MNSOL data set range from 2.9 to 3.2 kcal/mol as presented in Table 3.The cations are most sensitive to the choice of atom types with an RMSE range of 5.0 to 6.2 kcal/mol and the HCON-and HCONS-optimized radii set giving the best, and nearly identical, accuracy.For anions, the RMSE for the HCON radii (4.2 kcal/mol) is lower than that for HCONS (4.5 kcal/mol) and nearly identical to the lowest RMSE observed (4.1 kcal/mol).The HCON-optimized radii are therefore the best compromise.
For PM6, the RMSE * for HCO (3.2 kcal/mol) is about 1 kcal/mol lower than for the other radii set.However, the RMSE values computed using all molecules are very similar with a range of 3.6 to 3.8 kcal/mol, with HCO and HCON tied for first place.HCO does lead to the lowest error for neutral molecules and anions, but also an RMSE that is significantly larger for cations (6.2 kcal/mol), where PM6-HCONS performs best.
When testing the PM6-HCON and PM6-HCONS radii on pKa prediction (see below) we found that several solution-phase geometry optimizations failed because some X-H bonds broke and the proton started moving into the solvent.This indicated that the H radius might be too small, leading to overpolarization of the H atom due to the solvent, and we tested several larger H radii for PM6-HCONS since the H radius was larger than for PM6-HCON.We found that the overpolarization problem disappeared with a H radius of 0.6 Å and that using this value has a negligible 0.1 kcal/mol effect of the accuracy of the solvation energies (HCONSh in Table 3).Thus, we use PM6-HCONSh radii set going forward.  2. HCO indicates that hydrogen, carbon, and oxygen radii parameters are changed (i.e.second and sixth entry in Table 2.The reference set is split into full set (black), neutral (green), cations (red), and anions (blue).From the results discussed above and presented in Table 2, optimizing for atom types beyond HCON showed no significant increase in accuracy for either PM3 or PM6.Therefore the same procedure (optimizing HCON radii) was applied to DFTB.For DFTB the RMSE values of the full set of molecules, the subset of cation and the subset of anions goes from 4.1, 4.9 and 6.7 kcal/mol to 3.3, 2.9 and 4.4 kcal/mol, respectively.
In summary, we find that optimizing the SMD radii in combination with either PM3 or PM6 gives appreciable decreases in RMSE accuracy ≈ 7 kcal/mol in both cases compared to the reference results.We also find that no significant improvement is obtained when attempting to optimize the radii of other atoms than what was done with HCON subset of radii.For this subset, DFTB also saw an increase in accuracy for ions of about ≈ 6 kcal/mol.Because of this, we found the best choice was change the radii for hydrogen, carbon, oxygen and nitrogen (HCON) for PM3 and DFTB methods, For PM6 we optimized including sulfur and changed the radius of hydrogen to 0.6 Å to remove overpolarization problems.These methods are denoted as SMD † going forward.

Comparison with Other Solvation Methods
Using the new parameters for PM3/SMD †, PM6/SMD † and DFTB/SMD †, we compare the obtained RMSE with other approaches to obtain solvent free energies such as HF/6-31G(d)/SMD (one of the levels of theory used in the original SMD parameterization), PM3/COSMO and PM6/COSMO for the MNSOL data set (Table 4 and Figure 3).The overall accuracy of PM3/SMD † is now similar to HF/SMD, although the RMSE value is 0.4 and 0.6 kcal/mol higher for neutral molecules and cations, while the RMSE is 0.7 kcal/mol lower for anions.For PM6 the overall RMSE is 1.4 kcal/mol higher than HF, and the RMSE for neutral molecules is 2.3 kcal/mol higher.On the other hand, the RMSE for cations is only 0.1 kcal/mol higher, while the RMSE for anions is 0.8 kcal/mol lower compared to HF.Finally, for DFTB the overall RMSE value (4.1 kcal/mol) is in between those for PM3 and PM6 as is the RMSE values for neutral molecules (3.9 kcal/mol).But for cations the RMSE (3.7 kcal/mol) is significantly lower than the other methods while the RMSE for anions (5.1 kcal/mol) is very similar.
While the solvation free energies of neutral molecules can generally be measured with an accuracy of ca 0.2 kcal/mol, the corresponding solvation free energies of ions are inferred from other experimental values with a typical accuracy about 3.0 kcal/mol.[40] Thus, the accuracy of the semiempirical SMD † is for all intents and purposes identical to HF/SMD for ions, but significantly worse for neutral ions in the case of PM6 and DFTB.The source of this error is most likely due to a poorer description of the electrostatic properties compared to HF.For example, the error in dipole moments for HCNO compounds is larger for PM6 than for PM3.[16] Table 4 also lists values for PM3/COSMO and PM6/COSMO as implemented in MOPAC for comparison.This COSMO implementation only evaluates the polar part of the solvation free energy and will not be as accurate as properly parameterized COSMO-RS [5] values.For PM3, the SMD † results are significantly lower than the corresponding COSMO results, while for PM6 there is only a significant improvement for cations.

pKa Prediction
Part of the impetus for this work were the relatively large error in pKa values predicted using PM3/COSMO compared to PM3/SMD with the original SMD radii observed by Jensen et al. [35], so we use the new SMD radii to compute the pKa value using the same data set.The data is presented in Figure 4 and Table 5 and shows that the RMSE drops from 1.5±0.3 to 0.9±0.3pH units for PM3/SMD † and 2.0±0.4 to 1.6±0.3/0.4 pH units for DFTB/SMD †.In both cases the improvement is primarily a result of the decrease in mean error (ME), i.e the original radii led to a fairly systematic underestimation of the pKa values that has now been removed by the reparameterization.For DFTB the increased accuracy is also due to a larger number of low error predictions as is evident in

Non-aqueous Solvents
The new radii were optimized explicitly for water, so it is not a priori clear whether they offer more accurate results for other solvents.Table 6 shows the RMSE values for all, neutral, cations, and anions for the other polar solvents in the MNSOL data set, acetonitrile, DMSO and methanol for the original SMD radii and the corresponding re-optimized radii ( †).In addition, corresponding combined RMSE values are given for the remaining 88 mostly non-polar solvents in the MNSOL data set, for which only data is available for neutral molecules.It is clear from this data that the new radii sets lowers the error of the predicted solvation free energies for the other polar solvents and has a negligible effect on the accuracy for the non-polar solvents.
As for acetonitrile and DMSO, the improvement is greatest for the ions where the RMSE decreases by 3-12 kcal/mol.The improvement is more modest (0.5-3 kcal/mol) for the neutral molecules.For methanol, which contains only ions, the RMSE is similarly decreased with 2-13 kcal/mol.
We thus recommend that the new radii optimized for water also be used for other polar solvents when doing semiempirical SMD calculations.For non-polar solvents either set of radii can be used.

Conclusion
The PM6 method in GAMESS was extended to elements requiring d-integrals and interfaced with the polarized continuum model (PCM) of solvation, including semi-numerical gradients.However, the accuracy of aqueous solvation energies computed using AM1, PM3, PM6, and DFTB and the SMD continuum solvation model was tested using a subset of molecules from the MNSOL data set which showed that the errors in SMD solvation energies predicted using NDDO-based methods was considerably larger than when using DFT and HF, with RMSE values of 5.0 to 8.6 kcal/mol compared to 3.4 kcal/mol for HF/6-31G(d).For the NDDO-based methods the errors were especially large for cations and, in the case of PM6, also for anions, with RMSE values of 10.1 to 14.8 kcal/mol in comparison with to 5.0 to 5.7 kcal/mol for HF/6-31G(d)/SMD.Corresponding COSMO results (where the maximum RMSE is 7.5 kcal/mol) suggested that the NDDO/SMD results could be improved by re-parameterizing the SMD Coulomb radii for the NDDO methods.
The fact that the NDDO/SMD errors are largest for ions suggest that the problem is with the polar solvation energy, so we focus on optimizing the values of the Coulomb radii while leaving all parameters associated with the non-polar solvation free energy unchanged.We optimize the radii only for the ionic species on a subset of the MNSOL data set previously used to parametrize RHF/6-31G(d)/SMD, but include neutral molecules when testing how well SMD with the new Coulomb radii perform.We found the best results are obtained by changing only the radii for hydrogen, carbon, oxygen, nitrogen, and sulfur and this leads to RMSE values for PM3 (neutrals: 2.8/ions: ca 5 kcal/mol), PM6 (4.7/ca 5 kcal/mol), and DFTB (3.9/ca 5 kcal/mol) that are more comparable to HF/6-31G(d) (2.4/ca 5 kcal/mol).We note that the SMD parameterization is done using mono-aquo microsolvated species for small ions with atoms carrying large (partial) charges, while the MNSOL data set also contains data for the non-microsolvated equivalents.Comparison to the HF/6-31G(d) MSEs reported by Marenich et al. [8] reported for the selectively microsolvated ions indicates that the inclusion of non-microsolvated species leads an MSE increase of about 1 kcal/mol.Though the radii are optimized to reproduce aqueous solvation energies, they also lead more accurate predictions for other polar solvents such as DMSO, acetonitrile, and methanol, while the improvement for non-polar solvents are negligible.

Figure
Figure 1: RMSE (in kcal/mol) of SMD aqueous solvation energies computed using HF/6-31G(d), AM1, PM3, PM6, and DFTB and the original SMD Coulomb radii for the MNSOL data set.Corresponding RMSE values for the COSMO method are included for comparison.The MNSOL data set is further split into the full set of molecules (black), neutral molecules (green), cations (red), and anions (blue).
2 10.5 30.0 COSMO-predicted solvation free energies have RMSE values for the full set of 5.2, 4.8 and 5.0 kcal/mol for AM1, PM3 and PM6 respectively, which are quite similar to the corresponding SMD values for AM1 and PM3.However, the accuracy of COSMO for the ions are significantly better, especially for cations, with RMSE values of 7.3, 7.4 and 7.3 kcal/mol for cations and 7.5, 6.8, 4.5 kcal/mol for anions.The COSMO results show that is is possible to significantly improve the accuracy of the NDDO/SMD calculations and the comparatively high errors for ions indicate that the focus should be on the polar part of the solvation free energy.

Figure 2 :
Figure 2: RMSE values (in kcal/mol) computed using SMD subset of the MNSOL data set and the Coulomb radii set shown in Table2.HCO indicates that hydrogen, carbon, and oxygen radii parameters are changed (i.e.second and sixth entry in Table2.The reference set is split into full set (black), neutral (green), cations (red), and anions (blue).

Figure 3 :
Figure 3: RMSE (in kcal/mol) of aqueous solvation energies computed with SMD using HF/6-31G(d), PM3, PM6, and DFTB for the MNSOL dataset.Note that † indicates that the Radii has been reoptimized.Corresponding RMSE values for the COSMO method are included for comparison.The dataset is split into full set (black), neutral (green), cations (red), and anions (blue).

Table 2 :
Optimized Coulomb radii (in Å) for PM3, PM6 and DFTB for different atom types and original SMD radii parameters for comparison.RMSE SMD denotes the RMSE using the original parameters and RMSE * denotes RMSE after optimization.The RMSE is given in kcal/mol and is only calculated for the subset of ions in the SMD subset containing those specific atom types.

Table 3 :
RMSE, MUE and MSE values (in kcal/mol) computed using the SMD subset of the MNSOL data set and the Coulomb radii set shown in Table2.HCO indicates that hydrogen, carbon, and oxygen radii parameters are changed (i.e.second and sixth row in Table2.)

Table 4 :
RMSE, MUE and MSE (in kcal/mol) of aqueous solvation energies computed with SMD using HF/6-31G(d), PM3, PM6, and DFTB for the MNSOL dataset.Note that † indicates that the Radii has been reoptimized.Corresponding RMSE values for the COSMO method are included for comparison.The reference set is split into full set, neutral, cations, and anions.

Table 5 :
Root-Mean-Square-Error (RMSE), man error (ME) and Pearson correlation (r) for predicted pKa values relative to experiment, together with their statistical uncertainty (95% confidence limits)

Table 6 :
RMSE (in kcal/mol) of non-aqueous solvation energies computed with SMD using HF/6-31G(d), PM3, PM6, and DFTB for the MNSOL dataset.Note that † indicates that the radii has been reoptimized.Corresponding RMSE values for the COSMO method are included for comparison.The reference set is split into full set, neutral, cations, and anions.