Safer fitting through regularization
Last time a simple multiple linear regression (MLR) model was seriously overfitted to molecular solubility data. This time the concept of regularization will be tested.
Recall that the MLR model was of the form:
y = b1*x1 + b2*x2 + … + bi*xi + b0
Where each x correspond to the absence or presence of a particular bit in the morgan fingerprint and the coefficients b1 to bi was fit to the solubility data in y. With 1024 features and only 1311 molecular samples the model was seriously over-fit and had no predictive power in cross validation.
Regularization is the concept of penalizing “complex” solutions. So what is a complex solution and what is a simple solution? A complex solution may be one that has a lot of large coefficients with opposing signs, whereas a simple solution could be seen as one where there was a lot of small contributions with low coefficients. This can be build into the fit itself.
The coefficients in the MLR is fit so that the sum of squared errors between the predicted y and the measured y is lowest possible. The average squared error is usually used as the cost function that is optimized in the machine learning model. By adding an extra term to the cost function that penalizes “complex” solutions it is possible to regularize the solution so that over-fitting is avoided. For MLR the sum of squared coefficients are usually added to the cost-function multiplied by a parameter alpha that thus controls the amount of regularization.
This regularized MLR is available in Scikit-learn as Ridge Regression. It will be explored what happens in cross-validation of the previously generated data when varying the regularization parameter alpha.
The X matrix and y vector were produced in the same way as in the last blog post. In python a tuning of the regularization could be implemented in this way.
#prepare scoring lists
fitscores = []
predictscores = []
#prepare a log spaced list of alpha values to test
alphas = np.logspace(-1, 4, num=25)
#Iterate through alphas and fit with Ridge Regression
for alpha in alphas:
estimator = linear_model.Ridge(alpha = alpha)
estimator.fit(X_train,y_train)
fitscores.append(estimator.score(X_train,y_train))
predictscores.append(estimator.score(X_test,y_test))
#show a plot
import matplotlib.pyplot as plt
ax = plt.gca()
ax.set_xscale('log')
ax.plot(alphas, fitscores,'g')
ax.plot(alphas, predictscores,'b')
plt.xlabel('alpha')
plt.ylabel('Correlation Coefficient')
plt.show()
estimator= linear_model.Ridge(alpha = 10)
estimator.fit(X_train,y_train)
#Predict the test set
y_pred = estimator.predict(X_test)
rms = (np.mean((y_test - y_pred)**2))**0.5
s = np.std(y_test -y_pred)
print rms, s
The fastly build model has an RMSE around 1.15 with a similar standard deviation. Obviously the simple linear modelling done here has a way to go compared with the results obtained with Neural Networks using molecular weight and E-state indices(1). Molecular weight and E-state indices (rdkit.Chem.EState.Fingerprinter) can also be calculated with RDKit, maybe this will be attempted another time…..
Best Regards
Pingback: Learn How To Map A Simple Ames Mutagenicity Model To Molecular Features Using RDkit. | Wildcard Pharmaceutical Consulting