Abstract
Introduction: The prediction of plasma protein binding (ppb) is of paramount importance in the pharmacokinetics characterization of drugs, as it causes significant changes in volume of distribution, clearance and drug half life. This study utilized Quantitative Structure – Activity Relationships (QSAR) for the prediction of plasma protein binding. Methods: Protein binding values for 794 compounds were collated from literature. The data was partitioned into a training set of 662 compounds and an external validation set of 132 compounds. Physicochemical and molecular descriptors were calculated for each compound using ACD labs/logD, MOE (Chemical Computing Group) and Symyx QSAR software packages. Several data mining tools were employed for the construction of models. These included stepwise regression analysis, Classification and Regression Trees (CART), Boosted trees and Random Forest. Results: Several predictive models were identified; however, one model in particular produced significantly superior prediction accuracy for the external validation set as measured using mean absolute error and correlation coefficient. The selected model was a boosted regression tree model which had the mean absolute error for training set of 13.25 and for validation set of 14.96. Conclusion: Plasma protein binding can be modeled using simple regression trees or multiple linear regressions with reasonable model accuracies. These interpretable models were able to identify the governing molecular factors for a high ppb that included hydrophobicity, van der Waals surface area parameters, and aromaticity. On the other hand, the more complicated ensemble method of boosted regression trees produced the most accurate ppb estimations for the external validation set.