Hepatitis C virus (HCV) is responsible for a variety of human life-threatening diseases, which include liver cirrhosis, chronic hepatitis, fibrosis and hepatocellular carcinoma (HCC) . Computational study of protein-protein interactions between human and HCV could boost the findings of antiviral drugs in HCV therapy and might optimize the treatment procedures for HCV infections. In this analysis, we constructed a prediction model for protein-protein interactions between HCV and human by incorporating the features generated by pseudo amino acid compositions, which were then carried out at two levels: categories and features. In brief, extra-tree was initially used for feature selection while SVM was then used to build the classification model. After that, the most suitable models for each category and each feature were selected by comparing with the three ensemble learning algorithms, that is, Random Forest, Adaboost, and Xgboost. According to our results, profile-based features were more suitable for building predictive models among the four categories. AUC value of the model constructed by Xgboost algorithm on independent data set could reach 92.66%. Moreover, Distance-based Residue, Physicochemical Distance Transformation and Profile-based Physicochemical Distance Transformation performed much better among the 17 features. AUC value of the Adaboost classifier constructed by Profile-based Physicochemical Distance Transformation on the independent dataset achieved 93.74%. Taken together, we proposed a better model with improved prediction capacity for protein-protein interactions between human and HCV in this study, which could provide practical reference for further experimental investigation into HCV-related diseases in future.Communicated by Ramaswamy H. Sarma.
Keywords: Adaboost; HCV; SVM; Xgboost; protein-protein interaction; random forest.