TY - GEN
T1 - An AutoML Approach for Predicting Risk of Progression to Active Tuberculosis based on Its Association with Host Genetic Variations
AU - Dou, Wanying
AU - Liu, Yihang
AU - Liu, Zehai
AU - Yerezhepov, Dauren
AU - Kozhamkulov, Ulan
AU - Akilzhanova, Ainur
AU - Dib, Omar
AU - Chan, Chee Kai
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/10/29
Y1 - 2021/10/29
N2 - Tuberculosis (TB) is a worldwide health challenge. Mycobacterium tuberculosis(M.tb) is capable of evading the host immune system which can lead to tuberculosis infection. Household contacts (HHCs) of TB cases have a higher risk of infection. Novel predictive techniques to identify high-risk TB susceptible groups are needed. Susceptibility to Tuberculosis is associated with host genetic variations. This research work uses the TPOT autoML tool to map genetic variations and TB infection status mathematically. Machine learning was employed to predict the risk of progression to active tuberculosis based on associated host genetic variation. Among the three adopted configurations, "TPOT Default", "TPOT spars", "TPOT N that were used,""TPOT Default,"and "TPOT sparse"produced the same best performance both reaching 0.816 Training CV score and 0.625 Testing Accuracy. Different genes variants identified using this approach were found to have distinctive contributions for TB infection, which represent the feature importance of the classifier. The feature importance of the random forest classifier pipeline in "TPOT sparse"was adopted. The top ten contributing genes were also submitted to Enrichr for gene pathway enrichment analysis. The identified enriched pathways have been shown to be key to TB infection.
AB - Tuberculosis (TB) is a worldwide health challenge. Mycobacterium tuberculosis(M.tb) is capable of evading the host immune system which can lead to tuberculosis infection. Household contacts (HHCs) of TB cases have a higher risk of infection. Novel predictive techniques to identify high-risk TB susceptible groups are needed. Susceptibility to Tuberculosis is associated with host genetic variations. This research work uses the TPOT autoML tool to map genetic variations and TB infection status mathematically. Machine learning was employed to predict the risk of progression to active tuberculosis based on associated host genetic variation. Among the three adopted configurations, "TPOT Default", "TPOT spars", "TPOT N that were used,""TPOT Default,"and "TPOT sparse"produced the same best performance both reaching 0.816 Training CV score and 0.625 Testing Accuracy. Different genes variants identified using this approach were found to have distinctive contributions for TB infection, which represent the feature importance of the classifier. The feature importance of the random forest classifier pipeline in "TPOT sparse"was adopted. The top ten contributing genes were also submitted to Enrichr for gene pathway enrichment analysis. The identified enriched pathways have been shown to be key to TB infection.
KW - Genetic Variation
KW - Machine Learning
KW - Tuberculosis
UR - http://www.scopus.com/inward/record.url?scp=85124338318&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85124338318&partnerID=8YFLogxK
U2 - 10.1145/3498731.3498743
DO - 10.1145/3498731.3498743
M3 - Conference contribution
AN - SCOPUS:85124338318
T3 - ACM International Conference Proceeding Series
SP - 82
EP - 88
BT - ICBBS 2021 - Proceedings of 2021 10th International Conference on Bioinformatics and Biomedical Science
PB - Association for Computing Machinery
T2 - 10th International Conference on Bioinformatics and Biomedical Science, ICBBS 2021
Y2 - 29 October 2021 through 31 October 2021
ER -