Skip to main content

Establishment of early diagnosis models for cervical precancerous lesions using large-scale cervical cancer screening datasets



Human papilloma virus (HPV) DNA test was applied in cervical cancer screening as an effective cancer prevention strategy. The viral load of HPV generated by different assays attracted increasing attention on its potential value in disease diagnosis and progression discovery.


In this study, three HPV testing datasets were assessed and compared, including Hybrid Capture 2 (n = 31,954), Aptima HPV E6E7 (n = 3269) and HPV Cobas 4800 (n = 13,342). Logistic regression models for diagnosing early cervical lesions of the three datasets were established and compared. The best variable factor combination (VL + BV) and dataset (HC2) were used for the establishment of six machine learning models. Models were evaluated and compared, and the best-performed model was validated.


Our results show that viral load value was significantly correlated with cervical lesion stages in all three data sets. Viral Load and Bacterial Vaginosis were the best variable factor combination for logistic regression model establishment, and models based on the HC2 dataset performed best compared with the other two datasets. Machine learning method Xgboost generated the highest AUC value of models, which were 0.915, 0.9529, 0.9557, 0.9614 for diagnosing ASCUS higher, ASC-H higher, LSIL higher, and HSIL higher staged cervical lesions, indicating the acceptable accuracy of the selected diagnostic model.


Our study demonstrates that HPV viral load and BV status were significantly associated with the early stages of cervical lesions. The best-performed models can serve as a useful tool to help diagnose cervical lesions early.


Cervical cancer is the second most severe female cancer worldwide with 570,000 women diagnosed and 311,365 women died in the year 2018 despite worldwide applications of early screening for the disease or for the presence of human papillomavirus (HPV) [1]. It was estimated that 44.4 million cervical cancer cases would be diagnosed globally over the period of 2020–2069 [2]. Commonly used screening methods include HPV test, thin prep cytological test (TCT), and joined tests by HPV and TCT [3]. By comparison, TCT has lower false positive and higher false-negative rates than HPV test, but HPV test may cause higher unnecessary referrals to colposcopy [4]. With more and more HPV and TCT joined tests applied and compared [5,6,7,8], WHO changed cervical cancer screening guideline and listed HPV DNA test as the first recommended method for the application.

Currently, the results of HPV testing were generally reported as HPV positive or negative qualitatively based on the cut-off value of the assay used for the diagnosis. However, accumulated HPV screening data showed that HPV viral load could add valuable information as a screening triage marker. For example, Thomas identified a significant correlation between HPV viral load and integration status with high-grade squamous intraepithelial lesion (HSIL) [9]. Zhao’s study found that the 10-year cumulative incidence rate of cervical intraepithelial neoplasia (CIN2 +) was associated with cytological lesions and viral load and they recommended viral loads as a triage marker for non-16/18 hrHPV (high risk HPV) positive women [10]. A recent study also indicated that HPV viral load was positively correlated with cervical lesion grade based on 8556 women’s cervical cancer screening results [11]. In addition to being considered as a potential triage marker, HPV viral load was also a potential disease progression indicator as being showed that cervical cancer patients with high HPV viral load had a significantly lower 15-year survival rate and an advanced stage based on the International Federation of Gynaecology and Obstetrics (FIGO) as well as increased recurrence rate [12]. However, inconsistent conclusions related to viral load triage and prediction value from different studies restrain applications of viral load value in clinical settings [13]. One of the reasons causing result inconsistency is likely due to the different methods used in different diagnostic laboratories as being shown by a few small sizes of HPV viral load studies based on Hybrid Capture 2 (HC2) [14], Aptima E6E7 [15], and Cobas 4800 [16].

In this study, we retrospectively compared our cervical cancer screening results assayed by the 3 HPV testing platforms (HC2, Aptima E6E7, and HPV Cobas 4800) with accompanied TCT test results. A model for predicting different levels of cervical lesions was established by integrating potential cervical cancer risk factors, such as HPV infection status, HPV viral load, age, bacterial vaginosis, fungus, etc.

Materials and methods

Patients and data collection

In total, 48,565 individuals were tested by both TCT and one of the 3 HPV testing methods (31,954 individuals tested by HC2, 3269 individuals tested by Aptima E6E7, and 13,342 individuals tested by Cobas 4800) from the years of 2016 to 2019 in our laboratory, a CAP- and ISO15189-accredited reference laboratory in Guangzhou, China. (Fig. 1). The cases were collected in three datasets, named Dataset HC2, Dataset E6E7, Dataset Cobas, respectively. The institutional review board of KingMed Diagnostics approved the study with code 022.

Fig. 1
figure 1

Flow chart diagram of study design and data analysis procedures

HPV testing

HC2 assay detects 13 hrHPV subtypes, including HPV 16, 18, 31, 33, 35, 39, 45, 51, 52, 56, 58, 59 and 68 using hybrid Capture 2 high-Risk HPV DNA Test from Digene Corporation (Gaithersburg, MD, USA), providing an HPV positive or negative result based on the reading value compared with the cutoff value, RLU/CO > 1.0. Aptima HPV assay targets E6E7 mRNA expression of 14 hrHPV subtypes, including HPV 16, 18, 31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 66 and 68 using TMA (transcription mediated amplification) based methodology from Hologic Company (Marlborough, MA, USA). Roche Cobas 4800 HPV DNA assay (Pleasanton, CA, USA) is a real-time PCR-based assay used for HPV16, HPV18, and other 12 hrHPV subtypes, including HPV31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 66, and 68.

TCT testing-liquid-based cytology

Collected specimens were automatically treated and converted to cytological specimens by using ThinPrep method from Hologic (Bedford, MA, USA) [17]. Prepared specimens were evaluated independently by at least 2 certified cyto-pathologists. Results were classified as: negative for intraepithelial lesion or malignancy (NILM); atypical squamous cells of undetermined significance (ASCUS); atypical squamous cell cannot exclude high-grade squamous intraepithelial lesion (ASC-H); low-grade squamous intraepithelial lesion (LSIL); high-grade squamous intraepithelial lesion (HSIL) [18]. Patients with a diagnosis of AGUS or cervical cancer were excluded from the study due to the limited number of individuals identified. Meanwhile, BV and fungal infections are determined by pathologists through the result of TCT.

Data processing

Each of the 3 HPV platform datasets was divided into two datasets, all cases dataset (ACD), and dataset with only HPV positive cases (POS). HPV viral load values were calculated based on the reported value from each method, RLU/CO from HC2, S/CO from Aptima E6E7, and PCR cycle number from Cobas 4800.

Risk factors selection and model establishment

The original datasets were divided into 2 datasets, the training dataset contained 80% of the cases while the validation dataset had 20%. Synthetic minority over-sampling technique (SMOTE) analysis using the DMwR package was applied to balance data before model establishment. Pearson's correlation coefficient was applied to determine the association between viral load, age, HPV infection status, BV, and fungus infection with cytology diagnostic stages (ASCUShigher, ASC-Hhigher, LSILhigher, HSILhigher). Different combinations of the significantly correlated variable factors were used for further logistic regression model analysis, and comparison was applied by using the area under curve (AUC) value of each receiver operating characteristic (ROC) curve. Besides logistic regression model analysis, five more machine learning methods, including Decision tree, Xgboost, Random forest, support vector machines (SVM), and Neural net, were applied to build models using the Rattle package with default parameters.


Data sets characteristics and comparisons

All diagnostic results and related information were summarized in Table 1. In total, the average positive detection rate for HPV was 46.64% (22,654/48,565), including 59.10% (18,878/31,954) identified by HC2, 25.52% (3406/13,342) identified by Cobas 4800, and 11.31% (370/3269) identified by Aptima E6E7. Of the TCT results, NILM represented about 80% of the cases assayed, followed by LSIL (14%), ASCUS (7%), HSIL (3%), and ASC-H (2.6%). The proportions of cases with different TCT stages were similarly distributed among all 3 platform datasets (Additional file 1: Supplemental Fig. 1).

Table 1 Demographic data of patients collected in the three datasets

The viral loads showed an increasing trend along with the advancing cytology stages in each of the 3 HPV datasets (Fig. 2 and Additional file 1: Supplemental Fig. 2). Viral load values of each two stages were found significantly different in HC2 ACD except that between stage ASCUS and ASC-H. Compared with the other two platform datasets, more significant differences between TCT stages in the HC2 dataset were observed, no matter in ACD or positive dataset. Ct value of Cobas assay was used as viral load value and three types of HPV positive cases of Cobas were shown separately, other type HPV (HPV OT), HPV16, and HPV18.

Fig. 2
figure 2

Distribution of viral load value with cervical lesion stages of the three platform ACDs. a HC2. b E6E7. c Cobas

Correlations between variable factors

Correlation analysis was carried out to analyse the relationship among any 2 of the following factors (Additional file 1: Supplemental Table 2). In detail, we observed the following relations: (1) A significant correlation between viral load with cervical lesion stages in all the 3 datasets; (2) A significant correlation between age with cervical lesion stages in HC2 and Cobas datasets; (3) A significant correlation between viral load with BV infection in HC2 ACD, and E6E7 ACD but not in the POS of E6E7 and Cobas; (4) Fungus infection was observed significantly correlated with age but not with viral load and BV in all the three platform datasets; (5) There was no significant correlation between BV and age in most datasets, except HC2 POS. The detailed results were shown in Additional file 1: Supplemental Tables 2 and 3.

Table 2 Performance summary of models established by six ML methods in terms of PPV, NPV, Sensitivity, Specificity, Accuracy, Precision
Table 3 AUC value of the best two models established by Xgboost with test dataset analysis

Logistic regression models build on different factor combinations

The logistic regression model of each test dataset was established with every precancerous stage and higher as a diagnostic endpoint. Different risk factor combinations of viral load, BV, and age were used for building the regression equation. The AUC value of each model and comparison results of each two-variable combinations were summarized in Additional file 1: Supplemental Table. To avoid data imbalance, SMOTE was applied to balance the data of each cervical lesion stages. The results, elucidated that: (1) models of HC2 ACD and POS all performed best compared with the models established by the other two platform data sets with significant difference (Additional file 1: Supplemental Table 5); (2) models of HC2 POS and ACD with HPV viral load and bacterial vaginosis as variables performed best with significant difference compared with models established by viral load (VL) only and VL with Age variables (Additional file 1: Supplemental Table 6). ROC curves of each platform ACD models were shown in Fig. 3. It showed that models performed differently by using different cervical lesion stages and higher as a diagnostic endpoint. Models of HC2 performed best (AUC = 0.9467) with LSIL higher stage as a diagnostic endpoint. E6E7 (AUC = 0.9341) and Cobas OT models (AUC = 0.9038) performed best with ASC-H higher stage as a diagnostic endpoint. However, Cobas 16 models performed best (AUC = 0.9915) with HSIL higher stage as a diagnostic endpoint. In summary, the models generated by the HC2 platform with BV and VL as variables had the best performance compared with models of the other two platform data sets.

Fig. 3
figure 3

ROC curve of logistic regression model established by VL and BV variables using all data sets of the three platforms. a HC2. b E6E7. c Cobas_OT. d Cobas_16

Establishment and comparison of machine learning models

To establish the best model for diagnosing early cervical lesion stages, six machine learning methods were further applied in HC2 ACD and POS with VL and BV as variable factors. AUC values, PPV, NPV, accuracy, sensitivity, and specificity of the models were analysed for model performance evaluation, shown in Table 2, and comparisons were carried out between different methods, Additional file 1: Supplemental Table 7. The results indicated that the AUC value of Xgboost models in both ACD and POS was the highest compared with the other five methods, with an AUC value of ASCUS higher, ASC-H higher, LSIL higher, and HISL higher were 0.915, 0.953, 0.956, and 0.961 in ACD and 0.860, 0.910, 0.924 and 0.929 in POS, respectively. The ROC curve of Xgboost models of each diagnostic endpoint were shown in Fig. 4. And a significant difference was observed between ACD and POS AUC values. The Xgboost models were evaluated with a sensitivity of 0.826 (ASCUS higher), 0.914 (ASC-H higher), 0.925 (LSIL higher) and 0.952 (HSIL higher) and specificity of 0.838 (ASCUS higher), 0.845 (ASC-H higher), 0.849 (LSIL higher) and 0.838 (HSIL higher) in HC2 ACD, respectively. The sensitivity and specificity of Xgboost models of HC2 POS were significantly lower (sensitivity, P = 0.007; specificity, P = 0.05) than them in ACD.

Fig. 4
figure 4

ROC curve of six machine learning methods model by using HC2 dataset. a ASCUS higher. b ASC-H higher. c LSIL higher. d HSIL higher

Validation of the best HC2 models

To further validate the model established by Xgboost, we collected a new batch of HC2 HPV testing data, which consisted of 3932 NILM, 148 ASCUS, 28 ASC-H, 62 LISL, and 15 HSIL patients and evaluated the performance of the models in all and positive datasets. The results were summarized in Table 3. It showed that by using a new set of HC2 results, diagnostic models of Xgboost could predict the cytologic stage of the patient with acceptable AUC values, 0.8200 for ASCUS higher, 0.9385 for ASC-H higher, 0.9413 for LSIL higher, and 0.9293 for HSIL higher stage of test ACD model and 0.7176 for ASCUS higher, 0.7285 for ASC-H higher, 0.7210 for LSIL higher, and 0.7336 for HSIL higher stage of test positive data set. The ACD model performed better than the positive dataset with specificity ranging from 0.9547 to 0.9577 and sensitivity ranging from 0.5020 to 0.6484.


The mean values of HPV VL in each cytology stage increased with the severity of cervical lesion grade, consistent with previous findings, indicating the reliability of our conclusion [10, 19]. However, the associations of HPV subtypes VL with cervical lesions were inconsistent across studies. Luo Hongxue reported that the viral load of HPV16/18 could be used as a triage marker for HPV-positive women while Dong Li’s research found it cannot [10, 14]. The disagreement of studies might be caused by methods limitations in the studies or the reality of different viral load distribution characteristics of each HPV subtype in different populations. Based on our comparison results of platforms, which was seldom to be seen in one study, although the VL value trend seems similar among platforms, there was still a difference that could be observed in the distribution of viral load in each specific disease stage and coefficient among factors. It indicated that different methods could provide different detection ranges, which further differently reflected the real viral load situation of the sample. Therefore, the method with more broad detection range and lower limit of detection should be recommended for viral load study.

The cervical microbiome has been found to be affected by HPV infection [20] and the presence of BV was reported to be associated with HPV infection and persistence [21, 22]. BV and other factor, multiple sexual partners, were combined to predict of CIN/CC status [23]. A significant association between BV with HSIL cytologic stage in our HC2 dataset was observed, consistent with a previous report [24, 25]. These results provided strong support for our model comparison results which indicated that BV and VL are the two factors that provide the best accuracy for the effect of models. Although the BV status of our results was retrieved from cytologic diagnosis results, it also indicated the potential of DNA test assays or tools of detecting the two factors at the same time and collected information that could be used for cervical lesion prediction. The simultaneous detection method of HPV infection and microbiome of cervical samples have been developed by another study [26], providing the value of detecting both factors in the prevention of cervical cancer development. Since there were many factors that could affect cervical cancer development and their correlation relationship was not fully understood. Therefore, more exploration between them is necessary. The correlation analysis of risk factors in our study discovered a more significant correlation between them in specific population groups, which indicated different models with specific different factors might be established in the future to get more accurate results for clinical application.

Of the 3 HPV test platforms, Cobas 4800 is the only platform that could differentiate HPV16, HPV18, and HPV OT, enabling us to analyse the correlations between viral loads of the HPV subtypes and the severity of the cervical lesions caused by HPV. Our results showed that viral load in the cases with HPV16 infection increased more obviously with advanced cervical lesion stages compared with HPV18 and HPV OT, like a previous report [27]. If actual correlations between viral loads of HPV subtypes and cervical lesions caused by these viruses could be demonstrated, it might be possible to accurately diagnose people with similar conditions, using viral load and other variable factors without being necessarily referred to pathologists in the future [28].

This study indicated that: (1) HPV viral load values generated by the HC2 platform fit more for the diagnostic model establishment than the other two platforms, Aptima E6E7 and Cobas; (2) Sample balance treatment (SMOTE) improved our model performance in the unbalanced dataset since our datasets were from cervical cancer screening with a significantly higher percentage of normal status samples than abnormal samples. Similar results were reported showing that datasets pre-processed by SMOTE could improve model accuracy by avoiding bias caused by imbalance of the datasets used [29]. The AUC values of other diagnostic models had been reported as 0.895 and 0.64 in diagnosing CIN2 + by Tuerxun’s study and Xiao’s study, respectively [30, 31]. However, the AUC value of our model for HSIL prediction is 0.9293.

In summary, our results provided valuable information for the evaluation of viral load of HPV in clinical diagnostic applications. We also proved it is feasible to predict the cytological stage by using a diagnostic model based on viral load and other factors, especially in areas lacking enough pathological resources. As we all know that cervical cancer mainly occurs in low-level income countries, which often lack high-quality clinical resources, including clinicians and equipment. Therefore, our model with accurate diagnostic prediction function provides strong evidence for its clinical application with reliable results. However, due to the significant difference between HPV test methods, more studies need to be carried out to standardize the best way of diagnosing by models. Based on our study, the PCR-free method might be a better choice in this scenario. What’s more, further study combing patients’ information, cervical cancer screening results, colposcopy diagnose results, and management information should be carried out in the future to evaluate the application value of our model.


Using clinical laboratory cervical cancer screening datasets, after evaluating optimal datasets, machine learning method, and variable factors, early diagnostic models of four cervical lesion stages were defined. It is the first study by using BV and HPV VL for cervical lesion cytological diagnosis prediction and the accuracy of the prediction was shown to be superior to other clinical characteristics. Furthermore, machine learning models built based on HPV VL and BV demonstrated excellent performance in determining cervical cancer precancerous lesions at different stages, especially the Xgboost model. These promising findings warrant the early diagnosis for cervical lesions in clinical applications, especially in scenarios with limited pathological resources.

Availability of data and materials

Due to the privacy of patients, the related data cannot be available for public access but can be obtained from Shihui Yu, Bo Meng upon reasonable request.



Human papillomavirus


Hybrid Capture 2


Thin prep cytologic test


Bacterial vaginosis


Viral load


All cases dataset


HPV positive cases dataset


Transcription mediated amplification


Area under the curve


Receiver operating characteristic


Positive prediction value


Negative prediction value


Support vector machines


Extreme gradient boosting


Synthetic minority over-sampling technique


Negative for intraepithelial lesion or malignancy


Atypical squamous cells of undetermined significance


Atypical squamous cell cannot exclude high-grade squamous intraepithelial lesion


Low-grade squamous intraepithelial lesion


High-grade squamous intraepithelial lesion


  1. Bruni L AG, Serrano B, Mena M, Collado JJ, Gómez D, Muñoz J, Bosch FX, de Sanjosé S. Human papillomavirus and related diseases in the world. ICO/IARC Information Centre on HPV and Cancer; 2019.

  2. Simms KT, Steinberg J, Caruana M, Smith MA, Lew JB, Soerjomataram I, et al. Impact of scaled up human papillomavirus vaccination and cervical screening and the potential for global elimination of cervical cancer in 181 countries, 2020–99: a modelling study. Lancet Oncol. 2019;20(3):394–407.

    Article  PubMed  Google Scholar 

  3. Xie F, Zhang L, Zhao D, Wu X, Wei M, Zhang X, et al. Prior cervical cytology and high-risk HPV testing results for 311 patients with invasive cervical adenocarcinoma: a multicenter retrospective study from China’s largest independent operator of pathology laboratories. BMC Infect Dis. 2019;19(1):962.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Wright TC, Stoler MH, Behrens CM, Sharma A, Zhang G, Wright TL. Primary cervical cancer screening with human papillomavirus: end of study results from the ATHENA study using HPV as the first-line screening test. Gynecol Oncol. 2015;136(2):189–97.

    Article  PubMed  Google Scholar 

  5. Zorzi M, Del Mistro A, Farruggio A, de’Bartolomeis L, Frayle-Salamanca H, Baboci L, et al. Use of a high-risk human papillomavirus DNA test as the primary test in a cervical cancer screening programme: a population-based cohort study. BJOG: Int J Obstet Gynaecol. 2013;120(10):1260–7 (discussion 7-8).

    Article  CAS  Google Scholar 

  6. Isidean SD, Mayrand MH, Ramanakumar AV, Gilbert L, Reid SL, Rodrigues I, et al. Human papillomavirus testing versus cytology in primary cervical cancer screening: end-of-study and extended follow-up results from the Canadian cervical cancer screening trial. Int J Cancer. 2016;139(11):2456–66.

    Article  CAS  PubMed  Google Scholar 

  7. US Preventive Services Task Force, Curry SJ, Krist AH, Owens DK, Barry MJ, Caughey AB, et al. Screening for cervical cancer: US Preventive Services Task Force recommendation statement. JAMA. 2018;320(7):674–86.

    Article  Google Scholar 

  8. Thomsen LT, Kjaer SK, Munk C, Frederiksen K, Ornskov D, Waldstrom M. Clinical performance of human papillomavirus (HPV) testing versus cytology for cervical cancer screening: results of a large Danish implementation study. Clin Epidemiol. 2020;12:203–13.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Manawapat-Klopfer A, Wang L, Haedicke-Jarboui J, Stubenrauch F, Munk C, Thomsen LT, et al. HPV16 viral load and physical state measurement as a potential immediate triage strategy for HR-HPV-infected women: a study in 644 women with single HPV16 infections. Am J Cancer Res. 2018;8(4):715–22.

    CAS  PubMed  PubMed Central  Google Scholar 

  10. Dong L, Wang MZ, Zhao XL, Feng RM, Hu SY, Zhang Q, et al. Human papillomavirus viral load as a useful triage tool for non-16/18 high-risk human papillomavirus positive women: a prospective screening cohort study. Gynecol Oncol. 2018;148(1):103–10.

    Article  PubMed  Google Scholar 

  11. Luo H, Belinson JL, Du H, Liu Z, Zhang L, Wang C, et al. Evaluation of viral load as a triage strategy with primary high-risk human papillomavirus cervical cancer screening. J Low Genit Tract Dis. 2017;21(1):12–6.

    Article  PubMed  Google Scholar 

  12. Cao M, Wang Y, Wang D, Duan Y, Hong W, Zhang N, et al. Increased high-risk human papillomavirus viral load is associated with immunosuppressed microenvironment and predicts a worse long-term survival in cervical cancer patients. Am J Clin Pathol. 2020;153(4):502–12.

    Article  CAS  PubMed  Google Scholar 

  13. Malagón T, Louvanto K, Ramanakumar AV, Koushik A, Coutlée F, Franco EL. Viral load of human papillomavirus types 16/18/31/33/45 as a predictor of cervical intraepithelial neoplasia and cancer by age. Gynecol Oncol. 2019;155(2):245–53.

    Article  PubMed  Google Scholar 

  14. Luo H, Du H, Belinson JL, Wu R. Evaluation of alternately combining HPV viral load and 16/18 genotyping in secondary screening algorithms. PLoS ONE. 2019;14(7): e0220200.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Utaipat U, Siriaunkgul S, Supindham T, Saokhieo P, Chaidaeng B, Wongthanee A, et al. Association of cytologic grade of anal “Pap” smears with viral loads of human papillomavirus types 16, 18, and 52 detected in the same specimens from men who have sex with men. J Clin Virol: Off Publ Pan Am Soc Clin Virol. 2016;85:48–55.

    Article  Google Scholar 

  16. Álvarez-Argüelles ME, de Oña-Navarro M, Rojo-Alba S, Torrens-Muns M, Junquera-Llaneza ML, Antonio-Boga J, et al. Quantification of human papilloma virus (HPV) DNA using the Cobas 4800 system in women with and without pathological alterations attributable to the virus. J Virol Methods. 2015;222:95–102.

    Article  PubMed  Google Scholar 

  17. Zheng B, Austin RM, Liang X, Li Z, Chen C, Yan S, et al. Bethesda System reporting rates for conventional Papanicolaou tests and liquid-based cytology in a large Chinese, College of American Pathologists-certified independent medical laboratory: analysis of 1394389 Papanicolaou test reports. Arch Pathol Lab Med. 2015;139(3):373–7.

    Article  PubMed  Google Scholar 

  18. Sekine J, Nakatani E, Hideshima K, Iwahashi T, Sasaki H. Diagnostic accuracy of oral cancer cytology in a pilot study. Diagn Pathol. 2017;12(1):27.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Basu P, Muwonge R, Mittal S, Banerjee D, Ghosh I, Panda C, et al. Implications of semi-quantitative HPV viral load estimation by Hybrid capture 2 in colposcopy practice. J Med Screen. 2016;23(2):104–10.

    Article  PubMed  Google Scholar 

  20. Shannon B, Yi TJ, Perusini S, Gajer P, Ma B, Humphrys MS, et al. Association of HPV infection and clearance with cervicovaginal immunology and the vaginal microbiota. Mucosal Immunol. 2017;10(5):1310–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Gillet E, Meys JF, Verstraelen H, Bosire C, De Sutter P, Temmerman M, et al. Bacterial vaginosis is associated with uterine cervical human papillomavirus infection: a meta-analysis. BMC Infect Dis. 2011;11:10.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Kero K, Rautava J, Syrjänen K, Grenman S, Syrjänen S. Association of asymptomatic bacterial vaginosis with persistence of female genital human papillomavirus infection. Eur J Clin Microbiol Infect Dis: Off Publ Eur Soc Clin Microbiol. 2017;36(11):2215–9.

    Article  CAS  Google Scholar 

  23. Huang Y, Wu X, Lin Y, Li W, Liu J, Song B. Multiple sexual partners and vaginal microecological disorder are associated with HPV infection and cervical carcinoma development. Oncol Lett. 2020;20(2):1915–21.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Dahoud W, Michael CW, Gokozan H, Nakanishi AK, Harbhajanka A. Association of bacterial vaginosis and human papilloma virus infection with cervical squamous intraepithelial lesions. Am J Clin Pathol. 2019;152(2):185–9.

    Article  CAS  PubMed  Google Scholar 

  25. Suehiro TT, Malaguti N, Damke E, Uchimura NS, Gimenes F, Souza RP, et al. Association of human papillomavirus and bacterial vaginosis with increased risk of high-grade squamous intraepithelial cervical lesions. Int J Gynecol Cancer: Off J Int Gynecol Cancer Soc. 2019;29(2):242–9.

    Article  Google Scholar 

  26. Quan L, Dong R, Yang W, Chen L, Lang J, Liu J, et al. Simultaneous detection and comprehensive analysis of HPV and microbiome status of a cervical liquid-based cytology sample using Nanopore MinION sequencing. Sci Rep. 2019;9(1):19337.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Luo X, Donnelly CR, Gong W, Heath BR, Hao Y, Donnelly LA, et al. HPV16 drives cancer immune escape via NLRX1-mediated degradation of STING. J Clin Investig. 2020;130(4):1635–52.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Fu Xi L, Schiffman M, Ke Y, Hughes JP, Galloway DA, He Z, et al. Type-dependent association between risk of cervical intraepithelial neoplasia and viral load of oncogenic human papillomavirus types other than types 16 and 18. Int J Cancer. 2017;140(8):1747–56.

    Article  PubMed  Google Scholar 

  29. Xie C, Du R, Ho JW, Pang HH, Chiu KW, Lee EY, et al. Effect of machine learning re-sampling techniques for imbalanced datasets in (18)F-FDG PET-based radiomics model on prognostication performance in cohorts of head and neck cancer patients. Eur J Nucl Med Mol Imaging. 2020;47(12):2826–35.

    Article  PubMed  Google Scholar 

  30. Tuerxun G, Yukesaier A, Lu L, Aierken K, Mijiti P, Jiang Y, et al. Evaluation of careHPV, cervista human papillomavirus, and hybrid capture 2 methods in diagnosing cervical intraepithelial neoplasia grade 2+ in Xinjiang Uyghur women. Oncologist. 2016;21(7):825–31.

    Article  PubMed  PubMed Central  Google Scholar 

  31. Keyuan Z, editor Evaluation of the diagnostic accuracy of HC2 in detecting high-grade cervical intraepithelial neoplasia; 2009.

Download references


Not applicable.


Guangzhou KingMed Transformative Medicine Institute Co., Ltd., Guangzhou, Guangdong, China. State Key Laboratory of Respiratory Disease, Guangdong-Hong Kong-Macao Joint Laboratory of Respiratory Infectious Disease. The Science and Technology Planning Project of Guangdong Province, Grant No. 2019B121205010.

Author information

Authors and Affiliations



SHY and BM conceptualized this study. BM wrote the draft. GBL did the data analysis, prepared the figures, and edited the manuscript. ZYZ, BWZ, YYX, and CL reviewed and edited the manuscript. MYL, HRW, and YLS retrieved the data. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Shihui Yu.

Ethics declarations

Ethics approval and consent to participate

This study was approved by the institutional review board of KingMed Diagnostics with code 022.

Consent for publication

All the patients have signed an informed consent form.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Other details of this study.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Meng, B., Li, G., Zeng, Z. et al. Establishment of early diagnosis models for cervical precancerous lesions using large-scale cervical cancer screening datasets. Virol J 19, 177 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: