Genetic variability of human papillomavirus type 39 based on E6, E7 and L1 genes in Southwest China

Background Human papillomavirus type 39 associated with genital intraepithelial neoplasia and invasive cancers, has a high prevalence in Southwest China. HPV E6, E7 are two main papillomavirus oncoproteins, closely relate to the function of HPV immortalization, cell transformation, and carcinogenesis. L1 is the major capsid protein, can reflect the replication status of the virus in cells and the progression of cervical lesions. The purpose of this study is to reveal the prevalence of HPV 39 and the genetic polymorphisms of HPV39 based on E6, E7 and L1 gene in southwest China. Methods Cell samples were collected by cervical scraped for HPV detecting and typing, and HPV39 positive samples were selected out. Important E6, E7 and L1 genes of HPV39 were sequenced and analyzed for the study of HPV39 genetic polymorphisms. Phylogenetic trees were constructed by Maximum-likelihood and Kimura 2-parameters methods in Molecular Evolutionary Genetics Analysis version 6.0. The selection pressures of E6, E7 and L1 genes were estimated by Datamonkey web server. The secondary and three-dimensional structure of HPV39 E6, E7 proteins were created by sopma server and SWISS-MODEL software. Results 344 HPV39 positive samples were selected from 5718 HPV positive cell samples. Among HPV39 E6-E7 sequences, 20 single nucleotide mutations were detected, including 10 non-synonymous and 10 synonymous mutations; 26 single nucleotide mutations were detected in HPV39 L1 sequences, including 7 non-synonymous and 19 synonymous mutations respectively. 11 novel variants of HPV39 E6-E7 (5 in E6 and 6 in E7) and 14 novel variants of HPV39 L1 were identified in this study. A-branch was the most frequent HPV39 lineage in southwest China during our investigation. Selective pressure analysis showed that codon sites 26, 87, 151 in E6 and 75, 180, 222, 272, 284, 346, 356 in L1 were positively selected sites, as well as codon sites 45, 138, 309, 381 were negative selection sites in L1 gene, E7 has neither positive selection sites nor negative selection sites. A certain degree of secondary and three-dimensional structure dislocation was existed due to the non-synonymous mutations. Conclusions Amino acid substitution affected the secondary and three-dimensional structure of HPV39, and resulting in the differences of carcinogenic potential and biological functions as well as the immune response due to the antigen epitopes difference, the antigen epitopes with stronger adaptability in Southwest will be screened out based on the above research results for the later vaccine development. And gene polymorphism of HPV39 in Southwest China may improve the effectiveness of clinical test and vaccine design, specifically for women in Southwest China.


Introduction
Cervical cancer (CC) is the fourth leading cancer in women worldwide. The incidence of CC in developing countries is almost as twice as in developed countries, 17.8% and 9.0% respectively [1]. According to the report by the International Agency for Research on Cancer in 2018, the CC infection rate among Chinese women is 10.7 per 100,000, just below the global rate of 13.1 per 100,000, and increasing, that due to the neglect of screening [2]. Obviously, this cancer has seriously threatened the life and health of Chinese women.
HPV is a small, non-enveloped and double stranded DNA (dsDNA) virus, approximately 8,000 base pairs (bp) which contains early region (E6, E7, E1, E2, E4, and E5) and late region (L1, L2) [3]. High-risk human papilloma virus (HR-HPV) persistent infection is the leading cause of CC. E6 and E7 can inhibit the tumor suppressor pathways of p53 and pRb (retinoblastoma) respectively, that resulting in immortalization, cell transformation, and carcinogenesis, are two main papillomavirus oncoproteins [4][5][6][7]. L1 is the major capsid protein, which can reflect the replication status of virus in cells, the detection of L1 capsid protein expression can predict the progression and degeneration of cervical lesions [8].
Globally, HPV16 and HPV18 are the most common HR-HPV, persistence infection associated with those causing 70% of CC and pre-cancerous cervical lesions [12]. HPV39 belongs to α-7, has no available commercial vaccines, due to its low infection rate [13]. Nevertheless, the infection rate of different HPV genotypes varies in different geographical regions and different ages [1]. For example, in southern China, HPV infections occurs mainly in women under 30 years old, and HPV52, HPV16, HPV58, HPV18 and HPV39 are the most common HR-HPV subtypes [14]. In western China, HPV16, HPV52, HPV58, HPV33 and HPV18 are the most prevalence HR-HPV subtypes among patients with cervical intraepithelial neoplasia (CIN), HPV39 ranking eighth [15]. In our research, the detection rate of HPV39 is higher than that of HPV18 (Fig. 1). Nowadays, HPV18 has been extensively studied. While, the data available on HPV39 is still limited, especially in China. Therefore, the genetic polymorphisms, lineage phylogeny and positive selections of HPV39 based on E6, E7 and L1 genes in southwest China from a large amount of samples are sorely needed for detection of HPV-related diseases and development of HPV vaccine, as well as increased the attention to the China CC related α-7 HPV (HPV18, 39, 45, 59, 68, 70, 85 and 97).

PCR amplification and Variant identification
The primers of HPV39 E6, E7 and L1 were designed by PRIMER 5.0 based on the HPV39 reference sequences (GeneBank number: M62849) (  Tables 2 and 3 respectively.

Sequence analysis Genetic polymorphisms and phylogenetic analysis of HPV39 E6/E7/L1 genes
The E6, E7 and L1 sequences obtained in this study were aligned and compared to M62849 by MEGA 6 software. Nucleotide sequences were converted to amino acid sequence by MEGA 6, in order to determine the amino acid changes caused by nucleotide changes. Maximumlikelihood trees for HPV39 isolates were constructed by 807 bp E6-E7 and 1518 bp L1 nucleotide sequences respectively with Kimura's two-parameter model in MEGA 6. The data was bootstrap resampled 500 times for tree topology evaluation.

Selective pressure analysis of HPV39 E6/E7/L1 genes
The selection pressure of HPV39 E6, E7 and L1 were predicted by Datamokey online analysis software [16,17]. The normalized posterior mean of the dN-dS difference and the Bayesian posterior probability (dN > dS) of the positive selection for each codon position were obtained. A Bayes factor greater than 50 indicates that a site is positively selected.

The secondary and three-dimensional structure of HPV39 E6/ E7 protein
The secondary structure of main carcinogenic protein HPV E6/E7 reference sequence and the amino acid sequence with mutation were predicted by sopma server (self-optimized prediction method) [18], to analyze whether the non-synonymous mutation cause changes in the protein secondary structure or not. Sopma is a server that uses five independent methods (Levin homologous prediction method, gor method, dual prediction method, PhD method and sopma method) to predict and synthesize a "consistent prediction result".
The three-dimensional spatial structure formed by the interaction between secondary structure. The three-dimensional structure of HPV39 E6/E7 reference and mutation sequence were constructed from scratch to understand the effect of non-synonymous mutation. I-tasser (iterative thread optimization) [19] is a hierarchical method for protein structure and function prediction, it identifies the structure template from PDB by means of lomets, and constructs the full-length atom model by segment assembly simulation based on    iterative template, then, the three-dimensional model is reprogrammed into thread through protein functional database biolip to get the target protein sequence, thirdly optimized prediction model and minimized energy by Swiss-Pdb Viewer [20] to evaluate and verify the accuracy of HPV39 E6/E7 reference and mutation sequence generation model. The model was visualized, and the three-dimensional structure of reference and mutation sequence were compared to mark the amino acid mutation site for analyzing whether the gene mutation affected on the three-dimensional structure of the protein or not by PyMOL 2.2.0 [21].
10 positively selected codon sites and 4 negatively selected codon sites were identified (Tables 6 and 7). The secondary and three-dimensional structure of HPV39 E6/E7 protein Three E6 variants, 39pe02, 39pe04, 39pe12 and five E7 variants, 39pe02, 39pe04, 39pe05, 39pe09, 39pe10 were detected in our research. The secondary structure of E6 mutation sequence showed that Q26H, L51P and L151R appeared in random crimp (C), and A87G appeared in α-helix (H). Compared with reference sequence M62849, the secondary structure of L151P, L151R and Q26H changed from H to C, and the amino acids substitution also influenced the secondary structure around them (H shortened but C increased), A87G did not change the secondary structure. For HPV39 E7, R2H and D57E appeared in C, D14E, R87Q, Q88E and Q91K appeared in H, and there was no difference with M62849 secondary structure (Figs. 5 and 6).
The three-dimensional structure comparison of E6, E7 reference and mutant protein were shown in Figs. 7 and 8. For E6, the comparison chart of M62849, 39PE02 and 39PE12, L151P/R located in the surface and c-terminal of E6 protein 3D structure, the 151th amino acid substitution may lead the protein more closely to the environment; The comparison chart of M62849 and 39PE04, A87G amino acid substation located in helix, as well as Q26H, they are all exposed to the environment and near to the zinc granule. For E7, the comparison chart of M62849 and HPV39PE02, Q88E located in helix and exposed; The protein structure of M62849 and 39PE04 are difference in 57, 87, 91 position, D57E located in coil and exposed, R87Q and Q91K are both in helix and inside of the protein; The comparison chart of M62849 and HPV39PE10, Q91K is located inside of E7 protein; The comparison between M62849 and 39PE10, D14E located in helix and exposed, while Q88E located in the helix and buried. Non-synonymous mutations somewhat changed the trend of E6/E7 and carboxyl end structural disorder, especially E7.

Discussion
HR-HPV persistent infection closely related to the occurrence of cervical precancerous lesions and CC [22], its carcinogenicity is mainly realized by E6, E7 oncoproteins encoded by HPV E6, E7 genes, that lead to genetic instability and cell growth uncontrolled [23,24]. L1 major capsid proteins can be used to guide the design of prophylactic vaccines and induce the synthesis of high levels neutralizing antibodies major components, virus-like particles (VLP) [25]. HPV prevalence in southwest China during 2014-2019 was studied, demonstrating the significance of HPV39. Considering the relatively high prevalence of HPV39 in southwest China, we selected HPV39 samples to explore the gene polymorphism of E6, E7 and L1.
In this study, 344 HPV39 positive specimens were detected, accounting for 6.01% of the total positive population, higher than that of in Beijing (2.72%), ranking seventh in HR HPV infection rate and first in α-7 HPV; In the past five years, the positive rate of HPV39 was significantly increased, except 2016-2017, all suggested that the prevalence of HPV39 is relatively high in Southwest China and the environment of Southwest China is conducive to the survival of HPV39, so more attention should be paid to the prevention and control of this subtype. The HPV39 multiple infection rate is higher than that of in single infection, may due to the limitation of sample quantity, the different detection categories, and accuracy in different detection methods. Previous studies showed that HPV39 and HPV68 had the most mixed infection [26], but in our study, was HPV39 and HPV6 (18.49%), the probability of mixed infection with HPV68 was 9.11%, ranking fourth, above may due to HPV mixed infection prevalence in different regions was slightly different. Age is an important HPV infection affect factor, previous study showed that 20-24 and 50-54 years old are the two HPV detection peaks age group. The first peak may due to the relatively active sexual lives of women in this age group. The change of female hormone secretion may result in the second positive peak, with the physiological and immune dysfunction, that leads to the resurrection of latent HPV infection and reduced ability to self-return after HPV infection [3]. In our study, HPV39 infection rate was the highest in 18-28 years old, consistent with the previous study results, indicating that age is an important HPV infection rate affecting factor. However, the second peak did not appear in Southwest China, that possibly because the above two infection peaks were the statistical analysis results of all HPV-positive patients, while this study only focuses on the statistical analysis of HPV39 positive patients, fewer age 50-54 participants participate in this study and the high age group is not the HPV39 susceptible population, resulting in the peak age of HPV39 positive patients aged 50-54 years old lacking. HPV E6, E7 and L1 gene of 344 genital secretion samples were sequenced to analyze the gene variation, the distribution of HPV39 mutant and the effect of non-synonymous mutation on protein structure and function in this area. Our study showed that E7 (3.33%) had the highest mutation rate, followed by L1 (1.71%) and E6 (1.68%), indicating that E6 may be a more suitable target for HPV therapeutic vaccine than E7 and L1; Inconsistent with previous research results that E7 is a more suitable target for HPV therapeutic vaccines [13,27], that may due to the target of HPV therapeutic vaccine is different in different HPV types and the limitation of sample size. In the HPV39 E6, E7 and L1 sequence, 29 (5 in E6, 8 in E7 and 16 in L1) mutation sequences were detected, comparing with reference sequence of E6 (mk340878-mk340882), E7 (mk340883-mk340890) and L1(mk344658-mk344673). The maximum variation of E6, E7 and L1 gene in the positive samples was 1.05% (5/477), 1.52% (5/330) and 0.92% (14/1518) respectively, inconsistent with previously reported in other regions (0.63%, 0.91%, 1.12%), suggesting that the variation in HPV39 E6, E7 and L1 might be related to the distribution of regions and populations with regional characteristics.
HPV39 E6, E7 and L1 gene belongs to A1, A2 and B lineages, and A pedigree outstand (94.45% for E6-E7;  reported previously in China. In this study, 39pe04 and 39pl04 two B-pedigree mutation sequences were isolated and pathogenicity of B-pedigree was stronger than that of A-pedigree, suggesting that more attention should be paid to the prevalence and prevention of B-pedigree HPV39 in the future. Positive site selection may affect the virus infection efficiency and immunogenicity. The positive selection sites in E6 are 26, 87, 151, and in L1 are 75, 180, 222, 272, 284, 346, 356, no positive selection sites in E7 were detected, the positive selection sites may play an important role in improving the survival ability of HPV39 (e.g., resistance to immune response) and adapting to evolution. The codons 45, 138, 309 and 381 in L1 are negative selection sites, no mutation site was detected in this study, suggesting that mutation at these sites may harmful to the virus survival and evolution, so more attention should be pay to the mutation effect of these sites in the future research.
E6 and E7 are main carcinogenic protein and plays an important role in cervical lesions. Amino acid replacement may affect the carcinogenic potential of E6 and E7 protein and the biological functions of protein were mainly determined by its secondary and three-dimensional structure. The secondary structures of L151P (T558C), L151R (T558G) and Q26H (G596A) in E6 were changed from H to C by comparing with the reference sequence. These amino acid substitutions also have some effect on the secondary structures around them (H shortened and C increased). 145-149 were PDZ domaincontaining combined regional that was the target of E6 proteins for cellular transformation, crystal structures reveal that 152R-R-R-E-T-Q-V158 of E6 peptide are involved in PDZ recognition [28,29]. The regions in the half of carboxy terminal is mainly involved in the p53 binding [30], the L151P, L151R of HPV39 located in the surfaces of HPV E6 and close to the PDZ binding, recognition region that probability influence the function of P53 binding and degradation. E6 has four Cys-X-X-Cys repetitions, respectively located amino acid residues 32-35, 65-68, 105-108 and 138-141, Cys-X-X-Cys Fig. 7 Comparison of HPV39 E6 protein 3D structure prediction models Fig. 8 Comparison of HPV39 E7 protein 3D structure prediction models motifs have been suggested to coordinate metal binding and are frequently found in nucleic acid binding proteins [31]. A87G and Q26H close to the Zinc, those mutation may influence the protein binding and degradation. Meanwhile 26, 87, 151 are the positive selection sites in E6 and play an important role in improving the survival ability of HPV39, the positive selection sites screening results also verified the effects of the above mutations on protein function to some extent. For HPV39 E7, R2H and D57E appeared in C, D14E, R87Q, Q88E and Q91K appeared in H, and there was no difference with M62849 secondary structure. HPV E7 protein was divided into three main domains CR (conserved region)1, aa 1-15; CR2, aa 16-37; and CR3, aa 38-98, CR1 and CR2 mediate binding of viral oncoproteins to overlapping cellular proteins, including the product of retinoblastoma gene (pRB), cyclin A and cyclin E [32,33]. 21-29 associated with the binding of retinoblastoma (Rb) tumor suppressor, since E7 acts as the major immortalizing protein through Rb/E2F pathway, amino acid alternations in Rb binding domain may lead to the change of E7′s ability to immortalize cells among variants [34]. The CR3 domain contains two "Cys-X-X-Cys" sequences located at amino acid residues 58-61 and 91-94 participating in Zn binding, are the most important regions associated with the transforming ability of E7 protein [32,35]. The zinc finger structure at the C-terminal was reported essential for HPV16 E7 to stabilize its structure and biological function, the disruption of the 91Cys-X-X-Cys94 motif that lies towards the carboxyl terminus of E7 protein appeared to result in a greatly impaired transforming ability and reduced transactivation [32,35]. The mutations in 91th amino-acid caused the loss of the zinc-binding CR3 domain of E7, resulting in a highly unstable protein that can be quickly degraded, and generating more epitopes to be presented in the MHCI pathway, so does the Q91H [36]. R2H located in metal binding motifs, the transformation of E7 has been demonstrated to be eliminated by the mutation in 2th amino acid [35]. D14E located in CR1, mediate protein binding; D57E, R87Q, Q88E close to 58Cys-X-X-Cys61 and 91Cys-X-X-Cys94, that participating in Zn binding, the amnio acid substitution may influence the transforming ability of the E7 protein. These non-synonymous mutations slightly change the E6/E7 amino terminal and the trend of the carboxyl end structure disorder, A certain degree of secondary and threedimensional structure dislocation were existed between the reference and mutant sequence. Structure changes may lead to the different binding ability to the host p53, RB protein and other potential proteins, that may affect the pathogenicity of HPV39. Therefore, it is necessary to study the function of these non-synonymous mutations furtherly.
This study enrichment the data of HPV39 gene polymorphism and evolution in Chinese population, as well as provide a molecular biological basis for the future research analysis of HPV39 E6/E7/L1 and the design of therapeutic vaccine. However, due to the sample size limitation, the result of this study still has some shortcoming. Therefore, it is necessary to carry out multiregion and large sample population research, conduct a series of cell experiments to verify the above mutation sites function, explore the relationship between HPV39 gene polymorphism and genital lesions as well as its action mechanism.

Conclusion
HPV39 is a high prevalence HPV type in Southwest China, which is prone to mixed infection with other HPV types, especially with HPV6. 18-28 years old is the highest HPV infection rate age group. A pedigree HPV39 E6, E7 and L1 gene is dominant in southwest china, two B-pedigree mutant sequences were isolated and its pathogenicity was stronger than A-pedigree. Positive selection sites may affect the virus infection efficiency, immunogenicity and improve the survival ability of HPV39 as well as adaptation to evolution. Negative selection sites suggesting that mutation at these sites may harmful to the virus survival and evolution. A certain degree of dislocation was existed between the protein secondary and three-dimensional structure of the reference and mutant sequence. The changes of structure may lead to the different binding ability to the host p53 protein and other potential proteins, thus affect the pathogenicity of HPV39.