HIV-1 diversity in an antiretroviral treatment naïve cohort from Bushbuckridge, Mpumalanga Province, South Africa

South Africa has a generalized and explosive HIV/AIDS epidemic with the largest number of people infected with HIV-1 in the world. Molecular investigations of HIV-1 diversity can help enhance interventions to contain and combat the HIV/AIDS epidemic. However, many studies of HIV-1 diversity in South Africa tend to be limited to the major metropolitan centers and their surrounding provinces. Hardly any studies of HIV diversity have been undertaken in Mpumalanga Province, and this study sought to investigate the HIV-1 diversity in this province, as well as establish the occurrence and extent of transmitted antiretroviral drug resistance mutations. HIV-1 gag p24, pol p10 and p66/p51, pol p31 and env gp41 gene fragments from 43 participants were amplified and sequenced. Quality control on the sequences was carried out using the LANL QC online tool. HIV-1 subtype was preliminary assigned using the REGA 3.0 and jpHMM online tools. Subtype for the pol gene fragment was further designated using the SCUEAL online tool. Phylogenetic analysis was inferred using the Maximum Likelihood methods in MEGA version 6. HIV-1 antiretroviral drug resistance mutations were determined using the Stanford database. Phylogenetic analysis using Maximum Likelihood methods indicated that all sequences in the study clustered with HIV-1 subtype C. The exception was one putative subtype BC unique recombinant form. Antiretroviral drug resistance mutations K103N and E138A were also detected, indicating possible transmission of anti-retroviral drug resistance mutations. The phylogenetic analysis of the HIV sequences revealed that, by 2009, patients in the Bushbuckridge, Mpumalanga were predominantly infected with HIV-1 subtype C. However, the generalized, explosive nature of the HIV/AIDS epidemic in South Africa, in the context of extensive mobility by South Africans who inhabit rural areas, renders the continued molecular monitoring and surveillance of the epidemic imperative.


Background
Human immunodeficiency virus (HIV), the etiological agent of acquired immunodeficiency syndrome (AIDS), was first isolated more than 30 years ago [1]. By 2013, an estimated 35 million people were living with HIV-1 globally, of which 24.7 million were living in sub Saharan Africa [2]. During this time period, the HIV-1 prevalence in South Africa was 12.2% (6.4 million people), with 469 000 new infections occurring, suggesting that the epidemic is not only generalized, but also explosive [3].
The HIV-1 epidemic in South Africa is characterized by limited subtype diversity with subtype C accounting for the majority of infections [4,5]. Other non-C subtypes, particularly subtypes B and D, have also been identified [6][7][8] as well as the occasional unique recombinant forms (URFs) [9][10][11][12][13][14][15]. Molecular epidemiological investigations in South Africa have largely focused on provinces with major metropolitan centers such as Johannesburg in Gauteng, Cape Town in the Western Cape and Durban in Kwa-Zulu Natal. No subtype information is available for the Eastern Cape, North West and Northern Cape provinces and limited information is available for the Free State, Limpopo and Mpumalanga Provinces. HIV-1 prevalence in South Africa is also characterized by extreme heterogeneity and there is considerable variation in prevalence amongst the different provinces and districts in each province [16]. The highest prevalence is in Kwa-Zulu Natal with the lowest in the Western Cape Province. South Africa not only has a generalized and explosive HIV/AIDS epidemic, its impact also varies significantly in terms of race, age, gender, and between regions of the country, with poor, young, African women in rural Kwa-Zulu Natal bearing a disproportionate burden of HIV infection [16].
The overall HIV prevalence in Mpumalanga in 2012 was 35.6% [16]. The province consists of 3 districts: Ehlanzeni, Nkangala and Gert Sibande. The Bushbuckridge Local Municipality in the Ehlanzeni District in Mpumalanga Province is a predominantly rural, impoverished area, with only 14% of the adult population employed and over 85% of households living below the house hold subsistence level. Half of males and 14% of females between the ages of 25 and 59 are long-term migrant workers and provide a source of remittances, which comprise the largest proportion of the income of the population of Bushbuckridge [17].
Molecular investigations of HIV diversity can help enhance interventions to contain and combat the HIV-1 epidemic. With this study, we investigated for the first time, HIV-1 diversity in Bushbuckridge, Mpumalanga, as well as the possible occurrence and extent of transmitted antiretroviral drug resistance mutations.

Study population and RNA extraction
In preparation for HIV prevention trials, a cohort was developed for enrollment. Ethics approval were obtained from the Human Research Ethics Committees (HRECs) from the University of the Witwatersrand (M061129) and Stellenbosch University (N11/02/054), following internationally recognized guidelines. The entry point for this cohort was via a free voluntary counseling and testing service. After HIV testing, individuals were offered the opportunity to be part of the pre-screening cohort. Both HIV negative and HIV positive individuals were allowed to join the cohort in preparation for preventative and therapeutic HIV vaccine trials. Fiftyone samples were obtained with informed consent as part of this pre-screening protocol from 43 HIV positive participants in Bushbuckridge, Mpumalanga (Figure 1). RNA was extracted from stored plasma samples using a QIAamp MinElute Virus Spin Kit in a QIAcube automated extractor (QIAGEN, Dusseldorf, Germany), according to Figure 1 Geographical location of samples collected in this study. The South African map with 9 provinces is indicated and the Bushbuckridge local municipality in the Ehlanzeni district of Mupumalanga is enlarged. The "Maputo corridor" or N4 trunk roadway is highlighted in blue. the manufacturer's instructions. RNA samples were stored at −70°C until used.
Reverse transcriptase polymerase chain reaction (RT-PCR) of HIV-1 gene fragments Four genomic regions were targeted for amplification: the gag p24 region (HXB2 nucleotides 1248 to 1707); a part of the pol gene, that includes the Protease (PR) and a partial segment of the Reverse Transcriptase (RT) region (HXB2 nucleotides 2114 to 3335), the Integrase (IN) region (HXB2 nucleotides 4202 to 5096) and the partial env gp 41 region (HXB2 nucleotides 7877 to 8282). PCR amplification and purification was done using previously described primers and methods for the partial gag, pol integrase (IN) and env [18] genes. The partial pol PR/RT gene was also amplified using primers and a method previously described [19,20]. Briefly, cDNA synthesis and first round PCR amplification was done with the Access-RT PCR system (Promega, Wisconsin, USA), while second round nested PCR amplification was done with the GoTaq DNA polymerase system (Promega, Wisconsin, USA). The oligonucleotide primers used in the amplification of the gene fragments are listed in Table 1.

Sequencing of HIV-1 gene fragments
The cycle sequencing reactions of the partial gene fragments were done with the Big Dye® Terminator v 3.1 Cycle Sequencing Kit (Applied BioSystems, Foster City, CA, USA) and run on an ABI Prism 3130xl Genetic Analyzer (Applied Biosystems, Foster City, CA, USA), according to the manufacturer's instructions. Both strands were sequenced using overlapping primers. Sequencher v 5.1 (Gene Codes Corporation, Ann Arbor, MI, USA) was used to assemble the trace data into contiguous fragments, which were then verified, edited and saved as text files for subsequent analysis. All sequences were checked for quality assurance using the Los Alamos HIV-1 Sequence Quality Analysis tool (http://www. hiv.lanl.gov/content/sequence/QC/index.html) before further analyses and submission to GenBank.
The phylogenetic trees for the different HIV-1 genetic fragments were inferred using ML methods implemented in MEGA version 6 [25]. To find the most appropriate evolutionary model for phylogenetic inference, we used Model Selection (ML) as implemented in MEGA [25]. For each model, BIC scores (Bayesian Information Criterion), AICc value (Akaike Information Criterion, corrected), Maximum Likelihood value (lnL), and a number of different parameters were presented. Models with the lowest BIC scores were considered to describe the substitution pattern the best [25]. For the partial pol PR/RT region, the Integrase (IN) region and the partial env gp 41 region, the BIC, AICc and lnL scores indicated that the General Time Reversible model of evolution with Gamma distribution and invariant rate among sites (GTR + G + I), was the best model. For the gag region, the lnL method indicated the use of the GTR + G + I model and BIC and AICc indicated the use of the TN93 + G + I model. All nucleotide positions in the alignments with less than 95% site coverage were eliminated, thus fewer than 5% alignment gaps, missing data, and ambiguous bases were allowed at any position. The reliability of the inferred trees was evaluated using bootstrap resampling and branches with a bootstrap value of 70% or greater were considered reliable (n = 100) [26].

GenBank accession numbers
GenBank accession numbers of the gag sequences were KM218392 to KM218428; pol sequences, KM218448 to KM218460; integrase sequences, KM218429 to KM218447 and for the env sequences, KM218357 to KM218391.

Demographic information
The demographic and clinical information of the cohort, together with the subtyping, are summarized in Table 2.
The study involved 51 plasma samples, collected from 43 participants in Bushbuckridge, between February and July 2009. Forty samples were collected at the recruitment visit and 11 samples at visit one. Only one sample per participant was included in the study. All participants, except for 0064A and 0206A, were female and none were on HIV-1 antiretroviral treatment. The average age of the cohort was 26.7 years and ranged from 16 to 41 years. The CD4 lymphocyte count ranged from 105 to 1263 with an average of 450.
PCR amplification, sequence data and quality assurance    Preliminary subtype analysis using online tools REGA and jpHMM online tools were used to assign subtypes to all the sequences and to detect possible recombinant forms. REGA 3.0 assigned all gag, pol PR/RT, pol IN, and env sequences to subtype C, except for env 0143A, which was assigned subtype B. Similar results were obtained with jpHMM, with the exception of the IN region of 0193A which was assigned as a CK recombinant form.
The SCUEAL subtyping of the pol PR/RT and IN gene fragments revealed that all the PR/RT and IN sequences were HIV-1 subtype C. Six of the sequences (18.75%) were intra-subtype C recombinant forms (Table 3).

ML Phylogenetic inference
Model Selection (ML) using the BIC, implemented in MEGA, indicated the use of the (GTR + G + I) model for the pol and env regions and the use of the TN93 + G + I model for the gag region (Additional files 1, 2, 3 and 4: Table S1, Table S2, Table S3 and Table S4. Maximum Likelihood fits of 24 different nucleotide substitution models for gag, pol PR/RT, pol IN and env gp41, respectively). ML phylogenetic trees were inferred from the multiple sequence alignments, and branches with a bootstrap value of 70% or greater were considered reliable. None of the sub-genomic regions supported a monophyletic South African lineage.
In the gag ML tree (Figure 2A and B) all the sequences clustered within subtype C. Except for slight differences in the bootstrap values, there were no differences in the gag tree topologies inferred with either the GTR + G + I or TN93 + G + I models. Interestingly the 2 outliers to the main subtype C cluster, 0042A and 0143A, were possible intra subtype C recombinants in the pol region. Sequence 0119A had a long branch and 3 sets of sequences, 0189A/0203A, 0064A/190A and 0085A/0101A clustered closely together. This may indicate that these samples may be a possible PCR contamination or that they are epidemiologically linked.
The ML phylogenetic tree for the pol PR/RT gene comprised 49 sequences and all the Mpumalanga sequences clustered with HIV-1 subtype C (Figure 3). The ML phylogenetic tree for the IN region contained 55 sequences and all Mpumalanga sequences clustered with HIV-1 subtype C reference sequences (Figure 4). Sequence 0098A clustered as an outlier to subtype C and SQUEAL indicated that the sequence is an intra-subtype C recombinant with 3 breakpoints. Sequence 0193A had a long branch and jpHMM indicated a possible CK recombinant form.
The ML phylogenetic tree for the env gp41 contained 74 sequences and all sequences, except for 0143A, clustered with HIV-1 subtype C sequences ( Figure 5). Sequence 0143A clustered with subtype B in the env region and as an outlier to subtype C in the gag region. SQUEAL indicated that 0143A was an intra-subtype C recombinant in the pol region. This is the first indication of a putative unique BC recombinant sequence in Bushbuckridge, Mpumalanga.

HIV-1 antiretroviral drug resistance mutations
Although the participants were from an antiretroviral treatment naïve cohort, some antiretroviral drug mutations were detected ( Table 4). The NNRTI mutation K103N detected on the 0143A sequence causes highlevel resistance to nevirapine (NVP), and efavirenz (EFV). The NNRTI mutation, E138A, detected on the 0143A sequence is a polymorphism that may contribute to reduced etravirine (ETR) and rilpivirine (RPV) susceptibility in combination with other NNRTI-resistance mutations. The K101E mutation found on the 0189A sequence causes intermediate resistance to NVP and lowlevel resistance to EFV, ETR, and RPV. No major PI mutations were detected in the Bushbuckridge, Mpumalanga sequences. The T74S minor PI mutation occurs in 5% of untreated persons with subtype C viruses and is associated with reduced NFV susceptibility [28][29][30][31].
E157Q is an integrase polymorphic accessory mutation that is weakly selected in patients receiving raltegravir (RAL) and causes low level resistance to RAL and elvitegravir (EVG). L74I is an accessory mutation for integrase.

Discussion
The investigation of the HIV subtype diversity of samples obtained from a cohort in Bushbuckridge, Mpumalanga (See figure on previous page.) Figure 3 Phylogenetic analysis of the partial pol gene, using MEGA 6. The evolutionary history was inferred by using the ML method based on the GTR model. The tree with the highest log likelihood (−9574.7386) is shown. The percentage of trees in which the associated taxa clustered together is shown next to the branches. A discrete Gamma distribution was used to model evolutionary rate differences among sites (5 categories (+G, parameter = 1.1121)). The rate variation model allowed for some sites to be evolutionarily invariable ( revealed, first, that the HIV-1 from these samples belong almost entirely to HIV-1 subtype C with one BC recombinant; second, that the way in which the sequences derived from these samples cluster in phylogenetic trees suggests there has been multiple introductions of HIV-1 into Bushbuckridge; and third, the prevalence of antiretroviral drug resistance mutations and drug resistance-associated polymorphisms in Bushbuckridge is extremely low.

Bushbuckridge HIV epidemic is predominantly subtype C, with one BC URF
The fact that the HIV samples from Bushbuckridge, Mpumalanga, belong almost entirely to subtype C is consistent not only with the explosive HIV-1 epidemic in southern Africa, but also it's very limited subtype diversity. HIV-1 subtype C is the most common subtype, accounting for the majority of HIV infections in southern Africa [4,5], while subtype B is responsible for infections in MSM [9,12]. One putative subtype BC unique recombinant form was detected. This indicates that BC URFs are not only found in the Western Cape Province [15], but also in Mpumalanga province.

Multiple introductions of HIV-1 into Bushbuckridge
The fact that HIV-1 subtype C sequences from South Africa tend to intermingle with HIV-1 subtype C sequences from Botswana, Malawi and Zambia suggests they may have a common evolutionary origin [32,33]. The possibility of an underlying common evolutionary origin of isolates in southern Africa is consistent with the history of the population dynamics of the southern African region. While the HIV-1 subtype C isolates from Brazil and Ethiopia tend to cluster separately, the fact that the subtype C isolate from India tends to cluster with the subtype C isolates from southern Africa [34,35] can be explained by the historical connections between the Indian subcontinent and southern Africa, which arises from the roles of both regions as former British colonial territories.
Countries in southern Africa in which adult national HIV prevalence rates exceeded 15% in 2007 were all linked by the migrant labor system. This system, which under pinned the population dynamics of both South Africa and the broader southern African region, was critical in shaping the patterns of population mobility and integration that characterizing the entire region. The migrant labor system was integral to the development and structure of the South African economy and apartheid. Botswana, Lesotho, Namibia, South Africa, Swaziland, Zambia, and Zimbabwe, were all historically linked through the migrant labor system that brought men from as far as Zambia and Malawi to the mines initially on the Reef and subsequently elsewhere in the country [36][37][38][39]. Migrants are more vulnerable to HIV infection than people who hardly move, both in southern Africa as in other African countries [40][41][42]. A 1985 survey of workers in the gold mines originating from the entire southern African region found HIV prevalence to be very low among South African miners but among Malawian miners prevalence was already at 3% [36]. High infection levels are being found in Gaza province in Mozambique, where large numbers of migrants working in South Africa originate [43]. Before and after independence foreign migrant workers also crossed borders to work in mines in Namibia, Botswana, Zambia, and Zimbabwe [44][45][46].
Many of the countries in southern African with explosive HIV/AIDS epidemic are also landlocked, which entails that the region's road transport networks does not only link these landlocked countries to the ports in Durban, Richards Bay and Maputo, but also facilitate the rapid spread of HIV in the region by ensuring the sexual networks that drive the epidemic transcend national boundaries. The Ehlanzeni District in Mpumalanga Province straddles the Maputo Corridor, a major trade route which connects the Gauteng, Limpopo, and Mpumalanga provinces of South Africa with Maputo, the capital of Mozambique that also has a major port. In Mozambique, HIV is spreading more rapidly in provinces linked by major transport routes to Malawi, South Africa and Zimbabwe. High infection rates have been found in Sofia province, which is traversed by Zimbabwe's main export route [43].
The peculiarly explosive HIV-1 epidemic in southern Africa could also stem from the unique biological properties of subtype C. HIV-1 subtype C has an additional NF-binding site in the long terminal repeat (LTR), a prematurely truncated Rev protein, a 5′-amino-acid insertion in Vpu, and a more active, catalytically efficient protease, which may influence viral gene expression and alter the transmissibility and pathogenesis of subtype C isolates [31,[47][48][49][50][51][52]. These unique biological properties, (See figure on previous page.) Figure 4 Phylogenetic analysis of the integrase gene, using MEGA 6. The evolutionary history was inferred by using the ML method based on the GTR model. The tree with the highest log likelihood (−7480.4899) is shown. The percentage of trees in which the associated taxa clustered together is shown next to the branches. A discrete Gamma distribution was used to model evolutionary rate differences among sites (5 categories (+G, parameter = 0.3186)). The rate variation model allowed for some sites to be evolutionarily invariable ([+I], 42.6830% sites). The tree is drawn to scale, with branch lengths measured in the number of substitutions per site. The analysis involved 55 nucleotide sequences and there were a total of 849 positions in the final dataset.
including those related to viral entry and pathogenesis such as the CCR5 and non-syncytium-inducing phenotype, may account for the explosive epidemic of HIV-1 subtype C in southern Africa [53][54][55]. However, the additional NF-κB site in HIV-1 subtype C may be biologically inactive, and enhanced activity of these individual functions may still not be sufficient to overcome the decreased replicative capacity of the CCR5-tropic nonsyncytium-inducing phenotype [53].

Drug resistance mutations and polymorphisms
Combination antiretroviral therapy can suppress HIV-1 replication to undetectable levels with concomitant significant clinical outcomes. However, suboptimal suppression HIV-1 replication can result in the emergence of drug resistant virus strains. HIV-1 isolates that have acquired mutations conferring reduced susceptibility to antiretroviral drugs can be can be transmitted, potentially limiting options for first line therapy in untreated individuals [56]. The proportion of patients without prior antiretroviral therapy and who are infected with a virus resistant to at least one antiretroviral drug in Australia, Europe, Japan and the United States of America is 10% to 17%, while data between 2006 and 2010 suggests that transmitted antiretroviral drug resistance among  Figure 5 Phylogenetic analysis of the partial env gene, using MEGA 6. The evolutionary history was inferred by using the ML method based on the GTR model. The tree with the highest log likelihood (−7290.5638) is shown. The percentage of trees in which the associated taxa clustered together is shown next to the branches. A discrete Gamma distribution was used to model evolutionary rate differences among sites (5 categories (+G, parameter = 0.6134)). The rate variation model allowed for some sites to be evolutionarily invariable ([+I], 32.1349% sites). The tree is drawn to scale, with branch lengths measured in the number of substitutions per site. The analysis involved 74 nucleotide sequences and there were a total of 402 positions in the final dataset. those starting antiretroviral treatment in low-and middleincome countries increasing [2]. South Africa has the largest antiretroviral treatment program in the world. Besides its unprecedented scale, the antiretroviral treatment programme in South Africa is also being rolled out rapidly, such that while only 833653 adults and 86270 children were on antiretroviral treatment through the public sector in South Africa by the end of 2009, the number of those on treatment by 2012 had increased to 2010340 adults and 140541 children [2,3,16].
While the HIV-1 sequences used in this study are derived from treatment-naïve participants from Bushbuckridge, Mpumalanga, the K103N antiretroviral drug resistance mutation was detected. This suggests the participants from Bushbuckridge, Mpumalanga may either have undergone antiretroviral treatment or that they were infected with antiretroviral drug resistant strains [28][29][30][31]. The E138A mutation selected for by riplivirine/etravirine, must also be a transmitted mutation. Both riplivirine/etravirine are not part of the first and second line ART regimens in South Africa, while etravirine is part of the third line regimen. Patients in Mpumalanga only started receiving third line ART 2013.

Limitations of this study
The limitations of the study include a relatively small sample size; DNA amplification was not successful for up to 71% of the samples of the partial pol PR/RT subgenomic region; use of partial gene regions to assign viral subtypes, potentially allowing recombinant viruses to be missed, the use of direct; population sequencing may result in the lack of detection of minoritypopulation viruses; which can lead to an underestimation of viral diversity and drug resistance mutations.