gp46 Molecular characterization
Before analyzing the obtained gp46 sequences for the presence of mutations and identification of molecular characteristics, two datasets were constructed. The first was called “Cosmopolita database” and was composed by 27 gp46 sequences from different countries (except from Brazil), previously available in the NCBI/Nucleotide Sequence Database (GenBank). It is important to note that the 27 sequences of the “Cosmopolita dataset” did not have any information about subtype and were generated from different countries: French Guyana, Central America (n = 7); Martinique, Central America (n = 1); Gabon, Central Africa (n = 12); and Guadalupe, Central America (n = 7). The second dataset was called “Brazilian dataset” and was composed by 42 gp46 sequences from Brazil, previously available in GenBank. It is therefore important to note that these 42 Brazilian sequences are classified as subtype a (Cosmopolita), and were generated from different Brazilian geographic regions: Salvador, Northern (n = 21); Londrina, South (n = 5); São Paulo, Southeast (n = 11), while five of them did not present any information about their geographic origin.
Both datasets were submitted separately to the Clustal X software  to perform the alignment which was then manually edited using the GeneDoc program , and finally, the edited alignments were used to generate a unique consensus sequence of each dataset using Bioedit software . The consensus sequence from “Cosmopolita database” was called “Cosmopolita reference” and the consensus sequence from the “Brazilian dataset” was called “Brazilian reference”. These consensus sequences comprise the most frequent nucleotide variants found in previously published gp46 sequences from Brazil and elsewhere. These consensus sequences were used as the reference sequences to identify possible mutations in the 146 newly generated gp46 sequences.
The genetic distances were measured within the two distinct groups: gp46 sequences from HC and HAM/TSP HTLV-1 infected individuals. The Tamura Nei model was used with a distance matrix implemented in the MEGA 3.0 package , and the standard error computation was obtained by Bootstrap analysis (1000 replicates). The mutation/polymorphism identification was performed manually using the visualization of alignment in the Bioedit software.
To test the hypothesis that the amino acid substitutions within the gp46 sequences could have been favored or not by natural selection, the positive selection was assessed using six different codon-based maximum-likelihood substitution models . All models were implemented in the HYPHY program  and the ω and p values were estimated through maximum-likelihood optimization, in such a way that using the M3 model, sites with a posterior probability exceeding 90% and a ω value > 1 were labeled as being “positive selection sites”. Finally, Likelihood Ratio Test (LRT) analysis was used to determine: (1) if site heterogeneity selection was present and (2) if there were positively selected sites [23, 36].
Epitope prediction was carried out for the HC and HAM/TSP consensus sequences to 27 HLA-I (HLA A1101, HLA A26, HLA B1510, HLA B4402, HLA A01, HLA A0201, HLA A2402, HLA A5101, HLA A03, HLA B2705, HLA A6801, HLA B08, HLA B0702, HLA B2709, HLA B 1402, HLA B1501, HLA B18, HLA B37, HLA B3801, HLAB 3901, HLA B 3902, HLA B4001, HLA B4101, HLA B 4501, HLA B 4701, HLA B 4901, HLA B5101) and 6 HLA-II alleles (HLA DRB1 0101, HLA DRB1 0401, HLA DRB1 0301, HLA DRB1 1501, HLA DRB1 0701, HLA DRB1 1101), using the online bioinformatics tool SYFPEITHI (http://www.syfpeithi.de/Scripts/MHCServer.dll/Epitope Prediction.htm) . This tool uses an algorithm that can predict sequences that have the potential ability to bind to one or more different HLA-I and HLA-II molecules. It also provides information about the epitope sequence, the specificity to the HLA molecule and the HLA binding score for each epitope.
To investigate the possible influence of the described mutations in the gp46 sequences, physico-chemical analysis was performed using Network Protein Sequence Analysis (NPSA) (http://npsa-pbil.ibcp.fr/) [38–42] and the potential protein domain analysis using the GeneDoc software and the Prosite tool, as previously described .
Finally, the SWISS-MODEL online tool (http://swissmodel.expasy.org/)  was used as a fully automated protein structure homology-modeling server, to infer the possible influence of the amino acid changes at protein secondary structure.
All env nucleotide sequences previously deposited in the GenBank and used in the study are listed below with their corresponding accession number: [L26585, L26586, L33265, L33266, AF091494-AF091500, AF092065, L76041-L76049, L76052-L76054, L76056, L76058, L76060, DQ007189-DQ007209, HM770426-HM770440, U81865-U81869, AF077209].