Structural similarity-based predictions of protein interactions between HIV-1 and Homo sapiens
© Doolittle and Gomez. 2010
Received: 12 February 2010
Accepted: 28 April 2010
Published: 28 April 2010
Skip to main content
© Doolittle and Gomez. 2010
Received: 12 February 2010
Accepted: 28 April 2010
Published: 28 April 2010
In the course of infection, viruses such as HIV-1 must enter a cell, travel to sites where they can hijack host machinery to transcribe their genes and translate their proteins, assemble, and then leave the cell again, all while evading the host immune system. Thus, successful infection depends on the pathogen's ability to manipulate the biological pathways and processes of the organism it infects. Interactions between HIV-encoded and human proteins provide one means by which HIV-1 can connect into cellular pathways to carry out these survival processes.
We developed and applied a computational approach to predict interactions between HIV and human proteins based on structural similarity of 9 HIV-1 proteins to human proteins having known interactions. Using functional data from RNAi studies as a filter, we generated over 2000 interaction predictions between HIV proteins and 406 unique human proteins. Additional filtering based on Gene Ontology cellular component annotation reduced the number of predictions to 502 interactions involving 137 human proteins. We find numerous known interactions as well as novel interactions showing significant functional relevance based on supporting Gene Ontology and literature evidence.
Understanding the interplay between HIV-1 and its human host will help in understanding the viral lifecycle and the ways in which this virus is able to manipulate its host. The results shown here provide a potential set of interactions that are amenable to further experimental manipulation as well as potential targets for therapeutic intervention.
Pathogen invasion and survival requires that the pathogen interact with and manipulate its host. Human immunodefficiency virus type 1 (HIV-1) encodes only 15 proteins and must therefore rely on the host cell's machinery to accomplish vital tasks such as the transport of viral components through the cell and the transcription of viral genes [1, 2]. HIV-1 infects human cells by binding to CD4 and a coreceptor, fusing with the cell membrane and uncoating the virion core in the cytoplasm . The genomic RNA is then reverse transcribed and the DNA enters the nucleus as part of a viral pre-integration complex (PIC) containing both viral and host proteins. Afterwards, the viral DNA is inserted into the genome by viral integrase (IN) . The integrated provirus is transcribed by host RNA polymerase II from a promoter located in the provirus long terminal repeat (LTR), and the RNA is exported to the cytoplasm [1, 2]. Host machinery translates HIV-1 mRNA, and several of the resulting proteins are transported to the cell membrane to be packaged into the virion along with the genomic RNA and multiple host proteins. The virus then buds from the cell and undergoes a maturation process, which enables it to infect other cells . Throughout this process, host proteins play an indispensable role.
To understand the interface through which the pathogen connects with and manipulates its host requires knowledge of the molecular points of interaction between them. Specifically, knowledge of the protein interactions between pathogen and host is of particular value. While the prediction of protein interactions within species such as S. cerevisiae and H. sapiens has been pursued for some time, it is only recently that host-pathogen interactions have come under greater scrutiny. Indeed, computational approaches are of significant value in the host-pathogen context as large-scale experimental characterization of these interactions is non-trivial [3–6].
As a result of the need for computational approaches, several recent methods have been developed and applied to host-pathogen interactions, suggesting additional potential interactions in different host-pathogen systems. For instance, Dyer et al. predicted interactions between P. falciparum and human using statistics about domains involved in within-species interactions . Also focusing on malaria, Lee and colleagues generated predictions based on interactions between orthologous proteins from eukaryotes . In the context of HIV-human interactions, at least two computational methods have been applied. In the first study, Tastan et al. used a computational approach based on the random forest method to predict protein interactions using features taken from human proteins and the human interactome . In the second study, Evans et al. predicted possible interactions using short sequence motifs conserved in both HIV-1 and human proteins .
While of value, most approaches have not utilized the significant amount of protein structure information that is increasingly available. Specifically, rapid progress in structure determination technologies has led to the establishment and deposition of massive numbers of protein structures into the Protein Data Bank, with over 60,000 protein structures currently deposited . In combination with documented protein-protein interactions, the use of protein structure information provides another means for the prediction of possible protein interactions [12–14]. The central premise in such approaches is that, given a set of proteins with defined structures and associated interactions, proteins with similar structures or substructures will tend to share interaction partners. In the context of host-pathogen interactions, Davis et al., used homology modeling to ascertain potential protein interactions for pathogens responsible for several tropical diseases . Unfortunately, despite their potential value, such computational structure approaches have not been widely applied to the problem of predicting host-pathogen interactions.
HIV-1 protein structures
Representation of HIV-1 proteins
PDB chains in Dali
PDB structures in Dali
Upon obtaining the knowledge of which specific HIV-1 and human proteins have high structural similarity, we extract all known interactions for human proteins from the Human Protein Reference Database, which contains over 37,000 documented protein interactions . Again, the central premise is that given a network of protein interactions, proteins with similar structures or substructures will tend to have similar interaction partners. Thus, our hypothesis is that HIV-1 proteins having similar structure to one or more human proteins are also likely to participate in the same set of protein interactions (Figure 1). Under these assumptions, we directly mapped HIV-1 proteins to their high-similarity matches within this network.
To reduce the number of predictions and provide an additional line of functional evidence for interactions and their possible biological relevance, we filtered these results using two types of datasets on host proteins involved in HIV-1 infection; collectively referred to as "Literature Filters" hereon. The first type represents host proteins that have been shown to impair HIV-1 infection or replication when knocked down by siRNA or shRNA. Three genome-scale siRNA screens have been conducted in HeLa or 293T cells [19–21]. A fourth study with a similar goal was conducted using shRNA in Jurkat T-cells, a more realistic model of HIV-1 infection . Each of the four screens found over 250 host proteins involved in HIV-1 infection. Remarkably, very little overlap exists between these studies, perhaps due to differences in methods, including the cell lines and stages of the HIV-1 life cycle investigated.
The second type of data used to filter predictions is literature data identifying human proteins present in the HIV-1 virion. During budding, host proteins from both the cell surface and the cytoplasm, including some involved in the cytoskeleton, signal transduction, metabolism, and chaperones, may be incorporated into the virion . While some of these proteins may be taken up by the budding virus simply by chance, others are known to be specifically incorporated into the virion and may play key roles in viral life cycle or pathogenesis. For example, TSG101 may be incorporated due to its interaction with Gag, and facilitates budding [23, 24].
Summary of Predicted Interactions
Prediction Results Summary
Before CC filter
After CC filter
Similar Human Proteins
Predicted Human Binding Partners
Percent True Positive
We visually examined some of the structural similarities that led to predictions that were already known. SMN2 is structurally similar to integrase (IN) (Figure 3A, Additional File 1) and both SMN2 and IN are known to interact with SIP1 (Gemin2) [18, 25]. SIP1, part of the large SMN complex involved in the assembly of snRNPs, may also be part of the pre-integration complex during HIV-1 infection and may aid viral reverse transcription . There are also several predicted interactions between IN and host proteins that interact with SMN2 that have not yet been tested (Additional File 1). The structural similarities shown in Figure 3B-D also led to predictions of known interactions, even though only part of the proteins are structurally similar.
Taking localization into account, gp41 has many more predicted interactors than any other HIV-1 protein. This is most likely due to the relatively large number of GO cellular component terms that were annotated to gp41 and also relevant to the host cell. Since gp41 is known to be found in more parts of the cell than other HIV-1 proteins, a larger number of human proteins were able to meet the co-localization criterion.
The interaction predictions made by this method are specific for structures, and we note that different structures for a single protein may lead to different predictions about its interactions. Therefore, some information is lost if predictions are described at a gene level. Nevertheless, it may be of interest to consider interactions on a gene basis (See Additional File 5 for the mapping of HIV-1 IDs). When counted according to the HIV-1 protein node names and human target Entrez Gene IDs, we made 883 interaction predictions, 56 of which were true positives according to HHPID and PIG. Following CC filtering, we had 22 true positive predictions among 265 total predictions (~10% of known true positives). While these results tend to suggest higher rates of predictive accuracy when using our method, we report our more conservative Uniprot-based accuracy values as our best estimates.
Interestingly, all of the significantly enriched molecular function GO terms relate to GTP binding or hydrolysis (Figure 5B). GTPases are involved in a number of host processes that HIV-1 may take advantage of, including nuclear transport and cytoskeletal rearrangements that facilitate viral entry and cellular motility. Statins, a class of drugs that lowers cholesterol levels in the blood, have also been shown to inhibit HIV-1 infection by preventing viral fusion with the cell membrane through a mechanism that involves inhibition of Rho GTPases . In addition, p115-RhoGEF inhibits HIV-1 gene expression through the activation of RhoA . Furthermore, both Rho and Rho kinase play a role in the cellular motility that allows HIV-1 infected monocytes to cross the blood-brain barrier to cause HIV-1 encephalitis .
Actin microfilaments of the cytoskeleton are regulated by actin-binding proteins as well as Rho family small GTPases including Rho, Rac, and Cdc42 . IN, RT, and gp41 were all predicted to interact with RhoA, Rac1, and Cdc42 (Figure 4). We found that gp41 has regions of structural similarity with many cytoskeleton related proteins, including erythrocytic spectrin alpha (SPTA1), erythrocytic spectrin beta (SPTB), alpha actinin 4 (ACTN4), alpha actinin 2 (ACTN2), moesin (MSN), Rho-associated coiled-coil containing protein kinase 1 (ROCK1), and arfaptin 2 (ARFIP2). IN resembles NCK adaptor proteins 1 and 2 (NCK1/2), dynactin 1 (DCTN1), and RAS GTPase activating protein 1 (RASA1), among others (Additional File 4). The cytoskeleton has been suggested to be manipulated by HIV-1 during virion fusion, assembly, and budding . HIV-1 movement through the cell can be blocked by drugs that cause depolymerization of microtubules and actin filaments. Actin has also been found within HIV-1 virions, and is incorporated through binding with NC . Thus, our predictions may aid further investigation into the ways in which HIV-1 manipulates the cytoskeleton.
By integrating a variety of high-quality functional data sets in the Literature Filter, we created a smaller interaction map that has the potential to provide a physical interaction context for a number of experimental findings. As an example, retroviral budding is known to involve members of the endosomal sorting complexes (ESCRTs). The ESCRT complexes normally induce the formation of multivesicular bodies in the endosome, but can be recruited to the plasma membrane by Gag to aid in viral budding. Many members of the ESCRT machinery appear in our results, including VPS4A, STAM2, EEA1, RAB5A, and TSG101 . Early endosomal autoantigen 1 (EEA1) is recruited to early endosomes by Rab5 and phosphatidylinositol 3-phosphate . Our results show that gp41 and Gag p2 may interact with RAB5A, since they are structurally similar to EEA1 (Figure 4, Additional Files 1 and 3). EEA1 contains a FYVE domain and colocalizes with human hepatocyte growth factor-regulated tyrosine kinase substrate (Hrs) protein [33, 34]. Gp41 is also known to interact with AP1G2, an important component of clathrin-coated vesicles. AP1G2 interacts with RAB5A and provides further support for the possibility that gp41 interacts physically with RAB5A, but through a potentially different structural motif . The Gag p6 protein is a known mimic of Hrs, and like Hrs can recruit TSG101, which is required for the formation multivesicular bodies (MVBs) and viral budding . Gag p2, as well as a model of gp41, show structural similarity to the human protein CEP55, which recruits TSG101 to the thin membrane that separates the daughter cells, where it is needed for the final separation of two cells . Our results suggest that gp41, IN, and the p2 region of Gag may all be able to interact with TSG101 (Figure 4, Additional File 4). Overall, interaction predictions are supported by a variety of studies implicating host mechanisms of vesicle formation in HIV-1 infection.
To further assess our predictions, we determined how many known interactions, curated within either HHPID or PIG, could have possibly been predicted using our method and the available data. First, in order for our approach to suggest a possible HIV-human interaction, the HIV protein must be represented among the crystal structures from PDB that are included in the Dali Database. In addition, any host factors predicted to interact with HIV-1 must have at least 1 known interaction with another human protein, and to be considered further, each of these must also have representative structures within Dali. Finally, in this work we included only those proteins that have been implicated in playing a role in HIV-1 infection through RNAi studies or studies of the protein composition of the virion. Since we removed any human target proteins that did not pass the Literature Filter, we did not make predictions for human proteins not mentioned in previous studies.
Before CC filter
After CC filter
Predicted True Positives
Possible True Interactions
Accuracy of Random Predictions
There were a few predictions that were shared between all methods. For our results before CC filtering, we found that there were 9 interactions predicted by all three methods (Figure 6A). Of these, four were determined to be true positives in our results: RT and MAPK1, gp41 and LCK, gp41 and PTPRC, and IN and PRKCH. The other five interactions (RT and PIN1, p2 and MAPK1, p2 and YWHAZ, gp41 and PLK1, gp41 and MAPK1, gp41 and CLTC, IN and XPO1, and IN and YWHAZ) are not known to occur, and may be good candidates for further investigation since they were predicted by three diverse methods. After we filtered our predictions by shared cellular components, three predictions were still common between all three studies, gp41 and LCK, gp41 and PLK1, IN and XPO1, one of which is a known interaction (Figure 6B). In summary, although few predictions were shared by all three studies, a large proportion of them are already known to occur, suggesting that the others may be worthy of high priority in future experimental efforts.
We have generated a map of potential protein-protein interactions between HIV-1 and its human host. The computational methodology used to create this map is based on the assumption that proteins with similar structures will share similar interaction partners. Thus HIV-1 proteins having a structure similar to one or more human proteins may potentially "plug in" to the host protein interactome at these points; providing the interface through which manipulation of downstream host processes can occur. From previous literature, many human proteins are known to play some role in HIV-1 infection. However, in most cases the nature of this role is unknown. Here, we provide specific predictions of how these human proteins may influence viral infection, namely by interacting with certain HIV-1 proteins.
In principle, our approach is applicable to any host-pathogen system with known protein structures. HIV-1 has a small proteome, with most of its protein structures at least partially determined. In addition, HIV-1 also has a large set of identified interactions that can be used for model validation. While few pathogens currently have such rich data sets, continued progress in protein structure determination will help to remedy such deficiencies.
Identification of points of modulation between a host and pathogen requires multiple lines of evidence. Computational methods can help integrate these data, providing a promising avenue for the discovery of novel host-pathogen interactions mediated by structural similarities as well as enhancing our understanding of functional relationships characterized through modern screening methods such as siRNA. Knowledge of the protein interaction network between the pathogen and human will not only further our basic understanding of pathogen survival mechanisms, but may also provide clinical targets to combat infectious disease.
We used the Dali Database for structure comparisons (downloaded in January 2009), and the Human Protein Reference Database (HPRD), HHPID and PIG for protein interactions (downloaded February, July and June 2009, respectively) [16–18, 25, 38]. The literature sources and various databases used each have their own system of identifiers. PDB codes obtained from Dali were mapped to their corresponding taxonomy and Uniprot accessions using data from the SIFTS initiative [11, 39, 40]. Other identifier mappings were carried out using DAVID Gene ID Conversion or Uniprot ID mapping [41–43]. Network diagrams were created in Cytoscape . Images of protein structures were created in MacPyMol .
We used the Dali database to ascertain structural similarity. Dali compares the 3D structural coordinates of two PDB entries by alignment of alpha carbon distance matrices, allowing for differences in domain order, and produces a structural similarity score [11, 16, 17]. The Dali Database includes structural comparisons where proteins from PDB90, a subset of the PDB where no two proteins share more than 90% sequence similarity, were used as queries against the full PDB . For this study, we took into consideration all human proteins that were listed in the Dali database as being similar to an HIV-1 protein (NCBI Taxonomy ID: 11676) and having a z score above 2.0, with the HIV-1 protein being either the query or the hit. We refer to these human proteins as "HIV-similar" proteins. No proteins of unknown structure were considered.
We found known interactions between HIV-similar proteins and target human proteins, using data from the HPRD database, which contains literature curated interactions between pairs of human proteins . For each HIV-similar protein, we predict that the target proteins, which are known to both interact with the HIV-similar protein and pass the Literature Filter, might also interact with the corresponding HIV-1 protein. Therefore, interactions between the HIV-similar and the human target proteins were mapped directly to the corresponding HIV protein.
To reduce the number of predictions as well as add information from functional studies, predictions were filtered based on previous implication of the human protein's involvement in the HIV-1 infection process. One criterion was presence of the host protein in the HIV-1 virion. Host proteins known to be incorporated into or onto HIV-1 during budding were taken from several literature sources [23, 24, 47]. The presence of host proteins in or on HIV-1 may be a result of specific recruitment and serve a functional role, may result from localization of the protein near the site of budding, or may simply occur by chance. Predicted interactions between HIV-1 proteins and human proteins that are incorporated into the HIV-1 virion were retained. In addition, any human protein that is incorporated into the virion and is itself structurally similar to an HIV-1 protein was also included as a possible interaction.
Another filtering criterion was the host protein's essentiality for HIV-1 infection. Recently, several large-scale experiments using siRNA or shRNA knockdowns to identify host proteins involved in the HIV-1 life cycle have been published [19–22]. The probe ids of the genes implicated by Yeung et al. were mapped to their Entrez Gene IDs using the appropriate Affymetrix annotation file http://www.affymetrix.com/products_services/arrays/specific/hgu133plus.affx#1_4. This filter is referred to as the "Literature Filter." Host proteins that were implicated in at least one of these studies as having a possible role in HIV-1 infection or replication, and which are also known to interact with an HIV-similar protein, were predicted to interact with an HIV-1 protein in the final predicted network.
To create a smaller and potentially more reliable list for further experimental validation, we further filtered the predictions based on shared sub-cellular localization. The Cellular Component (CC) Filtered dataset contains interaction predictions where the two proteins share Gene Ontology (GO) cellular component annotation. Pairs of HIV-1 and human proteins predicted to interact were only included in this dataset if both proteins were annotated by DAVID as being present in the same cellular compartment [41, 42]. Pairs with only the terms "cell" and "cell part" in common were excluded due to a large number of such pairs and the relative lack of specificity of these high level terms.
Since within Dali there may be multiple PDB structures representing the same protein, there is some redundancy in the interaction predictions. In certain cases, multiple PDB structures for the same HIV-1 protein were found to be similar to multiple PDB structures for an HIV-similar protein, leading to the same interaction predictions. Therefore, the predictions were counted as unique pairs of Uniprot accessions. In addition, for ease of viewing the predicted interactome, each node representing an HIV-1 protein is labeled with the protein name while each human protein is represented by Entrez Gene ID. To determine the correct mapping of PBD codes to HIV-1 proteins, the molecule name associated with each PDB chain was searched for keywords indicating the protein, with ambiguous cases treated on an individual basis. For example, PDB molecule names containing the word "capsid" but not "nucleocapsid" were assigned to the node "capsid." Furthermore, molecule names indicating polyproteins, such as those containing the phrase "gag-pol" were checked individually to determine which specific part of the polyprotein was represented by the entry. Two PDB structures were found to represent more than one mature HIV-1 protein: 1l6nA contains both capsid and matrix, while 1bajA contains capsid and p2 [48, 49]; these structures are represented as "capsid, matrix" and "capsid, p2" respectively. When counting predictions at the gene level, we considered pairs of HIV-1 node names and human target Entrez Gene IDs.
To determine which predictions are true positives, PIG and HHPID entries for the predicted human interactors were examined to see if they contained the HIV-1 protein they were predicted to interact with [25, 38]. These interaction databases consist of PPIs curated from the literature. HHPID uses keywords to characterize the different types of interactions listed in this database. Since this work attempts to predict physical interactions, only entries with keywords representing direct interactions were included . The Uniprot accessions associated with the HIV-1 protein PDB entry, in the case of PIG, or the Entrez Gene ID mapped to that Uniprot accesion, in the case of HHPID, was checked to see if it was present as an already known interaction of the human protein.
The Gene Ontology (GO) provides a system of terms to consistently describe and annotate gene products . GO term enrichment was performed using the DAVID Functional Annotation Chart tool. The GO is organized as a tree structure, with terms becoming more specific as distance from the root increases. Therefore, to avoid very general and uninformative GO terms, only those that are found at least 5 steps removed from the overall root of GO were considered. The p-values were corrected for multiple testing using the Bonferroni procedure and transformed by taking the -log 10 for easier visualization.
Two forms of computational validation of the method were conducted. As it is not possible to predict all known interactions due to lack of protein structures, as well as other factors, we first determined the largest set of known interactions that it is theoretically possible to predict using our approach. To do this, we first determined the sets of all proteins that could be considered. This includes the set of all HIV-1 proteins in Dali (HIV set), the set of all human proteins that are represented in both Dali and HPRD (possible HIV-similar set), and the set of all human proteins in HPRD that are known to interact with at least one protein in the possible HIV-similar set as well as pass the literature filter (possible target set). Next, every pairwise combination of proteins in the HIV set and the possible target set was checked to see if it represented a known interaction curated in HHPID or PIG. The resulting number of true interactions that could have been found by the method was compared to the number of true positives that were actually found, both before and after filtering by cellular components.
In the second approach, actual prediction results were compared to predictions based on randomly selected HIV-human protein pairs. The HIV-1 proteins were chosen from the 69 Uniprot accessions represented at least once by structures in the Dali Database. For human proteins, two different sets of human Uniprot proteins were created, one containing all the proteins in HPRD, and the other containing the subset of human proteins that also passed the Literature Filter. The set of all human proteins in HPRD consisted of 8582 proteins and was used to see the accuracy of purely random predictions, while the second set of 830 proteins was used to observe the effect of the Literature Filter.
Since the structural similarity step was omitted, the predictions based on a human protein being similar to an HIV-1 protein and incorporated into the virion could not be simulated with the random selection procedure. We found that if we excluded this class of predictions from our real results, the number of unique predictions made was reduced to 2139, but all 62 true positives were still included. Therefore, we randomly selected 2139 pairs of HIV-1 proteins and human proteins from the entire HPRD, and a second set of 2139 pairs of HIV-1 proteins and Literature Filtered human proteins for evaluation. Next, any known interactions between the randomly chosen pairs were found using HHPID and PIG. Additionally, both the unfiltered and Literature Filtered random predictions were then subjected to the CC Filter to gauge the improvement due to this step of the method. The CC Filter reduced the number of predictions to a variable degree, depending on how many of the random predictions were annotated with the same GO cellular component term. The entire procedure was repeated 1000 times. The mean and standard error of the mean for each of the four variously filtered random prediction sets was calculated using R. The distributions of random predictions after Literature Filtering were approximately normal, so one-sided single sample t-tests were performed to determine if the method performed significantly better than random. In addition, we performed Wilcoxon signed-rank tests that do not make assumptions about normality. When comparing our results to random predictions that had undergone the same filtering steps, either the Literature Filter or both the Literature and CC Filters, the p-values were less than 2.2e-16 for all statistical tests. In addition, even when performing the randomization procedure 10000 times, none of the randomly selected interaction sets had a true positive rate higher than that observed in our results, suggesting a p-value of no greater than 0.0001.
To compare our predictions to those made by Evans et al. and Tastan et al., we found the intersection of the prediction sets, counted by HIV-1 protein name and human Entrez Gene ID [9, 10]. Since each study used different names for the HIV-1 proteins, we had to map the naming schemes to each other to find common predictions. For example, Evans et al.'s "CA" and "GAG" and Tastan et al.'s "gag_capsid" and "gag_pr55" were mapped to our "capsid." Proteins for which we made no predictions, such as Rev, were not mapped to anything in our results, but were converted between Evans et al. and Tastan et al. to find overlap between these two studies.
Funding for this work was provided by the the U. S. Army Research Laboratory and the U. S. Army Research Office under contract/grant number W911NF-09-0049 and through the Curriculum of Bioinformatics and Computational Biology. Financial support for these studies was also provided, in part, by the United States Environmental Protection Agency grant (RD833825). However, the research described in this article has not been subjected to the Agency's peer review and policy review and therefore does not necessarily reflect the views of the Agency and no official endorsement should be inferred.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.