We developed a methodology for predicting INI susceptibility, applying linear regression on a clonal genotype-phenotype database. Our modeling approach differs from most of the other genotypic INI resistance interpretation systems by providing a quantitative FC prediction. A particular advantage of our model is that predictions can be directly interpreted as a weighted sum of mutations and interaction pairs. We have made our RAL second order linear regression model available as PDF fillable form in Additional file 2 such that it can be used for rapid prediction of RAL susceptibility.

Previously, we described a computationally feasible technique for developing parsimonious linear regression models on large genotype-phenotype datasets for the identification of novel HIV-1 drug resistance associated mutations [28]. In this article, as the number of patients failing INI treatment was limited, our primary objective was to develop a methodology for training a linear regression model on a relatively small dataset. We increased the quality of the correlative genotype-phenotype data by taking multiple clones for each of the clinical isolates [26], allowing to more accurately model the resistance contribution of IN mutations or mutation pairs. Moreover, to avoid overfitting, we generated an INI model by consensus linear regression modeling, using a GA for selection of IN mutations [29, 30]. Multiple clones taken from the same patient largely confirmed the independence of the RAL resistance pathways 143, 148 and 155 [24, 31, 32]. For one patient, previously described in [33], four clones were picked containing both 143C and 155H. Mutation 143C was found to have a low prevalence in the clonal database. In [34] a transition from 143C to 143R was suggested, and in our RAL linear model 143R had a larger contribution towards resistance than 143C. 143G was another resistance associated variant at position 143 selected for our linear model, and has been described in [35, 36]. Obviously, our approach is still limited to detecting resistance associated mutations or combinations of mutations with presence in the training dataset. This was in part overcome by inclusion of site-directed mutants in the analysis, which we consider valuable in improving the generalizability of the model.

We evaluated the performance of the RAL linear model on an unseen population dataset. For RAL, the additive first order model had an overall equal performance to the second order model, which accounted for synergism or antagonism. However, for an individual sample (T97A) with secondary mutation 97A, found in absence of a primary mutation, a discordance was seen between the first and second order linear models. It was scored resistant by the first order model and susceptible by the second order model when using a biological cutoff of 2. In two other samples (T97A/T, Y143R; E92E/Q, T97A/T, N155H) where primary mutations 143R or 155H occurred together with 97A (in mixture with wild type), the increased resistance conferred by the combinations 143C/R & 97A [37] or 155H & 97A, was in the second order model accounted for by interaction terms. Because the second order model explicitly includes combination effects, we consider it more useful than the first order model. All interaction terms in the second order model were found to be synergistic. A high concordance in RAL resistance call was seen between the linear model and the publically available genotypic algorithms: Stanford, Rega and ANRS. However, major discordances were observed for samples without a primary mutation and containing mutation 157Q or 121Y. For the discordance involving 157Q, already discussed in [38], four clinical isolates (E157Q) from different patients were called Susceptible by the linear model, Stanford and Rega, but Resistant by ANRS. For the discordance involving 121Y, one clinical isolate (A91T, F121Y) was called Resistant by the linear model and ANRS, Intermediate resistant by Stanford, but Susceptible by Rega. According to [11], the in vivo selection of 121Y has not yet been reported. In the current study, one patient was found in the unseen dataset, who had indeed developed the 121Y mutation. However, as 121Y was not observed in any of the patient derived clones for training of the linear model, we had made seven site-directed mutant clones for the clonal genotype-phenotype database, confirming the in vitro effect of 121Y [7] on RAL resistance. As a result, 121Y could be and was selected for the linear model, and contributed to the FC prediction of the two clinical isolates from the aforementioned patient. Note that in the genotype of these isolates also the rare mutation 91T was found, a mutation that has not been associated with RAL resistance, but contributed to resistance in the RAL linear model. From the unseen data, it seems as if 91T may be a background mutation that is currently overweighted in the linear model. However, more samples are needed to be conclusive about 91T.

Other rare mutations in the RAL linear model that needed to be inspected more carefully were 72L and 84L, as they are currently undescribed and contributed to resistance in the second and first order model, respectively. Remarkably, 72L and 84L co-occurred in the clonal genotypes of nine clinical isolates derived from a single patient (only 72L appeared in another clinical isolate, by itself). In the clones of this patient the secondary mutations 74M, 92Q and 151I were also found, in absence of any primary mutations, and the measured RAL FCs were above the biological cutoff (42.9–77.4). Thus, although 72L and/or 84L are potential RAL resistance associated mutations, it may be possible that resistance for this patient is explained by a more complex synergistic interaction between 74M, 92Q and 151I. Note that mutation pair 74M & 151I had been selected for the RAL second order linear model, which already indicates that INI resistance can be developed between interacting secondary mutations, in absence of a primary mutation. Moreover, interactions between mutations are expected to become more important in elucidating genotype-INI susceptibility phenotype relationships once several INIs will be co-administered.

When comparing the R^{2} performance of the RAL linear model on population data, unseen *vs.* seen, a lower R^{2} performance on unseen data was observed. This difference in performance was acceptable as in the unseen dataset there were more clinical isolates that did not contain any of the primary RAL resistance mutations in their genotype (82.5% *vs.* 45.0%), and the measurement error of the phenotypic assay was relatively larger for low FC values.

In the described approach, ordinary least squares regression (OLS) was used without taking into account the correlation between genotypes-phenotypes of clones from the same clinical isolate or site-directed mutant. One way to account for such correlation would be to replace OLS by a linear mixed model with as *fixed effects* the linear model mutations and mutation pairs as in the RAL second order linear model (Figure 3), and with the clinical isolate/site-directed mutant as random factor. The predictive performance of the resulting model in terms of R^{2} changed from 0.80 to 0.82 and from 0.78 to 0.79, on the external validation set, and population unseen dataset, respectively. Such a minor change was not unexpected since OLS parameter estimates are known to be unbiased, even when the correlation structure is neglected [39]. Nevertheless, for future work it could be beneficial in using a mixed model instead of OLS for the GA models to improve the selection of the mutations and mutation pairs.

In conclusion, RAL resistance could be estimated using linear regression modeling and produced results that were generally consistent with those observed for samples analyzed by Stanford, Rega and ANRS algorithms or the online prediction tool geno2pheno. The quality of the INI susceptibility models is improved by developing the models on a clonal genotype-phenotype database and using a GA consensus approach. A quantitative linear model predicted phenotype is interpretable and informative about the effect of combinations of mutations on INI resistance. The linear regression modeling approach allows generating reliable models for INIs once viral isolates have been obtained during or after selective pressure of these INIs, even for relatively small numbers of patients.