Lymphocyte–monocyte–neutrophil index: a predictor of severity of coronavirus disease 2019 patients produced by sparse principal component analysis

It is important to recognize the coronavirus disease 2019 (COVID-19) patients in severe conditions from moderate ones, thus more effective predictors should be developed. Clinical indicators of COVID-19 patients from two independent cohorts (Training data: Hefei Cohort, 82 patients; Validation data: Nanchang Cohort, 169 patients) were retrospected. Sparse principal component analysis (SPCA) using Hefei Cohort was performed and prediction models were deduced. Prediction results were evaluated by receiver operator characteristic curve and decision curve analysis (DCA) in above two cohorts. SPCA using Hefei Cohort revealed that the first 13 principal components (PCs) account for 80.8% of the total variance of original data. The PC1 and PC12 were significantly associated with disease severity with odds ratio of 4.049 and 3.318, respectively. They were used to construct prediction model, named Model-A. In disease severity prediction, Model-A gave the best prediction efficiency with area under curve (AUC) of 0.867 and 0.835 in Hefei and Nanchang Cohort, respectively. Model-A’s simplified version, named as LMN index, gave comparable prediction efficiency as classical clinical markers with AUC of 0.837 and 0.800 in training and validation cohort, respectively. According to DCA, Model-A gave slightly better performance than others and LMN index showed similar performance as albumin or neutrophil-to-lymphocyte ratio. Prediction models produced by SPCA showed robust disease severity prediction efficiency for COVID-19 patients and have the potential for clinical application.

Possible risk factors for progressing to severe illness may include, but are not limited to, older age, and pre-existing chronic medical conditions such as lung disease, heart failure, cerebrovascular disease, and so on [2]. Clinically, the main symptoms of severe COVID-19 patients include fever, leukopenia, lymphopenia, thrombocytopenia, C-reactive protein increase, and cytokines abnormity [3][4][5][6]. Lactate dehydrogenase, interleukin 6, and D-dimer were also reported as risk factors for progression to severe status [7]. As such, plenty of clinical laboratory markers could be used to predict the severity of COVID-19 patients and it is challenging to utilize such rich laboratory indicators for clinical diagnosis and treatment.
Therefore, clinical characteristics and dozens of laboratory markers of 82 COVID-19 patients from the First Affiliated Hospital of University of Science and Technology of China were analyzed retrospectively and Sparse Principal Component Analysis (SPCA) was performed to examine the correlation between these markers and extract relevant features. Then the prediction models for disease severity were constructed based on logistic regression using the principal components (PCs) produced by SPCA. Prediction efficiency of these models was assessed and compared with classical blood markers. Furthermore, an independent cohort including 169 COVID-19 patients from the First Affiliated Hospital of Nanchang University was used as a validation dataset and prediction efficiency of these models was also evaluated.

Patients enrollment
In this study, 82 patients (Hefei Cohort) with confirmed COVID-19 admitted to the First Affiliated Hospital of University of Science and Technology of China from January 23, 2020 to March 3, 2020 were enrolled. Independent cohort enrolled 169 COVID-19 patients (Nanchang Cohort) from the First Affiliated Hospital of Nanchang University from January 23, 2020 to March 10, 2020. This study was approved by the Ethics Committee of the First Affiliated Hospital of University of Science and Technology of China and the Ethics Committee of the First Affiliated Hospital of Nanchang University. According to Diagnosis and Treatment Protocol for Novel Coronavirus Pneumonia (Trial version 7) [8], released by National Health Commission & State Administration of Traditional Chinese Medicine, all of the patients were confirmed using fluorescent reverse transcription PCR, and divided into severe group and mild group. Adult patients meet any of the following criteria were classified into severe type: (1) Respiratory distress (≥ 30 breaths/min); (2) Oxygen saturation ≤ 93% at rest; (3) Arterial partial pressure of oxygen (PaO2)/fraction of inspired oxygen (FiO2) ≤ 300 mmHg ( l mmHg = 0.133 kPa). Cases with chest imaging that showed obvious lesion progression within 24-48 h > 50% should be managed as severe cases.

Data collection
After all of the patients were discharged from hospital except one who died one day after admission to hospital, the clinical data of these patients were retrospected including demographic data, medical history, complete blood counts, blood biochemistry, coagulation indices, infection-related indices, and myocardial markers. Blood routine test, clinical chemistry markers, coagulation functions and T lymphocytes typing were tested on Mindray 6900 hematology analyzer, Beckmen 5800 automated biochemistry analyzer, Succeed SF8000 hemagglutinin analyzer and BD FACScalibur flow cytometer respectively. Infection-related indices and myocardial markers were detected on Roche cobas e601 automated electrochemical luminescence immunodetector. Since all of the patients have taken several laboratory tests, results of three time points during hospitalization were collected: the first time point upon hospitalization, the medium-term after hospitalization, and the last time of laboratory test before hospital discharge.

Statistical analysis
Statistical analysis was performed using the R software, version 3.6.3. The results of continuous variables were expressed as the median with interquartile range and analyzed using Wilcoxon signed-rank test or Pearson correlation test. Categorical variables were presented as numbers (percentages) and analyzed using chi-squared test or Fisher's exact test. Repeated measured data of different time points was compared by repeated measures analysis of variance. Multivariate logistic regression analysis was adopted to identify risk factors of disease progression.

Sparse principal component analysis (SPCA) and model evaluation
SPCA was performed using the R software package (sparsepca, https:// github. com/ erich son/ spca) [9]. Clinical continuous variables of Hefei Cohort including age and all of the above laboratory indicators were used and the data were centered and scaled by subtracting each mean and dividing each standard deviation to allow all the variables to have unit variance. In the SPCA process, controlling parameter alpha was adjusted from 0.0001 to 0.002 with stepsize 0.0001 for better variable selection, and for each alpha value, the cumulative variance and number of variables selected in the top principal components (PCs) were calculated. PCs produced by SPCA were then subjected to multivariate logistic regression for disease severity prediction. The prediction models using PCs were evaluated using receiver operator characteristic curve (ROC) and the area under curve (AUC) was calculated. The accuracy, sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) were also calculated. For clinical net benefit assessment [10], decision curve analysis was also performed using rmda package ( http:// mdbro wn. github. io/ rmda/).

Independent cohort validation
The 169 COVID-19 patients of Nanchang Cohort were used as an independent validation cohort. Using scaled clinical markers of each patient, the PCs of each patient were calculated according to the corresponding PC loadings matrix originated from Hefei Cohort. The produced prediction models were then used to predict the disease severity of this independent cohort and the prediction efficiency was estimated using ROC. The sensitivity, specificity, PPV, NPV and accuracy of each marker were also calculated. The clinical net benefit was evaluated using decision curve analysis.

Results of Sparse principal component analysis (SPCA)
When predicting the disease severity by multivariate logistic regression using clinical laboratory indicators directly, the fit curve did not converge. As the clinical laboratory markers always correlated with each other, we attempt to use SPCA to reduce dimensionality of the data and extract several PCs to explain such dozens of markers.
Using sparsepca package, the SPCA was performed based on the 44 clinical variables and the alpha parameter was adjusted from 0.0001 to 0.002 with stepsize 0.0001. In such SPCA models, cumulative variance of the first 13 PCs were greater than 80% of the total variance. For models of each alpha, the cumulative variance of the first 13 PCs were summed and the number of variables selected in the first 13 PCs was counted (Fig. 1a). As alpha increases, the cumulative variance decreases gradually and the number of variables reduces sharply. When alpha is 0.0015, the first 13 PCs account for 80.8% of the cumulative variance of the original data and the number of variables selected in the 13 PCs is only 30. Based on the variance-sparsity trade-off [11], SPCA model with alpha of 0.0015 was used for further analysis.
The patients distribution and variables' loadings using SPCA with alpha being 0.0015 were showed in Fig. 1b,c. The mild and severe ill COVID-19 patients distributed separately in the PC1 direction (X-axis) in the patients' distribution plot (Fig. 1b). Each PC only depends on less than 5 clinical variables. An additional table file (Additional file 1: Table S1) shows this in more details.
Next, the 13 PCs were subjected to multivariate logistic regression for disease progression prediction. Using both step logistic regression and logistic regression with L 1 penalty (glmnet package, https:// cran.r-proje ct. org/ packa ge= glmnet), two of the 13 PCs were finally selected in the prediction model, where the first PC (PC1) and the 12th PC (PC12) showed significant association with the disease severity classification ( Table 2). This model was named as Model-A for further analysis.
According to the PC loading matrix (Additional file 1: Table S1) and variable loading plots of SPCA (Fig. 1c), The PC1 depends on NEU%, LYM%, LYM, and MONO, while PC12 only depends on DD and LDH. Since the NEU%, LYM%, LYM, and MONO in PC1 could be obtained in one blood routine test and the PC1 accounted for 17.8% of the total variance, Model-A was further simplified to PC1, which was named as Lymphocyt-Monocyte-Neutrophil index, abbreviated as LMN index.
The relationships between Model-A and LMN index with clinical variables were assessed. Both of them showed significant correlation with CD8+ lymphocyte counts ( Fig. 2a, b). Meanwhile, higher Model-A probabilities and LMN indices were observed in patients with comorbidities and older age ( Fig. 2c-f ). Furthermore, Model-A probabilities and LMN indices of different time point during hospitalization were investigated and both of them significantly decreased as treatment took effect and before discharge (P < 0.001, Fig. 3a, b). Patients with mild and severe status showed clearly variation tendency difference (P < 0.001) in Model-A probability and LMN index. Both of Model-A probabilities and the LMN indices of mild ill patients fell sharply (Fig. 3a, b, green lines), while the counterparts of severe ill patients declined slowly (Fig. 3a, b, red lines).

Prediction efficiency evaluation
Then, ROC was used to estimate the disease severity classification performance of Model-A and the LMN index. The AUC and accuracy of Model-A for prediction of disease severity of COVID-19 patients were 0.867 and 0.726 in Hefei cohort ( Table 3). The counterparts of LMN index were 0.837 and 0.793, respectively (Table 3).
Since several laboratory markers are classical predictors of disease severity, so we also compared the prediction results of these markers and they were summarized in Table 3. The Model-A showed the best performance and LMN index showed robust prediction effect compared with classical predictors including neutrophil-tolymphocyte ratio (NLR) which is a hopeful predictor for severity ill COVID-19 [12,13]. In order to assess the clinical net benefit of Model-A and LMN index, we also performed decision curve analysis (Fig. 4a). Although curves of all the markers tangled and the Model-A gave slightly greater net benefit, while the LMN index just showed similar performance as albumin and NLR.

Independent cohort validation
In order to validate the severity prediction efficiency of Model-A and LMN index, the laboratory indicators of Nanchang Cohort (An additional table file shows this in more detail in Additional file 1: Table S2) were scaled and PC1 and PC12 of each patient were calculated using PC loading matrix of Hefei Cohort. Model-A probabilities and LMN indices were then predicted. ROC estimated the severity prediction efficiency, and the AUC and accuracy of Model-A produced with Nanchang Cohort for disease progression prediction were 0.835 and 0.757, respectively (Table 4). Meanwhile, the AUC and accuracy of LMN index were 0.800 and 0.740 in this independent cohort. Just like in the training cohort (Table 3). Model-A also gave the best efficiency and the LMN index showed   (Table 4), such as NLR, albumin and so on. In decision curve analysis (Fig. 4b), all the curves intertwined and the Model-A gave slightly better performance, while the LMN index just showed similar performance as albumin and NLR.

Discussion
Since the outbreak of COVID-19, the number of patients worldwide has increased drastically, which put massive pressure on the health care system of every country. In order to save lives as more as possible, more resources should be focused on the severe ill patients. Several studies have attempted to seek the predictors of disease progression of COVID-19, such as Neutrophil-to-lymphocyte ratio [12,13], thrombocytopenia [5], DD, IL-6 [7] and so on. There are also dozens of laboratory indicators used for disease severity prediction. In present study, we used SPCA to extract principal components of laboratory indicators. In SPCA model with alpha being 0.0015, the first 13 PCs accounted 80.8% of the total variance of the 44 clinical variables. Using logistic regression, Model-A based on PC1 and PC12 was deduced and showed the best prediction efficiency in the training cohort (Hefei Cohort. AUC = 0.867) as well as the independent validation cohort (Nanchang Cohort. AUC = 0.835). Because PC1 depending on blood routine test markers accounted 17.8% of the total variance, Model-A was further simplified to LMN index, which predicted disease severity just using PC1. LMN index also showed satisfactory prediction efficiency in the Hefei Cohort (AUC = 0.837) as well as the independent Nanchang Cohort (AUC = 0.800). In decision curve analysis, Model-A showed slightly better performance both in the Hefei Cohort and Nanchang Cohort and the LMN index performed comparably to albumin and NLR.
In clinical laboratory, combinations of test items are very common, while indicators in these combinations always correlated with each other. Such as in blood routine examination, the neutrophil counts always negatively relate with lymphocyte counts and in liver function examination, serum ALT always changes in parallel with AST alteration. This feature of laboratory markers is called collinearity and could enhance the diagnostic accuracy. The collinearity of these laboratory markers makes it difficult for traditional multivariate statistical analysis to include all the significant indicators. This is why PCA is used in this study, which can extract distinct PC from a group of highly correlated variables in combinations of the original variables [14,15]. Furthermore, controlling parameter alpha was induced to PCA for better variable selection, which is the so-called SPCA [16,17]. In this study, alpha value was adjusted from 0.0001 to 0.002 and when alpha was set as 0.0015, the 13 PCs accounted for 80.8% of the total variance of the 44 clinical variables and only depended on 30 variables. Thus this SPCA model balanced variance and sparsity [11] and could represent the original 44 variables. Furthermore, sparsepca package [9] used in current study is a recently published method for SPCA, which offers some immediate improvements over previously proposed SPCA algorithms, such as much faster and more scalable algorithm, robustness to outliers.
In the disease severity prediction Model-A, the PC1 is dependent on four clinical markers: NEU%, MONO, LYM%, and LYM, while the PC12 merely depends on DD and LDH. Several previous studies have convinced the relationship between LYM decrease and NEU increase in severe ill COVID-19, SARS and MERS patients [18][19][20][21][22][23]. While in this research, both the cell counts and percentage of lymphocyte showed importance in disease progression. Several studies have also confirmed that severe ill COVID-19 patients always accompany with higher DD [7,19] and LDH [7,19,23]. So Model-A may represent inflammation status, coagulation status, metabolism status of COVID-19 patients. Alteration in inflammation response, coagulation system and metabolism even hypoxia in COVID-19 patients could result in Model-A probability change. So that's why Model-A combined all these markers give the best performance in ROC and decision curve analysis for disease severity prediction.  Furthermore, monocyte may also play roles in the disease progression which was less noticed before and need further investigations. On the other hand, because the range of numeric values of different variables varied widely and variables with larger numeric values would dominate analysis, monocyte, which shows significance in this research, is rarely concerned in previous studies.
So the process of data standardization is critical in multivariate analysis.
Furthermore, we also found Model-A and LMN index were significantly associated with age, comorbidity status and CD8+ T cells. It is particularly important that Model-A probability and LMN index change significantly during COVID-19 patients hospitalization and they decrease obviously as treatment takes effect. Meanwhile, in moderate ill patients, they decreased more sharply than counterparts in severe ill patients. In 7 of 28 severe ill patients, Model-A first rose then descended and only in 7 of 54 mild ill ones, Model-A showed the same tendency. These evidence showed that continuous surveillance of Model-A and LMN index during treatment may have special clinical importance. The prediction efficiency of Model-A and LMN index for disease severity is encouraging. The Model-A showed the best prediction efficiency both in training cohort and independent validation cohort. While, the LMN index also gives the AUC of 0.837 and 0.800 in training data and validation data respectively, which performed better than classical markers including LYM%, NEU%, CRP, IL6, etc. Even compared with the NLR, which is recently reported as a hopeful predictor of inflammation or severity, LMN index still showed the better prediction value in the independent cohort (AUC: 0.800 VS 0.784). Though AUC of LMN index in training cohort is smaller than Alb and NLR, in clinical setting, LMN index depending on more variables may perform more robust than Alb or NLR, both of which are more susceptible to physical and pathological changes.
The NPV of Model-A was 1 in training cohort and 0.888 in validation cohort, respectively, so COVID-19 patients with the Model-A probability smaller than cutoff point may have little probability of developing to severe cases. The LMN also showed great NPV both in training cohort and validation cohort. So with these evaluation approaches, health care staff could monitor present COVID-19 patients and focus on the cases with more risk of progression earlier, which will benefit to save more lives.
For evaluation of net benefit of the models, decision curve analysis was also performed, and Model-A still show slightly better performance than Alb and NLR. According to previous study [24], a little improvement is also improvement, so Model-A indeed bring net benefit for patients. While in DCA, LMN index just showed comparable performance as Alb and NLR.
Finally, Model-A shows the best prediction efficiency for disease severity of COVID-19 patients, and the LMN index depending on four blood routine test markers. is very economical for clinical application. So both of them have the potential for clinical use in COVID-19 treatment and even in other disease treatment. This use of SPCA for clinical variables extraction may also shadow new application direction of SPCA.
Our study also have some weaknesses. Clinical characteristics other than laboratory markers were not concerned in this study, which were also risk factors of disease progression. More clinical characteristics should be included for model training in future. On the other hand, the sample size was small, which may have some impact on the statistical results and bias may exist during data standardization process, model training and cut-point selection. In future, with numerous patients enrolled to optimize the above processes, more accurate prediction model will be produced.

Conclusions
In the study, using SPCA method for feature selection and dimensionality reduction, prediction model Model-A and LMN index were deduced, which showed significant association with clinical outcomes and robust disease severity prediction efficiency of COVID-19 patients. Model-A and LMN index may have the potential for clinical application and are helpful to the patients classification so as to save more lives.