aThe P value from the
aThe P value from the t test of the PSA numbers between prostate cancer and control groups.
Data Processing and Statistical Analysis
We used Mirex as the internal standard because of its nonexis-tence in human urine. The relative intensity of each VOC peak could then be normalized against that of Mirex, allowing semi-quantitative statistical analysis of VOCs.
Over 9000 different VOC types were found in the urine samples, resulting in a high-dimensional modeling problem. To streamline the analysis, we first removed the VOCs that could be observed in less than 3% of the entire population. The remaining variables were screened by testing the difference in each VOC between the PCa-positive and control groups. The Wilcoxon rank-sum test was used because it can accommodate the zero inflation among many VOCs. Heat maps were generated to visualize those significant VOCs (P < .05) in the PCa-positive and control groups. Applying a liberal cutoff of 0.2 to the P values, over 800 VOCs remained for the model development. To deal with this p>>n scenario (ie, numbers of VOCs are much greater than the number of samples), we fit regularized logistic regression models18 with LASSO19 penalty, and the 10-fold cross-validation was used to select the optimal tuning parameter. The final logistic model was then evaluated via the receiver operating characteristic (ROC) curve and other performance measures on the basis of jackknife predic-tion,20 which helps alleviate the over-optimism induced by variable selection. Furthermore, the Firth approach was taken to fit the final logistic model in order to achieve bias Dexamethasone for the small sample scenario and deal with the nearly complete separation seen in the data.21 Another R package, known as OptimalCutpoints, was used to determine the optimal cut point for the diagnostic model cor-responding to the maximum Youden Index.22 All statistical analyses
were performed using the open-source statistical computing soft-ware R.23
The study analyzed VOCs in urine samples collected from patients to develop the urine metabolome-based PCa diagnosis model. All VOCs were identified using an existing National Insti-tute of Standards and Technology library, and significant VOCs were selected based on their occurrence and relative quantity in the urine. One mL of urine sample was extracted by stir bar sorptive
extraction, and the VOCs were then analyzed by GC/MS. The quantity of each VOC was normalized to the internal standard, Mirex, by taking the ratio between the signal of the compound and that of Mirex.
A total of 9144 potential VOCs were detected in urine from 108 patients (55 PCa-positive and 53 PCa controls). Using the Wil-coxon test, 254 VOCs were found to be positively related to PCa, and 282 VOCs were negatively associated at P < .05 (Figure 1A). To avoid missing important predictors for PCa prevalence, a relatively large threshold (with P < .20) was applied to screen variables for further development of the regression model, resulting in a total of 850 potential VOCs. After l1 regularization (ie, lasso regression), the final logistic model contained 11 VOCs (listed in Table 2), and a heat map was generated to show their differentiation in PCa-positive and -negative patients (Figure 1B). On the basis of predicted probabilities from the final model obtained via jackknife cross-validation, the area under the ROC curve (AUC) is 0.92, with sensitivity of 0.96 and a specificity of 0.80, respectively, as shown in Figure 2A. As a com-parison, we also built a logistic model with PSA only, which yielded an AUC of 0.54. The model resulted in a sensitivity of 0.60 and a specificity of 0.42, respectively, when applied to the test data (Figure 2B). Testing the performance of the PCa diagnostic model using an external cohort of 75 patients (53 patients with PCa and 22 PCa-negative patients) yielded a ROC curve of 0.86 (Figure 3), with an 87% sensitivity and a 77% specificity.