Non-invasive detection of silicosis based on array sensing and pattern recognition

. Silicosis is a fibrotic lung disease caused by inhalation of silica dusts, early and accurate diagnosis of which remains a challenge. We aimed to assess the performance of a nanofiber sensor array and pattern recognition to promptly and noninvasively detect silicosis. A total of 210 silicosis cases and 430 non-silicosis controls were enrolled in a cross-sectional study. Exhaled breath was analysed by a portable analytical system incorporating an array of 16x organic nanofiber sensors. Models were established by Deep Neural Network and eXtreme Gradient Boosting. Linear Discriminant Analysis was used for dimensionality reduction and visualized data analysis. Receiver Operating Characteristic Curve, accuracy, sensitivity and specificity were used to evaluate models. Results: 99.3% AUC, 96.0% accuracy, 94.1% sensitivity, and 96.3% specificity were achieved in test set. Silicosis cases present different breath patterns from healthy controls, classification results using which were highly consistent with the experts’ diagnosis. Breath analysis performed with the sensor array and pattern recognition is expected to provide a quick, stable recognition for silicosis. In this paper, different forms of features, different algorithms and data sets over long time periods were used, which provides a reference for silicosis expiratory diagnosis scheme.


Introduction
Silicosis is one of the most critical occupational diseases worldwide [1,2], which is incurable and becomes less treatable in late stages [3]. Early detection of silicosis is critical for improving the life expectancy, as prompt treatment intervention significantly improves prognosis. At present, its diagnosis mainly relies on X-ray [4], which are challenging for early detection [5,6], or invasive biopsies.
In recent years, respiratory analysis has shown the potential for detection of diseases (e.g., cancers) by monitoring the change in VOCs excreted from human breath or skin emanation [7,8,9]. However, most of the breath studies relied on the conventional bench-top analytical systems, such as GC-MS. These devices are complicated and cumbersome, which also need skilled workers to operate and are unsuitable as a point of-care tool. To address the challenge, electronic nose (e-nose) based on chemical sensor array has been developed and shown great potential in quick diagnosis of diseases [10,11]. In 2018, prof. Yang et al. performed breath tests using Cyranose TM to detect asbestosis, for which an accuracy rate of 70.0% was achieved in cross validation under a sensitivity of 66.7% [12]. Despite of the relatively small cohort group, this study showed the potential of developing a non-invasive tool for screening pneumoconiosis. In this context, we used a portable system to discriminate silicosis from healthy miners via sensors array and pattern recognition in a cohort of 640 subjects. Aimed to access and improve model effectiveness, we explored four forms of data features in cross combinations with two algorithms. Some exploratory analysis was also done to provide reference for potential clinical use.

Study design and subjects
This single-center, cross-sectional study with 640 miners obtain breath patterns by eNose. A total of 118 silicosis patients and 92 suspected silicosis patients were investigated in case group; 430 miners with no disease in lungs were healthy controls. Breath sample collection was done with a standardized process involving behaviour requirements, breath collection and pretreatment to reduce the interference. Extreme Gradient Boosting (XGBoost) and Deep Neural Network (DNN) were used to build classification models with four forms of features. Linear discriminant analysis (LDA) was used to visualize the classification performance. In model construction, 40% samples were set as the test set. The main research process is shown in Figure 1.
Behaviour requirements: subjects were asked not to eat, smoke, or take medicine for 10 hours prior to the breath collection. Subjects were asked not to eat onions, garlic or any other food with strong smells, or go to a dusty environment for two days prior to the collection. Subjects stayed in a naturally ventilated environment and did not exercise within an hour before sampling. Subjects rinsed their mouths with purified saline and then with distilled water before sampling. Schematic of study processes

Breath sampling and pretreatment
The breath collection site was well-ventilated over the whole analysis period to maintain clean ambient air as the reference. The collection system included mouthpiece, tubing, and a Tedlar TM sample bag, all of which were made of polytetrafluoroethylene. All materials were disposable and the gas path was as streamlined and short as possible.
Human breath was stored in Teflon bags, which was pretreated by self-made system in Figure 2. The exhaled breath was analyzed along with the ambient air for 2 minutes. Pre-processing flow diagram

Data preprocessing
Min-Max method was used for normalization in Python 3.7.1 with Pycharm 2021.1 x64. Four features were extracted, which were the median value of breath exposure, median difference between the ambient air and breath data, the Pearson's correlation coefficient (PCC) between the ambient air and breath, and two well-behaved features combination.

Results
In total, 210 consecutive silicosis cases and 430 nonsilicosis controls were enrolled in this cross-sectional study. Two-dimensional projection images using LDA with the median of breath data is shown Figure 3. In LDA diagram, points in the same group gathered into a mass apparently, and the points of "Confirmed silicosis" and "Suspected silicosis" couldn't be clearly distinguished by a straight line, both separated from those of "Healthy control" along axis X1. Point distribution of the "Healthy control" and "Confirmed silicosis" had little overlap, while some points of "Suspected silicosis" appeared in the point distribution region of "Healthy control". Two-dimensional projection images with the median of breath data using linear discriminant analysis In model construction, 40% samples were set as the test set. Using DNN and XGBoost with the four forms of data feature extracted, classification results are shown in Table 1, including result of the train and test sets. The four evaluation index values of the test set result in the optimal classifier are all over 90%, among which "sensitivity" suggests apparent weakness, especially in models using DNN.
Receiver operating characteristic (ROC) analysis of models in test sets constructed with different algorithms and data features are shown in Figure 4, which is a visualized showing of AUC. ROC analysis is consistent with results shown in Table 1. In Table 1 and Figure 4, Models using the combined median and PCC in the four features and models using XGBoost in algorithm achieved superior performance to others.

Discussion
In train and test sets, silicosis cases were recognized with high accuracy. The best classifier used the XGBoost based on the combined feature of median and PCC, achieving the excellent performance with an AUC of 99.3%. DNN is a classic and powerful classification algorithm for deep learning but it did not perform well in this study, probably because that the number of samples or data dimensionality was well below the level for multilayers learning machine and big data. In terms of feature extraction, PCC not only contained exposure data about the exhaled air and air baseline, but it took their relation to deal with static characteristics of sensors as well, such as signal drifts. So the combined data of median and PCC can perform well in datasets involving long time periods. And the appropriate balance of sensitivity and specificity is expected to be found in application cases.
In addition, attempts to identify subjects with suspected silicosis were shown to be possible because the cases were easy to be distinguished from healthy controls in breathprints as indicated by LDA plots. LDA is a dimension reduction technique of supervised learning. It projects data on the low dimension and selects the projection direction with the best classification performance. Its goal is to minimize intra-class variance and maximize inter-class variance, which is especially suitable for data sets containing samples of different categories.
Furthermore, it is interesting that subjects in this study covered large age ranges, various smoking habits and even some benign lesions of their bodies, and the classifiers still showed stable and outstanding performance in differentiating the two groups, suggesting that the these demographic variables may be less influential to silicosis stage on breathprints. This result shows breath analysis may be applied in screening of diverse subjects. In this study, the population was not divided into three categories as we named above, which caused some limitations to the research on early screening and warning of this technology. And the reason we thought about it was that there would be a very large difference in quantities of samples of the three sets if we did so.

Conclusion
Breath analysis was investigated as a method for diagnosing silicosis with an array of chemical sensors.
This study developed a breath testing system and showed excellent performance in a single-center study. Our investigation supports the hypothesis that similarities in silicosis group are expressed in breath patterns and the patterns are distinct from those of the healthy miners. This study serves as a proof of concept that breath analysis can be adopted with cross-reactive sensor arrays and pattern recognition to enable silicosis screening. We thank the participants for their volunteering in this study, and the doctors and nurses from local Institute of Occupational Disease Prevention for helping organize the sampling of the subjects.