Automated Early Phase Breast Cancer Detection using Hybrid Machine Learning Algorithms

. Breast cancer is the most common cancer among women. It occurs when few breast cells begin to grow abnormally. The national average for 2022 is 100.4 cases per 1,00,000 people, with a large number of women being diagnosed with breast cancer. The objective is to design a prediction system that can predict breast cancer at early stages using a set of attributes that have been selected from a critical dataset. The Wisconsin Kaggle dataset is used for this experiment. The goal of this work is to predict breast cancer utilizing hybrid machine learning methodologies, such as SVM and PCA. ML algorithms that could help to predict cancer, as the early detection of this disease would help to slow down the progression of other diseases. In our paper, we are implementing Hybrid algorithms like PCA and SVM and optimizing SVM with k-fold cross-validation for predicting Breast cancer at early stages with high accuracy. The goal is to raise the fraction of early-stage breast cancer detection and to reduce mistake rates with maximum precision, which are sustainable.


Introduction
Worldwide, breast cancer affects many women.The World Health Organization estimates that in 2020, almost 2 million women worldwide received a breast cancer diagnosis, and 6,85,000 of them passed away from the disease, accounting for about 15% of all cancerrelated fatalities in females.When cancer is detected early on, amazing changes are observed.Therefore, there is always a chance that the patients will receive a diagnosis fairly soon.Cancer is formed by abnormal growth of fatty and fibrous tissues.Numerous techniques for making an accurate diagnosis of breast cancer have been suggested in light of this.Using, a mixed machine learning technique, a dataset was used to generate a fully automatic categorization and prediction of breast cancer.
In this study, the Wisconsin Diagnostic Breast Cancer (WDBC) dataset from UC Irvine's machine learning repository is utilized to identify and categorize two types of cells: benign (non-cancerous) and malignant (cancerous).If the cancer is discovered when it is in the benign stage, successful outcomes will be attained.If the tumor does not spread to other body tissues or organs, it is regarded as benign.If a tumour's cells are able to invade neighbouring tissues, it is regarded as malignant.Therefore, in our paper, we combined PCA and hybrid PCA to find useful components and reduce the dimensions of the data.Support Vector Machine (SVM), a model for early cancer diagnosis that may detect the disease at an early stage, was employed and provided accuracy of 96%.
The technology revealed a number of risk variables, including cell shape, normal nuclei, lump thickness, mitosis, bland chromatin, and bare nuclei.Therefore, it is crucial to use preprocessing techniques to clean up the data before identifying breast cancer.In order to improve accuracy, the information is typically pre-processed to remove the noisy, redundant data, and inaccuracies.A common machine learning strategy that can aid to improve the categorization model's accuracy by removing redundant or pointless features is the use of PCA to find valuable sections of the data and minimize its dimensions.SVM is a potent algorithm that is frequently employed for classification tasks and has demonstrated promising outcomes in the detection of breast cancer.
Pre-processing techniques are essential for cleaning up the data and removing noisy, redundant data and errors, to increase the classification model's accuracy.These preprocessing methods include outlier elimination, feature scaling, and data normalization.Other factors, such as age, family history, and exposure to environmental factors, may also affect the development of breast cancer in addition to the risk factors you stated.Overall, the application of machine learning approaches to breast cancer diagnosis is a promising area of research that has the potential to improve diagnosis precision and efficacy, resulting in better patient outcomes.

Literature survey
The following section describes about existing approaches.To improve clustering accuracy and processing time, Hao Quan Lin and Zhengzhou J et al. [1] coupled conventional SOM neural networks with K-means methods.Because it is unable to reduce running time while keeping high accuracy, our hybrid algorithm boosts running speed and accuracy.The ideal number of clusters was determined using an elbow chart.The Wisconsin Breast Cancer Dataset (WBCD), which includes 569 cases and 32 patient features, was used for this experiment.The findings indicate that SOM and K-Means neural networks perform better in terms of accuracy(95%) and running speed (4.62), making them more suitable for breast cancer prediction.
By combining ensemble learning and Bayes theorem, Sam Khozama and Ali M. Mayya [2] developed a novel range-based breast cancer prediction model.The goal was to forecast BC with a range of 0% to 100%.The confusion matrix was utilized in this study to evaluate several parameters.The BCSC dataset, which contains 67632 records and 13 risk variables, was employed in this method.The weighting method is used with this model as well.AdaBoost provides the peculiar ability to differentiate the cancer outcome as a percentage rather than a predetermined binary outcome.It is necessary to compute the updated BCSC dataset.The study came to the conclusion that the dataset was utilized to develop a Bayesian hyperparameter ensemble learning model optimization technique and that the model had a high accuracy of 91.33%.
For early BC identification, MUAWIA A. ELSADIG [3] recommended using machine learning (ML).Utilizing seven different classifiers to identify breast cancer at an early stage was the goal.MLP, RF, DT, NB, SVM, LR, and KNN are the seven classifiers.The relief feature algorithm methodology was applied to this strategy.For each experiment, the random sampling technique is used 20 times, yielding accurate and realistic findings.The software called Orange Data Mining was used for all of the trials.The study found that, when compared to other classifiers, the SVM had the highest accuracy (97.3%).It is unclear whether the results were validated using cross-validation or an external dataset using appropriate validation techniques.
Rami Racal [4] investigated four algorithms in order to predict the outcomes of breast cancer using distinct datasets: Logistic Regression, SVM, Random Forest, and KNN.The goal was to distinguish between benign and malignant patients, and it was planned to parametrize classification methods to attain high accuracy.A data mining tool was used to assess the performance.The four datasets used for the study were the Wisconsin breast cancer dataset, the Haberman's Survival dataset, the SEER database, and the Wisconsin Diagnostic Breast Cancer dataset.According to the findings, SVM outperformed all other models (by 97%).
An automated technique that uses patient health records to forecast the likelihood of developing breast cancer was introduced by Shawni Dutta and her team [5].In this study, gradient boosting algorithms were utilized and compared against a variety of other algorithms, including K-Nearest Neighbour (K-NN), Naïve Bayes, Support Vector Machine (SVM), and AdaBoost classifiers, as well as Decision Tree (DT) and Random Forest (RF) classifiers.Accuracy, f1-score, the mean square error, the Cohen-Kappa score, and Matthew's correlation coefficient (MCC) are used to gauge overall performance.This study came to the conclusion that the gradient boosting method, which had an accuracy of 97.3%, performed better than all other algorithms and aided in the early detection of breast cancer.
The efficiency of classification algorithms like KNN, SVM, LR, and NB was compared by Gaurav Singh [6].The purpose was to examine the accuracy of various ML classification approaches, such as support vector machine, k-nearest neighbour, logistic regression, and Gaussian Nave Bayes, in predicting breast cancer in its early stages.The dataset will be divided into training and testing phases, each with an allocation of 80% and 20% of the dataset, respectively.The UCI Machine Learning Repository's WBCD data was used, results demonstrate that KNN outperforms the other algorithms, providing an accuracy of 99%, which is an impressive performance across the board.
Muhammad Umer [7] proposed an ensemble literacy-grounded voting classifier that blends logistic regression and stochastic grade descent classifiers for the precise detection of cancer patients.This provides a voting classifier based on ensemble literacy that blends logistic regression with a variety of advanced variables.The effectiveness of ML and the suggested ensemble model were examined using CNN features.
In this situation, the dataset was obtained from a single source.The findings for the other variables, still, show that using a voting classifier with complicated features leads to the stylish bracket delicacy of 100.Authors [13] highlighted the significance of ML in prediction, pattern recognition and error reduction across diverse fields, emphasizing the impact of AI in broad domain.
Using ML and DL algorithms, Krishna Mridha [8], Anjaneyulu [11] examined a few widely used evaluation techniques for the early identification of breast cancer.ML and DL algorithms are used, and their precision is compared.It focused on the number of models that were asserted on the used dataset.The WBCD dataset, which contains 31 features assessed via breast mass fine needle aspiration (FNA), was used in the trial.There are more ANNs that need to be trained for the parameters.The dynamic architecture and parameter refresh procedure result in a higher estimated cost and longer training period requirements.Results indicate that the KNN model has the least accuracy (91.22%), while the ANN model has the best accuracy (99.73%).Authors [14] suggested data mining techniques to predict diseaseprevalence based on symptoms in healthcare data.The appropriate prediction helps healthcare organizations avoid drug shortages and further ensures timely treatment of patients.The paper [15]  To produce accurate forecasts, Shudipti Rani Mondal and her team [9] compared several data mining methods employing categorization.The goal was to improve diagnosis through the use of effective techniques like LR, SVC, KNN, RF, and DT.Deep learning and numerical dataset machine learning approaches are used to extract hidden features.Based on tenfold cross-validation, the effective method was identified.The experiment made use of the WBCD dataset, which was downloaded from the UCI machine learning library.Results show that, with a prediction rate of 99.2% for breast cancer, logistic regression offers greater accuracy.
The paper [12] explores the distinct ML applications in predicting heart attacks using patient health records.It compares Random Forest and CNN methods, and findings showed that Random Forest's better performance in terms of accuracy.
Apoorva V, Yogish H K, and Chayadevi ML [10] used machine learning and deep learning ideas to diagnose and cure breast cancer early.The purpose was to forecast and research cancer databases to improve accuracy by using convolutional neural networks on an image dataset with features extracted from photos of breast mass.For the numerical dataset, the Decision Tree (CART), K-Nearest Neighbour (KNN), Naive Bayes and also Support Vector Machine (SVM) algorithms are used.SVM is taken into consideration for additional analysis and prediction because the results demonstrate that it provides high accuracy (96%).CNN, which was built using a sequential API, gets the results.The predicted image data was cancerous.The existing methodologies outlined above are summarized in Table1.

Problem statement and objectives
Breast cancer mortality rates are increasing due to inadequate knowledge of its early stages.In 2020, 2.3 million women were diagnosed with the disease, with 685,000 dying.7.8 million people had been diagnosed by 2020, highlighting the importance of early detection and prevention.Breast cancer can occur at any age, regardless of gender.Rural areas have higher rates of breast cancer compared to urban areas, and women with low vitamin D are also targeted.Encouraging people to take care of their health is crucial to reduce the mortality rate.
This paper aims to predict early breast cancer stages using a sustainable hybrid machinelearning algorithms.It selects the best algorithms and converts them to hybrid models.Two algorithms, supervised and unsupervised, are used to achieve this.The early stage of breast cancer is predicted using classification algorithms for accuracy, which leads to sustainable systems.

Proposed method
This paper deals with creation of sustainable hybrid machine learning model in order to predict breast cancer at early stage to fulfil this requirement we have developed a hybrid model with PCA and SVM and optimized SVM with k-folds for obtaining better accuracy and to build sustainable systems.The model is then tested using a variety of performance metrics, including confusion matrices, f1-score, accuracy, recall, and precision.The final roc curve gives you a clear comparison between traditional SVM and hybrid model.The paper aims to make early breast cancer predictions using sustainable hybrid machine-learning methods.It selects the best algorithms and converts them to hybrid models.Two algorithms are used: supervised and unsupervised, with the initial stage predicted using classification algorithms.These algorithms accuracy is employed to assure sustainable and reliable breast cancer prediction.The study aims to predict early breast cancer using a combination of machine learning approaches.The finest algorithms are chosen, and hybrid models are created from them.The early stage is predicted using classification techniques, and two approaches-supervised and unsupervised are applied.To guarantee sustainable and reliable breast cancer prediction, the few algorithms are applied to achieve higher accuracy.This breast cancer diagnosis system mainly consists of 6 modules namely Data Collection, Preprocessing, Feature extraction, hybrid model, training and results.

Modules and its description
This breast cancer diagnosis system mainly consists of 6 modules, which describe the functioning and procedure throughout the process of the model.

Data collection
The WBCD dataset is a popular machine learning and data analysis tool for determining if breast cancer cells are malignant or benign.It contains 569 instances of breast mass, with 357 labelled benign and 212 labelled malignant.To use, obtain the publicly available dataset and load it into a Python or Python library like Pandas.

Data Preprocessing
Data pre-processing for breast cancer diagnosis systems involves cleaning, transforming raw data, handling missing values, scaling features, selecting relevant ones, and addressing class imbalance.PCA, an unsupervised dimensionality reduction technique, significantly impacts system performance by transforming high-dimensional data into lower-dimensional space.

Feature extraction
Feature extraction is critical in machine learning, especially for large datasets with multiple features or variables.Principal Component Analysis is a widely used technique for identifying relevant features for accurate tumor classification.By retaining important principal components, the dimensionality of the dataset can be reduced while retaining most information needed.SVM and PCA approaches, as well as k-fold cross-validation, were used to create a sustainable hybrid model for breast cancer diagnosis.This approach improves accuracy by reducing data complexity and enhancing analysis accuracy.The model divides the data into k subgroups, training on k-1 subsets and validating on the remaining subsets.

Model training
Model training is crucial step in machine learning, using pre-processed data to predict tumour cancerousness or non-cancerousness.In the breast cancer diagnosis system, SVM and a hybrid model were used, with a training split of 0.25 and a testing split of 0.75.The model learned from training data and made predictions on testing data.

Result
The final stage involves presenting a diagnosis using visualization techniques to make results clear.The model uses SVM and k-folds cross-validation for 96% accuracy, improving machine learning performance.Combining PCA with SVM improves results.Machine learning models achieve 96% accuracy, but should be used for medical professionals' diagnoses and treatment decisions, considering sensitivity, specificity, and potential biases.

Description of the dataset
The Breast Cancer Wisconsin (Diagnostic) Dataset, created by Dr. William H, contains ten real-valued features for analyzing breast mass lesions.These features help classify samples as malignant or benign, analyze symmetry, and aid in breast cancer diagnosis.Understanding these features can lead to more accurate, sustainable and reliable classification models for breast cancer diagnosis.The dataset contains 569 patient samples from breast cancer, categorized as malignant and benign.

Experimental results
A confusion matrix shown in Figure 2, a table representing predicted outcomes and actual values, used in machine learning to measure key performance metrics like as precision, recall, AUC-ROC curve and accuracy.
The true positive of the model is 62.24% which represents the correctly predicted outcomes.A false negative of the model is 0.70% which represents incorrectly predicted outcomes.Type II errors include false negatives.Let us examine the table and apply some of the confusion matrix's standard performance measures.Fig3 depicts the classifier's performance results.The ROC curve displays a binary classifier's performance as its threshold changes, plotting TPR against FPR, and AUC.Traditional SVM achieves 98.8% accuracy, while kfolds achieves 99.7% accuracy.

Significance of proposed work
This study investigates the value and benefits of PCA with SVM and k-fold cross-validation for early breast cancer prediction.Machine learning techniques enhance early detection, diagnosis, and treatment, spotting trends and forecasting outcomes.

Principal component analysis (PCA)
This study investigates the value and benefits of PCA with SVM and k-fold cross-validation for early breast cancer prediction.Machine learning techniques enhance early detection, diagnosis, and treatment, spotting trends and forecasting outcomes.

Support vector machines (SVM)
is a strong machine learning technique for classification and regression applications that excels at binary classification tasks such as breast cancer detection.It offers high accuracy, robustness to noise, and high-dimensional data handling capabilities.

Advantages of the model
A hybrid model effectively handles missing data and imbalanced datasets by incorporating multiple models.PCA reduces noise and outliers, improving model robustness.Combining SVM with PCA for breast cancer prediction offers improved accuracy, interpretability, robustness to noise, and reduced training time.Using ROC curve and k-fold cross-validation, SVM can assess model performance and generalize to new data, reducing overfitting risk and improving model generalization performance.

Conclusion and future enhancement
To build a sustainable hybrid algorithm for breast cancer prediction, this study combines the benefits of PCA (principal component analysis) with SVM (support vector machine).The Wisconsin Breast Cancer dataset (WBCD) is cleaned, and the PCA algorithm is applied to reduce dimensionality.SVM is optimized using k-folds, resulting in a 96.5% accuracy.The hybrid SVM algorithm outperforms traditional SVM in recall, precision, accuracy, and f1score, with a 90% improvement in accuracy, precision, recall, and f1-score.This study concludes that hybrid machine learning algorithms are better than traditional approaches in accurately predicting breast cancer at its early stages, potentially saving millions of lives.The proposed methodologies provide accurate results on the Wisconsin Breast Cancer dataset.Future enhancements should involve applying various variables and checking with different datasets to ensure consistent results.Mammographic data, such as mammographic images, could be used to identify cancerous cells.Sustainable Computer-aided diagnosis can improve accuracy and time efficiency in detecting breast cancer patients.This approach makes it easier to detect diseases and provides accurate results in less time.

•
Accuracy: Accuracy is a typical performance statistic used in machine learning to evaluate the success of a classification model.It computes the proportion of examples in a dataset that are properly classified by the model.• Precision: Precision is a machine learning metric that measures how well a categorization model performs.It computes the proportion of true positives (predicted as positive and actually positive) from all positive predictions provided by the model., 010 (2023) E3S Web of Conferences ICMPC 2023 https://doi.org/10.1051/e3sconf/20234300103535 430 • Recall: Recall is a machine learning performance statistic that quantifies the proportion of true positive results that a model correctly detected out of all actual positive cases.• F1-score: The F1-score is a popular statistic in machine learning for assessing the performance of a classification model.It assesses a model's precision and recall to determine its correctness.

Fig. 7 .
Fig. 7. F1-score of K-folds.K-fold cross-validation: In machine learning, the K-fold cross-validation technique is often used for model evaluation and hyperparameter tuning.The original dataset is partitioned into k-folds of equal size for k-fold cross-validation.Figures 4, 5, 6, 7, and 8 represent the graphs for Accuracy, Recall of k-folds, Precision Score, F1-score and ROC Curve.The ROC curve displays a binary classifier's performance as its threshold changes, plotting TPR against FPR, and AUC.Traditional SVM achieves 98.8% accuracy, while kfolds achieves 99.7% accuracy.
/doi.org/10.1051/e3sconf/20234300103535 430 discussed the role of Intelligent Decision Support Systems (IDSS) in Healthcare Monitoring, especially for heart disease.Results claimed that IDSS enhances

Table 1 .
Summary of existing approaches.