Network Intrusion Detection using ML Techniques for Sustainable Information System

. Network intrusion detection is a vital element of cybersecurity, focusing on identification of malicious activities within computer networks. With the increasing complexity of cyber-attacks and the vast volume of network data being spawned, traditional intrusion detection methods are becoming less effective. In response, machine learning has emerged as a promising solution to enhance the accuracy and efficiency of intrusion detection. This abstract provides an overview of proper utilization of machine learning techniques in intrusion detection and its associated benefits. The base paper explores various machine learning algorithms employed for intrusion detection and evaluates their performance. Findings indicate that machine learning algorithms exhibit a significant improvement in intrusion detection accuracy compared to traditional methods, achieving an accuracy rate of approximately 90 percent. It is worth noting that the previous work experienced computational challenges due to the time-consuming nature of the utilized algorithm when processing datasets. In this paper, we propose the exertion of more efficient algorithms to compute datasets, resulting in reduced processing time and increased precision compared to other algorithms to provide sustainability. This approach proves particularly when computational resources are limited or when the relationship between features and target variables is relatively straightforward.


Introduction
The first line of defense against security assaults is intrusion detection.Therefore, studies are paying more attention to security solutions including barriers, intrusion detection systems (IDS), unified threat models (UTM), and intrusion prevention systems (IPS).IDS gathers data, analyses it for potential security breaches, and then detects assaults on a range of systems and network sources.The network-based IDS performs two types of analysis on the data packets that move amongst a network.
A key area of research continues to be anomaly-based detection, which is still far behind detection that relies on signatures today.The necessity to deal with unique attacks for which there is no prior knowledge to identify the anomaly presents difficulties for anomaly-based intrusion detection.Since the system must be able to distinguish between benign and malicious or abnormal traffic, machine learning approaches have been studied by researchers over the past few years.
Modern technology is also undeveloped and inefficient.Anomaly-based network IDS have not achieved the same level of commercial success as signature-based network IDS and have not been as widely adopted by technology-based organizations worldwide.Because of this, anomaly-based detection is currently a crucial area of study and development in the field of IDS.Important problems still need to be resolved before anomaly-based intrusion detection systems are widely deployed.But when it comes to contrasting what intrusion detection achieves when employing supervised machine learning algorithms, the literature is limited today.Modern technology is also undeveloped and inefficient.Anomaly-based network IDS have not achieved the same level of commercial success as signature-based network IDS and have not been as widely adopted by technology-based organizations worldwide.IDS's operations are depicted in Fig. 1.
A network intrusion detection system, or NIDS, keeps track of network activity and spots suspicious activity in a computer network.Its main objective is to assist in the detection of network-based intrusions and attacks.It inspects the content and other pertinent information in network packets, which are small units of data that move across a network, to find aberrant or unexpected behavior.To monitor the traffic, they can be installed as an appliance or software at key network nodes such network gateways, firewalls, or switches.Machine learning techniques have emerged as a viable approach to improve the accuracy and efficiency of intrusion detection as the number of IOT devices has expanded to improve sustainable information system.
NIDS improvements and difficulties -The goal of Network Intrusion Detection (NIDS) is to find shady activity in computer networks.The model would need to be trained using a training dataset as one of the primary phases in NIDS employing machine learning.Give categorical variables numerical values by doing label encoding.To choose the essential attributes that accurately identify data, use feature selection.Features that differ in magnitude, range, and units should be scaled to make them uniform.To evaluate the model and performance optimization, divide the data set into a train and test dataset.
Intruder identification is the most fundamental IDS task.It will be beneficial for conducting forensic investigations into occurrences, and applying the necessary fixes will enable the detection of upcoming attack attempts directed at specific people or resources.False alarms from intrusion detection can occasionally come from things like a broken network interface or the emailing of attack confessions or signatures.A sensor (an analysis engine) that serves as a responsibility for detecting intrusions is the fundamental component of an IDS.Mechanisms for making decisions about intrusions are present in the sensor.Three main information sources (Fig. 2): own IDS knowledge base, syslog, and audit trails provide raw data to sensors.File system settings, user authorizations, and other things may be included in syslog.This data serves as the foundation for upcoming decision-making.Fig. 2. A sample IDS.The arrow width is proportional to the amount of information flowing between system components [6] The sensor is combined with an event generator, a data gathering component (Fig. 3).The event generator policy, which specifies the filtering method of event notification information, controls the collecting strategy.The event generator generates a collection of events that adhere to the rules; these events could be network packets or log system events.This set, along with the policy data, may be kept outside or on a protected system.In some situations, like when event data streams are sent straight to the analyzer, no data storage is used at all.This specifically relates to network packets.

Fig.3. IDS components [7]
The agent structure approach, which is used by many civilized systems, organises small independent modules on a per-host basis across the protected network [8].The agent's job is to keep an eye on and filter out all activity taking place within the protected region.Depending on the strategy used, the agent may also perform an initial analysis or even take remedial action.When it comes to the identification of distributed assaults in particular, DIDS can make use of more advanced analysis tools [9].The agent also plays a part in travelling over many physical areas and being relevant.When establishing protection measures linked to various sorts of attacks, this is a key consideration [10].IDS agent-based solutions also employ less complex procedures for modifying the response policy [11].Additionally, filters for data selection and aggregation may be included [12], [10].

Literature Survey
The Internet has ingrained itself into modern life.Current internet-based information processing systems choose to expose themselves to a variety of risks, which result in a variety of damages and disclosing losses.Additionally, information security reduces the risks associated with confidentiality, integrity, and availability, the three basic security objectives.Intrusion detection systems (IDS) are the most crucial among them since they successfully fend off outside threats.IDSs offer a line of defense that thwarts computer system attacks through the Internet.According to their methods of detection, IDSs are typically divided into two categories: anomaly detection systems and misuse detection systems.The ability to identify deviations from established regular usage patterns as invasions is determined by anomaly intrusion detection.On the other side, techniques for detecting misuse of permissions are effective.The preprocessing phase and the intrusion detection phase are where most IDSs operate.In order to increase detection accuracy, intelligent IDSs are thought of as being intelligent computer programmes running on either a host or a network that can analyze the surroundings and behave in a variety of ways.H. Song et al.'s study [3] explores the relationship between macro-level opportunity indicators and cyber-theft victimization.The current study discovered a relationship between structural factors like unemployment and non-urban population and the location of internet users.Additionally, this study discovered a favorable correlation between state-level counts of cyber-theft victimization and the percentage of users who solely access the internet at home.These discoveries' theoretical ramifications are examined.According to a model put up by P. Alaei and F. Noorbehbahani [4], as the internet has grown and more people around the world have access to online media, cybercrime is likewise happening more frequently.On the other hand, an IDS automates the network monitoring procedure.Building IDS is extremely difficult because of computer networks' streaming nature of data.The researchers suggested using online categorization on datasets as a way to solve this issue.The suggested approach consists of two categories of acts, namely offline and online.Using the NSL-KDD standard dataset, the suggested technique is contrasted with the incremental naive Bayesian classifier.M. Sabre [5] presented a method for evaluating IDS based on gauging the effectiveness of its parts.A traffic generator and attacks based on Linux KALI (Backtrack) and Metasploit 3 were then used to test it.The findings indicate a strong correlation between these components' properties and IDS performance.

Definitions
1. Intrusion: An unauthorized attempt to access, or compromise a computer system.2. NIDS: A security system monitors network traffic and detects malicious activities within a computer network.

Anomaly-based detection:
A technique used by NIDS to identify abnormal network behavior that deviates from statistical models.4. False positive: A situation where a normal network is classified as suspicious or anomalous.5. False negative: A situation where a NIDS fails to detect an actual intrusion or attack 6. Packet: A unit of data transmitted over a network, consisting of a header (routing and control information) and payload or actual data.7. VPN: Virtual Private Network is a technology that provides encrypted communication over a public network, often used to protect network traffic from unauthorized access.

User characteristics
Understanding the user characteristics is important for NIDS deployment, as it helps in defining the user roles, access privileges, and responsibilities.It also helps in effective collaboration between various stakeholders involved in NIDS management, incident response, and network security.Network Administrators: Network administrators are responsible for the overall management of the network infrastructure on which NIDS is deployed.They should have a solid understanding of the network architecture, protocols, and security measures.System users: System users refer to individuals or entities who utilize the network protected by the NIDS.Users should have a basic knowledge of how to give the input to the product so it can predict network condition.System analysts: They play a critical role in analyzing the alerts and reports generated by the NIDS, so they should be able to distinguish between legitimate network activities and potential intrusions.Auditors: They evaluate the overall security posture of the NIDS deployment.They should possess in-depth knowledge of network security, intrusion detection principles, and machine learning techniques so that they can assess its effectiveness and adherence to security standards.

Data Analysis
The Exploratory Data Analysis (EDA) is a main step in modeling Machine Learning algorithms.EDA helps to evaluate the data sets and to outline the statistical features [18][19][20][21].The workflow of the proposed methodology is represented in Fig 4.

Dataset
Computational Resources: Random Forest algorithms can be computationally intensive, when dealing with large datasets.Building a Random Forest model for NIDS requires expressive computational resources, including the memory and processing power.
Hyper parameter tuning: Random Forest models have hyper parameters that have to be tuned carefully for optimal performance.Finding the right combination can be time consuming and one needs to be highly skilled in machine learning techniques.
Imbalanced datasets: If the NIDS dataset is imbalanced, with a significant disturbance in number of normal and anomalous instances, proper handling of imbalanced data must be done such as resampling techniques or class weighting, to make it a balanced dataset.
Feature Selection: Irrelevant and redundant features can negatively impact the model's performance or increase training and detection times, hence pre-processing and feature selection techniques will be needed to optimize the feature set used by Random Forest model.The importance of features are presented in Figure 5.

Assumptions and Dependencies
In the development of the project, we must assume some of the things which may not be completely true, but these assumptions should be carefully considered while implementing the NIDS using Random Forest.These assumptions and dependencies play a major role in shaping the functionality and limitations of intrusion detection systems.Availability of labeled data: Random Forest models require labeled training data, where each instance is assigned a class label indicating whether it is normal or malicious.We must assume that a sufficient amount of labeled data is available for training the model.Assumption of independent features: Random Forest assumes that the features used for training the model are independent or weakly correlated.This assumption allows the Random Forest algorithm to make unbiased or accurate predictions.The Figure 6 shows the correlations among the features.Homogeneity of Network Traffic: Random Forest model assumes that the network characteristics are stable and don't change with time.Sudden changes in the network may change the model's ability to generalize effectively.Assumption of Feature Importance Hierarchy: Random Forest model assigns feature importance scores to features based on the accuracy of classification.It is assumed that the features with higher importance scores can be classified more accurately.

Design
The main aim of our paper is to detect intrusions and notify the user whenever there is an intrusion attempt and alert the security system.We used the Random Forest algorithm to classify the data and analyze the data packets in the traffic network as normal or suspicious.We use feature selection and select the best features to classify the data.In our model multiclass problems are supported.There are less chances of overfitting because of the ensemble of decision trees.This application will be very helpful to detect intrusions on a daily basis.

IDS configuration module
The first step in the IDS configuration process is to identify the requirements for the system.This may involve identifying the types of threats the system should detect, the network topology, and the operating environment.The next step involves configuring network sensors that will collect data from the network.This may involve setting up network taps, configuring network switches to mirror traffic, or deploying specialized network sensors.Once the network sensors are in place, the next step is to define the IDS rules that will be used to detect potential threats.This may involve creating custom rules, using predefined rule sets, or leveraging machine learning algorithms to detect anomalies.The next step is to configure the alerting and reporting the capabilities of the IDS.This may involve defining the severity levels of alerts, configuring alert notifications, and setting up reporting features.

IDS rule module
The first step in the IDS rule module is to define the set of rules that will be used to detect potential threats.These rules may be based on known attack signatures, traffic patterns, or other indicators of suspicious activity.Once the rules are defined, the system prioritizes them based on their relevance to organization's security policies, and the severity of the potential threat.The system then applies the rules to the network traffic captured by the IDS system.Each rule is evaluated against the traffic to describe any matches or violations.When a potential threat is detected, the system generates an alert.The alert includes details about the type of threat, the source and the destination of the traffic, and the severity of the threat.

IDS network traffic capture module
The first step in the network traffic capture module is to find the capture points where network traffic will be monitored.This may involve deploying sensors at key points in the network, such as switches, routers, and firewalls.Once the capture points are identified, the next step is to collect network traffic from those points.This may involve configuring the sensors to capture traffic passively, or actively redirecting traffic to the sensors for analysis.Captured network traffic is then analyzed for potential threats.This may involve using signature-based detection methods to identify the known threats, or machine learning algorithms to detect variations.

IDS process core module
First step in this module is to receive the generated alerts by the other modules, such as the network traffic capture module and the IDS rule module.Once alerts are received, the system prioritizes them based on the severity of the potential threat, and the relevance to the organization's security policies.The system then correlates the alerts to identify any patterns or relationships between them.This may involve looking for common sources, destinations, or methods of attack.

IDS result module
The first step in the IDS result module is to receive alerts generated by the IDS rule module.The system filters the alerts based on their severity, source, destination, and other relevant criteria to reduce false positives and prioritize the most critical alerts.The system then correlates alerts that are related to the same incident or attack, to provide a comprehensive view of the security event.The system analyzes the filtered and correlated alerts to identify the nature and scope of the security event.This analysis may involve examining packet payloads, identifying the patterns of activity, and performing threat intelligence lookups for a sustainable information system.

About Software
Python has emerged as a highly favored programming language for the implementation of machine learning algorithms.Its versatility and extensive library support have contributed to a lot in this domain.It provides libraries like NumPy and pandas which facilitate efficient data handling and manipulation, and scikit offers a vast collection of machine learning algorithms and utilities, further enhancing Python's capabilities.Jupyter Notebook is an open-source web application enabling users to code and document in a single file.Furthermore, the interactive coding environment offered by Jupyter Notebooks enables seamless experimentation and visualization of machine learning workflows.Jupyter Notebooks provide an intuitive interface for documenting and sharing machine learning workflows, enabling collaboration and reproducibility.

Results
The values of the metrics obtained by applying Logistic regression, Random Forest Classifier algorithms are shown in the following Figure 7.The table 1 shows the test cases applied and their status.expensive and needs a small training dataset to predict the result.They are less prone to overfitting and able to minimize high variance from a flexible model like decision tree by combining many trees into one ensemble model.Our system is robust to outliers and noisy data.It uses an ensemble of decision trees which is less sensitive to anomalies when compared to SVM and ANN algorithms of the previous project.Random Forest allows us to identify the most relevant features for prediction.The feature selection helps in understanding the relationships between different attributes of a dataset.Random Forests are easy to interpret compared to ANNs.By using built in methods it can handle missing data by using surrogate splits and imputing missing values based on other features.Our project is highly scalable and does not face much difficulty in handling large datasets.Compared to complex ANNs and SVMs Random Forest is less prone to overfitting due to its ensemble nature and techniques like bagging and random feature selection.Random Forests can be trained relatively quickly due to their parallelizable nature.In the period of the Internet of Things (IoT) and Big Data, it was projected that the number of connected devices would surpass 26 billion by the year 2020.Consequently, this rapid growth in connected devices is also anticipated to result in a corresponding increase in cybersecurity challenges.In conclusion, achieving a high accuracy of 99.76 % with a Random Forest classifies a network intrusion detection is a good start, but it is important to consider other metrics, perform thorough testing, and evaluate the real -world implications of the model's performance for a sustainable Information System.There are several other metrics to consider such as F1-score, recall, precision, AUC-ROC, and confusion matrix that can help us understand the true performance of the model.

Future Scope
The future scope of this project could involve several areas of improvement, such as false alarm rate, response time, encrypted packets and unbalanced datasets.False Alarm Rate: Considering the impact of false positives and false negatives is vital.False positives create unnecessary alarm, while false negatives can miss actual attacks.Network administrators must investigate and disable non-harmful attacks cautiously to address this issue effectively.Response Time: There is a delay between the discovery of a new threat and the application of its corresponding signature to the IDS.This lag period results in the IDS being unable to detect and mitigate the new threat promptly.To minimize this issue, efforts should be made to reduce the lag time between threat discovery and signature deployment.Encrypted packets: Most intrusion detection devices do not inspect encrypted packets, which can potentially enable undetected intrusions into a network, particularly within VPNs.These intrusions may go unnoticed until more substantial network breaches occur.This reiterates the importance of implementing additional security measures to mitigate the risk of undetected intrusions within encrypted traffic in VPNs.

Table 1 .
Test cases, expected and actual results Our study proposes a system that uses a random forest algorithm, which works well with numerical and categorical features.Our model is intrinsically suited for multiclass problems whereas previous ones supported only two-class problems.It is computationally less