Estimating Passenger Demand Using Machine Learning Models: A Systematic Review

. This article investigated machine learning models used to estimate passenger demand. These models have the potential to provide valuable insights into passenger trip behaviour and other inferences. The estimate of passenger demand using machine learning model research and the methodologies used are fragmented. To synchronise these studies, this paper conducts a systematic review of machine learning models to estimate passenger demand. The review investigates how passenger demand is estimated using machine learning models. A comprehensive search strategy is conducted across the three main online publishing databases to locate 911 unique records. Relevant record titles, abstracts, and publication information are extracted, leaving 102 articles. Furthermore, articles are evaluated according to eligibility requirements. This procedure yields 21 full-text papers for data extraction. 3 research thematic questions covering passenger data collection techniques, passenger demand interventions, and intervention performance are reviewed in detail. The results of this study suggest that mobility records, LSTM-based models, and performance metrics play a critical role in conducting passenger demand prediction studies. The model evaluation was mostly restricted to 3 performance metrics which needs improved metric for evaluation. Furthermore, the review determined an overreliance on the long-and short-term memory model to estimate passenger demand. Therefore, minimising the limitation of the LSTM model will generally improve the estimation models. Furthermore, having an acceptable trainset to avoid overfitting is crucial. In addition, it is advisable to consider multiple metrics to have a more comprehensive evaluation.


Introduction
Getting to a location to participate in activities such as work, recreation, and socialisation is a necessity for human survival. Transport enables these activities to be carried out. Individuals use private transportation, usually self-owned vehicles. The public uses a system of shared transportation units provided by entities. More people patronise public transport. Therefore, its impacts are immediately felt when it is effective. The demand for transportation services continues to increase as a result of increasing urbanisation. In London, more than two billion passenger trips were made in 2009 (W. Wang et al., 2011). The increasing demand for passengers threatens the safety and quality of transportation services. Public transport operators must optimise operations by accurately estimating passenger demands, fleet, and income (Hänseler et al., 2017). Estimating passenger demand is the foundation of an efficient transportation system. It is challenging if operators do not use modern technologies and mathematical models in managing transportation (Sbai & Ghadi, 2018). The deployment of new transportationrelated technology has resulted in an exponential increase in the availability of data on passenger movements. Furthermore, recent strides in machine learning research have resulted in numerous applications of machine learning (Hillel et al., 2021).

Previous passenger demand estimation
Passenger demand estimation studies aim to forecast the number of people who will use a particular transportation system or service in the future. These studies are crucial for transportation planners to determine the appropriate level of service and guide decisions about infrastructure investments and transportation policies. One popular method for estimating passenger demand is the gravity model, which is widely used in transportation planning. The model assumes that the number of trips between two locations is proportional to the product of their population and inversely proportional to the distance between them (Becker et al., 2018). A study in Madrid paid attention to the demand for interurban rail travel. The paper analysis is based on the estimation of disaggregated Nested Logit models using information from travellers. It analyses the competition between newly built high-speed trains with other modes. An evaluation was conducted using cost-benefit analysis to inform investment decisions. In addition, it analyses the response of the demand to various policy scenarios for high-speed train services. This analysis uses willingness to pay for improved levels of service as an indicator (Román et al., 2010). In general, passenger demand estimation studies are critical for transportation planning and can provide valuable information on factors that influence travel behaviour.

Background
There are ten different types of systematic review. This review adopted an effectiveness review. Effectiveness is the degree to which an intervention, when applied appropriately, produces the desired outcome. When an effectiveness review is adopted, the PICO framework is recommended to develop a review question (Munn et al., 2018).

PICO framework for Question Development
The PICO process uses a case scenario from which a question is constructed. P represents the characteristics of the population, I represents the study intervention, C represents the comparator, and O represents the results. The case statement is as follows; To operate public transport successfully, operators should be able to anticipate passenger demand and infer journeys. However, the operation is negatively affected by the absence of automatic data collection in the service. Table 1 shows the PICO framework.

The Question
Therefore, the question developed for the review is as follows; What machine learning models can be used to estimate travel demands in various data collection systems?

Search Strategy
Adopting a search strategy is to find papers that will be useful for review. Cross-ref, Scopus, and Google Scholar search engines are used to search for articles to provide comprehensive coverage of the field. The exact search procedure is used on each database. This review focusses on papers with trip inferences. Therefore, only papers with titles directly related to trip inference are included.
The following initial phrases are tested: travel pattern, trip inference, passenger density, travel demand, and origin-destination. The phrases are limited to the title, which helps the researcher filter out irrelevant papers. To select papers that discuss machine learning techniques, only papers with one or more selected phrases related to machine learning are selected.
The following initial phrases are tested: machine learning, neural network, decision tree, computer vision, artificial intelligence, random forest, boosting, support vector, deep vision, and image processing. The initial search was not limited to a specific time frame. The Google Scholar search for papers containing at least one of the above machine learning-related phrases alongside at least one of the trip-inference phrases across all relevant fields returns 17,700 results as of August 2022. The search in the Scopus database returned more than 1,240 results. Whilst Crossref returns 1,730 results.

Selection Criteria
The setting of inclusion and exclusion criteria is a standard and required practise when designing systematic research protocols. It helps the researcher produce reliable and repeatable results. The following inclusion criteria are determined for articles found in the search to be included in the study (Booth Andrew et al., 2016):

Inclusion Criteria:
1. Studies that employ one or more machine learning techniques for predictive trips or demands.
2. Studies on the estimation of passenger demands in transport modes. 3. Studies that investigate density passenger densities from deep vision systems, google data, or use nonintrusive methods. 4. Studies that investigate the passenger data from collection systems.

Exclusion Criteria:
The following exclusion criteria are determined for the articles found in the search to be included in the study: 1. All related papers or documents were published before 2018. 2. Studies in peer-reviewed journals or conference proceedings not written in English.
The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline is used for paper selection. First, duplicates are eliminated from the search records. Subsequently, the title and abstract of the articles are checked against the eligibility criteria.
Finally, the remaining full-text articles are assessed for eligibility (Liberati et al., 2009). A paper with multiple datasets and methodologies is treated as a separate study for analysis.

Data extraction strategy
A list of attributes is created to extract relevant data without bias from each study. Furthermore, it limits subjectivity in the data extraction process. The attributes are specific, objective, quantifiable, or categorical. The studies are carefully reviewed and every attribute is noted and tallied. The attributes are shown in table 2.

Study selection
The following search terms are used to carry out the search strategy outlined in Section 3.1. Scopus: (TITLE ("trip inference" OR "origin destination" OR "travel pattern" OR "passenger density" OR "travel demand")) AND TITLE-ABS-KEY ("machine learning" OR "neural network" OR "decision tree" OR "computer vision" OR "random forest" OR "boosting" OR "support vector" OR "deep vision" OR "image processing")) Google Scholar: ("trip inference" OR "origin destination" OR "travel pattern" OR "passenger density" OR "travel demand") AND ("machine learning" OR "neural network" OR "decision tree" OR "computer vision" OR "random forest" OR "boosting" OR "support vector" OR "deep vision" OR "image processing").
Semantic Scholar: ("trip inference" OR "origin destination" OR "travel pattern" OR "passenger density" OR "travel demand") AND ("machine learning" OR "neural network" OR "decision tree" OR "computer vision" OR "random forest" OR "boosting" OR "support vector" OR "deep vision" OR "image processing").
Due to the restriction on search length in Google Scholar, this search is divided into two separate searches, with the results combined. The search was carried out on 20/12/2019 on all three databases. Figure  1 shows a PRISMA flow chart of the study selection process.
A total of 2,095 records were returned from the search. There were 346 records returned from Crossref, 635 records from Scopus, and 1,114 records from Google Scholar. Duplicates are then removed, leaving 911 records to be screened. The total number of records after removing duplicates is more than the results obtained from two of the databases, showing that there were results from Crossref/Scopus which were not returned with the Google Scholar search. The remaining records are then screened to determine whether they meet the eligibility criteria outlined in Section 2.4.2. During the selection, 809 articles are excluded for relevance based on their title and abstract.
The full text is obtained for the remaining 102 articles for further review. Of these, another 81 are excluded based on the selection criteria. This process leaves 21 selected articles for data extraction.

Selected papers
The final number of papers used for the data extraction is 21. Table 3 provides a unique identifier for each paper and year of publication.  Table 4 provides details of all the journals and conferences/proceedings from which the papers were selected. The articles come from a wide spread of publications, with a total of 13 different journals and 8 different conferences featured. Table 4. Summary of published sources. Figure 1 shows the distribution of the articles from 2019 to 2022. There is a recent spike in interest in the publication of demand estimation using machine learning models. 48% of the articles were published in the recent year 2022. The remaining 52% of the publications were done before 2022. This demonstrates a rising interest in research for demand estimation using machine learning models.

What are the methods for collecting passenger data?
The following sections present an overview of the methods for collecting passenger data used in the selected articles. Question 1.a What type of transport system has passenger data been collected?
Based on the responses to Q1a, the transportation systems are grouped into taxis, buses, and metro/rail. 62% of the transportation systems were taxis, 19% were buses, and 19% were metro/rail systems. Question 1.b What device/instrumentation was used to collect the data?
From the review, 57% of the data were recovered from mobility records or databases, 19% from smart cards, then 19% from automatic fare collection systems, and 5% from image captured records. All studies conducted in a metro/rail system collected information from smart cards except (Han et al., 2022a; Q. . Researchers resorted to smart cards and automatic fare collection systems because it generates the desired data at less cost, since it requires minimal installation, personnel, and man hours (Ait-Ali & Eliasson, 2022). Question 1.c What type of information was collected for the study?
The information collected from devices or through instruments is shown in Table 5. 19% of the studies used the tap-in tap-out information of smart cards for analysis. Smart cards have a large data volume capacity, broad coverage, and high authenticity. Immutable ID allows the researcher to obtain long-term travel information about passengers, which offers the potential to mine travel patterns and travel predictions. Therefore, researchers prefer tap-in tap-out information because it considers complete trip attributes in addition to realtime prediction (Ye & Ma, 2022; Z. Zhao et al., 2018). A significant number of studies used information on boarding and exiting, which is about 43%. This information was from mobility records/databases or smart cards. Passenger request information was about 24%. The information usually contained request ID, pick-up time, pick-up coordinates, drop-off time, and drop-off coordinates (Liang et al., 2019). S11, S12, S18, S19, and S20 combine passenger requests with vehicle GPS, metrological data, or coded road network. These kinds of studies constituted about 25% of the studies.
Information was classified into passenger-orientated information and transport unit-orientated information. Passenger-orientated studies are studies where the primary data used for the analysis were the passenger data, while vehicle-orientated studies analysis was based on the transport unit. Table 5 shows the information collected and the orientation of the study. 90% of the studies were in the passenger-oriented category and 4% were transported unit oriented. S12 was the only study to focus on both passenger and transport unit-orientated categories. Data from vehicleorientated studies were applied to taxi services, while data from passenger-orientated studies were mostly applied to metro, rail, and bus services. The researcher was able to assess vehicular data because they were in an open source dataset from the operators (Huang et al., 2022). The availability of passenger data on smart cards influenced the focus of transport system studies. In transport systems such as metro, rail and bus systems, the desired passenger data was obtained through smart cards, which includes information such as travel patterns, frequency of use, and duration of trips. These data are analysed to gain insight into passenger behaviour and preferences, which informs decisions related to the design and operation of these systems. As a result, studies on these transport systems tend to be passenger-orientated, as they are based on readily available data. The nature, size, period, and temporal aggregation of the collected data are shown in table 6. The nature of the data for most studies was transactional requests or data from smart card cards. This is because it was less costly to obtain transactional data than to conduct manual data surveys. Furthermore, smart card systems and service requests were readily available at locations where the studies were carried out (Sun et al., 2020). The size of the data set ranged from 10,000 to more than 700 million trips. The size of the data sets is correlated with the spatial area under study. For S5, 694,000 records were collected on 18 routes and 1,781 bus stops. While S6 had about 769 million records with 42,000 bus stops and 325 subway stations. S1, S2, S6, S7, S13 and S15 collected more than ten million records from complete networks. S4, S5, S16, S19, and S20 collected records of less than 10 million transactions, these studies were carried out on selected routes or a transport service with a limited time frame for data collection.
Although time is continuous, various time intervals were adopted for the studies. This adoption of the time interval is an essential part of each study. S19 did not state the temporal aggregation for the study. S19 was excluded from the temporal aggregation analysis. Data were temporally aggregated in 5 min, 10 min, 15 min, 20 min 25 min, 30 min, 60 min, and 24 hours, except for S1 and S7 which used 24 hours and 10 minutes, respectively. 21% and 37% of the studies used 15min and 60min, respectively, for temporal aggregation of the data. S21 used all categories of aggregation in the study. The research adopted varying time intervals based on population dynamics. Furthermore, the adopted interval is relatively stable in order not to conceal the timevarying laws of passenger demand. The review showed that 15 min and 60 min were the preferred time intervals for data aggregation (Giraldo-Forero et al., 2019; Yang et al., 2022).
The duration of data collection was from 5 to 365 days. Most studies collected their data in 30 days. Studies such as S6, S11, S12, S13, S14 and S19 data were collected over 120 days. The data was readily available data from automatic collection systems and open-source databases, which made collection easier. Therefore, researchers resorted to collecting longduration data for their studies.

What functions have been used to determine passenger demands and where?
The following sections present an overview of the methods for collecting passenger data used in the selected articles.

Question 2.a Which intervention and class of intervention was used in the study?
Based on the responses to Question 2.a, the interventions were categorised according to the type of model in addition to the snippet of the model. The known types of models are supervised, unsupervised, semi-supervised, and reinforcement learning. The identified model class was further classified into subtypes. Table 7 shows the interventions, the type of model, and the comments. 76% of the studies conducted the research using a supervised regression model. The remaining percentage was shared between the unsupervised clustering model and the unsupervised dimension reduction model at 19% and 5%, respectively. Forecasting traffic demand using deep neural networks has attracted widespread interest. Most studies that used neural network models combined several machine learning models as their intervention. (Han et al., 2022b). Recurrent neural networks are suitable for forecasting studies. However, it is limited in using information from a distant past and also performs poorly for long-term memory (Liyanage et al., 2022). Therefore, S3, S5, S8, S9, S10, S11, S12, S13, S14, S15 and S20 combined long-short-term memory (LSTM) models with other models for their studies. Hybrid LSTM studies constituted 73% of supervised regression model studies. LSTM models are generally identified to outperform RNNs in time-series data forecasting (Yeon et al., 2019). Therefore, the LSTM-based model is preferred for predicting and forecasting series.

Question 2.b
What was the unit of analysis? S1 uses travel patterns to improve the accuracy of passenger travel information prediction. It achieved its objectives by analysing the degree of accuracy of the proposed model. S2 and S7 analyse demand predictions by aggregating data into different time zones and updated transactions. The researcher minimises the abundance of results by aggregating the data. S3, S8 and S15 analysed the prediction of short-term demand and the effect of weather conditions, such as air temperature and the air quality index. The research determined this from the passenger correlation flow and showed that the population is attracted to the same type of facility. Furthermore, the shorter the distance between the bus top or the numerous bus routes to the stop, the higher the degree of correlation of tourist flow. S4 analysis Spatiotemporal dynamic time-warping test where correlation between zones was evaluated based on short-term passenger demand data. S5 article analysis predictions of varying temporarily aggregated short-term data. S6 analyses the variability of the travel pattern. To investigate the variability, probability density functions were developed and curves representing different groups of passengers were fitted and evaluated. S9, S11, S12, S13, S14 and S16 analyse the number of demands from passengers in each region. S11 also incorporates an objective function as part of its analysis. S10 analyses the prediction of passenger demand over a fixed temporal interval. Spatial analysis of S17 and S21 considers the requested demand and attributes the intensity of the demand to land use. They also perform a temporal analysis, in which the demand behaviour is analysed over time. Further analysis was performed in which the observed demand in specified periods was critically analysed. S18 generates various clusters; it uses the elbow method to find the optimised number of clusters. Due to spatial restrictions, a high number of clusters is also used to enable the study to determine hotspots. S19 analysed the distance travelled by passengers; the study also used an evaluation metric and the performance of the model for its analysis. S20 analyses the correlation between travel frequency and travel time by comparing travel frequency and travel time in different time intervals. It should be noted that the following articles S4, S8, S9, S10, S11, S12, S13, S14, S15, S16, S17, S20, and S21 analyse the performance and comparison of the model implemented in the study. Obtain mobility patterns to predict passenger demands from one region to another S10 Predict passenger demand S11 Predict passenger travel demands from one region to another. S12 Improve prediction accuracy to capture the characteristics of urban travel demand.

S13
Predict temporal variability of taxi demand between region pairs S14 Investigate origin-destination-based demand prediction S15 Predict short-term travel demand based on historical data and other information S16 Exploit heterogeneous information to learn the evolutionary patterns of ridership S17 Present spatio-temporal and temporal analysis of demand flow in taxi requests S18 Determining Hotspots by measuring the clusters' hindex S19 Predict the travel destination only depending on the departure time and coordinates S20 Predict passenger travel demands, as well as manage taxi operations and scheduling S21 Reflect the inherent time-space correlations and complexity of the passenger flow

Question 2.c
In what geographical areas have these studies been conducted? Figure 2 shows the geographical locations where the selected studies have been conducted. Most studies have been conducted in Asia (60%) and the United States (30%). This can be attributed to the open data policies in these regions. The data required for these studies are readily available and in the appropriate format. S18 did not state the location where the research was conducted. Public transportation in Africa is usually unregulated and its mode of operation presents a challenge in collecting information on passenger demand (Kumar & Barrett, 2008).

How is the performance of the intervention determined?
The following sections present how the intervention performance has been used in the selected articles. Question 3.a What is the data split ratio used in the studies?
To ensure that the model used can generalise to unseen data, the model is trained on a diverse dataset. Data are split into sets to identify and correct overfitting issues, therefore improving the overall performance of the machine learning model. The training set split between 70% and 80% was 83% of the selected articles. S3, S10, S13, S17, S19, and S20 split more than 80% of their data as a training set and did not consider a validation set. S4. S2, S4, S12, S14, and S16 with a training set below 80% further split the training set into a training and validation set. Splitting a data set into training and testing sets is an important step in machine learning as it also helps to evaluate the performance of a model. The 80/20 data split is a common choice because it strikes a good balance between having enough data to train a model and having enough data to test it. Choosing a small test set presents a risk that the model is unable to generalise well to new data. Models can learn faster with a large training set, thus reducing the overall training time. There is no hard and fast rule on data splitting. However, different data split ratios may be more appropriate depending on the specific problem, the amount of data available and the complexity of the model. It is a good practise to experiment with different data split ratios and evaluate the model's performance before making a final decision (Aurélien Géron, 2019; Raschka & Mirjalili, 2019). Question 3.b What performance metrics were used? Performance metrics are useful for evaluating the performance of models because they provide a standardised way to compare different models. Table 9 shows the performance metrics used for the evaluation of the models used in the studies. Three discrete performance metrics were used in the selected studies. The root mean square error (RMSE), the mean absolute error (MAE), and the mean absolute percentage error (MAPE) were the metrics commonly used to evaluate the models. They provide a quantitative measure of how well a model performs in terms of predicting the output variable compared to the actual values. RMSE is a measure of the average difference between the predicted and actual values, but with a higher weight given to larger errors. It is useful when you want to penalise larger errors more heavily. MAE is a measure of the average difference between the predicted and actual values, with equal weight given to all errors. It is useful when you want to give equal importance to all errors. MAPE is a measure of the average percentage difference between predicted and actual values. It is useful when you want to evaluate the accuracy of a model in terms of percentage error, which can be helpful in situations where the magnitude of the error is important. S1, S2, S6, S7, S9, S13, S14, S19, and S21 employed different metrics or combined with the commonly used metric. S1 used only the prediction accuracy as a metric, where the prediction accuracy is equal to the number of correctly predicted travel data divided by the total number of travel data. The correct prediction referred to the difference between the predicted time and the actual value as less than 1 hour. For S2, the Pearson Correlation Coefficient (PCC) was added to MAE and RSME as a metric. S2 and S21 adopted PCC because the results are linear, the variables are quantitative, normally distributed, and have no outliers. S6 adopted F1 score, precision and recall as its metrics. Recall and precision metrics are used to evaluate the performance of a classification or information retrieval model. Recall measures the percentage of relevant instances that were correctly retrieved by the model out of all relevant instances in the data set. It is used to evaluate the completeness of the results produced by the model. Precision measures the percentage of retrieved instances that are relevant from all retrieved instances. It is used to evaluate the accuracy of the results produced by the model. Then the F1 score was applied, which is the harmonic mean of precision and recall. It symmetrically represents both precision and recall in one metric. S7 adopted the mean error, a fitness function, and the mean standard error as a performance metric for its model. MSE measures the precision of the estimates since it reflects the magnitude of the error to expect in the estimates. A smaller MSE indicates that the estimates are more precise and have less variability, while a larger MSE suggests that the estimates are less precise and have more variability. The MSE values were associated with the neural network training effect, where a lower value indicated better neural network training. The fitness function was adopted because the study involved the use of a genetic algorithm. The function is used to guide the optimisation process. The reciprocal value of the mean squared error is used as a fitness function. ME sums up the variances and divides the result by sample population. The variance is the difference between the measured value and the true value. S9 adopted the RSME and the symmetric mean absolute percentage error (SMAPE), which measure the accuracy based on relative errors. The study justified the choice of SMAPE because it was not scale sensitive. S13 and S14, in addition to MAE and RSME, include the R2 score as a performance metric that indicates how well the data fit the regression model. The R2 score is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable. The score or value derived from this metric is independent of the context because it measures the goodness of fit of the model. MAE, RSME, and MAPE have primarily featured metrics in the selected papers. MAE, RSME and MAPE were found to be used in 57%, 67% and 52% of the studies, respectively. It suggests that these metrics are commonly used in model evaluation and machine learning studies. The other metrics were selected based on the preference and the objective of the study.

Conclusion
The review investigates how passenger demand is estimated with a collection system using machine learning models. It examines three thematic research questions that cover passenger data collection techniques, passenger demand interventions, and intervention performance. A comprehensive search strategy is carried out across the three main online publishing databases to locate unique 911 records. Relevant record titles, abstracts, and publication information are extracted, leaving 102 articles. In addition, the articles are evaluated according to the eligibility requirements. This procedure yields 21 fulltext papers for data extraction. For data collection techniques, approximately 57% of the data was recovered from mobility records or open databases. This is because the data were readily available, in an appropriate format, over a substantial period, and less expensive to obtain. Traffic data was mostly aggregated in multiples of 15-minute intervals because it provided sufficient time resolution to capture changes in traffic patterns while minimising noise. Furthermore, it becomes more manageable and easier to compare and analyse the aggregated data ( . 83% of the articles split the training set between 70% and 80%. Splitting a data set prevents overfitting and allows one to evaluate the performance of a model. MAE, RSME, and MAPE were featured as performance metrics in the selected articles. MAE, RSME and MAPE were found to be used in 57%, 67% and 52% of the studies, respectively. It suggests that these metrics are commonly used in model evaluation and machine learning studies. Despite the exhaustive nature of this document, the following limitations must be acknowledged. Only three Internet databases were used to search for relevant materials. Furthermore, this review did not evaluate grey and unpublished literatures. However, it is expected that the selected papers will cover new methodologies. The approach to the review is intended to be as objective as possible. Data extraction and discussion are carried out by the first author under the supervision of the coauthors. All results have been double-checked, but the authors are responsible for any residual inaccuracies. The results of this study suggest that mobility records, LSTM-based models, and performance metrics play a critical role in performing passenger demand prediction studies. Besides, the review determined an overreliance on the long-and short-term memory model to estimate passenger demand. Therefore, minimising the limitation of the LSTM model will generally improve the estimation models. Additionally, having an acceptable train set to avoid overfitting is crucial. Furthermore, it is advisable to consider multiple metrics to have a more comprehensive evaluation.