RFM-AR Model for Customer Segmentation using K-Means Algorithm

. Competition in the business field is getting tougher, business people are required to carry out various strategies and innovations in order to compete with their competitors. Business actors are not only focus on transaction convenience and product centric strategies, but also need to carry out customer centric strategies. Segmentation is part of a customer centric strategy by knowing the characteristics of customers with similarities. In conducting customer segmentation, previous studies mostly used RFM (Recency, Frequency, Monetary) and clustering methods. This research will add AR (Age, Return) to the model, so the method used in this research is CRISP-DM (Cross Industry Process for Data Mining) with a combination of RFM-AR model and K-Means clustering. The result of this research is a data clustering modeling with 3 types of customer clusters with different characteristics. Determination of the best number of clusters with the elbow method can produce the same number of K clusters on different amounts of data. The optimal K value for each RFM-AR variable is K=2. Clustering is divided into 3 grades are high, middle and low.


Introduction
Changes in customer needs encourage changes in the marketing sector so that business people are required to carry out various strategies and innovations in order to compete with their competitors.Renewal of marketing strategy can be done by utilizing transaction data and applying data mining technology to analyze customer needs in order to print higher quality values in selling products and services [1].Currently, modern companies are not only focus on ease of transactions and methods that provide priority to products (product centric), but also need to implement customer-focused tactics.carry out strategies that prioritize customer centric [2].Segmentation is part of a customer centric strategy by knowing the characteristics of customers with similarities.
In order to develop a marketing strategy, segmentation is used to group customers into several clusters based on their loyalty criteria.One of the initial steps in developing a business model is segmenting the customer base [3].Customer segmentation is the division of a customer base into uniform subgroups, each of which is then regarded as a separate marketing audience.Customer segmentation can be used to quantify customer value, allowing businesses to identify clients who generate significant revenues and those who do not [4].
The RFM (Recency, Frequency, Monetary) model is a widely-used paradigm for segmenting consumer behavior [5].The RFM model is a behavior-based model that analyses consumer behavior and then generates forecasts based on the behavior in the data.The RFM model's advantage lies in its relevance as long as it operates on several observable and objective variables [6].
In general, the criteria used are: Recency is the renewal of customer transactions, Frequency is theintensity of transactions, or how often customers make transactions, and Monetary is the number or total transactions made by customers [7].RFM has been widely used in customer segmentation research, identification of potential customers based on purchasing behavior [8]- [10], prediction of potential customer demand [11], identification of cellular sales termination [12], identification of special services to customers [13].
The use of data mining techniques allows organizations to manage stored data into new knowledge.Clustering is a data mining technique which divides data in a set into several groups according to the similarity of characteristics in the data [14].One of the clustering algorithms that is widely applied and minimizes clustering errors for points in the Euclidean space is K-Means [15].The K-Means algorithm has also used extensively in customer segmentation research with good accuracy [8], [16]- [18].In this study added 2 criteria, namely AR (Age, Return) in the model.Age is the customer's subscription time or since when the customer has made a transaction and Return is the average distance (median) between customer transactions.So the purpose of this study is to find the pattern of customer purchases through information on customer characteristics and values based on the cluster formed based on the RFM-AR model variable by applying the K-Means algorithm.

Data
The data utilized in this research is sales transaction data from July 2020 to November 2020 originating from PT. Literata Lintas Media, which is a retail business that sells various books and stationery.The data that has been obtained is then presented in the form of .xlsx to facilitate the process of processing data.The sample data obtained are 8,594 rows with 8 columns.The sales transaction table is presented in Figure 1.In processing data using the Python 3.7 programming language in Jupiter Notebook or Google Colab.

Research Method
The approach employed in this study is CRISP-DM (Cross Industry Standard Process for Data Mining), a typical procedure for problem solving in data mining research [19].The series of stages that will be carried out in this research can be seen in Figure 2.

Business understanding
The stage of business comprehension is focused on understanding from a business point of view.Then translate this knowledge into defining data mining problems.After that, plans and strategies will be determined to achieve these goals.At this point, understanding of the background and goals of the business processes involved in product sales is necessary.Meanwhile, the objective of data mining in this study is to explore knowledge in data.

Data understanding
At the stage of understanding the data, it provides an analytical foundation for a study by summarizing and identifying potential issues contained in the data.In understanding the data, there are several processes that need to be carried out to detect the subset as a reference for the formation of hidden information hypotheses.In this process, data exploration is carried out using the Exploratory Data Analysis (EDA) method by applying simple arithmetic techniques and graphic techniques in summarizing the observational data in order to facilitate understanding of the data [20].

Data preparation
During the phase of data preparation or data preprocessing, this also includes exercises to cunstruct a data set that will be processed at the modeling stage from raw data.This stage can be repeated several times [21].The selection of tables, records, and data attributes is needed at this stage, as well as transforming, cleaning and labeling the data in order to be used as input in the modeling stage

Modelling
At this stage, the selection and application of various modeling techniques will be carried out and some of the parameters will be adjusted to get the optimal value.In particular, there are several different techniques that can be applied to data mining problems.In this study, the modeling was carried out using the RFM-AR model with the K-Means algorithm.The clustering process using the K-Means algorithm is shown by the flowchart in Figure 3

Preprocessing
In this study, data preprocessing was carried out to prepare raw data into data according to data mining procedures to be used as input data at the modeling stage [24].In this research, the preprocessing stage includes the transformation, cleaning and labelling process on transaction data.The transformation process is carried out to change the integer data type into an object dataframe and date time according to the needs of time series analysis, which can be seen in Figure 4a.Cleansing data is applied to remove unnecessary data or fill in missing values [25].Before handling missing data, deleting the columns or impute missing values, it is necessary to identify between columns.After that, the correlation between the missing tables was checked.The selection of columns, records and attributes used is adjusted to the needs of the analysis, the process can be seen in Figure 4b.
After re-identification of the columns that contain missing values, there is a relationship between each column, so that if the records in the column are omitted it will reduce the quality of knowledge contained in the dataset significantly.Cleansing data is done to overcome noise in the data.Therefore, it is necessary to apply the forward fill method to fill in the missing values in the rows that are NaN (Not a Number) and NaT (Not a Time) by spreading the last valid observation forward so that it does not change the quality of the data.The results after the forward fill can be seen in Figure 4c.Random labeling is used to add a new column containing random ID to customers to make it easier to define in numeric form.The labeling process aims to facilitate customer analysis so that the name of each customer is unique.The labeling result data sample can be seen in Figure 4d.

Exploratory data analysis
Exploratory Data Analysis (EDA) is used to explore the potential knowledge contained in the dataset so that it can enrich the assumptions generated when conducting the analysis process and making research conclusions.Following are the output of the EDA process.The results of the distribution of data in Figure 5a show the difference in numbers between the median (50%) and there is no significant difference, so it can found that the number of transactions is not too diverse and there are no anomalies between customers because there is no distribution value (-).According to the findings of the investigation using the EDA method, there were 1,551 customers with 1,318 types of products sold in the period July-November 2020.The best-selling products were Joyko, Kiky and FBC.In addition, the number of sales transactions tends to fluctuate as shown in Figure 5b.
In the period between July-November 2020 there are transactions with a value of <= 0 so it can be assumed that the transaction is a discount on a certain product.Below are presented data and data diagrams with minus transactions in Figure 5c and 5d.The chart of products sold in the transaction period July-November 2020 can be seen in Figure 5e.In the EDA stage, the number of new customers can be known each month and provide a label to distinguish between new customers and regular customers in the period July-November 2020.In Figure 5f which is presented, it can be seen that the diagram has significantly decreased in the period July-October 2020.However, in November 2020 there was an increase.The decline that occurred in July-October can be the subject of analysis in carrying out the right strategy in dealing with this.In addition to knowing the number of new customers every month, at this stage the researcher analyzes the retention rate of transaction data.Retention Rate is the level or average number of existing customers who continue to buy products in the July-November 2020 period.Based on the results of the chart presented in Figure 5g, it can be it is concluded that the number of new subscribers continued to increase in July-October 2020, but in November experienced a significant decline.The last analysis carried out at the EDA phase is to determine the average distance between customer transactions in the period July-November 2020.The average distance between customers is shown in Figure 5h.

Recency
The date of the previous transaction or the span of time between the previous and current transaction are used to determine the recency value.The greatest value is assigned to the customer with the date of the most recent transaction and the smallest value to the customer with the furthest past transaction date.The recency unit can be seconds, minutes, hours, days or other time units.Figure 6a is the result of recency data with the highest recency value of 152.35 for customers with ID 1416 with the most recent purchase transaction time.While the smallest purchase transaction value is 0.00 for customers with ID 648 with the renewal of past purchases, so it can be expected that customers with ID 1416 are old customers who are still making purchase transactions until now.Meanwhile, customers with ID 648 are customers who only make transactions occasionally and are not regular customers.Figure 6b is a customer recency distribution chart which explains that transaction recency between customers is quite volatile.The largest recency is in the range of values 140-160 with a value of 120-140.Meanwhile, recency with the highest intensity is in the range of values 0-60 with an average value range of 0-50.Where it can be concluded that the number of regular customers has less intensity than new buyers or buyers who only buy occasionally for a long period of time

Frequency
The frequency value is obtained based on the intensity of customer transactions, how often customers shop.The frequency value is discrete, the largest value is given to customers who have frequent shopping intensity and vice versa.Figure 7a is the result of frequency data with the highest frequency value of 28 for customers with ID 983.
While the lowest frequency value is 1 for the majority of customers who have a small purchase transaction intensity.Figure 7b is a customer frequency distribution chart which explains that the frequency with the highest customer intensity is in the range of values 0-5.While the largest frequency value is in the range of values of 25-30 with the number of subscribers in the range of values of 0-1.

Monetary
The monetary value is obtained based on the nominal number of transactions for each customer during the research period, namely in July-November 2020.The largest monetary value is given to customers who make purchase transactions with a large amount of money spent.While the smallest value for customers who make purchases with a small nominal.Figure 8a is the result of monetary data with the highest monetary value of 41,782,900 for customers with ID 63.While the lowest monetary value is 1,100 for some customers.Figure 8b is a customer's monetary distribution chart which explains that the smaller the nominal transaction issued, the greater the intensity of purchases made by the customer.Meanwhile, customers who spend more money for transactions tend to be less.

Age (Days)
The age (days) value is obtained based on how long a customer has been a subscriber.The age value will be of great value if the subscription time is long enough.Meanwhile, if you have just subscribed, the age (days) value tends to be small.Figure 9a is the result of age (days) data with the highest age value of 152.35 for customers with ID 1416 with the most recent purchase transaction time.While the smallest purchase transaction value is 0.00 for customers with ID 648 with a small subscription time.So it can be assumed that customers with ID 1416 are old customers.Figure 9b is a customer age (days) distribution chart which explains that the distribution of age (days) data is quite varied with the number of customers who have been subscribed for a long time is greater than those who have just subscribed.It is represented on the right and left side of the chart.

Return (Days)
The return (days) value is obtained based on the average distance (median) between transactions made by customers.The return value (days) will be of large value if the renewal intensity of the transaction is close and will be of small value if the renewal intensity of the transaction is quite past.Figure 10a is the result of data return (days) with the highest return (days) value of 40.94 for customers with ID 987 with a close average (median) distance between transactions.While the lowest value is 0.00 with the average distance (median) between transactions which is quite far. Figure 10b is a chart of the distribution of customer returns (days) which explains that the distribution of return data (days) with a large range of transactions tends to be small in value or the number of customers tends to be small, but in the 35-40 range it starts to increase slowly.Meanwhile, the distance for transactions with small values tends to be higher.

Clustering RFM-AR with K-Means
Clustering lacks a data training stage or is an unsupervised learning method, in contrast to classification.Unsupervised learning algorithm is an unsupervised algorithm that divides clusters into groups of data points based on the distance between points and can be divided into groups based on patterns in the data.The objective is to reduce the space between the points and clusters of each centroid.In determining the best number of clusters by looking at the percentage of the results of the comparison between the number of clusters that form an angle at a point or called the Elbow K-Means Method [26].The results of a different percentage of each cluster value are represented in the form of a graph as a source of information.The resulting comparison is based on the SSE (Sum Square Error) value.The results of the cluster trial on customer recency are presented in graphical form in Figure 11a with the K value used in the range 1-10 with a maximum number of iterations of 1,000.The graph shows that there are several K values that have decreased the most and then the results of the K values will decrease slowly until the results of K are stable In the graph, the value of the cluster K=1 to K=2 experienced a significant decrease which formed an elbow at K=2, so it can be concluded that the ideal K value is K=2.
Furthermore, the results of the cluster test on the frequency of customers are presented in graphical form in Figure 11b with the K value used in the range 1-10 with a maximum number of 1000 iterations.In the graph, the value of the cluster K=1 to K=2 experienced a significant decrease which formed an elbow at K=2, so it can be concluded that the ideal K value is K=2.The results of the cluster test on the customer's monetary are presented in graphical form in Figure 11c with the same K value, namely the range 1-10 with a maximum number of 1000 iterations.In the graph, the value of the cluster K=1 to K=2 experienced a significant decrease which formed an elbow at K=2, so it can be concluded that the ideal K value is K=2.Furthermore, because the graph produces homogeneous values, the results of the cluster test on age (days) and return (days) of customers are presented in graphical form in Figure 11d with the same K value, which is used in the range 1-10 with a maximum number of iterations of 1,000.In the graph, the value of the cluster K=1 to K=2 experienced a significant decrease which formed an elbow at K=2, so it can be concluded that the ideal K value is K=2.

K-Means Scoring
In this study, scoring is employed to assess the level of customer loyalty using the RFM-AR variable.To ensure the accuracy of the scoring process, descriptive statistics are initially utilized to analyze data distribution of each variable, thus preventing any undue anomalies within the data.The results of the descriptive statistics on the Recency variable are presented in Figure 12a with the highest score of 9 and the lowest being of 2 based on the values of the Recency, Frequency, Monetary, Age (Days) and Return (Days) variables.Furthermore, based on the score that has been obtained, grouping is carried out based on grades A (high), B (middle), C (low).The results of grouping based on grade are presented in Figure 12b.

K-Means Customer Segmentation
Segmenting customers is a process of partitioning customers into homogeneous subsets.The clustering process is based on how the data's properties are similar.In addition, The distance between clusters determines the position of a data.Figure 13a is the result of the clustering process based on frequency and monetary which explains that customers with large monetary values tend to be few with high frequency values.Customers with moderate frequency categories and monetary ones that are not too large tend to have more.Figure 13b is the result of the clustering process based on recency and monetary which explains that customers with large monetary values tend to have low distances between transactions.Customers with low monetary categories are more intense in making transactions.This can be seen from the red plot which tends to dominate.
In addition, for the middle category with not too large a monetary, the distance between transactions is greater than the high category and smaller than the low category, but in the middle category there are customers who have large monetary but with moderate recency values.
Figure 13c is the result of the clustering process based on age (days) and monetary which explains that customers with large monetary values tend to have age values (days) with less intense shopping intensity.Customers with low monetary categories have quite varied age (days) values.This can be seen from the red plot which tends to dominate.In addition, for the middle category with not too big monetary, it also has quite varied age (days), but in the middle category there are customers who have large monetary but moderate age (days).

Result And Discussion
The results from this study are shown in a data clustering modelling with 3 types of customer clusters with different characteristics.The elbow method, which determines the best number of clusters, may generate the same number of K clusters from various amounts of data.The optimal K value for each RFM-AR variable is K=2.Clustering is divided into 3 Grades, including: Grade A: There are 13 customers with grade A. Customers in this segment are considered to be very profitable customers for the company because they spend a lot of money on their purchases.Customers in this sector have high recency values, high frequency, high monetary, high age and high returns.Therefore, to prevent customer in this segment from becoming marketing targets for rivals, company needs to improve their maintained strategy by providing the best service.Grade B: Customers with grade B are the segment with the highest number of customers reaching 891 customers.These customers are in a score range of 4-7 so that the combination of Recency, Frequency, Monetary, Age and Return values is very varied.The combination of RFM and AR values in grade B can be categorized as moderate.Customers in this segment are potential customers.Therefore, a marketing strategy is needed so that customers in this segment can increase purchase loyalty at PT. Cross Media Literature.Grade C: Customers with grade C are the segment with number of customers reaching 645 customers.This customer is in the score range of 3-0 so that the combination of Recency, Frequency, Monetary, Age and Return values is low.The combination of RFM and AR values in grade C can be categorized as moderate.Customers in this sector are those who lack sufficient potential.

Fig. 5 .
Fig. 5. Result of EDA: (a) Transaction data distribution, (b) Monthly sales chart, (c) Monthly product sales chart, (d) Monthly new customer chart, (e) Monthly retention rate chart, (f) Chart of distance distribution between transactions

Fig. 6 .
Fig. 6.Result of recency: (a) Sample of recency data results, (b) Chart of customer recency distribution

Fig. 7 .
Fig. 7. Result of frequency: (a) Sample of frequency data results, (b) Chart of customer frequency distribution

Fig. 8 .
Fig. 8. Result of monetary: (a) Sample of monetary data results, (b) Chart of customer monetary distribution

Fig. 9 .
Fig. 9. Result of age: (a) Sample of age (days) data results, (b) Chart of customer age (days) distribution

Fig. 10 .
Fig. 10.Result of return (days): (a) Sample of return (days) data results, (b) Chart of customer return (days) distribution

Fig. 12 .
Fig. 12. Result of K-Means scoring: (a) Data scoring (b) Sample of data grouping by grade