Energy Utilization and Transmission Delay Optimization in Cox VANET networks

Among the challenges in VANETs networks there is power control, which permits to achieve better permformance in terms of Signal to Interference plus Noise Ratio (SINR), Energy Efficiency (EE), Energy Utilization (EU),... Another important point to address is the transmition delay, which we should adapt it to be under an acceptable threshold. For that, we model the network as a Cox Line-Point process, and the transmitter as an M/D/1 queue server. To solve this tradeoff problem, we use machine learning techniques, especialy Q learning algorithm. It is shown , via simulations, that through our algorithm, the vehicular transmitter be able to learn its transmit power in an autonomous way, and achieve better performance for the energy utilization rate, the system waiting time and the area spectral efficiency.


Introduction
Recently, vehicular ad hoc networks (VANETs) are becoming one of the most active research fields. VANETs are the corner stone in the building of intelligent transportation systems(ITSs). They permit to the vehicles and the road side units (RSUs) to exchange important information such as the traffic status, the roads conditions and accidents occurrence to warn vehicle's drivers of potential risk [1]. These networks are a sub-class of mobile ad hoc networks(MANETs) that consider vehicles as mobile nodes to provide communications between vehicles, i.e. vehicle to vehicle (V2V) and between vehicles and RSUs, i.e. vehicle to infrastructure (V2I) [2]. However, these networks have their own characteristics as compared to MANETs. Even nodes are mobile as in MANETs, the vehicles are expected to move in the road topology instead of moving randomly [3]. Thus, VANETs have a special spatial geometry where the locations of nodes are present uniquely in the roadways lane. Recent model based on stochastic geometry is given to represent the road layouts and the location of vehicles is given [4,5]. Due to the principally linear nature, roads can be modeled using a Poisson Line Process (PLP) in R 2 that can be deduced from a Poisson Poisson Processes, where instead of points, the undirected lines (road lanes) are randomly distributed on plane [6]. Thus, the roadways are modeled as a network of lines that are distributed on the plane according to a PLP.
One of the challenging topics in VANETs networks is the Power control, it is defined as the mechanism that permits to each node to adapt its transmission power to achieve better communication without perturbing others communicating nodes. A lot of power control were proposed in the literature. One can review works as [7-10] * e-mail: adil.boumaalif@gmail.com * * e-mail: zytoune.ouadoudi@uit.ac.ma that tackle the problem of selecting the transmit power of vehicles. Particularly, the paper [11] proposed a power control algorithm to find the optimum transmission power by adding a power tuning feedback beacon during each beacon message transmission. Thus, the sender estimates the distance to the receiver and predicts the required transmission power. Even the algorithm optimizes the transmission power, it generates additional transmission delay due to the control messages exchanged.
Another important key of performance in the VANETs is the message transmission delay, which should be always under an acceptable threshold. Hence, many works were proposed to deal with this issue [12][13][14]. The key idea behind the messages transmission delay is to propose techniques that handle this issue in Network or MAC layers. In [15], authors proposed a routing protocol that minimizes the end to end delay by selecting the appropriate relays nodes. In [16], authors proposed to use machine learning to enhance routing in order to reduce data transmission latency. The papers [17,18] gave a survey on multi-layer techniques and MAC protocols respectively, to improve message transmission delay.
Both these issues can be solved with reinforcement learning (RL). One of the most famous algorithm if RL is Q learning, which is an off-policy control algorithm, i.e. it does not depend on the policy the agent uses to navigate the environment [19].
In this paper, we propose a reinforcement learning based algorithm to handle the transmission power control that aims to optimize the transmission energy while guaranteeing that the transmitted messages are arriving with a limited delay. The rest of this paper is organized as follows. Section II describes the system model and the problem formulation. Section III details the reinforcement learning algorithm used to solve the optimization problem defined in Section II. Section VI presents numerical and simulation results and finally, Section V concludes this paper and gives directives for future work.

System Model and Problem formulation 2.1 Cox VANET Network
We consider a Poisson line process PLP φ l with line density µ l to model the road system. Hence, the density of the equivalent point process of φ l is λ l = µ l /π. Besides, the locations of the nodes on each line form a 1D PPP with density λ v = λ n + λ r , where λ n is the density of vehicular nodes and λ r is the density of RSUs. So, we form a Cox line-point process φ c as illustrated in Fig.1. We consider that just pλ v can transmit in each time slot, where 0 ≤ p ≤ 1. Without loss of generality, we consider a typical link, where the receiver is located at the origine, and the corresponding transmitter is located at a distance d from the origin on the same line.
Assuming that all links are affected by Rayleigh fading and the fading powers are exponentially distributed with mean 1 and considering a standard path loss model with 2 < α < 6. We further assume that all nodes transmit with the same power P t . So, as proven by Chetlur in [10], The success probability of the typical link P s is: Where P s = P(S INR > β), with β is the SINR threshold, N 0 is the noise power.
In this work, we are intersted by the special case when the density of lines in the network is very sparse, i.e, λ l → 0. So, the success probability of the typical link converges to that of a 1D PPP and is given by the following reduced equation: Based on this result, we define the Energy Utilization Rate (EU r ) as the ratio of the energy consumption per unity of time to the transmission rate, and assuming that each message is transmitted in one second [20], so: Another important factor is the area spectral efficiency (ASE), it is defined as the average number of successfully transmitted bits per unit time per unit bandwidth per unit area in the network. Assuming that all the transmitted symbols are from Gaussian codebooks, It's given that [10]: AS E = λP s log 2 (1 + β); bits/s/Hz/km 2 (4) where λ is the average number of active transmitting nodes per unit area, as we are intersted by the special case in which our model converges to a 1D PPP scenario, the equation of λ is given by λ = pλ v , and: As the transmission probability p impact the overall spectral efficiency with a tradeoff manner, there is an optimum value of transmission probability p * that maximizes the ASE of the network, in the case of 1D PPP:

Queueing Model
In our model, we represent the typical transmitter as a system with single server having a queue length represented with M/D/1 queueing. , where arrivals are determined by a Poisson process and job service times are fixed (deterministic) [21]. We should define the principles metrics of the queueing system according to our system model. If we represent by λ a the rate with which the arrivals occur, and µ is the service rate, so, in our case: µ = P s , and the utilization of our server is: Another important metric which we will use in our study is the average waiting time in the system, denoted by ω, and defined as: . Finally, we got:

Problem Formulation
Based in this system model, our objectif is to minimize the energy utilization rate, while ensuring that the system waiting time is always under a predefined threshold and implicitly the transmit power is always under a maximum value. To achieve this goal, the following optimization problem must be solved: Where τ is the system waiting time threshold, and P t = (P t,1 , ..., P t,i , ..., P t,N ) is the set of the node transmit powers.
We can see clearly that when the transmit power of the transmitting node decreases, the energy utilization rate will decrease automatically, while the system waiting time will increases and may exceed the predefined limit. On the other hand, to ensure the condition of proceeding time, the transmit powers should be always greater than a certain value. In order to find the optimal transmit power, a Qlearning based power control algorithm will be used.

Q-learning energy/delay optimization algorithm
We use a simple Q learning algorithm to solve our optimization problem defined in the previous section. The Agents here is the transmitting node that want to learn its transmit power which minimizes the energy utilization rate while guaranteeing the condition on the system waiting time. The other elements of Q learning algorithm are defined as follows: State: We define the state as: Action: The action of each agent consists of a set of transmitting power levels. It is denoted by A = (P t,1 , ..., P t,i , ..., P t,N ) where every agent has N power levels.
In this paper we use − greedy strategy to choose actions based on the current Q−value estimation, which is described as follows: • choose action randomly with probability from the action space, • choose action according to a = arg max a∈A Q(s, a) with probability 1 − Reward: The reward function reflects the learning objectives of our algorithm, so we define the reward as Our power control strategy is described in the following algorithm.1. in the beginning, we associate for each vehicular transmitter a set of feasible transmit power levels A. Then, in the learning process, we use − greedy technique to select the future action to execute. The selected action is then performed, and the related reward is obtained. Thus, the Q-table is updated. This process is repeated till the number of iteration is achieved.
Algorithm 1 Q learning algorithm 1. Initialization: for each state S i ∈ S and each action a j ∈ A do Initialize Q(S i , a j ) to zeros end for evaluate the starting state S i ∈ S 2. Learning: while MaxIteration not reached do choose a j ∈ A using the -greedy policy based on Q Take action a j and observe the immediate reward R t and next state S

Simulation Parameters
In this work, we consider a VANET formed by Vehicles and RSUs with density λ v = 0.1 located in roads map with density λ l → 0. We assume that the transmission probability p = 0.3, the distance between typical nodes is d = 10m, the path loss exponent is α = 3, the SINR threshold is β = 9dB, and the noise power is N 0 = −100dB.
The Q-learning parameters are as follows: the learning rate is α = 0.1, the discount factor is γ = 0.9 and the greedy parameter = 0.1. We define the action space as a vector of discrete values of transmit power levels, i.e. A = [−60 : 20dBm] with step of 10 dBm. Also, the parameters of the queuing model are λ a = 10messages/s and authorized system waiting time τ = 3s.

Results and Discussion
In Fig.2, we plot the Q values as a function of learning iterations. We can see clearly that Q values become slightly constant after about 4000 iterations, which means that our algorithm converges at this moment. So, the obtained results can be considered as optimal. In Fig.3, we plot the energy utilization rate with regards of SINR threshold. As depicted, EU r increases when β increases. That is justified by the fact that we need more transmit power to achieve the required SINR when the threshold becomes greater.
By the way, Fig. 4 gives the evolution of system waiting time with regards to the SINR threshold. We can observe, that even ω varies in function of β, it remains always under the predefined value τ = 3s.
Finally, Fig. 5 represents the variation of the area spectral efficiency vs the SINR threshold. As depicted in this figure, we can observe that AS E increases as β increases with some small fluctuations. In the expression of AS E we have β in two times: in the log function, and in the P s expression. the observed fluctuations are due to the chosen steps of transmit power.

Conclusion
In this paper, we presented a Q-learning based technique to optimize simultaneously the transmission energy usage and the message latency in a VANET. The network is considered as a Cox Poisson Line Process. We modeled the message transmission as a queue M/D/1. Each packet waiting to be transmitted is stored in this FIFO queue. Based on the success transmission probability, we tried to select the best transmission power that permits to minimize the energy usage rate while guaranteeing that the system waiting time is always under a predetermined value. Numerical studies proved the effectiveness of our proposition. As a future work, we plan to consider improving the learning method to allow the scalability of our proposition.