Decision-Making in Autonomous Driving
Whether it be getting cut off by a Tesla or witnessing a Waymo taxi driving by itself, self-driving cars are becoming ever more present in today’s society. Driving is a task that involves many diverse interactions on the road, requiring full attention to the people and obstructions in the surrounding environment. This begs the question of how exactly an autonomous vehicle can replicate the decision-making of the human brain while driving. The key to tackling this challenge is the Markov decision process.
Authors: Mara Andronache, Linda Salvalaggio, Joshua Tan
Decision-making problem
In an autonomous driving system, the three core components are environment perception, decision-making, and dynamic control. The goal of decision-making is to understand the environment and predict the movements of surrounding entities. This is a complex task which involves planning the route and determining the actions needed to follow it, all whilst responding to the dynamic, and random, traffic environment. The Markov decision process (MDP) is a probabilistic decision-making method which takes into account various elements of the traffic environment while considering safety, efficiency, and comfort. This mathematical formulation allows a computer to approach decision and control problems with uncertain behaviours.Quantifying the scenario
A Markov decision process operates on the 5-tuple: $$(\mathcal{S}, \mathcal{A}, r(s, a, s’), p(s, a, s’), \gamma)$$ To simplify, only the decision-making car (ego car) and one vehicle of interest will be represented. The traffic environment is represented by the environment state space \(\mathcal{S}\). A state \(s \in \mathcal{S}\) contains variables which represent the environmental information:- \(\Delta x_{lon}\) – how far the other vehicle is ahead of the ego car
- \(x_{lat}^{ego}, x_{lat}^{veh}\) – lateral positions of ego car and other vehicle
- \(v_{lon}^{ego}, v_{lon}^{veh}\) – longitudinal velocities of ego car and other vehicle

Environment transition model
Since the state space \(\mathcal{S}\) takes discrete values, but the real world has a continuous state space \(\mathcal{S}_c\), a random variable \(I_\mathcal{S}\colon \mathcal{S}_c \to \mathcal{S}\) is used to connect real-world continuous states to discretized states. The inverse correspondence \(I_\mathcal{S}^{-1}\colon \mathcal{S} \to \mathcal{S}_c\) thus associates to each discrete state \(s \in \mathcal{S}\) a collection of continuous states in \(\mathcal{S}_c\). The probability that the car starts in an initial state \(s\), follows trajectory \(a\), and ends in a final state \(s’\) is given by \(p(s, a, s’)\). If the car’s initial state \(s\) is given and a trajectory \(a\) is predetermined, the conditional probability \(p(s’ \mid s, a)\) of the transition to the final state \(s’\) is approximated using the Monte Carlo method. This method empirically takes a sample in the state space and, given a trajectory \(a\), compares it with the prediction model \(p(s, a, s’)\) to estimate the conditional probability \(p(s’ \mid s, a)\).
Environment reward model
For a transition from \(s\) to \(s’\) under action \(s\), there is a corresponding reward \(r(s, a, s’)\) which can be seen as the assignment or deduction of points. The reward takes into account safety, efficiency, comfort, and the rules of the road. A large, positive reward is fixed for when the ego car overtakes the other vehicle, which is a good final state. A large negative reward is fixed for when the ego car enters a bad final state, which includes colliding with the other vehicle, driving off-road, and being overtaken by the other car. When one of these final states is not reached, the reward is given by
\[r = \sum_{i = 1}^{t/\Delta t} \Delta t(f_{com} a_{lon} + f_{ove} v_{lat} + f_{obe}(!\mathbb{I}_{\mathcal{S}_{right}})\]
The negative factor \(f_{com}\) punishes large accelerations \(a_{lon}\) to improve comfort and gas economy. The positive overtaking factor \(f_{ove}\) is added to the lateral velocity \(v_{lat}\) to encourage overtaking. The traffic-law penalty \(f_{obe}\) is a negative value that is only activated by the indicator \(!\mathbb{I}_{\mathcal{S}_{right}}\) when the car’s state is not in the right lane, encouraging the car to stay in the right lane.
The execution process of the action is needed to infer the environment model, so an action \(a\) is decomposed into several steps with step time \(\Delta t\). The final reward is obtained by scaling these terms by the step time \(\Delta t\) and summing over all the steps during the action implementation.
Markov decision process
The Markov decision process uses a policy \(\pi(s)\) to select actions \(a\) for each state \(s\) in order to maximize the long-term reward \(V(s)\), the value function of the policy. This policy follows the Bellman optimality equation to balance immediate and future goals. The optimal policy is given by
\[\pi^*(s) = \arg\max_a \sum_{s’} p(s’ \mid s, a) \left[ r(s, a, s’) + \gamma V^*(s’) \right]\]
where \(r(s,a,s’)\) is the reward for transitioning from \(s\) to \(s’\) via action \(a\), \(p(s,a,s’)\) is the probability of this transition, and \(\gamma\) is the discount factor that prioritizes near-term rewards. This optimization problem uses dynamic programming to find the optimal action \(a\), based on the reward of all possible outcomes and the value of subsequent maneuvers, all weighted by the probability of the action.
In order to manage situations like overtaking, avoiding collisions, and upholding lane discipline, the decision-making process depends on a real-time assessment of the surroundings. For example:
- Safety: When the vehicle approaches a slower-moving vehicle, the policy evaluates \(r(s, a, s’)\) to determine whether to decelerate or overtake. States that lead to collisions are heavily penalized with a large negative value.
- Efficiency: A reward \(r(s, a, s’)\) which is proportional to \(v_{lon}^{ego}\) incentivizes maintaining or increasing speed when safe.
- Rule Compliance: The system adds a penalty \(f_{obe}\) when the ego car stays too long in the overtaking lane, encouraging lane discipline.
The optimal policy adapts dynamically to traffic conditions. For instance, the policy chooses \(a_{lon} \to 0\) to prevent needless acceleration if \(\Delta x_{lon} < 0\) (the ego car is ahead). However, if \(\Delta x_{lon} > 10 \) and \(v_{lon}^{veh} < v_{lon}^{ego}\), the policy prioritizes overtaking by setting \(v_{lat} > 0\) temporarily to switch lanes.
When applied to dynamic environments, the MDP-derived policy achieves real-time responsiveness. As an illustration, if a vehicle in front of the ego car suddenly slows down, the ego car adjusts \(v_{lon}^{ego}\) proportionately while taking into account the lateral speed \(v_{lat}\) for a possible lane change. The policy maintains \(v_{lat} = 0\) to hold its place in the present lane in the event that a high-speed vehicle overtakes, ensuring safety.
Thanks to this approach, autonomous vehicles are able to handle intricate and unexpected traffic conditions while maintaining optimal driving behaviour, which is all powered by probabilistic reasoning.