Hostname: page-component-788cddb947-w95db Total loading time: 0 Render date: 2024-10-19T03:25:47.879Z Has data issue: false hasContentIssue false

Reinforcement learning with modified exploration strategy for mobile robot path planning

Published online by Cambridge University Press:  11 May 2023

Nesrine Khlif*
Affiliation:
Laboratory of Robotics, Informatics and Complex Systems (RISC lab - LR16ES07), ENIT, University of Tunis EL Manar, Le BELVEDERE, Tunis, Tunisia
Khraief Nahla
Affiliation:
Laboratory of Robotics, Informatics and Complex Systems (RISC lab - LR16ES07), ENIT, University of Tunis EL Manar, Le BELVEDERE, Tunis, Tunisia
Belghith Safya
Affiliation:
Laboratory of Robotics, Informatics and Complex Systems (RISC lab - LR16ES07), ENIT, University of Tunis EL Manar, Le BELVEDERE, Tunis, Tunisia
*
Corresponding author: Nesrine Khlif; Email: nesrine.khlif@etudiant-enit.utm.tn
Rights & Permissions [Opens in a new window]

Abstract

Driven by the remarkable developments we have observed in recent years, path planning for mobile robots is a difficult part of robot navigation. Artificial intelligence applied to mobile robotics is also a distinct challenge; reinforcement learning (RL) is one of the most used algorithms in robotics. The exploration-exploitation dilemma is a motivating challenge for the performance of RL algorithms. The problem is balancing exploitation and exploration, as too much exploration leads to a decrease in cumulative reward, while too much exploitation locks the agent in a local optimum. This paper proposes a new path planning method for mobile robot based on Q-learning with an improved exploration strategy. In addition, a comparative study of Boltzmann distribution and $\epsilon$-greedy politics is presented. Through simulations, the better performance of the proposed method in terms of execution time, path length, and cost function is confirmed.

Type
Research Article
Copyright
© The Author(s), 2023. Published by Cambridge University Press

1. Introduction

Motivation

The biggest challenge currently facing mobile robots is the development of intelligent navigation systems. Autonomous navigation is a research axis that enables robots to move without human intervention. In general, the task of navigating a mobile robot is to move freely in the environment without colliding with obstacles. To perform autonomous navigation tasks, mobile robots must implement a series of functions, including trajectory planning, localization, mapping, trajectory tracking, obstacle avoidance, object classification, and more. The purpose of autonomous navigation of mobile robots is to be able to precept the environment, plan, and track trajectories in free space and control the robot. Fig. 1 presents the mobile robot navigation system:

Figure 1. Mobile robot navigation system.

We can resume this system with three main functions which are perception where the robot takes information about environment via sensors and extract the data to perceive and map. Then, the second function is path planning which takes the map and finds an optimal path. The last function is control where the robot takes an instruction and interacts with its environment.

Context

Moving from one place to another is a trivial task for mobile robot. In autonomous robotics, path planning is a difficult part of robot navigation. The problem consists in finding out an optimal path from a start position to a target position with avoiding obstacles, so we need an efficient path planning algorithm. Several approaches have been proposed to solve path planning problem, and these approaches can be divided into two main categories according to the environmental information available: global and local path planning as resumed in ref. [Reference Pei and An1]. The most recent research develops the reinforcement learning (RL) algorithms for mobile robot path planning, it is a novel subject which requires no prior knowledge. This is an artificial intelligence (AI) technique in which an agent has to interact with an environment, choosing one of the available actions in each possible state. Q-learning is one of RL algorithms and has been successfully applied in many domains; it was used to solve path planning problem of a mobile robot, in the article [Reference Pei and An1] they cited the most recent Q-learning algorithms that solve the path planning problem.

Problematic

One of the most challenging tasks in Q-learning algorithms is the policy, which selects the actions that the robot can take in the current state. This approach must balance using the experience gained and exploring other new information about the environment [Reference Fruit2]. The exploration-exploitation dilemma is a distinctive challenge to the performance of RL algorithms. The problem consists of balancing between exploitation and exploration because too much exploration cause a lower accumulated reward and too much exploitation can block the agent in a local optimum. Many Q-learning algorithms rely on various exploration strategies for mobile robot path planning to solve this problem. These methods can be divided into two main categories: directed and undirected exploration [Reference Tijsma, Drugan and Wiering3].

Contribution

A new exploration strategy which combines $\epsilon$ -greedy exploration with Boltzmann exploration to learn the optimal path in unknown environment is proposed in this article. The proposed method tries to combine the advantages of each policy. A comparison between $\epsilon$ -greedy, Boltzmann, and the proposed method is also presented. From simulations, better performance of the proposed method is demonstrated using the classical Q-learning algorithm. Meanwhile, time, cost function, and path length have significance in the robotics literature of the energy consumption.

Paper organization

The paper is structured as follows: A related work is provided in Section 2. Section 3 presents a background of RL, Q-learning, path planning for mobile robot, and the exploration strategies. Used exploration strategies, problem formulation and the proposed modified Q-learning are described in Section 4. The simulation results are presented along Section 5 with some discussions. Finally, a conclusion is drawn in Section 6.

2. Related Studies

The exploration-exploitation problem has been studied for a longtime in detail [Reference McFarlane4Reference Wiering and Schmidhuber7] present a survey about exploration strategy in RL. In addition, refs. [Reference Tijsma, Drugan and Wiering3] and [Reference Koroveshi and Ktona8] conduct a comparative study of different methods to improve the performance of each method; the authors in ref. [Reference Koroveshi and Ktona8] focus on studying different exploration strategies and compare their effect in the performance to find which is the best strategy of an intelligent tutoring system. They compare four strategies: random, greedy, epsilon-greedy, and Boltzmann (softmax). The results show that during the training phase, random and greedy strategies performed worse where the reward was negative for every episode, which means that they chose the worst action for most of the time. On the other hand, the $\epsilon$ -greedy and Boltzmann strategies performed best during the training phase, with Boltzmann strategy getting slightly higher rewards. Although ref. [Reference Tijsma, Drugan and Wiering3] compare four strategies of exploration strategies using random stochastic mazes (UCB-1, softmax, $\epsilon$ -greedy, and pursuit), the results of this work show that softmax has the best performance and $\epsilon$ -greedy has the worst one.

In ref. [Reference Li, Xu and Zuo9], an improved Q-learning using a new strategy is proposed, and ref. [Reference Liu, Zhou, Ren and Sun10] also proposes a modified Q-learning using a combined $\epsilon$ -greedy and Boltzmann methods in unknown environments and formulate the problem as a nondeterministic Markov decision process (MDP), they proved the algorithm in a simple grid-based model without obstacles. The simulation results confirm the effectiveness of the proposed algorithm in solving the path planning problem. Kim and Lee [Reference Kim and Lee11] do many experiences with different value of $\epsilon$ to improve Q-learning in real time, and they compare learning rates according to the choice of action policy; the result shows the best performance when $\epsilon$ = 0. The author in ref. [Reference Hester, Lopes and Stone12] proposes a new algorithm (Learning Exploration Online) which receives a set of exploration strategies and learns which strategy is the best according to the situation. Ref. [Reference Tokic, Dillmann, Beyerer, Hanebeck and Schultz13] proposes an adaptive $\epsilon$ -greedy method and evaluates it using a multi-armed bandit task. This method adjusts the exploration parameters of $\epsilon$ -greedy according to the value function. The paper also conducts a comparative study of the new method using the Boltzmann distribution and the classical $\epsilon$ -greedy; the results show that the proposed method is determined to be more robust over a wide range of parameter settings, while still achieving acceptable performance results. Ref. [Reference Susan, Maziar, Harsh, Hoof and Doina14] survey proposes a taxonomy of RL exploration strategies based on the information an agent uses when choosing an exploration action. The authors divide exploration techniques into two broad categories: “reward-free exploration,” which selects actions entirely at random, and “reward-based exploration,” which uses information related to reward signals and can be further subdivided into “memory-free” (considering only environmental states) and “memory-based” (considering additional information about the agent’s interaction history with the environment). This survey also discusses these exploration methods and identifies their strengths and limitations.

3. Background

3.1. Reinforcement learning

RL is a very dynamic area of machine learning in terms of theory and application. RL algorithms inspired by animal and human learning and make an agent connected to his environment via perception and action [Reference Mehta15]; that means the agent interacts with its environment without any explicit supervision, the reward function is positive when the agent takes “good” action or it can be negative when it takes a “bad” action. RL is based on MDP which is described by ref. [Reference Sutton and Barto16]:

  • State s and State space S

  • Action a and Action space A

  • Reward function R(s,a)

  • P is the transition state probability for $s \in S$ , and $a \in A$ .

  • $\gamma \in [0,1)$ is the discount factor which describes how the agent considers the future reward.

  • Policy ( $\pi$ ): The strategy that the agent takes to estimate next action based on the current state.

  • Value function (V):V $\pi$ (s) is defined as the expected long-term return of the current state s under policy $\pi$ .

  • Q-value (Q): Q-value is similar to Value V, but it takes an extra parameter. Q $\pi$ (s, a) refers to the long-term return of the current state s, taking action a under policy $\pi$ .

In RL, the agent uses its past experiences to learn the optimal policy; learning this policy involves estimating a value function Q-value; $Q_{\pi }(s,a)$ denotes the expected accumulated discounted future reward $R_{t}$ obtained by selecting action “a” in state “s” and following policy $\pi$ ; thereafter, we can express the return value as follows [Reference Tijsma, Drugan and Wiering3]:

(1) \begin{equation} Q_{\pi }(s,a)=E_{\pi }\Big [R_{t}\mid s_{t}=s, a_{t}=a \Big ] \end{equation}

3.2. Q-learning

Q-learning is one of the most challenging algorithm of RL, it is an off-policy temporal difference learning technique [Reference Sutton and Barto16], with an off-policy algorithm where the agent takes the action from another policy for learning the value function. The author of this paper [Reference Miljković, Mitić, Lazarević and Babić17] summarizes the concept of Q-learning algorithm.

If the agent visits all state-action pairs infinitely, Q-learning converges to the optimal Q-function and the agent can get a set of state-action with maximum cumulative reward. The basic Q-learning algorithm update rule of the value function $Q_{t+1}(s,a)$ at time step $t+1$ is given by the formula [Reference Tijsma, Drugan and Wiering3]:

(2) \begin{equation} Q(s_{t},a_{t})= (1-\alpha )(Q(s_{t},a_{t})+\alpha (r_{t}+\gamma \text{max}((Q(s_{t+1},a))) \end{equation}

$0 \leq \alpha \leq 1$ learning rate controls the learning speed, $\gamma$ discount factor. A pseudo-code of general Q-learning algorithm is present with Algorithm 1:

Algorithm 1 : The pseudo-code of general Q-learning algorithm

As essential part of Q-learning in finding an optimal policy is the selection of action $a_{t}$ in the state $s_{t}$ , there is a tradeoff between choosing the currently expected optimal action or choosing a different action in the hope that it will yield a higher cumulative reward in the future. The step four of the algorithm 1 is the main idea of the contribution of this article. To solve this problem, we construct an unknown environment with many obstacles, in which there is a goal state, and the task of the mobile robot is to find an optimal path in the shortest possible time with a small convergence rate.

3.3. Path planning for mobile robot

Path planning is a difficult point in robot navigation. After specifying the environment, the robot must move while avoiding obstacles. An excellent planning technique that saves a lot of time and effort avoids collisions with obstacles. Over the past two decades, several scheduling techniques have been proposed. Each of these has had its own advantages and disadvantages over other methods. The different methods of path planning can be divided into Classical Control and Intelligent planning.

In ref. [Reference Anis, Hachemi, Imen, Sahar, Adel, Mohamed-Foued, Maram, Omar and Yasir18], they classify different path planning classes for mobile robots. According to the environmental knowledge, the nature of the environment, and the method of solving the problem, it can be divided into three categories, as shown in Fig. 2.

Figure 2. Path planning categories.

3.3.1. Path planning classification based on environmental knowledge

According to the environmental knowledge, path planning for mobile robot can be divided in two categories: global and local path planning. In the first category, the robot has prior knowledge of the environment modeled as a map and needs to know everything about the environment before the planning algorithm begins. Such path planning is called global path planning [Reference Anis, Hachemi, Imen, Sahar, Adel, Mohamed-Foued, Maram, Omar and Yasir18]. As examples of these algorithms, we can cite: visibility graphs, potential fields, Voronoi graphs [Reference Masehian and Amin-Naseri19], A-star (A*) [Reference Zhang and Li20], Dijkstra [Reference Fusic, Ramkumar and Hariharan21] $\ldots$ etc. Rapidly Exploring Random Tree or RRT also is a global path planning but there many modified version of it, we can cite this article [Reference Dönmez, Kocamaz and Dirik22], which present a bidirectional rapidly random exploring tree (Bi-RRT), so the authors conduct several experiments to evaluate Bi-RRT performance.

The second class is local path planning where the environment is unknown, and the robot has no prior knowledge. Therefore, mobile robots must analyze their environment as they move and make decisions in real time based on that analysis. The AI method is considered a local planning method.

3.3.2. Path planning classification based on nature of environment

The path planning problem can be solved with either static environments or dynamic environments. A static environment does not vary; start position, goal position, and obstacles are fixed over time. As example of path planning in a static environments, this article [Reference Choueiry, Owayjan, Diab and Achkar23] improves a genetic algorithm for path planning in static environment

However, in a dynamic environment, the location of obstacles and goal may vary during the research process. In general, path planning in dynamic environments is more complex than that in static environments [Reference Hosseininejad and Dadkhah24]. Path planning approaches applied in static environments are not appropriate for the dynamic problem.

3.3.3. Path planning classification based on problem-solving method

Path planning algorithms are divided into exact algorithms and heuristic algorithms according to the method of solving the path planning problem. The goal of a heuristic algorithms is to produce a solution in a reasonable time and good enough to solve the given problem. This solution may not be the best of all solutions to this problem, or it may just be close to the exact solution. An exact algorithm is guaranteed to find an optimal solution to a problem if a feasible solution exists or proves not exist. But it can solve problems with bigger and more complex solutions. This slows down precise algorithms.

Robot path planning can also be classified according to the number of robots performing the same task in the same workspace. In many applications, multiple mobile robots collaborate in the same environment. This problem is called multi-robot path planning. His goal is to find obstacle-free paths for mobile robotics teams. It allows a group of mobile robots to navigate together in the same workspace to achieve goals. It must be ensured that the two robots do not collide while following their respective trajectories. As an example of the multi-robot path planning problem, we cite [Reference Bae, Kim, Kim, Qian and Lee25].

3.4. Exploration strategies

This section contains a catalog of commonly used exploration strategies. The techniques outlined below can be divided into two categories: undirected and directed. Undirected exploration techniques are driven by randomness, independent of the historical exploration environment of the learning process. The simplest example is a random walk when an agent performs random actions [Reference Pang, Song, Zhang and Yang26]; the $\epsilon$ -greedy [Reference Kim and Lee11], and Boltzmann distribution [Reference Pan, Cai, Meng and Chen27] are other examples of undirected exploration techniques. In contrast to the undirected exploration techniques discussed above, directed exploration strategies consider the history of agents in the environment to influence which parts of the environment the agent will explore further, as example of basic directed exploration strategies: counter-based exploration, error-based exploration, and recency-based exploration [Reference Thrun5]. Figure 3 summarizes all exploration techniques mentioned in articles [Reference McFarlane4, Reference Thrun5].

There is an algorithm called upper confidence bound (UCB), which is an intelligent, targeted, counter-based exploration strategy that performs well in the RL multi-armed bandit problem [Reference Mahajan, Teneketzis, Hero, Castañón, Cochran and Kastella28]. In this paper, we study $\epsilon$ -greedy and Boltzmann distribution strategies.

4. Modified exploration strategy for path planning

In this section, we propose path planning for mobile robots based on RL and an improved choice-action strategy. In the RL algorithm (classical Q-learning) algorithm, the mobile agent observes the current state and chooses an action according to the proposed method combining $\epsilon$ -greedy and Boltzmann distribution. Then, the agent performs the action and observes the accumulated reward

4.1. Epsilon-greedy strategy

In Q-learning, we choose an action based on a reward. The agent always chooses the optimal action. Therefore, it generates the largest possible reward for a given state. In $\epsilon$ -greedy action selection, the agent uses both exploitation to exploit prior knowledge and exploration to find new options. The aim is to strike a balance between exploration and exploitation [Reference Kim, Watanabe, Nishide and Gouko29]. $\epsilon$ -greedy exploration is an exploration strategy in RL that performs an exploration action with probability $\epsilon$ and a exploitation action with probability $\epsilon -1$ .

The greedy strategy selects actions according to the following Eq. (3).

(3) \begin{equation} a(x)=\text{argmax} Q(s,a_{t}) \end{equation}

Now let us look at the implementation of the greedy strategy, which is a simple algorithm to understand and implement, the pseudo-code below that represents the principle of the algorithm.

Algorithm 2 : Algorithm of ϵ-greedy policy

Figure 3. Exploration-exploitation methods.

4.2. Boltzmann distribution strategy

A smooth transition between exploration and exploitation can be described using Boltzmann distribution which uses the Boltzmann distribution function [Reference Asadi and Littman30] to assign a probability P to the actions. $P(s_{t}, a)$ has the following form:

(4) \begin{equation} P(s_{t}, a) = \Big [ \text{exp}(Q(s_{t}, a_{t})/ T)\Big ]/ \Big [ \sum \text{exp}(Q(s_{t}, a_{t})/ T)\Big ] \end{equation}

where T is the temperature that controls the degree of randomness of the choice of action with the highest Q-value. When T = 0, the agent does not explore at all, and when $T \rightarrow \infty$ , the agent selects random actions. Using softmax exploration with intermediate values for T, the agent still most likely selects the best action, but other actions are ranked instead of randomly chosen.

4.3. Proposed method

In this paper, we propose a new approach to solve the exploration-exploitation dilemma, so we combine the Boltzmann distribution with the $\epsilon$ -greedy strategy. As an advantage, the new proposed strategy avoids the local optima and speeds up the entire Q-learning algorithm. The strategy is described with Algorithm 3:

Algorithm 3 : Algorithm of proposed policy

This Q-learning-based enhanced strategy for solving mobile robot path planning problems in complex environments with many obstacles is a novel approach. We evaluate the method considering the number of steps, speed of environment exploration, finding the optimal trajectory, and maximizing the accumulated reward.

Our approach consists of a tradeoff between two exploration strategies according to $\epsilon$ of $\epsilon$ -greedy policy and P of Boltzmann distribution. To improve our strategy, let us create an algorithm with two modules: environment creation and application of the Q-learning algorithm.

In the environment creation function, we will simulate the block as follows:

  • Create environments, obstacles, start and end positions.

  • Define a list of possible actions.

  • Create a function that returns a list of states and reward value and predict the next state.

Applying the Q-learning algorithm, we simulate the following blocks:

  • Define Q-table with actions in columns and states in rows

  • Define an exploration strategy for choosing actions

  • apply the Q-learning algorithm with Bellman equation for path planning

As the last step of the path planning algorithm, we apply the Q-learning algorithm to the created environment, taking the number of steps of the mobile robot, the reward function, and the path founded. Finally, all results are presented graphically.

5. Simulation Results

A mobile agent navigates in a two-dimensional virtual environment with many obstacles. The main objective of the program is to get from the starting state to the goal state while avoiding obstacles, so the agent should learn to navigate to the goal state. The methodology used in the experiments is:

Episode. Strategically, agents traverse the entire environment until reaching a goal state. Each episode gives the cumulative reward and the total number of steps required.

Run. A full run of k episode for a given combination of parameters (we use 1000 sets in our simulations). Running produces a list of results for each episode.

Test. Test exploitation-exploration methods ( $\epsilon$ -greedy, Boltzmann distribution, and the proposed method) in the environment and find the best results for each episode.

Comparative study. Comparison of results obtained with each method.

5.1. Simulation

Simulations were conducted in a virtual environment with obstacles, where a mobile agent is an object that moves around and studies it to find a path to a goal point. The program code is written in Python 3. We exploit Q-learning algorithm with different parameters and different action selection strategies. In all simulations, the total number of episode was set to 1000.

The discount factor affects how much weight it gives to future rewards in the value function. It is constrained to be less than 1 is a mathematical trick to make infinite sums finite. This helps prove the convergence of an algorithm. A discount factor $\gamma$ = 0 will result in states/actions values representing the immediate reward, and the agent will be completely myopic, while a higher discount factor $\gamma$ = 1 will result in an agent which evaluates each of its actions according to the sum total of all his future rewards [Reference François-Lavet, Fonteneau and Ernst31]. The convergence is influenced by the discount factor depending on whether it is a continual task or an episodic one. In a continual one, $\gamma$ must be between [0, 1). Since most actions have short-lived effects, $\gamma$ should be chosen as high as possible, so set the value $\gamma$ = 0.9.

A common problem faced when working on Q-Learning projects is the choice of learning rate. The learning rate or step size is a hyper-parameter that determines to what extent newly acquired information overrides old information.

When we give a perfectly configured learning rate, the model will learn to best approximate. Generally, as performed in this paper [Reference Kim, Watanabe, Nishide and Gouko32], a factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities). Unfortunately, we cannot analytically calculate the optimal learning rate for a given model. Instead, we have to find a good learning rate through trial and error. The traditional default value for the learning rate is between 0.1 and 0.9, and we pick the learning rate and fix it at $\alpha$ = 0.5.

To summarize the related parameters in our experiment, we set it in Table I.

Table I. Initial parameter of simulation.

5.1.1. Epsilon-greedy

The epsilon ( $\epsilon$ ) parameter introduces randomness into the algorithm and forces us to try different actions. This will prevent you from getting bogged down in local optimizations. When epsilon is set to 0, we never explore and always use the knowledge we already have. Conversely, setting epsilon to 1 forces the algorithm to always perform random actions and never exploit past knowledge. Usually, epsilon is chosen as a decimal close to 0, so we choose $\epsilon$ = 0.01. The results of mobile robot path planning based on $\epsilon$ -greedy action selection are shown in Fig. 4.

Figure 4. Results of Q-learning based $\epsilon$ -greedy action selection.

5.1.2. Boltzmann distribution

The convergence of the algorithm is highly dependent on the T parameter, so special attention should be paid to its range and rate of change. Choose T = 0.02 according to ref. [Reference Sichkar33] in which the author choose to work with this value and the results are performed invirtual environments. Figure 5 shows the results of Q-Learning path planning using the Boltzmann distribution.

Figure 5. Results of Q-learning based Boltzmann distribution action selection.

5.1.3. Proposed strategy

To evaluate the performance of the proposed algorithms, state decisions, reward functions, and cost functions are in primary interest during the implementation of Q-learning algorithms for obtaining the optimal path. Figure 6 presents the result of our proposed algorithm.

5.2. Discussions

For each simulation, we plot the number of steps for the entire episode, the cost function for each episode, the paths found, and the cumulative reward for each episode. The results show that the three methods work well and complete the task of finding paths for mobile agents (4d, 5d, 6c). The graph (4b, 5b, 6b) shows that the number of step decrease faster for the proposed method than other methods in finding the path, even the shortest path for achieving the goal consists of 42 steps for both $\epsilon$ -greedy and proposed strategy and 48 step for Boltzmann distribution. The graphs in Figures (4d, 5a, 6a) illustrate the difference in performance of the above methods. It can be seen that compared with the Boltzmann distribution and the proposed method, $\epsilon$ -greedy requires more learning cost, which means that the $\epsilon$ -greedy policy requires more steps to reach the goal. The Figs. 4(c), 5(c), 6(b) represent a form of reward signal, from which we can see that at the beginning of learning, the reward signal is −1, which means that the robot did not reach the goal. Initially, the robot will make bad moves, which is normal because the RL system starts with no prior information about the environment. Then, it fluctuates between +1 and −1 with a high switching frequency. We find that the penalty (r = −1) decreases with the learning process, and the reward at the end of learning is stabilized at the value +1, proving that the system is learning the desired behavior.

Comparing the graphs, we find that $\epsilon$ -Greedy takes more time to learn and does not stabilize until the end; the proposed method and the Boltzmann distribution reach the goal almost simultaneously, although the proposed method stabilizes faster. If we compare the graphs of the $\epsilon$ -Greedy and Boltzmann distribution strategies, we can see that $\epsilon$ -Greedy has the slowest performance improvement and takes more time to reach the goal even with the shortest path. In the proposed method, we take advantage of each strategy; we take the shortest path of $\epsilon$ -Greedy and the speed of the Boltzmann distribution, so the proposed strategy performs better than the other two strategies.

The results of each algorithm are shown in Table II. The first line shows the $\epsilon$ -Greedy results, the second line represents the results obtained using Boltzmann distribution, and the third line shows the results of the proposed algorithm in this paper.

Table II. Results of the simulation.

Figure 6. Results of Q-learning based on proposed method.

The simulation of the reward function in different episodes shows that at the initial time, the robot cannot reach the goal and the reward function = 0, as shown in Fig. 7(a), then increasing the number of steps allows the robot to learn the location of obstacles and the environment structure, for example, in episode 200 presented with Fig. 7(b) the robot took over 1750 steps. But after episode 300 which is exposed in Fig. 7(c), the robot can reach the goal, the reward function becomes 1, but the path is still long, and surroundings 300 steps, the robot keeps using and exploring its knowledge to make the shortest reach way. So in episode 600 which is displayed in Fig. 7(d), the optimal trajectory of the robot is about 40 steps.

Figure 7. Reward function for different episodes.

6. Conclusion and Future Work

As AI becomes more and more popular, more and more scientists are developing and studying RL. The fact of choosing an action is very important for finding the optimal path. Therefore, in this article, we propose a novel path planning for mobile robots based on RL (Q-learning). The proposed method combines $\epsilon$ -greedy and Boltzmann distribution strategies for improved choose action strategies.

The simulations are carried out in a two-dimensional virtual environment with many obstacles. The results of the modified method show that the self-learning speed is fast (learning stage), and the optimal path is found with less execution time. The reward function indicates that the learning process has three stages: learning the environment, reaching the goal, and finding the best way to reach the goal. The newly proposed strategy makes the first phase more rapid.

The proposed work is a part of our research project, and it opens the door for the future work in the field of intelligent path planning and control of mobile robot. However, the proposed algorithm was only tested in a simulation environment and not in a real environment. In future work, we will conduct experiments and debugging in a real environment with a real mobile robot. Also, using intelligent method in the control of mobile robot will be of great value to me in my future work inspired from these refs. [Reference Dönmez, Kocamaz and Dirik34] which use a visual servoing go-to-goalbehavior controller to control a differential drive mobile robot for a static target, and [Reference Mitić and Miljković35] where the authors propose a novel intelligent approach for mobile robots using neuralnetworks and learning from demonstration framework to control the mobile robot. Now we are developing a Deep Q-learning algorithm for path planning, and we plan to use this study by comparing experiment parameters of both.

Author contributions

Khlif Nesrine carried out the algorithm’s implementation to Python, the Q-learning algorithm, and the comparison. The simulation and data capture were accomplished by Khlif Nesrine and Khraief Nahla. The first draft of the manuscript was written by Khlif Nesrine, and both authors revised the manuscript. Management of the study was performed by Khraief Nahla. All studies and simulations were supervised by Belghith Safya. All authors read and approved the final manuscript.

Financial support

This research received no specific grant from any funding agency, commercial, or not-for-profit sectors.

Competing interest

The authors declare no conflicts of interest exist.

Ethical approval

There are no human or animal subjects in this study. No ethical approval is required.

References

Pei, M., An, H, B. Liu and C. Wang, “An Improved Dyna-Q Algorithm for Mobile Robot Path Planning in Unknown Dynamic Environment,” In: IEEE Transactions on Systems, Man, and Cybernetics: Systems, 52(7) 4415–4425 (2022). doi: 10.1109/TSMC.2021.3096935.Google Scholar
Fruit, R., Exploration-Exploitation Dilemma in Reinforcement Learning Under Various Form of Prior Knowledge, (2019). Artificial Intelligence [cs.AI]. Université de Lille 1, Sciences et Technologies; CRIStAL UMR 9189, 2019. English. fftel-02388395v2f.Google Scholar
Tijsma, A. D., Drugan, M. M. and Wiering, M. A., “Comparing Exploration Strategies for Q-Learning in Random Stochastic Mazes,” In: 2016 IEEE Symposium Series on Computational Intelligence (SSCI), Athens, Greece (2016) pp. 18, doi: 10.1109/SSCI.2016.7849366.CrossRefGoogle Scholar
McFarlane, R. A Survey of Exploration Strategies in Reinforcement Learning (McGill University, Montreal, QC, Canada, 2018).Google Scholar
Thrun, B.,“Efficient Exploration in Reinforcement Learning”, (1992), Technical report CMU-CS-92-102, School of Computer Science Carnegie-Mellon University.Google Scholar
Thrun, S. B., Efficient Exploration in Reinforcement Learning, (1992). Technical Report.Google Scholar
Wiering, M. and Schmidhuber, J., “Ecient Model-Based Exploration,” In: Proceedings of the Fifth International Conference on Simulation of Adaptive Behavior (SAB98), FromAnimals to Animats 5, Switzerland (1998) pp. 223228.Google Scholar
Koroveshi, J. and Ktona, A., “A Comparison of Exploration Strategies Used in Reinforcement Learning for Building an Intelligent Tutoring System,” In: Proceedings of the 4th International Conference on Recent Trends and Applications in Computer Science and Information Technology (RTA-CSIT, Tirana, Albania 2021).Google Scholar
Li, S., Xu, X. and Zuo, L., “Dynamic Path Planning of a Mobile Robot with Improved Q-Learning Algorithm,” In: 2015 IEEE International Conference on Information and Automation, Lijiang, China (2015) pp. 409414. doi: 10.1109/ICInfA.2015.7279322.CrossRefGoogle Scholar
Liu, X., Zhou, Q., Ren, H. and Sun, C., “Reinforcement Learning for Robot Navigation in Nondeterministic Environments,” In: 2018 5th IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS), Nanjing, China (2018) pp. 615619. doi: 10.1109/CCIS.2018.8691217.CrossRefGoogle Scholar
Kim, H. and Lee, W., “Real-Time Path Planning Through Q-learning’s Exploration Strategy Adjustment,” In: 2021 International Conference on Electronics, Information, and Communication (ICEIC), Jeju, Korea (2021) pp. 13. doi: 10.1109/ICEIC51217.2021.9369749.CrossRefGoogle Scholar
Hester, T., Lopes, M. and Stone, P., “Learning Exploration Strategies in Model-Based Reinforcement Learning,” In: 12th International Conference on Autonomous Agents and Multiagent Systems 2013, AAMAS (2013).Google Scholar
Tokic, M., “Adaptive ϵ-Greedy Exploration in Reinforcement Learning Based on Value Differences,” In: KI 2010: Advances in Artificial Intelligence, KI 2010. Lecture Notes in Computer Science, Dillmann, R., Beyerer, J., Hanebeck, U. D. and Schultz, T., 6359 (Springer, Berlin, Heidelberg, 2010). doi: 10.1007/978-3-642-16111-7.Google Scholar
Susan, A., Maziar, G., Harsh, S., Hoof, H. and Doina, P., A Survey of Exploration Methods in Reinforcement Learning, (2021). arXiv.org perpetual doi: 10.48550/ARXIV.2109.00157.CrossRefGoogle Scholar
Mehta, D., “State-of-the-art reinforcement learning algorithms,” Int. J. Eng. Res. Technol. 8(12), (2019), 717–722.Google Scholar
Sutton, R. S. and Barto, A. G., Reinforcement learning: An introduction, 352 (2015), 113138.Google Scholar
Miljković, Z., Mitić, M., Lazarević, M. and Babić, B., “Neural network reinforcement learning for visual control of robot manipulators,” J. Expert Syst. Appl. 40(5), 17211736 (2013). doi: 10.1016/j.eswa.2012.09.010.CrossRefGoogle Scholar
Anis, K., Hachemi, B., Imen, Ch., Sahar, T., Adel, A., Mohamed-Foued, S., Maram, A., Omar, Ch. and Yasir, J., “Introduction to Mobile Robot Path Planning,” In: Robot Path Planning and Cooperation: Foundations, Algorithms and Experimentation, (Springer International Publishing, Saudi Arabia 2018). doi: 10.1007/978-3-319-77042-0.Google Scholar
Masehian, E. and Amin-Naseri, M., “A voronoi diagram-visibility graph potential field compound algorithm for robot path planning,” J. Robot. Syst. 21(6), 275300 (2004). doi: 10.1002/rob.20014.CrossRefGoogle Scholar
Zhang, L. and Li, Y., “Mobile Robot Path Planning Algorithm Based on Improved A Star,” In: Journal of Physics: Conference Series, Volume 1848, 2021 4th International Conference on Advanced Algorithms and Control Engineering (ICAACE 2021), Sanya, China (January 29-31 2021).CrossRefGoogle Scholar
Fusic, S. J., Ramkumar, P. and Hariharan, K., “Path Planning of Robot Using Modified Dijkstra Algorithm,” In: 2018 National Power Engineering Conference (NPEC), Madurai, India (2018) pp. 15. doi: 10.1109/NPEC.2018.8476787.CrossRefGoogle Scholar
Dönmez, E., Kocamaz, A. F. and Dirik, M., “Bi-RRT Path Extraction and Curve Fitting Smooth with Visual Based Configuration Space Mapping,” In: International Artificial Intelligence and Data Processing Symposium (IDAP), Malatya, Turkey (2017) pp. 15, doi: 10.1109/IDAP.2017.8090214.CrossRefGoogle Scholar
Choueiry, S., Owayjan, M., Diab, H. and Achkar, R., “Mobile Robot Path Planning Using Genetic Algorithm in a Static Environment,” In: 2019 Fourth International Conference on Advances in Computational Tools for Engineering Applications (ACTEA), Beirut, Lebanon (2019) pp. 16. doi: 10.1109/ACTEA.2019.8851100.CrossRefGoogle Scholar
Hosseininejad, S. and Dadkhah, C., “Mobile robot path planning in dynamic environment based on cuckoo optimization algorithm,” Int. J. Adv. Robot. Syst. 16(2), 172988141983957 (2019). doi: 10.1177/1729881419839575.CrossRefGoogle Scholar
Bae, H., Kim, G., Kim, J., Qian, D. and Lee, S., “Multi-robot path planning method using reinforcement learning,” Appl. Sci. 9(15), 3057 (2019). doi: 10.3390/app9153057.CrossRefGoogle Scholar
Pang, B., Song, Y., Zhang, C. and Yang, R., “Effect of random walk methods on searching efficiency in swarm robots for area exploration,” Appl. Intell. 51(7), 51895199 (2021). doi: 10.1007/s10489-020-02060-0.CrossRefGoogle Scholar
Pan, L., Cai, Q., Meng, Q., Chen, W. and L, “Reinforcement Learning with Dynamic Boltzmann Softmax Updates,” In: IJCAI’20: Proceedings of theTwenty-Ninth International Joint Conference on Artificial Intelligence, Yokohama, Japan (2021) pp. 1992–1998. doi: 10.24963/ijcai.2020/272.CrossRefGoogle Scholar
Mahajan, A. and Teneketzis, D., “Multi-Armed Bandit Problems,” In: Foundations and Applications of Sensor Management, Hero, A. O., Castañón, D. A., Cochran, D. and Kastella, K., (Springer, Boston, MA, 2008). doi: 10.1007/978-0-387-49819-5.Google Scholar
Asadi, K. and Littman, M. L.. “An Alternative Softmax Operator for Reinforcement Learning,” In: Proceedings of the 34th International Conference on Machine Learning, PMLR, Sydney, Australia(2017) pp. 243252.Google Scholar
François-Lavet, V., Fonteneau, R. and Ernst, D., “How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies,” In: Cornell University, NIPS 2015 Deep Reinforcement Learning Workshop (2016). doi: 10.48550/arXiv.1512.02011.CrossRefGoogle Scholar
Brownlee, J., How to Configure the Learning Rate When Training Deep Learning Neural Networks, In: Deep Learning Performance, Machine Learning Mastery, (2019).Google Scholar
Kim, C. H., Watanabe, K., Nishide, S. and Gouko, M., “Epsilon-Greedy Babbling,” In: 2017 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), Lisbon, Portugal (2017) pp. 227232. doi: 10.1109/DEVLRN.2017.8329812.CrossRefGoogle Scholar
Sichkar, V. N., “Reinforcement Learning Algorithms in Global Path Planning for Mobile Robot,” In: 2019 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM), Sochi, Russia (2019) pp. 15. doi: 10.1109/ICIEAM.2019.8742915.CrossRefGoogle Scholar
Dönmez, E., Kocamaz, A. F. and Dirik, M., “A vision-based real-time mobile robot controller design based on Gaussian function for indoor environment,” Arab. J. Sci. Eng. 43(12), 71277142 (2018). doi: 10.1007/s13369-017-2917-0.CrossRefGoogle Scholar
Mitić, M. and Miljković, Z., “Neural network learning from demonstration and epipolar geometry for visual control of a nonholonomic mobile robot,” Soft Comput. 18(5), 10111025 (2014). doi: 10.1007/s00500-013-1121-8.CrossRefGoogle Scholar
Figure 0

Figure 1. Mobile robot navigation system.

Figure 1

Algorithm 1 : The pseudo-code of general Q-learning algorithm

Figure 2

Figure 2. Path planning categories.

Figure 3

Algorithm 2 : Algorithm of ϵ-greedy policy

Figure 4

Figure 3. Exploration-exploitation methods.

Figure 5

Algorithm 3 : Algorithm of proposed policy

Figure 6

Table I. Initial parameter of simulation.

Figure 7

Figure 4. Results of Q-learning based $\epsilon$-greedy action selection.

Figure 8

Figure 5. Results of Q-learning based Boltzmann distribution action selection.

Figure 9

Table II. Results of the simulation.

Figure 10

Figure 6. Results of Q-learning based on proposed method.

Figure 11

Figure 7. Reward function for different episodes.