Developing Decision-Making Algorithm for Unmanned Vessel Navigation Using Markov Processes

: In this study, the autonomous decision-making architecture of unmanned vessel navigation has been formulated. The aim of this study is the advancement of mathematical methods in the ship transportation field with relevance to collision avoidance scenario applications. The process of seafarers safely navigating a vessel at sea entails enacting appropriate decision-making at the appropriate time. In our model, we do not input the appropriate action order based on a seafarer’s experience. The model scores each step’s reward by its action behaviour and learns how to avoid obstacles by itself. By deploying decision timing, state, reward, and digitizing the seafarer’s decision, we establish a reinforcement learning algorithm based on Markov decision processes. In the model training, under a single factor influence, the vessel tends to change course with the best appropriate action behaviour, which is almost consistent with decision-making behaviour based on actual experience at sea.


Introduction
In the past few decades, increasing attention has been paid to Markov decision process (MDP) algorithms, partly due to the success of self-driving car research using reinforcement learning methods [1]. With the rapid development of automatic control, the Internet of Things (IoT), big data, state awareness, telecommunication, and other navigation technologies, the technical feasibility of smart ships has increased broadly [2]. In particular, the unmanned vessel yet to be christened, Yara Birkeland, is expected to begin its voyage in 2018 [3]. A large-scale merchant transportation of unmanned vessels has obvious advantages. For example, a fleet of unmanned container ships would maximize the capacity of storage space through the exclusion of the bridge and living area for the onboard seafarers; thereby maximizing the cargo volume and improving the transportation efficiency.
Moreover, many of the facilities on board serve the on-board seafarers, such as life-saving equipment, firefighting apparatus, pollution prevention, and living facilities. In the absence of seafarers, such equipment will not be required, reducing the weight of the ship and energy consumption, lowering the construction and operating costs, and increasing the ship's cargo capacity. Moreover, the main causes of maritime accidents are human-based factors, such as inadequate decision-making, operational negligence, deficient emergency response, and other seafarer factors.
In an unmanned vessel, ship maneuvering is conducted primarily through automatic decision-making and remote monitoring by personnel working under better working conditions at a shore side control station [4]. Thus, the impact of human-based factors is reduced.
However, the challenges are still many. One of the key problems of maritime navigation is the current trend towards human-centered decision-making systems. The process of seafarers maneuvering a ship at sea entails enacting the appropriate decision-making procedures at the appropriate time. Reinforcement learning has proven capable of developing learning models that are effective in planning [5]. In the prior research, the authors used the Robot Operating System as a tool, extended the Markov Decision-making (MDM) and supported the decision-making methodologies based on MDPs [6]. The aim of the MDPs is to provide an action set of decision-making for the onboard cycle.
Based on the complex maritime environment, it is observed that predetermination is the basic requirement to achieve safe navigation of unmanned vessels. This study consists of three sections. The first section clarifies the concept of MDPs, especially focusing on how to use MDPs to optimize the decision-making procedure under the situation of no seafarer on the bridge. In the second section, using MDPs to formulate a mathematical model, we classify the different elements of the navigation behaviour state. The third section concludes this study by presenting an algorithm for a very simple demonstration. The purpose of this paper is to build an automatic decision cycle model to increase the decision-making efficiency and safety performance of unmanned ships, the experiments conducted in this article also proved the performance of the autonomous decision-making can be improved. Under the single factor influence, with the model training, this paper also will provide the result that the vessel trends to change the course with the best appropriate maneuver behaviour, which is almost consistent with decision-making behaviour based on actual expert's experience at sea.

Mathematical Description of the Navigation State
Generally, in a vessel navigating at sea, the officer of the watch learns to recognize the state of the navigation environment through equipment and his look-out experience. The scenario in this study describes a case without any seafarers on the bridge; therefore, the vessel only recognizes the surrounding environment with devices such as radar, camera, and automatic identification system (AIS). Within the scope of this study, we assume that the data obtained by all other devices has been integrated into a similar state framework (radar screen or electronic chart display). As shown in Fig.  1, a cross encounter situation has been constructed between the vessel at the port side and the own ship. According to the International Regulations for Preventing Collisions at Sea (COLREGS), the own ship has no obligation to take an action to avoid the target vessel, but it should always pay attention to the changing state. Further, there is an obstacle (oil rig) in front of the starboard side of the own ship, and it needs to pass beyond a safe distance. This constitutes one of the most common navigational environments at sea.  This state changes, based on the surrounding environment and the change of the navigational state of the own ship. Each state change of the model requires a decision that weighs the next safe action behaviour and the risk probability of exercising the action behaviour. Therefore, the own ship must comprehensively predict and decide for the action change of each state to determine the optimal maneuver policy. There are two main ways to change the action behaviour of the own ship: change the course by steering (turn to port side 'P' or turn to starboard side 'S') or change speed (accelerate 'A' or decelerate 'D'). Due to the different amplitude of each action change, it can be regarded as a discrete vector set, as well as the entire control tensor consisting of the free combination of the two action sets: T_s= {A,D,P,S,AP,AS,DP,DS}.

MDPs
MDPs are usually advantageous for approaching a wide range of optimization problems solved through dynamic programming and reinforcement learning. It is a class of stochastic sequential decision processes, in which the reward and transition functions depend only on the current state of the model and the current action [7]. With the mathematical framework of the navigation states constructed, we can approach a simple model of completely autonomous decision-making based on Markov processes, as shown in Fig. 2. The green and orange circles represent the different states and actions, respectively. MDPs show that there may be more than one result per action in different states. For example, after the state S1 passes the action a0, it may return to the previous state S0 or change to the state S2. Moreover, it is also possible to keep the current state S1 with nothing to change. The number above the vector arrow represents the probability that the previous state will shift to another state after the action. R, corresponding to the vector arrow, is the reward we need to observe.

3.1.State Space
The decision state space, S, is the state of the unmanned navigation environment, which is the combination of a vector set of the state, P_((x)), and a vector set of the dynamic change of external context, Q_((y)). P_((x))=P_1,〖 P〗_2,…,P_m describes the set of actions, from act 1 to act m; while Q_((y))=Q_1,Q_2,…,Q_n is the set of vectors whose model takes a specific decision procedure to change the navigation state. Therefore, we obtain the formula S=(■(P_((x)),&Q_((y)),&ρ)), where ρ is the factorial of the set, i!, that describes the total summation of the most recent situation. The decision state space can be described as The following four equations can be obtained from (+) , (/) , and : ST ∈ ̅ (4) V describes a single change in the unmanned vessel's maneuvers, but no change in the external context. This situation rarely occurs. Thus, the index weight will be relatively small.
Y describes the situation where a change occurs in both the unmanned vessel's maneuvers and the external context. This situation commonly occurs in complex waters, and the model needs a continuous decision-making process. Thus, the index weight will be relatively large.
Subsequently, ̅ = 7 ∪ V ∪ X ∪ Y . When the unmanned boat's performance limit y is a constant, θ represents all the possible operational decision states.

Action Set
For an unmanned vessel in the actual navigation process, from point A to point B, the system's navigation environment variables may occur in the entire navigation time interval [a, b]. In a safe navigation condition, the unmanned vessel does not have to make any maneuvers to change the model's navigational state; the actual decision-making time occurs only when the unmanned vessel encounters something affecting the safety of the navigation. Thus, the number of valid environment variables is the number of the MDP model's decision moments. For

Reward Function
It is assumed that the operating time interval of the unmanned vessel decision model is a time series that follows an exponential distribution. For the state S transferring to state S' by action A, the mathematical expression is: where tu is the reward generated by the traffic event occurrence of the action ω •€ , when the model is under the decision operation •€ , and tu is the probability that the unmanned vessel in the •€ decision-making operation needs to act upon the •€ operation.
The MDP decision model based on a continuous timeline is now complete.

Environment Formulation
Maritime long-distance transportation vessels have great inertia, resulting in difficult control. In the case of deceleration, the ship's power system may cause damage, while frequent steering using the rudder could reduce the service life of the steering gear and propeller. Therefore, to optimize maneuvering, this study introduces the Markov process, with safe navigation constraints, as an unmanned vessel decision model. It integrates all the navigation data in a continuous decision-making timeline for the action set T. In an avoidance scenario, the local reward function not only needs to obtain the maximum value, but also needs to satisfy the global process to get the maximum reward value. Thus, it needs to achieve the purpose of optimizing navigation with less manipulation and obtain a larger reward function value. We can continue to simplify the MDPs model, through the Python computer programming language, to build a simple marine collision avoidance environment for simulation and authentication.

Simplified Decision Algorithm
The process shown in Fig. 3 is a navigation situation usually encountered in a seaborne vessel. The own ship belongs to the original state from S0. According to the changes in the surrounding environment, the model tries to bring up a series of simple action behavior a0. The model keeps changing the status of this ship until a complete collision avoidance has been achieved and gives the state Sn. The purpose of the decision-making algorithm is to try to give manipulation in the process. It does not directly tell the model manipulation of right or wrong. Thus, the model only scores a reward value by random action and makes the final global reward Rt approach an optimal value. Some specific algorithm ideas are as follows: In the above algorithm, α is the learning efficiency, γ is the decay rate, MP(S o , a o ) is the true value, and MP(S, a) is the estimated value.
In the model training of this study, we assume that the encountered scenario at sea is: according to the navigation experience and collision avoidance rules (COLREGS), the appropriate action behavior of an experienced officer of the watch is that if the "starboard five" is held for 10 s, it can safely avoid obstacles at sea. Therefore, to simplify the flow of demonstration algorithm, we can make a single-threaded simulation, where every second there is an action behavior taken. The action behavior can be the same as before the manipulation of this state, and only "turn to port five" or "starboard five" types of steering action can be chosen. After setting the reward value gained by different manipulative behaviors, the whole training process cannot interfere with human factors. If the model is trained, it can perform the avoidance and steering operation around 10 actions, indicating that this method can be applied to the unmanned vessel collision avoidance operations in a real offshore environment.

Training Result
After 20 iterations of model training, we obtain the training results, as shown in Fig. 4. In the first training process, the model did not know which one was suitable and could avoid the manipulation of obstacles. Therefore, it took 72 steps to achieve the aim of obstacle avoidance, which is not suitable for the requirement of collision avoidance in a practical situation at sea. However, with the increase in training times, especially after training 5 times, the model quickly learned that if "starboard five" is the most effective action to avoid the obstacles in front of it, the number of subsequent rudder steps is obviously cut back. Finally, the model can be stabilized in about 10 steps, to complete the collision avoidance manipulation.

Steering Action Data Analysis
Moreover, as the training times increase, we can also obtain results as shown in Fig. 5. The abscissa axis in the figure is the number of steps performed in one state, and the ordinate axis is the average of the probabilities for decision-making. With the change of state, the decision made by the algorithm at the beginning does not know whether turn to port or starboard can be obtained a desirable outcome, but statistics of the results of 20 times training can be observed. Generally, after five steps, the "starboard five" action has significantly improved its probability compared to the "port five" decision, making it quicker to learn how to avoid the obstacles in front of it. Therefore, it also proves that the vessel is controlled to navigate safely, and the Markov processes-based autonomous decision-making model is fully applicable to the unmanned vessel navigation at sea.

Conclusion
This study formulated a Markov processes-based autonomous decision-making model to help an unmanned vessel achieve collision avoidance by methods learned by itself. Further, it constructed a simple common encounter environment, through the Python programming language, for a "starboard five" action, while keeping a 10 s rudder order to complete the obstacle avoidance action behavior that has training, which is known by officers of a watch's experience. No human-based factors control the model to avoid the obstacles. The model explores on its own, and according to each step, gets the appropriate reward for choosing the best appropriate solution to achieve the collision avoidance goal.