Motion planning framework based on dual-agent DDPG method for dual-arm robots guided by human joint angle constraints

1 Introduction

In recent years, more and more researchers have paid attention to dual-arm robots, which are more similar to human beings in terms of configuration, joint freedom, and working space. They can better replace human tasks by imitating human arms (Wang et al., 2022). Compared with single-arm robots, dual-arm robots have significant advantages in precision assembly with high coordination and multi-object assembly in unstructured environments. Because of the high degree of freedom and high coordination of the dual-arm robot, it requires high dimensions and strong coupling for its motion planning. In particular, for multi-step complex tasks, the motion process of the dual-arm robot can be divided into multiple sub-task processes, which increases the dimension of motion planning from the task planning level. The traditional motion planning method mainly solves dual-arm robots' obstacle avoidance motion planning problem using constraint model establishment and non-linear solutions (Vahrenkamp et al., 2012; Fang et al., 2015; Giftthaler et al., 2017). However, this method has little effect on dual-arm robots' multi-step coordination tasks, which limits the application of dual-arm robots.

With the development of machine learning methods, more and more researchers use intelligent learning methods to complete the motion planning of multi-step coordination tasks of dual-arm robots (Bing et al., 2022a, 2023b,d). Learning-based motion planning methods are mainly divided into imitation learning and reinforcement learning. The motion planning method based on imitation learning learns the motion features from the teaching demonstration and then reproduces the demonstration task on the robot. Maeda et al. (2020) proposed a demonstration programming method that automatically derives task constraints from data for constraint-based robot controllers using the Dirichlet Process Gaussian Mixture Model (DPGMM) and Gaussian Mixture Regression (GMR) method. In the study by Mronga and Kirchner (2021), phase portrait movement primitives (PPMPs), which can predict the dynamics of the low-dimensional phase space and then can be used to control the high-dimensional kinematics of the task, were proposed. In the study by Dong et al. (2022), a model-based learnable graph attention network (GAT) was used to learn task-level skills from human demonstration passively. It was validated in a humanoid robot task experiment of waving and grasping boxes. This category method can realize human imitation from the level of learning content and effectively learn human motion knowledge, but it can not optimize or learn new trajectories independently.

The motion planning method based on reinforcement learning enables the agent to explore learning motion strategies by interacting with the environment (Bing et al., 2022b, 2023a; Chu et al., 2022). For example, Ren and Ben-Tzvi (2020) proposed an advising reinforcement learning approach based on the depth deterministic strategy gradient (DDPG) and hindsight experience replay (HER), which applies the teacher-student framework to a continuous control environment with sparse rewards to solve the problem of extended agents. In the study by Jiang et al. (2021), a multiagent twin delayed deep deterministic policy gradient (MATD3) algorithm was proposed for the on-orbit acquisition mission of a space robot arm to generate a real-time inverse kinematics solution for the coordinated robot arm. In the study by Tang et al. (2022), the proximal policy optimization (PPO) algorithm with continuous rewards was used for trajectory planning of the two-arm robot, and the reward and punishment function was designed based on the artificial potential field (APF) method so that the dual-arm robot could approach and support patients in a complex environment. This category method imitates human beings from the level of learning style and mimics the human trial and error reward learning mechanism. However, the high-dimensional problem of trajectory planning brought by dual-arm multi-step tasks which will make the search space of reinforcement learning larger, the training process easily falls into local optimal, and the training results are difficult to converge.

In the previous study, the DADDPG algorithm proposed could reduce and decouple the dual-arm trajectory planning problem to a certain extent and successfully plan the dual-arm coordination trajectory for multi-objective tasks (Liang et al., 2023). Based on the DADDPG algorithm, this study proposes a motion planning framework guided by the human joint angle constraints, which simultaneously realizes the human-like learning content and learning style. By introducing human joint angle constraint, this method reduces the exploration space of reinforcement learning, rationalizes its exploration, makes its learning controllable to a certain extent, and speeds up the learning speed.

Building a robot's structure or control algorithm by imitating humans or animals has long been one of the potential means of improving robot performance (Bing et al., 2023c). The bionics-based human-like arm motion planning method extracts biomarkers and rules from recorded movements for simulating arm motion trajectory (Gulletta et al., 2020). For example, Kim et al. proposed a method to extract human arm movement features from the motion capture database, characterize human arm movement according to elbow elevation angle, and use this representation to generate human-like movements in real-time (Kim et al., 2006; Shin and Kim, 2014). In the study by Suárez et al. (2015), a motion planning method for a dual-arm anthropomorphic system was proposed, and a new basis vector of the dual-arm configuration space returned by principal component analysis (PCA) was used to characterize the dual-arm synergy. In this study, the human joint angle is calculated from the demonstration data collected by IMU and mapped to the robot model. Then, the joint angle constraint is extracted piecewise from the multi-group human demonstration and used to guide the autonomous learning of the multi-step coordination trajectory of dual-arm robots.

Three existing learning optimization methods use empirical knowledge to guide reinforcement learning: reward function optimization, exploration behavior optimization, and network parameter initialization (Taylor et al., 2011; Bougie et al., 2018; Xiang and Su, 2019). Among them, optimizing reward function is the most consistent with human behavior patterns. It models the reward function of the reinforcement learning method based on the empirical knowledge model, which can guide reinforcement learning intuitively and effectively (Tian et al., 2021). The segmented guided step reward of this study is designed to make the joint angle constraints guide the DADDPG method to quickly learn the dual-arm coordination trajectory for complex multi-step tasks.

2 Methodology

This study proposes a motion planning framework for dual-arm robots guided by human joint constraints based on the DADDPG method, as shown in Figure 1. In the proposed framework, the joint angle is calculated from the demonstration data collected by IMU and mapped to the robot model. Then, the joint angle constraint is extracted piecewise from multiple groups of human demonstration. The joint angle constraint is then used to guide the DADDPG method to quickly learn the dual-arm coordination trajectory for complex multi-step tasks through reward distribution. The following is a detailed introduction from four aspects: joint mapping model, joint angle constraint, reinforcement learning method, and reward guidance design.

Figure 1. The proposed framework block diagram.

2.1 Human-robot joint mapping model

The human arm has three joints: shoulder, elbow, and wrist. Among them, the shoulder joint has three rotational degrees of freedom, the elbow joint has two rotational degrees of freedom, and the wrist joint has two rotational degrees of freedom. The kinematics model of the human arm and Baxter is established using the standard D-H method (Denavit and Hartenberg, 1955). The parameters are shown in Tables 1, 2. q1, q2, q3 is the rotation angle corresponding to the shoulder joint, q4, q5 is the rotation angle corresponding to the elbow joint, and q6, q7 is the rotation angle corresponding to the wrist joint. Input the same joint angle value, and the posture of the two models is consistent, as shown in Figure 2. Figure 2A is the simplified kinematics model of the human arms, and Figure 2B is the kinematics model of the Baxter robot.

Table 1. D-H parameters of the human arm(right).

Table 2. D-H parameters of Baxter arm(right).

Figure 2. Comparison of human and robot pose when joint angle ql = qr = [−0.078, −0.968, −1.150, 1.923, 0.648, 1.008, −0.483] is set. (A) Simplified kinematic model of human arms. (B) Kinematic model of Baxter robot.

The following describes how to solve the corresponding joint angle from the demonstration data measured by the IMU. The presenter wears six IMUs, as shown in Figure 3A, three for each arm, to measure the spatial orientation of the upper arm, forearm, and palm. Taking the right arm as an example, the coordinate system is presented in Figure 3B: G is the global coordinate frame, U is the upper arm coordinate frame, F is the forearm coordinate frame, P is the palm coordinate frame, IU is the coordinate frame of the IMU on the upper arm, IF is the coordinate frame of the IMU on the forearm, and IP is the coordinate frame of the IMU on the palm.

Figure 3. IMUs and coordinate frame schematic. (A) The position of the IMU on the presenter's arm. (B) Coordinate frame diagram.

The initial position of the right arm joint angle is specified as qr0 = [q1, q2, q3, q4, q5, q6, q7] = [0, 0, 0, 0, 0, 0, 0]. In the initial joint angle configuration, the orientation of each frame of the right arm with respect to the global frame is represented as: RGU0=RGF0=RGP0=[0,0,1;0,1,0;-1,0,0]. The measured values of the IMU on the upper arm, forearm, and palm are RGIU0, RGIF0, and RGIP0, respectively, that is, the orientation of the IMU's frame with respect to the global frame. The orientation of the IMU relative to the arm can be determined as shown in Equation (1):

{RUIU=(RGU0)TRGIU0RFIF=(RGF0)TRGIF0RPIP=(RGP0)TRGIP0 (1)

where RUIU is the orientation of the IMU's frame on the upper arm relative to the upper arm's frame, RFIF is the orientation of the IMU's frame on the forearm relative to the forearm's frame, and RPIP is the orientation of the IMU's frame on the palm relative to the palm's frame.

When the arm moves to a new position, the orientation of the IMU concerning the arm remains unchanged, assuming that the new orientations of the IMU's frame relative to the global frame are RGIUnew, RGIFnew, and RGIPnew. The orientation of the upper arm, forearm, and palm can be calculated using Equation (2):

{RGUnew=RGIUnew(RUIU)TRGFnew=RGIFnew(RFIF)TRGPnew=RGIPnew(RPIP)T (2)

where RGUnew is the orientation of the upper arm's frame with respect to the global frame in new position, RGFnew is the orientation of the forearm's frame with respect to the global frame in new position, and RGPnew is the orientation of the palm's frame with respect to the global frame in new position. The relationship between the orientation of the upper arm's frame with respect to the global frame and the joint angles q1, q2, and q3 is shown in Equation (3):

RGUnew=RX(-q1)RY(q2)RZ(q3) (3)

By substituting Equation (2) into Equation (3), the joint angles q1, q2, and q3 corresponding to the shoulder joint can be calculated as Equation (4):

{q1=−atan2(−RGUnew(2,3),RGUnew(3,3))q3=atan2(−RGUnew(1,2),RGUnew(1,1))q2=atan2(−RGUnew(1,3),RGUnew(1,1)cos(q3)) (4)

The orientation of the forearm relative to the upper arm can be calculated using Equation (5):

RUFnew=(RGUnew)TRGFnew (5)

The relationship between the orientation of the forearm's frame with respect to the global frame and the joint angles q4 and q5 is shown in Equation (6):

RUFnew=RY(q4)RZ(q5) (6)

By substituting Equation (5) into Equation (6), the joint angles q4 and q5 corresponding to the elbow joint can be calculated as Equation (7):

{q4=atan2(−RUFnew(1,3),RUFnew(3,3))q5=atan2(−RUFnew(2,1),RUFnew(2,2)) (7)

The orientation of the palm relative to the forearm can be calculated using Equation (8):

RFPnew=(RGFnew)TRGPnew (8)

The relationship between the orientation of the palm's frame with respect to the global frame and the joint angles q6 and q7 is shown in Equation (9):

RFPnew=RY(q6)RZ(q7) (9)

By substituting Equation (8) into Equation (9), the joint angles q6 and q7 corresponding to the wrist joint can be calculated as Equation (10):

{q6=atan2(−RFPnew(1,3),RFPnew(3,3))q7=atan2(−RFPnew(2,1),RFPnew(2,2)) (10)

In this way, the right arm joint angle is calculated, and the left arm joint angle can also be calculated by the above method.

2.2 Joint angle constraint

A human demonstration can be divided into multiple trajectories for a complex multi-step task. For example, a bottle cap screwing task can be broken down into reaching, grabbing, aligning, and screwing steps. Suppose a multi-step task is artificially divided into N sub-tasks; in the sub-task n, the angle constraint of the k-th joint of the human arm can be expressed as Equation (11):

akn≤qkn≤bkn (11)

where 0 < n ≤ N, 0 < k ≤ 14, akn is the lower limit for the k-th joint angle in sub-task n, bkn is the upper limit for the k-th joint angle in sub-task n.

When there are M groups of human demonstrations, the trajectories can be divided into M*N sub-trajectories, where N is the number of sub-tasks. Then, for sub-task n, the angle constraint of the k-th human arm joint can be expressed as Equation (12):

Ckn: {qkn≥min(ak,1n,ak,2n,⋯,ak,mn)qkn≤max(bk,1n,bk,2n,⋯,bk,mn) (12)

where 0 < m ≤ M, ak,mn is the lower limit of the k-th joint angle in the sub-task n of the demonstration m, bk,mn is the upper limit of the k-th joint angle in the sub-task n of the demonstration m.

2.3 Reinforcement learning method

The DADDPG method proposed in previous study can plan the coordinated trajectories of dual-arm robots for multi-objective tasks (Liang et al., 2023). In this study, the DADDPG algorithm is chosen as the algorithm of reinforcement learning, which uses two agents to plan the coordinated trajectory of the left arm and the right arm simultaneously. Each agent contains four networks: Actor μi(si|θiμ), Critic Qi(s,ai|θiQ), Target Actor μi′(si′|θiμ′), and Target Critic Qi′(s′,ai′|θiQ′), where i = 1, 2.

For the agent i, the parameters of the Critic network are updated by minimizing MSBE loss Lc by the gradient descent method using Equation (13) (Liang et al., 2023):

Lc=(yi−qi)2=(Qi(sj,aj,i|θiQ)−rj,i+γ(1−done)Q′ i(sj+1,μ′ i(sj+1,i|θiμ′)|θiQ′))2 (13)

The parameters of the Actor network are updated by maximizing the cumulative expected return J of agent i by the gradient ascent method using Equation (14) (Liang et al., 2023):

▽θiμJ=Es~μ1,μ2[▽aiQi(s,ai|θ

View original article

FRONTIERS IN NEUROROBOTICS

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Motion planning framework based on dual-agent DDPG method for dual-arm robots guided by human joint angle constraints

Comments (0)