CN102723112B

CN102723112B - Q learning system based on memristor intersection array

Info

Publication number: CN102723112B
Application number: CN201210188573.2A
Authority: CN
Inventors: 王丽丹; 何朋飞; 段书凯; 钟宇平
Original assignee: Southwest University
Current assignee: Southwest University
Priority date: 2012-06-08
Filing date: 2012-06-08
Publication date: 2015-06-17
Anticipated expiration: 2032-06-08
Also published as: CN102723112A

Abstract

The invention discloses a Q-learning system based on a memristive cross array, which includes a memristive cross array, and is characterized in that: the system also includes a read and write selection switch for controlling the read and write operations of the memristive cross array, and a state selection switch: The state detection module detects the current environmental state s _t , and selects the corresponding row line through the state selection switch; column selection switch: when it is necessary to update the Q value, that is, a certain memristor value of the memristor cross array, select the column The switch selects the column line corresponding to the action _at . Delay unit: delay the voltage of the selected column line by one time step; state detection module: detect the current environmental state and save the previous environmental state. The invention successfully applies a new circuit element, the memristor, to reinforcement learning, solves the problem that reinforcement learning requires a large amount of storage space, and provides a new idea for future research on reinforcement learning.

Description

A Q-learning system based on memristive cross array

技术领域 technical field

本发明涉及一种存储矩阵和智能学习算法。The invention relates to a storage matrix and an intelligent learning algorithm.

背景技术 Background technique

强化学习是一种高级的智能学习算法，近年来被广泛的应用于智能机器人领域，成为研究的热点。1954年，Minsky提出了SNARCs的强化学习计算模型。接着，Sutton在其博士论文中提出了AHC算法和TD学习算法。后来，Watkins等人在TD学习算法的基础上，提出了目前强化学习算法中的经典算法-Q学习算法，Q学习算法是强化学习发展过程中的一个重要里程碑。Q学习算法提出后，很多研究者将Q学习算法应用于移动机器人的导航，机器人足球系统和智能I/O的调度。但是强化学习也有其自身的局限性，当问题较为复杂时，它需要大量的状态-动作存储空间。1971年，Chua根据电路的完备性理论，提出了第四种电路元件-忆阻器(L.O.Chua.Memristor-themissing circuit element.IEEE Trans.Circuit Theory.1971，18(5)：507-519.)。Reinforcement learning is an advanced intelligent learning algorithm, which has been widely used in the field of intelligent robots in recent years and has become a research hotspot. In 1954, Minsky proposed a reinforcement learning computing model for SNARCs. Then, Sutton proposed the AHC algorithm and the TD learning algorithm in his doctoral dissertation. Later, on the basis of the TD learning algorithm, Watkins et al. proposed the classic algorithm in the current reinforcement learning algorithm - the Q learning algorithm. The Q learning algorithm is an important milestone in the development of reinforcement learning. After the Q-learning algorithm was proposed, many researchers applied the Q-learning algorithm to the navigation of mobile robots, the scheduling of robot soccer systems and intelligent I/O. But reinforcement learning also has its own limitations. When the problem is more complex, it requires a large amount of state-action storage space. In 1971, according to the completeness theory of the circuit, Chua proposed the fourth circuit element - memristor (L.O.Chua.Memristor-themissing circuit element.IEEE Trans.Circuit Theory.1971,18(5):507-519.) .

2008年，HP实验室成功制造了第一个物理实现的忆阻器，此后忆阻器引起了广泛的关注。忆阻器具有纳米尺寸、非线性特性，其阻值随着输入激励的变化而变化，并且这种变化是非易失性的，因此忆阻器非常适合用来设计大规模存储器。忆阻器交叉阵列是忆阻器存储器中的一种，它的结构简单，设计方便。胡小方等人利用忆阻器交叉阵列实现了图像的存储(胡小方，段书凯，王丽丹，等.忆阻器交叉阵列及在图像处理中的应用.中国科学F辑：信息科学.2011，41(4)：500-512.)。由于忆阻器具有纳米尺寸，因此忆阻器交叉阵列能够做成大规模存储器，可以解决强化学习在解决复杂问题时，需要大量的状态-动作存储空间的问题，因此，利用忆阻交叉阵列来实现Q学习是一种好的选择。In 2008, the HP laboratory successfully fabricated the first physically realized memristor, and since then, memristor has attracted widespread attention. Memristors have nanometer size, nonlinear characteristics, and their resistance value changes with the change of input excitation, and this change is non-volatile, so memristors are very suitable for designing large-scale memories. Memristor crossbar array is a kind of memristor memory, which has simple structure and convenient design. Hu Xiaofang and others realized image storage by using memristor cross array (Hu Xiaofang, Duan Shukai, Wang Lidan, et al. Memristor cross array and its application in image processing. Chinese Science Series F: Information Science. 2011, 41(4) :500-512.). Because the memristor has a nanometer size, the memristor cross array can be made into a large-scale memory, which can solve the problem that reinforcement learning requires a large amount of state-action storage space when solving complex problems. Therefore, using the memristor cross array to Implementing Q-learning is a good choice.

HP忆阻器的物理模型如图1所示，忆阻器由掺杂区和非掺杂区两部分组成。其中w和D分别表示忆阻器中掺杂区域的宽度和忆阻器的总宽度。其数学模型如下：The physical model of the HP memristor is shown in Figure 1. The memristor consists of two parts: a doped region and an undoped region. where w and D denote the width of the doped region in the memristor and the total width of the memristor, respectively. Its mathematical model is as follows:

$M m ((t t)) = = {R R}_{ON ON} \frac{w w ((t t))}{D D.} + + {R R}_{OFF OFF} ((11 - - \frac{w w ((t t))}{D D.}))$

其中，R_OFF和R_ON分别表示w等于0和D时，忆阻器的阻值。Among them, R _OFF and R _ON represent the resistance value of the memristor when w is equal to 0 and D respectively.

$\frac{dw dw ((t t))}{dt dt} = = \frac{{μ μ}_{V V} {R R}_{ON ON}}{D D.} i i ((t t))$

这里，μ_v表示平均离子的移动，单位为cm²s^-1V^-1。Here, μ _v represents the movement of the average ion in cm ² s ^-1 V ^-1 .

${T T}_{w w} = = \frac{{Φ Φ}_{D D.}}{{V V}_{A A} {R R}_{OFF OFF}^{22}} [[{((R R (({w w}_{00}))))}^{22} - - {((R R ((w w))))}^{22}]]$

其中，in,

${Φ Φ}_{D D.} = = \frac{{((βD βD))}^{22}}{22 {μ μ}_{v v} ((β β - - 11))}$

这里，Tw是输入忆阻器两端的脉冲电压的脉冲宽度，V_A是脉冲的幅度，R(w₀)表示忆阻器的初始阻值，R(w)表示忆阻器可以达到的阻值，β＝R_OFF/R_ON。Here, Tw is the pulse width of the pulse voltage input across the memristor, _VA is the amplitude of the pulse, R(w ₀ ) represents the initial resistance of the memristor, and R(w) represents the resistance that the memristor can achieve , β=R _OFF /R _ON .

当R(w₀)小于等于R(w)时，可以得到When R(w ₀ ) is less than or equal to R(w), we can get

$R R ((w w)) = = \sqrt{{((R R (({w w}_{00}))))}^{22} - - \frac{{V V}_{A A} {T T}_{w w} {R R}_{OFF OFF}^{22}}{{Φ Φ}_{D D.}}},, {R R}_{ON ON} \leq \leq R R ((w w)) \leq \leq {R R}_{OFF OFF}$

因此，当Tw一定时，随着V_A的变化，忆阻器的阻值会发生变化，并且这种变化是非易失性的。Therefore, when Tw is constant, the resistance value of the memristor will change with the change of _VA , and this change is non-volatile.

忆阻器存储电路如图2和图3所示。写入数据的电路如图2所示，读出数据的电路如图3所示。当写入数据时，给忆阻器加上一个正的电压脉冲，R(w)会减小，因此忆阻器会记忆所加电压脉冲。当读出数据时，忆阻器的阻值不同，得到的V_out也不同，V_out与忆阻器的阻值之间形成了一个对应关系，因此能够正确反映忆阻器的阻值大小，也即忆阻器存储值的大小。The memristor storage circuit is shown in Figure 2 and Figure 3. The circuit for writing data is shown in Figure 2, and the circuit for reading data is shown in Figure 3. When writing data, apply a positive voltage pulse to the memristor, R(w) will decrease, so the memristor will memorize the applied voltage pulse. When reading data, the resistance value of the memristor is different, and the obtained V _out is also different. There is a corresponding relationship between V _out and the resistance value of the memristor, so it can correctly reflect the resistance value of the memristor. That is, the size of the memristor's stored value.

忆阻器的阻值会随着输入激励的变化而变化，而且这种变化是非易失性；因此，忆阻器具有非常好的存储特性。并且，忆阻器具有纳米尺寸，非常适合用在大规模存储器中。而忆阻交叉阵列就是一个忆阻器作存储器的例子。The resistance of a memristor changes in response to input stimuli, and this change is nonvolatile; therefore, memristors have very good memory properties. Moreover, the memristor has a nanometer size, which is very suitable for use in large-scale memory. Memristor interleaved array is an example of memristor as memory.

忆阻交叉阵列的结构如图4所示，每一个圆形区域代表的电路如图5所示。在图5中，读\写开关是写入数据和读出数据的控制开关。当给某一个忆阻器写入数据时，开关接左边的点，此时，对应的行线输入写数据电压V_in；当读出某一个忆阻器的数据时，开关接右边的点，此时，对应的行线输入读数据电压V_in，对应的列线输出电压V_out。The structure of the memristive crossbar array is shown in Figure 4, and the circuit represented by each circular area is shown in Figure 5. In Figure 5, the read/write switch is a control switch for writing data and reading data. When writing data to a certain memristor, the switch is connected to the point on the left, at this time, the corresponding row line inputs the write data voltage V _in ; when reading the data of a certain memristor, the switch is connected to the point on the right, At this time, the corresponding row line inputs the read data voltage V _in , and the corresponding column line outputs the voltage V _out .

发明内容 Contents of the invention

本发明的目的是提供一种实现Q学习算法的基于忆阻交叉阵列的Q学习系统。The purpose of the present invention is to provide a Q-learning system based on memristive cross-array for realizing Q-learning algorithm.

为了实现上述目的，采用以下技术方案：一种基于忆阻交叉阵列的Q学习系统，包括忆阻交叉阵列，其特征在于：所述系统还包括In order to achieve the above object, the following technical solutions are adopted: a Q-learning system based on memristive cross array, comprising a memristive cross array, characterized in that: the system also includes

读写选择开关：控制忆阻交叉阵列的读写操作；Read and write selection switch: control the read and write operations of the memristive cross array;

状态选择开关：状态检测模块检测当前环境状态s_t，通过状态选择开关，选择相应的行线；State selection switch: the state detection module detects the current environment state s _t , and selects the corresponding line through the state selection switch;

列选择开关：当需要对Q值，也即对忆阻交叉阵列的某一个忆阻值进行更新时，列选择开关选择动作a_t所对应的列线。Column selection switch: when it is necessary to update the Q value, that is, a certain memristor value of the memristive crossbar array, the column selection switch selects the column line corresponding to the action _at .

延迟单元：将选择的列线的电压延迟一个时间步长；Delay unit: delay the voltage of the selected column line by one time step;

状态检测模块：检测当前的环境状态，并且保存上一个环境状态。当需要根据状态选择动作时，状态检测模块检测当前环境状态，并将此状态提供给状态选择开关和状态控制开关。当执行动作以后，状态选择开关检测此时的环境状态，并且保存上一个环境状态，并将此时的环境状态提供给状态选择开关和状态控制开关。当对Q值进行更新的时候，状态检测模块输出前一个时刻的环境状态，并提供给状态选择开关，选择相应的行线。State detection module: detect the current environment state and save the last environment state. When an action needs to be selected according to the state, the state detection module detects the current environment state, and provides this state to the state selection switch and the state control switch. After the action is executed, the state selection switch detects the current environmental state, saves the last environmental state, and provides the current environmental state to the state selection switch and the state control switch. When updating the Q value, the state detection module outputs the environment state at the previous moment, and provides it to the state selection switch to select the corresponding line.

本发明将新的电路元件-忆阻器成功应用到了强化学习中，解决了强化学习需要大量的存储空间问题，为以后强化学习的研究提供了一种新的思路。The invention successfully applies the new circuit element-memristor to reinforcement learning, solves the problem that reinforcement learning requires a large amount of storage space, and provides a new idea for future research on reinforcement learning.

附图说明 Description of drawings

图1为HP忆阻器的物理模型结构图；Figure 1 is a physical model structure diagram of the HP memristor;

图2为忆阻器写数据时的电路图；Fig. 2 is the circuit diagram when memristor writes data;

图3为忆阻器读数据时的电路图；Fig. 3 is a circuit diagram when the memristor reads data;

图4为忆阻交叉阵列的结构示意图；4 is a schematic structural diagram of a memristive crossbar;

图5为忆阻交叉阵列中单个忆阻电路图；Fig. 5 is a single memristive circuit diagram in a memristive cross array;

图6为本发明的结构示意图；Fig. 6 is a structural representation of the present invention;

图7为本发明实施例中机器人和障碍物的结构示意图；Fig. 7 is a schematic structural diagram of a robot and an obstacle in an embodiment of the present invention;

图8为本实施例的仿真结果。Fig. 8 is the simulation result of this embodiment.

具体实施例 specific embodiment

下面结合附图和具体实施例对本发明做进一步描述。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

Q学习算法是强化学习算法中的一个经典算法，Q学习中最简单的一种形式为单步Q学习，其Q值的更新公式为The Q-learning algorithm is a classic algorithm in the reinforcement learning algorithm. The simplest form of Q-learning is single-step Q-learning, and the update formula of its Q value is

Q(s_t，a_t)＝Q(s_t，a_t)+α(r_t+1+γmaxQ(s_t+1，a)-Q(s_t，a_t))Q(s _t , a _t )=Q(s _t , a _t )+α(r _t+1 +γmaxQ(s _t+1 , a)-Q(s _t , a _t ))

其中，α为学习率，γ为折扣率。r_t+1表示在状态s_t执行动作a_t所获得环境的奖励。Q(s_t，a_t)表示动作状态对值函数，即在状态s_t，执行动作a_t，所得到的值的大小。Among them, α is the learning rate and γ is the discount rate. r _t+1 represents the environmental reward obtained by performing action a _t in state s _t . Q(s _t , a _t ) represents the action state-to-value function, that is, the magnitude of the value obtained by executing the action a _t in the state st _t .

强化学习的局限在于需要大量的存储空间，而新的电路元件-忆阻器，具有纳米尺寸和存储特性，基于忆阻器的交叉阵列具有大量的存储空间和并行处理能力，非常适合用来解决这个问题。The limitation of reinforcement learning is that it requires a large amount of storage space, and the new circuit element-memristor, with its nanometer size and storage characteristics, the memristor-based interleaved array has a large amount of storage space and parallel processing capabilities, which is very suitable for solving this problem.

在Q学习算法中，每执行一个动作，会得到环境的奖励值，并选择当前状态-动作对中的最大Q值和获得的奖励去更新前一个状态和选择的动作对的Q值。而用忆阻交叉阵列去实现Q学习时，每一个忆阻器的输出电压代表所对应的状态-动作对的Q值。根据忆阻器的存储原理，可以知道掉电之后阻值不会改变，因此只需在忆阻器两端加上写电压In the Q-learning algorithm, every time an action is executed, the reward value of the environment will be obtained, and the maximum Q value in the current state-action pair and the obtained reward will be selected to update the Q value of the previous state and the selected action pair. When the memristor cross array is used to realize Q learning, the output voltage of each memristor represents the Q value of the corresponding state-action pair. According to the storage principle of the memristor, it can be known that the resistance value will not change after power off, so it is only necessary to add a write voltage across the memristor

V_i＝α(r+γmaxV(s_t+1，a)-V(s_t，a_t))V _i =α(r+γmaxV(s _t+1 ,a)-V(s _t ,a _t ))

就可以去对s_t和a_t所对应的忆阻器的阻值进行更新，从而改变该忆阻器的输出电压V(s_t，a_t)，也即Q(s_t，a_t)值。It is possible to update the resistance value of the memristor corresponding to st _and at _t , thereby changing the output voltage V( _st , _at ) of the memristor, that is, the value of Q(st _t , at ₎ .

忆阻交叉阵列实现Q学习的过程如图6所示。忆阻交叉阵列中，每一条行线对应一个状态s，每一条列些对应一个动作a，其具体实现过程如下所示：The process of realizing Q-learning by memristive cross array is shown in Fig. 6 . In the memristive cross array, each row line corresponds to a state s, and each column corresponds to an action a. The specific implementation process is as follows:

(1)读写选择开关选择读有效，机器人中的状态检测模块检测当前环境状态s_t，通过状态选择开关，选择相应的行线；(1) The reading and writing selection switch is selected to be valid, and the state detection module in the robot detects the current environment state s _t , and selects the corresponding row line through the state selection switch;

(2)列选择开关选择所有列，通过状态控制开关将列线连接到随机选择模块，随机选择模块根据每个列线电压的大小随机的选择，电压越大的列线被选择的几率越大，最后随机选择出一个列线，根据选择的列线，得到执行的动作a_t，机器人执行动作a_t。也可以在设定的某些状态时，通过状态控制开关将列线连接到比较器模块，选择出电压最大的列线，再通过连接选择开关将该列线连接到延迟单元。通过状态选择开关、随机选择模块、比较器、连接选择模块就可以实现强化学习中的ε-greedy策略。(2) The column selection switch selects all columns, and connects the column lines to the random selection module through the state control switch. The random selection module randomly selects the column line voltage according to the voltage of each column line. The greater the voltage, the greater the probability of the column line being selected. , and finally randomly select a column line, according to the selected column line, get the executed action a _t , and the robot executes the action a _t . It is also possible to connect the column line to the comparator module through the state control switch in certain states, select the column line with the highest voltage, and then connect the column line to the delay unit through the connection selection switch. The ε-greedy strategy in reinforcement learning can be realized through the state selection switch, random selection module, comparator, and connection selection module.

(3)将选择的列线连接到延迟单元，延迟单元对列线的电压延迟一个时间步长；(3) The selected column line is connected to the delay unit, and the delay unit delays the voltage of the column line by a time step;

(4)状态检测模块检测当前环境状态，机器人进入状态s_t+1，此时状态控制开关将列线连接到比较器，通过比较器，选择电压最大的列线，通过连接选择模块将该列线连接到Q值更新模块，Q值更新模块将该电压与延迟单元的输出电压以及获得环境的奖励按照式(7)进行计算，得到写电压V_i。(4) The state detection module detects the current environment state, and the robot enters the state s _t+1 . At this time, the state control switch connects the column line to the comparator, selects the column line with the highest voltage through the comparator, and connects the column line to The line is connected to the Q value update module, and the Q value update module calculates the voltage, the output voltage of the delay unit and the reward obtained from the environment according to formula (7) to obtain the writing voltage V _i .

(5)读写选择开关选择写有效，将写电压V_i加在忆阻器的两端，时间为T_w。(5) The read-write selection switch selects write to be valid, and the write voltage V _i is applied to both ends of the memristor for a time of T _w .

(6)重复上面的过程，直到达到设定的次数。(6) Repeat the above process until the set number of times is reached.

机器人避障实验是要让机器人在有障碍的环境中实现无碰撞的行走。本实验采用基于忆阻交叉阵列的Q学习来实现机器人的学习，并最终实现无障碍的行走，本实验使用mobotsim软件。The robot obstacle avoidance experiment is to let the robot realize collision-free walking in an environment with obstacles. In this experiment, the Q-learning based on the memristive cross array is used to realize the learning of the robot, and finally realize the barrier-free walking. This experiment uses the mobotsim software.

在图7中，圆形区域表示机器人，机器人上有三个传感器，数字0-2分别对应3个传感器，每一个传感器能够检测的最大距离是1.5米，黑色区域表示障碍物。In Figure 7, the circular area represents the robot. There are three sensors on the robot. The numbers 0-2 correspond to the three sensors respectively. The maximum distance that each sensor can detect is 1.5 meters. The black area represents obstacles.

在本实验中，把每一个传感器检测到的与障碍物的距离划分为3段，如下所示：In this experiment, the distance to the obstacle detected by each sensor is divided into 3 segments, as follows:

其中，dist0-dist2分表表示每一个传感器检测到的到障碍物的距离，将s0-s2进行组合，会得到27种情况，将这27种情况作为机器人所处的环境中的27种状态，用一个三维数组state[s0，s1，s2]存储该27种状态。由于在本实验平台中，当机器人与障碍物碰撞或者传感器不能检测到障碍物时，传感器返回的值都是-1，因此，将机器人与障碍物碰撞时的状态，归为状态0，也即s0-s2都为0时的情况。Among them, the dist0-dist2 sub-table indicates the distance to the obstacle detected by each sensor. Combining s0-s2, 27 situations will be obtained, and these 27 situations will be regarded as 27 states in the environment where the robot is located. A three-dimensional array state[s0, s1, s2] is used to store the 27 states. Since in this experimental platform, when the robot collides with an obstacle or the sensor cannot detect an obstacle, the value returned by the sensor is -1. Therefore, the state when the robot collides with an obstacle is classified as state 0, that is, The situation when both s0-s2 are 0.

奖赏函数r定义为：The reward function r is defined as:

在本实验中，机器人将执行三种动作：前进，左转和右转。如果机器人所处的状态为state[2，2，2]时，动作的执行按照Q值的比重随机执行；其他状态时，执行Q值最大的动作。In this experiment, the robot will perform three actions: forward, turn left, and turn right. If the state of the robot is state[2, 2, 2], the execution of the action is performed randomly according to the proportion of the Q value; in other states, the action with the largest Q value is executed.

取α＝0.8，γ＝0.98，仿真次数设为500次，每次仿真2000步，实验仿真结果如图8所示。Take α=0.8, γ=0.98, set the number of simulations to 500, and simulate 2000 steps each time. The experimental simulation results are shown in Figure 8.

Claims

1. A Q-learning system based on a memristive cross array, comprising a memristive cross array, characterized in that: the system also includes

Read and write selection switch: control the read and write operations of the memristive cross array;

State selection switch: the state detection module detects the current environment state s _t , and selects the corresponding line through the state selection switch;

Column selection switch: when it is necessary to update the Q value, that is, a certain memristor value of the memristor cross array, the column selection switch selects the column line corresponding to the action _at ;

Delay unit: delay the voltage of the selected column line by one time step;

State detection module: detect the current environment state and save the previous environment state. When an action needs to be selected according to the state, the state detection module detects the current environment state and provides this state to the state selection switch and state control switch. After the action is executed, The state selection switch detects the environmental state at this time, saves the last environmental state, and provides the environmental state at this time to the state selection switch and the state control switch; when updating the Q value, the state detection module outputs the previous time The state of the environment is provided to the state selection switch to select the corresponding row line and apply the write voltage across the memristor

It is possible to update the resistance value of the memristor corresponding to st _and at _t , thereby changing the output voltage V (st _t , _at ) of the memristor, that is, the value of Q (st _t , at ₎ ; Here the value of V (s _t , a _t ) is equal to the value of Q (s _t , a _t );

Among them, α is the learning rate, r is the reward function, and γ is the discount rate.