CN102723112B - Q learning system based on memristor intersection array - Google Patents

Q learning system based on memristor intersection array Download PDF

Info

Publication number
CN102723112B
CN102723112B CN201210188573.2A CN201210188573A CN102723112B CN 102723112 B CN102723112 B CN 102723112B CN 201210188573 A CN201210188573 A CN 201210188573A CN 102723112 B CN102723112 B CN 102723112B
Authority
CN
China
Prior art keywords
state
memristor
selection switch
value
cross array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210188573.2A
Other languages
Chinese (zh)
Other versions
CN102723112A (en
Inventor
王丽丹
何朋飞
段书凯
钟宇平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest University
Original Assignee
Southwest University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest University filed Critical Southwest University
Priority to CN201210188573.2A priority Critical patent/CN102723112B/en
Publication of CN102723112A publication Critical patent/CN102723112A/en
Application granted granted Critical
Publication of CN102723112B publication Critical patent/CN102723112B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Manipulator (AREA)

Abstract

本发明公开了一种基于忆阻交叉阵列的Q学习系统,包括忆阻交叉阵列,其特征在于:所述系统还包括读写选择开关:控制忆阻交叉阵列的读写操作,状态选择开关:状态检测模块检测当前环境状态st,通过状态选择开关,选择相应的行线;列选择开关:当需要对Q值,也即对忆阻交叉阵列的某一个忆阻值进行更新时,列选择开关选择动作at所对应的列线。延迟单元:将选择的列线的电压延迟一个时间步长;状态检测模块:检测当前的环境状态,保存上一个环境状态。本发明将新的电路元件——忆阻器成功应用到了强化学习中,解决了强化学习需要大量的存储空间问题,为以后强化学习的研究提供了一种新的思路。

The invention discloses a Q-learning system based on a memristive cross array, which includes a memristive cross array, and is characterized in that: the system also includes a read and write selection switch for controlling the read and write operations of the memristive cross array, and a state selection switch: The state detection module detects the current environmental state s t , and selects the corresponding row line through the state selection switch; column selection switch: when it is necessary to update the Q value, that is, a certain memristor value of the memristor cross array, select the column The switch selects the column line corresponding to the action at . Delay unit: delay the voltage of the selected column line by one time step; state detection module: detect the current environmental state and save the previous environmental state. The invention successfully applies a new circuit element, the memristor, to reinforcement learning, solves the problem that reinforcement learning requires a large amount of storage space, and provides a new idea for future research on reinforcement learning.

Description

一种基于忆阻交叉阵列的Q学习系统A Q-learning system based on memristive cross array

技术领域 technical field

本发明涉及一种存储矩阵和智能学习算法。The invention relates to a storage matrix and an intelligent learning algorithm.

背景技术 Background technique

强化学习是一种高级的智能学习算法,近年来被广泛的应用于智能机器人领域,成为研究的热点。1954年,Minsky提出了SNARCs的强化学习计算模型。接着,Sutton在其博士论文中提出了AHC算法和TD学习算法。后来,Watkins等人在TD学习算法的基础上,提出了目前强化学习算法中的经典算法-Q学习算法,Q学习算法是强化学习发展过程中的一个重要里程碑。Q学习算法提出后,很多研究者将Q学习算法应用于移动机器人的导航,机器人足球系统和智能I/O的调度。但是强化学习也有其自身的局限性,当问题较为复杂时,它需要大量的状态-动作存储空间。1971年,Chua根据电路的完备性理论,提出了第四种电路元件-忆阻器(L.O.Chua.Memristor-themissing circuit element.IEEE Trans.Circuit Theory.1971,18(5):507-519.)。Reinforcement learning is an advanced intelligent learning algorithm, which has been widely used in the field of intelligent robots in recent years and has become a research hotspot. In 1954, Minsky proposed a reinforcement learning computing model for SNARCs. Then, Sutton proposed the AHC algorithm and the TD learning algorithm in his doctoral dissertation. Later, on the basis of the TD learning algorithm, Watkins et al. proposed the classic algorithm in the current reinforcement learning algorithm - the Q learning algorithm. The Q learning algorithm is an important milestone in the development of reinforcement learning. After the Q-learning algorithm was proposed, many researchers applied the Q-learning algorithm to the navigation of mobile robots, the scheduling of robot soccer systems and intelligent I/O. But reinforcement learning also has its own limitations. When the problem is more complex, it requires a large amount of state-action storage space. In 1971, according to the completeness theory of the circuit, Chua proposed the fourth circuit element - memristor (L.O.Chua.Memristor-themissing circuit element.IEEE Trans.Circuit Theory.1971,18(5):507-519.) .

2008年,HP实验室成功制造了第一个物理实现的忆阻器,此后忆阻器引起了广泛的关注。忆阻器具有纳米尺寸、非线性特性,其阻值随着输入激励的变化而变化,并且这种变化是非易失性的,因此忆阻器非常适合用来设计大规模存储器。忆阻器交叉阵列是忆阻器存储器中的一种,它的结构简单,设计方便。胡小方等人利用忆阻器交叉阵列实现了图像的存储(胡小方,段书凯,王丽丹,等.忆阻器交叉阵列及在图像处理中的应用.中国科学F辑:信息科学.2011,41(4):500-512.)。由于忆阻器具有纳米尺寸,因此忆阻器交叉阵列能够做成大规模存储器,可以解决强化学习在解决复杂问题时,需要大量的状态-动作存储空间的问题,因此,利用忆阻交叉阵列来实现Q学习是一种好的选择。In 2008, the HP laboratory successfully fabricated the first physically realized memristor, and since then, memristor has attracted widespread attention. Memristors have nanometer size, nonlinear characteristics, and their resistance value changes with the change of input excitation, and this change is non-volatile, so memristors are very suitable for designing large-scale memories. Memristor crossbar array is a kind of memristor memory, which has simple structure and convenient design. Hu Xiaofang and others realized image storage by using memristor cross array (Hu Xiaofang, Duan Shukai, Wang Lidan, et al. Memristor cross array and its application in image processing. Chinese Science Series F: Information Science. 2011, 41(4) :500-512.). Because the memristor has a nanometer size, the memristor cross array can be made into a large-scale memory, which can solve the problem that reinforcement learning requires a large amount of state-action storage space when solving complex problems. Therefore, using the memristor cross array to Implementing Q-learning is a good choice.

HP忆阻器的物理模型如图1所示,忆阻器由掺杂区和非掺杂区两部分组成。其中w和D分别表示忆阻器中掺杂区域的宽度和忆阻器的总宽度。其数学模型如下:The physical model of the HP memristor is shown in Figure 1. The memristor consists of two parts: a doped region and an undoped region. where w and D denote the width of the doped region in the memristor and the total width of the memristor, respectively. Its mathematical model is as follows:

Mm (( tt )) == RR ONON ww (( tt )) DD. ++ RR OFFOFF (( 11 -- ww (( tt )) DD. ))

其中,ROFF和RON分别表示w等于0和D时,忆阻器的阻值。Among them, R OFF and R ON represent the resistance value of the memristor when w is equal to 0 and D respectively.

dwdw (( tt )) dtdt == μμ VV RR ONON DD. ii (( tt ))

这里,μv表示平均离子的移动,单位为cm2s-1V-1Here, μ v represents the movement of the average ion in cm 2 s -1 V -1 .

TT ww == ΦΦ DD. VV AA RR OFFOFF 22 [[ (( RR (( ww 00 )) )) 22 -- (( RR (( ww )) )) 22 ]]

其中,in,

ΦΦ DD. == (( βDβD )) 22 22 μμ vv (( ββ -- 11 ))

这里,Tw是输入忆阻器两端的脉冲电压的脉冲宽度,VA是脉冲的幅度,R(w0)表示忆阻器的初始阻值,R(w)表示忆阻器可以达到的阻值,β=ROFF/RONHere, Tw is the pulse width of the pulse voltage input across the memristor, VA is the amplitude of the pulse, R(w 0 ) represents the initial resistance of the memristor, and R(w) represents the resistance that the memristor can achieve , β=R OFF /R ON .

当R(w0)小于等于R(w)时,可以得到When R(w 0 ) is less than or equal to R(w), we can get

RR (( ww )) == (( RR (( ww 00 )) )) 22 -- VV AA TT ww RR OFFOFF 22 ΦΦ DD. ,, RR ONON ≤≤ RR (( ww )) ≤≤ RR OFFOFF

因此,当Tw一定时,随着VA的变化,忆阻器的阻值会发生变化,并且这种变化是非易失性的。Therefore, when Tw is constant, the resistance value of the memristor will change with the change of VA , and this change is non-volatile.

忆阻器存储电路如图2和图3所示。写入数据的电路如图2所示,读出数据的电路如图3所示。当写入数据时,给忆阻器加上一个正的电压脉冲,R(w)会减小,因此忆阻器会记忆所加电压脉冲。当读出数据时,忆阻器的阻值不同,得到的Vout也不同,Vout与忆阻器的阻值之间形成了一个对应关系,因此能够正确反映忆阻器的阻值大小,也即忆阻器存储值的大小。The memristor storage circuit is shown in Figure 2 and Figure 3. The circuit for writing data is shown in Figure 2, and the circuit for reading data is shown in Figure 3. When writing data, apply a positive voltage pulse to the memristor, R(w) will decrease, so the memristor will memorize the applied voltage pulse. When reading data, the resistance value of the memristor is different, and the obtained V out is also different. There is a corresponding relationship between V out and the resistance value of the memristor, so it can correctly reflect the resistance value of the memristor. That is, the size of the memristor's stored value.

忆阻器的阻值会随着输入激励的变化而变化,而且这种变化是非易失性;因此,忆阻器具有非常好的存储特性。并且,忆阻器具有纳米尺寸,非常适合用在大规模存储器中。而忆阻交叉阵列就是一个忆阻器作存储器的例子。The resistance of a memristor changes in response to input stimuli, and this change is nonvolatile; therefore, memristors have very good memory properties. Moreover, the memristor has a nanometer size, which is very suitable for use in large-scale memory. Memristor interleaved array is an example of memristor as memory.

忆阻交叉阵列的结构如图4所示,每一个圆形区域代表的电路如图5所示。在图5中,读\写开关是写入数据和读出数据的控制开关。当给某一个忆阻器写入数据时,开关接左边的点,此时,对应的行线输入写数据电压Vin;当读出某一个忆阻器的数据时,开关接右边的点,此时,对应的行线输入读数据电压Vin,对应的列线输出电压VoutThe structure of the memristive crossbar array is shown in Figure 4, and the circuit represented by each circular area is shown in Figure 5. In Figure 5, the read/write switch is a control switch for writing data and reading data. When writing data to a certain memristor, the switch is connected to the point on the left, at this time, the corresponding row line inputs the write data voltage V in ; when reading the data of a certain memristor, the switch is connected to the point on the right, At this time, the corresponding row line inputs the read data voltage V in , and the corresponding column line outputs the voltage V out .

发明内容 Contents of the invention

本发明的目的是提供一种实现Q学习算法的基于忆阻交叉阵列的Q学习系统。The purpose of the present invention is to provide a Q-learning system based on memristive cross-array for realizing Q-learning algorithm.

为了实现上述目的,采用以下技术方案:一种基于忆阻交叉阵列的Q学习系统,包括忆阻交叉阵列,其特征在于:所述系统还包括In order to achieve the above object, the following technical solutions are adopted: a Q-learning system based on memristive cross array, comprising a memristive cross array, characterized in that: the system also includes

读写选择开关:控制忆阻交叉阵列的读写操作;Read and write selection switch: control the read and write operations of the memristive cross array;

状态选择开关:状态检测模块检测当前环境状态st,通过状态选择开关,选择相应的行线;State selection switch: the state detection module detects the current environment state s t , and selects the corresponding line through the state selection switch;

列选择开关:当需要对Q值,也即对忆阻交叉阵列的某一个忆阻值进行更新时,列选择开关选择动作at所对应的列线。Column selection switch: when it is necessary to update the Q value, that is, a certain memristor value of the memristive crossbar array, the column selection switch selects the column line corresponding to the action at .

延迟单元:将选择的列线的电压延迟一个时间步长;Delay unit: delay the voltage of the selected column line by one time step;

状态检测模块:检测当前的环境状态,并且保存上一个环境状态。当需要根据状态选择动作时,状态检测模块检测当前环境状态,并将此状态提供给状态选择开关和状态控制开关。当执行动作以后,状态选择开关检测此时的环境状态,并且保存上一个环境状态,并将此时的环境状态提供给状态选择开关和状态控制开关。当对Q值进行更新的时候,状态检测模块输出前一个时刻的环境状态,并提供给状态选择开关,选择相应的行线。State detection module: detect the current environment state and save the last environment state. When an action needs to be selected according to the state, the state detection module detects the current environment state, and provides this state to the state selection switch and the state control switch. After the action is executed, the state selection switch detects the current environmental state, saves the last environmental state, and provides the current environmental state to the state selection switch and the state control switch. When updating the Q value, the state detection module outputs the environment state at the previous moment, and provides it to the state selection switch to select the corresponding line.

本发明将新的电路元件-忆阻器成功应用到了强化学习中,解决了强化学习需要大量的存储空间问题,为以后强化学习的研究提供了一种新的思路。The invention successfully applies the new circuit element-memristor to reinforcement learning, solves the problem that reinforcement learning requires a large amount of storage space, and provides a new idea for future research on reinforcement learning.

附图说明 Description of drawings

图1为HP忆阻器的物理模型结构图;Figure 1 is a physical model structure diagram of the HP memristor;

图2为忆阻器写数据时的电路图;Fig. 2 is the circuit diagram when memristor writes data;

图3为忆阻器读数据时的电路图;Fig. 3 is a circuit diagram when the memristor reads data;

图4为忆阻交叉阵列的结构示意图;4 is a schematic structural diagram of a memristive crossbar;

图5为忆阻交叉阵列中单个忆阻电路图;Fig. 5 is a single memristive circuit diagram in a memristive cross array;

图6为本发明的结构示意图;Fig. 6 is a structural representation of the present invention;

图7为本发明实施例中机器人和障碍物的结构示意图;Fig. 7 is a schematic structural diagram of a robot and an obstacle in an embodiment of the present invention;

图8为本实施例的仿真结果。Fig. 8 is the simulation result of this embodiment.

具体实施例 specific embodiment

下面结合附图和具体实施例对本发明做进一步描述。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

Q学习算法是强化学习算法中的一个经典算法,Q学习中最简单的一种形式为单步Q学习,其Q值的更新公式为The Q-learning algorithm is a classic algorithm in the reinforcement learning algorithm. The simplest form of Q-learning is single-step Q-learning, and the update formula of its Q value is

Q(st,at)=Q(st,at)+α(rt+1+γmaxQ(st+1,a)-Q(st,at))Q(s t , a t )=Q(s t , a t )+α(r t+1 +γmaxQ(s t+1 , a)-Q(s t , a t ))

其中,α为学习率,γ为折扣率。rt+1表示在状态st执行动作at所获得环境的奖励。Q(st,at)表示动作状态对值函数,即在状态st,执行动作at,所得到的值的大小。Among them, α is the learning rate and γ is the discount rate. r t+1 represents the environmental reward obtained by performing action a t in state s t . Q(s t , a t ) represents the action state-to-value function, that is, the magnitude of the value obtained by executing the action a t in the state st t .

强化学习的局限在于需要大量的存储空间,而新的电路元件-忆阻器,具有纳米尺寸和存储特性,基于忆阻器的交叉阵列具有大量的存储空间和并行处理能力,非常适合用来解决这个问题。The limitation of reinforcement learning is that it requires a large amount of storage space, and the new circuit element-memristor, with its nanometer size and storage characteristics, the memristor-based interleaved array has a large amount of storage space and parallel processing capabilities, which is very suitable for solving this problem.

在Q学习算法中,每执行一个动作,会得到环境的奖励值,并选择当前状态-动作对中的最大Q值和获得的奖励去更新前一个状态和选择的动作对的Q值。而用忆阻交叉阵列去实现Q学习时,每一个忆阻器的输出电压代表所对应的状态-动作对的Q值。根据忆阻器的存储原理,可以知道掉电之后阻值不会改变,因此只需在忆阻器两端加上写电压In the Q-learning algorithm, every time an action is executed, the reward value of the environment will be obtained, and the maximum Q value in the current state-action pair and the obtained reward will be selected to update the Q value of the previous state and the selected action pair. When the memristor cross array is used to realize Q learning, the output voltage of each memristor represents the Q value of the corresponding state-action pair. According to the storage principle of the memristor, it can be known that the resistance value will not change after power off, so it is only necessary to add a write voltage across the memristor

Vi=α(r+γmaxV(st+1,a)-V(st,at))V i =α(r+γmaxV(s t+1 ,a)-V(s t ,a t ))

就可以去对st和at所对应的忆阻器的阻值进行更新,从而改变该忆阻器的输出电压V(st,at),也即Q(st,at)值。It is possible to update the resistance value of the memristor corresponding to st and at t , thereby changing the output voltage V( st , at ) of the memristor, that is, the value of Q(st t , at ) .

忆阻交叉阵列实现Q学习的过程如图6所示。忆阻交叉阵列中,每一条行线对应一个状态s,每一条列些对应一个动作a,其具体实现过程如下所示:The process of realizing Q-learning by memristive cross array is shown in Fig. 6 . In the memristive cross array, each row line corresponds to a state s, and each column corresponds to an action a. The specific implementation process is as follows:

(1)读写选择开关选择读有效,机器人中的状态检测模块检测当前环境状态st,通过状态选择开关,选择相应的行线;(1) The reading and writing selection switch is selected to be valid, and the state detection module in the robot detects the current environment state s t , and selects the corresponding row line through the state selection switch;

(2)列选择开关选择所有列,通过状态控制开关将列线连接到随机选择模块,随机选择模块根据每个列线电压的大小随机的选择,电压越大的列线被选择的几率越大,最后随机选择出一个列线,根据选择的列线,得到执行的动作at,机器人执行动作at。也可以在设定的某些状态时,通过状态控制开关将列线连接到比较器模块,选择出电压最大的列线,再通过连接选择开关将该列线连接到延迟单元。通过状态选择开关、随机选择模块、比较器、连接选择模块就可以实现强化学习中的ε-greedy策略。(2) The column selection switch selects all columns, and connects the column lines to the random selection module through the state control switch. The random selection module randomly selects the column line voltage according to the voltage of each column line. The greater the voltage, the greater the probability of the column line being selected. , and finally randomly select a column line, according to the selected column line, get the executed action a t , and the robot executes the action a t . It is also possible to connect the column line to the comparator module through the state control switch in certain states, select the column line with the highest voltage, and then connect the column line to the delay unit through the connection selection switch. The ε-greedy strategy in reinforcement learning can be realized through the state selection switch, random selection module, comparator, and connection selection module.

(3)将选择的列线连接到延迟单元,延迟单元对列线的电压延迟一个时间步长;(3) The selected column line is connected to the delay unit, and the delay unit delays the voltage of the column line by a time step;

(4)状态检测模块检测当前环境状态,机器人进入状态st+1,此时状态控制开关将列线连接到比较器,通过比较器,选择电压最大的列线,通过连接选择模块将该列线连接到Q值更新模块,Q值更新模块将该电压与延迟单元的输出电压以及获得环境的奖励按照式(7)进行计算,得到写电压Vi(4) The state detection module detects the current environment state, and the robot enters the state s t+1 . At this time, the state control switch connects the column line to the comparator, selects the column line with the highest voltage through the comparator, and connects the column line to The line is connected to the Q value update module, and the Q value update module calculates the voltage, the output voltage of the delay unit and the reward obtained from the environment according to formula (7) to obtain the writing voltage V i .

(5)读写选择开关选择写有效,将写电压Vi加在忆阻器的两端,时间为Tw(5) The read-write selection switch selects write to be valid, and the write voltage V i is applied to both ends of the memristor for a time of T w .

(6)重复上面的过程,直到达到设定的次数。(6) Repeat the above process until the set number of times is reached.

机器人避障实验是要让机器人在有障碍的环境中实现无碰撞的行走。本实验采用基于忆阻交叉阵列的Q学习来实现机器人的学习,并最终实现无障碍的行走,本实验使用mobotsim软件。The robot obstacle avoidance experiment is to let the robot realize collision-free walking in an environment with obstacles. In this experiment, the Q-learning based on the memristive cross array is used to realize the learning of the robot, and finally realize the barrier-free walking. This experiment uses the mobotsim software.

在图7中,圆形区域表示机器人,机器人上有三个传感器,数字0-2分别对应3个传感器,每一个传感器能够检测的最大距离是1.5米,黑色区域表示障碍物。In Figure 7, the circular area represents the robot. There are three sensors on the robot. The numbers 0-2 correspond to the three sensors respectively. The maximum distance that each sensor can detect is 1.5 meters. The black area represents obstacles.

在本实验中,把每一个传感器检测到的与障碍物的距离划分为3段,如下所示:In this experiment, the distance to the obstacle detected by each sensor is divided into 3 segments, as follows:

其中,dist0-dist2分表表示每一个传感器检测到的到障碍物的距离,将s0-s2进行组合,会得到27种情况,将这27种情况作为机器人所处的环境中的27种状态,用一个三维数组state[s0,s1,s2]存储该27种状态。由于在本实验平台中,当机器人与障碍物碰撞或者传感器不能检测到障碍物时,传感器返回的值都是-1,因此,将机器人与障碍物碰撞时的状态,归为状态0,也即s0-s2都为0时的情况。Among them, the dist0-dist2 sub-table indicates the distance to the obstacle detected by each sensor. Combining s0-s2, 27 situations will be obtained, and these 27 situations will be regarded as 27 states in the environment where the robot is located. A three-dimensional array state[s0, s1, s2] is used to store the 27 states. Since in this experimental platform, when the robot collides with an obstacle or the sensor cannot detect an obstacle, the value returned by the sensor is -1. Therefore, the state when the robot collides with an obstacle is classified as state 0, that is, The situation when both s0-s2 are 0.

奖赏函数r定义为:The reward function r is defined as:

在本实验中,机器人将执行三种动作:前进,左转和右转。如果机器人所处的状态为state[2,2,2]时,动作的执行按照Q值的比重随机执行;其他状态时,执行Q值最大的动作。In this experiment, the robot will perform three actions: forward, turn left, and turn right. If the state of the robot is state[2, 2, 2], the execution of the action is performed randomly according to the proportion of the Q value; in other states, the action with the largest Q value is executed.

取α=0.8,γ=0.98,仿真次数设为500次,每次仿真2000步,实验仿真结果如图8所示。Take α=0.8, γ=0.98, set the number of simulations to 500, and simulate 2000 steps each time. The experimental simulation results are shown in Figure 8.

Claims (1)

1.一种基于忆阻交叉阵列的Q学习系统,包括忆阻交叉阵列,其特征在于:所述系统还包括 1. A Q-learning system based on a memristive cross array, comprising a memristive cross array, characterized in that: the system also includes 读写选择开关:控制忆阻交叉阵列的读写操作; Read and write selection switch: control the read and write operations of the memristive cross array; 状态选择开关:状态检测模块检测当前环境状态s t,通过状态选择开关,选择相应的行线; State selection switch: the state detection module detects the current environment state s t , and selects the corresponding line through the state selection switch; 列选择开关:当需要对Q值,也即对忆阻交叉阵列的某一个忆阻值进行更新时,列选择开关选择动作at所对应的列线; Column selection switch: when it is necessary to update the Q value, that is, a certain memristor value of the memristor cross array, the column selection switch selects the column line corresponding to the action at ; 延迟单元:将选择的列线的电压延迟一个时间步长; Delay unit: delay the voltage of the selected column line by one time step; 状态检测模块:检测当前的环境状态,保存上一个环境状态,当需要根据状态选择动作时,状态检测模块检测当前环境状态,并将此状态提供给状态选择开关和状态控制开关,执行动作以后,状态选择开关检测此时的环境状态,保存上一个环境状态,并将此时的环境状态提供给状态选择开关和状态控制开关;当对Q值进行更新的时候,状态检测模块输出前一个时刻的环境状态,并提供给状态选择开关,选择相应的行线,在忆阻器两端加上写电压                                                                State detection module: detect the current environment state and save the previous environment state. When an action needs to be selected according to the state, the state detection module detects the current environment state and provides this state to the state selection switch and state control switch. After the action is executed, The state selection switch detects the environmental state at this time, saves the last environmental state, and provides the environmental state at this time to the state selection switch and the state control switch; when updating the Q value, the state detection module outputs the previous time The state of the environment is provided to the state selection switch to select the corresponding row line and apply the write voltage across the memristor 就可以去对s ta t所对应的忆阻器的阻值进行更新,从而改变该忆阻器的输出电压V(st,at),也即Q(st,at)值;此处V(st,at)的值与Q(st,at)值相等; It is possible to update the resistance value of the memristor corresponding to st and at t , thereby changing the output voltage V (st t , at ) of the memristor, that is, the value of Q (st t , at ) ; Here the value of V (s t , a t ) is equal to the value of Q (s t , a t ); 其中,α为学习率,r为奖励函数,γ为折扣率。 Among them, α is the learning rate, r is the reward function, and γ is the discount rate.
CN201210188573.2A 2012-06-08 2012-06-08 Q learning system based on memristor intersection array Expired - Fee Related CN102723112B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210188573.2A CN102723112B (en) 2012-06-08 2012-06-08 Q learning system based on memristor intersection array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210188573.2A CN102723112B (en) 2012-06-08 2012-06-08 Q learning system based on memristor intersection array

Publications (2)

Publication Number Publication Date
CN102723112A CN102723112A (en) 2012-10-10
CN102723112B true CN102723112B (en) 2015-06-17

Family

ID=46948846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210188573.2A Expired - Fee Related CN102723112B (en) 2012-06-08 2012-06-08 Q learning system based on memristor intersection array

Country Status (1)

Country Link
CN (1) CN102723112B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107533862B (en) * 2015-08-07 2021-04-13 慧与发展有限责任合伙企业 Crossbar arrays, image processors and computing devices
CN105897585B (en) * 2016-04-11 2019-07-23 电子科技大学 A kind of Q study block transmission method of the self-organizing network based on delay constraint
CN106373611A (en) * 2016-09-29 2017-02-01 华中科技大学 Storage and calculation array structure and operation method thereof
CN106844223B (en) * 2016-12-20 2021-04-09 北京大学 Data search system and method
CN107085429B (en) * 2017-05-23 2019-07-26 西南大学 Robot path planning system based on memristive crossbar array and Q-learning
KR20190007642A (en) 2017-07-13 2019-01-23 에스케이하이닉스 주식회사 Neuromorphic Device Having a Plurality of Synapse Blocks
CN109214048A (en) * 2018-07-27 2019-01-15 西南大学 Utilize mixing CMOS- memristor fuzzy logic gate circuit and its design method
CN110842915B (en) * 2019-10-18 2021-11-23 南京大学 Robot control system and method based on memristor cross array
CN115440277A (en) * 2021-05-07 2022-12-06 浙江树人学院 Memristor-based XOR logic circuit

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101951258A (en) * 2010-09-27 2011-01-19 中国人民解放军国防科学技术大学 Multidigit variable system asynchronous counting circuit based on memory resistor
CN102354128A (en) * 2011-06-02 2012-02-15 北京大学 Circuit for emotional simulation of robot and control method thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101971166B (en) * 2008-03-14 2013-06-19 惠普开发有限公司 neuromorphic circuits

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101951258A (en) * 2010-09-27 2011-01-19 中国人民解放军国防科学技术大学 Multidigit variable system asynchronous counting circuit based on memory resistor
CN102354128A (en) * 2011-06-02 2012-02-15 北京大学 Circuit for emotional simulation of robot and control method thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
胡柏林等."忆阻器Simulink建模和图形用户界面设计".《西南大学学报(自然科学版)》.2011,第33卷(第9期),全文. *
高士咏等."忆阻细胞神经网络及图像去噪和边缘提取中的应用".《西南大学学报(自然科学版)》.2011,第33卷(第11期),全文. *

Also Published As

Publication number Publication date
CN102723112A (en) 2012-10-10

Similar Documents

Publication Publication Date Title
CN102723112B (en) Q learning system based on memristor intersection array
US20220277199A1 (en) Method for data processing in neural network system and neural network system
US10643119B2 (en) Differential non-volatile memory cell for artificial neural network
Yu et al. Scaling-up resistive synaptic arrays for neuro-inspired architecture: Challenges and prospect
KR102567160B1 (en) Neural network circuit with non-volatile synaptic array
CN106530210B (en) Device and method for realizing parallel convolution calculation based on resistive memory device array
JP6585845B2 (en) Apparatus and method including memory and operation thereof
EP3506266B1 (en) Methods and systems for performing a calculation across a memory array
TWI699711B (en) Memory devices and manufacturing method thereof
KR102861763B1 (en) Apparatus for performing in memory processing and computing apparatus having the same
TWI446352B (en) Resistive random access memory and verifying method thereof
WO2023217017A1 (en) Variational inference method and device for bayesian neural network based on memristor array
KR102885872B1 (en) Neural network apparatus
Crafton et al. Merged logic and memory fabrics for accelerating machine learning workloads
CN114861900A (en) Weight updating method for memristor array and processing unit
CN116523014A (en) Device and method for realizing physical RC network capable of learning on chip
CN114743582B (en) Efficient programming method for memristor arrays
Biyani et al. C3CIM: Constant column current memristor-based computation-in-memory micro-architecture
US20230186086A1 (en) Neural network device and electronic system including the same
Agarwal et al. The energy scaling advantages of RRAM crossbars
CN110619907A (en) Synapse circuit, synapse array and data processing method based on synapse circuit
CN115796250A (en) Weight deployment method and device, electronic equipment and storage medium
Garbin A variability study of PCM and OxRAM technologies for use as synapses in neuromorphic systems
Mbarek et al. Design and properties of logic circuits based on memristor devices
Zanotti et al. Study of RRAM-Based Binarized Neural Networks Inference Accelerators Using an RRAM Physics-Based Compact Model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150617

Termination date: 20170608