WO2022105876A1 - 一种用于选择决策的方法和装置 - Google Patents

一种用于选择决策的方法和装置 Download PDF

Info

Publication number
WO2022105876A1
WO2022105876A1 PCT/CN2021/131777 CN2021131777W WO2022105876A1 WO 2022105876 A1 WO2022105876 A1 WO 2022105876A1 CN 2021131777 W CN2021131777 W CN 2021131777W WO 2022105876 A1 WO2022105876 A1 WO 2022105876A1
Authority
WO
WIPO (PCT)
Prior art keywords
decision
explored
message
exploration
decisions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2021/131777
Other languages
English (en)
French (fr)
Inventor
皇甫幼睿
王坚
李榕
王俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to EP21894023.7A priority Critical patent/EP4228330A4/en
Publication of WO2022105876A1 publication Critical patent/WO2022105876A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/02Arrangements for optimising operational condition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning

Definitions

  • the present application relates to the field of communications, and more particularly, to a method and apparatus for selection decisions.
  • Wireless communication systems typically face changing channels, changing environments, and changing users.
  • the non-ideal of hardware and the non-ideal of modeling make it difficult for the communication system to seek the optimal decision through theoretical formula calculation in the changing process, so the optimal decision is usually not easy to obtain.
  • Optimal decisions; suboptimal decisions can be obtained by solving optimization problems, but the complexity of solving optimization problems is also very high, and it is also difficult to solve in some scenarios.
  • Deep reinforcement learning can use the interaction between the neural network model and the environment to search for optimal decisions. Taking the communication system as the environment, deep reinforcement learning can be used to search for the optimal decision for the communication system. Usually, for a complex specific communication scenario, only exploration can find a better decision. However, in the existing communication system, in order to ensure the reliable operation of the communication system, the general and conservative decision-making is often insisted on, which leads to the long-term use of most devices in the communication system. Working under non-optimal performance, it cannot meet the requirements of future high-performance communication systems.
  • the present application provides a method and apparatus for selection decision, which facilitates reliable operation of a communication system.
  • the present application provides a method for selecting a decision, the method comprising: acquiring state information of a communication system; and determining a performance corresponding to each of the M first decisions according to the state information And/or the number of times that each first decision has been explored, the M first decisions are decisions that can be explored under the state information, and M is a positive integer; The performance and/or the number of times each of the first decisions has been explored, and the target first decision is determined from the M first decisions.
  • the decision can be selected according to the performance of the explorable decision and/or the number of times the explorable decision has been explored, which can avoid random exploration and help the communication system to operate reliably.
  • the method further includes: determining the performance corresponding to each of the K second decisions and/or the number of times each second decision has been explored , the K second decisions are explorable decisions after selecting the target first decision, and K is a positive integer; according to the performance corresponding to each second decision and/or each second decision is The number of times of exploration, and the target second decision is determined from the K second decisions.
  • each step in the multi-step decision can select a decision according to the performance of the current explorable decision and/or the number of times the currently explorable decision has been explored, which can avoid random exploration and help the communication system. Reliable operation.
  • the performance and/or the corresponding performance and/or performance of each of the M first decisions is determined according to the state information.
  • the number of times each first decision has been explored including: cyclically executing the following steps N times to obtain the performance corresponding to each first decision and/or the number of times each first decision has been explored, N is an integer greater than 1: according to the current corresponding performance of each first decision and/or the current number of times each first decision has been explored, select the first to be explored from the M first decisions decision; update the performance corresponding to the first decision to be explored according to the state information and the model of the communication system; and/or add 1 to the number of times the first decision to be explored is explored.
  • the present application can use known models in the communication system to guide decision-making exploration, avoid random exploration, and help the communication system operate reliably.
  • the method further includes: according to the performance and/or the current corresponding performance of each second decision in the K second decisions and/or the The number of times that each second decision has been explored currently, and the second decision to be explored is selected from the K second decisions, and the K second decisions are available after selecting the first decision to be explored.
  • exploratory decision according to the state information and the model of the communication system, update the performance corresponding to the second decision to be explored; and/or add the number of times the second decision to be explored is explored 1. Update the performance corresponding to the first decision to be explored according to the performance corresponding to the second decision to be explored; and/or add 1 to the number of times the first decision to be explored is explored.
  • the performance currently corresponding to each first decision and/or each first decision is currently The number of times of exploration, selecting the first decision to be explored from the M first decisions, including: according to the current corresponding performance of each first decision and/or the current exploration of each first decision The number of times and the exploration coefficient of each first decision, the first decision to be explored is selected from the M first decisions, and the exploration coefficient is used to control the tendency when selecting a decision.
  • the tendency of selection decision-making can be controlled by the exploration coefficient, random exploration can be avoided, and the communication system can be operated reliably.
  • the performance and/or the corresponding performance and/or performance of each of the M first decisions is determined according to the state information.
  • the number of times that each first decision has been explored including: determining the performance corresponding to each first decision and/or the explored value of each first decision according to the state information and historical information. number of times, and the historical information includes the performance corresponding to each first decision under the state information and/or the number of times that each first decision has been explored.
  • the performance corresponding to each first decision and/or each first decision has been explored.
  • the number of times, determining the target first decision from the M first decisions including: according to the performance corresponding to each first decision and/or the number of times each first decision has been explored, and the The exploration coefficient of each first decision, the target first decision is determined from the M first decisions, and the exploration coefficient is used to control the tendency when a decision is selected.
  • the tendency when selecting a decision can be controlled by the exploration coefficient, which is helpful for selecting a more appropriate decision and contributing to the reliable operation of the communication system.
  • the method further includes: according to the state information and the target first decision, making a For training, the neural network model is used to output Cr .
  • the output of the model-based decision exploration can be used to train the neural network model, and the trained neural network model can in turn guide the model-based decision exploration.
  • the method further includes: acquiring parameters used for exploration decision-making, where the parameters include performance indicators, N 2 , C 2 At least one of , C r , ⁇ , N t .
  • the terminal may acquire the parameters from the access network device.
  • the access network device may acquire the parameters from the terminal.
  • the acquiring the parameters used for the exploration decision includes: acquiring the parameters according to the task type.
  • C r should be smaller, which can reduce the weight of exploration items; ⁇ should be larger, so that decisions with too few exploration times will not increase the exploration confidence of the decision; N should be The larger the value, the larger the total number of explorations and the more accurate the final decision.
  • the method further includes: acquiring support information, where the support information is used to determine the performance corresponding to the first decision,
  • the support information includes at least one of a simulator of the model of the communication system, simulation conditions, and historical information, the historical information including the performance and/or the corresponding performance of each decision in different states of the communication system. Describe the number of times each decision has been explored.
  • the method further includes: receiving a sixth message sent by the access network device, where the sixth message is used to query and explore parameters used for decision-making, the parameters include at least one of performance indicators, N 2 , C 2 , C r , ⁇ , and N t ; send a seventh message to the access network device, where the seventh message is used to indicate the described parameters.
  • the access network device can query the terminal for parameters used in the exploration decision, so that the access network device can estimate the time required for the decision and exploration, so as to perform reasonable processing.
  • the method further includes: receiving a first message sent by an access network device, where the first message is used to query whether Have the ability to make exploration decisions; send a second message to the access network device, where the second message is used to indicate the ability to make exploration decisions; receive a fifth message sent by the access network device, the fifth message is used to indicate completion The registration of discovery decision-making capabilities in core network equipment.
  • the access network device needs to assist the terminal to explore better performance in the handover task, so the access network device should inquire and authenticate the terminal's reliable exploration capability (VIP client).
  • VIP client The terminal has specific requirements for its own mission reliability.
  • the access network equipment should inquire about the specific settings of the terminal in the setting of the exploration parameters. For high-value terminals within the coverage of the access network equipment, the access network equipment can be used in the core. Reliable exploration permission is registered on the network device. For example, terminals that appear frequently within the coverage of access network equipment are very helpful for effective and continuous exploration. Reliable exploration rights can be registered for such terminals to assist access network equipment to improve reliable exploration performance and accumulate experience.
  • the method further includes: receiving an eighth message sent by the terminal, where the eighth message is used to request to start an exploration decision; Send an exploration result to the terminal, where the exploration result includes information about the first decision of the target; receive a tenth message sent by the terminal, where the tenth message is used to request to end the exploration decision.
  • the start and end of the decision exploration can be defined, which is helpful for the smooth progress of the decision exploration.
  • the present application provides a method for selecting a decision, the method comprising: sending a first message to a terminal, where the first message is used to query whether it has the ability to make an exploration decision; receiving a message sent by the terminal a second message, where the second message is used to indicate the capability of exploration and decision-making; send a third message to the core network device, where the third message is used to request registration of the exploration-decision-making capability; receive the first message sent by the core network device four messages, where the fourth message is used to instruct the completion of the registration of the exploration decision-making capability; a fifth message is sent to the terminal, and the fifth message is used to instruct the completion of the registration of the exploration-decision-making capability in the core network device.
  • the access network device needs to assist the terminal to explore better performance in the handover task, so the access network device should inquire and authenticate the terminal's reliable exploration capability (optional).
  • the terminal has specific requirements for its own mission reliability.
  • the access network equipment should inquire about the specific settings of the terminal in the setting of the exploration parameters. For high-value terminals within the coverage of the access network equipment, the access network equipment can be used in the core. Reliable exploration permission is registered on the network device. In the above technical solution, the reliable exploration authority can be registered for the terminal, which helps to improve the reliable exploration performance of the access network equipment and accumulate experience.
  • the method further includes: sending a sixth message to the terminal, where the sixth message is used to query the parameters used in the exploration decision; receiving the information sent by the terminal A seventh message, where the seventh message is used to indicate the parameter; according to the parameter, the time for an exploration decision is estimated.
  • the access network device can query the terminal for parameters used in the exploration decision, so that the access network device can estimate the time required for the decision and exploration, so as to perform reasonable processing.
  • the method further includes: sending support information to the terminal, where the support information includes an emulator of the communication system model, At least one of simulation conditions and historical information, where the historical information includes the performance corresponding to each decision under different states of the communication system and/or the number of times each decision has been explored.
  • the present application provides a method for selecting a decision, the method comprising: sending an eighth message to an access network device, where the eighth message is used to request to start an exploration decision; receiving the access network The exploration result sent by the device; sending a tenth message to the access network device, where the tenth message is used to request to end the exploration decision.
  • the start and end of the decision exploration can be defined, which is helpful for the smooth progress of the decision exploration.
  • the present application provides a method for selection decision, the method comprising: receiving a thirteenth message sent by an access network device, where the thirteenth message is used to request historical information, the historical information Including the corresponding performance of each decision in different states of the communication system and/or the number of times each decision has been explored; sending a fifteenth message to the access network device, where the fifteenth message is used to indicate all historical information.
  • historical information can be provided to the access network device, so as to realize decision-making based on the historical information.
  • the method further includes: receiving a third message sent by the access network device, where the third message is used to request to register the discovery decision-making capability for the terminal;
  • the access network device sends a fourth message, where the fourth message is used to indicate that the registration of the exploration decision capability is completed.
  • the method before sending the fourth message to the access network device, the method further includes: determining that the terminal is allowed to explore decision making.
  • the present application provides an apparatus for selection decision, the apparatus comprising:
  • a processing unit configured to acquire state information of the communication system; according to the state information, determine the performance corresponding to each of the M first decisions and/or the number of times that each first decision has been explored, and the The M first decisions are explorable decisions under the state information, and M is a positive integer; according to the performance corresponding to each first decision and/or the number of times each first decision has been explored, A target first decision is determined from the M first decisions.
  • the decision can be selected according to the performance of the explorable decision and/or the number of times the explorable decision has been explored, which can avoid random exploration and help the communication system to operate reliably.
  • the processing unit is further configured to determine the performance corresponding to each second decision among the K second decisions and/or the explored value of each second decision. number of times, the K second decisions are decisions that can be explored after selecting the target first decision, and K is a positive integer; according to the performance corresponding to each second decision and/or each second decision The number of times it has been explored, and the target second decision is determined from the K second decisions.
  • each step in the multi-step decision can select a decision according to the performance of the current explorable decision and/or the number of times the currently explorable decision has been explored, which can avoid random exploration and help the communication system. Reliable operation.
  • the processing unit is specifically configured to: cyclically execute the following steps N times to obtain the performance corresponding to each first decision And/or the number of times each first decision has been explored, N is an integer greater than 1: according to the current corresponding performance of each first decision and/or the currently explored value of each first decision times, select the first decision to be explored from the M first decisions; update the performance corresponding to the first decision to be explored according to the state information and the model of the communication system; and/or, Add 1 to the number of times the first decision to be explored has been explored.
  • the present application can use known models in the communication system to guide decision-making exploration, avoid random exploration, and help the communication system operate reliably.
  • the method further includes: according to the current performance and/or the current corresponding performance of each second decision in the K second decisions The number of times that each second decision has been explored currently, and the second decision to be explored is selected from the K second decisions, and the K second decisions are available after selecting the first decision to be explored.
  • exploratory decision according to the state information and the model of the communication system, update the performance corresponding to the second decision to be explored; and/or add the number of times the second decision to be explored is explored 1. Update the performance corresponding to the first decision to be explored according to the performance corresponding to the second decision to be explored; and/or add 1 to the number of times the first decision to be explored is explored.
  • the processing unit is specifically configured to: according to the current corresponding performance of each first decision and/or the The number of times that each first decision has been explored currently, and the exploration coefficient of each first decision, the first decision to be explored is selected from the M first decisions, and the exploration coefficient is used to control the selection tendencies in decision-making.
  • the tendency of selection decision-making can be controlled by the exploration coefficient, random exploration can be avoided, and the communication system can be operated reliably.
  • the processing unit is specifically configured to: Determining the first decision to be explored from the M first decisions, where X 1d is the current performance of the d-th first decision among the M first decisions, and N 1 is the M-th first decision The total number of times a decision is currently explored, and N 1d is the number of times the d-th first decision is currently explored.
  • the processing unit is specifically configured to: determine each of the first decisions according to the state information and historical information The corresponding performance and/or the number of times each first decision has been explored, and the historical information includes the performance corresponding to each first decision and/or each first decision under the state information The number of times the decision has been explored.
  • the processing unit is specifically configured to: according to the performance corresponding to each first decision and/or each of the The number of times the first decision has been explored, and the exploration coefficient of each first decision, the target first decision is determined from the M first decisions, and the exploration coefficient is used to control the tendency when selecting a decision .
  • the tendency when selecting a decision can be controlled by the exploration coefficient, which is helpful for selecting a more appropriate decision and contributing to the reliable operation of the communication system.
  • the processing unit is specifically configured to: Determine the target first decision from the M first decisions, where X 2d is the performance corresponding to the d-th first decision among the M first decisions, and N 2 is the M first decision to be explored The total number of times, N 2d is the number of times the d-th first decision is explored, C 2 is a constant, C 2 varies with N 2d , or C 2 is determined by the neural network model.
  • the processing unit is further configured to: according to the state information and the target first decision A network model is trained, which is used to output C r .
  • the output of the model-based decision exploration can be used to train the neural network model, and the trained neural network model can in turn guide the model-based decision exploration.
  • the processing unit is further configured to: acquire parameters used for exploration decision-making, where the parameters include performance indicators, N 2 , At least one of C 2 , C r , ⁇ , N t .
  • the terminal may acquire the parameters from the access network device.
  • the access network device may acquire the parameters from the terminal.
  • the processing unit is specifically configured to: acquire the parameter according to the task type.
  • C r should be smaller, which can reduce the weight of exploration items; ⁇ should be larger, so that decisions with too few exploration times will not increase the exploration confidence of the decision; N should be The larger the value, the larger the total number of explorations and the more accurate the final decision.
  • the processing unit is further configured to: acquire support information, where the support information is used to determine the performance, the supporting information includes at least one of a simulator of the model of the communication system, simulation conditions, and historical information, the historical information including the performance and/or corresponding to each decision in different states of the communication system or the number of times each decision has been explored.
  • the apparatus further includes a transceiver unit, configured to receive a sixth message sent by the access network device, the sixth message for querying the parameters used in the exploration decision, the parameters include at least one of performance indicators, N 2 , C 2 , C r , ⁇ , and N t ; send a seventh message to the access network device, the seventh message Used to indicate the parameter.
  • a transceiver unit configured to receive a sixth message sent by the access network device, the sixth message for querying the parameters used in the exploration decision, the parameters include at least one of performance indicators, N 2 , C 2 , C r , ⁇ , and N t ; send a seventh message to the access network device, the seventh message Used to indicate the parameter.
  • the access network device can query the terminal for parameters used in the exploration decision, so that the access network device can estimate the time required for the decision and exploration, so as to perform reasonable processing.
  • the transceiver unit is further configured to: receive a first message sent by the access network device, where the first message is used for Query whether it has the ability to make exploration decisions; send a second message to the access network device, where the second message is used to indicate the ability to make exploration decisions; receive a fifth message sent by the access network device, the fifth message is used for Indicates that the registration of the Discovery Decision Capability at the core network device is completed.
  • the access network device needs to assist the terminal to explore better performance in the handover task, so the access network device should inquire and authenticate the terminal's reliable exploration capability (VIP client).
  • VIP client The terminal has specific requirements for its own mission reliability.
  • the access network equipment should inquire about the specific settings of the terminal in the setting of the exploration parameters. For high-value terminals within the coverage of the access network equipment, the access network equipment can be used in the core. Reliable exploration permission is registered on the network device. For example, terminals that appear frequently within the coverage of access network equipment are very helpful for effective and continuous exploration. Reliable exploration rights can be registered for such terminals to assist access network equipment to improve reliable exploration performance and accumulate experience.
  • the transceiver unit is further configured to: receive an eighth message sent by the terminal, where the eighth message is used to request to start exploration decision; sending an exploration result to the terminal, where the exploration result includes information about the first decision of the target; receiving a tenth message sent by the terminal, where the tenth message is used to request to end the exploration decision.
  • the start and end of the decision exploration can be defined, which is helpful for the smooth progress of the decision exploration.
  • the present application provides an apparatus for selection decision, the apparatus comprising:
  • a transceiver unit configured to send a first message to the terminal, where the first message is used to query whether it has the capability to make an exploration decision; receive a second message sent by the terminal, where the second message is used to indicate that it has the capability to make an exploration decision ; Send a third message to the core network device, the third message is used to request the registration of the exploration decision-making capability; Receive the fourth message sent by the core network device, the fourth message is used to indicate that the registration of the exploration and decision-making capability is completed ; Send a fifth message to the terminal, where the fifth message is used to indicate that the registration of the discovery decision-making capability in the core network device is completed.
  • the access network device needs to assist the terminal to explore better performance in the handover task, so the access network device should inquire and authenticate the terminal's reliable exploration capability (optional).
  • the terminal has specific requirements for its own mission reliability.
  • the access network equipment should inquire about the specific settings of the terminal in the setting of the exploration parameters. For high-value terminals within the coverage of the access network equipment, the access network equipment can be used in the core. Reliable exploration permission is registered on the network device. In the above technical solution, the reliable exploration authority can be registered for the terminal, which helps to improve the reliable exploration performance of the access network equipment and accumulate experience.
  • the transceiver unit is further configured to send a sixth message to the terminal, where the sixth message is used to query the parameters used in the exploration decision;
  • the seventh message is used to indicate the parameter;
  • the apparatus further includes a processing unit, configured to estimate the time of the exploration decision according to the parameter.
  • the access network device can query the terminal for parameters used in the exploration decision, so that the access network device can estimate the time required for the decision and exploration, so as to perform reasonable processing.
  • the transceiver unit is further configured to send support information to the terminal, where the support information includes an emulator of a communication system model At least one of , simulation conditions, and historical information, where the historical information includes the performance corresponding to each decision in different states of the communication system and/or the number of times each decision has been explored.
  • the present application provides an apparatus for selection decision, the apparatus comprising:
  • a transceiver unit configured to send an eighth message to an access network device, where the eighth message is used to request a decision to start an exploration; receive an exploration result sent by the access network device; and send a tenth message to the access network device , the tenth message is used to request to end the exploration decision.
  • the start and end of the decision exploration can be defined, which is helpful for the smooth progress of the decision exploration.
  • the present application provides an apparatus for selection decision, the apparatus comprising:
  • a transceiver unit configured to receive a thirteenth message sent by an access network device, where the thirteenth message is used to request historical information, where the historical information includes performance and/or corresponding to each decision in different states of the communication system Or the number of times each decision has been explored; send a fifteenth message to the access network device, where the fifteenth message is used to indicate the historical information.
  • historical information can be provided to the access network device, so as to realize decision-making based on the historical information.
  • the transceiver unit is further configured to: receive a third message sent by the access network device, where the third message is used to request to register the discovery decision-making capability for the terminal; A fourth message is sent to the access network device, where the fourth message is used to indicate that the registration of the exploration decision capability is completed.
  • the apparatus further includes a processing unit, configured to determine whether to allow the access network device before sending the fourth message The terminal explores the decision.
  • the present application provides a communication device including a processor, a memory and a transceiver.
  • the memory is used to store the computer program
  • the processor is used to call and run the computer program stored in the memory, and control the transceiver to send and receive signals, so that the communication device executes the method in the first aspect or any possible implementation manner thereof, or A method is performed as in the second aspect or any possible implementation thereof, or as in the third aspect or any possible implementation thereof, or as in the fourth aspect or any possible implementation thereof.
  • the present application provides a communication device, comprising a processor and a communication interface, the communication interface being used for receiving a signal and transmitting the received signal to the processor, the processor processing the signal such that A method as in the first aspect or any possible implementation thereof is performed, or a method as in the second aspect or any possible implementation thereof is performed, or as in the third aspect or any possible implementation thereof The method is performed, or causes the method as in the fourth aspect or any possible implementation thereof to be performed.
  • the above-mentioned communication interface may be an interface circuit
  • the processor may be a processing circuit
  • the present application provides a chip, comprising a logic circuit and a communication interface, the communication interface being used to receive data and/or information to be processed, the logic circuit being used to perform any one of the above-mentioned aspects or The data and/or information processing described in any implementation manner thereof, and the communication interface is also used for outputting the processing result obtained by the logic circuit.
  • the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the first aspect or any possible implementations thereof are made The method of the or any of its possible implementations are executed.
  • the present application provides a computer program product, the computer program product comprising computer program code, when the computer program code is run on a computer, the computer program code, as in the first aspect or any possible implementations thereof, is provided.
  • the method is performed, or causes the method as in the second aspect or any possible implementation thereof to be performed, or causes the method as in the third aspect or any possible implementation thereof to be performed, or causes the fourth aspect or its The method in any possible implementation is executed.
  • the present application provides a wireless communication system, including the communication device according to the fifth aspect, the sixth aspect, the seventh aspect, or the seventh aspect.
  • FIG. 1 is a schematic structural diagram of a communication system to which an embodiment of the present application can be applied.
  • FIG. 2 is an overall flow diagram of the method for selection decision of the present application.
  • FIG. 3 is a schematic flowchart of the method for selection decision provided by the present application.
  • Figure 4 is an example of assisting exploration through historical information.
  • Figure 5 is a schematic diagram of a multi-step exploration.
  • FIG. 6 is an example (1) of model-assisted exploration through a communication system.
  • FIG. 7 is an example (two) of model-assisted exploration through a communication system.
  • FIG. 8 is an example in which the method of the present application is applied to a channel coding scenario.
  • FIG. 9 is a schematic flowchart of registering a reliable discovery capability for a terminal in a core network device.
  • FIG. 10 is a schematic diagram of signaling interaction when a decision is made by an access network device.
  • FIG. 11 is a schematic structural diagram of an apparatus for selection decision provided by an embodiment of the present application.
  • FIG. 12 is another schematic structural diagram of an apparatus for selection decision provided by an embodiment of the present application.
  • GSM global system for mobile communications
  • EDGE enhanced data rate for GSM evolution
  • WCDMA broadband Code division multiple access system
  • CDMA2000 code division multiple access 2000 system
  • CDMA2000 code division multiple access 2000 system
  • time division synchronous code division multiple access system time division-synchronization code division multiple access, TD-SCDMA
  • LTE long term evolution
  • FDD frequency division duplex
  • TDD LTE time division duplex
  • UMTS universal mobile telecommunication system
  • WiMAX worldwide interoperability for microwave access
  • WiMAX narrow band-internet of things
  • NB-IoT fifth generation
  • 5G fifth generation
  • 5G fifth generation
  • NR new wireless
  • satellite communication systems and future mobile communication systems.
  • the technical solutions of the embodiments of the present application can be applied to 5G systems (enhanced mobile broadband, eMBB), ultra-relaible and low latency communication (ultra-relaible and low latency communication, URLL), and enhanced machine-type communication (enhanced machine-type communication) communication, eMTC) and other application scenarios.
  • 5G systems enhanced mobile broadband, eMBB
  • ultra-relaible and low latency communication ultra-relaible and low latency communication
  • URLL ultra-relaible and low latency communication
  • eMTC enhanced machine-type communication
  • FIG. 1 is a schematic structural diagram of a communication system to which an embodiment of the present application can be applied.
  • the communication system 100 may include an access network device (such as the access network device 110 in FIG. 1 ) and at least one terminal (such as the terminal 120 and the terminal 130 in FIG. 1 ).
  • a wireless communication system is usually composed of cells, each cell may include an access network device, a terminal is wirelessly connected to the access network device, and the access network device can provide communication services to multiple terminals.
  • the terminal in FIG. 1 may be fixed position or movable.
  • FIG. 1 is only a schematic diagram, and the communication system may also include other network devices, such as core network devices, wireless relay devices, and wireless backhaul devices, etc., which are not shown in FIG. 1 .
  • the embodiments of the present application do not limit the number of access network devices and terminals included in the communication system.
  • the terminal in the embodiments of the present application may also be referred to as user equipment (user equipment, UE), user, access terminal, subscriber unit, subscriber station, mobile station, mobile station (mobile station, MS), remote station, remote terminal, Mobile equipment, user terminal, terminal equipment, wireless communication equipment, user agent or user equipment, etc.
  • user equipment user equipment
  • UE user equipment
  • access terminal subscriber unit, subscriber station, mobile station, mobile station (mobile station, MS)
  • remote station remote terminal
  • Mobile equipment user terminal, terminal equipment, wireless communication equipment, user agent or user equipment, etc.
  • the terminal may be a cell phone, smart watch, wireless data card, cell phone, tablet computer, personal digital assistant (PDA) computer, wireless modem, computing device, other processing device connected to the wireless modem, handheld device, laptop computer, machine type communication (MTC) terminal, computer with wireless transceiver function, virtual reality terminal, augmented reality terminal, wireless terminal in industrial control, wireless terminal in unmanned driving, wireless terminal in remote surgery Terminals, wireless terminals in smart grids, wireless terminals in transportation security, wireless terminals in smart cities, wireless terminals in smart homes, wireless terminals in satellite communications (eg, satellite phones or satellite terminals, etc.), etc.
  • the embodiments of the present application do not limit the specific technology and specific device form adopted by the terminal.
  • the access network device in this embodiment of the present application may be a device for communicating with a terminal.
  • the access network device may be a base station (base transceiver station, BTS) in the GSM system or code division multiple access (CDMA), or a wideband code division multiple access (WCDMA)
  • the base station (nodeB, NB) in the system can also be an evolved base station (evolutional nodeB, eNB or eNodeB) in the LTE system, and can also be a radio control in a cloud radio access network (CRAN) scenario or the network device may be a relay station, an access point, a vehicle-mounted device or a wearable device, or the network device may be a terminal that undertakes the base station function in device to device (D2D) communication or machine communication, or the
  • the network device may be a network device in a 5G network or an access network device in a future evolved public land mobile network (Public Land Mobile Network, PLMN) network, etc., which is not limited in the
  • the access network device in this embodiment of the present application may also be a module or unit that completes some functions of the base station, for example, may be a centralized unit (central unit, CU), or may be a distributed unit (distributed unit, DU) .
  • the embodiments of the present application do not limit the specific technology and specific device form adopted by the access network device.
  • the above-mentioned base station may include a baseband unit (baseband unit, BBU) and a remote radio unit (remote radio unit, RRU).
  • BBU and RRU can be placed in different places.
  • the RRU is located far away and is placed in a high traffic area, and the BBU is placed in the central computer room.
  • BBU and RRU can also be placed in the same computer room.
  • the BBU and RRU can also be different components under one rack.
  • the terminal and access network equipment in the embodiments of the present application can be deployed on land, including indoors or outdoors, handheld or vehicle mounted; can also be deployed on water; and can also be deployed on aircraft, balloons, and artificial satellites in the air.
  • the embodiments of the present application do not limit the application scenarios of the terminal and the access network device.
  • the terminal and the access network device in the embodiments of the present application may communicate through licensed spectrum, may also communicate through unlicensed spectrum, or may communicate through licensed spectrum and unlicensed spectrum at the same time.
  • the terminal and the access network device can communicate through the frequency spectrum below 6 GHz (gigahertz, GHz), and can also communicate through the frequency spectrum above 6 GHz, and can also use the frequency spectrum below 6 GHz and the frequency spectrum above 6 GHz for communication at the same time.
  • the embodiments of the present application do not limit the spectrum resources used between the terminal and the access network device.
  • Wireless communication systems typically face changing channels, changing environments, and changing users.
  • the non-ideal of hardware and the non-ideal of modeling make it difficult for the communication system to seek the optimal decision through theoretical formula calculation in the changing process, so the optimal decision is usually difficult to obtain, and sometimes even a complex traversal search is required to obtain the optimal decision.
  • Optimal decisions; suboptimal decisions can be obtained by solving optimization problems, but the complexity of solving optimization problems is also very high, and it is also difficult to solve in some scenarios. Often, for a complex and specific communication scenario, only exploration can lead to better decisions.
  • Deep reinforcement learning can use the interaction between the neural network model and the environment to search for optimal decisions. Taking the communication system as the environment, deep reinforcement learning can be used to search for the optimal decision for the communication system.
  • Model-free reinforcement learning algorithms are the most commonly used types of deep reinforcement learning, such as deep Q network (DQN), proximal policy optimization (PPO) and other algorithms.
  • DQN deep Q network
  • PPO proximal policy optimization
  • random exploration may lead to the deterioration of the performance of the communication system. Due to the high reliability requirements of the communication system, this exploration method is unacceptable.
  • the present application provides a method and apparatus for selection decision, which can avoid random exploration and help the communication system to operate reliably.
  • FIG. 2 is an overall flow diagram of the method for selection decision of the present application.
  • the state of the current communication system can be used as an input to the known model of the communication system, and the known model of the communication system can first determine the explorable decision according to the current state of the communication system. Select a decision in the exploration decision, and evaluate the corresponding performance of the decision. After N iterations, the performance X ⁇ and the number of exploration B ⁇ of each decision ⁇ explored are obtained, so as to guide the selection of reliable communication decisions.
  • the output of the known model of the communication system may include the performance of the above-mentioned explorable decision.
  • the real experience of the communication system i.e. historical information
  • historical information can be queried under the current state of the communication system.
  • the historical information can give the performance X ⁇ and the number of explorations B ⁇ corresponding to each explored decision ⁇ .
  • the historical information is equivalent to having completed multiple explorations.
  • reliable communication decisions can be guided and selected, and the historical information can be regarded as Communication models, this process can also be called model-based exploration.
  • the historical information may include decisions and corresponding performances selected by the device or other devices in the history under the same or similar communication system state, and these selected decisions are exploratory decisions.
  • the output of the communication system state, the known model of the communication system and the historical information can also be used to train the neural network model, and the trained neural network model can in turn guide the selection of reliable communication decisions.
  • the communication system will obtain a real performance after taking a reliable communication decision, and the decision and real performance can be stored in the historical information, and the historical information can be used to train the neural network model, so that the neural network model can output each communication system state.
  • the estimated performance of the decision X ⁇ and the reliable exploration coefficient C ⁇ By computing the neural network iteratively many times, the number of explorations B ⁇ for each strategy can be obtained. Choose reliable communication decisions based on the output of the neural network.
  • Y ⁇ X ⁇ +C ⁇ ⁇ B ⁇
  • a decision ⁇ that can maximize Y ⁇ can be selected.
  • the reliable exploration coefficient C ⁇ can be preset, or obtained by looking up a table, or calculated according to the current decision-making exploration situation, or obtained through negotiation between communication devices during the communication process, or It can be output by the neural network according to the current state of the communication system.
  • the value of the reliable exploration coefficient C ⁇ is to balance the exploration item and the utilization item, where X ⁇ corresponds to the utilization item, and B ⁇ corresponds to the exploration item.
  • the tradeoff is that the performance of the chosen decisions is not too bad, while keeping the less explored decisions more explored.
  • the reliable communication decision selected above may be a series of decisions, and selecting a reliable communication decision is equivalent to selecting a multi-step continuous decision.
  • the states of the communication systems described above may be different for different tasks.
  • the state of the communication system may be the location of the terminal.
  • the state of the communication system may be the current code configuration.
  • the above-mentioned explorable decision refers to a decision that is feasible under the current state of the communication system. For example, in a handover task, when the terminal is in a certain position, there are only two options of handover to base station A, or handover to base station B, without handover. For options such as base station C, the only option to be explored at this time is to switch to A or B.
  • the above communication system model can obtain the performance corresponding to the decision according to the state of the communication system.
  • the communication system model may be a Monte Carlo simulator, a model formulation, a neural network model, or the like.
  • the model formula can be different.
  • the model formula may be at least one of a channel capacity formula, a signal-noise ratio (signal-noise ratio, SNR) formula, an energy efficiency formula, a spectrum efficiency formula, and other related calculation formulas of the communication system.
  • the performance of the above decision can also be described as the performance corresponding to the decision, which is the performance that the communication system has after selecting and executing a certain decision. Executing this decision may be performed in a simulation environment, and the performance in the corresponding simulation environment may be obtained; it may also be executed in a real communication network, and the performance in the corresponding real environment may be obtained. Likewise, for different tasks, the performance of the decision may be at least one of communication system-related performance indicators such as channel capacity, signal-to-noise ratio, energy efficiency, and spectral efficiency.
  • the above-mentioned reliable communication decision is the final decision after multiple explorations.
  • the reliable communication decision can also be regarded as a decision to be explored that is actually implemented in the communication system.
  • the decision to be explored is a single-step exploration decision.
  • the decision to be explored may be made under simulation, but the reliable communication decision is made in the real communication system. executed in.
  • the present application can use known models and/or historical information and/or neural network models in the communication system to guide decision-making exploration, avoid random exploration, and help the communication system operate reliably.
  • FIG. 3 is a schematic flowchart of the method for selection decision provided by the present application.
  • the method shown in FIG. 3 may include at least part of the following.
  • step 310 state information of the communication system is acquired.
  • the above state information of the communication system is information used to represent the state of the communication system, and may be different for different tasks.
  • the status information may be information about the location of the terminal device.
  • the state information may be channel state information of the channel between the access network device and the terminal device.
  • the state information may be information of the current code construction.
  • step 320 the performance of each of the M first decisions and/or the number of times each first decision has been explored is determined according to the state information of the communication system, where M is a positive integer. It should be noted that in this paper, no distinction is made between the number of times the decision has been explored, the number of visits to the decision, and the number of times the decision has been made.
  • the M first decisions are explorable decisions under the state information.
  • the performance of the first decision may be arithmetic average performance, weighted average performance, maximum or minimum performance, cumulative performance, etc. of the first decision, which is not specifically limited.
  • the current state of the communication system can be used as input to the model of the communication system, and through the calculation of the communication system model, the performance of each of the M first decisions and/or each first decision has been explored. number of times.
  • the following steps can be performed N times in a loop to obtain the performance of each first decision and/or the number of times each first decision has been explored, where N is an integer greater than 1: according to the current performance of each first decision And/or the number of times each first decision is currently explored, select the first decision to be explored from the M first decisions; according to the above state information and the model of the communication system, update the corresponding first decision to be explored. performance; and/or, add 1 to the number of times the first decision to be explored has been explored.
  • the exploration coefficient of each first decision, from M first decisions are selected to be explored.
  • first decision is used to control the tendency to choose the first decision, balancing exploration, utilization and reliability.
  • the tendency may be to choose more first decisions with fewer explorations, more first decisions with fewer explorations and better performance, or more first decisions with better performance, and so on.
  • the settings of exploration coefficients can be different.
  • the classical upper confidence boundary (UCB) algorithm can be used in combination with the design of reliable communication, which is called reliable upper confidence boundary (R-UCB) exploration in this application.
  • R-UCB reliable upper confidence boundary
  • X 1d is the current performance corresponding to the d-th first decision among the M first-decisions
  • N is the total number of times the M first decisions are currently explored
  • N 1d is the current number of times the d-th first decision is explored.
  • X 1d is the utilization item, is the exploration item, the value of y 1 corresponding to the first decision to be explored is the largest, X 1d is the current performance of the d-th first decision among the M first decisions, and N 1 is the M first decision currently being explored.
  • the total number of times N 1d is the number of times the d-th first decision is currently explored, C 1 is the exploration coefficient of each first decision, and C 1 is a constant.
  • C 1 is used to control the tendency when choosing a decision, or it can be understood as changing the ratio of utilization items and exploration items. The larger C 1 is, the more decisions that are explored when choosing a decision, and the smaller C 1 is, the more When choosing a decision, explore better-performing decisions more.
  • the communication device may record and store the historical decisions and corresponding performances of the device or other devices under the same or similar communication system state. The performance of each of the stored M first decisions and/or the number of times each first decision has been explored.
  • the communication device can store the historical decisions and corresponding performances of the device or other devices under different communication system states in the form of tables. The performance of each of the M first decisions and/or the number of times each first decision has been explored.
  • Mode 2 does not need to go through a single exploration again and again, because historical information can be regarded as multiple explorations in history.
  • a target first decision is determined from the M first decisions according to the performance of each of the M first decisions and/or the number of times each first decision has been explored.
  • the target first decision may correspond to the reliable communication decision described above.
  • the first decision with the best performance may be selected from the M first decisions as the target first decision.
  • the first decision that has been explored the most times may be selected from the M first decisions as the target first decision.
  • the target first decision may be selected in consideration of performance and the number of times it has been explored.
  • the target first decision may be selected from the M first decisions according to the performance corresponding to each first decision and/or the number of times each first decision has been explored, and the exploration coefficient of each first decision. .
  • the exploration coefficient is used to control the tendency to choose the first decision, balancing exploration, utilization and reliability.
  • the tendency may be to choose more first decisions with fewer explorations, more first decisions with fewer explorations and better performance, or more first decisions with better performance, and so on.
  • the settings of exploration coefficients can be different.
  • the first decision to be explored can be determined from the M first decisions according to the following formula 2:
  • X 2d is the utilization item, is the exploration item, X 2d is the performance of the d-th first decision among the M first decisions, N 2 is the total number of times the M first decisions are explored, and N 2d is the number of times the d-th first decision is explored , C 2 is the exploration coefficient of each first decision, C 2 is constant, C 2 varies with N 2d , or C 2 is determined by the neural network model.
  • C 2 is used to control the tendency when choosing a decision, or it can be understood as changing the ratio of the utilization item and the exploration item. The larger the C 2 is, the more the decisions that have been explored are explored more when the C 2 is larger, and the smaller the C 2 is, the more When choosing a decision, explore better-performing decisions more.
  • C 2 when selecting a reliable communication decision, if C 2 is set to vary with N d , it may be set that C 2 is 0 when N d is less than a preset threshold. In this way, although some explorable decisions have potential exploration value, because the number of times to be explored is too small, executing the decision in the communication system may bring unreliable results to the communication system. Therefore, no exploration is performed, which helps the communication system reliable operation.
  • N t is a preset threshold
  • N t ⁇ N 2
  • is a preset constant.
  • 0.001
  • the preset threshold is 10
  • C 2 is 0.
  • the setting of C2 is a step design, i.e. 0 or another value.
  • C 2 can also be set to change continuously, that is, the value of C 2 changes continuously with the size of N 2d .
  • the setting of C 2 can be different when the communication system adopts different exploration methods.
  • C 2 when exploring through the model of the communication system, C 2 is a fixed value; when using historical information as the basis for exploration, C 2 can be set to a value that varies with N 2d .
  • This multi-step decision can be represented by a tree.
  • the tree composed of multi-step decisions can be called a decision tree.
  • Each node in the decision tree can correspond to a An explorable decision.
  • the communication system model or historical information can be used to determine the performance of each node in the tree structure and the number of times it has been explored; similarly, a tree-shaped table that records historical information can be used for table lookup exploration.
  • the performance of the decision corresponding to the parent node is the sum of the performances of the decisions corresponding to all the child nodes, and the number of times the decision corresponding to the parent node has been explored is the sum of the times the decision corresponding to all the child nodes has been explored.
  • the state of the child node is the result of the decision to select the corresponding child node on the basis of the state of the parent node.
  • the performance of decision B in state A is the sum of the performance of decision C in state A+B and the performance of decision D in state A+B, and the decision in state A
  • the number of times of B is the sum of the number of times of decision C under state A+B and the number of times of decision D under state A+B
  • the performance of decision C under state A+B is the performance of decision E under state A+B+C
  • the number of times of decision C in state A+B is the decision in state A+B+C
  • the following actions can be continued: according to the current corresponding performance of each second decision among the K second decisions and/or the current number of times each second decision has been explored, Select second decisions to be explored from the K second decisions, the K second decisions are decisions that can be explored after selecting the first decisions to be explored; according to the state information, and the The model of the communication system, updating the performance corresponding to the second decision to be explored; and/or, adding 1 to the number of times the second decision to be explored is explored; The performance updates the performance corresponding to the first decision to be explored; and/or, adds 1 to the number of times the first decision to be explored is explored.
  • updating the performance corresponding to the first decision to be explored according to the performance corresponding to the second decision to be explored can be understood as adding the performance of the second decision to be explored to the performance of the original first decision.
  • the method shown in FIG. 3 may further include steps 340 and 350 .
  • step 340 the performance of each of the K second decisions and/or the number of times each second decision has been explored is determined, and the K second decisions are explorable after selecting the target first decision decision, K is a positive integer.
  • step 350 a target second decision is determined from the K second decisions according to the performance corresponding to each second decision and/or the number of times each second decision has been explored.
  • the method for determining the second target decision is the same as or similar to that for determining the first target decision, and reference may be made to the above related descriptions.
  • the difference is that the state of the communication system is the state after the first decision to select the target.
  • FIG. 3 only shows two steps in the multi-step exploration, which may actually be more steps of exploration.
  • the method for exploring decision-making using the communication system model and historical information is described above, and the method for assisting the exploration decision-making by the neural network model is described below.
  • the decision space is very large, that is, the decision tree is both wide and deep.
  • N times of exploration in similar states have no mechanism for mutual learning and evolution. For example, assuming base stations A, B, and C surrounding base stations and their environments are similar, and at the same relative position, the exploration of UE handover situations conforms to similar laws. of waste.
  • the present application can use the neural network model to assist decision-making exploration, and assist in reducing the exploration space.
  • the output of model-based decision exploration can be used to train a neural network model, and the trained neural network model can in turn guide model-based decision exploration.
  • a communication system will get a realistic performance after taking a reliable communication decision, and a neural network model can be trained based on this reliable communication decision and performance.
  • the neural network model can be used to assist in reducing the exploration space since the neural network model has been fitted to the results of multiple explorations of other devices in history.
  • a factor P d can be added to the exploration item to generalize and prune the decision tree, as shown in formula 3:
  • the exploration coefficient is C ⁇ P d
  • P d is output by the neural network model.
  • the parameters in Equation 1 and Equation 2 are described as y, X d , C, N, N d here and below. No distinction is made. For a specific description, reference may be made to Formula 1 and Formula 2, which will not be repeated here.
  • various parameters used in the decision exploration may be determined according to the task type or application scenario that triggers the decision exploration.
  • Each parameter may include at least one of N, C, C r , ⁇ , and N t .
  • a reliable exploration parameter table may be recorded or stored on the communication device, and each parameter is determined by looking up the table according to the task type.
  • Table 1 is an example of a reliable exploration parameter table.
  • Performance parameter C r parameter ⁇ The total number of explorations N 1. Toggle SNR 1.414 0.01 1000 2. UE pairing Spectral efficiency 2.5 0.001 100 3. Power control Spectral efficiency 2.0 0.001 200 ... ... ... ...
  • C r should be smaller, which can reduce the weight of exploration items; ⁇ should be larger, so that decisions with too few exploration times will not increase the decision.
  • the exploration confidence; N should be larger, so that the total number of explorations is larger, and the final decision is more accurate.
  • the performance index can be the signal-to-noise ratio SNR, C r is 1.414, ⁇ is 0.01, and N is 1000; when the task that triggers decision exploration is When the UE is paired with the task, the performance index can be spectrum efficiency, C r is 2.5, ⁇ is 0.001, and N is 100; when the task that triggers decision exploration is power control task, the performance index can be spectrum efficiency, C r The value is 2.0, ⁇ is 0.001, N is 200, etc.
  • 6 and 7 are examples of model-assisted exploration through a communication system.
  • model-based exploration In order to output a reliable communication decision, model-based exploration needs to simulate the performance gains brought by various decisions before outputting as much as possible, and use these simulation results as the basis for the final output of reliable communication decisions. Of course, the more simulation times, the system resources The consumption will also increase, and there is a trade-off to be made here.
  • model-based exploration simulates N times of exploration before each decision, selects an explorable decision (which can correspond to the first decision to be explored above) for each exploration, and uses the communication system model to determine the state of the communication system according to the input and the selected explorable decision, obtain the estimated performance of the explorable decision, and add the estimated performance to the cumulative performance of the explorable decision selected during the exploration process, and record the explorable decision has been in N times of exploration. How many times it is selected; after reaching N explorations, the explorable decision with the most visits can be output as a reliable communication decision, or a reliable communication decision can be selected comprehensively considering the performance and the number of visits.
  • explorable decisions A, B, C, and D there are explorable decisions A, B, C, and D.
  • an explorable decision B is selected.
  • the explorable decision is output.
  • the obtained estimated performance is accumulated into the performance B, and then 1 is added to the number of visits NB.
  • the channel capacity formula, the signal-to-noise ratio formula, the energy efficiency formula, the spectrum efficiency formula and other communication formulas can be used to calculate the result of selecting the explorable decision B under the current communication system state.
  • the performance of the system can also be simulated by Monte Carlo simulators in general scenarios (for example, Rayleigh channels, etc.) for explorable decision B; generative adversarial networks (GANs) can also be used as scenario simulators, Simulate a specific communication scenario and combine with the simulator to obtain the performance of the explorable decision B for the current communication system state in a specific scenario.
  • GANs generative adversarial networks
  • Table 2 the table lookup method based on historical information is used to describe how to explore and output reliable decisions.
  • the classical UCB algorithm will choose the explorable decision F, because the explorable decision F has not been explored, so the value of the UCB formula is infinite, and the classical UCB algorithm is therefore also considered to be an optimistic algorithm, but in In reliable communication, this kind of optimism can lead to system crashes, so the formula is restricted.
  • the "tree communication strategy value table" in state A is shown in Table 3.
  • the multi-step decision-making sequences B, C, and E will be selected with a higher probability, that is, user 1, user 2, and user 3 are selected for pairing.
  • This example is an example in which the method of the present application is applied to a channel coding scenario.
  • the exploration of the nested code structure of polar code is a decision tree with a large space.
  • the communication system model can be a Monte Carlo simulator of channel coding and decoding, and the performance index can be -log (BLER), neural network models can be used for generalization and pruning of decision trees.
  • 0, 1, 2, 3, and 4 are the reliability ranking position indications constructed by Polar's nested code. As the code length increases, the tree will be deeper and wider. Here is only a partial decision tree constructed by a Polar code nesting. Assuming that 0 is known to be the most reliable information bit position, 0 is directly determined as the parent node. Next, there are multiple options for the position of the next information bit, corresponding to multiple child nodes.
  • the performance of the child nodes is to simulate the performance of the 0->1, 0->2, 0->4 sequence respectively. According to the obtained simulation performance of each child node, the performance of the route 0->1 is found. Best, we then choose the decision below 1, compare the performance of 0->1->4 and 0->1->2, and find that the performance of 0->1->2 is better, put 0->1-> The performance of 2 and the performance of 0->1->4 are added to the parent node 1. Note that the performance of 0->1 is better than that of 0->2 does not mean that the performance of 0->1->2 is necessarily better than that of 0->2->1.
  • the indicator is based on the aforementioned method of balancing exploration and utilization.
  • a mapping table can be designed and standardized. Each reliable exploration level in the table corresponds to a set of reliable exploration parameters. In this way, the sending and receiving ends keep the same mapping table, and when signaling the reliable exploration parameters, the transmitting end can only send the sequence number of the reliable exploration level, and the receiving end can obtain the corresponding reliable exploration parameters.
  • a mapping table of tasks and reliable exploration levels and performance indicators can be preset. There is no need to transmit reliable exploration levels and row performance indicators, and the parameters corresponding to the task in the table are used for exploration.
  • Tables 4 and 5 are examples of reliable exploration ranking initial tables.
  • Reliable Exploration Level parameter ⁇ The total number of explorations N Level 0 0.01 10000 Level 1 0.001 1000 Level 2 0.0001 100 ... ...
  • Reliable exploration Performance Reliable Exploration Level 1 Toggle SNR Level 0 2.
  • UE pairing Spectral efficiency Level 1 3.
  • Power control Spectral efficiency Level 2 ... ... ...
  • the methods shown in FIG. 3 to FIG. 8 may be executed by a terminal, an access network device, or a core network device, or may be executed by a terminal, an access network device, or a module or unit (for example, a chip, a circuit, a system on a chip (system on chip, SOC), etc.) are used for execution, and the following description takes the execution by a terminal, an access network device, or a core network device as an example.
  • a module or unit For example, a chip, a circuit, a system on a chip (system on chip, SOC), etc.
  • a decision exploration table may be exchanged among the terminal, the access network equipment, and the core network equipment.
  • the access network device needs to assist the terminal to explore better performance in the handover task, so the access network device should inquire and authenticate the terminal's reliable exploration capability (VIP client).
  • VIP client The terminal has specific requirements for its own mission reliability.
  • the access network equipment should inquire about the specific settings of the terminal in the setting of the exploration parameters. For high-value terminals within the coverage of the access network equipment, the access network equipment can be used in the core. Reliable exploration permission is registered on the network device. For example, terminals that appear frequently within the coverage of access network equipment are very helpful for effective and continuous exploration. Reliable exploration rights can be registered for such terminals to assist access network equipment to improve reliable exploration performance and accumulate experience.
  • FIG. 9 is a schematic flowchart of registering a reliable discovery capability for a terminal in a core network device.
  • step 901 the access network device sends a first message to the terminal for querying the reliable discovery capability of the terminal.
  • the terminal receives the first message from the access network device.
  • the reliable exploration capability here can be understood as whether to support the methods shown in Figures 3 to 8 for selection decisions.
  • the above-mentioned first message may be a system information block (system information block, SIB) or a master information block (master information block, MIB) message.
  • the first message may include at least one of a ueCapability field, a reliableSearchFlg field, and a reliableLevel field.
  • step 902 the terminal sends a second message to the access network device to feed back its reliable exploration capability to the access network device.
  • the access network device receives the second message from the terminal.
  • step 903 if the terminal feeds back that it has the reliable discovery capability, the access network device sends a third message to the core network device for requesting to register the reliable discovery capability for the terminal.
  • the core network device receives the third message from the access network device.
  • step 904 the core network device completes the registration, and sends a fourth message to the access network device to indicate that the reliable discovery capability registration is completed.
  • the access network device receives the fourth message sent by the core network device.
  • step 905 the access network device sends a fifth message to the terminal, which is used to indicate that the reliable discovery capability registration is completed.
  • the terminal receives the fifth message sent by the access network device.
  • the core network device may also authenticate the terminal's reliable exploration capability to determine whether to allow the terminal to make a discovery decision. That is, before step 904, step 906 may also be performed, and step 904 is performed only when the core network device allows the terminal to explore and make a decision, otherwise, the registration exception or registration failure is fed back to the terminal.
  • the access network device may also query the terminal for parameters used in the exploration decision, so that the access network device estimates the time required for the decision and exploration, so as to perform reasonable processing. Specifically, steps 907-908 may be performed.
  • step 907 the access network device sends a sixth message to the terminal for querying the parameters used in the exploration decision.
  • the terminal may send a seventh message to the access network device for feeding back parameters used in the exploration decision.
  • the terminal may obtain at least part of the parameters, the communication system model, and the historical information required for the exploration decision from the access network device. For example, the terminal may acquire the above-mentioned content from the access network device in the process shown in FIG. 9 .
  • FIGS. 3 to 8 are performed by the access network equipment.
  • step 1001 the terminal sends an eighth message to the access network device, which is used to request to start a decision-making exploration.
  • the access network device receives the eighth message sent by the terminal.
  • the eighth message may include at least one of a task type that triggers the decision-making exploration, a model of a communication system, a neural network model parameter, and the like.
  • a task type that triggers the decision-making exploration e.g., a model of a communication system, a neural network model parameter, and the like.
  • the reliable exploration task list of the eighth message includes the handover task
  • the reliable exploration performance index includes a communication formula, a simulator type, and the like.
  • step 1002 the access network device sends a ninth message to the terminal, and feeds back to the terminal that the decision to explore is started.
  • step 1003 the access network device performs decision exploration according to the methods shown in FIG. 3 to FIG. 8, and sends the exploration result to the terminal.
  • the terminal receives the discovery result sent by the access network device.
  • the present application does not limit the manner in which the access network device feeds back the exploration result.
  • the access network device may send all the exploration results to the terminal at one time after the multi-step exploration is completed.
  • the access network device may send a discovery result to the terminal every time one step of discovery is performed, and send the result of the multi-step discovery to the terminal through multiple messages.
  • step 1004 after receiving the required exploration result, the terminal sends a tenth message to the access network device for requesting to end the multi-step exploration.
  • the access network device receives the tenth message sent by the terminal.
  • step 1005 the access network device sends an eleventh message to the terminal, notifying the terminal that the decision to explore is ended.
  • the access network device may obtain support information related to exploration decisions from other access network devices or core network devices according to the task type, for example, the simulator required by the communication model, simulation conditions, etc. historical information, etc.
  • the access network device may send a twelfth message to other access network devices for obtaining the emulator and emulation conditions required by the communication model.
  • the access network device may send a thirteenth message to other core network devices for obtaining historical information under the same task.
  • step 1008 after receiving the twelfth message, other access network devices send a fourteenth message to the access network device for feeding back the simulator and simulation conditions required by the communication model.
  • step 1009 after receiving the thirteenth message, the core network device sends a fifteenth message to the access network device for feeding back historical information under the same task.
  • the terminal, the access network device, and the core network device include corresponding hardware structures and/or software modules for performing each function.
  • the units and method steps of each example described in conjunction with the embodiments disclosed in the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software-driven hardware depends on the specific application scenarios and design constraints of the technical solution.
  • FIG. 11 and FIG. 12 are schematic structural diagrams of apparatuses for possible selection decisions provided by embodiments of the present application. These apparatuses can be used to implement the functions of the terminal, the access network equipment, or the core network equipment in the above method embodiments, and thus can also achieve the beneficial effects of the above method embodiments.
  • the device for selection decision may be the terminal 120 or the terminal 130 as shown in FIG. 1 , the radio access network device 110 as shown in FIG. 1 , or the core network device, It may also be a module (such as a chip) applied to a terminal, an access network device, or a core network device.
  • the apparatus 1100 includes a processing unit 1110 and a transceiver unit 1120 .
  • the processing unit 1110 may be used to perform steps 310-350, and the transceiver unit 1120 may be used to perform steps 901-902, 905, 907-908, and 1001-1005.
  • the processing unit 1110 can be used to execute steps 310-350, and the transceiver unit 1120 can be used to execute steps 901-905, 907--908, 1001, 1006-1009.
  • the processing unit 1110 may be used to perform step 906
  • the transceiver unit 1120 may be used to perform steps 903 - 904 , 1007 and 1009 .
  • the apparatus 1200 includes a processor 1210 and an interface circuit 1220 .
  • the processor 1210 and the interface circuit 1220 are coupled to each other.
  • the interface circuit 1220 can be a transceiver or an input-output interface.
  • the apparatus 1200 may further include a memory 1230 for storing instructions executed by the processor 1210 or input data required by the processor 1210 to execute the instructions or data generated after the processor 1210 executes the instructions.
  • the processor 1210 is used to perform the function of the above-mentioned processing unit 1110
  • the interface circuit 1220 is used to perform the function of the above-mentioned transceiver unit 1120 .
  • the terminal chip implements the functions of the terminal in the above method embodiments. For example, the terminal chip receives information from other modules (such as radio frequency modules or antennas) in the terminal, and the information is sent to the terminal by other devices; or, the terminal chip sends information to other modules in the terminal (such as radio frequency modules or antennas) information, which is sent by the terminal to other devices.
  • the terminal chip receives information from other modules (such as radio frequency modules or antennas) in the terminal, and the information is sent to the terminal by other devices; or, the terminal chip sends information to other modules in the terminal (such as radio frequency modules or antennas) information, which is sent by the terminal to other devices.
  • the access network device chip implements the functions of the access network device in the above method embodiments.
  • the chip of the access network device receives information from other modules (such as radio frequency modules or antennas) in the access network device, and the information is sent by other devices to the access network device; or, the chip of the access network device Send information to other modules (such as radio frequency modules or antennas) in the access network device, where the information is sent by the access network device to other devices.
  • the processor in the embodiments of the present application may be a central processing unit (Central Processing Unit, CPU), and may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application-specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field Programmable Gate Array (Field Programmable Gate Array, FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof.
  • a general-purpose processor may be a microprocessor or any conventional processor.
  • the method steps in the embodiments of the present application may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions.
  • Software instructions can be composed of corresponding software modules, and software modules can be stored in random access memory (Random Access Memory, RAM), flash memory, read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM) , PROM), Erasable Programmable Read-Only Memory (Erasable PROM, EPROM), Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory (Electrically EPROM, EEPROM), registers, hard disks, removable hard disks, CD-ROMs or known in the art in any other form of storage medium.
  • RAM Random Access Memory
  • ROM read-only memory
  • PROM programmable read-only memory
  • PROM Erasable Programmable Read-Only Memory
  • EPROM Electrically Erasable Programmable Read-Only Memory
  • An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and storage medium may reside in an ASIC.
  • the ASIC may be located in a terminal, an access network device or a core network device.
  • the processor and the storage medium may also exist in the terminal, the access network device or the core network device as discrete components.
  • the computer program product includes one or more computer programs or instructions.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer program or instructions may be stored in or transmitted over a computer-readable storage medium.
  • the computer-readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server that integrates one or more available media.
  • the usable media can be magnetic media, such as floppy disks, hard disks, magnetic tapes; optical media, such as DVD; and semiconductor media, such as solid state disks (SSD).
  • “at least one” means one or more, and “plurality” means two or more.
  • “And/or”, which describes the relationship of the associated objects, indicates that there can be three kinds of relationships, for example, A and/or B, it can indicate that A exists alone, A and B exist at the same time, and B exists alone, where A, B can be singular or plural.
  • the character “/” generally indicates that the related objects are a kind of "or” relationship; in the formula of this application, the character "/” indicates that the related objects are a kind of "division” Relationship.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

本申请提供了一种用于选择决策的方法和装置,可以获取通信系统的状态信息,并根据状态信息,确定M个第一决策中每个第一决策对应的性能和/或每个第一决策被探索过的次数,M个第一决策为在该状态信息下可探索的决策,M为正整数,进一步根据每个第一决策对应的性能和/或每个第一决策被探索过的次数,从M个第一决策中确定目标第一决策。在本申请的技术方案中,可以根据可探索的决策的性能和/或可探索的决策已经被探索的次数来选择决策,可以避免随机探索,有助于通信系统可靠运行。

Description

一种用于选择决策的方法和装置
本申请要求于2020年11月23日提交中国国家知识产权局、申请号为202011322773.3、申请名称为“一种用于选择决策的方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信领域,并且更具体地,涉及一种用于选择决策的方法和装置。
背景技术
无线通信系统通常面对着变化的信道、变化的环境和变化的用户。硬件的非理想和建模的非理想使得通信系统在变化中很难通过理论公式计算寻求最优的决策,导致最优决策通常不易获取,有时甚至需要采用复杂度很高的遍历搜索才能得到最优决策;次优决策可以通过解优化问题的方式获得,但解优化问题的复杂度也很高,在一些场景下同样不易求解。
深度强化学习可以利用神经网络模型与环境的交互来搜索最优决策,以通信系统作为环境,深度强化学习可以用来搜索针对该通信系统的最优决策。通常,对于某个复杂的特定通信场景,只有探索可以找到更优的决策,但是现有通信系统中,为了通信系统可以可靠运行,往往坚持选用通用、保守的决策,导致通信系统大部分器件长期工作在非最优性能下,不能满足未来高性能通信系统的要求。
发明内容
本申请提供一种用于选择决策的方法和装置,有助于通信系统可靠运行。
第一方面,本申请提供了一种用于选择决策的方法,所述方法包括:获取通信系统的状态信息;根据所述状态信息,确定M个第一决策中每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,所述M个第一决策为在所述状态信息下可探索的决策,M为正整数;根据所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,从所述M个第一决策中确定目标第一决策。
在上述技术方案中,可以根据可探索的决策的性能和/或可探索的决策已经被探索的次数来选择决策,可以避免随机探索,有助于通信系统可靠运行。
结合第一方面,在一种可能的实现方式中,所述方法还包括:确定K个第二决策中每个第二决策对应的性能和/或所述每个第二决策被探索过的次数,所述K个第二决策为在选择所述目标第一决策后可探索的决策,K为正整数;根据所述每个第二决策对应的性能和/或所述每个第二决策被探索过的次数,从所述K个第二决策中确定目标第二决策。
通信系统中的决策往往不是单步决策,而是一系列决策。在上述技术方案中,多步决策中的每一步均可以根据当前可探索的决策的性能和/或当前可探索的决策已经被探索的 次数来选择决策,可以避免随机探索,有助于通信系统可靠运行。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述根据所述状态信息,确定M个第一决策中每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,包括:循环执行以下步骤N次,得到所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,N为大于1的整数:根据所述每个第一决策当前对应的性能和/或所述每个第一决策当前被探索过的次数,从所述M个第一决策中选择待探索的第一决策;根据所述状态信息、以及所述通信系统的模型,更新所述待探索的第一决策对应的性能;和/或,在所述待探索的第一决策的被探索次数上加1。
在上述技术方案中,本申请可以利用通信系统中的已知模型,来指导决策探索,可以避免随机探索,有助于通信系统可靠运行。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述方法还包括:根据K个第二决策中每个第二决策当前对应的性能和/或所述每个第二决策当前被探索过的次数,从所述K个第二决策中选择待探索的第二决策,所述K个第二决策为在选择所述待探索的第一决策后可探索的决策;根据所述状态信息、以及所述通信系统的模型,更新所述待探索的第二决策对应的性能;和/或,在所述待探索的第二决策的被探索次数上加1;根据所述待探索的第二决策对应的性能更新所述待探索的第一决策对应的性能;和/或,在所述待探索的第一决策被探索次数上加1。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述根据所述每个第一决策当前对应的性能和/或所述每个第一决策当前被探索过的次数,从所述M个第一决策中选择待探索的第一决策,包括:根据所述每个第一决策当前对应的性能和/或所述每个第一决策当前被探索过的次数、以及所述每个第一决策的探索系数,从所述M个第一决策中选择所述待探索的第一决策,所述探索系数用于控制选择决策时的倾向。
在上述技术方案中,可以通过探索系数控制选择决策时的倾向,可以避免随机探索,有助于通信系统可靠运行。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述根据所述每个第一决策当前对应的性能和/或所述每个第一决策当前被探索过的次数、以及所述每个第一决策的探索系数,从所述M个第一决策中选择所述待探索的第一决策,包括:根据y 1=x 1+C 1·b 1,从所述M个第一决策中选择所述待探索的第一决策,所述待探索的第一决策对应的y 1的取值最大,其中,x 1为所述每个第一决策当前对应的性能的函数,C 1为所述探索系数,且C 1为常数,b 1为所述每个第一决策当前被探索过的次数的倒数的函数。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述根据y 1=x 1+C 1·b 1,从所述M个第一决策中选择所述待探索的第一决策,包括:
根据
Figure PCTCN2021131777-appb-000001
从所述M个第一决策中确定所述待探索的第一决策其中,X 1d为所述M个第一决策中第d个第一决策当前对应的性能,N 1为所述M个第一决策当前被探索的总次数,N 1d为所述第d个第一决策当前被探索的次数。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述根 据所述状态信息,确定M个第一决策中每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,包括:根据所述状态信息、以及历史信息,确定所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,所述历史信息包括在所述状态信息下的所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数。
在上述技术方案中,可以利用通信系统中的历史信息来指导决策探索,可以避免随机探索,有助于通信系统可靠运行。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述根据所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,从所述M个第一决策中确定目标第一决策,包括:根据所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数、以及所述每个第一决策的探索系数,从所述M个第一决策中确定所述目标第一决策,所述探索系数用于控制选择决策时的倾向。
在上述技术方案中,可以通过探索系数控制选择决策时的倾向,有助于选择更合适的决策,有助于通信系统的可靠运行。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述根据所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数、以及所述每个第一决策的探索系数,从所述M个第一决策中选择所述目标第一决策,包括:根据y 2=x 2+C 2·b 2,从所述M个第一决策中选择所述目标第一决策,所述目标第一决策对应的y 2的取值最大,其中,x 2为所述每个第一决策对应的性能的函数,C 2为所述探索系数,b 2为所述每个第一决策被探索过的次数的倒数的函数。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述根据y 2=x 2+C 2·b 2,从所述M个第一决策中选择所述目标第一决策,包括:
根据
Figure PCTCN2021131777-appb-000002
从所述M个第一决策中确定目标第一决策,其中,X 2d为所述M个第一决策中第d个第一决策对应的性能,N 2为所述M个第一决策被探索的总次数,N 2d为所述第d个第一决策被探索的次数,C 2为常数、C 2随N 2d变化或者C 2由神经网络模型确定。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,若C 2随N 2d变化,当N 2d小于预设阈值时,C 2为0。
这样,虽然一些可探索决策具有潜在的探索价值,但由于被探索的次数过少,在通信系统执行该决策可能会给通信系统带来不可靠的结果,因此不进行探索,有助于通信系统的可靠运行。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,C 2满足
Figure PCTCN2021131777-appb-000003
其中,N t为所述预设阈值,且N t=σ·N 2,σ为预设常数。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述方法还包括:根据所述状态信息、以及所述目标第一决策,对所述神经网络模型进行训练,所述神经网络模型用于输出C r
在上述技术方案中,可以使用基于模型的决策探索的输出来训练神经网络模型,训练好的神经网络模型可以反过来指导基于模型的决策探索。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述方 法还包括:获取探索决策所使用的参数,所述参数包括性能指标、N 2、C 2、C r,σ,N t中的至少一个。
可选地,若该方法由终端执行,终端可以从接入网设备获取所述参数。
可选地,若该方法由接入网设备执行,接入网设备可以从终端获取所述参数。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述获取探索决策所使用的参数,包括:根据任务类型,获取所述参数。
这样,针对不同任务可以采用不同的参数,有助于得到更准确的决策。例如,对于可靠度要求越高的任务,则C r应越小,这样可以减小探索项的权重;σ应越大,这样探索次数太少的决策不增加该决策的探索置信度;N应越大,这样总探索次数大,最终的决策更准确。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述方法还包括:获取支持信息,所述支持信息用于确定所述第一决策对应的性能,所述支持信息包括所述通信系统的模型的仿真器、仿真条件、历史信息中的至少一个,所述历史信息包括在所述通信系统的不同状态下的每个决策对应的性能和/或所述每个决策被探索过的次数。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述方法还包括:接收接入网设备发送的第六消息,所述第六消息用于查询探索决策所使用的参数,所述参数包括性能指标、N 2、C 2、C r,σ,N t中的至少一个;向接入网设备发送第七消息,所述第七消息用于指示所述参数。
在上述技术方案中,接入网设备可以向终端查询探索决策使用的参数,以便接入网设备估计决策探索所需的时间,以便进行合理处理。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述方法还包括:接收接入网设备发送的第一消息,所述第一消息用于查询是否具备探索决策的能力;向接入网设备发送第二消息,所述第二消息用于指示具备探索决策的能力;接收接入网设备发送的第五消息,所述第五消息用于指示完成在核心网设备的探索决策能力的注册。
以切换任务为例,接入网设备需要协助终端在切换任务上探索更好的性能,所以接入网设备应对终端的可靠探索能力进行询问和鉴权(VIP客户)。终端对自身的任务可靠性有特定需求,接入网设备在探索参数的设置上应询问终端的特定设置,对于接入网设备覆盖范围内的高价值终端,接入网设备可以为其在核心网设备上注册可靠探索权限。例如,接入网设备覆盖范围内高频次出现的终端对于有效连续探索的帮助很大,可为这样的终端注册可靠探索权限,辅助接入网设备可靠探索性能提升、经验累积。
结合第一方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述方法还包括:接收终端发送的第八消息,所述第八消息用于请求开始探索决策;向终端发送探索结果,所述探索结果包括所述目标第一决策的信息;接收终端发送的第十消息,所述第十消息用于请求结束探索决策。
在上述技术方案中,可以定义决策探索的起始和结束,有助于决策探索的顺利进行。
第二方面,本申请提供了一种用于选择决策的方法,所述方法包括:向终端发送第一消息,所述第一消息用于查询是否具备探索决策的能力;接收所述终端发送的第二消息, 所述第二消息用于指示具备探索决策的能力;向核心网设备发送第三消息,所述第三消息用于请求注册探索决策能力;接收所述的核心网设备发送的第四消息,所述第四消息用于指示完成探索决策能力的注册;向所述终端发送第五消息,所述第五消息用于指示完成在所述核心网设备的探索决策能力的注册。
接入网设备需要协助终端在切换任务上探索更好的性能,所以接入网设备应对终端的可靠探索能力进行询问和鉴权(可选)。终端对自身的任务可靠性有特定需求,接入网设备在探索参数的设置上应询问终端的特定设置,对于接入网设备覆盖范围内的高价值终端,接入网设备可以为其在核心网设备上注册可靠探索权限。在上述技术方案中,可为终端注册可靠探索权限,有助于辅助接入网设备可靠探索性能提升、经验累积。
结合第二方面,在一种可能的实现方式中,所述方法还包括:向所述终端发送第六消息,所述第六消息用于查询探索决策所使用的参数;接收所述终端发送的第七消息,所述第七消息用于指示所述参数;根据所述参数,估计探索决策的时间。
在上述技术方案中,接入网设备可以向终端查询探索决策使用的参数,以便接入网设备估计决策探索所需的时间,以便进行合理处理。
结合第二方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述方法还包括:向所述终端发送支持信息,所述支持信息包括通信系统模型的仿真器、仿真条件、历史信息中的至少一个,所述历史信息包括在通信系统的不同状态下的每个决策对应的性能和/或所述每个决策被探索过的次数。
第三方面,本申请提供了一种用于选择决策的方法,所述方法包括:向接入网设备发送第八消息,所述第八消息用于请求开始探索决策;接收所述接入网设备发送的探索结果;向所述接入网设备发送第十消息,所述第十消息用于请求结束探索决策。
在上述技术方案中,可以定义决策探索的起始和结束,有助于决策探索的顺利进行。
第四方面,本申请提供了一种用于选择决策的方法,所述方法包括:接收接入网设备发送的第十三消息,所述第十三消息用于请求历史信息,所述历史信息包括在通信系统的不同状态下的每个决策对应的性能和/或所述每个决策被探索过的次数;向接入网设备发送第十五消息,所述第十五消息用于指示所述历史信息。
在上述技术方案中,可以向接入网设备提供历史信息,以便实现基于历史信息的决策探索。
结合第四方面,在一种可能的实现方式中,所述方法还包括:接收所述接入网设备发送的第三消息,所述第三消息用于请求为终端注册探索决策能力;向所述接入网设备发送第四消息,所述第四消息用于指示完成探索决策能力的注册。
结合第四方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,在向所述接入网设备发送第四消息之前,所述方法还包括:确定允许所述终端探索决策。
第五方面,本申请提供了一种用于选择决策的装置,所述装置包括:
处理单元,用于获取通信系统的状态信息;根据所述状态信息,确定M个第一决策中每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,所述M个第一决策为在所述状态信息下可探索的决策,M为正整数;根据所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,从所述M个第一决策中确定目标第一决策。
在上述技术方案中,可以根据可探索的决策的性能和/或可探索的决策已经被探索的 次数来选择决策,可以避免随机探索,有助于通信系统可靠运行。
结合第五方面,在一种可能的实现方式中,所述处理单元还用于确定K个第二决策中每个第二决策对应的性能和/或所述每个第二决策被探索过的次数,所述K个第二决策为在选择所述目标第一决策后可探索的决策,K为正整数;根据所述每个第二决策对应的性能和/或所述每个第二决策被探索过的次数,从所述K个第二决策中确定目标第二决策。
通信系统中的决策往往不是单步决策,而是一系列决策。在上述技术方案中,多步决策中的每一步均可以根据当前可探索的决策的性能和/或当前可探索的决策已经被探索的次数来选择决策,可以避免随机探索,有助于通信系统可靠运行。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述处理单元具体用于:循环执行以下步骤N次,得到所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,N为大于1的整数:根据所述每个第一决策当前对应的性能和/或所述每个第一决策当前被探索过的次数,从所述M个第一决策中选择待探索的第一决策;根据所述状态信息、以及所述通信系统的模型,更新所述待探索的第一决策对应的性能;和/或,在所述待探索的第一决策的被探索次数上加1。
在上述技术方案中,本申请可以利用通信系统中的已知模型,来指导决策探索,可以避免随机探索,有助于通信系统可靠运行。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述方法还包括:根据K个第二决策中每个第二决策当前对应的性能和/或所述每个第二决策当前被探索过的次数,从所述K个第二决策中选择待探索的第二决策,所述K个第二决策为在选择所述待探索的第一决策后可探索的决策;根据所述状态信息、以及所述通信系统的模型,更新所述待探索的第二决策对应的性能;和/或,在所述待探索的第二决策的被探索次数上加1;根据所述待探索的第二决策对应的性能更新所述待探索的第一决策对应的性能;和/或,在所述待探索的第一决策被探索次数上加1。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述处理单元具体用于:根据所述每个第一决策当前对应的性能和/或所述每个第一决策当前被探索过的次数、以及所述每个第一决策的探索系数,从所述M个第一决策中选择所述待探索的第一决策,所述探索系数用于控制选择决策时的倾向。
在上述技术方案中,可以通过探索系数控制选择决策时的倾向,可以避免随机探索,有助于通信系统可靠运行。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述处理单元具体用于:根据y 1=x 1+C 1·b 1,从所述M个第一决策中选择所述待探索的第一决策,所述待探索的第一决策对应的y 1的取值最大,其中,x 1为所述每个第一决策当前对应的性能的函数,C 1为所述探索系数,且C 1为常数,b 1为所述每个第一决策当前被探索过的次数的倒数的函数。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述处理单元具体用于:根据
Figure PCTCN2021131777-appb-000004
从所述M个第一决策中确定所述待探索的第一决策其中,X 1d为所述M个第一决策中第d个第一决策当前对应的性能,N 1为所述M个第一决策当前被探索的总次数,N 1d为所述第d个第一决策当前被探索的次数。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述处理单元具体用于:根据所述状态信息、以及历史信息,确定所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,所述历史信息包括在所述状态信息下的所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数。
在上述技术方案中,可以利用通信系统中的历史信息来指导决策探索,可以避免随机探索,有助于通信系统可靠运行。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述处理单元具体用于:根据所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数、以及所述每个第一决策的探索系数,从所述M个第一决策中确定所述目标第一决策,所述探索系数用于控制选择决策时的倾向。
在上述技术方案中,可以通过探索系数控制选择决策时的倾向,有助于选择更合适的决策,有助于通信系统的可靠运行。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述处理单元具体用于:根据y 2=x 2+C 2·b 2,从所述M个第一决策中选择所述目标第一决策,所述目标第一决策对应的y 2的取值最大,其中,x 2为所述每个第一决策对应的性能的函数,C 2为所述探索系数,b 2为所述每个第一决策被探索过的次数的倒数的函数。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述处理单元具体用于:根据
Figure PCTCN2021131777-appb-000005
从所述M个第一决策中确定目标第一决策,其中,X 2d为所述M个第一决策中第d个第一决策对应的性能,N 2为所述M个第一决策被探索的总次数,N 2d为所述第d个第一决策被探索的次数,C 2为常数、C 2随N 2d变化或者C 2由神经网络模型确定。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,若C 2随N 2d变化,当N 2d小于预设阈值时,C 2为0。
这样,虽然一些可探索决策具有潜在的探索价值,但由于被探索的次数过少,在通信系统执行该决策可能会给通信系统带来不可靠的结果,因此不进行探索,有助于通信系统的可靠运行。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,C 2满足
Figure PCTCN2021131777-appb-000006
其中,N t为所述预设阈值,且N t=σ·N 2,σ为预设常数。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述处理单元还用于:根据所述状态信息、以及所述目标第一决策,对所述神经网络模型进行训练,所述神经网络模型用于输出C r
在上述技术方案中,可以使用基于模型的决策探索的输出来训练神经网络模型,训练好的神经网络模型可以反过来指导基于模型的决策探索。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述处理单元还用于:获取探索决策所使用的参数,所述参数包括性能指标、N 2、C 2、C r,σ,N t中的至少一个。
可选地,若该方法由终端执行,终端可以从接入网设备获取所述参数。
可选地,若该方法由接入网设备执行,接入网设备可以从终端获取所述参数。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述处理单元具体用于:根据任务类型,获取所述参数。
这样,针对不同任务可以采用不同的参数,有助于得到更准确的决策。例如,对于可靠度要求越高的任务,则C r应越小,这样可以减小探索项的权重;σ应越大,这样探索次数太少的决策不增加该决策的探索置信度;N应越大,这样总探索次数大,最终的决策更准确。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述处理单元还用于:获取支持信息,所述支持信息用于确定所述第一决策对应的性能,所述支持信息包括所述通信系统的模型的仿真器、仿真条件、历史信息中的至少一个,所述历史信息包括在所述通信系统的不同状态下的每个决策对应的性能和/或所述每个决策被探索过的次数。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述装置还包括收发单元,用于接收接入网设备发送的第六消息,所述第六消息用于查询探索决策所使用的参数,所述参数包括性能指标、N 2、C 2、C r,σ,N t中的至少一个;向接入网设备发送第七消息,所述第七消息用于指示所述参数。
在上述技术方案中,接入网设备可以向终端查询探索决策使用的参数,以便接入网设备估计决策探索所需的时间,以便进行合理处理。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述收发单元还用于:接收接入网设备发送的第一消息,所述第一消息用于查询是否具备探索决策的能力;向接入网设备发送第二消息,所述第二消息用于指示具备探索决策的能力;接收接入网设备发送的第五消息,所述第五消息用于指示完成在核心网设备的探索决策能力的注册。
以切换任务为例,接入网设备需要协助终端在切换任务上探索更好的性能,所以接入网设备应对终端的可靠探索能力进行询问和鉴权(VIP客户)。终端对自身的任务可靠性有特定需求,接入网设备在探索参数的设置上应询问终端的特定设置,对于接入网设备覆盖范围内的高价值终端,接入网设备可以为其在核心网设备上注册可靠探索权限。例如,接入网设备覆盖范围内高频次出现的终端对于有效连续探索的帮助很大,可为这样的终端注册可靠探索权限,辅助接入网设备可靠探索性能提升、经验累积。
结合第五方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述收发单元还用于:接收终端发送的第八消息,所述第八消息用于请求开始探索决策;向终端发送探索结果,所述探索结果包括所述目标第一决策的信息;接收终端发送的第十消息,所述第十消息用于请求结束探索决策。
在上述技术方案中,可以定义决策探索的起始和结束,有助于决策探索的顺利进行。
第六方面,本申请提供了一种用于选择决策的装置,所述装置包括:
收发单元,用于向终端发送第一消息,所述第一消息用于查询是否具备探索决策的能力;接收所述终端发送的第二消息,所述第二消息用于指示具备探索决策的能力;向核心网设备发送第三消息,所述第三消息用于请求注册探索决策能力;接收所述的核心网设备发送的第四消息,所述第四消息用于指示完成探索决策能力的注册;向所述终端发送第五消息,所述第五消息用于指示完成在所述核心网设备的探索决策能力的注册。
接入网设备需要协助终端在切换任务上探索更好的性能,所以接入网设备应对终端的可靠探索能力进行询问和鉴权(可选)。终端对自身的任务可靠性有特定需求,接入网设备在探索参数的设置上应询问终端的特定设置,对于接入网设备覆盖范围内的高价值终端,接入网设备可以为其在核心网设备上注册可靠探索权限。在上述技术方案中,可为终端注册可靠探索权限,有助于辅助接入网设备可靠探索性能提升、经验累积。
结合第六方面,在一种可能的实现方式中,所述收发单元还用于向所述终端发送第六消息,所述第六消息用于查询探索决策所使用的参数;接收所述终端发送的第七消息,所述第七消息用于指示所述参数;所述装置还包括处理单元,用于根据所述参数,估计探索决策的时间。
在上述技术方案中,接入网设备可以向终端查询探索决策使用的参数,以便接入网设备估计决策探索所需的时间,以便进行合理处理。
结合第六方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述收发单元还用于向所述终端发送支持信息,所述支持信息包括通信系统模型的仿真器、仿真条件、历史信息中的至少一个,所述历史信息包括在通信系统的不同状态下的每个决策对应的性能和/或所述每个决策被探索过的次数。
第七方面,本申请提供了一种用于选择决策的装置,所述装置包括:
收发单元,用于向接入网设备发送第八消息,所述第八消息用于请求开始探索决策;接收所述接入网设备发送的探索结果;向所述接入网设备发送第十消息,所述第十消息用于请求结束探索决策。
在上述技术方案中,可以定义决策探索的起始和结束,有助于决策探索的顺利进行。
第八方面,本申请提供了一种用于选择决策的装置,所述装置包括:
收发单元,用于接收接入网设备发送的第十三消息,所述第十三消息用于请求历史信息,所述历史信息包括在通信系统的不同状态下的每个决策对应的性能和/或所述每个决策被探索过的次数;向接入网设备发送第十五消息,所述第十五消息用于指示所述历史信息。
在上述技术方案中,可以向接入网设备提供历史信息,以便实现基于历史信息的决策探索。
结合第八方面,在一种可能的实现方式中,所述收发单元还用于:接收所述接入网设备发送的第三消息,所述第三消息用于请求为终端注册探索决策能力;向所述接入网设备发送第四消息,所述第四消息用于指示完成探索决策能力的注册。
结合第八方面或上述任意一种可能的实现方式,在另一种可能的实现方式中,所述装置还包括处理单元,用于在向所述接入网设备发送第四消息之前,确定允许所述终端探索决策。
第九方面,本申请提供了一种通信装置,包括处理器、存储器和收发器。其中,存储器用于存储计算机程序,处理器用于调用并运行存储器中存储的计算机程序,并控制收发器收发信号,以使通信装置执行如第一方面或其任意可能的实现方式中的方法,或者执行如第二方面或其任意可能的实现方式中的方法,或者执行如第三方面或其任意可能的实现方式中的方法,或者执行如第四方面或其任意可能的实现方式中的方法。
第十方面,本申请提供一种通信装置,包括处理器和通信接口,所述通信接口用于接 收信号并将接收到的信号传输至所述处理器,所述处理器处理所述信号,使得如第一方面或其任意可能的实现方式中的方法被执行,或者使得如第二方面或其任意可能的实现方式中的方法被执行,或者使得如第三方面或其任意可能的实现方式中的方法被执行,或者使得如第四方面或其任意可能的实现方式中的方法被执行。
可选地,上述通信接口可以为接口电路,处理器可以为处理电路。
第十一方面,本申请提供了一种芯片,包括逻辑电路和通信接口,所述通信接口,用于接收待处理的数据和/或信息,所述逻辑电路用于执行如上述任意一方面或其任意实现方式中所述的数据和/或信息处理,以及,所述通信接口还用于输出所述逻辑电路得到处理结果。
第十二方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机指令,当计算机指令在计算机上运行时,使得如第一方面或其任意可能的实现方式中的方法被执行,或者使得如第二方面或其任意可能的实现方式中的方法被执行,或者使得如第三方面或其任意可能的实现方式中的方法被执行,或者使得如第四方面或其任意可能的实现方式中的方法被执行。
第十三方面,本申请提供一种计算机程序产品,所述计算机程序产品包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得如第一方面或其任意可能的实现方式中的方法被执行,或者使得如第二方面或其任意可能的实现方式中的方法被执行,或者使得如第三方面或其任意可能的实现方式中的方法被执行,或者使得如第四方面或其任意可能的实现方式中的方法被执行。
第十四方面,本申请提供一种无线通信系统,包括如第五方面、第六方面、第七方面、或第七方面所述的通信装置。
附图说明
图1是可以应用本申请实施例的通信系统的一种示意性架构图。
图2是本申请的用于选择决策的方法的整体流程图。
图3是本申请提供的用于选择决策的方法的示意性流程图。
图4是通过历史信息辅助探索的示例。
图5是多步探索的示意图。
图6是通过通信系统的模型辅助探索的示例(一)。
图7是通过通信系统的模型辅助探索的示例(二)。
图8是本申请的方法应用于信道编码场景的一个示例。
图9是在核心网设备为终端注册可靠探索能力的示意性流程图。
图10是由接入网设备选择决策时的信令交互的示意图。
图11为本申请的实施例提供的选择决策的装置的结构示意图。
图12为本申请的实施例提供的选择决策的装置的另一结构示意图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
本申请实施例的技术方案可以应用于各种通信系统,例如:全球移动通信系统(global  system for mobile communications,GSM)、增强型数据速率GSM演进系统(enhanced data rate for GSM evolution,EDGE)、宽带码分多址系统(wideband code division multiple access,WCDMA)、码分多址2000系统(code division multiple access,CDMA2000)、时分同步码分多址系统(time division-synchronization code division multiple access,TD-SCDMA)/长期演进(long term evolution,LTE)系统、LTE频分双工(frequency division duplex,FDD)系统、LTE时分双工(time division duplex,TDD)、通用移动通信系统(universal mobile telecommunication system,UMTS)、全球互联微波接入(worldwide interoperability for microwave access,WiMAX)通信系统、窄带物联网系统(narrow band-internet of things,NB-IoT)、第五代(5th generation,5G)系统或新无线(new radio,NR)、卫星通信系统、以及未来的移动通信系统等。本申请实施例的技术方案可以应用于5G系统的(enhanced mobile broadband,eMBB),超可靠低时延通信(ultra-relaible and low latency communication,URLL)、以及增强型机器类型通信(enhanced machine-type communication,eMTC)等应用场景。
图1是可以应用本申请实施例的通信系统的一种示意性架构图。
如图1所示,该通信系统100可以包括接入网设备(如图1中的接入网设备110)和至少一个终端(如图1中的终端120和终端130)。无线通信系统通常由小区组成,每个小区可以包含一个接入网设备,终端通过无线的方式与接入网设备相连,接入网设备可以向多个终端提供通信服务。图1中的终端可以是固定位置的,也可以是可移动的。图1只是示意图,该通信系统中还可以包括其它网络设备,如还可以包括核心网设备、无线中继设备和无线回传设备等,在图1中未画出。本申请的实施例对该通信系统中包括的接入网设备和终端的数量不做限定。
本申请实施例中的终端也可以称为用户设备(user equipment,UE)、用户、接入终端、用户单元、用户站、移动站、移动台(mobile station,MS)、远方站、远程终端、移动设备、用户终端、终端设备、无线通信设备、用户代理或用户装置等。终端可以是蜂窝电话、智能手表、无线数据卡、手机、平板电脑、个人数字助理(personal digital assistant,PDA)电脑、无线调制解调器、计算设备、连接到无线调制解调器的其它处理设备、手持设备、膝上型电脑、机器类型通信(machine type communication,MTC)终端、带无线收发功能的电脑、虚拟现实终端、增强现实终端、工业控制中的无线终端、无人驾驶中的无线终端、远程手术中的无线终端、智能电网中的无线终端、运输安全中的无线终端、智慧城市中的无线终端、智慧家庭中的无线终端、卫星通信中的无线终端(例如,卫星电话或卫星终端等)等等。本申请的实施例对终端所采用的具体技术和具体设备形态不做限定。
本申请实施例中的接入网设备可以是用于与终端通信的设备。该接入网设备可以是GSM系统或码分多址(code division multiple access,CDMA)中的基站(base transceiver station,BTS),也可以是宽带码分多址(wideband code division multiple access,WCDMA)系统中的基站(nodeB,NB),还可以是LTE系统中的演进型基站(evolutional nodeB,eNB或eNodeB),还可以是云无线接入网络(cloud radio access network,CRAN)场景下的无线控制器,或者该网络设备可以为中继站、接入点、车载设备或者可穿戴设备,或者该网络设备可以为设备到设备(device to device,D2D)通信或机器通信中承担基站功能的终端,或者该网络设备可以为5G网络中的网络设备或者未来演进的公共陆地移动网 (public land mobile network,PLMN)网络中的接入网设备等,本申请实施例并不限定。此外,本申请实施例中的接入网设备也可以是完成基站部分功能的模块或单元,例如,可以是集中式单元(central unit,CU),也可以是分布式单元(distributed unit,DU)。本申请的实施例对接入网设备所采用的具体技术和具体设备形态不做限定。
在一些实现方式中,上述基站可以包含基带单元(baseband unit,BBU)和远端射频单元(remote radio unit,RRU)。BBU和RRU可以放置在不同的地方。例如,RRU拉远,放置于高话务量的区域,BBU放置于中心机房。BBU和RRU也可以放置在同一机房。BBU和RRU也可以为一个机架下的不同部件。
本申请实施例的终端和接入网设备可以部署在陆地上,包括室内或室外、手持或车载;也可以部署在水面上;还可以部署在空中的飞机、气球和人造卫星上。本申请的实施例对终端和接入网设备的应用场景不做限定。
本申请实施例的终端和接入网设备之间可以通过授权频谱进行通信,也可以通过免授权频谱进行通信,也可以同时通过授权频谱和免授权频谱进行通信。终端和接入网设备之间可以通过6千兆赫(gigahertz,GHz)以下的频谱进行通信,也可以通过6GHz以上的频谱进行通信,还可以同时使用6GHz以下的频谱和6GHz以上的频谱进行通信。本申请的实施例对终端和接入网设备之间所使用的频谱资源不做限定。
无线通信系统通常面对着变化的信道、变化的环境和变化的用户。硬件的非理想和建模的非理想使得通信系统在变化中很难通过理论公式计算寻求最优的决策,导致最优决策通常不易获取,有时甚至需要采用复杂度很高的遍历搜索才能得到最优决策;次优决策可以通过解优化问题的方式获得,但解优化问题的复杂度也很高,在一些场景下同样不易求解。通常,对于某个复杂的特定通信场景,只有探索可以找到更优的决策。
深度强化学习可以利用神经网络模型与环境的交互来搜索最优决策,以通信系统作为环境,深度强化学习可以用来搜索针对该通信系统的最优决策。
无模型(model-free)强化学习算法是深度强化学习中最常用的一类,例如深度Q网络(deep Q network,DQN)、近端策略优化(proximal policy optimization,PPO)等算法。无模型强化学习算法中没有模型,完全依靠和环境的交互得到一种决策,所以探索较随机,这在环境非常复杂无法建模的时候非常有用。但是对于实际的通信系统,随机探索有可能会导致通信系统性能恶化,由于通信系统的可靠性要求较高,这种探索方式是不能接受的。
这样,现有的通信系统为了通信系统可以可靠运行,往往坚持选用通用、保守的决策,导致通信系统大部分器件长期工作在非最优性能下,不能满足未来高性能通信系统的要求。
针对上述问题,本申请提供了一种用于选择决策的方法和装置,能够避免随机探索,有助于通信系统可靠运行。
图2是本申请的用于选择决策的方法的整体流程图。
1)如图2所示,可以将当前通信系统的状态作为输入,输入到通信系统已知模型,通信系统已知模型根据当前通信系统的状态首先可以确定可探索决策,该通信系统模型在可探索决策中选择决策,并评估该决策对应的性能,经过N次迭代后,得到探索的每个决策θ的性能X θ和探索次数B θ,以此来指导可靠通信决策的选择。其中,通信系统已知模型的输出可以包括上述可探索决策的性能。
2)如图2所示,还可以根据当前通信系统的状态查询该通信系统的真实经验,即历史信息,在当前通信系统的状态下查询历史信息,首先可以确定历史上的已探索决策,同时,历史信息可以给出每个已探索决策θ对应的性能X θ和探索次数B θ,历史信息相当于已完成了多次探索,根据历史信息可以指导并选择可靠通信决策,把历史信息看作通信模型,这个过程也可以称为基于模型的探索。其中,历史信息可以包括历史上本设备或其它设备在相同或相似通信系统状态下选择过的决策及相应的性能,这些选择过的决策即是已探索决策。
3)如图2所示,通信系统状态、通信系统已知模型和历史信息的输出还可以用于训练神经网络模型,训练好的神经网络模型可以反过来指导可靠通信决策的选择。例如,通信系统在采取可靠通信决策后会得到一个真实性能,可以把这个决策和真实性能存入历史信息中,并用历史信息来训练神经网络模型,让神经网络模型输出某种通信系统状态下各个决策的估计性能X θ和可靠探索系数C θ。通过多次迭代计算神经网络,可以得到每个策略的探索次数B θ。根据神经网络的输出选择可靠通信决策。
4)如图2所示,选择可靠通信决策时,可以计算Y θ=X θ+C θ×B θ,选择能让Y θ最大的决策θ。其中,可靠探索系数C θ可以是预设的、也可以是查表得到的、也可以是根据当前的决策探索情况计算得到的、也可以是通信过程中由通信设备之间协商得到的、也可以是由神经网络根据当前通信系统状态输出的。可靠探索系数C θ的取值是为了平衡探索项和利用项,其中X θ对应利用项,B θ对应探索项。平衡结果为让选择的决策的性能不会太差,同时保持对探索少的决策进行更多的探索。
上述选择的可靠通信决策可以是一系列决策,选择可靠通信决策即相当于选择多步连续的决策。
上述通信系统的状态针对于不同的任务可以不同。例如,对于切换任务和配对任务,通信系统的状态可以是终端所在位置。又如,对于信道编码任务,通信系统的状态可以是当前的码构造。
上述可探索决策是指在当前通信系统的状态下可行的决策,例如,在切换任务中,当终端处于某个位置时,只有切换到A基站,或切换到B基站这两个选项,没有切换到C基站这样的选项,此时可探索决策就只有切换到A或B。
上述通信系统模型可以根据通信系统的状态得到决策对应的性能。例如,通信系统模型可以是蒙特卡洛仿真器、模型公式、神经网络模型等。针对不同的任务,该模型公式可以不同。例如,模型公式可以是信道容量公式、信噪比(signal-noise ratio,SNR)公式、能量效率公式、频谱效率公式等通信系统相关计算公式中的至少一个。
上述决策的性能,也可以描述为决策对应的性能,是选择并执行某个决策后导致通信系统具有的性能。执行这个决策可能是在仿真环境中执行,得到的是对应的仿真环境下的性能;也可能是在真实的通信网络中执行,得到的是对应的真实环境下的性能。同样,针对不同的任务,决策的性能可能是信道容量、信噪比、能量效率、频谱效率等通信系统相关性能指标中的至少一个。
上述可靠通信决策为通过多次探索后最终选择的决策。也可以把可靠通信决策看成真正在通信系统中执行的待探索决策,待探索决策是单步探索的决策,待探索决策有可能是在仿真下进行的,但可靠通信决策是在真实通信系统中执行的。
需要说明的是,上述1)、2)、3)可以单独执行,也可以任意组合在一起执行,本申请实施例对比不做具体限定。
这样,本申请可以利用通信系统中的已知模型和/或历史信息和/或神经网络模型,来指导决策探索,可以避免随机探索,有助于通信系统可靠运行。
下面对本申请提供的用于选择决策的方法进行详细描述。
图3是本申请提供的用于选择决策的方法的示意性流程图。图3所示的方法可以包括以下内容的至少部分内容。
在步骤310中,获取通信系统的状态信息。
上述通信系统的状态信息为用于表征通信系统的状态的信息,针对不同的任务可以不同。例如,对于切换任务和配对任务,状态信息可以是终端设备所在位置的信息。对于功率分配任务,状态信息可以是接入网设备和终端设备之间的信道的信道状态信息。又如,对于信道编码的构造任务,状态信息可以是当前的码构造的信息。
在步骤320中,根据通信系统的状态信息,确定M个第一决策中每个第一决策的性能和/或每个第一决策被探索过的次数,M为正整数。需要说明的是,本文中对决策被探索过的次数、决策的访问次数、决策的次数不做区分。
其中,M个第一决策为在该状态信息下可探索的决策。第一决策的性能可以是第一决策的算数平均性能、加权平均性能、最大或最小性能、累加性能等,对此不作具体限定。
在本申请中,确定M个第一决策中每个第一决策的性能和/或每个第一决策被探索过的次数的方式有很多,本申请不作具体限定。例如,可以通过下述的方式1和2来实现。
方式1:通过通信系统的模型辅助探索
即可以将当前通信系统的状态作为输入,输入到通信系统的模型,通过通信系统模型的计算,确定M个第一决策中每个第一决策的性能和/或每个第一决策被探索过的次数。
作为一个示例,可以循环执行以下步骤N次,得到每个第一决策的性能和/或每个第一决策被探索过的次数,N为大于1的整数:根据每个第一决策当前的性能和/或每个第一决策当前被探索过的次数,从M个第一决策中选择待探索的第一决策;根据上述状态信息、以及通信系统的模型,更新待探索的第一决策对应的性能;和/或,在待探索的第一决策的被探索次数上加1。
可选地,可以根据每个第一决策当前对应的性能和/或每个第一决策当前被探索过的次数、以及每个第一决策的探索系数,从M个第一决策中选择待探索的第一决策。其中,探索系数用于控制选择第一决策时的倾向,平衡探索、利用和可靠。该倾向可以是更多地选择被探索次数少的第一决策,可以是更多地选择被探索次数少且性能好的第一决策,可以是更多地选择性能好的第一决策等。不同任务或场景下,探索系数的设置可以不同。
以综合考虑性能和被探索过的次数为例,即根据每个第一决策当前对应的性能、每个第一决策当前被探索过的次数、以及每个第一决策的探索系数,从M个第一决策中选择待探索的第一决策。
在一些实现方式中,可以根据y 1=x 1+C 1·b 1,从M个第一决策中选择待探索的第一决策,待探索的第一决策对应的y 1的取值最大,其中,x 1为每个第一决策当前对应的性能的函数,C 1为每个第一决策的探索系数,且C 1为常数,b 1为每个第一决策当前被探索过的次数的倒数的函数。
例如,可以采用经典的上置信界(upper confidence boundary,UCB)算法结合可靠通信 的设计,在本申请中称之为可靠上置信界(R-UCB)探索,可以根据如下的公式1,从M个第一决策中确定待探索的第一决策:
Figure PCTCN2021131777-appb-000007
X 1d为M个第一决策中第d个第一决策当前对应的性能,N为M个第一决策当前被探索的总次数,N 1d为第d个第一决策当前被探索的次数。
其中,X 1d为利用项,
Figure PCTCN2021131777-appb-000008
为探索项,待探索的第一决策对应的y 1的取值最大,X 1d为是M个第一决策中第d个第一决策当前的性能,N 1为M个第一决策当前被探索的总次数,N 1d为第d个第一决策当前被探索的次数,C 1为每个第一决策的探索系数,C 1为常数。C 1用于控制选择决策时的倾向,或可理解为改变利用项和探索项的比例,C 1越大则选择决策时更多的探索那些被探索的次数少的决策,C 1越小则选择决策时更多的探索性能好的决策。
方式2:通过历史信息辅助探索
在一些实现方式中,通信装置中可以记录后存储历史上本设备或其它设备在相同或相似通信系统状态下的决策及相应的性能,当需要探索决策时,可以根据当前的通信系统状态,获取存储的M个第一决策中每个第一决策的性能和/或每个第一决策被探索过的次数。
例如,通信装置中可以以表格的形式存储不同通信系统状态下历史上本设备或其它设备的决策及相应的性能,当需要探索决策时,可以根据当前的通信系统状态通过查表的方式,得到M个第一决策中每个第一决策的性能和/或每个第一决策被探索过的次数。
如图4所示,与方式1不同的是,方式2不需要经过一次次的单次探索,因为历史信息可以看作历史上已经做过多次探索。
在步骤330中,根据M个第一决策中每个第一决策的性能和/或每个第一决策被探索过的次数,从M个第一决策中确定目标第一决策。目标第一决策可以对应于上文所述的可靠通信决策。
从M个第一决策中确定目标第一决策的方式有很多,本申请实施例不作具体限定。
在一些实现方式中,可以从M个第一决策中选择性能最好的第一决策作为目标第一决策。
在另一些实现方式中,可以从M个第一决策中选择被探索次数最多的第一决策作为目标第一决策。
在另一些实现方式中,可以综合考虑性能和被探索过的次数来选择目标第一决策。
可选地,可以根据每个第一决策对应的性能和/或每个第一决策被探索过的次数、以及每个第一决策的探索系数,从M个第一决策中选择目标第一决策。其中,探索系数用于控制选择第一决策时的倾向,平衡探索、利用和可靠。该倾向可以是更多地选择被探索次数少的第一决策,可以是更多地选择被探索次数少且性能好的第一决策,可以是更多地选择性能好的第一决策等。不同任务或场景下,探索系数的设置可以不同。
具体地,可以根据y 2=x 2+C 2·b 2,从M个第一决策中选择目标第一决策,目标第一决策对应的y 2的取值最大,其中,x 2为每个第一决策对应的性能的函数,C 2为每个第一决策的探索系数,b 2为每个第一决策被探索过的次数的倒数的函数。
例如,可以根据如下的公式2,从M个第一决策中确定待探索的第一决策:
Figure PCTCN2021131777-appb-000009
其中,X 2d为利用项,
Figure PCTCN2021131777-appb-000010
为探索项,X 2d为是M个第一决策中第d个第一决策的性能,N 2为M个第一决策被探索的总次数,N 2d为第d个第一决策被探索的次数,C 2为每个第一决策的探索系数,C 2为常数、C 2随N 2d变化或者C 2由神经网络模型确定。C 2用于控制选择决策时的倾向,或可理解为改变利用项和探索项的比例,C 2越大则选择决策时更多的探索那些被探索的次数少的决策,C 2越小则选择决策时更多的探索性能好的决策。
可选地,在选择可靠通信决策时,若设置C 2随N d变化,可以设置N d小于预设阈值时,C 2为0。这样,虽然一些可探索决策具有潜在的探索价值,但由于被探索的次数过少,在通信系统执行该决策可能会给通信系统带来不可靠的结果,因此不进行探索,有助于通信系统的可靠运行。
例如,C 2满足:
Figure PCTCN2021131777-appb-000011
其中,N t为预设阈值,且N t=σ·N 2,σ为预设常数。例如,σ=0.001,当N 2为10000时,预设阈值为10,即对于被探索次数小于10的决策,C 2为0。此处C 2的设置为阶跃设计,即为0或另一个值。C 2也可以设置成连续变化,即C 2值随N 2d的大小连续变化。
另外,C 2的设置在通信系统采用不同的探索方式时可以不同。例如,在通过通信系统的模型辅助探索时,C 2为定值;在通过历史信息作为探索依据时,C 2可以设置为随N 2d变化的值。
通信系统中的决策往往不是单步决策,而是一系列决策,这种多步的决策可以用树形表示,多步决策组成的树可以称为决策树,决策树中每个节点可以对应一种可探索决策。此时可以用通信系统模型或历史信息确定树形结构中每个节点的性能和被探索过的次数;同理可以采用树形的记载历史信息的表格来进行查表探索。父节点对应的决策的性能为其所有子节点对应的决策的性能之和,父节点对应的决策的被探索过的次数为其所有子节点对应的决策被探索过的次数之和。子节点的状态为在父节点的状态基础上选择子节点对应的决策的结果。
例如,如图5所示的多步决策,状态A下的决策B的性能为状态A+B下的决策C的性能与状态A+B下的决策D的性能之和,状态A下的决策B的次数为状态A+B下的决策C的次数与状态A+B下的决策D的次数之和;状态A+B下的决策C的性能为状态A+B+C下的决策E的性能、状态A+B+C下的决策F的性能和状态A+B+C下的决策G的性能之和,状态A+B下的决策C的次数为状态A+B+C下的决策E的次数、状态A+B+C下的决策F的次数和状态A+B+C下的决策G的次数之和。
对于多步探索,上述的方式1,还可以继续执行以下动作:根据K个第二决策中每个第二决策当前对应的性能和/或所述每个第二决策当前被探索过的次数,从所述K个第二决策中选择待探索的第二决策,所述K个第二决策为在选择所述待探索的第一决策后可探索的决策;根据所述状态信息、以及所述通信系统的模型,更新所述待探索的第二决策对应的性能;和/或,在所述待探索的第二决策的被探索次数上加1;根据所述待探索的第二 决策对应的性能更新所述待探索的第一决策对应的性能;和/或,在所述待探索的第一决策被探索次数上加1。其中,根据所述待探索的第二决策对应的性能更新所述待探索的第一决策对应的性能,可以理解为将待探索的第二决策的性能加到原来的第一决策的性能上。
图3所示的方法还可以包括步骤340和步骤350。
在步骤340中,确定K个第二决策中每个第二决策的性能和/或每个第二决策被探索过的次数,K个第二决策为在选择所述目标第一决策后可探索的决策,K为正整数。
在步骤350中,根据每个第二决策对应的性能和/或每个第二决策被探索过的次数,从K个第二决策中确定目标第二决策。
确定目标第二决策的方法与确定目标第一决策相同或相似,可以参考上文的相关描述。不同的是,通信系统的状态为选择目标第一决策后的状态。
需要说明的是,图3仅示出了多步探索中的两步,实际上可以为更多步的探索。
上文描述了利用通信系统模型和历史信息探索决策的方法,下面对神经网络模型辅助探索决策的方法进行描述。
当单步可探索决策的数量大,且决策步数很多时,可以认为决策空间非常大,即决策树既宽且深,此时相似状态下的N次探索没有相互借鉴演进的机制。例如,假设基站A、B、C周边基站及环境相似,在同样的相对位置上,UE切换情况的探索符合相似规律,若基站A、B、C各自进行N次没有关联的探索,是对资源的浪费。本申请可以利用神经网络模型辅助决策探索,并辅助减小探索空间。
在本申请中,可以使用基于模型的决策探索的输出来训练神经网络模型,训练好的神经网络模型可以反过来指导基于模型的决策探索。例如,通信系统在采取可靠通信决策后会得到一个真实性能,可以根据这个可靠通信决策和性能来训练神经网络模型。
这样,由于神经网络模型已经拟合到历史上其它设备的多次探索结果,所以可以用于辅助减小探索空间。
在一些实现方式中,在上述公式1或公式2的基础上,可以在探索项上增加一个因子P d来对决策树进行泛化和剪枝,如公式3所示:
Figure PCTCN2021131777-appb-000012
此时,探索系数为C·P d,P d由神经网络模型输出,为了方便描述,这里以及下文将公式1和公式2中的参数同一描述为y,X d,C,N,N d,并未进行区分,具体地描述可以参见公式1和公式2,在此不再赘述。
在本申请中,确定决策探索所涉及或者使用的各项参数的方式有很多,对此不做具体定限定。
在一些实现方式中,可以根据触发决策探索的任务类型或者应用场景来确定决策探索使用的各项参数。各项参数可以包括N,C,C r,σ,N t中的至少一个。
例如,在通信装置上可以记录或存储有可靠探索参数表,根据任务类型,通过查表的方式,确定各项参数。
表1是可靠探索参数表的一个示例。
表1
可靠探索任务类型 性能指标 参数C r 参数σ 总探索次数N
1.切换 SNR 1.414 0.01 1000
2.UE配对 频谱效率 2.5 0.001 100
3.功率控制 频谱效率 2.0 0.001 200
 
不同任务对可靠度的要求不同,对于可靠度要求越高的任务,则C r应越小,这样可以减小探索项的权重;σ应越大,这样探索次数太少的决策不增加该决策的探索置信度;N应越大,这样总探索次数大,最终的决策更准确。如表1所示,当触发决策探索的任务为切换任务时,性能指标可以为信噪比SNR,C r取值为1.414,σ取值为0.01,N为1000;当触发决策探索的任务为UE配对任务时,性能指标可以为频谱效率,C r取值为2.5,σ取值为0.001,N为100;当触发决策探索的任务为功率控制任务时,性能指标可以为频谱效率,C r取值为2.0,σ取值为0.001,N为200等。
下面结合具体的例子,对上述方法进行详细描述。
示例1
图6和图7是通过通信系统的模型辅助探索的示例。
为了输出一个可靠通信决策,基于模型探索需要在输出前尽量多的模拟执行各种决策带来的性能增益,并以这些模拟结果作为依据最终输出可靠通信决策,当然,模拟次数多了对系统资源的消耗也会增加,这里需要做出权衡。假设基于模型探索在每次决策前先模拟N次探索,每次探索选择一个可探索决策(可以对应于上文的待探索的第一决策),并利用通信系统模型,根据输入的通信系统状态和选择的可探索决策,得到该可探索决策的估计性能,并将该估计性能累加到该可探索决策在探索过程中被选择的累加性能中,并且记录该可探索决策在N次探索中已经被选择了多少次;达到N次探索后,可以输出访问次数最多的可探索决策作为可靠通信决策,也可以综合考虑性能和访问次数来选择可靠通信决策。
例如,如图6所示,在当前通信系统状态下,存在可探索决策A、B、C、D,在某次探索中选择了可探索决策B,经过通信系统的模型分析,输出可探索决策B的一个估计性能,将得到的估计性能累加到性能B中,再对访问次数NB加1。
具体地,如图7所示,选择可探索决策B后,可以通过信道容量公式、信噪比公式、能量效率公式、频谱效率公式等通信公式计算在当前通信系统状态下选择可探索决策B导致的系统性能;也可以利用蒙特卡洛仿真器仿真可探索决策B在一般场景(例如,瑞利信道等)下的性能;也可以利用生成对抗网络(generative adversarial networks,GAN)作为场景模拟器,模拟特定的通信场景,结合仿真器,得到可探索决策B在特定场景下针对当前通信系统状态的性能。
由图6和图7可知,单次探索选择可探索决策的方式不是随机的(首次可以是随机的),而是需要根据各个可探索决策的累加性能和访问次数来选择决策。
示例2
以表2为例,对根据历史信息的查表法如何探索、输出可靠决策进行描述。
此时,由于访问次数不是基于模型探索一次次探索出来的,所以不能再以访问次数最大来选择可靠决策,而应选择平均性能最好的可探索决策,或者也可以综合考虑性能和探索次数来选择可靠通信决策。
以综合考虑性能和探索次数来选择可靠通信决策为例,在表2中,可探索决策A、B、C、D中B、D的平均性能更高,且被探索过的次数少,其中D的探索次数更少,更值得探索。可探索决策E、F的过往探索次数过少,基于上述公式1和2,可探索决策E、F探索项可能为零。最终通过上述公式1和2,确定使y的取值最大的为可探索决策D,因此输出可探索决策D。
值得注意的是,经典的UCB算法会选择可探索决策F,因为可探索决策F没有被探索过,所以UCB公式的值为无穷大,经典的UCB算法因此也被认为是一个乐观的算法,但在可靠通信中,这种盲目乐观可能导致系统崩溃,因此要对公式加以限制。
表2
Figure PCTCN2021131777-appb-000013
示例3
结合表3对采用历史信息的查表法进行多步探索进行描述。
假设任务为UE配对任务,需要从6个UE中选择3个。状态A下的“树形通信策略价值表”如表3所示。
表3
Figure PCTCN2021131777-appb-000014
Figure PCTCN2021131777-appb-000015
根据公式1-4进行多步探索后,多步决策顺序B、C、E会以更大概率被选择,即用户1、用户2、用户3被选择用于配对。
示例4
该示例为本申请的方法应用于信道编码场景的一个示例。
极化码(polar code)的嵌套(nested)码构造的探索是一个空间很大的决策树,此时,通信系统模型可以为信道编译码的蒙特卡洛仿真器,性能指标可以为-log(BLER),神经网络模型可用于决策树的泛化和剪枝。如图8所示,0,1,2,3,4为Polar的nested码构造的可靠度排序位置指示。随着码长增加,树会更深更广,此处仅示例一个Polar码嵌套构造的部分决策树。假设已知0为最可靠的信息比特位置,直接确定0为父节点,接下来,下一个信息比特的位置可以有多种选择,对应多个子节点,此时,可以通过蒙特卡洛仿真得到各个子节点的性能,本例中,就是分别仿真0->1、0->2、0->4序列的性能,根据得到的每个子节点的仿真性能,发现0->1这条路线的性能最好,我们接着选择1下面的决策,对比0->1->4和0->1->2的性能,发现0->1->2的性能更好,把0->1->2的性能和0->1->4的性能加到父节点1上。注意,0->1的性能好于0->2不代表0->1->2的性能一定好于0->2->1,为了最终的长序列整体性能最好,即找到平均性能最好的一条路线,也需要探索父节点2下面的决策。在父节点1和父节点2中做出探索决策的选择时,依据的指标就是前面所述平衡探索和利用的方法。
示例5
如表4所示,在信令上传输可靠探索参数的开销较大,为了减小信令开销,可以设计并标准化映射表,表里的每个可靠探索等级对应一套可靠探索参数。这样,收发端保存同样的映射表,在信令传输可靠探索参数时,发端可以只发送可靠探索等级的序号,收端就可以得到相应的可靠探索参数。
同样为减小信令开销,如表5所示,对于某些特定的任务,可以预先设置好任务和可靠探索等级和性能指标的映射表,这样,在进行某个任务的探索时,收发端不需要传输可靠探索等级和行性能指标,而采用表中该任务对应的参数进行探索。
表4和表5是可靠探索等级初始表的示例。
表4
可靠探索等级 参数σ 总探索次数N
Level 0 0.01 10000
Level 1 0.001 1000
Level 2 0.0001 100
 
表5
可靠探索 性能指标 可靠探索等级
1.切换 SNR Level 0
2.UE配对 频谱效率 Level 1
3.功率控制 频谱效率 Level 2
需要说明的是,图3至图8所示的方法可以由终端、接入网设备、或核心网设备来执行,也可以由终端、接入网设备、或核心网设备中的模块或单元(例如,芯片、电路、片上系统(system on chip,SOC)等)来执行,下面以由终端、接入网设备、或核心网设备来执行为例进行描述。
在本申请中,终端、接入网设备、核心网设备之间可以交互决策探索表、多步决策探索表、可靠探索参数表、用于可靠探索的神经网络参数等。
1)由终端选择决策
即图3至8所示的方法由终端来执行。
以切换任务为例,接入网设备需要协助终端在切换任务上探索更好的性能,所以接入网设备应对终端的可靠探索能力进行询问和鉴权(VIP客户)。终端对自身的任务可靠性有特定需求,接入网设备在探索参数的设置上应询问终端的特定设置,对于接入网设备覆盖范围内的高价值终端,接入网设备可以为其在核心网设备上注册可靠探索权限。例如,接入网设备覆盖范围内高频次出现的终端对于有效连续探索的帮助很大,可为这样的终端注册可靠探索权限,辅助接入网设备可靠探索性能提升、经验累积。
图9是在核心网设备为终端注册可靠探索能力的示意性流程图。
在步骤901中,接入网设备向终端发送第一消息,用于查询终端的可靠探索能力。相应地,终端接收来自接入网设备的第一消息。
这里的可靠探索能力可以理解为是否支持图3至图8所示的用于选择决策的方法。
可选地,上述的第一消息可以为系统信息块(system information block,SIB)或主信息块(master information block,MIB)消息。第一消息可以包括ueCapability字段、reliableSearchFlg字段、reliableLevel字段中的至少一个。
在步骤902中,终端向接入网设备发送第二消息,向接入网设备反馈自己的可靠探索 能力。相应地,接入网设备接收来自终端的第二消息。
在步骤903中,若终端反馈其具备可靠探索能力,接入网设备向核心网设备发送第三消息,用于请求为终端注册可靠探索能力。相应地,核心网设备接收来自接入网设备的第三消息。
在步骤904中,核心网设备完成注册,并向接入网设备发送第四消息,用于指示可靠探索能力注册完成。相应地,接入网设备接收核心网设备发送的第四消息。
在步骤905中,接入网设备向终端发送第五消息,用于指示可靠探索能力注册完成。相应地,终端接收接入网设备发送的第五消息。
这样就完成了在核心网设备为终端注册可靠探索能力。
可选地,核心网设备还可以对终端的可靠探索能力进行鉴权,确定是否允许终端探索决策。即在步骤904之前,还可以执行步骤906,当核心网设备允许终端探索决策时才执行步骤904,否则向终端反馈注册异常或注册失败等。
可选地,接入网设备还可以向终端查询探索决策使用的参数,以便接入网设备估计决策探索所需的时间,以便进行合理处理。具体地,可以执行步骤907-908。
在步骤907中,接入网设备向终端发送第六消息,用于查询探索决策使用的参数。
在步骤908中,在接收到第六消息后,终端可以向接入网设备发送第七消息,用于反馈探索决策使用的参数。
在一些实现方式中,终端可以从接入网设备获取探索决策所需的参数、通信系统模型、历史信息中的至少部分。例如,终端可以在图9所示的过程中从接入网设备获取上述内容。
2)由接入网设备选择决策
即图3至8所示的方法由接入网设备来执行。
在步骤1001中,终端向接入网设备发送第八消息,用于请求开始决策探索。相应地,接入网设备接收终端发送的第八消息。
可选地,第八消息中可以包括触发决策探索的任务类型、通信系统的模型、神经网络模型参数等中的至少一项。例如,对于切换任务,第八消息的可靠探索任务列表中包括切换任务,可靠探索性能指标包括通信公式、仿真器类型等。
在步骤1002,接入网设备向终端发送第九消息,向终端反馈决策探索开始。
在步骤1003中,接入网设备根据图3至图8所示的方法进行决策探索,并将探索结果发送给终端。相应地,终端接收接入网设备发送的探索结果。
需要说明的是,对于多步探索,本申请不限定接入网设备反馈探索结果的方式。例如,接入网设备可以待多步探索结束后一次性将全部探索结果发送给终端。又例如,接入网设备可以每执行一步探索向终端发送一次探索结果,通过多次消息将多步探索的结果发送给终端。
在步骤1004中,终端在接收到所需探索结果后,向接入网设备发送第十消息,用于请求结束多步探索。相应地,接入网设备接收终端发送的第十消息。
在步骤1005中,接入网设备向终端发送第十一消息,通知终端决策探索结束。
在一些实现方式中,接入网设备可以根据任务类型,从其它接入网设备或核心网设备获取探索决策相关的支持信息,例如,通信模型需要的仿真器,仿真条件等、同类任务下的历史信息等。
例如,如图10所示:
在步骤1006中,接入网设备可以在接收到终端发送的第八消息后,向其他接入网设备发送第十二消息,用于获取通信模型需要的仿真器、仿真条件等。
在步骤1007中,接入网设备可以在接收到终端发送的第八消息后,向其他核心网设备发送第十三消息,用于获取同类任务下的历史信息。
在步骤1008中,其他接入网设备在接收到第十二消息后,向接入网设备发送第十四消息,用于反馈通信模型需要的仿真器、仿真条件等。
在步骤1009中,核心网设备在接收到第十三消息后,向接入网设备发送第十五消息,用于反馈同类任务下的历史信息。
可以理解的是,为了实现上述实施例中功能,终端、接入网设备、以及核心网设备包括了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本申请中所公开的实施例描述的各示例的单元及方法步骤,本申请能够以硬件或硬件和计算机软件相结合的形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用场景和设计约束条件。
图11和图12为本申请的实施例提供的可能的选择决策的装置的结构示意图。这些装置可以用于实现上述方法实施例中终端、接入网设备、或核心网设备的功能,因此也能实现上述方法实施例所具备的有益效果。在本申请的实施例中,该选择决策的装置可以是如图1所示的终端120或终端130,也可以是如图1所示的无线接入网设备110,还可以是核心网设备,还可以是应用于终端、接入网设备、或核心网设备的模块(如芯片)。
如图11所示,装置1100包括处理单元1110和收发单元1120。
当装置1100用于实现终端的功能时:处理单元1110可以用于执行步骤310-350,收发单元1120可以用于执行步骤901-902、905、907-908、1001-1005。
当装置1100用于实现接入网设备的功能时:处理单元1110可以用于执行步骤310-350,收发单元1120可以用于执行步骤901-905、907--908、1001、1006-1009。
当装置1100用于实现核心网设备的功能时:处理单元1110可以用于执行步骤906,收发单元1120可以用于执行步骤903-904、1007、1009。
有关上述处理单元1110和收发单元1120更详细的描述可以直接参考方法实施例中相关描述直接得到,这里不加赘述。
如图12所示,装置1200包括处理器1210和接口电路1220。处理器1210和接口电路1220之间相互耦合。可以理解的是,接口电路1220可以为收发器或输入输出接口。可选的,装置1200还可以包括存储器1230,用于存储处理器1210执行的指令或存储处理器1210运行指令所需要的输入数据或存储处理器1210运行指令后产生的数据。
当装置1200用于实现方法侧实施例中的方法时,处理器1210用于执行上述处理单元1110的功能,接口电路1220用于执行上述收发单元1120的功能。
当上述装置为应用于终端的芯片时,该终端芯片实现上述方法实施例中终端的功能。例如,该终端芯片从终端中的其它模块(如射频模块或天线)接收信息,该信息是其他设备发送给终端的;或者,该终端芯片向终端中的其它模块(如射频模块或天线)发送信息,该信息是终端发送给其他设备的。
当上述装置为应用于接入网设备的芯片时,该接入网设备芯片实现上述方法实施例中 接入网设备的功能。例如,该接入网设备的芯片从接入网设备中的其它模块(如射频模块或天线)接收信息,该信息是其他设备发送给接入网设备的;或者,该接入网设备的芯片向接入网设备中的其它模块(如射频模块或天线)发送信息,该信息是接入网设备发送给其他设备的。
可以理解的是,本申请的实施例中的处理器可以是中央处理单元(Central Processing Unit,CPU),还可以是其它通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其它可编程逻辑器件、晶体管逻辑器件,硬件部件或者其任意组合。通用处理器可以是微处理器,也可以是任何常规的处理器。
本申请的实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(Random Access Memory,RAM)、闪存、只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于终端、接入网设备或核心网设备中。当然,处理器和存储介质也可以作为分立组件存在于终端、接入网设备或核心网设备中。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序或指令。在计算机上加载和执行所述计算机程序或指令时,全部或部分地执行本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其它可编程装置。所述计算机程序或指令可以存储在计算机可读存储介质中,或者通过所述计算机可读存储介质进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是集成一个或多个可用介质的服务器等数据存储设备。所述可用介质可以是磁性介质,例如,软盘、硬盘、磁带;也可以是光介质,例如,DVD;还可以是半导体介质,例如,固态硬盘(solid state disk,SSD)。
在本申请的各个实施例中,如果没有特殊说明以及逻辑冲突,不同的实施例之间的术语和/或描述具有一致性、且可以相互引用,不同的实施例中的技术特征根据其内在的逻辑关系可以组合形成新的实施例。
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。在本申请的文字描述中,字符“/”,一般表示前后关联对象是一种“或”的关系;在本申请的公式中,字符“/”,表示前后关联对象是一种“相除”的关系。
可以理解的是,在本申请的实施例中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (60)

  1. 一种用于选择决策的方法,其特征在于,包括:
    获取通信系统的状态信息;
    根据所述状态信息,确定M个第一决策中每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,所述M个第一决策为在所述状态信息下可探索的决策,M为正整数;
    根据所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,从所述M个第一决策中确定目标第一决策。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    确定K个第二决策中每个第二决策对应的性能和/或所述每个第二决策被探索过的次数,所述K个第二决策为在选择所述目标第一决策后可探索的决策,K为正整数;
    根据所述每个第二决策对应的性能和/或所述每个第二决策被探索过的次数,从所述K个第二决策中确定目标第二决策。
  3. 根据权利要求1或2所述的方法,其特征在于,所述根据所述状态信息,确定M个第一决策中每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,包括:
    循环执行以下步骤N次,得到所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,N为大于1的整数:
    根据所述每个第一决策当前对应的性能和/或所述每个第一决策当前被探索过的次数,从所述M个第一决策中选择待探索的第一决策;
    根据所述状态信息、以及所述通信系统的模型,更新所述待探索的第一决策对应的性能;和/或,在所述待探索的第一决策的被探索次数上加1。
  4. 根据权利要3所述的方法,其特征在于,所述方法还包括:
    根据K个第二决策中每个第二决策当前对应的性能和/或所述每个第二决策当前被探索过的次数,从所述K个第二决策中选择待探索的第二决策,所述K个第二决策为在选择所述待探索的第一决策后可探索的决策;
    根据所述状态信息、以及所述通信系统的模型,更新所述待探索的第二决策对应的性能;和/或,在所述待探索的第二决策的被探索次数上加1;
    根据所述待探索的第二决策对应的性能更新所述待探索的第一决策对应的性能;和/或,在所述待探索的第一决策被探索次数上加1。
  5. 根据权利要求3或4所述的方法,其特征在于,所述根据所述每个第一决策当前对应的性能和/或所述每个第一决策当前被探索过的次数,从所述M个第一决策中选择待探索的第一决策,包括:
    根据所述每个第一决策当前对应的性能和/或所述每个第一决策当前被探索过的次数、以及所述每个第一决策的探索系数,从所述M个第一决策中选择所述待探索的第一决策,所述探索系数用于控制选择决策时的倾向。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述每个第一决策当前对应的性能和/或所述每个第一决策当前被探索过的次数、以及所述每个第一决策的探索系数, 从所述M个第一决策中选择所述待探索的第一决策,包括:
    根据y 1=x 1+C 1·b 1,从所述M个第一决策中选择所述待探索的第一决策,所述待探索的第一决策对应的y 1的取值最大,其中,x 1为所述每个第一决策当前对应的性能的函数,C 1为所述每个第一决策的探索系数,且C 1为常数,b 1为所述每个第一决策当前被探索过的次数的倒数的函数。
  7. 根据权利要求6所述的方法,其特征在于,所述根据y 1=x 1+C 1·b 1,从所述M个第一决策中选择所述待探索的第一决策,包括:
    根据
    Figure PCTCN2021131777-appb-100001
    从所述M个第一决策中确定所述待探索的第一决策,其中,X 1d为所述M个第一决策中第d个第一决策当前对应的性能,N 1为所述M个第一决策当前被探索的总次数,N 1d为所述第d个第一决策当前被探索的次数。
  8. 根据权利要求1或2所述的方法,其特征在于,所述根据所述状态信息,确定M个第一决策中每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,包括:
    根据所述状态信息、以及历史信息,确定所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,所述历史信息包括在所述状态信息下的所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数。
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,所述根据所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,从所述M个第一决策中确定目标第一决策,包括:
    根据所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数、以及所述每个第一决策的探索系数,从所述M个第一决策中确定所述目标第一决策,所述探索系数用于控制选择决策时的倾向。
  10. 根据权利要求9所述的方法,其特征在于,所述根据所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数、以及所述每个第一决策的探索系数,从所述M个第一决策中选择所述目标第一决策,包括:
    根据y 2=x 2+C 2·b 2,从所述M个第一决策中选择所述目标第一决策,所述目标第一决策对应的y 2的取值最大,其中,x 2为所述每个第一决策对应的性能的函数,C 2为所述每个第一决策的探索系数,b 2为所述每个第一决策被探索过的次数的倒数的函数。
  11. 根据权利要求10所述的方法,其特征在于,所述根据y 2=x 2+C 2·b 2,从所述M个第一决策中选择所述目标第一决策,包括:
    根据
    Figure PCTCN2021131777-appb-100002
    从所述M个第一决策中确定目标第一决策,其中,X 2d为所述M个第一决策中第d个第一决策对应的性能,N 2为所述M个第一决策被探索的总次数,N 2d为所述第d个第一决策被探索的次数,C 2为常数、C 2随N 2d变化或者C 2由神经网络模型确定。
  12. 根据权利要求11所述的方法,其特征在于,若C 2随N 2d变化,当N 2d小于预设阈值时,C 2为0。
  13. 根据权利要求12所述的方法,其特征在于,C 2满足
    Figure PCTCN2021131777-appb-100003
    其中,N t为所述预设阈值,且N t=σ·N 2,σ为预设常数。
  14. 根据权利要求13所述的方法,其特征在于,所述方法还包括:
    根据所述状态信息、以及所述目标第一决策,对所述神经网络模型进行训练,所述神经网络模型用于输出C r
  15. 根据权利要求13或14所述的方法,其特征在于,所述方法还包括:
    获取探索决策所使用的参数,所述参数包括性能指标、N 2、C 2、C r,σ,N t中的至少一个。
  16. 根据权利要求15所述的方法,其特征在于,所述获取探索决策所使用的参数,包括:
    根据任务类型,获取所述参数。
  17. 根据权利要求1至16中任一项所述的方法,其特征在于,所述方法还包括:
    获取支持信息,所述支持信息用于确定所述第一决策对应的性能,所述支持信息包括所述通信系统的模型的仿真器、仿真条件、历史信息中的至少一个,所述历史信息包括在所述通信系统的不同状态下的每个决策对应的性能和/或所述每个决策被探索过的次数。
  18. 根据权利要求13至16中任一项所述的方法,其特征在于,所述方法还包括:
    接收接入网设备发送的第六消息,所述第六消息用于查询探索决策所使用的参数,所述参数包括N 2、C 2、C r,σ,N t中的至少一个;
    向接入网设备发送第七消息,所述第七消息用于指示所述参数。
  19. 根据权利要求1至18中任一项所述的方法,其特征在于,所述方法还包括:
    接收接入网设备发送的第一消息,所述第一消息用于查询是否具备探索决策的能力;
    向接入网设备发送第二消息,所述第二消息用于指示具备探索决策的能力;
    接收接入网设备发送的第五消息,所述第五消息用于指示完成在核心网设备的探索决策能力的注册。
  20. 根据权利要求1至17中任一项所述的方法,其特征在于,所述方法还包括:
    接收终端发送的第八消息,所述第八消息用于请求开始探索决策;
    向终端发送探索结果,所述探索结果包括所述目标第一决策的信息;
    接收终端发送的第十消息,所述第十消息用于请求结束探索决策。
  21. 一种用于选择决策的方法,其特征在于,包括:
    向终端发送第一消息,所述第一消息用于查询是否具备探索决策的能力;
    接收所述终端发送的第二消息,所述第二消息用于指示具备探索决策的能力;
    向核心网设备发送第三消息,所述第三消息用于请求注册探索决策能力;
    接收所述的核心网设备发送的第四消息,所述第四消息用于指示完成探索决策能力的注册;
    向所述终端发送第五消息,所述第五消息用于指示完成在所述核心网设备的探索决策能力的注册。
  22. 根据权利要求21所述的方法,其特征在于,所述方法还包括:
    向所述终端发送第六消息,所述第六消息用于查询探索决策所使用的参数;
    接收所述终端发送的第七消息,所述第七消息用于指示所述参数;
    根据所述参数,估计探索决策的时间。
  23. 根据权利要求21或22所述的方法,其特征在于,所述方法还包括:
    向所述终端发送支持信息,所述支持信息包括通信系统模型的仿真器、仿真条件、历史信息中的至少一个,所述历史信息包括在通信系统的不同状态下的每个决策对应的性能和/或所述每个决策被探索过的次数。
  24. 一种用于选择决策的方法,其特征在于,包括:
    向接入网设备发送第八消息,所述第八消息用于请求开始探索决策;
    接收所述接入网设备发送的探索结果;
    向所述接入网设备发送第十消息,所述第十消息用于请求结束探索决策。
  25. 一种用于选择决策的方法,其特征在于,包括:
    接收接入网设备发送的第十三消息,所述第十三消息用于请求历史信息,所述历史信息包括在通信系统的不同状态下的每个决策对应的性能和/或所述每个决策被探索过的次数;
    向接入网设备发送第十五消息,所述第十五消息用于指示所述历史信息。
  26. 根据权利要求25所述的方法,其特征在于,所述方法还包括:
    接收所述接入网设备发送的第三消息,所述第三消息用于请求为终端注册探索决策能力;
    向所述接入网设备发送第四消息,所述第四消息用于指示完成探索决策能力的注册。
  27. 根据权利要求25或26所述的方法,其特征在于,在向所述接入网设备发送第四消息之前,所述方法还包括:
    确定允许所述终端探索决策。
  28. 一种通信装置,其特征在于,所述装置包括:
    处理单元,用于获取通信系统的状态信息;根据所述状态信息,确定M个第一决策中每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,所述M个第一决策为在所述状态信息下可探索的决策,M为正整数;根据所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,从所述M个第一决策中确定目标第一决策。
  29. 根据权利要求28所述的装置,其特征在于,
    所述处理单元还用于确定K个第二决策中每个第二决策对应的性能和/或所述每个第二决策被探索过的次数,所述K个第二决策为在选择所述目标第一决策后可探索的决策,K为正整数;根据所述每个第二决策对应的性能和/或所述每个第二决策被探索过的次数,从所述K个第二决策中确定目标第二决策。
  30. 根据权利要求28或29所述的装置,其特征在于,
    所述处理单元具体用于:循环执行以下步骤N次,得到所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,N为大于1的整数:根据所述每个第一决策当前对应的性能和/或所述每个第一决策当前被探索过的次数,从所述M个第一决策中选择待探索的第一决策;根据所述状态信息、以及所述通信系统的模型,更新所述待探索的第一决策对应的性能;和/或,在所述待探索的第一决策的被探索次数上加1。
  31. 根据权利要求30所述的装置,其特征在于,
    所述处理单元还用于:根据K个第二决策中每个第二决策当前对应的性能和/或所述每个第二决策当前被探索过的次数,从所述K个第二决策中选择待探索的第二决策,所述K个第二决策为在选择所述待探索的第一决策后可探索的决策;根据所述状态信息、以及 所述通信系统的模型,更新所述待探索的第二决策对应的性能;和/或,在所述待探索的第二决策的被探索次数上加1;根据所述待探索的第二决策对应的性能更新所述待探索的第一决策对应的性能;和/或,在所述待探索的第一决策被探索次数上加1。
  32. 根据权利要求30或31所述的装置,其特征在于,
    所述处理单元具体用于:根据所述每个第一决策当前对应的性能和/或所述每个第一决策当前被探索过的次数、以及所述每个第一决策的探索系数,从所述M个第一决策中选择所述待探索的第一决策,所述探索系数用于控制选择决策时的倾向。
  33. 根据权利要求32所述的装置,其特征在于,
    所述处理单元具体用于:根据y 1=x 1+C 1·b 1,从所述M个第一决策中选择所述待探索的第一决策,所述待探索的第一决策对应的y 1的取值最大,其中,x 1为所述每个第一决策当前对应的性能的函数,C 1为所述探索系数,且C 1为常数,b 1为所述每个第一决策当前被探索过的次数的倒数的函数。
  34. 根据权利要求33所述的装置,其特征在于,
    所述处理单元具体用于:根据
    Figure PCTCN2021131777-appb-100004
    从所述M个第一决策中确定所述待探索的第一决策其中,X 1d为所述M个第一决策中第d个第一决策当前对应的性能,N 1为所述M个第一决策当前被探索的总次数,N 1d为所述第d个第一决策当前被探索的次数。
  35. 根据权利要求28或29所述的装置,其特征在于,
    所述处理单元具体用于:根据所述状态信息、以及历史信息,确定所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数,所述历史信息包括在所述状态信息下的所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数。
  36. 根据权利要求28至35中任一项所述的装置,其特征在于,
    所述处理单元具体用于:根据所述每个第一决策对应的性能和/或所述每个第一决策被探索过的次数、以及所述每个第一决策的探索系数,从所述M个第一决策中确定所述目标第一决策,所述探索系数用于控制选择决策时的倾向。
  37. 根据权利要求36所述的装置,其特征在于,
    所述处理单元具体用于:根据y 2=x 2+C 2·b 2,从所述M个第一决策中选择所述目标第一决策,所述目标第一决策对应的y 2的取值最大,其中,x 2为所述每个第一决策对应的性能的函数,C 2为所述探索系数,b 2为所述每个第一决策被探索过的次数的倒数的函数。
  38. 根据权利要求37所述的装置,其特征在于,
    所述处理单元具体用于:根据
    Figure PCTCN2021131777-appb-100005
    从所述M个第一决策中确定目标第一决策,其中,X 2d为所述M个第一决策中第d个第一决策对应的性能,N 2为所述M个第一决策被探索的总次数,N 2d为所述第d个第一决策被探索的次数,C 2为常数、C 2随N 2d变化或者C 2由神经网络模型确定。
  39. 根据权利要求38所述的装置,其特征在于,
    若C 2随N 2d变化,当N 2d小于预设阈值时,C 2为0。
  40. 根据权利要求39所述的装置,其特征在于,
    C 2满足
    Figure PCTCN2021131777-appb-100006
    其中,N t为所述预设阈值,且N t=σ·N 2,σ为预设常数。
  41. 根据权利要求40所述的装置,其特征在于,
    所述处理单元还用于:根据所述状态信息、以及所述目标第一决策,对所述神经网络模型进行训练,所述神经网络模型用于输出C r
  42. 根据权利要求40或41所述的装置,其特征在于,
    所述处理单元还用于:获取探索决策所使用的参数,所述参数包括性能指标、N 2、C 2、C r,σ,N t中的至少一个。
  43. 根据权利要求42所述的装置,其特征在于,
    所述处理单元具体用于:根据任务类型,获取所述参数。
  44. 根据权利要求28至43中任一项所述的装置,其特征在于,
    所述处理单元还用于:获取支持信息,所述支持信息用于确定所述第一决策对应的性能,所述支持信息包括所述通信系统的模型的仿真器、仿真条件、历史信息中的至少一个,所述历史信息包括在所述通信系统的不同状态下的每个决策对应的性能和/或所述每个决策被探索过的次数。
  45. 根据权利要求40至43中任一项所述的装置,其特征在于,
    所述装置还包括收发单元,用于接收接入网设备发送的第六消息,所述第六消息用于查询探索决策所使用的参数,所述参数包括性能指标、N 2、C 2、C r,σ,N t中的至少一个;向接入网设备发送第七消息,所述第七消息用于指示所述参数。
  46. 根据权利要求45所述的装置,其特征在于,
    所述收发单元还用于:接收接入网设备发送的第一消息,所述第一消息用于查询是否具备探索决策的能力;向接入网设备发送第二消息,所述第二消息用于指示具备探索决策的能力;接收接入网设备发送的第五消息,所述第五消息用于指示完成在核心网设备的探索决策能力的注册。
  47. 根据权利要求45或46所述的装置,其特征在于,
    所述收发单元还用于:接收终端发送的第八消息,所述第八消息用于请求开始探索决策;向终端发送探索结果,所述探索结果包括所述目标第一决策的信息;接收终端发送的第十消息,所述第十消息用于请求结束探索决策。
  48. 一种通信的装置,其特征在于,所述装置包括:
    收发单元,用于向终端发送第一消息,所述第一消息用于查询是否具备探索决策的能力;接收所述终端发送的第二消息,所述第二消息用于指示具备探索决策的能力;向核心网设备发送第三消息,所述第三消息用于请求注册探索决策能力;接收所述的核心网设备发送的第四消息,所述第四消息用于指示完成探索决策能力的注册;向所述终端发送第五消息,所述第五消息用于指示完成在所述核心网设备的探索决策能力的注册。
  49. 根据权利要求48所述的装置,其特征在于,
    所述收发单元还用于向所述终端发送第六消息,所述第六消息用于查询探索决策所使用的参数;接收所述终端发送的第七消息,所述第七消息用于指示所述参数;
    所述装置还包括处理单元,用于根据所述参数,估计探索决策的时间。
  50. 根据权利要求48或49所述的装置,其特征在于,
    所述收发单元还用于向所述终端发送支持信息,所述支持信息包括通信系统模型的仿真器、仿真条件、历史信息中的至少一个,所述历史信息包括在通信系统的不同状态下的每个决策对应的性能和/或所述每个决策被探索过的次数。
  51. 一种通信的装置,其特征在于,所述装置包括:
    收发单元,用于向接入网设备发送第八消息,所述第八消息用于请求开始探索决策;接收所述接入网设备发送的探索结果;向所述接入网设备发送第十消息,所述第十消息用于请求结束探索决策。
  52. 一种通信装置,其特征在于,所述装置包括:
    收发单元,用于接收接入网设备发送的第十三消息,所述第十三消息用于请求历史信息,所述历史信息包括在通信系统的不同状态下的每个决策对应的性能和/或所述每个决策被探索过的次数;向接入网设备发送第十五消息,所述第十五消息用于指示所述历史信息。
  53. 根据权利要求52所述的装置,其特征在于,
    所述收发单元还用于:接收所述接入网设备发送的第三消息,所述第三消息用于请求为终端注册探索决策能力;向所述接入网设备发送第四消息,所述第四消息用于指示完成探索决策能力的注册。
  54. 根据权利要求52或53所述的装置,其特征在于,
    所述装置还包括处理单元,用于在向所述接入网设备发送第四消息之前,确定允许所述终端探索决策。
  55. 一种选择决策的装置,其特征在于,包括至少一个处理器,所述至少一个处理器与至少一个存储器耦合,所述至少一个处理器用于执行所述至少一个存储器中存储的计算机程序或指令,以使所述装置执行如权利要求1至27中任一项所述的方法。
  56. 一种芯片,其特征在于,包括逻辑电路和通信接口,所述通信接口,用于接收待处理的数据和/或信息,所述逻辑电路用于执行如权利要求1至27中任一项所述的数据和/或信息处理,以及,所述通信接口还用于输出所述逻辑电路得到处理结果。
  57. 一种计算机可读存储介质,其特征在于,存储有计算机指令,当计算机指令在计算机上运行时,如权利要求1至27中任一项所述的方法被实现。
  58. 一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机程序或指令,当所述计算机程序或指令在计算机上运行时,使得如权利要求1至27中任一项所述的方法被执行。
  59. 一种计算机程序,其特征在于,当所述计算机程序在计算机上运行时,使得如权利要求1至27中任一项所述的方法被执行。
  60. 一种无线通信系统,其特征在于,所述无线通信系统包括如权利要求28至54中任一项所述的装置。
PCT/CN2021/131777 2020-11-23 2021-11-19 一种用于选择决策的方法和装置 Ceased WO2022105876A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP21894023.7A EP4228330A4 (en) 2020-11-23 2021-11-19 DECISION SELECTION METHOD AND APPARATUS

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011322773.3 2020-11-23
CN202011322773.3A CN114615680A (zh) 2020-11-23 2020-11-23 一种用于选择决策的方法和装置

Publications (1)

Publication Number Publication Date
WO2022105876A1 true WO2022105876A1 (zh) 2022-05-27

Family

ID=81708391

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/131777 Ceased WO2022105876A1 (zh) 2020-11-23 2021-11-19 一种用于选择决策的方法和装置

Country Status (3)

Country Link
EP (1) EP4228330A4 (zh)
CN (1) CN114615680A (zh)
WO (1) WO2022105876A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024131885A1 (zh) * 2022-12-22 2024-06-27 北京紫光展锐通信技术有限公司 通信方法与装置、终端设备、网络设备和芯片

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1571355A (zh) * 2003-07-16 2005-01-26 华为技术有限公司 一种更新用户终端能力信息的方法
CN101299748A (zh) * 2007-04-30 2008-11-05 华为技术有限公司 在iptv业务中应用终端能力信息的方法、系统及装置
WO2013049996A1 (zh) * 2011-10-08 2013-04-11 中国移动通信集团公司 接入网选择方法、用户设备、系统和选网策略单元
CN109445947A (zh) * 2018-11-07 2019-03-08 东软集团股份有限公司 资源的分配处理方法、装置、设备及存储介质
CN110519816A (zh) * 2019-08-22 2019-11-29 普联技术有限公司 一种无线漫游控制方法、装置、存储介质及终端设备
US20200076857A1 (en) * 2018-08-31 2020-03-05 Microsoft Technology Licensing, Llc Secure exploration for reinforcement learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10225772B2 (en) * 2017-06-22 2019-03-05 At&T Intellectual Property I, L.P. Mobility management for wireless communication networks

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1571355A (zh) * 2003-07-16 2005-01-26 华为技术有限公司 一种更新用户终端能力信息的方法
CN101299748A (zh) * 2007-04-30 2008-11-05 华为技术有限公司 在iptv业务中应用终端能力信息的方法、系统及装置
WO2013049996A1 (zh) * 2011-10-08 2013-04-11 中国移动通信集团公司 接入网选择方法、用户设备、系统和选网策略单元
US20200076857A1 (en) * 2018-08-31 2020-03-05 Microsoft Technology Licensing, Llc Secure exploration for reinforcement learning
CN109445947A (zh) * 2018-11-07 2019-03-08 东软集团股份有限公司 资源的分配处理方法、装置、设备及存储介质
CN110519816A (zh) * 2019-08-22 2019-11-29 普联技术有限公司 一种无线漫游控制方法、装置、存储介质及终端设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4228330A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024131885A1 (zh) * 2022-12-22 2024-06-27 北京紫光展锐通信技术有限公司 通信方法与装置、终端设备、网络设备和芯片

Also Published As

Publication number Publication date
EP4228330A1 (en) 2023-08-16
CN114615680A (zh) 2022-06-10
EP4228330A4 (en) 2024-07-17

Similar Documents

Publication Publication Date Title
US20240306119A1 (en) Communication method and apparatus
CN111967605A (zh) 无线电接入网中的机器学习
KR20120089416A (ko) 반복 리소스 가중치들의 업데이트를 이용한 통신 시스템 내의 노드들에 리소스들을 할당하기 위한 장치 및 방법
WO2024109682A9 (zh) 一种用于定位的方法及装置
WO2022105876A1 (zh) 一种用于选择决策的方法和装置
CN110621025B (zh) 一种设备选型方法和装置
WO2025043478A1 (zh) 一种神经网络模型训练方法及通信装置
JP2025535693A (ja) 訓練データセット取得方法および装置
JP6271597B2 (ja) 無線通信システムにおける装置及び方法
CN110662225B (zh) 一种设备选型方法和装置
WO2025145930A1 (zh) 一种反馈信道信息的方法及装置
WO2025247184A1 (zh) 通信方法和通信装置
WO2026012229A1 (zh) 通信方法及通信装置
WO2025228310A1 (zh) 通信方法及相关装置
WO2025213781A1 (zh) 一种基站参数控制方法及通信装置
WO2025039193A1 (zh) 神经网络训练的方法和通信装置
WO2026021210A1 (zh) 一种通信系统中的检索增强方法以及相关装置
CN121052292A (zh) 一种通信方法及装置
WO2025140191A1 (zh) 一种反馈信息的方法及装置
WO2025201030A1 (zh) 一种通信方法及装置
WO2025232599A1 (zh) 一种通信方法和通信装置
WO2026007447A1 (zh) 一种通信方法及装置
WO2026026550A1 (zh) 一种通信方法、装置以及系统
WO2025232733A1 (zh) 一种通信方法及装置
WO2025157088A1 (zh) 一种节点选择方法和通信装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21894023

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021894023

Country of ref document: EP

Effective date: 20230511

NENP Non-entry into the national phase

Ref country code: DE