WO2023222086A1 - 一种建立光通路的方法及相关设备 - Google Patents
一种建立光通路的方法及相关设备 Download PDFInfo
- Publication number
- WO2023222086A1 WO2023222086A1 PCT/CN2023/095073 CN2023095073W WO2023222086A1 WO 2023222086 A1 WO2023222086 A1 WO 2023222086A1 CN 2023095073 W CN2023095073 W CN 2023095073W WO 2023222086 A1 WO2023222086 A1 WO 2023222086A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- optical network
- optical
- switching device
- subnet
- wavelength
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04J—MULTIPLEX COMMUNICATION
- H04J14/00—Optical multiplex systems
- H04J14/02—Wavelength-division multiplex systems
- H04J14/0201—Add-and-drop multiplexing
- H04J14/0202—Arrangements therefor
- H04J14/021—Reconfigurable arrangements, e.g. reconfigurable optical add/drop multiplexers [ROADM] or tunable optical add/drop multiplexers [TOADM]
- H04J14/0212—Reconfigurable arrangements, e.g. reconfigurable optical add/drop multiplexers [ROADM] or tunable optical add/drop multiplexers [TOADM] using optical switches or wavelength selective switches [WSS]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04Q—SELECTING
- H04Q11/00—Selecting arrangements for multiplex systems
- H04Q11/0001—Selecting arrangements for multiplex systems using optical switching
- H04Q11/0005—Switch and router aspects
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04J—MULTIPLEX COMMUNICATION
- H04J14/00—Optical multiplex systems
- H04J14/02—Wavelength-division multiplex systems
- H04J14/0227—Operation, administration, maintenance or provisioning [OAMP] of WDM networks, e.g. media access, routing or wavelength allocation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04J—MULTIPLEX COMMUNICATION
- H04J14/00—Optical multiplex systems
- H04J14/02—Wavelength-division multiplex systems
- H04J14/0227—Operation, administration, maintenance or provisioning [OAMP] of WDM networks, e.g. media access, routing or wavelength allocation
- H04J14/0241—Wavelength allocation for communications one-to-one, e.g. unicasting wavelengths
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04J—MULTIPLEX COMMUNICATION
- H04J14/00—Optical multiplex systems
- H04J14/02—Wavelength-division multiplex systems
- H04J14/0227—Operation, administration, maintenance or provisioning [OAMP] of WDM networks, e.g. media access, routing or wavelength allocation
- H04J14/0254—Optical medium access
- H04J14/0267—Optical signaling or routing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04J—MULTIPLEX COMMUNICATION
- H04J14/00—Optical multiplex systems
- H04J14/02—Wavelength-division multiplex systems
- H04J14/0227—Operation, administration, maintenance or provisioning [OAMP] of WDM networks, e.g. media access, routing or wavelength allocation
- H04J14/0254—Optical medium access
- H04J14/0267—Optical signaling or routing
- H04J14/0268—Restoration of optical paths, e.g. p-cycles
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04J—MULTIPLEX COMMUNICATION
- H04J14/00—Optical multiplex systems
- H04J14/02—Wavelength-division multiplex systems
- H04J14/03—WDM arrangements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/02—Topology update or discovery
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/62—Wavelength based
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04Q—SELECTING
- H04Q11/00—Selecting arrangements for multiplex systems
- H04Q11/0001—Selecting arrangements for multiplex systems using optical switching
- H04Q11/0005—Switch and router aspects
- H04Q2011/0007—Construction
- H04Q2011/0026—Construction using free space propagation (e.g. lenses, mirrors)
- H04Q2011/003—Construction using free space propagation (e.g. lenses, mirrors) using switches based on microelectro-mechanical systems [MEMS]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04Q—SELECTING
- H04Q11/00—Selecting arrangements for multiplex systems
- H04Q11/0001—Selecting arrangements for multiplex systems using optical switching
- H04Q11/0005—Switch and router aspects
- H04Q2011/0037—Operation
- H04Q2011/005—Arbitration and scheduling
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04Q—SELECTING
- H04Q2213/00—Indexing scheme relating to selecting arrangements in general and for multiplex systems
- H04Q2213/1301—Optical transmission, optical switches
Definitions
- the present application relates to the field of computer technology, and in particular to a method and system for establishing an optical path, as well as an optical network controller, an optical network switching device, a computer-readable storage medium, and a computer program product.
- Data Driven Architecture also known as Memory Centric Architecture or Disaggregated Shared Memory architecture
- Disaggregated Shared Memory architecture is an architecture-level innovation in the computing industry and is also the industry's first Technology research hotspots.
- the data-driven computing architecture is based on new memory semantic networks, such as compute express link (CXL) memory semantic network, coherent accelerator processor interface (CAPI) memory semantic network or GenZ memory semantic network to build memory Internet (memory fabric) cluster realizes hierarchical memory resource pooling and global sharing.
- CXL compute express link
- CAI coherent accelerator processor interface
- GenZ GenZ memory semantic network
- Memory interconnection network clusters require efficient, stable, and reliable cross-node memory access.
- memory interconnection network clusters usually require ultra-high bandwidth (Tbps level), ultra-low average latency and long-tail latency ( ⁇ 1us, hundreds of nanoseconds level).
- Tbps level ultra-high bandwidth
- ⁇ 1us hundreds of nanoseconds level
- the bandwidth can be terabits per second (Terabits Per Second, TbPS) level
- the average delay or long-tail delay can be at the level of hundreds of nanoseconds, where the level of hundreds of nanoseconds refers to less than 1 microsecond (microsecond, us).
- OneHop networking refers to a networking method in which computing nodes are connected to the same switching device, such as a switch, so that data can be exchanged between computing nodes through the switch.
- This application provides a method for establishing optical paths. This method divides wavelengths to establish different optical paths between optical network switching equipment and computing nodes in the subnet of a memory interconnection network cluster, so that the optical network switching equipment It can distinguish optical signals based on wavelength, and select the corresponding optical path according to the wavelength of the optical signal to pass the optical signal directly to the computing node, avoiding congestion problems caused by burst traffic, shortening long-tail delays, and ensuring the performance of the memory interconnection network cluster.
- This application also provides the above-mentioned corresponding systems, optical network controllers, optical network switching equipment, computer-readable storage media, and computer program products.
- this application provides a method for establishing an optical path.
- This method is applied to optical networks.
- the optical network is specifically an optical transport network, for example, a network that uses optical cross-connections to transmit optical signals.
- Optical networks include optical network controllers and optical network switching equipment. Among them, the optical network switching equipment is used to exchange data for the computing nodes in the cluster (ie, the memory interconnection network cluster mentioned above).
- the optical network controller obtains the topology structure of the cluster's subnet, which records the addresses of the computing nodes in the subnet.
- the optical network controller determines the wavelength splitting configuration based on the topology structure and provides all the information to the optical network switching device.
- the wavelength splitting configuration is used to allocate different wavelengths to computing nodes in the subnet.
- the optical network switching device establishes the optical network switching device and the computing nodes in the subnet according to the wavelength splitting configuration. The light path of the node.
- optical paths between the optical network switching equipment and different computing nodes in the subnet are established based on the wavelength splitting configuration, so that the optical network switching equipment can distinguish the optical signals based on the wavelength, and select the corresponding path according to the wavelength of the optical signal.
- Optical signals are passed directly to computing nodes to avoid congestion problems caused by burst traffic, shorten long-tail delays, and ensure cluster performance.
- the subnet includes N computing nodes, where N is greater than 1.
- the target computing node can be any computing node among the N computing nodes, and the optical network controller can determine it from the wavelength range (usable wavelength range, such as the wavelength range of visible light) sub-range, and samples N-1 wavelengths from the sub-range, and then the optical network controller uses the addresses of the N-1 computing nodes in the subnet except the target computing node, the N-1 wavelengths , the egress port of the optical network switching device connected to the N-1 computing nodes, and the ingress port of the optical network switching device connected to the target computing node determine the wavelength splitting configuration.
- the wavelength range usable wavelength range, such as the wavelength range of visible light
- the wavelength splitting configuration can be specifically the address of the computing node in the subnet, the wavelength of the optical signal that the computing node in the subnet is allowed to receive (the wavelength sampled from the subrange), and the inlet port and outlet port of the optical network switching device. Correspondence.
- the optical network controller determines the sub-range from the wavelength range, and samples several wavelengths from the sub-range. Based on the sampled wavelength, the address of the computing node in the subnet, and the inlet and outlet ports of the optical network switching equipment, the Wavelength segmentation configuration to achieve fine-grained wavelength segmentation.
- the optical signal received by the inlet port can be passed directly from the corresponding egress port to the computing node through the corresponding optical path according to its wavelength, without the need for complex paths. calculation, shortening the delay, improving the transmission efficiency, and ensuring the performance of the cluster.
- the optical network switching device is an optical cross-connect switch, and the optical cross-connect switch includes a wavelength selective switch.
- the optical network switching device can establish an optical path between the optical network switching device and the computing node in the subnet through the wavelength selective switch according to the wavelength splitting configuration.
- the optical network switching device can automatically adjust the optical path according to the wavelength splitting configuration through the wavelength selection switch, so that after the optical signal reaches the optical network switching device, it can automatically transmit according to the optical path and reach the corresponding computing node without prior adjustment of the port.
- Performing complex physical connections not only solves the congestion problem, but also simplifies the networking method and improves the user experience.
- the computing nodes in the subnet are configured with optical network adaptation equipment, and the optical network adaptation equipment is used to access the optical network.
- the optical network controller may also provide the wavelength splitting configuration to the optical network adaptation device, so that the optical network adaptation device converts the electrical signal into light of the corresponding wavelength according to the wavelength splitting configuration. Signal.
- the electrical signal can be converted into an optical signal of the corresponding wavelength, and then the optical network switching equipment will pass the optical signal of the above wavelength directly to the computing node through the corresponding optical path. This improves transmission efficiency, shortens latency, and ensures cluster performance.
- the subnet includes a first computing node and a second computing node.
- the optical network adaptation device configured by the first computing node can convert the electrical signal to be sent to the second computing node into an optical signal of the corresponding wavelength according to the wavelength splitting configuration, and transmit it to the optical network switching device. Send the light signal. Then, the optical network switching device transmits the optical signal to the second computing node through the optical path between the optical network switching device and the second computing node.
- the optical network controller may provide the first configuration information in the wavelength splitting configuration to the optical network switching device, and provide the second configuration information in the wavelength splitting configuration to the optical network adaptation device.
- the first configuration information includes the corresponding relationship between the wavelength of the optical signal that the computing node in the subnet is allowed to receive and the ingress port and egress port of the optical network switching device
- the second configuration information includes the correspondence between the computing node in the subnet and The corresponding relationship between the address and the wavelength of the optical signal that the computing node in the subnet is allowed to receive. This can reduce transmission overhead and reduce costs.
- the optical network controller can uniformly provide relatively complete wavelength splitting configurations to optical network switching equipment and optical network adaptation equipment.
- the optical network controller can provide optical network switching equipment and optical network adaptation equipment.
- the cluster may be a high-performance computing cluster.
- High-performance computing clusters are equipped with job schedulers.
- the job scheduler can generate the topology structure of the subnet according to the scheduling policy
- the optical network controller can receive the topology structure of the subnet generated by the job scheduler according to the scheduling policy.
- the optical network controller can provide a northbound application programming interface
- the job scheduler can call the northbound application programming interface to deliver the subnet topology to the optical network controller.
- jobs can be scheduled for execution on computing nodes in the subnet of the high-performance computing cluster, making full use of the resources of the high-performance computing cluster and improving resource utilization.
- the optical network controller can also be connected to the cloud platform.
- the optical network controller may receive the topology structure of the subnet of the cluster sent by the infrastructure as a service IaaS layer network management of the cloud platform. This can achieve cloud-based scheduling of high-performance computing clusters, further improve resource utilization and reduce costs.
- the underlying overall architecture of the optical network does not require any changes. Only a small amount of adaptation is needed for the IaaS layer network management, such as the adaptation of the northbound application programming interface of the optical network controller, so that the optical network controller can be smoothly connected to the public IaaS network service layer of cloud/hybrid cloud and other cloud platforms.
- the optical network includes multiple optical network switching devices, and the optical network controller provides wavelength splitting configurations to the multiple optical network switching devices. This can avoid single-point failures of the optical network switching devices. The service is unavailable, improving the availability of the cluster.
- the working modes of the plurality of optical network switching devices are active/standby mode or multi-active mode.
- the backup device can become the new master device, thereby exchanging data for the computing nodes in the cluster's subnet, ensuring the normal operation of services.
- the working mode of the network switching device is multi-active mode, on the one hand, it can avoid single point failure causing service unavailability, and on the other hand, it can achieve load balancing.
- this application improves a system for establishing an optical path.
- the system includes an optical network controller and an optical network switching device, the optical network switching device being used to exchange data for computing nodes in the cluster;
- the optical network controller is used to obtain the topology structure of the subnet of the cluster, and the topology structure records the addresses of the computing nodes in the subnet;
- the optical network controller is further configured to determine a wavelength splitting configuration according to the topology structure, and provide the wavelength splitting configuration to the optical network switching device, where the wavelength splitting configuration includes Calculate the different wavelengths assigned by nodes;
- the optical network switching device is configured to establish an optical path between the optical network switching device and the computing node in the subnet according to the wavelength splitting configuration.
- the subnet includes N computing nodes, where N is greater than 1, and the optical network controller is specifically used to:
- the N-1 wavelengths, and the output of the optical network switching equipment connected to the N-1 computing nodes determine the wavelength splitting configuration.
- the optical network switching device is an optical cross-connect OXC switch, and the OXC switch includes a wavelength selective switch;
- optical network switching equipment is specifically used for:
- an optical path is established between the optical network switching device and the computing node in the subnet through the wavelength selective switch.
- the computing nodes in the subnet are configured with optical network adaptation equipment, the optical network adaptation equipment is used to access the optical network, and the optical network controller is also used to:
- the subnet includes a first computing node and a second computing node
- the optical network adaptation device configured by the first computing node is used to convert the electrical signal to be sent to the second computing node into an optical signal of the corresponding wavelength according to the wavelength splitting configuration, and transmit it to the optical network
- the switching device sends the optical signal
- the optical network switching device is further configured to transmit the optical signal to the second computing node through the optical path between the optical network switching device and the second computing node.
- the optical network controller is specifically used to:
- the first configuration information includes the wavelength of the optical signal that the computing node in the subnet is allowed to receive and the wavelength of the optical network switching device.
- the optical network controller is specifically used for:
- the second configuration information includes an address of a computing node in the subnet and an optical fiber count that the computing node in the subnet is allowed to receive. Correspondence between signal wavelengths.
- the optical network controller is specifically used to:
- the optical network controller is specifically used to:
- the system includes multiple optical network switching devices, and the optical network controller is specifically used to:
- the working modes of the plurality of optical network switching devices are active/standby mode or multi-active mode.
- this application provides an optical network controller.
- the controller includes at least one processor and at least one memory.
- the at least one processor and the at least one memory communicate with each other.
- the at least one processor is configured to execute instructions stored in the at least one memory, so that the optical network controller performs steps performed by the optical network controller in the method of the first aspect.
- this application provides an optical network switching device.
- the optical network switching device includes at least one processor and at least one memory.
- the at least one processor and the at least one memory communicate with each other.
- the at least one processor is configured to execute instructions stored in the at least one memory, so that the optical network switching device performs steps performed by the optical network switching device in the method of the first aspect.
- the present application provides a computer-readable storage medium in which instructions are stored, and the instructions are used to execute the method described in the above-mentioned first aspect or any implementation of the first aspect. Methods for establishing light pathways.
- the present application provides a computer program product containing instructions for executing the method for establishing an optical path described in the above first aspect or any implementation of the first aspect.
- Figure 1A is a schematic architectural diagram of a memory interconnection network cluster provided by an embodiment of the present application.
- Figure 1B is a schematic diagram of the networking mode of the memory interconnection network cluster provided by the embodiment of the present application.
- Figure 2A is a schematic diagram of the architecture of an optical network provided by an embodiment of the present application.
- Figure 2B is an architectural schematic diagram of another optical network provided by an embodiment of the present application.
- Figure 2C is a schematic diagram of the architecture of another optical network provided by an embodiment of the present application.
- Figure 2D is an architectural schematic diagram of another optical network provided by an embodiment of the present application.
- Figure 3 is a flow chart of a method for establishing an optical path provided by an embodiment of the present application.
- Figure 4 is a schematic diagram of path construction of an all-optical switch provided by an embodiment of the present application.
- Figure 5A is a schematic structural diagram of a wavelength selective switch provided by an embodiment of the present application.
- Figure 5B is a schematic structural diagram of another wavelength selective switch provided by an embodiment of the present application.
- Figure 6 is a schematic diagram of wave cutting routing provided by an embodiment of the present application.
- Figure 7 is a schematic structural diagram of an optical network controller provided by an embodiment of the present application.
- Figure 8 is a schematic structural diagram of an optical network switching device provided by an embodiment of the present application.
- Figure 9 is a hardware structure diagram of an optical network controller provided by an embodiment of the present application.
- Figure 10 is a hardware structure diagram of an optical network switching device provided by an embodiment of the present application.
- first and second in the embodiments of this application are only used for descriptive purposes and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Therefore, features defined as “first” and “second” may explicitly or implicitly include one or more of these features.
- Memory Centric Architecture also known as Disaggregated Shared Memory architecture
- Disaggregated Shared Memory architecture is a memory interconnection network based on a new memory semantic network. fabric) cluster computing architecture.
- New memory semantic networks include but are not limited to CXL memory semantic network, CAPI memory semantic network or GenZ memory semantic network.
- CXL is an open source protocol standard used to realize high-speed interconnection between central processing unit (CPU) and peripheral devices (external device). This protocol standard is compatible with the high-speed peripheral component interconnect (Peripheral Component Interconnect Express, PCIe) protocol. standard.
- PCIe peripheral component interconnect Express
- CAPI provides a customizable, efficient and easy-to-use hardware acceleration solution that shares CPU load.
- Its implementation carrier can be a Field Programmable Gate Array (FPGA).
- GenZ is a bus structured protocol. GenZ uses semantic storage communication to transfer data between the memories of different components with minimal overhead. It not only interconnects storage devices, but also interconnects processors and accelerators. Accelerators can alleviate processing such as CPUs. processing pressure of the machine.
- RDMA remote direct memory access
- peripherals such as network cards, graphics cards, hard disks, accelerators, etc.
- DMA local host
- FIG. 1A shows an architectural diagram of a cluster.
- the cluster 10 includes multiple computing nodes 100.
- Each computing node 100 includes a processor and a memory.
- the processor may be an X86 architecture or an advanced reduced instruction set machine (Advanced Reduced Instruction Set Machine).
- RISC Machine, ARM) architecture CPU memory can be synchronized at double rate Dynamic Random Access Memory (Double Data Rate Synchronous Dynamic Random Access Memory, DDR SDRAM), where DDR SDRAM can be referred to as DDR for short.
- DDR SDRAM Dynamic Random Access Memory
- Each computing node 100 can also be configured with a network adaptation device, such as a CXL-based network adaptation device, denoted as CXL Device.
- the network adaptation device can also have other functions.
- the network adaptation device can be based on The network adaptation device of the data streaming accelerator (DSA) is denoted as DSA Device.
- DSA is an external device standard for operating data streams.
- the supported scenarios include data movement (can be superimposed with deduplication, cyclic redundancy check, etc.), data comparison, data conversion (can be superimposed with Data Integrity Domain). Field, DIF) insertion, DIF verification, DIF update, etc.).
- CXL Device or DSA Device can be used as a custom hardware bridge (self-defined hardware bridge) to translate CXL.IO, CXL.Mem, CXL.Cache or DSA into RDMA network semantics, thereby enabling direct access to remote devices through networks such as GenZ networks. memory, so that resource pooling and global sharing of hierarchical memory can be achieved.
- custom hardware bridge self-defined hardware bridge
- the cluster 10 shown in Figure 1A requires efficient, stable, and reliable cross-node memory access. Specifically, the cluster 10 usually requires ultra-high bandwidth, ultra-low average latency, and long-tail latency. Considering the delay and reliability of long optical fibers, a cluster 10 can be obtained by networking multiple computing nodes 100 in a single cabinet or macro cabinet.
- the cluster 10 can be established using the OneHop networking mode.
- multiple computing nodes 100 (denoted as H1, H2...H8 in Figure 1B) are connected to the switch Switch, and data can be exchanged between H1 to H8 through the switch.
- sudden Incast/Outcast traffic (such as H1 to H7 sending data to H8 at the same time, or H8 returning data to H1 to H7 at the same time) can trigger short-term congestion, causing long-tail latency to skyrocket, thus restricting the cluster. 10 performance.
- optical transport network refers to a transport network that realizes the transmission, multiplexing, routing, and monitoring of business signals in the optical domain and ensures its performance indicators and survivability.
- OTN can use optical cross connection (optical cross connection, OXC) to transmit service signals.
- OXC optical cross connection
- Optical networks include optical network controllers and optical network switching equipment.
- the optical network controller may be a controller that supports OXC, also called an OXC controller.
- the optical network controller also supports software defined network (SDN).
- SDN software defined network
- the optical network controller can centrally manage and configure network equipment such as optical network switching equipment through standardized interfaces to achieve dynamic division of clusters to meet the needs of different services.
- the optical network switching device may be a switch that supports OXC, such as an OXC switch.
- the optical network switching device may also be an optical fiber router that supports OXC.
- the optical network controller can obtain the topology structure of the cluster's subnet, where the cluster includes multiple computing nodes.
- the cluster's subnet can be a subnetwork formed by some nodes in the cluster and is used to perform upper-layer service delivery. Jobs (such as weather forecast for a certain area in the next day, animation rendering), where different subnets can be used to perform different jobs.
- the topology of a subnet is recorded with the addresses of the computing nodes in the subnet.
- the optical network controller can then determine the wavelength splitting configuration based on the topology and provide the wavelength splitting configuration to the optical network switching device.
- the wavelength splitting configuration includes allocating different wavelengths to the computing nodes in the subnet.
- the optical network switching device can establish an optical path between the optical network switching device and the computing nodes in the subnet based on the above wavelength splitting configuration.
- the optical network switching device establishes optical paths to different computing nodes in the subnet based on the wavelength splitting configuration, so that when the optical signal reaches the optical network switching device, the corresponding optical path can be selected for direct transmission based on the wavelength of the optical signal.
- the corresponding computing nodes which solves the congestion problem caused by sudden incast traffic or outcast traffic in traditional switch networks, avoids long-tail delay surges, and ensures cluster performance.
- the physical connection between the optical network switching equipment and the computing nodes in the cluster is similar to the physical connection between the traditional switch and the computing node in the OneHop networking method, thus achieving a simple network without the need for a full-internet network.
- This method requires computing nodes to provide multiple outgoing ports for physical connections, which lowers the networking threshold and improves networking efficiency.
- the optical network 20 is connected to the upper service layer (not shown in Figure 2A) and the underlying cluster 10 respectively.
- Business applications in the business layer can generate jobs.
- model training business applications can generate model training jobs
- weather prediction business applications can generate weather prediction jobs.
- the above-mentioned jobs may be scheduled by the job scheduler 30 to the underlying cluster 10 for execution.
- the cluster 10 can be divided into different subnets, which are used to execute different jobs, thereby achieving job isolation and ensuring data security.
- the job scheduler 30 can split the job into multiple tasks (tasks), and then schedule the multiple tasks to different computing nodes 100 in the subnet, and the different computing nodes 100 can execute the tasks in parallel.
- different computing nodes 100 in the subnet perform tasks, they usually need to exchange data to complete the job.
- the job scheduler 30 can split the training job into 3 tasks, and each task is used to utilize 4GB of training data in the training set.
- the data is used for model training, and the job scheduler 30 schedules the above three tasks to three computing nodes 100 in the subnet.
- the three computing nodes 100 can also exchange the gradient of parameters during each round of training.
- Each computing node 100 The next round of parameter updates is performed based on the gradient of the parameters calculated by itself and the gradient of the parameters obtained by exchange.
- the optical network 20 is used to establish an optical path with the computing node 100 in the subnet, thereby transmitting data directly to the corresponding computing node 100 in the form of optical signals, thereby solving the problem of congestion caused by sudden traffic in the traditional switch network, and thus affecting the Performance issues with cluster 10.
- the optical network 20 includes an optical network controller 22 and an optical network switching device 24. Further, the optical network 20 may also include an optical network adaptation device 26. Among them, the optical network controller 22 is used to configure and manage network equipment such as the optical network switching equipment 24 and the optical network adaptation equipment 26 to realize optical signal transmission. The optical network switching device 24 is used to exchange data between the computing nodes 100 of the cluster 10 .
- the optical network adaptation device 26 can usually be configured on the computing node 100 side, for example, mounted on the mainboard of the computing node 100 .
- the optical network adaptation device 26 is used to connect the computing node 100 to the optical network 20 to achieve high-speed communication between the computing nodes 100 through the optical network 20 .
- the optical network controller 22 may be an OXC controller
- the optical network switching device 24 may be an optical network switch, such as an OXC switch
- the optical network adaptation device 26 may be an optical network card, such as an OXC network card.
- the optical network switching device 24 may also be other devices, such as an optical fiber router, and the optical network adaptation device 26 may also be other devices used to access the optical network 20 .
- the optical network controller 22 is used to obtain the topology structure of the subnet of the cluster 10 .
- the topology records the addresses of the computing nodes 100 in the subnet.
- the optical network controller 22 is also used to determine the wavelength splitting configuration according to the topology structure, and provide the wavelength splitting configuration to the optical network switching device 24 .
- the wavelength splitting configuration is used to allocate different wavelengths to the computing nodes in the subnet.
- the wavelength splitting configuration may include the address of the computing node 100 in the subnet, the wavelength of the optical signal that the computing node 100 in the subnet is allowed to receive, and the optical network switching equipment. The corresponding relationship between the ingress port and egress port of 24.
- the optical network controller 22 may provide the wavelength splitting configuration to the optical network switching device 24 based on the management network.
- the management network is used to transmit control signaling to configure the optical network switching device 24.
- the management network may be Ethernet.
- the optical network switching device 24 is used to establish an optical path between the optical network switching device 24 and the computing node 100 in the subnet according to the wavelength splitting configuration. This optical path is used to transmit the optical signal of the corresponding wavelength received by the inlet port from the egress port to the subnet.
- the optical network adaptation device 26 configured on the computing node 100 in the subnet is used to convert the received optical signal into an electrical signal, and provide the electrical signal to the computing node 100, thereby realizing data exchange.
- FIG. 2A uses cluster 10 as an example of a high-performance computing (HPC) cluster.
- HPC cluster refers to a cluster that connects multiple computing nodes 100 through various interconnection technologies to handle large-scale computing problems through the comprehensive computing capabilities of the connected multiple computing nodes 100.
- Industries such as scientific research, weather forecasting, simulation experiments, biopharmaceuticals, gene sequencing, and image processing can all use HPC clusters to solve large-scale computing problems.
- the HPC cluster is provided with a job scheduler 30.
- the job scheduler 30 receives a job, it can obtain the scheduling policy of the job.
- the scheduling policy can be, for example, a priority scheduling policy or a backfill scheduling policy.
- the job scheduler 30 can determine the topology structure of the cluster's subnet according to the scheduling policy, so as to subsequently schedule the job to the corresponding subnet for execution.
- the job scheduler 30 can call the northbound application programming interface (API) of the optical network controller 22 to deliver the topology structure of the subnet to the optical network controller 22, so that the optical network controller 22 can obtain the The topology of the subnet.
- API application programming interface
- the optical network 20 can also be connected to the cloud service provided by the cloud platform to process the operations of the cloud platform.
- the optical network 20 includes an optical network controller 22 , an optical network switching device 24 and an optical network adaptation device 26 .
- the optical network controller 22 in Figure 2B can receive the topology structure of the subnet of the cluster 10 sent by the infrastructure as a service (IaaS) layer network management 32 (such as OpenStack Neutron) of the cloud platform. , thereby realizing the cloud service of HPC clusters.
- the cloud platform can be a public cloud or a hybrid cloud.
- Hybrid cloud is a cloud platform that mixes private cloud (local infrastructure) and public cloud.
- the IaaS layer network management 32 can provide a management plane (also called a control plane).
- the management plane is used to transmit control signaling to the optical network controller 22 to control the optical network switching device 24 and the optical network adaptation device 26 To manage.
- the IaaS layer network management 32 is connected to the IaaS layer network service domain 34 of the cloud platform.
- the IaaS layer provides computing, storage and network services through infrastructure.
- network services include but are not limited to virtual private cloud (VPC), elastic public IP (Elastic IP, EIP), network address translation (NAT), elastic load balance (ELB) ) or cloud dedicated line (direct connect, DC).
- VPC virtual private cloud
- Elastic IP, EIP elastic public IP
- NAT network address translation
- ELB elastic load balance
- DC cloud dedicated line
- Each of the above network services can usually be deployed on multiple servers, thereby forming an IaaS layer network service domain 34.
- the operation request can be a Hypertext Transfer Protocol (Hypertext Transfer Protocol, HTTP) request.
- HTTP Hypertext Transfer Protocol
- the operation request may first arrive at the IaaS layer network service domain 34, and then be sent to the IaaS layer network management 32 after corresponding processing by the IaaS layer network service domain 34.
- the IaaS layer network management 32 can determine the topology structure of the subnet in the cluster 10 used to process the above request according to the request, and then send the topology structure of the subnet to the optical network controller 22, so that the optical network controller 22 can determine the topology structure of the subnet according to the request.
- the topology structure of the subnet configures and manages the optical network switching device 24 and the optical network adaptation device 26 .
- the specific implementation of how the optical network controller 22 configures and manages the optical network switching device 24 and the optical network adaptation device 26 according to the subnet topology can be found in the description of the embodiment shown in FIG. 2A and will not be described again here.
- the underlying overall architecture of the optical network 20 does not need to be changed at all. Only a small amount of adaptations need to be added to the IaaS layer network management 32, such as adaptation to the northbound API of the optical network controller 22, to realize optical network control.
- the server 22 is smoothly connected to the IaaS network service layer of the public cloud/hybrid cloud.
- the optical network 20 can include multiple optical network switching devices 24, and the computing in the cluster 10
- the node 100 is connected to multiple optical network switching devices 24, thereby enabling the computing node 100 to access multiple optical network planes at the same time.
- Optical network control When the controller 22 provides the wavelength splitting configuration to the optical network switching device 24, it may provide the wavelength splitting configuration to all the above-mentioned multiple optical network switching devices 24.
- the working mode of the plurality of optical network switching devices 24 can be active/standby mode or multi-active mode.
- the master-backup mode refers to setting some devices among multiple devices (such as multiple optical network switching devices 24) as master devices, and setting other devices as backup devices. When the master device goes down, the backup device is set as the master device. to provide services.
- Multi-active mode means that multiple devices provide services at the same time. In the multi-active mode, data synchronization can also be performed between multiple devices (such as multiple optical network switching devices 24).
- Figure 2C illustrates an example in which a computing node 100 in an HPC cluster is connected to two optical network switching devices 24.
- the computing node 100 can be connected to more optical network switching devices 24 to ensure availability.
- the computing nodes in the cluster 10 can also be connected to multiple optical network switching devices 24 , thereby ensuring the availability of cloud services.
- FIG. 2A to FIG. 2D introduce the architecture of the optical network 20 according to the embodiment of the present application.
- the following takes the architecture shown in FIG. 2A as an example to illustrate the method of establishing an optical path according to the embodiment of the present application.
- the method includes:
- the job scheduler 30 is specifically used to schedule jobs to appropriate computing nodes 100 according to the distribution and usage of resources such as computing nodes 100 in the cluster 10 to improve job execution efficiency and resource utilization.
- the job scheduler 30 usually has workload management and resource management functions.
- the job scheduler 30 may include a workload manager (workload manager) and a resource manager (resource manager).
- the resource manager is used to collect resource usage information
- the workload manager is used to schedule workloads, including jobs, to appropriate computing nodes 100 based on the resource usage information.
- the workload manager can also monitor the running status of the job on the computing node 100 to schedule the job based on the running status.
- the running status of the job can be characterized by at least one of execution progress and expected remaining execution time.
- the job scheduler 30 in this embodiment may be an open source scheduler.
- the job scheduler 30 may be the open source Openlava scheduler or the Slurm scheduler.
- the job scheduler 30 may also be a scheduler purchased by the user or self-developed.
- the job scheduler 30 can be a TORQUE scheduler or a Moab Cluster Suite scheduler.
- the above-mentioned job scheduler 30 can receive jobs submitted by clients (such as clients of business applications). Specifically, the client can provide an operation interface. After the user triggers an operation on the operation interface, for example, an operation of accessing a database, a job can be generated. The job specifically queries the data in the database that meets the query conditions, and then the client can submit the job. Job, accordingly, the job scheduler 30 can receive the above-mentioned job submitted by the client. It should be noted that the job scheduler 30 can also create a job queue, and then add the above-mentioned jobs submitted by the received clients to the job queue, so as to manage the jobs based on the job queue.
- clients such as clients of business applications.
- the client can provide an operation interface. After the user triggers an operation on the operation interface, for example, an operation of accessing a database, a job can be generated. The job specifically queries the data in the database that meets the query conditions, and then the client can submit the job. Job, accordingly, the job scheduler 30 can receive the
- the job scheduler 30 determines the topology structure of the subnet based on job scheduling.
- the job scheduler 30 is configured with a scheduling policy for jobs.
- This scheduling strategy can be called a scheduling algorithm.
- the scheduling policy configured by the job scheduler 30 may be a job priority scheduling policy or a backfill scheduling policy. Different scheduling strategies are explained below.
- the job priority scheduling policy refers to starting jobs in the order of job priority. Jobs with higher priority are scheduled first, and jobs with lower priority are scheduled later. Among them, the job priority can be set when the job is created. Job priority can be passed Priority values are measured. Priority values for different jobs can be equal. When multiple jobs submitted to the job scheduler 30 have the same priority value, the final priority of each job may be determined based on the reception time of the job scheduler 30 . In the case where different jobs have the same priority value, the job received first by the job scheduler 30 has a higher priority.
- the backfill scheduling policy refers to allowing lower-priority jobs to run first without delaying the expected start time of high-priority jobs.
- the lower priority job can be run on resources (such as the computing node 100) reserved for the above-mentioned high priority job. Jobs that run backfill (such as the low-priority jobs mentioned above) often need to be limited in run time.
- the job scheduler 30 may determine the node used to execute the job according to the job scheduling policy. Nodes executing jobs can form a subnet, and subnets executing different jobs can be logically isolated. In this way, the job scheduler 30 can obtain the topology of the subnet. The topology is recorded with the addresses of the computing nodes 100 in the subnet.
- the address of the computing node 100 is unique, and the address may be an Internet protocol (Internet protocol, IP) address. In some embodiments, the address of the computing node 100 may also be another address, such as a media access control (media access control, MAC) address.
- the topology structure also records the connection relationships of the computing nodes 100 . The connection relationship may be a connection relationship between the computing node 100 and an ingress port or an egress port of the optical network switching device 24 .
- the topology of the subnet can be represented by a graph structure.
- the vertices (vertex) in the graph structure can be used to represent the computing nodes 100 in the subnet
- the edges (edges) in the graph structure can be used to represent the connection relationships of the computing nodes 100.
- edge 1 is included between vertex 1 and vertex 2
- edge 1 is used to represent the connection relationship between the computing node 100 corresponding to vertex 1, the computing node 100 corresponding to vertex 2, and the ports on the optical network switching device 24.
- the computing node 100 corresponding to vertex 1 may be H1
- the computing node corresponding to vertex 2 may be H2.
- Edge 1 indicates that H1 is connected to the inlet port 1 of the optical network switching device 24, and H2 is connected to the optical network switching device 24.
- the topology of the subnet can also be represented by a data table.
- the data table may include multiple records, and each record may include the following fields: subnet identifier, address of the computing node 100, ingress port, and egress port, thereby indicating the computing nodes 100 included in a subnet and the connections to which the computing node 100 is connected. Ingress port or egress port. Further, each record may also include a node identifier of the computing node 100 . It should be noted that in a record, the inbound port or outbound port can also be empty or have a default value.
- the subnet's label can also be recorded in the subnet's topology structure.
- This tag can be used for filtering or querying.
- this label can identify the service ID or service type of the subnet's processing services.
- the job scheduler 30 delivers the subnet topology to the optical network controller 22.
- optical network controller 22 may provide a northbound API.
- the northbound API refers to an interface provided upwards, such as an interface provided to upper-layer business applications. Its goal is to enable business applications to conveniently call underlying network resources and capabilities.
- the corresponding southbound API is an interface provided downwards, such as an interface for managing network management or equipment from other manufacturers.
- the job scheduler 30 can globally control the resource status of the cluster 10 through the northbound API, and uniformly schedule jobs according to the resource status.
- the job scheduler 30 can call the northbound API to deliver the topology structure of the subnet to the optical network controller 22, so as to uniformly schedule jobs according to the topology structure of the subnet.
- the optical network controller 22 determines the wavelength splitting configuration according to the topology structure of the subnet.
- the subnet includes N computing nodes 100, where N is a positive integer greater than 1.
- the target computing node may be any computing node 100 among the N computing nodes 100
- the optical network controller 22 may determine a sub-range from the wavelength range, and determine the sub-range from the wavelength range. Sampling N-1 wavelengths, and then based on the addresses of N-1 computing nodes 100 in the subnet except the target computing node, the N-1 wavelengths, and the optical network switching equipment connected to the N-1 computing nodes
- the egress port of 24 and the inlet port of the optical network switching device 26 connected to the target computing node determine the wavelength splitting configuration.
- the average sampling method or the random sampling method may be used. For example, the optical network controller 22 may divide the range into N segments, thereby obtaining N sub-ranges. Each of the N subranges is used to represent the value range of the wavelength of an optical signal that can be sent by a computing node 100 in the subnet.
- the optical network controller 22 may also use an average sampling method or a random sampling method. For example, the optical network controller 22 may divide the sub-range into N-1 segments, obtain the left endpoint, middle point or right endpoint of each segment in the N-1 segments, thereby obtaining N-1 wavelengths. The N-1 wavelengths are respectively used for wavelengths of optical signals sent to the remaining N-1 computing nodes 100 in the subnet.
- the wavelength of visible light is usually between 780 and 400 nanometers (nm).
- the optical network control node 22 can divide the wavelengths into the following Four sections: 780nm to 685nm, 685nm to 590nm, 590nm to 495nm, 495nm to 400nm. Among them, the above range includes the left endpoint and does not include the right endpoint. Then the optical network control node 22 can take three values from each wavelength section to achieve wavelength segmentation. Taking values from 780nm to 685nm as an example, the optical network controller 22 can take three values, namely 750nm, 720nm and 690nm. In this way, when H1 sends data to H2, H3, and H5 respectively, the wavelengths of the optical signals converted from the electrical signals carrying the data are 750nm, 720nm, and 690nm.
- the optical network controller 22 may determine the wavelength splitting configuration.
- the wavelength splitting configuration includes the address of the computing node 100 in the subnet, the wavelength of the optical signal that the computing node 100 in the subnet is allowed to receive, and the correspondence between the ingress port and the egress port of the optical network switching device 24 .
- the topological structure of the subnet records the ingress port and egress port connected to the computing node 100.
- the optical network controller 22 can determine the address of the computing node 100 in the subnet and the optical signal that the computing node 100 in the subnet is allowed to receive.
- the corresponding relationship between the wavelengths, as well as the ingress port and egress port connected to the computing node 100 in the topology determine the address of the computing node 100 in the subnet, the wavelength of the optical signal that the computing node 100 in the subnet is allowed to receive, and the optical network switching device 24 Correspondence between ingress port and egress port to obtain wavelength splitting configuration.
- the optical network controller 22 provides the wavelength splitting configuration to the optical network switching device 24 and the optical network adaptation device 26 .
- the optical network controller 22 may provide the first configuration information in the wavelength splitting configuration to the optical network switching device 24.
- the first configuration information includes the wavelength of the optical signal that the computing node 100 in the subnet is allowed to receive and the optical network The corresponding relationship between the ingress port and the egress port of the switching device 24, and the second configuration information in the wavelength split configuration is provided to the optical network adaptation device 26.
- the second configuration information includes the address of the computing node 100 in the subnet and the The corresponding relationship between the wavelengths of the optical signals that the node 100 is allowed to receive is calculated in the subnet. In this way, the transmission overhead of the optical network controller 22, the optical network switching device 24, and the optical network adaptation device 26 can be reduced.
- the optical network controller 22 may not differentiate between the optical network switching device 24 and the optical network adaptation device 26 and provide a complete corresponding relationship in a unified manner. That is, the optical network controller 22 can provide the optical network switching device 24 and the optical network adaptation device 26 with the address of the computing node 100 in the subnet, the wavelength of the optical signal that the computing node 100 in the subnet is allowed to receive, and the inlet port of the optical network switching device 24 , the corresponding relationship between the egress ports.
- the optical network controller 22 when the optical network controller 22 provides the wavelength splitting configuration to the optical network switching device 24 and the optical network adaptation device 26, there can be multiple implementation methods.
- One way of implementation is that the optical network controller 22 actively delivers the wavelength splitting configuration to the optical network switching device 24 and the optical network adaptation device 26.
- Another way of implementation is that the optical network controller 22 responds to the optical network switching device 24 and the optical network adaptation device 26.
- the configuration acquisition request of the network adaptation device 26 returns the wavelength splitting configuration to the optical network switching device 24 and the optical network adaptation device 26 .
- the optical network controller 22 may not directly transmit the wavelength splitting configuration to the optical network switching device 24 and the optical network adaptation device 26 .
- the optical network controller 22 may provide the wavelength splitting configuration to the optical network switching device 24 and the optical network adaptation device 26 by sharing the wavelength splitting configuration.
- the optical network switching device 24 establishes an optical path between the optical network switching device 24 and the computing node 100 in the subnet according to the wavelength splitting configuration.
- the optical network switching device 24 can establish a corresponding optical path according to the corresponding relationship between wavelengths, ingress ports, and egress ports in the wavelength splitting configuration.
- the optical path is used to transmit the optical signal of the corresponding wavelength received by the inlet port from the egress port to the computing node 100 in the subnet.
- the optical network switching device 24 can establish multiple different optical paths. Different optical paths are used to transmit optical signals of different wavelengths to different computing nodes 100 .
- the optical network switching device 24 is an OXC switch
- the OXC switch is an all-optical switch
- the source port is the source port for receiving data from H1
- the destination port is respectively When they are ports for sending data to H2, H3, and H5, optical paths for transmitting optical signals with wavelengths ⁇ 1, ⁇ 2, and ⁇ 3 can be established respectively.
- the OXC switch may include a wavelength selective switch (Wavelength Selective Switch, WSS).
- WSS wavelength selective switch
- the switch may be an NxN port matrix optical switch.
- OXC switches can be configured according to wavelength splitting and build roads through the above-mentioned WSS.
- WSS can be divided into WSS based on Microelectromechanical Systems (MEMS) and WSS based on Liquid Crystal on Silicon (LCoS). The principles of the above WSS are explained below respectively.
- MEMS Microelectromechanical Systems
- LCDoS Liquid Crystal on Silicon
- the MEMS-based WSS includes an optical fiber array, a grating, a MEMS, and a reflector.
- the fiber array includes an inlet port (ie, input fiber port) and an egress port (ie, output fiber port).
- the optical signal received by the inlet port can be a wavelength division multiplexing signal, and the wavelength division multiplexing signal can achieve wavelength separation after passing through the grating.
- the wavelength division multiplexing signal can achieve wavelength separation after passing through the grating.
- the optical signals of the two wavelengths can be reflected through the mirrors to specific exit ports of the optical fiber array by adjusting the angles of the different mirrors in the MEMS.
- the LCoS-based WSS includes an optical fiber array, a grating, a liquid crystal-based spatial light modulator (phased array LC-based switch) and a reflector. Specifically, after the wavelength division multiplexed optical signal passes through the grating, the optical signals of each wavelength can be demultiplexed according to different positions in space, thereby obtaining optical signals of different wavelengths. Different from the MEMS-based WSS that changes the direction of an optical signal of a certain wavelength in real time by controlling the mirror angle to achieve optical path adjustment, the LCoS-based WSS uses a liquid crystal-based spatial light modulator to change the optical signal of a certain wavelength. phase to achieve optical path adjustment.
- the optical network controller 22 can establish an optical path for switching the optical signal of a specific wavelength from the designated inlet port to the designated egress port through the above-mentioned WSS switch according to the corresponding relationship between the wavelength and the ingress port and the egress port.
- S314 The optical network adaptation device 26 generates a forwarding table according to the wavelength segmentation configuration.
- the wavelength splitting configuration includes a corresponding relationship between the address of the computing node 100 in the subnet and the wavelength of the optical signal that the computing node 100 in the subnet is allowed to receive.
- the optical network adaptation device 26 can match the address of the computing node 100 with the computing node according to the address of the computing node 100. 100 allows the correspondence between the wavelengths of the received optical signals to generate a forwarding table indexed by the address of the computing node 100 .
- the address of the computing node 100 may be an IP address, and accordingly, each entry in the forwarding table may include the IP address and the wavelength corresponding to the IP address.
- the optical network adaptation device 26 can store the above-mentioned forwarding table for subsequent signal conversion based on the forwarding table.
- the optical network adaptation device 26 can generate a forwarding table according to the wavelength segmentation configuration, and the forwarding table includes the correspondence between the addresses of H2, H3, and H5 and the wavelengths of optical signals that H2, H3, and H5 are allowed to receive.
- the optical network adaptation device 26 can store the forwarding table so that subsequent optical-to-electrical conversion of signals is performed based on the forwarding table.
- the above-mentioned S312 and S314 can be executed in parallel, or executed one after another in a set order.
- the execution order does not affect the specific implementation of the embodiment of the present application, and this embodiment does not limit this.
- the above-mentioned S314 may not be executed when performing the method of the embodiment of the present application.
- the optical network adaptation device 26 may perform subsequent processing according to the wavelength splitting configuration provided by the optical network controller 22 .
- S316 The optical network adaptation device 26 configured by the first computing node receives the electrical signal to be sent to the second computing node.
- the first computing node and the second computing node are different computing nodes 100 in the subnet, and data can be exchanged between the computing nodes 100 to complete the job. For example, gradient values can be exchanged between computing nodes 100 to complete model training operations.
- the first computing node and the second computing node may exchange data in the form of messages. Specifically, the CPU of the first computing node may generate a message in the form of an electrical signal, and then the CPU of the first computing node may send the message in the form of an electrical signal to the optical network adaptation device 26 configured on the first computing node. arts.
- the optical network adaptation device 26 configured by the first computing node converts the electrical signal to be sent to the second computing node into an optical signal of the corresponding wavelength according to the forwarding table.
- the source address and destination address are recorded in the message in the form of an electrical signal.
- the source address may be the address of the first computing node
- the destination address may be the address of the second computing node.
- the optical network adaptation device 26 configured by the first computing node can query the forwarding table according to the address of the second computing node to obtain the wavelength of the optical signal that the second computing node is allowed to receive, and then convert the electrical signal into an optical signal of the corresponding wavelength.
- the optical network adaptation device 26 includes a photoelectric conversion module. After determining the wavelength of the optical signal that the second computing node is allowed to receive, the optical network adaptation device 26 converts the electrical signal into an optical signal of the corresponding wavelength through the above-mentioned photoelectric conversion module. .
- S320 The optical network adaptation device 26 configured by the first computing node sends the optical signal to the optical network switching device 24 .
- the optical network adaptation device 26 configured on the first computing node can send optical signals to the optical network switching device 24 through the path between the first computing node and the optical network switching device 24 .
- the optical network switching device 24 transmits the optical signal to the second computing node through the optical path with the second computing node.
- the optical network switching device 24 may determine the corresponding relationship between the wavelength of the optical signal that the second computing node is allowed to receive and the ingress port and egress port of the optical network switching device 24 , as well as the wavelength of the optical signal and the ingress port that receives the optical signal.
- the egress port then transmits the optical signal to the second computing node through the optical path between the egress port and the second computing node. That is, the optical network switching device 24 can transmit the optical signal to the second computing node through wave cutting and routing, and the network adaptation device configured at the second computing node can convert the optical signal into an electrical signal through the photoelectric conversion module, thereby achieving The message is sent to the second computing node.
- the embodiment of this application also provides an example for illustration.
- the optical network switching device 24 receives the above wavelengths in time slots T1, T2, and T3 respectively.
- the optical network switching device 24 can select the corresponding optical channel through wave cutting and routing to transmit the optical signal to the corresponding computing node.
- the optical network switching device 24 receives the optical signal with wavelength ⁇ 1 sent by H1, determines the corresponding optical path based on the wavelength and the inlet port, and transmits the optical signal to the H2.
- the optical network controller 22 segments the wavelengths according to the topology of the subnet, and sends the wavelength segmentation configuration to the optical network switching device 24 and the optical network adaptation device 26, so that the optical network adaptation device 26 Messages arriving at different computing nodes 100 can be converted into optical signals of different wavelengths.
- the optical network switching device 24 distinguishes the optical signals according to the wavelength, and directly passes the optical signals to the corresponding computing nodes 100 through the pre-established optical paths, avoiding the need for traditional switches.
- the congestion problem caused by sudden incast traffic or outcast traffic in the network shortens the long tail delay and ensures the performance of the cluster 10.
- the optical network switching device 24 can establish an optical path with a specific computing node 100 (specifically, the computing node 100 in the subnet) according to the wavelength splitting configuration, thereby realizing flexible scheduling of the cluster 10 and being able to flexibly meet different needs of the business. .
- the optical network controller 22 can also be connected to a public cloud or a hybrid cloud.
- IaaS layer network management 32 thereby realizing the cloud service of memory interconnection network clusters such as cluster 10.
- the IaaS layer network service domain 34 of the public cloud or hybrid cloud can generate jobs.
- the IaaS layer network service domain 34 can generate jobs in response to user operations, and the IaaS layer network management 32 can generate jobs according to the distribution and distribution of resources in the cluster 10.
- the topology structure of the subnet is determined, and then the topology structure of the subnet is delivered to the optical network controller 22 through the northbound API of the optical network controller 22 .
- the optical network controller 22 determines the wavelength splitting configuration according to the topology structure, and provides the wavelength splitting configuration to the optical network switching device 24.
- the optical network switching device 24 establishes an optical path with the computing node 100 in the subnet according to the wavelength splitting configuration, so that The optical signal is subsequently transmitted to the computing node 100 in the subnet through the optical path.
- the specific implementation of the optical network controller 22 determining the wavelength splitting configuration and the optical network switching device 24 establishing the optical path can be referred to the relevant content description shown in Figure 3, and will not be described again here.
- This method can improve the utilization rate of resources such as (computing resources) in the cluster 10 and reduce costs by converting the cluster 10 into a cloud service.
- the computing node 100 can be connected to multiple optical network switching devices 24.
- the working modes of the multiple optical network switching devices 24 are active/standby mode or multi-active mode.
- the optical network controller 22 can Providing wavelength splitting configurations to the above-mentioned plurality of optical network switching devices 24 avoids network interruption caused by a single point failure of the optical network switching device 24 and improves the availability of the cluster 10 .
- the embodiment of the present application also provides a system for establishing an optical path.
- the system for establishing an optical path may be a hardware system, and the hardware system may include an optical network controller 22 and an optical network switching device 24 as shown in Figures 2A to 2D. Further, the hardware system may include the optical network adaptation device 26 as shown in Figures 2A to 2D. In other words, the system for establishing the optical path may be the optical network 20 as shown in FIGS. 2A to 2D .
- the optical network 20 includes an optical network controller 22 and an optical network switching device 24.
- the optical network switching device 24 is used to switch the computing nodes 100 in the cluster 10. data.
- the optical network controller 22 is used to obtain the topology structure of the subnet of the cluster 10, and the topology structure records the address of the computing node 100 in the subnet;
- the optical network controller 24 is also configured to determine a wavelength splitting configuration according to the topology structure, and provide the wavelength splitting configuration to the optical network switching device 24, where the wavelength splitting configuration includes Different wavelengths allocated by computing nodes in the network;
- the optical network switching device 24 is configured to establish an optical path between the optical network switching device 24 and the computing node 100 in the subnet according to the wavelength splitting configuration.
- the subnet includes N computing nodes, where N is greater than 1, and the optical network controller 22 is specifically used to:
- the egress port and the ingress port of the optical network switching device 24 to which the target computing node is connected determine the wavelength splitting configuration.
- the optical network switching device 24 is an optical cross-connect OXC switch, and the OXC switch includes a wavelength selective switch;
- the optical network switching device 24 is specifically used for:
- an optical path is established between the optical network switching device 24 and the computing node 100 in the subnet through the wavelength selective switch.
- the computing node 100 in the subnet is configured with an optical network adaptation device 26.
- the optical network adaptation device 26 is used to access the optical network 20.
- the optical network controller 22 also uses At:
- the wavelength splitting configuration is provided to the optical network adaptation device 26, so that the optical network adaptation device 26 converts the electrical signal into an optical signal of a corresponding wavelength according to the wavelength splitting configuration.
- the subnet includes a first computing node and a second computing node
- the optical network adaptation device 26 configured by the first computing node is used to convert the electrical signal to be sent to the second computing node into an optical signal of a corresponding wavelength according to the wavelength splitting configuration, and transmit it to the optical signal.
- Network switching device 24 sends the optical signal;
- the optical network switching device 24 is also configured to transmit the optical signal to the second computing node through the optical path between the optical network switching device 24 and the second computing node.
- the optical network controller 22 is specifically used to:
- the first configuration information includes the wavelength of the optical signal that the computing node in the subnet is allowed to receive and the wavelength of the optical network switching device.
- the optical network controller 22 is specifically used for:
- the second configuration information includes an address of a computing node in the subnet and an optical fiber count that the computing node in the subnet is allowed to receive. Correspondence between signal wavelengths.
- the optical network controller 22 is specifically used to:
- the optical network controller 22 is specifically used to:
- the optical network 20 includes a plurality of the optical network switching devices 24, and the optical network controller 22 is specifically used to:
- the wavelength splitting configuration is provided to a plurality of the optical network switching devices 24 .
- the working modes of the plurality of optical network switching devices 24 are active/standby mode or multi-active mode.
- the optical network 20 may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of the various components of the optical network 20 are respectively to implement the embodiment shown in FIG. 3
- the corresponding processes of each method will not be described here for the sake of brevity.
- the embodiment of the present application also provides an optical network controller 22 and an optical network switching device 24.
- the optical network controller 22 includes:
- the communication module 702 is used to obtain the topology structure of the cluster's subnet, and the topology structure records the addresses of the computing nodes in the subnet;
- Determining module 704 configured to determine the wavelength splitting configuration according to the topological structure
- Providing module 706 is configured to provide the wavelength splitting configuration to the optical network switching device 24.
- the wavelength splitting configuration is used to allocate different wavelengths to the computing nodes 100 in the subnet, so that the optical network switching device 24 Split according to the wavelength Configure and establish an optical path between the optical network switching device 24 and the computing node 100 in the subnet.
- the subnet includes N computing nodes, and N is greater than 1.
- the determining module 704 :
- the egress port and the ingress port of the optical network switching device 24 to which the target computing node is connected determine the wavelength splitting configuration.
- the computing node 100 in the subnet is configured with an optical network adaptation device 26.
- the optical network adaptation device 26 is used to access the optical network 20.
- the provision module 706 is also used to:
- the wavelength splitting configuration is provided to the optical network adaptation device 26, so that the optical network adaptation device 26 converts the electrical signal into an optical signal of a corresponding wavelength according to the wavelength splitting configuration.
- the provision module 706 is specifically used to:
- the first configuration information includes the wavelength of the optical signal that the computing node in the subnet is allowed to receive and the wavelength of the optical network switching device.
- the second configuration information includes an address of a computing node in the subnet and an optical fiber count that the computing node in the subnet is allowed to receive. Correspondence between signal wavelengths.
- the communication module 702 is specifically used to:
- the topology structure of the subnet generated by the job scheduler 30 according to the scheduling policy is received.
- the communication module 702 is specifically used to:
- the optical network 20 includes multiple optical network switching devices 24, and the providing module 706 is specifically used to:
- the wavelength splitting configuration is provided to a plurality of the optical network switching devices 24 .
- the working modes of the plurality of optical network switching devices 24 are active/standby mode or multi-active mode.
- the optical network controller 22 may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of the various modules of the optical network controller 22 are respectively to implement the implementation shown in FIG. 3 For the sake of simplicity, the steps performed by the optical network controller 22 in this example will not be described again here.
- the optical network switching device 24 includes:
- the communication module 802 is used to obtain the wavelength segmentation configuration, which is used to allocate different wavelengths to the computing nodes 100 in the subnet;
- the establishment module 804 is configured to establish an optical path between the optical network switching device 24 and the computing node 100 in the subnet according to the wavelength splitting configuration.
- the optical network switching device 24 is an optical cross-connect OXC switch, and the OXC switch includes a wavelength selective switch;
- the establishment module 804 is specifically used to:
- an optical path is established between the optical network switching device 24 and the computing node 100 in the subnet through the wavelength selective switch.
- the subnet includes a first computing node and a second computing node
- the communication module 802 is also configured to receive an optical signal sent by the optical network adaptation device 26 configured by the first computing node.
- the optical signal is configured by the optical network adaptation device 26 according to the wavelength segmentation,
- the electrical signal to be sent to the second computing node is converted, and then the optical signal is transmitted to the second computing node through the optical path between the optical network switching device 24 and the second computing node.
- the optical network switching device 24 may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of the various modules of the optical network switching device 24 are respectively to implement the implementation shown in Figure 3 For the sake of simplicity, the steps performed by the optical network switching device 24 in this example will not be described again here.
- Figure 9 provides a schematic structural diagram of an optical network controller 22.
- the optical network controller 22 includes a bus 901, a processor 902, a communication interface 903 and a memory 904.
- the processor 902, the memory 904 and the communication interface 903 communicate through the bus 901.
- the bus 901 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
- PCI peripheral component interconnect standard
- EISA extended industry standard architecture
- the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in Figure 9, but it does not mean that there is only one bus or one type of bus.
- the processor 902 may be a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP). any one or more of them.
- CPU central processing unit
- GPU graphics processing unit
- MP microprocessor
- DSP digital signal processor
- the communication interface 903 is used for communicating with the outside.
- the communication interface 903 is used to obtain the topology structure of the subnet of the cluster 10, or provide wavelength splitting configuration to the optical network switching device 24, and so on.
- Memory 904 may include volatile memory (volatile memory), such as random access memory (RAM). Memory 904 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, hard disk drive (HDD) or solid state drive (solid state drive) , SSD).
- volatile memory such as random access memory (RAM).
- RAM random access memory
- non-volatile memory non-volatile memory
- ROM read-only memory
- flash memory such as hard disk drive (HDD) or solid state drive (solid state drive) , SSD).
- Computer readable instructions are stored in the memory 904, and the processor 902 executes the computer readable instructions, so that the optical network
- the network controller 22 performs the steps performed by the optical network controller 22 in the aforementioned method of establishing an optical path.
- the optical network switching device 24 includes a bus 1001, a processor 1002, a communication interface 1003 and a memory 1004.
- the processor 1002, the memory 1004 and the communication interface 1003 communicate through the bus 1001.
- Computer-readable instructions are stored in the memory 1004, and the processor 1002 executes the computer-readable instructions, so that the optical network switching device 24 performs the steps performed by the optical network switching device 24 in the aforementioned method of establishing an optical path.
- An embodiment of the present application also provides a computer-readable storage medium.
- the computer-readable storage medium may be any available medium that the optical network controller 22 or the optical network switching device 24 can store, or a data storage device such as a data center containing one or more available media.
- the available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, solid state drive), etc.
- the computer-readable storage medium includes instructions that instruct the optical network controller 22 or the optical network switching device 24 to perform the steps performed by the optical network controller 22 or the optical network switching device 24 in the above method.
- An embodiment of the present application also provides a computer program product.
- the computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the optical network controller 22 or the optical network switching device 24, the processes or functions described in the embodiments of the present application are generated in whole or in part.
- the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transmitted from a website, computing device, or data center to Transmission to another website site, computing device or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
- the computer program product may be a software installation package. If any of the foregoing methods for establishing an optical path is required, the computer program product may be downloaded and executed on the optical network controller 22 or the optical network switching device 24 The computer program product.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Optical Communication System (AREA)
Abstract
本申请提供了一种建立光通路的方法,应用于光网络,光网络包括光网络控制器和光网络交换设备,光网络交换设备用于为集群中的计算节点交换数据,该方法包括:光网络控制器获取集群的子网的拓扑结构,该拓扑结构记录有子网中计算节点的地址,光网络控制器根据拓扑结构确定波长切分配置,该波长切分配置用于为子网中计算节点分配不同波长,然后向光网络交换设备提供波长切分配置,接着光网络交换设备根据所述波长切分配置建立光网络交换设备与子网中计算节点的光通路。如此,光网络交换设备可以根据波长区分光信号,将光信号通过光通路直通到计算节点,避免突发流量导致的拥塞问题,缩短长尾时延。
Description
本申请要求于2022年5月18日提交中国专利局、申请号为202210542499.3、发明名称为“一种建立光通路的方法及相关设备”的中国专利申请的优先权,所述专利申请的全部内容通过引用结合在本申请中。
本申请涉及计算机技术领域,尤其涉及一种建立光通路的方法、系统以及光网络控制器、光网络交换设备、计算机可读存储介质、计算机程序产品。
随着计算机技术的不断发展,各种计算架构应运而生。其中,数据驱动的计算架构(Data Driven Architecture),也称以内存为中心的计算架构(Memory Centric Architecture)或者分级共享内存(Disaggregated Shared Memory)架构,是计算产业的一次架构级创新,也是业界的技术研究热点。
数据驱动的计算架构基于新型内存语义网络,如计算快速连接(compute express link,CXL)内存语义网络、一致性加速处理器接口(coherent accelerator processor interface,CAPI)内存语义网络或者GenZ内存语义网络构建内存互联网络(memory fabric)集群,实现分级内存的资源池化与全局共享。
内存互联网络集群要求高效、稳定、可靠的跨节点内存访问。具体地,内存互联网络集群通常要求超高带宽(Tbps级别)、超低平均时延与长尾时延(<1us,百纳秒级别),例如带宽可以为兆比特每秒(Terabits Per Second,TbPS)级别,平均时延或长尾时延可以为百纳秒级别,其中,百纳秒级别是指小于1微秒(microsecond,us)。
目前,内存互联网络集群通常采用一跳(OneHop)组网方式组建。OneHop组网是指将计算节点连接至同一交换设备如交换机(switch)上,使得各个计算节点之间通过该switch进行数据交换的组网方式。
然而,采用OneHop组网方式,当各计算节点产生突发的流量时,易触发短时拥塞,导致长尾时延暴涨,由此制约内存互联网络集群的性能。
发明内容
本申请提供了一种建立光通路的方法,该方法通过对波长进行切分,从而在光网络交换设备与内存互联网络集群的子网中计算节点之间建立不同光通路,使得光网络交换设备能够根据波长区分光信号,并根据光信号的波长选择相应的光通路将光信号直通到计算节点,避免突发流量导致的拥塞问题,缩短长尾时延,保障了内存互联网络集群的性能。本申请还提供了上述对应的系统、光网络控制器、光网络交换设备以及计算机可读存储介质、计算机程序产品。
第一方面,本申请提供了一种建立光通路的方法。该方法应用于光网络。光网络具体是光传送网,例如是采用光交叉连接传送光信号的网络。光网络包括光网络控制器和光网络交换设备。其中,光网络交换设备用于为集群(即上文中的内存互联网络集群)中的计算节点交换数据。
具体地,光网络控制器获取集群的子网的拓扑结构,该拓扑结构记录有子网中计算节点的地址,光网络控制器根据拓扑结构确定波长切分配置,并向光网络交换设备提供所述波长切分配置,其中,波长切分配置用于为所述子网中计算节点分配不同波长,相应地,光网络交换设备根据所述波长切分配置建立光网络交换设备与子网中计算节点的光通路。
在该方法中,根据波长切分配置建立光网络交换设备与子网中不同计算节点的光通路,可以使得光网络交换设备能够根据波长区分光信号,并根据光信号的波长选择相应的通路将光信号直通到计算节点,避免突发流量导致的拥塞问题,缩短长尾时延,保障了集群的性能。
在一些可能的实现方式中,子网中包括N个计算节点,其中,N大于1。针对N个计算节点中的目标计算节点,该目标计算节点可以是N个计算节点中的任意计算节点,光网络控制器可以从波长范围(可使用的波长范围,如可见光的波长范围)中确定子范围,并从该子范围中采样N-1个波长,然后光网络控制器根据子网中除所述目标计算节点之外的N-1个计算节点的地址、所述N-1个波长、所述N-1个计算节点连接的所述光网络交换设备的出端口、所述目标计算节点连接的所述光网络交换设备的入端口,确定波长切分配置。该波长切分配置具体可以为子网中计算节点的地址、子网中计算节点允许接收的光信号的波长(从子范围中采样得到的波长)以及光网络交换设备的入端口、出端口的对应关系。
该方法中,光网络控制器从波长范围中确定子范围,并从子范围中采样若干波长,根据采样的波长、子网中计算节点的地址以及光网络交换设备的入端口、出端口,确定波长切分配置,从而实现细粒度的波长切分,如此可以实现将入端口接收的光信号,按照其波长,从相应的出端口,通过相应的光通路直通至计算节点,无需进行复杂的路径计算,缩短了时延,提高了传送效率,保障了集群的性能。
在一些可能的实现方式中,光网络控制器可以通过随机采样方式或平均采样方式,从子范围中采样N-1个波长。这N-1个波长可以被分配给子网中除目标计算节点以外的N-1个计算节点。其中,平均采样方式可以使得波长间隔比较均匀,避免波长过于接近从而产生信号干扰。随机采样方式可以使得波长无规律可循,提升了光通路的复杂度,保障了安全性。
在一些可能的实现方式中,光网络交换设备为光交叉连接交换机,该光交叉交换机包括波长选择开关。相应地,光网络交换设备在建立光通路时,可以根据所述波长切分配置,通过所述波长选择开关建立光网络交换设备与所述子网中计算节点的光通路。
该方法中,光网络交换设备通过波长选择开关可以实现根据波长切分配置自动调整光通路,使得光信号到达光网络交换设备后能够自动按照该光通路传输到达相应的计算节点,无需事先对端口进行复杂的物理连线,在解决拥塞问题的基础上,还简化了组网方式,提升了用户体验。
在一些可能的实现方式中,子网中计算节点配置有光网络适配设备,该光网络适配设备用于接入所述光网络。相应地,光网络控制器还可以向所述光网络适配设备提供所述波长切分配置,以使所述光网络适配设备根据所述波长切分配置将电信号转换为相应波长的光信号。
如此,子网中计算节点之间进行数据交换时,可以将电信号转换为相应波长的光信号,然后由光网络交换设备将上述波长的光信号,通过相应的光通路直通至计算节点,由此提高了传送效率,缩短了时延,保障了集群的性能。
在一些可能的实现方式中,所述子网中包括第一计算节点和第二计算节点。所述第一计算节点配置的光网络适配设备可以根据所述波长切分配置,将待发送至所述第二计算节点的电信号转换为相应波长的光信号,向所述光网络交换设备发送所述光信号。然后,光网络交换设备通过所述光网络交换设备与所述第二计算节点的光通路传输所述光信号至所述第二计算节点。
由此实现了子网中计算节点之间的高速通信,避免了突发流量导致的拥塞问题,缩短了长尾时延,保障了集群的性能,提高了作业的执行效率。
在一些可能的实现方式中,光网络控制器可以向光网络交换设备提供所述波长切分配置中的第一配置信息,向光网络适配设备提供波长切分配置中的第二配置信息。其中,第一配置信息包括所述子网中计算节点允许接收的光信号的波长和所述光网络交换设备的入端口、出端口的对应关系,第二配置信息包括所述子网中计算节点的地址和所述子网中计算节点允许接收的光信号的波长的对应关系。如此可以减少传输开销,降低成本。
在一些可能的实现方式中,光网络控制器可以统一向光网络交换设备和光网络适配设备提供较完整的波长切分配置,例如光网络控制器可以向光网络交换设备和光网络适配设备提供子网中计算节点的地址、所述子网中计算节点允许接收的光信号的波长和光网络交换设备的入端口、出端口的对应关系。如此,光网络控制器侧可以减少复杂操作,降低计算量,即使低配置的光网络控制器也能够满足需求。
在一些可能的实现方式中,集群可以为高性能计算集群。高性能计算集群设置有作业调度器。作业调度器可以根据调度策略生成子网的拓扑结构,光网络控制器可以接收作业调度器根据调度策略生成的子网的拓扑结构。例如,光网络控制器可以提供北向应用程序编程接口,作业调度器可以调用该北向应用程序接口向光网络控制器下发子网的拓扑结构。
如此,可以将作业调度至高性能计算集群的子网中计算节点上执行,充分利用高性能计算集群的资源,提升资源利用率。
在一些可能的实现方式中,光网络控制器还可以与云平台对接。具体地,光网络控制器可以接收云平台的基础设施即服务IaaS层网络管理发送的所述集群的子网的拓扑结构。如此可以实现高性能计算集群的云化调度,进一步提高资源利用率,降低成本。而且,光网络的底层整体架构无需任何变动,仅需IaaS层网络管理添加少量适配,例如是光网络控制器的北向应用程序编程接口的适配,即可实现光网络控制器平滑对接到公有云/混合云等云平台的IaaS网络服务层。
在一些可能的实现方式中,光网络包括多个所述光网络交换设备,光网络控制器向多个光网络交换设备提供波长切分配置,如此可以避免光网络交换设备的单点故障导致的服务不可用,提高了集群的可用性。
在一些可能的实现方式中,所述多个光网络交换设备的工作模式为主备模式或多活模式。当多个光网络交换设备的工作模式为主备模式时,主设备发生故障时,备用设备可以成为新的主设备,从而为集群的子网中的计算节点交换数据,保障了服务正常运行。当多个光
网络交换设备的工作模式为多活模式时,一方面可以避免单点故障导致服务不可用,另一方面可以实现负载均衡。
第二方面,本申请提高了一种建立光通路的系统。所述系统包括光网络控制器和光网络交换设备,所述光网络交换设备用于为集群中的计算节点交换数据;
所述光网络控制器,用于获取集群的子网的拓扑结构,所述拓扑结构记录有所述子网中计算节点的地址;
所述光网络控制器,还用于根据所述拓扑结构确定波长切分配置,并向所述光网络交换设备提供所述波长切分配置,所述波长切分配置包括为所述子网中计算节点分配的不同波长;
所述光网络交换设备,用于根据所述波长切分配置建立所述光网络交换设备与所述子网中计算节点的光通路。
在一些可能的实现方式中,所述子网中包括N个计算节点,所述N大于1,所述光网络控制器具体用于:
针对所述N个计算节点中的目标计算节点,从波长范围中确定子范围,并从所述子范围中采样N-1个波长;
根据所述子网中除所述目标计算节点之外的N-1个计算节点的地址、所述N-1个波长、所述N-1个计算节点连接的所述光网络交换设备的出端口、所述目标计算节点连接的所述光网络交换设备的入端口,确定波长切分配置。
在一些可能的实现方式中,所述光网络交换设备为光交叉连接OXC交换机,所述OXC交换机包括波长选择开关;
所述光网络交换设备具体用于:
根据所述波长切分配置,通过所述波长选择开关建立光网络交换设备与所述子网中计算节点的光通路。
在一些可能的实现方式中,所述子网中计算节点配置有光网络适配设备,所述光网络适配设备用于接入光网络,所述光网络控制器还用于:
向所述光网络适配设备提供所述波长切分配置,以使所述光网络适配设备根据所述波长切分配置将电信号转换为相应波长的光信号。
在一些可能的实现方式中,所述子网中包括第一计算节点和第二计算节点;
所述第一计算节点配置的光网络适配设备,用于根据所述波长切分配置,将待发送至所述第二计算节点的电信号转换为相应波长的光信号,向所述光网络交换设备发送所述光信号;
所述光网络交换设备,还用于通过所述光网络交换设备与所述第二计算节点的光通路传输所述光信号至所述第二计算节点。
在一些可能的实现方式中,所述光网络控制器具体用于:
向所述光网络交换设备提供所述波长切分配置中的第一配置信息,所述第一配置信息包括所述子网中计算节点允许接收的光信号的波长和所述光网络交换设备的入端口、出端口的对应关系;
所述光网络控制器具体用于:
向所述光网络适配设备提供所述波长切分配置中的第二配置信息,所述第二配置信息包括所述子网中计算节点的地址和所述子网中计算节点允许接收的光信号的波长的对应关系。
在一些可能的实现方式中,所述光网络控制器具体用于:
接收作业调度器根据调度策略生成的子网的拓扑结构。
在一些可能的实现方式中,所述光网络控制器具体用于:
接收云平台的基础设施即服务IaaS层网络管理发送的所述集群的子网的拓扑结构。
在一些可能的实现方式中,所述系统包括多个所述光网络交换设备,所述光网络控制器具体用于:
向多个所述光网络交换设备提供所述波长切分配置。
在一些可能的实现方式中,所述多个光网络交换设备的工作模式为主备模式或多活模式。
第三方面,本申请提供一种光网络控制器。所述控制器包括至少一个处理器和至少一个存储器。所述至少一个处理器、所述至少一个存储器进行相互的通信。所述至少一个处理器用于执行所述至少一个存储器中存储的指令,以使得光网络控制器执行如第一方面的方法中由光网络控制器执行的步骤。
第四方面,本申请提供一种光网络交换设备。所述光网络交换设备包括至少一个处理器和至少一个存储器。所述至少一个处理器、所述至少一个存储器进行相互的通信。所述至少一个处理器用于执行所述至少一个存储器中存储的指令,以使得光网络交换设备执行如第一方面的方法中由光网络交换设备执行的步骤。
第五方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,所述指令用于执行上述第一方面或第一方面的任一种实现方式所述的建立光通路的方法。
第六方面,本申请提供了一种包含指令的计算机程序产品,所述指令用于执行上述第一方面或第一方面的任一种实现方式所述的建立光通路的方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
为了更清楚地说明本申请实施例的技术方法,下面将对实施例中所需使用的附图作以简单地介绍。
图1A为本申请实施例提供的一种内存互联网络集群的架构示意图;
图1B为本申请实施例提供的内存互联网络集群的组网模式示意图;
图2A为本申请实施例提供的一种光网络的架构示意图;
图2B为本申请实施例提供的另一种光网络的架构示意图;
图2C为本申请实施例提供的另一种光网络的架构示意图;
图2D为本申请实施例提供的另一种光网络的架构示意图;
图3为本申请实施例提供的一种建立光通路的方法的流程图;
图4为本申请实施例提供的一种全光交换机建路示意图;
图5A为本申请实施例提供的一种波长选择开关的结构示意图;
图5B为本申请实施例提供的另一种波长选择开关的结构示意图;
图6为本申请实施例提供的一种切波选路的示意图;
图7为本申请实施例提供的一种光网络控制器的结构示意图;
图8为本申请实施例提供的一种光网络交换设备的结构示意图;
图9为本申请实施例提供的一种光网络控制器的硬件结构图;
图10为本申请实施例提供的一种光网络交换设备的硬件结构图。
本申请实施例中的术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。
首先对本申请实施例中所涉及到的一些技术术语进行介绍。
数据驱动的计算架构(Data Driven Architecture),也称以内存为中心的计算架构(Memory Centric Architecture)或者分级共享内存(Disaggregated Shared Memory)架构,是一种基于新型内存语义网络构建内存互联网络(memory fabric)集群的计算架构。
新型内存语义网络包括但不限于CXL内存语义网络、CAPI内存语义网络或者GenZ内存语义网络。CXL是一个开源的协议标准,用于实现中央处理器(Central Processing Unit,CPU)和外设(external device)的高速互联,该协议标准可以兼容高速外围组件互联(Peripheral Component Interconnect Express,PCIe)协议标准。CAPI作为Power处理器架构的一个重要加速功能,提供可订制、高效易用、分担CPU负荷的硬件加速解决方案,其实现载体可以是现场可编程逻辑门阵列(Field Programmable Gate Array,FPGA)。GenZ为一种总线结构式协议,GenZ使用语义存储通信,以最小开销在不同组件的内存之间传送数据,它不仅使存储器件互连,也使处理器和加速器互连,加速器可以减轻CPU等处理器的处理压力。
上述新型内存语义网络支持远程直接内存访问(remote direct memory access,RDMA)。RDMA是指将外设(如网卡、显卡、硬盘、加速器等)直接访问本地主机内存的能力(DMA),扩展到外设直接访问远端主机内存的能力。
基于上述新型内存语义网络将多个计算节点互联可以组建内存互联网络集群。其中,内存互联网络集群也可以简称为集群。图1A示出了一种集群的架构图,该集群10包括多个计算节点100,每个计算节点100包括处理器和内存,其中,处理器可以是X86架构或者进阶精简指令集机器(Advanced RISC Machine,ARM)架构的CPU,内存可以是双倍速率同步
动态随机存取存储器(Double Data Rate Synchronous Dynamic Random Access Memory,DDR SDRAM),其中,DDR SDRAM可以简称为DDR。每个计算节点100还可以配置网络适配设备,例如是基于CXL的网络适配设备,记作CXL Device,进一步地,该网络适配设备还可以具有其他功能,例如网络适配设备可以是基于数据流加速器(data streaming accelerator,DSA)的网络适配设备,记作DSA Device。DSA是一种用于操作数据流的外部设备标准,可支持的场景包括数据搬移(可叠加去重、循环冗余校验等)、数据比较、数据转换(可叠加数据完整性域(Data Integrity Field,DIF)插入、DIF校验、DIF更新等)。
CXL Device或者DSA Device可以作为自定义的硬件桥(self-defined hardware bridge)将CXL.IO、CXL.Mem、CXL.Cache或者DSA翻译为RDMA网络语义,从而实现通过网络如GenZ网络直接访问远端的内存,如此可以实现分级内存的资源池化与全局共享。
图1A所示的集群10要求高效、稳定、可靠的跨节点内存访问。具体地,集群10通常要求超高带宽、超低平均时延与长尾时延。考虑到长光纤的时延和可靠性,可以采用单个机柜或宏机柜内的多个计算节点100组网得到集群10。
接着参见图1B所示的组网模式示意图,集群10可采用OneHop组网方式组建。具体地,多个计算节点100(图1B中记作H1、H2…H8)连接至交换机Switch,H1至H8之间可以通过交换机进行数据交换。然而,突发的Incast/Outcast流量(例如是H1至H7同时向H8发送数据,或者是H8同时向H1至H7返回数据)可以触发短时拥塞,导致长尾时延暴涨,由此制约集群10的性能。
有鉴于此,本申请实施例提供了一种建立光通路的方法。该方法应用于光网络。光网络可以为光传送网(optical transport network,OTN)。光传送网是指在光域内实现业务信号的传送、复用、路由选择、监控,并且保证其性能指标和生存性的传送网络。在本实施例中,OTN可以采用光交叉连接(optical cross connection,OXC)进行业务信号的传送,采用OXC传送业务信号的OTN也可以称作OXC网络。
光网络包括光网络控制器和光网络交换设备。其中,光网络控制器可以是支持OXC的控制器,也称作OXC控制器。进一步地,光网络控制器还支持软件定义网络(software defined network,SDN)。例如,光网络控制器可以通过标准化的接口对网络设备如光网络交换设备进行集中管理和配置,以实现对集群的动态划分,从而满足不同业务的需求。光网络交换设备可以是支持OXC的交换机,例如为OXC交换机。在一些实施例中,光网络交换设备也可以是支持OXC的光纤路由器。
具体地,光网络控制器可以获取集群的子网的拓扑结构,其中,集群包括多个计算节点,集群的子网可以是集群中的部分节点形成的子网络,用于执行上层业务下发的作业(如某一区域在未来一天的天气预测、动画渲染),其中,不同子网可以用于执行不同作业。子网的拓扑结构记录有子网中计算节点的地址。然后光网络控制器可以根据拓扑结构确定波长切分配置,并向光网络交换设备提供波长切分配置。该波长切分配置包括用于为子网中计算节点分配不同波长,相应地,光网络交换设备可以基于上述波长切分配置建立该光网络交换设备与子网中计算节点的光通路。
在该方法中,光网络交换设备根据波长切分配置建立了到达子网中不同计算节点的光通路,使得光信号到达光网络交换设备时,可以根据光信号的波长选择相应的光通路直接传送至相应的计算节点,解决了传统交换机网络中突发的Incast流量或Outcast流量导致的拥塞问题,避免了长尾时延暴涨,保障了集群的性能。
还需要说明的是,光网络交换设备与集群中计算节点的物理连线类似于OneHop组网方式中传统交换机与计算节点的物理连线,由此实现了简洁组网,无需像全互联组网方式中要求计算节点提供多个出端口进行物理连线,降低了组网门槛,提高了组网效率。
为了使得本申请的技术方案更加清楚、易于理解,下面结合附图对本申请实施例的系统架构进行介绍。
参见图2A所示的光网络的系统架构图,光网络20分别与上层的业务层(图2A中未示出)和底层的集群10对接。业务层的业务应用可以产生作业,例如,模型训练业务应用可以产生模型训练作业、天气预测业务应用可以产生天气预测作业。上述作业可以由作业调度器30调度至底层的集群10执行。其中,集群10可以划分不同子网,分别用于执行不同作业,从而实现作业隔离,保障数据安全。
考虑到作业执行效率,作业调度器30可以将作业拆分为多个任务(task),然后将多个任务调度至子网中的不同计算节点100,不同计算节点100可以并行执行任务。其中,子网中不同计算节点100在执行任务时,通常还需要交换数据,以完成作业。以模型训练作业为例,假设训练集中有12吉字节(GigaByte,GB)训练数据,则作业调度器30可以将训练作业拆分为3个任务,每个任务用于利用训练集中的4GB训练数据进行模型训练,作业调度器30将上述3个任务分别调度至子网中3个计算节点100。3个计算节点100在每轮训练过程中,还可以交换参数的梯度,每个计算节点100根据自身计算的参数的梯度以及交换获得的参数的梯度,进行下一轮参数更新。
光网络20用于建立与子网中计算节点100的光通路,从而实现将数据以光信号的形式直接传输至对应的计算节点100,从而解决传统交换机网络中突发的流量导致拥塞,进而影响集群10的性能的问题。
在本实施例中,光网络20包括光网络控制器22和光网络交换设备24。进一步地,光网络20还可以包括光网络适配设备26。其中,光网络控制器22用于对网络设备如光网络交换设备24、光网络适配设备26进行配置、管理,以实现光信号传送。光网络交换设备24用于在集群10的计算节点100之间进行数据交换。光网络适配设备26通常可以配置在计算节点100侧,例如是装配在计算节点100的主板上。光网络适配设备26用于将计算节点100接入光网络20,以通过光网络20实现计算节点100之间的高速通信。
其中,光网络控制器22可以是OXC控制器,光网络交换设备24可以是光网络交换机,例如为OXC交换机,光网络适配设备26可以是光网卡,例如为OXC网卡。在一些实施例中,光网络交换设备24也可以是其他设备,例如为光纤路由器,光网络适配设备26也可以是其他用于接入光网络20的设备。
具体地,光网络控制器22用于获取集群10的子网的拓扑结构。该拓扑结构记录有子网中计算节点100的地址。然后光网络控制器22还用于根据所述拓扑结构确定波长切分配置,并向光网络交换设备24提供波长切分配置。该波长切分配置用于为子网中计算节点分配不同波长,例如波长切分配置可以包括子网中计算节点100的地址、子网中计算节点100允许接收的光信号的波长和光网络交换设备24的入端口、出端口的对应关系。需要说明的是,光网络控制器22可以基于管理网络向光网络交换设备24提供波长切分配置。该管理网络用于传输控制信令,以对光网络交换设备24进行配置。在一些实施例中,管理网络可以是以太网。
光网络交换设备24用于根据波长切分配置建立该光网络交换设备24与子网中计算节点100的光通路。该光通路用于将入端口接收的相应波长的光信号,从出端口传输至所述子网
中计算节点100。子网中计算节点100配置的光网络适配设备26用于将接收到的光信号转换为电信号,向计算节点100提供该电信号,进而实现数据交换。
需要说明的是,图2A以集群10为高性能计算(High-performance computing,HPC)集群进行示例说明。HPC集群是指通过各种互联技术将多个计算节点100连接,以通过连接后的多个计算节点100的综合计算能力处理大型计算问题的集群。科学研究、气象预报、仿真实验、生物制药、基因测序、图像处理等行业都可以使用HPC集群来解决大型计算问题。
具体地,参见图2A,HPC集群设置有作业调度器30,作业调度器30接收到作业时,可以获取该作业的调度策略,该调度策略例如可以是优先级调度策略或者回填调度策略。作业调度器30可以根据调度策略确定该集群的子网的拓扑结构,以便后续将作业调度至相应的子网中执行。其中,作业调度器30可以调用光网络控制器22的北向应用程序编程接口(application programming interface,API),下发子网的拓扑结构至光网络控制器22,如此光网络控制器22可以获得该子网的拓扑结构。
在另一些可能的实现方式中,光网络20还可以与云平台提供的云服务对接,用于处理云平台的作业。参见图2B所示的另一种光网络的系统架构图,如图2B所示,光网络20包括光网络控制器22、光网络交换设备24和光网络适配设备26。区别于图2A,图2B中的光网络控制器22可以接收云平台的基础设施即服务(Infrastructure as a Service,IaaS)层网络管理32(如OpenStack Neutron)发送的集群10的子网的拓扑结构,从而实现HPC集群的云服务化。其中,云平台可以是公有云,或者是混合云。混合云是私有云(本地基础架构)和公有云混合的云平台。
具体实现时,IaaS层网络管理32可以提供管理面(也称作控制面),管理面用于传输控制信令至光网络控制器22,以对光网络交换设备24、光网络适配设备26进行管理。其中,IaaS层网络管理32连接有云平台的IaaS层网络服务域34。
IaaS层通过基础设施提供计算、存储和网络服务。其中,网络服务包括但不限于虚拟私有云(virtual private cloud,VPC)、弹性公网IP(Elastic IP,EIP)、网络地址转换(network address translation,NAT)、弹性负载均衡(elastic load balance,ELB)或者云专线(direct connect,DC)。上述每一种网络服务通常可以部署在多个服务器上,从而形成IaaS层网络服务域34。当用户通过云平台的操作界面触发操作,例如触发访问数据库的操作时,客户端响应于用户的操作生成操作请求,该操作请求可以是超文本传输协议(Hypertext Transfer Protocol,HTTP)请求。该操作请求可以先到达IaaS层网络服务域34,由IaaS层网络服务域34进行相应处理后,发送至IaaS层网络管理32。IaaS层网络管理32接收到上述请求,可以根据请求确定集群10中用于处理上述请求的子网的拓扑结构,然后向光网络控制器22发送子网的拓扑结构,使得光网络控制器22根据子网的拓扑结构对光网络交换设备24和光网络适配设备26进行配置、管理。光网络控制器22根据子网的拓扑结构对光网络交换设备24和光网络适配设备26进行配置、管理的具体实现可以参见图2A所示实施例相关内容描述,在此不再赘述。
在该场景中,光网络20的底层整体架构无需任何变动,只需在IaaS层网络管理32添加少量适配,例如是对光网络控制器22的北向API的适配,即可实现光网络控制器22平滑对接到公有云/混合云的IaaS网络服务层。
在另一些可能的实现方式中,参见图2C所示的光网络的架构示意图,当集群10有极高的可用性需求时,光网络20可以包括多个光网络交换设备24,集群10中的计算节点100与多个光网络交换设备24连接,从而实现计算节点100同时接入多个光网络平面。光网络控制
器22在向光网络交换设备24提供波长切分配置时,可以向上述多个光网络交换设备24均提供波长切分配置。
其中,多个光网络交换设备24的工作模式可以为主备模式或多活模式。主备模式是指将多个设备(如多个光网络交换设备24)中的部分设备设置为主设备,另一部分设备设置为备用设备,当主设备宕机时,将备用设备设置为主设备,以提供服务。多活模式是指多个设备同时提供服务。在多活模式下,多个设备(如多个光网络交换设备24)之间还可以进行数据同步。
图2C是以HPC集群中计算节点100连接两个光网络交换设备24进行示例说明,在一些可能的实现方式中,计算节点100可以连接更多的光网络交换设备24,从而保障可用性。进一步地,参见图2D所示的另一种光网络20的架构示意图,光网络控制器22对接IaaS层网络管理32的情况下,集群10中的计算节点也可以连接多个光网络交换设备24,从而保障云服务的可用性。
上述图2A至图2D对本申请实施例的光网络20的架构进行了介绍,下面以图2A所示的架构为例,对本申请实施例的建立光通路的方法进行示例说明。
参见图3所示的建立光通路的方法的交互流程图,该方法包括:
S302:作业调度器30接收作业。
作业调度器30具体用于根据集群10中计算节点100等资源的分布和使用情况,将作业调度至合适的计算节点100,以提高作业执行效率以及资源利用率。其中,作业调度器30通常具有工作负载管理和资源管理功能,例如,作业调度器30可以包括工作负载管理器(workload manager)和资源管理器(resource manager)。资源管理器用于收集资源使用信息,工作负载管理器用于根据资源使用信息,将包括作业在内的工作负载调度至适合的计算节点100。在一些实施例中,工作负载管理器还可以监控作业在计算节点100的运行状态,以便基于该运行状态进行作业调度。作业的运行状态可以通过执行进度,预期的剩余执行时间中的至少一种进行表征。
本实施例中的作业调度器30可以是开源的调度器。例如,作业调度器30可以是开源的Openlava调度器,或者是Slurm调度器。在一些实施例中,作业调度器30也可以是用户购买的或者自研的调度器。例如,作业调度器30可以为TORQUE调度器、Moab Cluster Suite调度器。
上述作业调度器30可以接收客户端(如业务应用的客户端)提交的作业。具体地,客户端可以提供操作界面,用户在该操作界面触发操作,例如是访问数据库的操作后,可以产生一个作业,该作业具体为查询数据库中满足查询条件的数据,然后客户端可以提交该作业,相应地,作业调度器30可以接收客户端提交的上述作业。需要说明的是,作业调度器30还可以创建作业队列,然后将接收到的客户端提交的上述作业加入作业队列,以便基于作业队列对作业进行管理。
S304:作业调度器30根据作业的调度,确定子网的拓扑结构。
具体地,作业调度器30配置有针对作业的调度策略。该调度策略可以称作调度算法。在本实施例中,作业调度器30配置的调度策略可以为作业优先级调度策略或回填调度策略。下面分别对不同调度策略进行说明。
作业优先级调度策略,是指按照作业优先级的顺序启动作业,优先级高的作业先调度,优先级低的作业后调度。其中,作业优先级可以在作业被创建时设置。作业优先级可以通过
优先级数值进行衡量。不同作业的优先级数值可以相等。当提交至作业调度器30的多个作业具有相同的优先级数值时,可以基于作业调度器30的接收时间确定各作业的最终优先级。在不同作业具有相同优先级数值的情况下,作业调度器30先接收的作业具有较高的优先级。
回填调度策略,是指在不耽误高优先级作业的预计开始时间的情况下,允许较低优先级作业先运行。其中,该较低优先级作业可以在为上述高优先级作业预留的资源(如计算节点100)上运行。回填运行的作业(如上述低优先级的作业)通常需要限制运行时间。
具体地,作业调度器30可以根据作业调度策略,确定用于执行该作业的节点。执行作业的节点可以形成一个子网,执行不同作业的子网可以在逻辑上隔离。如此,作业调度器30可以获得子网的拓扑结构。拓扑结构记录有子网中计算节点100的地址。
计算节点100的地址具有唯一性,该地址可以是网际协议(Internet protocol,IP)地址。在一些实施例中,计算节点100的地址也可以是其他地址,例如为介质访问控制(media access control,MAC)地址。进一步地,拓扑结构还记录有计算节点100的连接关系。该连接关系可以是计算节点100与光网络交换设备24的入端口或出端口之间的连接关系。
在一些可能的实现方式中,子网的拓扑结构可以通过图结构表征。图结构中的顶点(vertex)可以用于表示子网中的计算节点100,图结构中的边(edge)可以用于表示计算节点100的连接关系。例如,顶点1和顶点2之间包括边1,该边1用于表示顶点1对应的计算节点100、顶点2对应的计算节点100与光网络交换设备24上端口之间的连接关系。以图2A中的系统架构进行示例说明,顶点1对应的计算节点100可以为H1,顶点2对应的计算节点可以为H2,边1表示H1连接光络交换设备24的入端口1,H2连接光网络交换设备24的出端口3。
在一些可能的实现方式中,子网的拓扑结构也可以采用数据表进行表征。数据表中可以包括多条记录,每条记录可以包括以下字段:子网标识、计算节点100的地址、入端口和出端口,从而表示一个子网包括的计算节点100以及该计算节点100连接的入端口或出端口。进一步地,每条记录还可以包括计算节点100的节点标识。需要说明的是,在一条记录中,入端口或出端口也可以为空或者是缺省值。
为了便于理解,下面结合一具体示例进行说明。参见表1,表1示出了子网的拓扑结构:
表1子网的拓扑结构
在一些可能的实现方式中,子网的拓扑结构中还可以记录子网的标签。该标签可以用于过滤或查询。例如,该标签可以标识子网处理业务的业务标识或业务类型。
S306:作业调度器30向光网络控制器22下发子网的拓扑结构。
具体地,光网络控制器22可以提供北向API。北向API是指向上提供的接口,例如是向上层业务应用提供的接口,其目标是使得业务应用能够便利地调用底层的网络资源和能力。与之对应的南向API则是向下提供的接口,例如管理其他厂商网管或设备的接口。
作业调度器30可以通过北向API全局把控集群10的资源状态,并根据资源状态对作业进行统一调度。具体实现时,作业调度器30可以调用北向API,向光网络控制器22下发子网的拓扑结构,以便于根据该子网的拓扑结构对作业进行统一调度。
S308:光网络控制器22根据子网的拓扑结构确定波长切分配置。
具体地,子网中包括N个计算节点100,N为大于1的正整数。针对N个计算节点100中的目标计算节点,该目标计算节点可以是N个计算节点100中的任一计算节点100,光网络控制器22可以从波长范围中确定子范围,并从该子范围中采样N-1个波长,然后根据子网中除目标计算节点之外的N-1个计算节点100的地址、所述N-1个波长、N-1个计算节点连接的光网络交换设备24的出端口、目标计算节点连接的光网络交换设备26的入端口,确定波长切分配置。
波长范围可以是可使用的波长范围,例如是可见光的波长范围。光网络控制器22可以通过采样方式从上述波长范围中确定目标计算节点对应的子范围,如此,光网络控制器22可以确定N个子范围。为了避免产生信号干扰,N个子范围通常是不重叠的。光网络控制器22可以从N个子范围的每个子范围中采样N-1个波长,如此,光网络控制器22可以确定出N*(N-1)个波长。
其中,光网络控制器22从范围中确定子范围时,可以采用平均采样方式,或者随机采样方式。例如,光网络控制器22可以将范围分为N段,从而获得N个子范围。N个子范围中的每个子范围用于表示子网中的一个计算节点100能够发送的光信号的波长的取值范围。类似地,光网络控制器22从每个子范围中确定N-1个波长时,也可以采样平均采样方式,或者随机采样方式。例如,光网络控制器22可以将子范围分为N-1段,获取N-1段中每段的左端点、中点或者右端点,从而获得N-1个波长。N-1个波长分别用于向子网中剩余的N-1个计算节点100发送的光信号的波长。
为了便于理解,下面结合一个示例进行说明。
可见光的波长通常在780到400纳米(nanometer,nm)之间,假设子网中包括四个计算节点100,具体为H1、H2、H3和H5,则光网络控制节点22可以将波长分为如下四段:780nm到685nm,685nm到590nm,590nm到495nm,495nm到400nm。其中,上述范围包括左端点,不包括右端点。然后光网络控制节点22可以针对每段波长,分别从中取三个值,从而实现波长切分。以780nm到685nm中取值为例,光网络控制器22可以取三个值分别为750nm、720nm和690nm。如此,H1分别向H2、H3、H5发送数据时,承载数据的电信号转换的光信号的波长为750nm、720nm和690nm。
在完成波长切分后,光网络控制器22可以确定波长切分配置。该波长切分配置包括子网中计算节点100的地址、子网中计算节点100允许接收的光信号的波长和光网络交换设备24的入端口、出端口的对应关系。
其中,子网的拓扑结构中记录有计算节点100连接的入端口、出端口,基于此,光网络控制器22可以根据子网中计算节点100的地址和子网中计算节点100允许接收的光信号的波长的对应关系,以及拓扑结构中计算节点100连接的入端口、出端口,确定子网中计算节点100的地址、子网中计算节点100允许接收的光信号的波长和光网络交换设备24的入端口、出端口的对应关系,从而获得波长切分配置。
S310:光网络控制器22向光网络交换设备24和光网络适配设备26提供波长切分配置。
具体实现时,光网络控制器22可以向光网络交换设备24提供波长切分配置中的第一配置信息,该第一配置信息包括子网中计算节点100允许接收的光信号的波长与光网络交换设备24的入端口、出端口的对应关系,以及向光网络适配设备26提供波长切分配置中的第二配置信息,该第二配置信息包括子网中计算节点100的地址与所述子网中计算节点100允许接收的光信号的波长的对应关系。如此,可以减少光网络控制器22与光网络交换设备24、光网络适配设备26的传输开销。
在一些可能的实现方式中,光网络控制器22也可以不对光网络交换设备24和光网络适配设备26进行区分,统一提供完整的对应关系。即光网络控制器22可以向光网络交换设备24和光网络适配设备26提供子网中计算节点100的地址、子网中计算节点100允许接收的光信号的波长和光网络交换设备24的入端口、出端口的对应关系。
其中,光网络控制器22在向光网络交换设备24和光网络适配设备26提供波长切分配置时,可以有多种实现方式。一种实现方式为,光网络控制器22主动向光网络交换设备24和光网络适配设备26下发波长切分配置,另一种实现方式为光网络控制器22响应于光网络交换设备24和光网络适配设备26的配置获取请求,向光网络交换设备24和光网络适配设备26返回波长切分配置。
需要说明的是,光网络控制器22也可以不直接传输波长切分配置至光网络交换设备24和光网络适配设备26。例如,光网络控制器22可以通过共享波长切分配置的方式,将波长切分配置提供给光网络交换设备24和光网络适配设备26。
S312:光网络交换设备24根据波长切分配置建立光网络交换设备24与子网中计算节点100的光通路。
具体地,光网络交换设备24可以根据波长切分配置中波长与入端口、出端口的对应关系,建立相应的光通路。该光通路用于将入端口接收的相应波长的光信号,从所述出端口传输至所述子网中计算节点100。在本实施例中,光网络交换设备24可以建立多条不同的光通路。不同的光通路用于传送不同波长的光信号至不同的计算节点100。
参见图4所示的全光交换机建路示意图,在该示例中,光网络交换设备24为OXC交换机,该OXC交换机为全光交换机,源端口为接收来自H1的数据的源端口,目的端口分别为向H2、H3、H5发送数据的端口时,可以分别建立用于传输波长为λ1、λ2、λ3的光信号的光通路。
下面对OXC交换机(全光交换机)建立光通路的原理进行说明。
OXC交换机中可以包括波长选择开关(Wavelength Selective Switch,WSS)。该开关可以是一种NxN端口的矩阵光开关。OXC交换机可以根据波长切分配置,通过上述WSS进行建路。
其中,WSS可以分为基于微机电系统(Microelectromechanical Systems,MEMS)的WSS和基于液晶硅(Liquid Crystal on Silicon,LCoS)的WSS。下面分别对上述WSS的原理进行说明。
参见图5A所示的基于MEMS的WSS的结构示意图,基于MEMS的WSS包括光纤阵列、光栅、MEMS和反射镜。其中,光纤阵列包括入端口(即输入光纤端口)和出端口(即输出光纤端口)。入端口接收的光信号可以为波分复用信号,该波分复用信号经过光栅可以实现波长分离。该示例中假定分离出两种波长的光信号,这两种波长的光信号经过反射镜的反
射,到达MEMS中的不同镜面上,通过调整MEMS中的不同镜面的角度,可以使得两种波长的光信号通过反射镜反射到光纤阵列的特定出端口。
接着,参见图5B所示的基于LCoS的WSS的结构示意图,基于LCoS的WSS包括光纤阵列、光栅、基于液晶的空间光调制器(phased array LC-based switch)和反射镜。具体地,经过波分复用的光信号经过光栅,可以实现将各个波长的光信号按空间不同位置解复用,从而得到不同波长的光信号。区别于基于MEMS的WSS通过控制镜面角度来实时改变某个波长的光信号的方向,以实现光通路调整,基于LCoS的WSS则是采用基于液晶的空间光调制器改变某个波长的光信号的相位,从而实现光通路调整。
基于此,光网络控制器22可以根据波长与入端口、出端口的对应关系,通过上述WSS开关,建立将特定波长的光信号由指定入端口交换至指定出端口的光通路。
S314:光网络适配设备26根据波长切分配置生成转发表。
具体地,波长切分配置包括子网中计算节点100的地址与子网中计算节点100允许接收的光信号的波长的对应关系,光网络适配设备26可以根据计算节点100的地址与计算节点100允许接收的光信号的波长的对应关系,生成以计算节点100的地址为索引的转发表。其中,计算节点100的地址可以为IP地址,相应地,转发表的各表项中可以包括IP地址以及与IP地址对应的波长。进一步地,光网络适配设备26可以存储上述转发表,以便后续基于该转发表进行信号转换。
为了便于理解,仍以图4进行示例说明。光网络适配设备26可以根据波长切分配置生成转发表,该转发表中包括H2、H3、H5的地址和H2、H3、H5允许接收的光信号的波长的对应关系。光网络适配设备26可以存储该转发表,以便后续基于该转发表进行信号的光电转换。
需要说明的是,上述S312和S314可以并行执行,或者是按照设定的顺序先后执行,其执行顺序并不影响本申请实施例的具体实现,本实施例对此不作限制。进一步地,执行本申请实施例的方法也可以不执行上述S314,例如,光网络适配设备26可以根据光网络控制器22提供的波长切分配置进行后续处理。
S316:第一计算节点配置的光网络适配设备26接收待发送至第二计算节点的电信号。
第一计算节点、第二计算节点为子网中的不同计算节点100,上述计算节点100之间可以进行数据交换,以完成作业。例如,计算节点100之间可以交换梯度值,以完成模型训练作业。
第一计算节点和第二计算节点可以通过报文形式交换数据。具体地,第一计算节点的CPU可以生成报文,该报文为电信号形式,然后第一计算节点的CPU可以向该第一计算节点配置的光网络适配设备26发送电信号形式的报文。
S318:第一计算节点配置的光网络适配设备26根据转发表,将待发送至第二计算节点的电信号转换为相应波长的光信号。
电信号形式的报文中记录有源地址和目的地址。在该实施例中,源地址可以是第一计算节点的地址,目的地址可以是第二计算节点的地址。第一计算节点配置的光网络适配设备26可以根据第二计算节点的地址查询转发表,从而获得第二计算节点允许接收的光信号的波长,然后将电信号转换为相应波长的光信号。
其中,光网络适配设备26中包括光电转换模块,光网络适配设备26在确定第二计算节点允许接收的光信号的波长后,通过上述光电转换模块将电信号转换为相应波长的光信号。
S320:第一计算节点配置的光网络适配设备26向所述光网络交换设备24发送所述光信号。
第一计算节点配置的光网络适配设备26可以通过该第一计算节点与光网络交换设备24之间的通路,向光网络交换设备24发送光信号。
S322:光网络交换设备24通过与所述第二计算节点的光通路传输所述光信号至所述第二计算节点。
具体地,光网络交换设备24可以根据第二计算节点允许接收的光信号的波长与光网络交换设备24的入端口、出端口的对应关系以及光信号的波长、接收该光信号的入端口确定出端口,然后通过出端口与第二计算节点的光通路将光信号传送至第二计算节点。即光网络交换设备24可以通过切波选路方式将光信号传送至第二计算节点,第二计算节点配置的网络适配设备可以通过光电转换模块将光信号转换为电信号,由此实现将报文传送至第二计算节点。
为了便于理解,本申请实施例还提供了一个示例进行说明。参见图6所示的切波选路的示意图,当光网络适配设备26生成波长为λ1、λ2、λ3的光信号,光网络交换设备24在时隙T1、T2、T3分别接收到上述波长为λ1、λ2、λ3的光信号时,光网络交换设备24可以通过切波选路方式选择相应光通道将光信号传输至对应的计算节点。例如,在时隙T1中,光网络交换设备24(如OXC交换机)接收H1发送的波长为λ1的光信号,根据该波长以及入端口确定相应的光通路,通过该光通路将光信号传送至H2。
在该方法中,光网络控制器22根据子网的拓扑结构对波长进行切分,并将波长切分配置下发至光网络交换设备24和光网络适配设备26,使得光网络适配设备26可以将到达不同计算节点100的报文转换为不同波长的光信号,光网络交换设备24根据波长区分光信号,将光信号通过预先建立的光通路直通到相应的计算节点100,避免了传统交换机网络中突发的Incast流量或Outcast流量导致的拥塞问题,缩短了长尾时延,保障了集群10的性能。
而且,光网络交换设备24可以根据波长切分配置,与特定的计算节点100(具体是子网中的计算节点100)建立光通路,实现了集群10的弹性调度,能够灵活满足业务的不同需求。
需要说明的是,图3所示实施例是以图2A所示的系统架构进行示例说明,在如图2B所示的系统架构中,光网络控制器22也可以对接到公有云或混合云的IaaS层网络管理32,从而实现内存互联网络集群如集群10的云服务化。具体地,公有云或混合云的IaaS层网络服务域34可以产生作业,例如IaaS层网络服务域34可以响应于用户的操作,产生作业,IaaS层网络管理32可以根据集群10中资源的分布和使用情况,确定子网的拓扑结构,然后通过光网络控制器22的北向API向光网络控制器22下发子网的拓扑结构。然后光网络控制器22根据拓扑结构确定波长切分配置,向光网络交换设备24提供波长切分配置,光网络交换设备24根据波长切分配置建立与子网中计算节点100的光通路,以便于后续通过该光通路传输光信号至子网中计算节点100。其中,光网络控制器22确定波长切分配置以及光网络交换设备24建立光通路的具体实现可以参见图3所示相关内容描述,在此再赘述。该方法通过将集群10进行云服务化,可以提高集群10中资源如(计算资源)的利用率,降低成本。
进一步地,如图2C或图2D所示,计算节点100可以连接多个光网络交换设备24,多个光网络交换设备24的工作模式为主备模式或多活模式,光网络控制器22可以向上述多个光网络交换设备24提供波长切分配置,避免了光网络交换设备24发生单点故障导致网络中断,提高了集群10的可用性。
基于本申请实施例提供的建立光通路的方法,本申请实施例还提供了一种建立光通路的系统。建立光通路的系统可以是硬件系统,该硬件系统可以包括如图2A至图2D所示的光网络控制器22和光网络交换设备24。进一步地,该硬件系统可以包括如图2A至图2D所示的光网络适配设备26。换言之,建立光通路的系统可以是如图2A至图2D所示的光网络20。
参见图2A至图2D所示的光网络20的结构示意图,该光网络20包括光网络控制器22和光网络交换设备24,所述光网络交换设备24用于为集群10中的计算节点100交换数据。
所述光网络控制器22,用于获取集群10的子网的拓扑结构,所述拓扑结构记录有所述子网中计算节点100的地址;
所述光网络控制器24,还用于根据所述拓扑结构确定波长切分配置,并向所述光网络交换设备24提供所述波长切分配置,所述波长切分配置包括为所述子网中计算节点分配的不同波长;
所述光网络交换设备24,用于根据所述波长切分配置建立所述光网络交换设备24与所述子网中计算节点100的光通路。
在一些可能的实现方式中,所述子网中包括N个计算节点,所述N大于1,所述光网络控制器22具体用于:
针对所述N个计算节点中的目标计算节点,从波长范围中确定子范围,并从所述子范围中采样N-1个波长;
根据所述子网中除所述目标计算节点之外的N-1个计算节点的地址、所述N-1个波长、所述N-1个计算节点100连接的所述光网络交换设备的出端口、所述目标计算节点连接的所述光网络交换设备24的入端口,确定波长切分配置。
在一些可能的实现方式中,所述光网络交换设备24为光交叉连接OXC交换机,所述OXC交换机包括波长选择开关;
所述光网络交换设备24具体用于:
根据所述波长切分配置,通过所述波长选择开关建立光网络交换设备24与所述子网中计算节点100的光通路。
在一些可能的实现方式中,所述子网中计算节点100配置有光网络适配设备26,所述光网络适配设备26用于接入光网络20,所述光网络控制器22还用于:
向所述光网络适配设备26提供所述波长切分配置,以使所述光网络适配设备26根据所述波长切分配置将电信号转换为相应波长的光信号。
在一些可能的实现方式中,所述子网中包括第一计算节点和第二计算节点;
所述第一计算节点配置的光网络适配设备26,用于根据所述波长切分配置,将待发送至所述第二计算节点的电信号转换为相应波长的光信号,向所述光网络交换设备24发送所述光信号;
所述光网络交换设备24,还用于通过所述光网络交换设备24与所述第二计算节点的光通路传输所述光信号至所述第二计算节点。
在一些可能的实现方式中,所述光网络控制器22具体用于:
向所述光网络交换设备提供所述波长切分配置中的第一配置信息,所述第一配置信息包括所述子网中计算节点允许接收的光信号的波长和所述光网络交换设备的入端口、出端口的对应关系;
所述光网络控制器22具体用于:
向所述光网络适配设备提供所述波长切分配置中的第二配置信息,所述第二配置信息包括所述子网中计算节点的地址和所述子网中计算节点允许接收的光信号的波长的对应关系。
在一些可能的实现方式中,所述光网络控制器22具体用于:
接收作业调度器根据调度策略生成的子网的拓扑结构。
在一些可能的实现方式中,所述光网络控制器22具体用于:
接收云平台的基础设施即服务IaaS层网络管理发送的所述集群的子网的拓扑结构。
在一些可能的实现方式中,所述光网络20包括多个所述光网络交换设备24,所述光网络控制器22具体用于:
向多个所述光网络交换设备24提供所述波长切分配置。
在一些可能的实现方式中,所述多个光网络交换设备24的工作模式为主备模式或多活模式。
根据本申请实施例的光网络20可对应于执行本申请实施例中描述的方法,并且光网络20的各个组成部分的上述和其它操作和/或功能分别为了实现图3所示实施例中的各个方法的相应流程,为了简洁,在此不再赘述。
基于本申请实施例的建立光通路的方法和系统,本申请实施例还提供了一种光网络控制器22和光网络交换设备24。下面先从功能模块化的角度,对本申请实施例的光网络控制器22和光网络交换设备24进行介绍。
参见图7所示的光网络控制器22的结构示意图,光网络控制器22包括:
通信模块702,用于获取集群的子网的拓扑结构,所述拓扑结构记录有所述子网中计算节点的地址;
确定模块704,用于根据所述拓扑结构确定波长切分配置;
提供模块706,用于向所述光网络交换设备24提供所述波长切分配置,所述波长切分配置用于为所述子网中计算节点100分配不同波长,以使光网络交换设备24根据所述波长切分
配置建立所述光网络交换设备24与所述子网中计算节点100的光通路。
在一些可能的实现方式中,所述子网中包括N个计算节点,所述N大于1,所述确定模块704:
针对所述N个计算节点中的目标计算节点,从波长范围中确定子范围,并从所述子范围中采样N-1个波长;
根据所述子网中除所述目标计算节点之外的N-1个计算节点的地址、所述N-1个波长、所述N-1个计算节点100连接的所述光网络交换设备的出端口、所述目标计算节点连接的所述光网络交换设备24的入端口,确定波长切分配置。
在一些可能的实现方式中,所述子网中计算节点100配置有光网络适配设备26,所述光网络适配设备26用于接入光网络20,所述提供模块706还用于:
向所述光网络适配设备26提供所述波长切分配置,以使所述光网络适配设备26根据所述波长切分配置将电信号转换为相应波长的光信号。
在一些可能的实现方式中,所述提供模块706具体用于:
向所述光网络交换设备提供所述波长切分配置中的第一配置信息,所述第一配置信息包括所述子网中计算节点允许接收的光信号的波长和所述光网络交换设备的入端口、出端口的对应关系;
向所述光网络适配设备提供所述波长切分配置中的第二配置信息,所述第二配置信息包括所述子网中计算节点的地址和所述子网中计算节点允许接收的光信号的波长的对应关系。
在一些可能的实现方式中,所述通信模块702具体用于:
接收作业调度器30根据调度策略生成的子网的拓扑结构。
在一些可能的实现方式中,所述通信模块702具体用于:
接收云平台的基础设施即服务IaaS层网络管理发送的所述集群10的子网的拓扑结构。
在一些可能的实现方式中,所述光网络20包括多个所述光网络交换设备24,所述提供模块706具体用于:
向多个所述光网络交换设备24提供所述波长切分配置。
在一些可能的实现方式中,所述多个光网络交换设备24的工作模式为主备模式或多活模式。
根据本申请实施例的光网络控制器22可对应于执行本申请实施例中描述的方法,并且光网络控制器22的各个模块的上述和其它操作和/或功能分别为了实现图3所示实施例中由光网络控制器22执行的步骤,为了简洁,在此不再赘述。
接着参见图8所示的光网络交换设备24的结构示意图,光网络交换设备24包括:
通信模块802,用于获取波长切分配置,所述波长切分配置用于为所述子网中计算节点100分配不同波长;
建立模块804,用于根据所述波长切分配置建立所述光网络交换设备24与所述子网中计算节点100的光通路。
在一些可能的实现方式中,所述光网络交换设备24为光交叉连接OXC交换机,所述OXC交换机包括波长选择开关;
所述建立模块804具体用于:
根据所述波长切分配置,通过所述波长选择开关建立光网络交换设备24与所述子网中计算节点100的光通路。
在一些可能的实现方式中,所述子网中包括第一计算节点和第二计算节点;
所述通信模块802,还用于接收所述第一计算节点配置的光网络适配设备26发送的光信号,所述光信号由所述光网络适配设备26根据所述波长切分配置,将待发送至所述第二计算节点的电信号转换得到,然后通过所述光网络交换设备24与所述第二计算节点的光通路传输所述光信号至所述第二计算节点。
根据本申请实施例的光网络交换设备24可对应于执行本申请实施例中描述的方法,并且光网络交换设备24的各个模块的上述和其它操作和/或功能分别为了实现图3所示实施例中由光网络交换设备24执行的步骤,为了简洁,在此不再赘述。
上述图7、图8从功能模块化的角度对本申请实施例的光网络控制器22和光网络交换设备24进行介绍,下面将从硬件实体化的角度对本申请实施例的光网络控制器22和光网络交换设备24。
图9提供了一种光网络控制器22的结构示意图,如图9所示,光网络控制器22包括总线901、处理器902、通信接口903和存储器904。处理器902、存储器904和通信接口903之间通过总线901通信。
总线901可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图9中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
处理器902可以为中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。
通信接口903用于与外部通信。例如,通信接口903用于获取集群10的子网的拓扑结构,或者是向光网络交换设备24提供波长切分配置等等。
存储器904可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器904还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,硬盘驱动器(hard disk drive,HDD)或固态驱动器(solid state drive,SSD)。
存储器904中存储有计算机可读指令,处理器902执行该计算机可读指令,以使得光网
络控制器22执行前述建立光通路的方法中由光网络控制器22执行的步骤。
接下来,参见图10所示的光网络交换设备24的结构示意图,如图10所示,光网络交换设备24包括总线1001、处理器1002、通信接口1003和存储器1004。处理器1002、存储器1004和通信接口1003之间通过总线1001通信。
存储器1004中存储有计算机可读指令,处理器1002执行该计算机可读指令,以使得光网络交换设备24执行前述建立光通路的方法中由光网络交换设备24执行的步骤。
本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是光网络控制器22或光网络交换设备24能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,所述指令指示光网络控制器22或光网络交换设备24执行上述方法中由光网络控制器22或光网络交换设备24执行的步骤。
本申请实施例还提供了一种计算机程序产品。所述计算机程序产品包括一个或多个计算机指令。在光网络控制器22或光网络交换设备24上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算设备或数据中心进行传输。所述计算机程序产品可以为一个软件安装包,在需要使用前述建立光通路的方法的任一方法的情况下,可以下载该计算机程序产品并在光网络控制器22或光网络交换设备24上执行该计算机程序产品。
上述各个附图对应的流程或结构的描述各有侧重,某个流程或结构中没有详述的部分,可以参见其他流程或结构的相关描述。
Claims (24)
- 一种建立光通路的方法,其特征在于,应用于光网络,所述光网络包括光网络控制器和光网络交换设备,所述光网络交换设备用于为集群中的计算节点交换数据,所述方法包括:所述光网络控制器获取所述集群的子网的拓扑结构,所述拓扑结构记录有所述子网中计算节点的地址;所述光网络控制器根据所述拓扑结构确定波长切分配置,并向所述光网络交换设备提供所述波长切分配置,所述波长切分配置包括为所述子网中计算节点分配的不同波长;所述光网络交换设备根据所述波长切分配置建立所述光网络交换设备与所述子网中计算节点的光通路。
- 根据权利要求1所述的方法,其特征在于,所述子网中包括N个计算节点,所述N大于1,所述光网络控制器根据所述拓扑结构确定波长切分配置,包括:针对所述N个计算节点中的目标计算节点,所述光网络控制器从波长范围中确定子范围,并从所述子范围中采样N-1个波长;所述光网络控制器根据所述子网中除所述目标计算节点之外的N-1个计算节点的地址、所述N-1个波长、所述N-1个计算节点连接的所述光网络交换设备的出端口、所述目标计算节点连接的所述光网络交换设备的入端口,确定波长切分配置。
- 根据权利要求1或2所述的方法,其特征在于,所述光网络交换设备为光交叉连接OXC交换机,所述OXC交换机包括波长选择开关;所述光网络交换设备根据所述波长切分配置建立所述光网络交换设备与所述子网中计算节点的光通路,包括:所述光网络交换设备根据所述波长切分配置,通过所述波长选择开关建立所述光网络交换设备与所述子网中计算节点的光通路。
- 根据权利要求1至3任一项所述的方法,其特征在于,所述子网中计算节点配置有光网络适配设备,所述光网络适配设备用于接入所述光网络,所述方法还包括:所述光网络控制器向所述光网络适配设备提供所述波长切分配置,以使所述光网络适配设备根据所述波长切分配置将电信号转换为相应波长的光信号。
- 根据权利要求4所述的方法,其特征在于,所述子网中包括第一计算节点和第二计算节点,所述方法还包括:所述第一计算节点配置的光网络适配设备根据所述波长切分配置,将待发送至所述第二计算节点的电信号转换为相应波长的光信号,向所述光网络交换设备发送所述光信号;所述光网络交换设备通过所述光网络交换设备与所述第二计算节点的光通路传输所述光信号至所述第二计算节点。
- 根据权利要求4所述的方法,其特征在于,所述光网络控制器向所述光网络交换设备提供所述波长切分配置,包括:所述光网络控制器向所述光网络交换设备提供所述波长切分配置中的第一配置信息,所述第一配置信息包括所述子网中计算节点允许接收的光信号的波长和所述光网络交换设备的入端口、出端口的对应关系;所述光网络控制器向所述光网络适配设备提供所述波长切分配置,包括:所述光网络控制器向所述光网络适配设备提供所述波长切分配置中的第二配置信息,所述第二配置信息包括所述子网中计算节点的地址和所述子网中计算节点允许接收的光信号的波长的对应关系。
- 根据权利要求1至6任一项所述的方法,其特征在于,所述光网络控制器获取集群的子网的拓扑结构,包括:所述光网络控制器接收作业调度器根据调度策略生成的子网的拓扑结构。
- 根据权利要求1至6任一项所述的方法,其特征在于,所述光网络控制器获取集群的子网的拓扑结构,包括:所述光网络控制器接收云平台的基础设施即服务IaaS层网络管理发送的所述集群的子网的拓扑结构。
- 根据权利要求1至8任一项所述的方法,其特征在于,所述光网络包括多个所述光网络交换设备,所述光网络控制器向所述光网络交换设备提供所述波长切分配置,包括:所述光网络控制器向多个所述光网络交换设备提供所述波长切分配置。
- 根据权利要求9所述的方法,其特征在于,所述多个光网络交换设备的工作模式为主备模式或多活模式。
- 一种建立光通路的系统,其特征在于,所述系统包括光网络控制器和光网络交换设备,所述光网络交换设备用于为集群中的计算节点交换数据;所述光网络控制器,用于获取集群的子网的拓扑结构,所述拓扑结构记录有所述子网中计算节点的地址;所述光网络控制器,还用于根据所述拓扑结构确定波长切分配置,并向所述光网络交换设备提供所述波长切分配置,所述波长切分配置包括为所述子网中计算节点分配的不同波长;所述光网络交换设备,用于根据所述波长切分配置建立所述光网络交换设备与所述子网中计算节点的光通路。
- 根据权利要求11所述的系统,其特征在于,所述子网中包括N个计算节点,所述N大于1,所述光网络控制器具体用于:针对所述N个计算节点中的目标计算节点,从波长范围中确定子范围,并从所述子范围中采样N-1个波长;根据所述子网中除所述目标计算节点之外的N-1个计算节点的地址、所述N-1个波长、所述N-1个计算节点连接的所述光网络交换设备的出端口、所述目标计算节点连接的所述光网络交换设备的入端口,确定波长切分配置。
- 根据权利要求11或12所述的系统,其特征在于,所述光网络交换设备为光交叉连接OXC交换机,所述OXC交换机包括波长选择开关;所述光网络交换设备具体用于:根据所述波长切分配置,通过所述波长选择开关建立光网络交换设备与所述子网中计算节点的光通路。
- 根据权利要求11至13任一项所述的系统,其特征在于,所述子网中计算节点配置有光网络适配设备,所述光网络适配设备用于接入光网络,所述光网络控制器还用于:向所述光网络适配设备提供所述波长切分配置,以使所述光网络适配设备根据所述波长切分配置将电信号转换为相应波长的光信号。
- 根据权利要求14所述的系统,其特征在于,所述子网中包括第一计算节点和第二计算节点;所述第一计算节点配置的光网络适配设备,用于根据所述波长切分配置,将待发送至所述第二计算节点的电信号转换为相应波长的光信号,向所述光网络交换设备发送所述光信号;所述光网络交换设备,还用于通过所述光网络交换设备与所述第二计算节点的光通路传输所述光信号至所述第二计算节点。
- 根据权利要求14所述的系统,其特征在于,所述光网络控制器具体用于:向所述光网络交换设备提供所述波长切分配置中的第一配置信息,所述第一配置信息包括所述子网中计算节点允许接收的光信号的波长和所述光网络交换设备的入端口、出端口的对应关系;向所述光网络适配设备提供所述波长切分配置中的第二配置信息,所述第二配置信息包括所述子网中计算节点的地址和所述子网中计算节点允许接收的光信号的波长的对应关系。
- 根据权利要求11至16任一项所述的系统,其特征在于,所述光网络控制器具体用于:接收作业调度器根据调度策略生成的子网的拓扑结构。
- 根据权利要求11至16任一项所述的系统,其特征在于,所述光网络控制器具体用于:接收云平台的基础设施即服务IaaS层网络管理发送的所述集群的子网的拓扑结构。
- 根据权利要求11至18任一项所述的系统,其特征在于,所述系统包括多个所述光 网络交换设备,所述光网络控制器具体用于:向多个所述光网络交换设备提供所述波长切分配置。
- 根据权利要求19所述的系统,其特征在于,所述多个光网络交换设备的工作模式为主备模式或多活模式。
- 一种光网络控制器,其特征在于,所述光网络控制器包括处理器和存储器,所述存储器中存储有计算机可读指令;所述处理器执行所述计算机可读指令,以使得所述光网络控制器执行如权利要求1至10中任一项所述的方法中由所述光网络控制器执行的步骤。
- 一种光网络交换设备,其特征在于,所述光网络交换设备包括处理器和存储器,所述存储器中存储有计算机可读指令;所述处理器执行所述计算机可读指令,以使得所述光网络交换设备执行如权利要求1至10中任一项所述的方法中由所述光网络交换设备执行的步骤。
- 一种计算机可读存储介质,其特征在于,包括计算机可读指令;所述计算机可读指令用于实现权利要求1至10任一项所述的方法。
- 一种计算机程序产品,其特征在于,包括计算机可读指令;所述计算机可读指令用于实现权利要求1至10任一项所述的方法。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP23807036.1A EP4518344A4 (en) | 2022-05-18 | 2023-05-18 | METHOD FOR ESTABLISHING AN OPTICAL PATH, AND ASSOCIATED DEVICE |
| US18/949,240 US20250070906A1 (en) | 2022-05-18 | 2024-11-15 | Optical path establishment method and related device |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210542499.3A CN117135495A (zh) | 2022-05-18 | 2022-05-18 | 一种建立光通路的方法及相关设备 |
| CN202210542499.3 | 2022-05-18 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/949,240 Continuation US20250070906A1 (en) | 2022-05-18 | 2024-11-15 | Optical path establishment method and related device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023222086A1 true WO2023222086A1 (zh) | 2023-11-23 |
Family
ID=88834662
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/095073 Ceased WO2023222086A1 (zh) | 2022-05-18 | 2023-05-18 | 一种建立光通路的方法及相关设备 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US20250070906A1 (zh) |
| EP (1) | EP4518344A4 (zh) |
| CN (1) | CN117135495A (zh) |
| WO (1) | WO2023222086A1 (zh) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2025117605A1 (en) * | 2023-11-29 | 2025-06-05 | MTS IP Holdings Ltd | Compute express link switch with integrated optical communications device |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20250022982A (ko) * | 2023-08-09 | 2025-02-18 | 삼성전자주식회사 | 메모리 장치 및 이를 포함하는 컴퓨팅 시스템 |
| US12445356B1 (en) * | 2024-04-10 | 2025-10-14 | Dish Wireless L.L.C. | Automated container orchestration in a cloud-based network |
| CN119383498A (zh) * | 2024-12-27 | 2025-01-28 | 上海智能算力科技有限公司 | 一种基于光互联的gpu集群系统 |
| CN120201335B (zh) * | 2025-05-26 | 2025-08-05 | 广东三石园科技有限公司 | 基于光开关网络的全光线路交换系统 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104813603A (zh) * | 2012-10-26 | 2015-07-29 | 索德若网络有限公司 | 用于实现多维光电路交换结构的方法和装置 |
| EP3136649A1 (en) * | 2015-08-27 | 2017-03-01 | Alcatel Lucent | Method and system for providing load information of an optical data transmission system |
| CN107094270A (zh) * | 2017-05-11 | 2017-08-25 | 中国科学院计算技术研究所 | 可重构的互连系统及其拓扑构建方法 |
| US20200348981A1 (en) * | 2018-12-21 | 2020-11-05 | Bull Sas | Method for deployment of a task in a supercomputer, method for implementing a task in a supercomputer, corresponding computer program and supercomputer |
| CN113872697A (zh) * | 2020-06-30 | 2021-12-31 | 华为技术有限公司 | 光发送机和光调制的方法 |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8867915B1 (en) * | 2012-01-03 | 2014-10-21 | Google Inc. | Dynamic data center network with optical circuit switch |
| US9491526B1 (en) * | 2014-05-12 | 2016-11-08 | Google Inc. | Dynamic data center network with a mesh of wavelength selective switches |
| US10038514B2 (en) * | 2014-12-23 | 2018-07-31 | Telefonaktiebolaget Lm Ericsson (Publ) | Datacentre for processing a service |
| US10873409B2 (en) * | 2016-08-03 | 2020-12-22 | Telefonaktiebolaget Lm Ericsson (Publ) | Optical switch |
| EP3410735A1 (en) * | 2017-06-02 | 2018-12-05 | Nokia Solutions and Networks Oy | Optical switching system and data center including the same |
-
2022
- 2022-05-18 CN CN202210542499.3A patent/CN117135495A/zh active Pending
-
2023
- 2023-05-18 WO PCT/CN2023/095073 patent/WO2023222086A1/zh not_active Ceased
- 2023-05-18 EP EP23807036.1A patent/EP4518344A4/en active Pending
-
2024
- 2024-11-15 US US18/949,240 patent/US20250070906A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104813603A (zh) * | 2012-10-26 | 2015-07-29 | 索德若网络有限公司 | 用于实现多维光电路交换结构的方法和装置 |
| EP3136649A1 (en) * | 2015-08-27 | 2017-03-01 | Alcatel Lucent | Method and system for providing load information of an optical data transmission system |
| CN107094270A (zh) * | 2017-05-11 | 2017-08-25 | 中国科学院计算技术研究所 | 可重构的互连系统及其拓扑构建方法 |
| US20200348981A1 (en) * | 2018-12-21 | 2020-11-05 | Bull Sas | Method for deployment of a task in a supercomputer, method for implementing a task in a supercomputer, corresponding computer program and supercomputer |
| CN113872697A (zh) * | 2020-06-30 | 2021-12-31 | 华为技术有限公司 | 光发送机和光调制的方法 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4518344A4 |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2025117605A1 (en) * | 2023-11-29 | 2025-06-05 | MTS IP Holdings Ltd | Compute express link switch with integrated optical communications device |
Also Published As
| Publication number | Publication date |
|---|---|
| CN117135495A (zh) | 2023-11-28 |
| EP4518344A1 (en) | 2025-03-05 |
| US20250070906A1 (en) | 2025-02-27 |
| EP4518344A4 (en) | 2025-09-24 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2023222086A1 (zh) | 一种建立光通路的方法及相关设备 | |
| CN115480869B (zh) | 微服务架构 | |
| US20100287262A1 (en) | Method and system for guaranteed end-to-end data flows in a local networking domain | |
| US9473414B2 (en) | Method and system for supporting packet prioritization at a data network | |
| US9086919B2 (en) | Fabric independent PCIe cluster manager | |
| US11442791B2 (en) | Multiple server-architecture cluster for providing a virtual network function | |
| US8780923B2 (en) | Information handling system data center bridging features with defined application environments | |
| US20020107971A1 (en) | Network transport accelerator | |
| WO2017215071A1 (en) | Modular telecommunication edge cloud system | |
| US20030236919A1 (en) | Network connected computing system | |
| US20030236837A1 (en) | Content delivery system providing accelerate content delivery | |
| US20030236861A1 (en) | Network content delivery system with peer to peer processing components | |
| CN110661718A (zh) | 基于非随机流簇的路由 | |
| EP3022888B1 (en) | Network element and method of running applications in a cloud computing system | |
| US20110302287A1 (en) | Quality of service control | |
| CN104320350A (zh) | 用于提供基于信用的流控制的方法及系统 | |
| JP7654209B2 (ja) | 計算クラスタにおける実行ジョブ計算ユニット合成 | |
| US10380041B2 (en) | Fabric independent PCIe cluster manager | |
| US20230015687A1 (en) | Routing application control and data-plane traffic in support of cloud-native applications | |
| CN112655185B (zh) | 软件定义网络中的服务分配的设备、方法和存储介质 | |
| US20200336381A1 (en) | Bandwidth-based virtual router redundancy protocol node designation | |
| US11303524B2 (en) | Network bandwidth configuration | |
| CN121433923B (zh) | 交换通信装置及方法、服务器 | |
| WO2020000409A1 (en) | Managing quality of storage service in virtual network | |
| Vijayakumar et al. | Analyzing the theoretical merits of Loxi load balancer for improving the efficiency of load balancing in 5G‐edge IoT applications based on Kubernetes |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23807036 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023807036 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2023807036 Country of ref document: EP Effective date: 20241127 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |