WO2023218596A1 - サーバ内遅延制御装置、サーバ内遅延制御方法およびプログラム - Google Patents
サーバ内遅延制御装置、サーバ内遅延制御方法およびプログラム Download PDFInfo
- Publication number
- WO2023218596A1 WO2023218596A1 PCT/JP2022/020051 JP2022020051W WO2023218596A1 WO 2023218596 A1 WO2023218596 A1 WO 2023218596A1 JP 2022020051 W JP2022020051 W JP 2022020051W WO 2023218596 A1 WO2023218596 A1 WO 2023218596A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- thread
- mode
- packet
- sleep
- delay control
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/485—Task life-cycle, e.g. stopping, restarting, resuming execution
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0852—Delays
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45575—Starting, stopping, suspending or resuming virtual machine instances
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45595—Network integration; Enabling network access in virtual machine instances
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/40—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using virtualisation of network functions or resources, e.g. SDN or NFV entities
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/10—Active monitoring, e.g. heartbeat, ping or trace-route
- H04L43/103—Active monitoring, e.g. heartbeat, ping or trace-route with adaptive polling, i.e. dynamically adapting the polling rate
Definitions
- the present invention relates to an intra-server delay control device, an intra-server delay control method, and a program.
- NFV Network Functions Virtualization
- SFC Service Function Chaining
- a hypervisor environment composed of Linux (registered trademark) and KVM (kernel-based virtual machine) is known as a technology for configuring virtual machines.
- a Host OS with a built-in KVM module OS installed on a physical server is called a Host OS
- a hypervisor in a memory area called kernel space that is different from user space.
- a virtual machine operates in the user space, and a Guest OS (the OS installed on the virtual machine is called a Guest OS) operates within the virtual machine.
- a virtual machine running a Guest OS is different from a physical server running a Host OS, in that all HW (hardware) including network devices (typified by Ethernet card devices, etc.) are allocated from the HW to the Guest OS.
- This register control is necessary for processing and writing from the Guest OS to the hardware. With this kind of register control, the notifications and processes that should normally be executed by physical hardware are imitated by software, so performance is generally lower than in the Host OS environment.
- virtio for data input/output such as console, file input/output, and network communication, data exchange using a queue designed with a ring buffer is defined as a unidirectional transport for transfer data using queue operations.
- virtio's queue specifications and preparing the number and size of queues suitable for each device at startup of the Guest OS hardware emulation can be used to improve communication between the Guest OS and the outside of the own virtual machine. This can be achieved simply by using queue operations without execution.
- DPDK is a framework for controlling NIC (Network Interface Card) in user space, which was conventionally performed by Linux kernel (registered trademark).
- the biggest difference from processing in the Linux kernel is that it has a polling-based reception mechanism called PMD (Pull Mode Driver).
- PMD Pull Mode Driver
- PMD Pull Mode Driver
- PMD a dedicated thread continuously performs data arrival confirmation and reception processing. By eliminating overhead such as context switches and interrupts, high-speed packet processing can be performed.
- DPDK significantly increases packet processing performance and throughput, allowing more time for data-plane application processing.
- DPDK exclusively uses computer resources such as the CPU (Central Processing Unit) and NIC. For this reason, it is difficult to apply it to applications such as SFC, where modules are flexibly reconnected.
- SPP Soft Patch Panel
- SPP provides a shared memory between VMs and configures each VM to directly reference the same memory space, thereby omitting packet copying in the virtualization layer.
- DPDK is used to speed up the exchange of packets between the physical NIC and the shared memory.
- SPP can change the input destination and output destination of packets using software by controlling the reference destination for memory exchange of each VM. Through this processing, SPP realizes dynamic connection switching between VMs and between VMs and physical NICs.
- FIG. 18 is a schematic diagram of Rx side packet processing using New API (NAPI) implemented in Linux kernel 2.5/2.6 (see Non-Patent Document 1).
- the New API (NAPI) executes the packet processing APL1 located in the User space 60 that can be used by the user on a server equipped with an OS 70 (for example, Host OS), and connects to the OS 70. Packet transfer is performed between the NIC 11 of the HW 10 and the packet processing APL 1.
- OS 70 for example, Host OS
- the OS 70 has a kernel 71, a Ring Buffer 72, and a driver 73, and the kernel 71 has a protocol processing section 74.
- the Kernel 71 is a core function of the OS 70 (eg, Host OS), and monitors hardware and manages the execution status of programs on a process-by-process basis.
- the kernel 71 responds to requests from the packet processing APL1 and transmits requests from the HW 10 to the packet processing APL1.
- Kernel 71 processes requests from packet processing APL 1 through system calls (a "user program running in non-privileged mode" requests processing to "kernel running in privileged mode”). .
- the Kernel 71 transmits the packet to the packet processing APL 1 via the Socket 75.
- the Kernel 71 receives packets from the packet processing APL 1 via the Socket 75.
- the Ring Buffer 72 is managed by the Kernel 71 and is located in the memory space of the server.
- the Ring Buffer 72 is a buffer of a fixed size that stores messages output by the Kernel 71 as a log, and is overwritten from the beginning when the upper limit size is exceeded.
- Driver 73 is a device driver for monitoring hardware with kernel 71. Note that Driver73 depends on Kernel71, and will be different if the created (built) kernel source changes. In this case, you will need to obtain the relevant driver source, rebuild it on the OS that uses the driver, and create the driver.
- the protocol processing unit 74 performs L2 (data link layer)/L3 (network layer)/L4 (transport layer) protocol processing defined by the OSI (Open Systems Interconnection) reference model.
- Socket 75 is an interface for kernel 71 to perform inter-process communication. Socket 75 has a socket buffer and does not cause data copy processing to occur frequently.
- the flow up to establishing communication via Socket 75 is as follows. 1.Create a socket file for the server side to accept clients. 2. Name the reception socket file. 3. Create a socket queue. 4.Accept the first connection from a client in the socket queue. 5.Create a socket file on the client side. 6.Send a connection request from the client side to the server. 7.Create a connection socket file on the server side, separate from the reception socket file.
- the packet processing APL1 can call system calls such as read() and write() to the kernel 71.
- the Kernel 71 receives notification of packet arrival from the NIC 11 using a hardware interrupt (hardIRQ), and schedules a software interrupt (softIRQ) for packet processing.
- the New API (NAPI) which has been implemented since Linux kernel 2.5/2.6, performs packet processing using a hardware interrupt (hardIRQ) and then a software interrupt (softIRQ) when a packet arrives.
- NPI New API
- FIG. 18 in packet transfer using the interrupt model, packets are transferred by interrupt processing (see reference numeral a in FIG. 18), so a wait for interrupt processing occurs, resulting in a large delay in packet transfer. .
- FIG. 19 is a diagram illustrating an overview of Rx side packet processing by New API (NAPI) in the area surrounded by the broken line in FIG. 18.
- the device driver includes NIC11 (physical NIC) that is a network interface card, hardIRQ81 that is a handler that is called when a processing request for NIC11 occurs and executes the requested processing (hardware interrupt), and netif_rx 82, which is a software interrupt processing function unit.
- the networking layer includes softIRQ83, which is a handler that is called upon the generation of a netif_rx82 processing request and executes the requested process (software interrupt), and do_softirq84, which is a control function unit that implements the actual software interrupt (softIRQ). Ru.
- net_rx_action 85 is a packet processing function unit executed in response to a software interrupt (softIRQ)
- poll_list 86 registers information on a net device (net_device) indicating which device the hardware interrupt from the NIC 11 belongs to.
- a netif_receive_skb 87 and a Ring Buffer 72 that create a sk_buff structure (a structure that allows the Kernel 71 to recognize the status of packets) are arranged.
- Protocol layer In the protocol layer, packet processing functional units such as ip_rcv88 and arp_rcv89 are arranged.
- netif_rx82, do_softirq84, net_rx_action85, netif_receive_skb87, ip_rcv88, and arp_rcv89 are program components (function names) used for packet processing in the Kernel 71.
- FIG. 19 [Rx side packet processing operation using New API (NAPI)] Arrows (symbols) b to m in FIG. 19 indicate the flow of packet processing on the Rx side.
- the hardware function unit 11a of the NIC 11 hereinafter referred to as NIC 11
- the packet arrives at the Ring Buffer 72 without using the CPU through DMA (Direct Memory Access) transfer.
- DMA Direct Memory Access
- This Ring Buffer 72 is a memory space within the server, and is managed by the Kernel 71 (see FIG. 18).
- the Kernel 71 will not be able to recognize the packet. Therefore, when the packet arrives, the NIC 11 raises the hardware interrupt (hardIRQ) to the hardIRQ 81 (see reference numeral c in FIG. 19), and the netif_rx 82 executes the following process, so that the Kernel 71 recognizes the packet.
- hardIRQ81 shown enclosed in an ellipse in FIG. 19 represents a handler rather than a functional unit.
- netif_rx82 is a function that actually performs processing, and when hardIRQ81 (handler) starts up (see code d in Figure 19), poll_list86 contains information from NIC11, which is one of the information in the hardware interrupt (hardIRQ). Saves net device (net_device) information that indicates which device the hardware interrupt belongs to. Then, netif_rx82 registers queue reaping (refers to the contents of the packets accumulated in the buffer and deletes the corresponding queue entry from the buffer, taking into account the processing of that packet and the next processing). (See symbol e in FIG. 19). Specifically, in response to packets being packed into the Ring Buffer 72, the netif_rx 82 uses the NIC 11 driver to register future queue reaping in the poll_list 86. As a result, queue reaping information resulting from packets being stuffed into the Ring Buffer 72 is registered in the poll_list 86 .
- ⁇ Device driver> of FIG. 19 when the NIC 11 receives a packet, it copies the packet that has arrived at the Ring Buffer 72 by DMA transfer. Further, the NIC 11 raises the hardIRQ 81 (handler), the netif_rx 82 registers net_device in the poll_list 86, and schedules a software interrupt (softIRQ). Up to this point, the hardware interrupt processing in ⁇ Device driver> in FIG. 19 is stopped.
- the netif_rx82 uses a software interrupt (softIRQ) to reap the data stored in the Ring Buffer72 using the information (specifically, the pointer) stored in the queue stored in the poll_list86. (handler) (see reference numeral f in FIG. 19), and notifies the do_softirq 84, which is a software interrupt control function unit (see reference numeral g in FIG. 19).
- softIRQ software interrupt
- the do_softirq 84 is a software interrupt control function unit, and defines each software interrupt function (there are various types of packet processing, and interrupt processing is one of them. It defines interrupt processing). Based on this definition, do_softirq 84 notifies net_rx_action 85, which actually performs software interrupt processing, of the current (corresponding) software interrupt request (see reference numeral h in FIG. 19).
- the net_rx_action 85 calls a polling routine for reaping packets from the Ring Buffer 72 based on the net_device registered in the poll_list 86 (see reference numeral i in FIG. 19), and reaps the packets ( (See symbol j in FIG. 19). At this time, net_rx_action 85 continues reaping until poll_list 86 becomes empty. Thereafter, net_rx_action 85 notifies netif_receive_skb 87 (see symbol k in FIG. 19).
- the netif_receive_skb 87 creates a sk_buff structure, analyzes the contents of the packet, and sends processing to the subsequent protocol processing unit 74 (see FIG. 18) for each type.
- netif_receive_skb 87 analyzes the contents of the packet, and when performing processing according to the contents of the packet, passes the processing to ip_rcv 88 of ⁇ Protocol layer> (symbol l in Figure 19).
- the processing is passed to arp_rcv89 (symbol m in FIG. 19).
- Patent Document 1 describes an in-server network delay control device (KBP: Kernel Busy Poll). KBP constantly monitors packet arrival using a polling model within the kernel. This suppresses softIRQ and achieves low-latency packet processing.
- KBP Kernel Busy Poll
- both the interrupt model and the polling model for packet transfer have the following problems.
- the kernel receives an event (hardware interrupt) from the HW and transfers the packet through software interrupt processing for processing the packet. Therefore, in the interrupt model, packet transfer is performed by interrupt (software interrupt) processing, so if there is a conflict with other interrupts or if the interrupt destination CPU is being used by a process with a higher priority, waiting This poses the problem of increased packet transfer delay. In this case, if the interrupt processing becomes congested, the waiting delay will further increase.
- NW delay on the order of ms occurs due to competition in interrupt processing (softIRQ), as shown in the broken line box n in FIG.
- softIRQ interrupt processing
- the technology described in Patent Document 1 when the technology described in Patent Document 1 is used, by constantly monitoring packet arrival, software interrupts can be suppressed and packet harvesting with low delay can be realized.
- the CPU core is occupied and CPU time is used to monitor packet arrival, power consumption is high. That is, the kernel thread that constantly monitors packet arrivals monopolizes the CPU core and constantly uses CPU time, which poses a problem of increased power consumption. The relationship between workload and CPU usage rate will be described with reference to FIGS. 20 and 21.
- FIG. 20 is an example of video (30 FPS) data transfer.
- the workload shown in FIG. 20 has a transfer rate of 350 Mbps, and data is transferred intermittently every 30 ms.
- FIG. 21 is a diagram showing the CPU usage rate used by the busy poll thread in the KBP described in Patent Document 1.
- the kernel thread exclusively uses the CPU core to perform busy polls. Even with intermittent packet reception as shown in FIG. 20, KBP always uses the CPU regardless of whether or not a packet has arrived, so there is a problem of high power consumption.
- the present invention was made in view of this background.
- the present invention suppresses sleep and wake-up operations when the traffic inflow frequency is "dense", achieves low latency, and saves power.
- the task is to achieve the following.
- an in-server delay control device that is placed in the kernel space of the OS and starts up a thread that monitors packet arrival using a polling model, and the operation mode of the thread is to put the thread to sleep. It has a sleep control mode that allows the thread to be constantly polled, and a constant busy poll mode that constantly polls the thread, and includes a traffic frequency measurement unit that measures the traffic inflow frequency, and a traffic inflow frequency measured by the traffic frequency measurement unit.
- This in-server delay control device is characterized by comprising a packet arrival monitoring unit that monitors a poll list that registers information on a network device indicating whether the device belongs to the device, and checks whether or not a packet has arrived.
- the present invention it is possible to suppress sleep and wake-up operations when the traffic inflow frequency is "dense", thereby achieving low delay and power saving.
- FIG. 1 is a schematic configuration diagram of an intra-server delay control system according to an embodiment of the present invention. This is a configuration example in which the polling thread (in-server delay control device) of FIG. 1 is placed in kernel space. This is a configuration example in which the polling thread (in-server delay control device) of FIG. 1 is placed in the user space.
- FIG. 2 is a diagram illustrating the arrangement of a traffic frequency measurement unit of the polling thread (in-server delay control device) of FIG. 1;
- FIG. 6 is a diagram showing an example of polling thread operation when the traffic inflow frequency of the server delay control device of the server delay control system according to the embodiment of the present invention is “sparse”;
- FIG. 1 is a schematic configuration diagram of an intra-server delay control system according to an embodiment of the present invention.
- FIG. 4 is an example of data transfer when the traffic inflow frequency of the intra-server delay control system according to the embodiment of the present invention is "dense.”
- FIG. 7 is a diagram illustrating an example of polling thread operation in the data transfer example when the traffic inflow frequency is "dense" in FIG. 6; It is a figure explaining the operation mode switching point of the delay control device in a server of the delay control system in a server concerning an embodiment of the present invention.
- FIG. 3 is a table showing an example of switching determination logic of the intra-server delay control device of the intra-server delay control system according to the embodiment of the present invention.
- 2 is a flowchart showing NIC and HW interrupt processing of the intra-server delay control device of the intra-server delay control system according to the embodiment of the present invention.
- FIG. 1 is a hardware configuration diagram showing an example of a computer that implements the functions of an intra-server delay control device of an intra-server delay control system according to an embodiment of the present invention.
- FIG. 2 is a diagram illustrating an example in which an intra-server delay control system that places a polling thread in the kernel is applied to an interrupt model in a server virtualization environment with a general-purpose Linux kernel (registered trademark) and a VM configuration.
- FIG. 2 is a diagram illustrating an example in which an intra-server delay control system that places polling threads in the kernel is applied to an interrupt model in a container-configured server virtualization environment.
- FIG. 2 is a diagram illustrating an example in which an intra-server delay control system that places a polling thread in user space is applied to an interrupt model in a server virtualization environment with a general-purpose Linux kernel (registered trademark) and a VM configuration.
- FIG. 2 is a diagram illustrating an example in which an intra-server delay control system that places polling threads in user space is applied to an interrupt model in a container-configured server virtualization environment. It is a schematic diagram of Rx side packet processing by New API (NAPI) implemented from Linux kernel 2.5/2.6.
- NAPI New API
- FIG. 19 is a diagram illustrating an overview of Rx-side packet processing by New API (NAPI) in a portion surrounded by a broken line in FIG. 18.
- NAPI New API
- FIG. It is a diagram showing an example of data transfer of video (30 FPS).
- FIG. 2 is a diagram showing a CPU usage rate used by a busy poll thread in KBP described in Patent Document 1.
- FIG. 1 is a schematic configuration diagram of an intra-server delay control system according to an embodiment of the present invention.
- This embodiment is an example in which the New API (NAPI) implemented in Linux kernel 2.5/2.6 is applied to Rx side packet processing. Components that are the same as those in FIG. 18 are given the same reference numerals. As shown in FIG.
- the in-server delay control system 1000 executes a packet processing APL1 placed in a user space that can be used by a user on a server equipped with an OS (for example, a Host OS), and is connected to the OS. Packet transfer is performed between the NIC 11 of the HW and the packet processing APL 1.
- OS for example, a Host OS
- the intra-server delay control system 1000 includes a NIC 11 (physical NIC) which is a network interface card, a hardIRQ 81 which is a handler that is called upon the generation of a processing request of the NIC 11 and executes the requested processing (hardware interrupt), and a handler for HW interrupts. It includes a HW interrupt processing section 182, a receive list 186, a Ring_Buffer 72, a polling thread (server delay control device 100), and a protocol processing section 74, which are processing function sections.
- the Ring Buffer 72 is managed by the kernel in a memory space within the server.
- the Ring Buffer 72 is a buffer of a fixed size that stores messages output by the kernel as a log, and is overwritten from the beginning when the upper limit size is exceeded.
- the protocol processing unit 74 is Ethernet, IP, TCP/UDP, etc.
- the protocol processing unit 74 performs, for example, L2/L3/L4 protocol processing defined by the OSI reference model.
- the intra-server delay control device 100 is a polling thread placed in either the kernel space or the user space.
- the intra-server delay control device 100 includes a packet arrival monitoring section 110, a packet harvesting section 120, a sleep management section 130, a CPU frequency/CPU idle setting section 140, a mode switching control section 150, and a traffic frequency measurement section 160. , is provided.
- the packet arrival monitoring section 110 includes a traffic frequency measuring section 160.
- the packet arrival monitoring unit 110 is a thread for monitoring whether a packet has arrived.
- the packet arrival monitoring unit 110 monitors (polles) the receive list 186.
- the packet arrival monitoring unit 110 acquires pointer information indicating that the packet exists in the Ring_Buffer 72 and net_device information from the receive list 186, and transmits the information (pointer information and net_device information) to the packet harvesting unit 120.
- the information for multiple pieces is transmitted.
- the packet reaping unit 120 refers to the packet held in the Ring Buffer 72 and performs reaping to delete the corresponding queue entry from the Ring Buffer 72 based on the next processing (hereinafter referred to as There are cases where the packets are simply harvested from the Ring Buffer 72).
- the packet harvesting unit 120 extracts the packet from the Ring_Buffer 72 based on the received information and transmits the packet to the protocol processing unit 74.
- the packet harvesting section 120 harvests the plurality of packets at once and passes the harvested packets to the subsequent protocol processing section 74. Note that this number to be harvested at once is called a quota, and is often called batch processing.
- the protocol processing unit 74 also performs high-speed protocol processing because it processes multiple packets at once.
- the sleep management unit 130 causes a thread (polling thread) to go to sleep if a packet does not arrive for a predetermined period of time, and causes the thread (polling thread) to wake up from sleep using a hardware interrupt (hardIRQ) when a packet arrives. (details later).
- the CPU frequency/CPU idle setting unit 140 sets the CPU operating frequency of the CPU core used by the thread (polling thread) to a low value during sleep.
- the CPU frequency/CPU idle setting unit 140 sets the CPU idle state of the CPU core used by this thread (polling thread) to power saving mode during sleep (details will be described later).
- the mode switching control unit 150 switches the operation mode of the polling thread to either the sleep control mode or the always busy poll mode based on the traffic inflow frequency measured by the traffic frequency measurement unit 160. For example, the mode switching control unit 150 changes the operation mode of the polling thread to a "sleep control mode (polling Mode switching control is performed to switch the thread to a mode that allows the thread to sleep (a mode that allows the thread to sleep), or to an always busy poll mode (a mode in which the polling thread is always busy polling) if the frequency of traffic inflow is high.
- a single control mode polyling Mode switching control is performed to switch the thread to a mode that allows the thread to sleep (a mode that allows the thread to sleep)
- an always busy poll mode a mode in which the polling thread is always busy polling
- the operation mode of the polling thread is either “sleep control mode” or “always busy poll mode”; if it is not “sleep control mode”, it is set to “always busy poll mode” or “always busy poll mode”. If the mode is not "poll mode”, it is switched to "sleep control mode".
- the above-mentioned “case where the traffic inflow frequency is low” refers to the case where the bucket inflow frequency is smaller than the threshold T (Fig. 8), such as when the traffic inflow frequency is "sparse” (Fig. 20), and the above-mentioned “traffic "When the inflow frequency is high” refers to a case where the traffic inflow frequency is equal to or higher than the threshold value T (FIG. 8), as in the case where the traffic inflow frequency is "dense” (FIG. 6) (described later).
- the traffic frequency measurement unit 160 measures the traffic inflow frequency and transmits it to the mode switching control unit 150.
- the traffic frequency measurement unit 160 may measure the traffic frequency by approximately estimating the traffic frequency using the number of HW interruptions (recorded as statistical information in the kernel) or the like.
- FIG. 2 and 3 are diagrams illustrating the arrangement of the polling thread (in-server delay control device 100) in FIG. 1.
- - Arrangement of polling thread in kernel space FIG. 2 is a configuration example in which the polling thread (in-server delay control device 100) of FIG. 1 is arranged in kernel space.
- a polling thread (server delay control device 100) and a protocol processing unit 74 are arranged in a kernel space. This polling thread (server delay control device 100) operates within the kernel space.
- the intra-server delay control system 1000 executes a packet processing APL1 placed in the user space on a server equipped with an OS, and transfers packets between the NIC 11 of the HW and the packet processing APL1 via a device driver connected to the OS. Make a transfer.
- a hardIRQ 81, a HW interrupt processing unit 182, a receive list 186, and a Ring_Buffer 72 are arranged in the Device driver.
- the Device driver is a driver for monitoring hardware.
- the mode switching control unit 150 of the intra-server delay control device 100 periodically wakes up the polling thread during sleep, or immediately before the packet arrives in accordance with the packet arrival timing. wake up the thread.
- the mode switching control unit 150 manages HW interrupts and controls polling thread sleep and HW interrupt enable/disable for the hardIRQ 81 (see reference numeral xx in FIG. 2).
- the present invention can be applied to cases where there is a polling thread inside the kernel, such as NAPI or KBP.
- FIG. 3 is a configuration example in which the polling thread (in-server delay control device 100) of FIG. 1 is arranged in user space.
- a polling thread (server delay control device 100) and a protocol processing unit 74 are arranged in the user space.
- This polling thread (in-server delay control device 100) operates not in Kernel space but in User space.
- the polling thread (server delay control device 100) bypasses the kernel space and transfers packets between the device driver and NIC 11 and the packet processing APL 1.
- the mode switching control unit 150 of the intra-server delay control device 100 periodically wakes up the polling thread during sleep, or immediately before the packet arrives in accordance with the packet arrival timing. wake up the thread.
- the mode switching control unit 150 manages HW interrupts and controls the HW interrupt processing unit 182 to sleep the polling thread and enable/disable HW interrupts (see reference numeral yy in FIG. 3).
- the present invention can be applied when there is a polling thread in the user space, such as in DPDK.
- FIG. 4 is a diagram illustrating the arrangement of the traffic frequency measurement unit 160 of the polling thread (in-server delay control device 100) in FIG. 1.
- the traffic frequency measurement unit 160 of the intra-server delay control device 100 may be arranged as a thread independent from the packet arrival monitoring unit 110 to measure the traffic frequency.
- the traffic frequency measuring unit 160 cannot directly measure the traffic frequency, but can estimate the traffic frequency by approximating the traffic frequency using the number of HW interruptions (recorded as statistical information in the kernel), etc. can be measured.
- the operation of the intra-server delay control system 1000 configured as described above will be described below.
- the present invention can be applied to cases where there is a polling thread inside the kernel, such as NAPI and KBP, or cases where there is a polling thread in user space, such as DPDK. This will be explained using an example where there is a polling thread inside the kernel.
- FIGS. 1 to 4 Arrows (symbols) aa to ii in FIGS. 1 to 4 indicate the flow of packet processing on the Rx side.
- the NIC 11 When the NIC 11 receives a packet (or frame) in a frame from the opposite device, it copies the arrived packet to the Ring Buffer 72 by DMA transfer without using the CPU (see reference numeral aa in FIGS. 1 to 4).
- This Ring Buffer 72 is managed by ⁇ Device driver>.
- the NIC 11 launches a hardware interrupt (hardIRQ) to the hardIRQ 81 (handler) (see symbol bb in Figures 1 to 4), and the HW interrupt processing unit 182 executes the following processing. , recognizes the packet.
- hardIRQ hardware interrupt
- the HW interrupt processing unit 182 executes the following processing. , recognizes the packet.
- the HW interrupt processing unit 182 When the hardwire 81 (handler) starts up (see cc in FIG. 1), the HW interrupt processing unit 182 adds the hardware from the NIC 11, which is one of the information in the hardware interrupt (hardIRQ), to the receive list 186. Save the net device (net_device) information that indicates which device the interrupt belongs to, and register the queue reaping information. Specifically, in response to packets being packed into the Ring Buffer 72, the HW interrupt processing unit 182 uses the driver of the NIC 11 to register future queue reaping in the receive list 186 (Figs. 1 to 4). (see code dd). As a result, queue reaping due to packets being stuffed into the Ring Buffer 72 is registered in the receive list 186.
- the HW interrupt processing unit 182 registers net_device in the receive list 186, but unlike the netif_rx 82 in FIG. 19, it does not schedule software interrupts (softIRQ). That is, the HW interrupt processing unit 182 differs from the netif_rx 82 in FIG. 19 in that it does not schedule software interrupts (softIRQ).
- the HW interrupt processing unit 182 performs sleep release to wake up the sleeping polling thread (see reference numeral ee in FIGS. 1 to 4). Up to this point, the hardware interrupt processing in ⁇ Device driver> in FIGS. 1 to 4 has stopped.
- softIRQ83 and do_softirq84 are deleted in ⁇ Networking layer> shown in FIG. 19, and accordingly, netif_rx82 shown in FIG. Not performed.
- the intra-server delay control system 1000 deletes the softIRQ 83 and do_softirq 84 shown in FIG. 19, and instead provides a polling thread (intra-server delay control device 100) in ⁇ kernel space> (see FIG. 2).
- the server delay control system 1000 provides a polling thread (server delay control device 100) in ⁇ User space> (see FIG. 3).
- the packet arrival monitoring unit 110 monitors (polles) the receive list 186 (see reference numeral ff in FIGS. 1 to 4) and checks whether or not a packet has arrived.
- the packet arrival monitoring unit 110 acquires pointer information indicating that the packet exists in the Ring_Buffer 72 and net_device information from the receive list 186, and transmits the information (pointer information and net_device information) to the packet harvesting unit 120 (see FIGS. 4 (see code gg).
- the information for the plurality of packets is transmitted.
- the packet reaping unit 120 of the intra-server delay control device 100 reaps the packet from the Ring Buffer 72 if the packet has arrived (see symbol hh in FIGS. 1 to 4).
- the packet harvesting unit 120 extracts the packet from the Ring_Buffer 72 based on the received information and transmits the packet to the protocol processing unit 74 (see reference numeral ii in FIGS. 1 to 4).
- the intra-server delay control system 1000 stops softIRQ for packet processing, which is the main cause of NW delay occurrence, and the packet arrival monitoring unit 110 of the intra-server delay control device 100 executes a polling thread for monitoring packet arrival. Then, the packet harvesting unit 120 processes the packet using the polling model (without softIRQ) when the packet arrives.
- the polling thread (in-server delay control device 100) that monitors the arrival of packets can sleep while no packets arrive.
- the polling thread (server delay control device 100) sleeps depending on whether or not a packet arrives, and releases the sleep state using the hardIRQ 81 when a packet arrives.
- the sleep management unit 130 of the intra-server delay control device 100 causes the polling thread to sleep depending on whether or not a packet has arrived, that is, if no packet has arrived for a predetermined period of time.
- the sleep management unit 130 cancels sleep using the hardIRQ 81 when a packet arrives. This avoids softIRQ contention and achieves low latency.
- the CPU frequency/CPU idle setting unit 140 of the intra-server delay control device 100 changes the CPU operating frequency and idle setting depending on whether or not a packet has arrived. Specifically, the CPU frequency/CPU idle setting unit 140 lowers the CPU frequency during sleep, and increases the CPU frequency when restarting (returns the CPU operating frequency to the original). Further, the CPU frequency/CPU idle setting unit 140 changes the CPU idle setting to power saving during sleep. Power savings can also be achieved by lowering the CPU operating frequency during sleep and by changing the CPU idle setting to a power-saving setting.
- FIG. 5 is a diagram showing an example of polling thread operation when the traffic inflow frequency of the intra-server delay control device 100 is "sparse".
- the vertical axis shows the CPU usage rate [%] of the CPU core used by the polling thread, and the horizontal axis shows time.
- FIG. 5 shows an example of polling thread operation based on packet arrival corresponding to the data transfer example of video (30 FPS) in which packets are received intermittently shown in FIG.
- FIG. 5 is an example where the traffic inflow frequency is "sparse", as in the example of video (30 FPS) data transfer in FIG. 20. Note that an example where the traffic inflow frequency is "dense" will be described later with reference to FIG.
- the sleep management unit 130 of the intra-server delay control device 100 controls the sleep management unit 130 when no packet arrives for a predetermined period of time (more specifically, after a certain packet arrives, If the next packet does not arrive after a certain period of time), the polling thread is put to sleep (see symbol p: sleep in FIG. 5). Then, the sleep management unit 130 activates the polling thread at the hardIRQ 81 upon arrival of the packet (see q: wake up in FIG. 5).
- the kernel thread does not exclusively use the CPU core, so in addition to being used by the polling thread, timer interrupts for stable system operation enter the corresponding CPU core, and the migration thread is used for error handling, etc. enters the corresponding CPU core, the CPU usage rate of the CPU core used by the polling thread may fluctuate (see symbol r in FIG. 5).
- the method of putting the polling thread to sleep can achieve a sufficient power saving effect when the amount of traffic inflow is small.
- FIG. 6 is an example of data transfer when the traffic inflow frequency is "dense".
- the workload shown in FIG. 6 has a high traffic inflow frequency, that is, the traffic inflow frequency is "dense" in the time axis direction.
- FIG. 7 is a diagram showing an example of polling thread operation in the data transfer example when the traffic inflow frequency is "dense" in FIG. 6.
- the vertical axis shows the CPU usage rate [%] of the CPU core used by the polling thread, and the horizontal axis shows time.
- the same polling thread operations as in FIG. 5 are given the same reference numerals.
- the traffic inflow frequency is "dense" as shown in FIG. (See r:sleep/q:wake up).
- the polling thread When the traffic inflow frequency is "sparse", the polling thread is controlled to sleep. When the traffic inflow frequency is "dense”, the polling thread always polls busy. When the traffic inflow frequency is "dense”, constantly busy polling can reduce the number of CPU cycles rather than repeating sleep/wake up frequently.
- the low-latency polling thread includes a packet arrival monitoring unit 110 and a packet harvesting unit 120, and performs packet arrival monitoring and reception processing using a polling model to generate low-latency packets. Realize reception processing. Specifically, the polling thread (in-server delay control device 100) performs packet arrival monitoring and reception processing using a low-delay polling model (see reference numeral ff in FIGS. 1 to 4). Therefore, softIRQ contention does not occur and the delay is small. Additionally, when a packet arrives during sleep, the polling thread is started using high-priority hardIRQ, so the overhead caused by sleep can be suppressed as much as possible.
- the polling thread includes a sleep management unit 130, and the sleep management unit 130 causes the polling thread to sleep while no packet has arrived, thereby reducing power consumption due to wasteful busy polling by the polling thread. prevent waste. Furthermore, the polling thread (server delay control device 100) includes a CPU frequency/CPU idle setting section 140, and the CPU frequency/CPU idle setting section 140 dynamically controls the CPU operating frequency. This dynamic control of the CPU operating frequency is used in conjunction with sleep control.
- the polling thread goes to sleep and controls the CPU frequency to be set low while no power-saving packets arrive, so it is possible to suppress the increase in power consumption due to busy polling.
- the polling thread (server delay control device 100) includes a mode switching control section 150 and a traffic frequency measurement section 160.
- the mode switching control section 150 measures the traffic inflow frequency and transmits it to the mode switching control section 150.
- the mode switching control unit 150 switches the operation mode of the polling thread between "always busy poll mode" and "sleep control mode” depending on the traffic inflow frequency. Specifically, the mode switching control unit 150 switches the operation mode of the polling thread to "sleep control mode" when the traffic inflow frequency is "sparse", and switches the operation mode of the polling thread to "sleep control mode" when the traffic inflow frequency is “dense”. Switch the operation mode to "always busy poll mode”.
- FIG. 8 is a diagram illustrating operation mode switching points.
- the vertical axis shows the power consumption by the polling thread
- the horizontal axis shows the packet inflow frequency (traffic inflow frequency).
- the power consumption by the polling thread is constant.
- the power consumption by the polling thread increases as the frequency of packet inflow increases.
- the packet inflow frequency reaches a certain threshold value T
- the power consumption by the polling thread becomes equal, and when the packet inflow frequency exceeds the threshold value T, when sleep control is performed, the power consumption by the polling thread increases on the contrary.
- the range from low packet inflow frequency to the threshold T is the "area where sleep control mode should be used" (see symbol jj in Figure 8), and above the threshold T is the "always busy poll mode”. ” (see reference numeral kk in FIG. 8).
- the threshold value T becomes the operating mode switching point between the "sleep control mode" and the "always busy poll mode".
- the threshold value T differs depending on the type of server used.
- the operator measures the threshold value T in advance through experiments using the server used for the service.
- FIG. 9 is a diagram illustrating an example of switching determination logic in the form of a table. As shown in FIG. 9, the switching judgment logic consists of a category and a logic outline for each category.
- Simple threshold judgment A method that measures the amount of traffic inflow and switches the operation mode when the frequency of traffic inflow per unit time exceeds the threshold T.
- FIG. 10 is a flowchart showing the NIC and HW interrupt processing of the polling thread (in-server delay control device 100). While the polling thread is activated, this operation flow is executed in a loop. When a packet arrives at the NIC 11, this flow starts. In step S1, the NIC 11 copies the arrived packet data to a memory area by DMA (Direct Memory Access).
- DMA Direct Memory Access
- step S2 the polling thread (in-server delay control device 100) determines whether HW interrupts are permitted. If HW interrupts are permitted (S2: Yes), the process advances to step S3, and if HW interrupts are not permitted (S2: No), the process of this flow is ended.
- step S3 the NIC 11 activates the HW interrupt (hardIRQ) to the hardIRQ 81 (handler), activates the HW interrupt, and registers packet arrival information (NIC device information, etc.) in the receive list 186.
- step S4 if the polling thread (in-server delay control device 100) is sleeping, the NIC 11 wakes up the polling thread and ends the processing of this flow.
- FIG. 11 is a flowchart showing the operation mode switching process of the mode switching control unit 150 of the polling thread (server delay control device 100).
- the mode switching control unit 150 receives traffic inflow frequency information from the traffic frequency measurement unit 160.
- the mode switching control unit 150 determines which of the "sleep control mode" and the “constantly busy poll mode” is suitable based on the received traffic inflow frequency information and according to the switching determination logic described in FIG. .
- the mode switching control unit 150 controls each unit (packet arrival monitoring unit 110, packet harvesting unit 120, sleep management unit 130, and CPU frequency/CPU idle setting unit 140). Instructs the operation mode after determination. If the current operating mode is the same as the determined operating mode, no instruction is given to each part regarding the operating mode. This causes the current mode of operation to continue.
- FIG. 12 is a flowchart showing the polling thread operation mode switching process of the polling thread (server delay control device 100). While the polling thread is sleeping, a packet arrives, it is woken up by a HW interrupt, and this flow starts. In step S21, the mode switching control unit 150 prohibits HW interrupts by the NIC 11. If a HW interrupt occurs during processing, the processing will be interrupted, so the mode switching control unit 150 temporarily prohibits HW interrupts by the NIC 11.
- step S22 the CPU frequency/CPU idle setting unit 140 sets the CPU frequency of the CPU core on which the polling thread operates to a high value, and releases the idle state if the CPU is in the idle state.
- step S23 the polling thread refers to the receive list 186.
- the polling thread knows from which device the HW interrupt has occurred, and checks the packet arrival information in the receive list 186 in the next step S24.
- the Ring Buffer 72 may be directly referred to to confirm whether or not a packet has arrived.
- NAPI implemented in the Linux kernel monitors a Control Plane list called poll_list.
- step S24 the packet arrival monitoring unit 110 determines whether packet arrival information exists in the receive list 186. If the packet arrival information exists in the receive list 186 (S24: Yes), the process advances to step S25, and if the packet arrival information does not exist in the receive list 186 (S24: No), that is, if there is no packet to be processed, The process skips the following process and proceeds to step S30.
- the polling thread refers to the packet data from the ring buffer 72 and transfers the corresponding data to the subsequent protocol processing unit 74.
- the polling thread may be received and processed all at once.
- step S26 the traffic frequency measurement unit 160 measures the traffic inflow frequency and transmits it to the mode switching control unit 150.
- the traffic frequency measurement unit 160 may measure the traffic frequency by approximately estimating the traffic frequency using the number of HW interruptions (recorded as statistical information in the kernel). Furthermore, if the operation mode switching judgment logic is a light process such as a simple threshold value judgment shown in FIG. (In that case, the traffic frequency measurement section 160 also functions as the mode switching control section 150.)
- step S27 the sleep management unit 130 causes the polling thread to sleep for a short time to match the traffic inflow frequency. For example, if the traffic inflow frequency is 5 us, the device will sleep for about 3 us.
- step S28 the packet arrival monitoring unit 110 determines whether the operation mode instructed by the mode switching control unit is the “sleep control mode”. If the operation mode instructed by the mode switching control section is not the "sleep control mode" (S28: No), the process returns to step S25.
- the packet harvesting unit 120 determines whether or not there are unreceived packets in the ring buffer 72 in step S29. If there is an unreceived packet in the ring buffer 72 (S29: Yes), the process returns to step S25.
- step S25 to step S28 is always a busy poll loop (see dashed line box mm in FIG. 12) (other loops are sleep control mode loops).
- step S30 the CPU frequency/CPU idle setting unit 140 sets the CPU frequency of the CPU core on which the polling thread operates to a low value, and puts the corresponding CPU into the idle state. Make it.
- step S31 the packet arrival monitoring unit 110 deletes the corresponding NIC information from the receive list 186.
- step S32 the packet arrival monitoring unit 110 allows the HW interrupt by the corresponding NIC.
- step S33 the sleep management unit 130 puts the polling thread to sleep and ends the processing of this flow.
- FIG. 13 is a hardware configuration diagram showing an example of a computer 900 that implements the functions of the intra-server delay control device 100.
- the computer 900 has a CPU 901, a ROM 902, a RAM 903, an HDD 904, a communication interface (I/F) 906, an input/output interface (I/F) 905, and a media interface (I/F) 907.
- the CPU 901 operates based on a program stored in the ROM 902 or the HDD 904, and controls each part of the intra-server delay control device 100 shown in FIGS. 1 to 4.
- the ROM 902 stores a boot program executed by the CPU 901 when the computer 900 is started, programs depending on the hardware of the computer 900, and the like.
- the CPU 901 controls an input device 910 such as a mouse and a keyboard, and an output device 911 such as a display via an input/output I/F 905.
- the CPU 901 acquires data from the input device 910 via the input/output I/F 905 and outputs the generated data to the output device 911.
- a GPU Graphics Processing Unit
- a GPU Graphics Processing Unit
- the HDD 904 stores programs executed by the CPU 901 and data used by the programs.
- the communication I/F 906 receives data from other devices via a communication network (for example, NW (Network) 920) and outputs it to the CPU 901, and also sends data generated by the CPU 901 to other devices via the communication network. Send to device.
- NW Network
- the media I/F 907 reads the program or data stored in the recording medium 912 and outputs it to the CPU 901 via the RAM 903.
- the CPU 901 loads a program related to target processing from the recording medium 912 onto the RAM 903 via the media I/F 907, and executes the loaded program.
- the recording medium 912 is an optical recording medium such as a DVD (Digital Versatile Disc) or a PD (Phase change rewritable disk), a magneto-optical recording medium such as an MO (Magneto Optical disk), a magnetic recording medium, a conductive memory tape medium, a semiconductor memory, or the like. It is.
- the CPU 901 of the computer 900 executes the intra-server delay control device by executing a program loaded on the RAM 903. Realizes 100 functions. Furthermore, data in the RAM 903 is stored in the HDD 904 .
- the CPU 901 reads a program related to target processing from the recording medium 912 and executes it. In addition, the CPU 901 may read a program related to target processing from another device via a communication network (NW 920).
- NW 920 communication network
- the present invention can be applied to an in-server delay control device that starts a thread in the kernel that monitors packet arrival using a polling model, such as a polling thread (in-server delay control device 100) shown in FIG.
- a polling thread in-server delay control device 100
- the OS is not limited.
- it is not limited to being under a server virtualization environment. Therefore, the intra-server delay control system can be applied to each of the configurations shown in FIGS. 14 and 15.
- FIG. 14 is a diagram showing an example in which the intra-server delay control system 1000A is applied to an interrupt model in a server virtualization environment with a general-purpose Linux kernel (registered trademark) and a VM configuration. Components that are the same as those in FIGS. 1 and 18 are given the same reference numerals.
- the server delay control device 100 is placed in the Kernel 171 of the Guest OS 70, and the server delay control device 100 is placed in the Kernel 91 of the Host OS 90.
- the server includes a Host OS 90 in which a virtual machine and an external process formed outside the virtual machine can operate, and a Guest OS 70 that operates within the virtual machine.
- HostOS90 is a memory space in a server equipped with Kernel91 and HostOS90, and registers Ring Buffer22 managed by Kernel91 and net device information indicating which device the hardware interrupt (hardIRQ) from NIC11 belongs to. It has a receive list 186 (FIG. 2), a vhost-net module 221 which is a kernel thread, a tap device 222 which is a virtual interface created by Kernel 91, and a virtual switch (br) 223.
- the Kernel 91 includes an intra-server delay control device 100. Kernel 91 transmits the packet to virtual machine 40 configured with Linux (registered trademark) and KVM 30 via tap device 222 .
- the GuestOS 70 includes a kernel 171, a Ring Buffer 52, and a Driver 53, and the Driver 53 includes a virtio-driver 531.
- the Kernel 171 includes an intra-server delay control device 100 and a protocol processing unit 74 that performs protocol processing on packets that have been harvested.
- the Kernel 171 transmits the packet to the packet processing APL1 via the protocol processing unit 74.
- packet transfer can be performed with the delay within the server reduced without modifying the APL in either the HostOS 90 or GuestOS 70 OS.
- FIG. 15 is a diagram showing an example in which the intra-server delay control system 1000B is applied to an interrupt model in a container-configured server virtualization environment. Components that are the same as those in FIGS. 1 and 14 are given the same reference numerals.
- the intra-server delay control system 1000B has a container configuration in which GuestOS 70 is replaced with Container 211.
- Container 211 has vNIC (virtual NIC) 212.
- the present invention can be applied to a configuration example in which a polling thread (server delay control device 100) is arranged in the user space.
- the OS is not limited.
- it is not limited to being under a server virtualization environment. Therefore, the intra-server delay control system can be applied to each of the configurations shown in FIGS. 16 and 17.
- FIG. 16 is a diagram showing an example in which the intra-server delay control system 1000C is applied to an interrupt model in a server virtualization environment with a general-purpose Linux kernel (registered trademark) and a VM configuration. Components that are the same as those in FIGS. 1 and 14 are given the same reference numerals.
- the intra-server delay control system 1000C includes a Host OS 20 in which a virtual machine and an external process formed outside the virtual machine can operate, and the Host OS 20 includes a Kernel 21 and a Driver 23.
- the server delay control system 1000C includes a NIC 11 of the HW connected to the Host OS 20, a polling thread (server delay control device 100) placed in the User space 60, a virtual switch 53, a Guest OS 50 operating within the virtual machine, A polling thread (server delay control device 100) connected to the host OS 20 and placed in the user space 60 is provided.
- packet transfer can be performed with the delay within the server reduced without changing the APL in either the Host OS 20 or Guest OS 1 (50). be able to.
- FIG. 17 is a diagram showing an example in which the intra-server delay control system 1000D is applied to an interrupt model in a container-configured server virtualization environment. Components that are the same as those in FIGS. 1, 14, and 16 are designated by the same reference numerals.
- the intra-server delay control system 1000D has a container configuration in which the Guest OS 50 in FIG. 16 is replaced with a Container 211.
- Container 211 has vNIC (virtual NIC) 212.
- the present invention can be applied to a system with a non-virtualized configuration such as a bare metal configuration.
- packet transfer can be performed with reduced delay within the server without modifying the APL.
- the present invention works with RSS (Receive-Side Scaling), which can process inbound network traffic using multiple CPUs, to increase the number of CPUs allocated to the packet arrival monitoring thread when the number of traffic flows increases. It becomes possible to scale out the load.
- RSS Receive-Side Scaling
- NIC Network Interface Card
- FEC Forward Error Correction
- processors other than CPU are similarly applicable to processors other than CPUs, such as GPUs, FPGAs, and ASICs (application specific integrated circuits), if they have an idle state function.
- processors other than CPUs such as GPUs, FPGAs, and ASICs (application specific integrated circuits), if they have an idle state function.
- the in-server delay control device 100 which is placed in the kernel space of the OS and starts up a thread that monitors packet arrival using a polling model
- the operation mode of a thread is a sleep control mode in which the thread can be put to sleep, and a always busy poll mode in which the thread is always busy polling. It includes a frequency measurement unit 160 and a mode switching control unit 150 that switches the operation mode of a thread to either a sleep control mode or a constantly busy poll mode based on the traffic inflow frequency measured by the traffic frequency measurement unit 160.
- the mode switching control unit 150 changes the operation mode of the thread (polling thread) to sleep based on the traffic inflow frequency, for example, when the traffic inflow frequency is a predetermined threshold (threshold T in FIG. 8). Switch between control mode and always busy poll mode. Thereby, when the traffic inflow frequency is "dense", sleep and sleep release operations can be suppressed, and power saving can be achieved while achieving low delay. Specifically, by dynamically switching the appropriate packet reception mode (sleep control mode/always busy poll mode) according to the frequency of traffic inflow, the system can enjoy the power saving effect of sleep while the frequency of traffic inflow is low. When the traffic inflow frequency becomes "dense”, even if the interrupt overhead becomes greater than the power saving effect of sleep, by constantly performing busy poll, it is possible to prevent the deterioration of power consumption.
- the following effects can be obtained by switching between the "sleep control mode" and the "always busy poll mode” rather than controlling the sleep time etc. in the sleep control.
- sleep control logic, etc. which is not necessary for simple busy poll, in simple busy poll mode, so unnecessary control logic operations can be omitted, potentially reducing power consumption.
- the polling thread goes to sleep and controls the CPU frequency to be set low, so it is possible to suppress the increase in power consumption due to busy polling (power saving).
- the present invention can be applied to cases where there is a polling thread inside the kernel, such as NAPI or KBP.
- an in-server delay control device 100 (see FIGS. 1 and 3) that is placed in the user space and starts up a thread that monitors packet arrival using a polling model.
- the operation mode of thread has a sleep control mode in which the thread can be put to sleep, and an always busy poll mode in which the thread is always busy polling, and a traffic frequency measurement unit 160 that measures the frequency of traffic inflow;
- a mode switching control unit 150 is provided that switches the operation mode of the thread to either a sleep control mode or a constantly busy poll mode based on the traffic inflow frequency measured by the traffic frequency measurement unit 160.
- the present invention can be applied to cases where there is a polling thread in the user space, such as in DPDK.
- the Guest OS (GuestOS70) (see Figure 14) running inside the virtual machine uses the kernel (Kernel171) and a ring buffer (Ring Buffer72) ( Figure 14) managed by the kernel in the memory space of the server equipped with the Guest OS. ), and a packet arrival monitoring unit 110 that monitors a poll list that registers network device information indicating which device the hardware interrupt from the interface unit (NIC 11) belongs to, and checks whether or not a packet has arrived. , if a packet has arrived, the packet reaping unit 120 refers to the packet held in the ring buffer (Ring Buffer 72) and deletes the corresponding queue entry from the ring buffer.
- a protocol processing unit that performs packet protocol processing, and an intra-server delay control device 100 that launches a thread that monitors packet arrival using a polling model in the kernel.
- the control device 100 has a sleep control mode in which the thread (polling thread) can be put to sleep, and an always busy poll mode in which the thread is always busy polling, and the control device 100 measures the frequency of traffic inflow.
- a mode switching control unit 150 that switches the thread operation mode between a sleep control mode and a always busy poll mode based on the traffic inflow frequency measured by the traffic frequency measurement unit 160. Be prepared.
- the Host OS (HostOS90) (see Figure 14) (HostOS20) (see Figures 16 and 17), on which virtual machines and external processes formed outside the virtual machine can run, runs the kernel (Kernel91) and the Host OS.
- a ring buffer (Ring Buffer 72) (see Figure 18) managed by the kernel in the memory space of the server equipped with it, and net device information indicating which device the hardware interrupt from the interface unit (NIC 11) belongs to.
- a packet arrival monitoring unit 110 monitors a poll list in which a packet is registered and checks whether a packet has arrived, and if a packet has arrived, refers to the packet held in the ring buffer and removes the corresponding queue entry from the ring buffer.
- the in-server delay control device 100 has an in-server delay control device 100 that starts up a thread to perform a thread, and the in-server delay control device 100 has two operation modes: a sleep control mode in which the thread can be put to sleep, and a busy polling mode in which the thread is constantly polled.
- a traffic frequency measurement unit 160 measures the traffic inflow frequency, and a thread operation mode is set to sleep control mode based on the traffic inflow frequency measured by the traffic frequency measurement unit 160.
- a mode switching control unit 150 that switches to either the always busy poll mode or the busy poll mode is provided.
- sleep and wake operations can be performed when the traffic inflow frequency is "dense" for a server equipped with a kernel (Kernel171) and a host OS (HostOS90). It is possible to achieve power savings while achieving low latency.
- kernel Kernel171
- HostOS90 host OS
- the mode switching control unit 150 controls the polling thread from an area where the traffic inflow frequency is low until a predetermined threshold (threshold T in FIG. 8) is reached.
- the operation mode is switched to a sleep control mode, and when the traffic inflow frequency exceeds a predetermined threshold value (threshold value T in FIG. 8), the operation mode is always switched to a busy poll mode.
- the thread (polling thread) is In addition to switching the operation mode to the sleep control mode, it is possible to switch to the always busy poll mode when the threshold value T or higher is in the "area where the always busy poll mode should be used" (see symbol kk in FIG. 8).
- the optimal value of the threshold value T is selected using the switching determination logic shown in the table of FIG.
- a packet arrival monitoring unit 110 monitors (polles) packet arrival from the interface unit (NIC 11) and checks whether or not the packet has arrived.
- the thread (polling thread) is put to sleep when a packet does not arrive for a predetermined period of time, and when a packet arrives, this thread (polling thread) is put to sleep by a hardware interrupt (hardIRQ).
- a sleep management unit 130 that performs cancellation is provided.
- the packet arrival monitoring unit 110 performs packet arrival monitoring and reception processing using the polling model, so softIRQ contention does not occur and delays can be reduced. Furthermore, in the sleep control mode, when a packet arrives during sleep, the sleep management unit 130 wakes up a polling thread using high-priority hardIRQ, so the overhead caused by sleep can be suppressed as much as possible.
- each of the above-mentioned configurations, functions, processing units, processing means, etc. may be partially or entirely realized by hardware, for example, by designing an integrated circuit.
- each of the above-mentioned configurations, functions, etc. may be realized by software for a processor to interpret and execute a program for realizing each function.
- Information such as programs, tables, files, etc. that realize each function is stored in memory, storage devices such as hard disks, SSDs (Solid State Drives), IC (Integrated Circuit) cards, SD (Secure Digital) cards, optical disks, etc. It can be held on a recording medium.
- Packet processing APL application
- HW 11 NIC physical NIC
- Interface part 20,90 Host OS 22, 52, 72 Ring Buffer 50,70 Guest OS 60 user space
- Protocol processing unit 86,186 receive list (poll list) 91,171 Kernel 100
- In-server delay control device polyling thread
- Packet arrival monitoring unit 120
- Packet harvesting unit 130
- sleep management unit 140
- Mode switching control section 160
- Traffic frequency measurement section 211 Container 1000, 1000A, 1000B, 1000C, 1000D Server delay control system T Threshold
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Environmental & Geological Engineering (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
Description
複数の仮想マシンを接続、連携させる手法はInter-VM Communicationと呼ばれ、データセンタなどの大規模な環境では、VM間の接続に、仮想スイッチが標準的に利用されてきた。しかし、通信の遅延が大きい手法であることから、より高速な手法が新たに提案されている。例えば、SR-IOV(Single Root I/O Virtualization)と呼ばれる特別なハードウェアを用いる手法や、高速パケット処理ライブラリであるIntel DPDK(Intel Data Plane Development Kit)(以下、DPDKという)を用いたソフトウェアによる手法などが提案されている。
図18は、Linux kernel 2.5/2.6より実装されているNew API(NAPI)によるRx側パケット処理の概略図である(非特許文献1参照)。
図18に示すように、New API(NAPI)は、OS70(例えば、Host OS)を備えるサーバ上で、ユーザが使用可能なUser space60に配置されたパケット処理APL1を実行し、OS70に接続されたHW10のNIC11とパケット処理APL1との間でパケット転送を行う。
Kernel71は、OS70(例えば、Host OS)の基幹部分の機能であり、ハードウェアの監視やプログラムの実行状態をプロセス単位で管理する。ここでは、kernel71は、パケット処理APL1からの要求に応えるとともに、HW10からの要求をパケット処理APL1に伝える。Kernel71は、パケット処理APL1からの要求に対して、システムコール(「非特権モードで動作しているユーザプログラム」が「特権モードで動作しているカーネル」に処理を依頼)を介することで処理する。
Kernel71は、Socket75を介して、パケット処理APL1へパケットを伝達する。Kernel71は、Socket75を介してパケット処理APL1からパケットを受信する。
上記、Linux kernel 2.5/2.6より実装されているNew API(NAPI)は、パケットが到着するとハードウェア割込(hardIRQ)の後、ソフトウェア割込(softIRQ)により、パケット処理を行う。図18に示すように、割込モデルによるパケット転送は、割込処理(図18の符号a参照)によりパケットの転送を行うため、割込処理の待ち合わせが発生し、パケット転送の遅延が大きくなる。
[New API(NAPI)によるRx側パケット処理構成]
図19は、図18の破線で囲んだ箇所におけるNew API(NAPI)によるRx側パケット処理の概要を説明する図である。
<Device driver>
図19に示すように、Device driverには、ネットワークインターフェイスカードであるNIC11(物理NIC)、NIC11の処理要求の発生によって呼び出され要求された処理(ハードウェア割込)を実行するハンドラであるhardIRQ81、およびソフトウェア割込の処理機能部であるnetif_rx82が配置される。
Networking layerには、netif_rx82の処理要求の発生によって呼び出され要求された処理(ソフトウェア割込)を実行するハンドラであるsoftIRQ83、ソフトウェア割込(softIRQ)の実体を行う制御機能部であるdo_softirq84が配置される。また、ソフトウェア割込(softIRQ)を受けて実行するパケット処理機能部であるnet_rx_action85、NIC11からのハードウェア割込がどのデバイスのものであるかを示すネットデバイス(net_device)の情報を登録するpoll_list86、sk_buff構造体(Kernel71が、パケットがどうなっているかを知覚できるようにするための構造体)を作成するnetif_receive_skb87、Ring Buffer72が配置される。
Protocol layerには、パケット処理機能部であるip_rcv88、arp_rcv89等が配置される。
図19の矢印(符号)b~mは、Rx側パケット処理の流れを示している。
NIC11のhardware機能部11a(以下、NIC11という)が、対向装置からフレーム内にパケット(またはフレーム)を受信すると、DMA(Direct Memory Access)転送によりCPUを使用せずに、Ring Buffer72へ到着したパケットをコピーする(図19の符号b参照)。このRing Buffer72は、サーバの中にあるメモリ空間で、Kernel71(図18参照)が管理している。
ここまでで、図19の<Device driver>におけるハードウェア割込の処理は停止する。
その後、net_rx_action85は、netif_receive_skb87に通達をする(図19の符号k参照)。
割込モデルは、HWからイベント(ハードウェア割込)を受けたkernelがパケット加工を行うためのソフトウェア割込処理によってパケット転送を行う。このため、割込モデルは、割込(ソフトウェア割込)処理によりパケット転送を行うので、他の割込との競合や、割込先CPUがより優先度の高いプロセスに使用されていると待ち合わせが発生し、パケット転送の遅延が大きくなるといった課題がある。この場合、割込処理が混雑すると、更に待ち合わせ遅延は大きくなる。
一般的なkernelは、パケット転送処理はハードウェア割込処理の後、ソフトウェア割込処理にて伝達される。
パケット転送処理のソフトウェア割込が発生した際に、下記条件(1)~(3)においては、前記ソフトウェア割込処理を即時に実行することができない。このため、ksoftirqd(CPU毎のカーネルスレッドであり、ソフトウェア割込の負荷が高くなったときに実行される)等のスケジューラにより調停され、割込処理がスケジューリングされることにより、msオーダの待ち合わせが発生する。
(1)他のハードウェア割込処理と競合した場合
(2)他のソフトウェア割込処理と競合した場合
(3)優先度の高い他プロセスやkernel thread(migration thread等)、割込先CPUが使用されている場合
上記条件では、前記ソフトウェア割込処理を即時に実行することができない。
一方、特許文献1に記載の技術を用いると、パケット到着を常時監視することにより、ソフトウェア割込を抑止し、低遅延なパケット刈取を実現できる。しかしながら、パケット到着を監視するため、CPUコアを専有しCPUタイムを使用するため、消費電力が高くなる。すなわち、パケット到着を常時監視するkernel threadがCPUコアを専有し、常にCPUタイムを使用するため、消費電力が大きくなる課題がある。図20および図21を参照して、ワークロードとCPU使用率の関係について説明する。
図21に示すように、KBPでは、kernel threadはbusy pollを行うために、CPUコアを専有する。図20に示す間欠的なパケット受信であっても、KBPでは、パケット到着有無に関わらず常にCPUを使用するため、消費電力が大きくなる課題がある。
[概要]
図1は、本発明の実施形態に係るサーバ内遅延制御システムの概略構成図である。本実施形態は、Linux kernel 2.5/2.6より実装されているNew API(NAPI)によるRx側パケット処理に適用した例である。図18と同一構成部分には、同一符号を付している。
図1に示すように、サーバ内遅延制御システム1000は、OS(例えば、Host OS)を備えるサーバ上で、ユーザが使用可能なUser spaceに配置されたパケット処理APL1を実行し、OSに接続されたHWのNIC11とパケット処理APL1との間でパケット転送を行う。
Ring Buffer72は、サーバの中にあるメモリ空間においてkernelが管理する。Ring Buffer72は、kernelが出力するメッセージをログとして格納する一定サイズのバッファであり、上限サイズを超過すると先頭から上書きされる。
プロトコル処理部74は、Ethernet,IP,TCP/UDP等である。プロトコル処理部74は、例えばOSI参照モデルが定義するL2/L3/L4のプロトコル処理を行う。
サーバ内遅延制御装置100は、kernel spaceまたはUser spaceのいずれかに配置されるpolling threadである。
サーバ内遅延制御装置100は、パケット到着監視部110と、パケット刈取部120と、sleep管理部130と、CPU周波数/CPU idle設定部140と、モード切替制御部150と、トラヒック頻度計測部160と、を備える。図1では、パケット到着監視部110が、トラヒック頻度計測部160を備えている。
パケット刈取部120は、Ring_Buffer72に複数のパケットが貯まっているときは、複数パケットをまとめて刈り取って、後続のプロトコル処理部74へ渡す。なお、このまとめて刈り取る数をquotaと言い、バッチ処理という呼び方をすることも多い。プロトコル処理部74は、プロトコル処理も複数パケットをまとめて処理するので高速である。
図2および図3は、図1のpolling thread(サーバ内遅延制御装置100)の配置を説明する図である。
・polling threadのkernel space配置
図2は、図1のpolling thread(サーバ内遅延制御装置100)をkernel spaceに配置した構成例である。
図2に示すサーバ内遅延制御システム1000は、kernel spaceにpolling thread(サーバ内遅延制御装置100)、プロトコル処理部74が配置される。このpolling thread(サーバ内遅延制御装置100)は、kernel space内で動作する。サーバ内遅延制御システム1000は、OSを備えるサーバ上で、User spaceに配置されたパケット処理APL1を実行し、OSに接続されたDevice driverを介してHWのNIC11とパケット処理APL1との間でパケット転送を行う。
なお、図2に示すように、Device driverには、hardIRQ81、HW割込処理部182、receive list186、Ring_Buffer72が配置される。
Device driverは、ハードウェアの監視を行うためのドライバである。
図3は、図1のpolling thread(サーバ内遅延制御装置100)をUser spaceに配置した構成例である。
図3に示すサーバ内遅延制御システム1000は、User spaceにpolling thread(サーバ内遅延制御装置100)、プロトコル処理部74が配置される。このpolling thread(サーバ内遅延制御装置100)は、Kernel space内ではなく、User spaceで動作する。
図3に示すサーバ内遅延制御システム1000は、polling thread(サーバ内遅延制御装置100)が、kernel spaceをバイパスして、Device driverおよびNIC11とパケット処理APL1との間でパケット転送を行う。
図4は、図1のpolling thread(サーバ内遅延制御装置100)のトラヒック頻度計測部160の配置を説明する図である。
図4に示すように、サーバ内遅延制御装置100のトラヒック頻度計測部160は、パケット到着監視部110から独立したスレッドとして配置し、トラヒック頻度を計測してもよい。この場合、トラヒック頻度計測部160は、直接トラヒック頻度を計測できなくなるが、HW割込回数(kernel内に統計情報として記録されている)等で近似的にトラヒック頻度を類推することで、トラヒック頻度を計測可能である。
本発明は、NAPIやKBPのように、kernel内部にpolling threadがある場合、または、DPDKのように、user spaceにpolling threadがある場合のいずれにも適用することができる。kernel内部にpolling threadがある場合への適用を例にとり説明する。
図1~図4の矢印(符号)aa~iiは、Rx側パケット処理の流れを示している。
NIC11が、対向装置からフレーム内にパケット(またはフレーム)を受信すると、DMA転送によりCPUを使用せずに、Ring Buffer72へ到着したパケットをコピーする(図1~図4の符号aa参照)。このRing Buffer72は、<Device driver>で管理している。
ここまでで、図1~図4の<Device driver>におけるハードウェア割込の処理は停止する。
パケット到着監視部110は、receive list186から、Ring_Buffer72にパケットが存在するポインタ情報と、net_device情報とを取得し、パケット刈取部120へ当該情報(ポインタ情報およびnet_device情報)を伝達する(図1~図4の符号gg参照)。ここで、receive list186に複数パケット情報が存在する場合は、複数分当該情報を伝達する。
パケット刈取部120は、受信した情報をもとにRing_Buffer72からパケットを取り出し、プロトコル処理部74へパケットを伝達する(図1~図4の符号ii参照)。
サーバ内遅延制御システム1000は、NW遅延発生の主要因であるパケット処理のsoftIRQを停止し、サーバ内遅延制御装置100のパケット到着監視部110がパケット到着を監視するpolling threadを実行する。そして、パケット刈取部120が、パケット到着時に、pollingモデル(softIRQなし)によりパケット処理を行う。
polling thread(サーバ内遅延制御装置100)は、パケット到着有無に応じてsleepし、パケット到着時はhardIRQ81によりsleep解除を行う。具体的には、サーバ内遅延制御装置100のsleep管理部130は、パケット到着有無に応じて、すなわち所定期間パケットの到着がないと、polling threadをsleepさせる。sleep管理部130は、パケット到着時はhardIRQ81によりsleep解除を行う。これにより、softIRQ競合を回避して、低遅延化を実現する。
<トラヒック流入頻度が「疎」の場合>
まず、トラヒック流入頻度が「疎」の場合について説明する。
図5は、サーバ内遅延制御装置100のトラヒック流入頻度が「疎」の場合のpolling thread動作例を示す図である。縦軸は、polling threadが使用するCPUコアのCPU使用率[%]を示し、横軸は、時間を示す。なお、図5は、図20に示す間欠的にパケットが受信される映像(30FPS)のデータ転送例に対応するパケット到着によるpolling thread動作例を示している。
次に、トラヒック流入頻度が「密」の場合について説明する。
図6は、トラヒック流入頻度が「密」の場合のデータ転送例である。図6に示すワークロードは、図20と対比して分かるように、トラヒック流入頻度が高い、すなわち時間軸方向にトラヒック流入頻度が「密」である。
例えば、vRAN(virtual Radio Access Network) vDU(virtual Distributed Unit)システムにおけるnumerology=3or4のように、時間方向のsymbol間隔が短い(例えば、8.92us,4.46us間隔)でデータが到着する場合等が該当する。
図6に示すトラヒック流入量が多い、すなわちトラヒック流入頻度が「密」の場合は、図7に示すように、sleepできる時間が短くなり、sleepとwake upとが高頻度に繰り返される(図7の符号r:sleep/q:wake up参照)。
次に、sleep/wake upを繰り返すよりも、単純busy pollの方が、消費電力が小さくなる理由について説明する。
sleep/wake upを行う「sleep制御モード」では、sleep時にパケットを受信すると、下記処理が発生し、この処理の演算のためのCPU cycleを必要とする。
・ハードウェア割込発動
・ハードウェア割込ハンドラ処理(receive listへの登録、sleepしているスレッドの起床)
・polling threadがuser spaceに配置されている場合は、ハードウェア割込処理のkernel特権モードから、polling threadの起床処理のための一般モードへの切替に伴うコンテキストスイッチ
トラヒック流入頻度が「疎」の場合は、polling threadをsleepさせることによる省電力効果が見込めるため、sleep制御を行う。一方で、トラヒック流入頻度が「密」の場合は、polling threadをsleepさせると、高頻度にsleep/wake upを繰り返すことになる。この場合、sleepによる省電力効果よりも、割込オーバーヘッドの方が大きくなり、sleep制御を行うことで消費電力が悪化する場合がある。このため、「常時busy pollモード」とする(「sleep制御モード」としない)。
トラヒック流入頻度が「密」の場合は、polling threadは常時busy pollする。トラヒック流入頻度が「密」の場合、常にbusy pollする方が、sleep/wake upを高頻度で繰り返すよりも、CPU cycle数を削減できる。
・ 低遅延
polling thread(サーバ内遅延制御装置100)は、パケット到着監視部110と、パケット刈取部120と、を備え、pollingモデルによりパケットの到着監視と受信処理を行うことで、低遅延なパケット受信処理を実現する。具体的には、polling thread(サーバ内遅延制御装置100)は、低遅延pollingモデルによりパケットの到着監視および受信処理を行う(図1~図4の符号ff参照)。このため、softIRQ競合が発生せず、遅延が小さい効果がある。また、sleep時にパケットが到着した際は、高優先のhardIRQによりpolling threadを起こすため、sleepによるオーバーヘッドをできる限り抑制できる。
polling thread(サーバ内遅延制御装置100)は、sleep管理部130を備え、sleep管理部130がパケットが到着していない間はpolling threadをsleepさせることで、polling threadによる無駄なbusy pollingによる消費電力の浪費を防ぐ。
また、polling thread(サーバ内遅延制御装置100)は、CPU周波数/CPU idle設定部140を備え、CPU周波数/CPU idle設定部140がCPU動作周波数の動的制御を行う。このCPU動作周波数の動的制御は、sleep制御と併用する。
polling thread(サーバ内遅延制御装置100)は、モード切替制御部150と、トラヒック頻度計測部160と、を備える。モード切替制御部150は、トラヒック流入頻度を計測し、モード切替制御部150に伝達する。モード切替制御部150は、トラヒック流入頻度に応じて、polling threadの動作モードを、「常時busy pollモード」と、「sleep制御モード」のいずれかに切り替える。具体的には、モード切替制御部150は、トラヒック流入頻度が「疎」の場合は、polling threadの動作モードを「sleep制御モード」に切り替え、トラヒック流入頻度が「密」の場合は、polling threadの動作モードを「常時busy pollモード」に切り替える。
次に、sleep時間等の制御を行うのではなく、モードを分けることのメリットについて説明する。
sleep/wake upを行う「sleep制御モード」と、単純busy pollを行う「常時busy pollモード」とで、モードを分ける。これにより、単純busy pollには必要のないsleep制御ロジック等を単純busy pollモードに実装する必要がない。このため、余計な制御ロジックの演算を省略することができ、余計な演算が減り、消費電力を削減できる可能性がある。逆に、動作モードを分けない場合は、単純busy pollを行うモードであっても、sleep制御時間等の判断ロジックを実装し、この演算コストを要することになる。
図8は、動作モード切替ポイントを説明する図である。縦軸は、polling threadによる消費電力を示し、横軸は、パケット流入頻度(トラヒック流入頻度)を示す。
図8に示すように、polling threadが常にbusy pollする場合、polling threadによる消費電力は一定である。sleep制御を行う場合、パケット流入頻度が高くなるに従ってpolling threadによる消費電力も上昇していく。パケット流入頻度がある閾値Tのところでpolling threadによる消費電力は等しくなり、パケット流入頻度が閾値Tを超えると、sleep制御を行うと、polling threadによる消費電力は却って増大する。polling threadによる消費電力の観点からは、パケット流入頻度低から閾値Tまでは、「sleep制御モードを使用すべき領域」(図8の符号jj参照)であり、閾値T以上は「常時busy pollモードを使用すべき領域」(図8の符号kk参照)となる。
図9は、切り替え判断ロジックの例を表にして示す図である。
図9に示すように、切り替え判断ロジックは、カテゴリと、カテゴリごとのロジック概要からなる。
トラヒック流入量を計測しておき、単一時間当たりのトラヒック流入頻度が閾値Tを超えた際に動作モードを切り替える方式
日中帯は人が活発に活動するため、夜間に比べてトラヒック量が多い傾向にある。このトラヒック量の特徴を考慮し、下記のように時間帯を考慮した動作モード切り替えを行う。
・日中帯:トラヒック流入頻度が閾値Tより高くなった場合、常時busy pollモードへ移行する。この場合、一時的に閾値Tを下回っても、sleep制御モードへは移行しないことで、閾値Tを境に動作モードが頻繁に切り替わるハンチングを防止する。
・夜間帯:トラヒック頻度が閾値Tより低くなった場合、sleep制御モードへ移行する。この場合、一時的に閾値Tを超えても、常時busy pollモードへは移行しないことで、閾値Tを境に動作モードが頻繁に切り替わるハンチングを防止する。
花火大会等のイベントや、店舗の営業時間等に依って、在圏内の人数が変動し、トラヒック量が特徴的な場合がある。このような企画型イベント情報や立地情報を入手しておき、そのイベントによって予想される時間帯別のトラヒック量に応じて、上記2.と同様の制御を行うことで、効果の高い動作モードの切り替えが可能である。
トラヒック量の推移を機械学習により学習しておき、流入するトラヒックパターンから、将来のトラヒック頻度を推論することによって予想し、適する動作モードへ切り替える。
<NICおよびHW割込処理>
図10は、polling thread(サーバ内遅延制御装置100)のNICおよびHW割込処理を示すフローチャートである。
polling threadが起動している間は、本動作フローをループして実行する。
NIC11にパケットが到着すると、本フローがスタートする。ステップS1でNIC11は、DMA(Direct Memory Access)により到着したパケットデータをメモリ領域へコピーする。
ステップS3でNIC11は、HW割込(hardIRQ)をhardIRQ81(ハンドラ)に立ち上げてHW割込を起動し、receive list186にパケット到着情報(NICデバイス情報等)を登録する。
ステップS4でNIC11は、polling thread(サーバ内遅延制御装置100)がsleepしている場合、polling threadを起こして本フローの処理を終了する。
<動作モード切替処理>
図11は、polling thread(サーバ内遅延制御装置100)のモード切替制御部150の動作モード切替処理を示すフローチャートである。
ステップS11でモード切替制御部150は、トラヒック頻度計測部160からトラヒック流入頻度情報を受信する。
ステップS12でモード切替制御部150は、受信したトラヒック流入頻度情報をもとに、図9に記載した切替判断ロジックに従い、「sleep制御モード」と「常時busy pollモード」のどちらが適するかを判断する。モード切替制御部150は、現在の動作モードが判断後の動作モードと異なる場合は、各部(パケット到着監視部110、パケット刈取部120、sleep管理部130、およびCPU周波数/CPU idle設定部140)へ判断後の動作モードを指示する。現在の動作モードが判断後の動作モードと同じ場合は、各部への動作モードを指示は行わない。これにより、現在の動作モードが継続される。
図12は、polling thread(サーバ内遅延制御装置100)のpolling threadの動作モード切替処理を示すフローチャートである。
polling threadがsleepしているときに、パケットが到着し、HW割込により起こされ、本フローがスタートする。
ステップS21でモード切替制御部150は、NIC11によるHW割込を禁止する。処理している最中にHW割込されると、処理が中断されてしまうので、モード切替制御部150は、NIC11によるHW割込を一旦禁止する。
なお、receive list186というControl Planeのlistを参照するのではなく、直接Ring Buffer72を参照し、パケットの到着有無を確認してもよい。例えば、Linux kernelに実装されたNAPIでは、poll_listというControl Planeのlistを監視する。
上記実施形態に係るサーバ内遅延制御装置100は、例えば図13に示すような構成のコンピュータ900によって実現される。
図13は、サーバ内遅延制御装置100の機能を実現するコンピュータ900の一例を示すハードウェア構成図である。
コンピュータ900は、CPU901、ROM902、RAM903、HDD904、通信インターフェイス(I/F:Interface)906、入出力インターフェイス(I/F)905、およびメディアインターフェイス(I/F)907を有する。
(kernel内にpolling threadを配置する形態)
図2に示すpolling thread(サーバ内遅延制御装置100)のように、Kernel内に、ポーリングモデルを用いてパケット到着を監視するスレッドを立ち上げるサーバ内遅延制御装置に適用できる。この場合、OSは限定されない。また、サーバ仮想化環境下であることも限定されない。したがって、サーバ内遅延制御システムは、図14および図15に示す各構成に適用が可能である。
図14は、汎用Linux kernel(登録商標)およびVM構成のサーバ仮想化環境における割込モデルに、サーバ内遅延制御システム1000Aを適用した例を示す図である。図1および図18と同一構成部分には、同一符号を付している。
図14に示すように、サーバ内遅延制御システム1000Aは、Guest OS70のKernel171内にサーバ内遅延制御装置100が配置され、Host OS90のKernel91内にサーバ内遅延制御装置100が配置される。
HostOS90は、Kernel91と、HostOS90を備えるサーバ中のメモリ空間で、Kernel91が管理するRing Buffer22と、NIC11からのハードウェア割込(hardIRQ)がどのデバイスのものであるかを示すネットデバイスの情報を登録するreceive list186(図2)と、kernel threadであるvhost-netモジュール221と、Kernel91により作成される仮想インターフェイスであるtapデバイス222と、仮想スイッチ(br)223と、を有する。
Kernel91は、tapデバイス222を介して、Linux(登録商標)とKVM30で構成された仮想マシン40へパケットを伝達する。
Kernel171は、プロトコル処理部74を介して、パケット処理APL1へパケットを伝達する。
図15は、コンテナ構成のサーバ仮想化環境における割込モデルに、サーバ内遅延制御システム1000Bを適用した例を示す図である。図1および図14と同一構成部分には、同一符号を付している。
図15に示すように、サーバ内遅延制御システム1000Bは、GuestOS70をContainer211に代えた、コンテナ構成を備える。Container211は、vNIC(仮想NIC)212を有する。
以上、kernel内にpolling threadを配置する形態について説明した。次に、user spaceにpolling threadを配置する形態について説明する。
図3に示すように、User spaceにpolling thread(サーバ内遅延制御装置100)を配置した構成例に適用できる。この場合、OSは限定されない。また、サーバ仮想化環境下であることも限定されない。したがって、サーバ内遅延制御システムは、図16および図17に示す各構成に適用が可能である。
図16は、汎用Linux kernel(登録商標)およびVM構成のサーバ仮想化環境における割込モデルに、サーバ内遅延制御システム1000Cを適用した例を示す図である。図1および図14と同一構成部分には、同一符号を付している。
図16に示すように、サーバ内遅延制御システム1000Cは、仮想マシンおよび仮想マシン外に形成された外部プロセスが動作可能なHost OS20を備え、Host OS20は、Kernel21およびDriver23を有する。さらに、サーバ内遅延制御システム1000Cは、Host OS20に接続されたHWのNIC11、User space60に配置されたpolling thread(サーバ内遅延制御装置100)、仮想スイッチ53、仮想マシン内で動作するGuest OS50、Host OS20に接続されUser space60に配置されたpolling thread(サーバ内遅延制御装置100)を備える。
図17は、コンテナ構成のサーバ仮想化環境における割込モデルに、サーバ内遅延制御システム1000Dを適用した例を示す図である。図1、図14、および図16と同一構成部分には、同一符号を付している。
図17に示すように、サーバ内遅延制御システム1000Dは、図16のGuest OS50をContainer211に代えた、コンテナ構成を備える。Container211は、vNIC(仮想NIC)212を有する。
本発明は、ベアメタル構成のように非仮想化構成のシステムに適用できる。非仮想化構成のシステムにおいて、APLを改変することなく、サーバ内の遅延を小さくしてパケット転送を行うことができる。
トラヒック量が多く、複数のNICデバイスやNICポートを使用する場合に、これらと関連付けて複数のpolling threadを動作させることで、HW割込頻度制御を行いつつ、polling threadをスケールイン/アウトすることができる。
本発明は、トラヒックフロー数が増えた場合に、インバウンドのネットワークトラヒックを複数CPUで処理可能なRSS(Receive-Side Scaling)と連携して、パケット到着監視threadに割り当てるCPU数を増やすことで、ネットワーク負荷に対するスケールアウトが可能になる。
NIC(Network interface Card)I/Oについて例示したが、本技術は、アクセラレータ(FPGA/GPU等)のPCIデバイスのI/Oに対しても、適用可能である。特に、vRANにおけるFEC(Forward Error Correction)のアクセラレータへのオフロード結果の返答受信時のpolling等へ活用が可能である。
本発明は、CPU以外にも、GPU/FPGA/ASIC(application specific integrated circuit)等のプロセッサに、idle stateの機能がある場合には、同様に適用可能である。
以上説明したように、OSのカーネル空間(kernel space)に配置され、ポーリングモデルを用いてパケット到着を監視するスレッド(thread)を立ち上げるサーバ内遅延制御装置100(図1および図2参照)であって、スレッド(polling thread)の動作モードは、当該スレッドをsleep可能なsleep制御モードと、当該スレッドを常時busy pollingさせる常時busy pollモードと、を有しており、トラヒック流入頻度を計測するトラヒック頻度計測部160と、トラヒック頻度計測部160が計測したトラヒック流入頻度に基づいて、スレッドの動作モードをsleep制御モードと常時busy pollモードとのいずれかに切り替えるモード切替制御部150と、を備える。
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。
10 HW
11 NIC(物理NIC)(インターフェイス部)
20,90 Host OS(OS)
22,52,72 Ring Buffer(リングバッファ)
50,70 Guest OS(OS)
60 user space(ユーザスペース)
74 プロトコル処理部
86,186 receive list(ポールリスト)
91,171 Kernel(カーネル)
100 サーバ内遅延制御装置(polling thread)
110 パケット到着監視部
120 パケット刈取部
130 sleep管理部(スリープ管理部)
140 CPU周波数/CPU idle設定部
150 モード切替制御部
160 トラヒック頻度計測部
211 Container
1000,1000A,1000B,1000C,1000D サーバ内遅延制御システム
T 閾値
Claims (9)
- OSのカーネル空間に配置され、ポーリングモデルを用いてパケット到着を監視するスレッドを立ち上げるサーバ内遅延制御装置であって、
前記スレッドの動作モードは、当該スレッドをsleep可能なsleep制御モードと、当該スレッドを常時busy pollingさせる常時busy pollモードと、を有しており、
トラヒック流入頻度を計測するトラヒック頻度計測部と、
前記トラヒック頻度計測部が計測した前記トラヒック流入頻度に基づいて、前記スレッドの動作モードを前記sleep制御モードと前記常時busy pollモードとのいずれかに切り替えるモード切替制御部と、を備える
ことを特徴とするサーバ内遅延制御装置。 - ユーザ空間に配置され、ポーリングモデルを用いてパケット到着を監視するスレッドを立ち上げるサーバ内遅延制御装置であって、
前記スレッドの動作モードは、当該スレッドをsleep可能なsleep制御モードと、当該スレッドを常時busy pollingさせる常時busy pollモードと、を有しており、
トラヒック流入頻度を計測するトラヒック頻度計測部と、
前記トラヒック頻度計測部が計測した前記トラヒック流入頻度に基づいて、前記スレッドの動作モードを前記sleep制御モードと前記常時busy pollモードとのいずれかに切り替えるモード切替制御部と、を備える
ことを特徴とするサーバ内遅延制御装置。 - サーバ内遅延制御装置であって、
仮想マシン内で動作するGuest OSが、
カーネルと、
前記Guest OSを備えるサーバ中のメモリ空間で、前記カーネルが管理するリングバッファと、
インターフェイス部からのハードウェア割込がどのデバイスのものであるかを示すネットデバイスの情報を登録するポールリストを監視し、パケット到着有無を確認するパケット到着監視部と、
パケットが到着している場合は、リングバッファに保持したパケットを参照し、該当するキューのエントリを前記リングバッファから削除する刈取りを実行するパケット刈取部と、
刈取りが実行されたパケットのプロトコル処理を行うプロトコル処理部と、を有し、
前記カーネル内に、ポーリングモデルを用いてパケット到着を監視するスレッドを立ち上げる前記サーバ内遅延制御装置を備えており、
前記サーバ内遅延制御装置は、
前記スレッドの動作モードが、当該スレッドをsleep可能なsleep制御モードと、当該スレッドを常時busy pollingさせる常時busy pollモードと、を有しており、
トラヒック流入頻度を計測するトラヒック頻度計測部と、
前記トラヒック頻度計測部が計測した前記トラヒック流入頻度に基づいて、前記スレッドの動作モードを前記sleep制御モードと前記常時busy pollモードとのいずれかに切り替えるモード切替制御部と、を備える
ことを特徴とするサーバ内遅延制御装置。 - サーバ内遅延制御装置であって、
仮想マシンおよび前記仮想マシン外に形成された外部プロセスが動作可能なHost OSが、
カーネルと、
前記Host OSを備えるサーバ中のメモリ空間で、前記カーネルが管理するリングバッファと、
インターフェイス部からのパケット到着を監視し、パケット到着有無を確認するパケット到着監視部と、
パケットが到着している場合は、リングバッファに保持したパケットを参照し、該当するキューのエントリを前記リングバッファから削除する刈取りを実行するパケット刈取部と、
前記カーネルにより作成される仮想インターフェイスであるtapデバイスと、を備え、
前記カーネル内に、ポーリングモデルを用いてパケット到着を監視するスレッドを立ち上げる前記サーバ内遅延制御装置を備えており、
前記サーバ内遅延制御装置は、
前記スレッドの動作モードが、当該スレッドをsleep可能なsleep制御モードと、当該スレッドを常時busy pollingさせる常時busy pollモードと、を有しており、
トラヒック流入頻度を計測するトラヒック頻度計測部と、
前記トラヒック頻度計測部が計測した前記トラヒック流入頻度に基づいて、前記スレッドの動作モードを前記sleep制御モードと前記常時busy pollモードとのいずれかに切り替えるモード切替制御部と、を備える
ことを特徴とするサーバ内遅延制御装置。 - 前記モード切替制御部は、トラヒック流入頻度が低い領域から所定閾値に達するまでは、前記スレッドの動作モードを前記sleep制御モードに切り替え、トラヒック流入頻度が前記所定閾値以上になると前記常時busy pollモード切り替える
ことを特徴とする請求項1乃至4のいずれか一項に記載のサーバ内遅延制御装置。 - 前記常時busy pollモードの場合、インターフェイス部からのハードウェア割込がどのデバイスのものであるかを示すネットデバイスの情報を登録するポールリストを監視し、、パケット到着有無を確認するパケット到着監視部と、
前記sleep制御モードの場合、パケットが所定期間到着しないときに前記スレッドをスリープさせ、かつ、パケット到着時はハードウェア割込により当該スレッドのスリープ解除を行うスリープ管理部と、を備える
ことを特徴とする請求項1または請求項2に記載のサーバ内遅延制御装置。 - OSのカーネル空間に配置され、ポーリングモデルを用いてパケット到着を監視するスレッドを立ち上げるサーバ内遅延制御装置のサーバ内遅延制御方法であって、
前記スレッドの動作モードは、当該スレッドをsleep可能なsleep制御モードと、当該スレッドを常時busy pollingさせる常時busy pollモードと、を有しており、
トラヒック流入頻度を計測するステップと、
計測した前記トラヒック流入頻度に基づいて、前記スレッドの動作モードを前記sleep制御モードと前記常時busy pollモードとのいずれかに切り替えるステップと、を実行する
ことを特徴とするサーバ内遅延制御方法。 - ユーザ空間に配置され、ポーリングモデルを用いてパケット到着を監視するスレッドを立ち上げるサーバ内遅延制御装置のサーバ内遅延制御方法であって、
前記スレッドの動作モードは、当該スレッドをsleep可能なsleep制御モードと、当該スレッドを常時busy pollingさせる常時busy pollモードと、を有しており、
トラヒック流入頻度を計測するステップと、
計測した前記トラヒック流入頻度に基づいて、前記スレッドの動作モードを前記sleep制御モードと前記常時busy pollモードとのいずれかに切り替えるステップと、を実行する
ことを特徴とするサーバ内遅延制御方法。 - コンピュータを、請求項1乃至4のいずれか一項に記載のサーバ内遅延制御装置として機能させるためのプログラム。
Priority Applications (5)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP22941676.3A EP4524737A4 (en) | 2022-05-12 | 2022-05-12 | Intra-server delay control device and method, and associated program |
| US18/864,727 US20250328372A1 (en) | 2022-05-12 | 2022-05-12 | Server delay control device, server delay control method and program |
| PCT/JP2022/020051 WO2023218596A1 (ja) | 2022-05-12 | 2022-05-12 | サーバ内遅延制御装置、サーバ内遅延制御方法およびプログラム |
| JP2024520175A JP7754299B2 (ja) | 2022-05-12 | 2022-05-12 | サーバ内遅延制御装置、サーバ内遅延制御方法およびプログラム |
| CN202280095899.2A CN119173855A (zh) | 2022-05-12 | 2022-05-12 | 服务器内延迟控制装置、服务器内延迟控制方法以及程序 |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2022/020051 WO2023218596A1 (ja) | 2022-05-12 | 2022-05-12 | サーバ内遅延制御装置、サーバ内遅延制御方法およびプログラム |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023218596A1 true WO2023218596A1 (ja) | 2023-11-16 |
Family
ID=88730011
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2022/020051 Ceased WO2023218596A1 (ja) | 2022-05-12 | 2022-05-12 | サーバ内遅延制御装置、サーバ内遅延制御方法およびプログラム |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US20250328372A1 (ja) |
| EP (1) | EP4524737A4 (ja) |
| JP (1) | JP7754299B2 (ja) |
| CN (1) | CN119173855A (ja) |
| WO (1) | WO2023218596A1 (ja) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20240160468A1 (en) * | 2021-03-18 | 2024-05-16 | Nippon Telegraph And Telephone Corporation | Server delay control device, server delay control method, and program |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021130828A1 (ja) | 2019-12-23 | 2021-07-01 | 日本電信電話株式会社 | サーバ内遅延制御装置、サーバ内遅延制御方法およびプログラム |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12001895B2 (en) * | 2019-10-08 | 2024-06-04 | Nippon Telegraph And Telephone Corporation | Server delay control system, server delay control device, server delay control method, and program |
-
2022
- 2022-05-12 WO PCT/JP2022/020051 patent/WO2023218596A1/ja not_active Ceased
- 2022-05-12 US US18/864,727 patent/US20250328372A1/en active Pending
- 2022-05-12 JP JP2024520175A patent/JP7754299B2/ja active Active
- 2022-05-12 CN CN202280095899.2A patent/CN119173855A/zh active Pending
- 2022-05-12 EP EP22941676.3A patent/EP4524737A4/en active Pending
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021130828A1 (ja) | 2019-12-23 | 2021-07-01 | 日本電信電話株式会社 | サーバ内遅延制御装置、サーバ内遅延制御方法およびプログラム |
Non-Patent Citations (2)
| Title |
|---|
| NEW API(NAPI, 4 April 2022 (2022-04-04), Retrieved from the Internet <URL:http://lwn.net/2002/0321/a/napi-howto.php3> |
| See also references of EP4524737A4 |
Also Published As
| Publication number | Publication date |
|---|---|
| US20250328372A1 (en) | 2025-10-23 |
| JPWO2023218596A1 (ja) | 2023-11-16 |
| JP7754299B2 (ja) | 2025-10-15 |
| EP4524737A1 (en) | 2025-03-19 |
| CN119173855A (zh) | 2024-12-20 |
| EP4524737A4 (en) | 2026-03-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7310924B2 (ja) | サーバ内遅延制御装置、サーバ、サーバ内遅延制御方法およびプログラム | |
| JP7251648B2 (ja) | サーバ内遅延制御システム、サーバ内遅延制御装置、サーバ内遅延制御方法およびプログラム | |
| JP2025100826A (ja) | サーバ内データ転送装置、サーバ内データ転送方法およびプログラム | |
| JP2024180507A (ja) | サーバおよびプログラム | |
| JP7485101B2 (ja) | サーバ内遅延制御装置、サーバ内遅延制御方法およびプログラム | |
| JP7754299B2 (ja) | サーバ内遅延制御装置、サーバ内遅延制御方法およびプログラム | |
| JP7816498B2 (ja) | サーバ内遅延制御装置、サーバ内遅延制御方法およびプログラム | |
| JP7662062B2 (ja) | サーバ内遅延制御装置、サーバ内遅延制御方法およびプログラム | |
| JP7574902B2 (ja) | サーバ内遅延制御装置、サーバ内遅延制御方法およびプログラム | |
| JP7852718B2 (ja) | サーバ内データ転送装置、データ転送システム、サーバ内データ転送方法およびプログラム | |
| JP2026074185A (ja) | サーバ内遅延制御装置、サーバ内遅延制御方法およびプログラム | |
| JP7740368B2 (ja) | サーバ内データ転送装置、サーバ内データ転送方法およびプログラム | |
| JP7709645B2 (ja) | サーバ内データ転送装置、サーバ内データ転送方法およびプログラム |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22941676 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2024520175 Country of ref document: JP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 18864727 Country of ref document: US |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2022941676 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2022941676 Country of ref document: EP Effective date: 20241212 |
|
| WWP | Wipo information: published in national office |
Ref document number: 18864727 Country of ref document: US |