WO2022007968A1 - 分条管理方法、存储系统、分条管理装置及存储介质 - Google Patents
分条管理方法、存储系统、分条管理装置及存储介质 Download PDFInfo
- Publication number
- WO2022007968A1 WO2022007968A1 PCT/CN2021/105640 CN2021105640W WO2022007968A1 WO 2022007968 A1 WO2022007968 A1 WO 2022007968A1 CN 2021105640 W CN2021105640 W CN 2021105640W WO 2022007968 A1 WO2022007968 A1 WO 2022007968A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- storage system
- units
- check
- data
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0646—Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
- G06F3/0652—Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0614—Improving the reliability of storage systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1004—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1076—Parity data used in redundant arrays of independent storages, e.g. in RAID systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0608—Saving storage space on storage systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0614—Improving the reliability of storage systems
- G06F3/0616—Improving the reliability of storage systems in relation to life time, e.g. increasing Mean Time Between Failures [MTBF]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0614—Improving the reliability of storage systems
- G06F3/0619—Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0646—Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
- G06F3/0647—Migration mechanisms
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
- G06F3/0656—Data buffering arrangements
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0683—Plurality of storage devices
- G06F3/0688—Non-volatile semiconductor memory arrays
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0683—Plurality of storage devices
- G06F3/0689—Disk arrays, e.g. RAID, JBOD
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M13/00—Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
- H03M13/35—Unequal or adaptive error protection, e.g. by providing a different level of protection according to significance of source information or by adapting the coding according to the change of transmission channel characteristics
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M13/00—Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
- H03M13/37—Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35
- H03M13/373—Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35 with erasure correction and erasure determination, e.g. for packet loss recovery or setting of erasures for the decoding of Reed-Solomon codes
Definitions
- the embodiments of the present application relate to the technical field of data storage, and in particular, to a stripe management method, a storage system, a stripe management device, and a storage medium.
- EC erasure code
- the EC technology mainly encodes the data unit through the erasure code algorithm to obtain the check unit, and stores the data unit and the check unit together to achieve the purpose of fault tolerance.
- the larger the number of data units in encoding the higher the storage space utilization.
- it is difficult to fill the EC strip which affects the reliability of data storage. sex.
- Embodiments of the present application provide a stripe management method, a storage system, a stripe management device, and a storage medium, which can improve the storage space utilization rate of the storage system and improve the reliability of data storage.
- the technical solution is as follows:
- a striping management method which is applied to a storage system, and the method includes:
- a new check unit is generated according to the check units of the plurality of first bars; wherein, the new check unit and the data units in the plurality of first bars belong to a new bar; the The new stripe complies with the second erasure code matching; wherein, the number of data units corresponding to the first erasure code matching is smaller than the number of data units corresponding to the second erasure code. That is to say, first use a smaller erasure code ratio to store data, and then convert it to a larger erasure code ratio to store data. Data is stored with a smaller ratio of erasure codes, which is easy to fill up the strips, reduces write amplification, and improves storage space utilization.
- the verification units in the new strip are generated by the verification units of the multiple first strips, and the data units in the multiple first strips do not participate in the operation, thereby saving computing resources.
- the number of verification units in the new stripe is the same as the number of verification units in the first stripe.
- the plurality of first stripes include at least one first strip that is not persistently stored in the storage system and at least one first strip that has been persistently stored in the storage system ; Described obtaining the verification unit in a plurality of first sub-stripes, specifically includes:
- the method also includes:
- the plurality of first stripes are the first strips that are not persistently stored in the storage system; the method further includes:
- a storage system in a second aspect, includes one or more processors, and the one or more processors are configured to implement the data storage method of the first aspect.
- the storage system provided by the second aspect can be either a distributed storage system or a centralized storage system.
- a third aspect provides a stripe management apparatus, where the storage device is applied in a storage system, and the stripe management apparatus includes a plurality of units, and the plurality of units are used to implement the data storage method of the first aspect.
- a computer-readable storage medium is provided.
- the computer-readable storage medium contains computer program instructions, and one or more central processors in the storage system execute the computer program instructions to cause the storage system to perform the data storage method described in the first aspect.
- a computer program product containing instructions that, when executed on a computer, cause the computer to execute the data storage method described in the first aspect above.
- FIG. 1 is an architecture diagram of a storage system provided by an embodiment of the present application
- FIG. 2 is an architecture diagram of another storage system provided by an embodiment of the present application.
- FIG. 3 is a system architecture diagram of a storage device provided by an embodiment of the present application.
- FIG. 4 is a flowchart of a striping management method provided by an embodiment of the present application.
- FIG. 5 is a schematic diagram of the distribution of each unit in a first check matrix provided in an embodiment of the present application in the first dielectric layer;
- FIG. 6 is a schematic diagram of the distribution of cells in the w first check matrices in the first dielectric layer provided by an embodiment of the present application;
- FIG. 7 is a schematic diagram of the distribution of cells in a second check matrix in a second dielectric layer provided by an embodiment of the present application.
- FIG. 8 is a schematic diagram of a principle of obtaining a second check matrix according to merging w first check matrices according to an embodiment of the present application;
- FIG. 9 is a schematic diagram of a check unit for obtaining a second check matrix provided by an embodiment of the present application.
- FIG. 10 is a schematic diagram of writing data in a second medium layer according to a second erasure code ratio provided by an embodiment of the present application;
- FIG. 11 is a schematic structural diagram of a striping management apparatus provided by an embodiment of the present application.
- FIG. 1 is a schematic structural diagram of a storage system provided by an embodiment of the present application.
- the storage system includes a computer node cluster and a storage node cluster.
- the computing node cluster includes one or more computing nodes 10 (two computing nodes 10 are shown in FIG. 1 , but not limited to two computing nodes 10 ).
- the computing node 10 is a computing device on the user side, such as a server, a desktop computer, and the like.
- the computing node 10 is provided with a central processing unit and a memory (not shown in FIG. 1 ).
- the computing node 10 runs an application program (application) 101 (referred to as application) and a client program 102 (referred to as client).
- application application
- client program 102 referred to as client
- Application 101 is a general term for various applications presented to the user.
- the client 102 is configured to receive a data access request triggered by the application 101 , interact with the storage node 20 , and send the data access request to the storage node 20 .
- the client 102 is also configured to receive data from the storage node and forward the data to the application 101 .
- the client 102 is a software program
- the functions of the client 102 are implemented by the central processing unit included in the computing node 10 running the program in the memory.
- Client 102 may also be implemented by hardware components located within compute node 10 . Any client 102 in the computing node cluster can access any storage node 20 in the storage node cluster.
- the storage node cluster includes one or more storage nodes 20 (three storage nodes 20 are shown in FIG. 1 , but not limited to three storage nodes 20 ), and each storage node 20 can be interconnected.
- Storage nodes such as servers, desktop computers, or storage array controllers, disk enclosures, etc. Functionally, the storage node 20 is mainly used for computing or processing data.
- the storage node 20 at least includes a memory, a network card and one or more central processing units.
- a central processing unit central processing unit, CPU
- CPU central processing unit
- Memory refers to a device used to store data.
- the memory may be a memory or a hard disk.
- the memory refers to the internal memory that directly exchanges data with the processor, it can read and write data at any time, and is very fast, as a temporary data storage for the operating system or other running programs.
- Memory includes one or more types of memory.
- memory can be either random access memory or read only memory (ROM).
- ROM read only memory
- the random access memory can be a dynamic random access memory (DRAM) or a storage class memory (SCM).
- DRAM is a semiconductor memory, and like most Random Access Memory (RAM), it belongs to a volatile memory (volatile memory) device.
- SCM is a composite storage technology that combines the characteristics of traditional storage devices and memory.
- SCM can provide faster read and write speeds than hard disks, but is slower than DRAM in terms of operation speed and cheaper than DRAM in cost. It should be noted that the processor can directly access the memory, for example, as shown in Figure 2, the processor can directly access the DRAM and the SCM.
- the DRAM and the SCM are only exemplified in this embodiment, and in some possible cases, the memory may only include one of the DRAM and the SCM.
- the memory may also include other random access memories, such as static random access memory (Static Random Access Memory, SRAM) and the like.
- static random access memory Static Random Access Memory, SRAM
- PROM Programmable Read Only Memory
- EPROM Erasable Programmable Read Only Memory
- the memory may also be a dual in-line memory module or a dual in-line memory module (Dual In-line Memory Module, DIMM for short), that is, a module composed of dynamic random access memory (DRAM).
- DIMM Dual In-line Memory Module
- the memory includes one type of memory as an example for description, but this does not constitute a limitation on the number of memory types included in the memory.
- Hard disks read and write data slower than memory and are often used to store data persistently.
- the storage node 20a As an example, one or more hard disks are arranged inside; or, a hard disk enclosure (as shown in FIG. 2 ) is mounted outside the storage node 20, and multiple hard disks are arranged in the hard disk enclosure. Regardless of the deployment method, these hard disks can be regarded as hard disks included in the storage node 20 .
- the hard disk here may refer to a physical hard disk, or may refer to a logical domain or a fault domain including multiple physical hard disks, which is not limited in this embodiment of the present application.
- the type of the physical hard disk is a solid-state hard disk, a mechanical hard disk, or other types of hard disks.
- the memory and the hard disk included in the memory are two completely different storage media, and the performance of the two is completely different. Among them, compared with the hard disk, the data reading speed of the memory is faster and the delay is smaller, that is, the performance of the memory is higher than that of the hard disk. Based on this, in the embodiments of the present application, as shown in FIGS. 1 and 2 , the memory in each storage node 20 is referred to as the first medium layer, and the hard disk in each storage node 20 is referred to as the second medium layer. The performance of the first dielectric layer is higher than that of the second dielectric layer.
- each storage medium in the memory can also be used as a medium layer.
- SCM constitutes a medium layer
- hard disk constitutes a medium layer.
- the first medium layer may be a solid state disk (solid state disk, SSD)
- the second medium layer may be a hard disk drive (hard disk drvie, HDD).
- the network card is used to communicate with other storage nodes, or used to communicate with a disk enclosure coupled to the storage node.
- the network card can directly access the memory of the storage node, as shown in Figure 2, the network card can directly access the DRAM and SCM.
- the storage system does not include computing nodes.
- the computing node and the storage node in the storage system may be in the same node.
- the specific form of the storage system is not limited in the present invention.
- the function of one or more CPUs in the storage node in this embodiment of the present invention may be determined by a Field Programmable Gate Array (FPGA) or an Application-specific Integrated Circuit (ASIC), or the like or the above A variety of combinations to achieve.
- FPGA Field Programmable Gate Array
- ASIC Application-specific Integrated Circuit
- FIG. 3 is a schematic structural diagram of another storage system provided by an embodiment of the present application.
- the storage system shown in FIG. 3 is a storage array, and the storage array includes at least one controller (the controller 11 shown in FIG. 3 ) and a plurality of hard disks 22 .
- the controller 11 is connected to a host (not shown in the figure) through a storage area network (English: storage area network, SAN).
- Controller 11 may be a computing device such as a server, desktop computer, or the like.
- An operating system and application programs are installed on the controller 11 .
- the controller 11 can receive input-output (I/O) requests from the host.
- the controller 11 may also store the data (if any) carried in the I/O request and write the data to the hard disk 22 .
- the hard disk 22 is a mechanical hard disk or a solid-state hard disk, and the solid-state hard disk is a memory with a flash memory (English: flash memory) chip as a medium, also known as a solid state drive (
- FIG. 3 is only an exemplary illustration.
- the storage array may include two or more controllers.
- the physical structure and function of each controller are similar to those of the controller 11 , and this embodiment does not limit the controllers among the controllers. and the connection mode between any controller and the hard disk 22 .
- a hard disk may refer to a physical hard disk, or may refer to a logical domain or a fault domain including multiple physical hard disks, which is not limited in this embodiment of the present application.
- the controller 11 includes an interface card 110 , one or more processors 112 and an interface card 113 .
- the interface card 110 is used to communicate with the host, and the controller 11 receives the operation instruction of the host through the interface card 110 .
- the processor 112 may be a central processing unit (English: central processing unit, CPU). In this embodiment of the present application, the processor 112 is configured to receive an I/O request from a host and process the I/O request.
- the I/O request is a data write request or a data read request, and the processor 112 may also send the data in the write data request to the hard disk 22 .
- the interface card 113 is used to communicate with the hard disk 22 , and the controller 11 sends a write data request (including data, logical address of the data, and virtual address of the data) to the hard disk 22 for storage through the interface card 113 .
- the processor 112 in the embodiment of the present invention may also be implemented by a field programmable gate array (Field Programmable Gate Array, FPGA) or an application-specific integrated circuit (Application-specific integrated circuit, ASIC), etc., or a combination of the above.
- FPGA Field Programmable Gate Array
- ASIC application-specific integrated circuit
- the controller 11 further includes a memory 111 .
- the memory 111 is used to temporarily store data received from the host or data read from the hard disk 22 .
- the controller 11 may temporarily store the data in the multiple write data requests in the memory 111 .
- the capacity of the memory 111 reaches a certain threshold, the data stored in the memory 111 , the virtual address of the data, and the logical address allocated for the data are sent to the hard disk 22 .
- the hard disk 22 stores the received data.
- the memory 111 includes volatile memory, flash memory chips, or a combination thereof.
- the volatile memory is, for example, random-access memory (English: random-access memory, RAM).
- Flash memory chips are various machine-readable media that can store program codes, such as floppy disks, hard disks, and optical disks.
- the memory 111 has a power protection function, which means that when the system is powered off and then powered on again, the data stored in the memory 111 will not be lost.
- the memory 111 and the hard disk 22 included in the controller 11 are two completely different storage media. Among them, compared with the hard disk, the data reading speed of the memory is faster and the delay is smaller, that is, the performance of the memory is higher than that of the hard disk.
- the memory 111 with higher performance is referred to as the first medium layer
- the plurality of hard disks 22 with lower performance relative to the memory are referred to as the second medium layer, that is, the performance of the first medium layer higher than the second dielectric layer.
- the memory 111 with higher performance is called the second medium layer
- the plurality of hard disks 22 with lower performance relative to the memory are called the first medium layer. In this case, the performance of the first medium layer is lower than that of the second medium. Floor.
- Erasure coding is a data redundancy technology. Compared with the multi-copy strategy, erasure coding has higher disk utilization. For example, Reed-Solomon code is a common erasure code.
- the erasure coding technology mainly encodes the original data to obtain redundancy through the erasure coding algorithm, and stores the data and redundancy together to achieve the purpose of fault tolerance.
- the basic idea is to obtain m blocks of redundant elements (check units) by calculating n blocks of original data elements (data units), and the disk utilization rate is n/(n+m). For the elements of the n+m blocks, when any m block elements are in error (including the original data elements and redundant elements), the original n blocks of data elements can be recovered through the corresponding reconstruction algorithm.
- the process of generating checksums is called encoding, and the process of recovering lost data elements is called decoding.
- the ratio of erasure correction codes mentioned in this application refers to the ratio of the data element n to the redundant element m.
- the n blocks of data elements and m blocks of redundant elements based on erasure coding technology belong to one stripe; wherein, the data elements are also called data units, and the redundant elements are also called check units.
- FIG. 4 is a flowchart of a striping management method provided by an embodiment of the present application. The method may be applied to the storage system shown in FIG. 1 or FIG. 2 , and may also be applied to the storage system shown in FIG. 3 . Referring to Figure 4, the method includes the following steps:
- Step 401 In the first medium layer, cache a plurality of data units and check units of the first stripe according to the first erasure code ratio.
- the storage system includes a variety of different storage media.
- the storage system includes two different storage media, and the performance of the first media layer is higher than that of the second media.
- the data storage method is explained by taking the performance of the layer as an example.
- the first dielectric layer is DRAM and the second dielectric layer is hard disk.
- data is stored in accordance with the first erasure code ratio, that is, the stripes complying with the first erasure code ratio are cached in the DRAM, that is, the data units and check units in the stripe are both cached in the DRAM.
- the so-called performance of the first medium layer is higher than that of the second medium layer means that the read and write speed of the first medium layer is faster and the time delay is smaller than that of the second medium layer.
- the first erasure code ratio corresponding to the first medium layer and the second erasure code ratio corresponding to the second medium layer may be obtained respectively, and then In the first medium layer, the data unit and the check unit of the first stripe are cached according to the first erasure code ratio. That is, the received data is divided into data units according to the first erasure code ratio, and corresponding check units are obtained based on the erasure code algorithm, so as to obtain the first segment that complies with the first erasure code ratio.
- the stripe is a stripe that is not persistently stored in the storage system.
- the stripe is the stripe that is persistently stored in the storage system.
- the stripe when the first medium layer includes SCM, when the data unit and check unit in the stripe are stored in the SCM of the first medium layer, the stripe is still called as not persistent in the storage system.
- the stripe of the storage is stored, that is, the stripe is cached in the SCM.
- the stripe is still called the storage system Striping in persistent storage.
- the first erasure code ratio and the second erasure code ratio can be obtained by the management node in the storage system.
- the management node is any storage node among multiple storage nodes in the storage system, or the management node is a node in the storage system that is independent of the storage nodes and used to manage each storage node.
- the management node can obtain the first erasure code ratio and the second erasure code ratio during initialization, and can also obtain the first erasure code ratio and the second erasure code ratio during the operation of the storage system. , which is not limited in the embodiments of the present application. If the data storage method is applied to the storage system shown in FIG.
- the first erasure code matching ratio and the second erasure code matching ratio can be obtained by the controller in the storage system.
- an introduction is made by taking the management node acquiring the first erasure code ratio and the second erasure code ratio as an example.
- the management node determines the first erasure code matching ratio or the second erasure code matching ratio according to the topology structure and fault tolerance capability of the storage system.
- the first erasure code ratio and the second erasure code ratio may both be determined according to the topology and fault tolerance of the storage system, or the first erasure code ratio may be determined according to the topology and fault tolerance of the storage system Obtain, the second erasure code ratio is obtained according to the first erasure code ratio, or, the second erasure code ratio is determined according to the topology and fault tolerance of the storage system, and the first erasure code ratio is determined according to the second erasure code ratio. Erasure code matching is obtained, which is not limited in this embodiment of the present application.
- the topology is used to indicate the number of storage nodes included in the storage system
- the fault tolerance capability is used to indicate the number of storage nodes that the storage system tolerates errors.
- the number of error-tolerant storage nodes in the storage system is equal to the number of check units corresponding to the first erasure code matching, or the number of check units corresponding to the second erasure code matching.
- the management node first obtains the topology of the storage system.
- the management node may store the topology of the storage system, or receive the topology of the storage system sent by other devices, or receive the topology of the storage system input by the user.
- the topology can indicate the composition of the storage system, for example, the number of storage nodes included in the storage system, and the number of child nodes included in each storage node.
- the storage node is a server
- the number of child nodes of the storage node refers to the number of physical hard disks included in the server or the number of hard disk logical domains obtained by dividing the physical hard disks included in the corresponding storage node.
- the storage node is a cabinet
- the number of child nodes of the storage node refers to the number of servers included in the cabinet.
- a rack contains multiple servers.
- each server includes 60 physical hard disks, and every 15 physical hard disks is divided into 1 hard disk logical domain, according to the topology, it can be known that the storage system includes 4 storage nodes , each server is a storage node. Each storage node includes four hard disk logical domains, that is, each storage node includes four child nodes.
- the management node In addition to obtaining the topology of the storage system, the management node also obtains the security level and fault tolerance of the storage system.
- a configuration interface is displayed on the management node, and the configuration interface includes a security level configuration item and a fault tolerance capability configuration option.
- the user inputs the required security level in the security level configuration item, and inputs the number t of nodes that allow errors in the fault tolerance configuration option, where t is an integer greater than or equal to.
- the management node obtains the security level entered by the user and the number of error-allowed nodes t.
- the security level includes server-level security, cabinet-level security, and the like. Among them, server-level security is used to indicate that the storage system can tolerate the failure of at most t servers.
- Cabinet-level security is used to indicate that the storage system can tolerate up to t cabinet failures.
- the management node may also determine the security level of the storage system according to a preset principle according to the topology of the storage system, wherein the preset principle refers to a calculation principle that can ensure the reliability of the storage system.
- the fault tolerance capability of the storage system may also be a system default value, which is not limited in this embodiment of the present application.
- the management node After acquiring the topology structure, fault tolerance capability and security level of the storage system, the management node determines the value range of the number of data units in the first erasure code matching ratio through the following formula (1).
- N is the number of data units corresponding to the first erasure code matching.
- k is the number of nodes included in the storage system.
- the security level is server-level security, the above-mentioned nodes refer to servers, and when the security level is cabinet-level security, the above-mentioned nodes are cabinets.
- the management node After determining the value range of the number of data units in the first erasure code ratio, the management node determines to obtain a plurality of first candidate erasure code ratios according to the value range and M, and each candidate erasure code ratio The code ratio corresponds to a value in the value range. Afterwards, the corresponding erasure correction code ratio with the smallest write amplification value is selected from the plurality of first candidate erasure code ratios as the first erasure code ratio.
- write amplification means that the amount of data actually written by the storage node is greater than the amount of data received from the computing node.
- write amplification is represented by a write amplification value.
- the write amplification value corresponding to the first candidate erasure correction code ratio is equal to the first candidate erasure correction code ratio.
- the ratio between the total number of data units and check units for erasure matching and the number of data units For example, for the erasure code ratio of 6:2, the erasure code ratio is used to represent that every 6 data units corresponds to 2 check units. In this way, the write amplification value corresponding to the erasure code ratio is (6+2)/6.
- the management node is further configured to obtain the second erasure code matching ratio according to the topology structure and fault tolerance capability of the storage system. Specifically, the management node uses the following formula (2) to determine the range of the number of data units in the second erasure code matching ratio.
- X is the number of data units corresponding to the second erasure code matching, and X is greater than N.
- i is the number of child nodes of the node included in the storage system, where, when the security level is server-level security, i is the number of child nodes of the server included in the storage system, where the child node of a server may refer to the server An attached physical hard disk or a logical domain of hard disks.
- i is the number of child nodes of the cabinet included in the storage system, where the child nodes of the cabinet refer to the number of servers included in the cabinet.
- the security level can be configured by the user in the aforementioned configuration manner. In this case, the management node directly obtains the security level configured by the user. Alternatively, the security level may also be determined by the management node according to a preset principle according to the topology structure of the storage system, where the preset principle refers to a calculation principle that can ensure the reliability of the storage system. This is not limited.
- the management node After determining the value range of the number of data units in the second erasure code matching ratio, the management node determines the second erasure code matching ratio according to the value range and Y.
- each server included in the storage system has 4 sub-systems under it. node, so that the total number of child nodes of 4 servers is 16.
- the management node can select the number of data units to be 24. At this time, the ratio of the second erasure code is 24:2.
- N in the first erasure code matching and X in the second erasure code matching are not equal, and N is less than X.
- M in the first erasure erasure code configuration and Y in the second erasure erasure code configuration may be equal or unequal.
- the ratio of N and M is not equal to the ratio of X and Y.
- the management node determines the second erasure code matching ratio X according to the first erasure code matching ratio N:M and a preset w :Y. where X is equal to w*N and Y is equal to or greater than M.
- the management node determines the first erasure code matching ratio N:M according to the second erasure code matching ratio X:Y and the preset w. where N is equal to X/w and M is equal to or less than Y.
- the management node calculates the number X of data units in the second erasure code ratio and the number of data units in the first erasure code ratio
- the preset w is actually The above is the number of first parity check matrices included in the data stored in the first medium layer according to the first erasure code ratio.
- a first parity check matrix is a stripe that complies with the first erasure code matching.
- the administrator may configure the first erasure code ratio and the second erasure code ratio through the management node.
- the storage node After obtaining the first erasure code ratio, the second erasure code ratio, and w, subsequently, when the storage node receives a data write request sent by the computing node, the storage node will be in accordance with the first erasure code ratio and w in Data is written in the first media layer.
- the write data request includes data to be written.
- the target storage node receives the write data request sent by the computing node, and when the amount of received data to be written reaches the size of N data units, the target storage node divides the to-be-written data into N data units. , and generate M check units according to the N data units.
- the N data units and the M check units belong to one sub-data, and the sub-data corresponds to a first check matrix, wherein the first check matrix includes the N data units and the M check units.
- the target storage node stores the N data units and M check units contained in the first check matrix into the first medium layer of the storage system.
- the target storage node continues to receive the write data request sent by the computing node, obtains another first parity check matrix according to the above method, and stores it in the first medium layer.
- the subsequent step 402 can be executed.
- the data to be written is divided into 6 data units, and 2 check units are generated according to the 6 data units.
- a first check unit including 6 data units and 2 check units is generated. matrix.
- the 8 cells included in the first check matrix are stored in the memory of each storage node in the storage system.
- the target storage node can distribute the check cells included in each first check matrix on the same storage node, and the data cells included in each first check matrix can be distributed among the storage nodes according to the principle of even distribution. in the node.
- the storage nodes included in the storage system are four servers, and the memory included in the four servers is the first media layer.
- the target storage node stores the first check matrix.
- the 2 check units included in the check matrix are stored in its own memory, and the remaining 6 data units are forwarded to the other 3 servers for storage.
- the other 3 servers Stores 2 data units. In this way, after the four first check matrices are stored in the first medium layer, the distribution of the 32 units included in the four check matrices on the storage nodes of the storage system is shown in FIG. 6 .
- FIG. 5 and FIG. 6 are only possible distributions of each unit in a first parity check matrix given by way of example in the embodiment of the present application.
- the target storage node may also randomly distribute the plurality of cells evenly in each storage node according to the maximum number of cells allowed to be distributed in each storage node of the storage system and the number of cells included in the first check matrix. That is, the check unit is not limited to be on the same storage node.
- one data unit and one check unit in a first check matrix may be stored on the target storage node, and another check unit and one data unit may be stored on another storage node , in this way, there are still 4 data units remaining, and these 4 data units are respectively stored on the remaining two storage nodes.
- the 4 check units p in the 4 first check matrices are stored in a storage node
- the four verification units q are stored in one storage node, and the storage nodes where the four verification units p are located may be different from the storage nodes where the four verification units q are located.
- the above describes the process of caching data in the first medium layer according to the first erasure code ratio when the storage system is the storage system shown in FIG. 1 or FIG. 2 .
- the controller in the storage array can determine the first erasure code ratio, the second erasure code ratio, and the data stored in the first medium layer.
- the method for determining the number w of the first parity check matrix included in the data refers to the method described above, and details are not described herein again in this embodiment of the present application.
- the controller may cache data in the first media layer according to the first erasure code ratio and w, where the first media layer is Memory for the controller included in the storage system. It should be noted that the controller generates w first check matrices with reference to the method described above. According to the number of controllers included in the storage system and/or the number of memories included in each controller, the data units and check units included in each first parity check matrix are distributed and stored to the respective controllers with reference to the foregoing method. in memory.
- Step 402 Persistently store the plurality of first-striped data units and new check units cached in the first medium layer to the second medium layer according to the second erasure code ratio.
- the new check unit is generated by a plurality of check units of the first section, and the data units of the plurality of first sections and the new check unit belong to the sections that comply with the second erasure code ratio ( second clause).
- the storage node or the controller When the data cached in the first medium layer according to the first erasure code ratio reaches the set condition, the storage node or the controller will store the part of the data that meets the set condition, according to the second erasure code ratio.
- the second dielectric layer For example, the setting condition may be: the number of data units in the plurality of first stripes that are cached in the first medium layer conforming to the first erasure code matching reaches the second score that conforms to the second erasure code matching. The number of data cells in the bar. That is, the second strip is filled in the first dielectric layer.
- the ratio of the second erasure code is X:Y, that is, the data stored in the second medium layer includes X data units and Y check units.
- the second media layer includes hard disks included in storage nodes in the storage system.
- a second parity check matrix is a stripe that complies with the second erasure code matching.
- the first implementation manner after the target storage node collects the w first check matrices through the above step 401, the target storage node calculates and obtains the second check matrix according to the w ⁇ N data units included in the w first check matrices. Y check units. After that, the target storage node stores the calculated Y check units in the second medium layer. For other storage nodes, each storage node stores the data units belonging to the w first check matrices stored by itself to the second medium layer when the set conditions are met. In this way, the w ⁇ N data units stored in the second medium layer are the X data units of the second check matrix, and the Y check units calculated by the target storage node are the Y included in the second check matrix. a verification unit.
- the target storage node when the target storage node stores the data units and Y check units stored by itself in the second medium layer, if the number of hard disks included in the second medium layer is greater than the total number of units included in the second check matrix , the target storage node selects Y hard disks from the plurality of hard disks included in the target storage node according to the calculated number Y of verification units. Then the target storage node writes the Y check units into the selected hard disks, wherein one unit is written on each hard disk.
- the target storage node also stores data units belonging to w first check matrices, the target storage node selects a hard disk for each data unit from the hard disks included in the target storage node, and writes the data unit into the data unit. to the selected hard disks, where one unit is also written to each hard disk.
- the storage node determines the second medium according to the number of check cells included in the second check matrix. The maximum number of cells allowed to be distributed on each hard disk in the tier. Afterwards, if the target storage node also stores data units belonging to w first check matrices, then according to the maximum number of units and the number of data units belonging to w first check matrices stored by itself and Y from its own A plurality of hard disks are selected from the included hard disks, and then the stored data unit and the verification unit are written to the selected plurality of hard disks.
- the target storage node does not store data units belonging to the w first check matrices, select multiple hard disks from the hard disks included in the target storage node according to the maximum number of units and Y, so as to write the Y check units to the on the selected hard disk.
- a plurality of cells in the second parity check matrix may be stored distributed on one hard disk, but the number of stored cells does not exceed the maximum number of cells allowed to be distributed on the hard disk.
- the data units belonging to the w first parity check matrices stored by itself can be written into the second medium layer by referring to the above method.
- the w first parity check matrices include a total of 24 data units, that is, the second The number of data units included in the parity check matrix is 24.
- the number of check units calculated by the target storage node is 2. Since the number of check cells included in the second check matrix is 2, it can be known that the maximum number of cells allowed to be distributed on each hard disk logical domain is 2.
- each storage node combines the 24 data cells and the
- 2 parity units are stored in the second medium layer
- 3 storage nodes in the 4 storage nodes 2 units are stored in each hard disk logical domain of each storage node in the 3 storage nodes, that is, each A storage node stores a total of 8 units, and for another storage node, you can store 2 units on one hard disk logical domain of the storage node, or store one unit on each of the two hard disk logical domains of the storage node. unit.
- the second implementation manner after the target storage node collects the w first check matrices, it obtains Y checks in the second check matrix according to the w ⁇ M check units included in the w first check matrices. unit. After that, the target storage node stores the calculated Y check units in the second medium layer. For other storage nodes, each storage node stores the data units belonging to the w first parity check matrices stored by itself to the second medium layer when the amount of data in its own cache reaches a certain threshold.
- the target storage node obtains the w ⁇ M check units stored by itself, and then according to the w ⁇ M check units The unit obtains Y check units.
- the target storage node performs an XOR operation on the stored w check units p or other
- the check unit p' in the second check matrix is obtained by the calculation method
- the check unit q' in the second check matrix is obtained by performing an XOR operation on the stored w check units q or other calculation methods. It can be seen that, in the embodiment of the present application, the check units of the second check matrix can be obtained by directly calculating the M check units included in each first check matrix. All data units in the check matrix are recalculated to check the check unit, which reduces the amount of computation.
- the storage node can directly obtain the stored check cells to obtain the check cells in the second check matrix. Compared with the case where the verification units are distributed and stored in each storage node, there is no need to obtain verification units across storage nodes, which reduces the amount of network forwarding.
- the target storage node After calculating and obtaining Y check units, the target storage node stores the Y check units in the second medium layer with reference to the method described in the foregoing first embodiment, or stores the self-stored w check matrices belonging to the The data unit and Y check units are stored in the second medium layer. After the other storage nodes meet the set conditions, they store the data units stored in them in the second medium layer. In this way, the w first check matrices stored in each storage node are merged into a second check matrix and stored in the second medium layer. in the second dielectric layer.
- FIG. 8 is a schematic diagram of a principle of obtaining a second check matrix according to merging w first check matrices according to an embodiment of the present application.
- p1 is the check unit p
- q1 is the check unit q.
- a7 to a12 in the second first check matrix are 6 data units
- the remaining two columns of elements p2 and q2 are 2 check units, and so on.
- the elements of the first 6 columns of each check matrix are taken out to form 24 data units of the second check matrix.
- FIG. 8 exemplarily illustrates the process of merging w first check matrices into a second check matrix.
- the performance of the first medium layer is lower than the The performance of the two medium layers, in the process of reading the data from the first medium layer to the second medium layer, it may be necessary to split the second check matrix into w first check matrices. In this case, It is only necessary to perform the above process in reverse to obtain w first check matrices, which is not limited in this embodiment of the present application.
- the target storage node when the M check units included in each first check matrix are check units r, the target storage node performs an XOR operation on the stored w check units r, and incremental calculation obtains the second check unit.
- check unit r' in the check matrix after that, the target storage node can obtain the data units in each first check matrix stored on itself and other storage nodes, and calculate the check unit according to the obtained w ⁇ N data units p' and check unit q'.
- the check unit r', the check unit p' and the check unit q' obtained by calculation are used as the Y check units in the second check matrix. It can be seen that in this implementation, Y and M are not equal.
- the check unit r in each first check matrix is incrementally calculated to obtain the check unit in the second check matrix. unit r', reducing the computational cost.
- the check unit p' and the check unit q' are calculated according to w ⁇ N data units, so that the second check matrix includes 3 check units, which improves the redundancy of the data stored in the second medium layer. Redundancy improves fault tolerance.
- the target storage node performs an exclusive process on the stored 3 check units r. OR operation to obtain the check unit r' in the second check matrix.
- the target storage node obtains the 21 data units stored in itself and other storage nodes, and calculates the check unit p' and the check unit q' according to the 21 data units.
- the Y check units included in the second check matrix are the generated check unit p', check unit q' and check unit r'.
- the target storage node After calculating and obtaining Y check units, the target storage node also refers to the method described in the first implementation manner to store the stored data unit and Y check units in the second medium layer, and other storage nodes cache their own After the amount of data in the storage device reaches a certain threshold, the respective stored data units are stored in the second medium layer, which is not repeated in this embodiment of the present application.
- the target storage node obtains the stored check units from each storage node, and then according to the obtained w ⁇ M Y verification units are obtained from the number of verification units.
- the implementation manner of the target storage node obtaining Y verification units according to the acquired w ⁇ M verification units refers to the implementation manner in the above (1), which is not repeated in this embodiment of the present application.
- the target storage node stores the stored data unit and Y check units in the second medium layer with reference to the method described in the first implementation manner, and other storage nodes store data in their own caches. After the amount of data reaches a certain threshold, the respective stored data units are stored in the second medium layer, and details are not described herein again in this embodiment of the present application.
- the third implementation method After the target storage node has assembled the w first check matrices, it writes the units belonging to the w first check matrices stored by itself to the second medium layer, and each other storage node caches its own After the stored quantity reaches a certain threshold, the cells belonging to the w first parity check matrices stored by itself are also written into the second medium layer. After that, the target storage node obtains the w ⁇ M check units written into the second medium layer, and then calculates and obtains Y check units according to the w ⁇ M check units, and uses the calculated Y check units as The Y parity units of the second parity check matrix are written into the second medium layer.
- each storage node stores the data units and check units belonging to the w first parity check matrices.
- the second medium layer for data units, one hard disk can be selected for each data unit, and each data unit can be written to the hard disk selected for the corresponding data unit, wherein the Hard drives are also different.
- the X data units included in the second parity check matrix the X data units will be written into the X hard disks.
- each storage node may store the check units stored by itself in the remaining hard disks other than the above-mentioned X hard disks.
- one check unit may be written on each hard disk, so that w ⁇ M check units will be written to w ⁇ M hard disks.
- all parity units can be written to one hard disk.
- the target storage node After each storage node writes the data units and the check units belonging to the w first check matrices stored by itself into the second medium layer, the target storage node obtains w ⁇ M check units from the second medium layer . Wherein, if w ⁇ M check units are to be written into w ⁇ M hard disks, the target storage node reads w ⁇ M check units from these w ⁇ M hard disks. If all verification units are written into one hard disk, the target storage node obtains w ⁇ M verification units from the hard disk at one time, which can reduce the number of network communications and save bandwidth resources.
- the target storage The node reads the parity units located in the same column from each hard disk, thereby obtaining w ⁇ M parity units. In this way, the number of network communications can also be reduced to a certain extent, and bandwidth resources can be saved.
- the target storage node After obtaining w ⁇ M check units, the target storage node refers to the method described in the first implementation manner, calculates and obtains Y check units according to the w ⁇ M check units, and then calculates the Y check units.
- the units are respectively written to Y hard disks, wherein each hard disk is written with a parity unit.
- the Y hard disks in which the Y verification units are written are not the hard disks in the X hard disks in which the data units are written.
- each storage node can refer to the method described in the foregoing first implementation manner, on one hard disk. Write to two or more cells, as long as the maximum number of cells allowed to be stored is not exceeded.
- the M check units included in each first check matrix may be stored on a hard disk under the same storage node, or the M check units included in each first check matrix
- the parity units located in the same column can be stored in the hard disk under a storage node, for example, stored in the same hard disk logical domain under a storage node, or stored in a physical hard disk under a storage node, so as to This reduces the number of network forwarding required when calculating Y check units in the second check matrix.
- FIG. 10 is a schematic diagram of writing data according to the second erasure code ratio in the second medium layer according to an embodiment of the present application.
- the two verification units are respectively the verification unit p and the verification unit q.
- the second media layer includes 16 hard disk logical domains distributed on 4 storage nodes (a1 to a4), each storage node is distributed with 4 hard disk logical domains, and each hard disk logical domain includes 16 physical hard disks. Since the number of check cells is 2, a maximum of 2 cells in the second check matrix are allowed to be distributed on each hard disk logical domain.
- two hard disk logical domains are first selected from the 16 hard disk logical domains.
- the selected two hard disk logical domains are the hard disk logical domain a21 on storage node a2 and the hard disk logical domain a41 on storage node a4.
- each storage node storing the check unit p of each first check matrix writes the check unit p stored by itself into the physical hard disk of the hard disk logical domain a21, for example, each first check matrix in the
- the verification unit p is all written into the first physical hard disk of the hard disk logical domain a21.
- each storage node storing the check unit q of each first check matrix writes the check unit q stored by itself into the first physical hard disk of the hard disk logical domain a41.
- the target storage node or the storage node a2 performs an XOR operation on the four check units p in the four first check matrices written to the hard disk logical domain a21 to obtain the second check
- the check unit p' in the matrix is stored in the hard disk logical domain a21.
- each hard disk logical domain is allowed to store at most two cells in the second parity check matrix, after writing the calculated check cell p' in the second check matrix to the hard disk logical domain a21,
- the hard disk logical domain a21 can further store at most one data unit in the second parity check matrix.
- the target storage node or the storage node a4 can also perform an XOR operation on the four check units q stored therein, and incrementally calculate the check unit q' in the second check matrix. , and store it in the hard disk logical domain a41. In this way, the hard disk logical domain a41 can store only one data unit in the second parity check matrix at most.
- each storage node distributes 24 data units in 4 units according to the principle of distributing at most two units in the second check matrix on each hard disk logical domain.
- Each storage node includes 16 hard disk logical domains.
- the above describes the process of caching data in the first medium layer according to the first erasure code ratio when the storage system is the storage system shown in FIG. 1 or FIG. 2 .
- the operations performed by the above-mentioned storage nodes can be performed by the controller, so that the data in the first medium layer includes the w first parity check matrices.
- the data unit and the check unit are combined into a second check matrix and written into the second medium layer, which is not repeated in this embodiment of the present application.
- the data unit and the check unit at the non-faulty position are read from the first medium layer and reconstructed, so as to restore the data in the fault point in the first medium layer.
- the number of data units corresponding to the first erasure code matching is smaller than the number of data units corresponding to the second erasure code, that is, the first erasure code matching is a small proportion. ratio, and the second erasure code matching ratio is a large ratio matching ratio.
- data is cached in the first medium layer according to the first erasure code ratio, and data is stored in the second medium layer according to the second erasure code ratio, and the performance of the first medium layer is higher than that of the second medium layer.
- the medium layer that is to say, the high-performance medium layer uses a smaller ratio of erasure codes to store data, and the low-performance medium layer uses a larger ratio of erasure codes to store data.
- the high-performance medium layer Since the IO granularity received by the high-performance medium layer is relatively small, when the high-performance medium layer stores data with a small erasure code ratio, whenever the size of the received data reaches the corresponding erasure code ratio When the size of N data units is used, one stripe can be filled (N data units and M check units can form one stripe). Compared with the larger ratio of erasure codes, the smaller The erasure code ratio makes it easier to fill up the stripes, thereby reducing the amount of data that is filled with 0s in the stripes, reducing write amplification, and improving storage space utilization. For example, in the high-performance media layer, the erasure code ratio of 6:2 is used to store data.
- a piece of data cached in the first medium layer according to the first erasure code ratio can be directly converted into a piece of data that conforms to the second erasure code ratio, and then stored in the second
- the media layer improves the data storage reliability while improving the storage space utilization of the storage system.
- the data units in the plurality of first stripes no longer need to participate in the operation, thereby saving the computing resources of the storage system.
- the above-mentioned embodiment mainly introduces that the number of data units in a plurality of first strips that comply with the first erasure code ratio and cached in the first medium layer reaches the second strip that complies with the second erasure code ratio.
- the number of data units that is, when the multiple first stripes cached in the first media layer can fill up the second strip, the data units of the first strip and the check unit of the second strip are stored in the The second dielectric layer.
- the first stripe that complies with the first erasure code ratio can be persistently stored in the storage system, that is, stored in the second medium layer, which can improve the Data reliability, preventing the failure of the storage system from leading to the loss of the first piece of data that is not persistently stored in the first media layer.
- the first strip of the non-persistent storage in the first media layer and the first strip of the persistent storage can fill the second strip, that is, the first strip of the non-persistent storage in the first media layer
- the number of data units and the number of data units in the first segment that have been persisted are equal to the number of data units in the second segment
- the verification unit in the first strip of the non-persistent storage in the first medium layer is read, and the verification unit of the second strip is generated. description, which is not repeated in this embodiment of the present invention.
- an interface card in a storage system for example, a host bus adapter (Host Bus Adapter, HBA), a network interface card (Network Interface Card, NIC) or an expander (Expander) It is not repeated in the present invention.
- HBA host Bus Adapter
- NIC Network Interface Card
- Expander Expander
- an embodiment of the present application provides a striping management device, which is applied to any storage node in the storage system shown in FIG. 1 or FIG. 2 described above, and can also be applied to the storage system shown in FIG. 3 . storage array, etc.
- the strip management device includes:
- the obtaining unit 1101 is used to obtain the verification units in the plurality of first strips; wherein, the first strips comply with the first erasure code ratio;
- the generating unit 1102 is configured to generate a new verification unit according to the verification units of the plurality of first strips; wherein, the new verification unit and the data units in the plurality of first strips belong to a new verification unit.
- the new stripe follows the second erasure code matching ratio; wherein, the number of data units corresponding to the first erasure code matching is less than the number of data units corresponding to the second erasure code.
- the number of verification units in the new stripe is the same as the number of verification units in the first stripe.
- the multiple first stripes include at least one first strip that is not persistently stored in the storage system and at least one first strip that has been persistently stored in the storage system; the acquiring unit 1101, Specifically for:
- the stripe management apparatus further includes a storage unit 1103; the storage unit 1103 is configured to persistently store the at least one data unit in the first stripe that is not persistently stored in the storage system and the new verification unit.
- the plurality of first stripes are the first strips that are not persistently stored in the storage system; the stripe management apparatus further includes a storage unit 1103; the storage unit 1103 is used for persistence The data units in the plurality of first stripes and the new check unit are stored.
- the performance of the first medium layer and the second medium layer are different. Based on this, the first medium layer and the second medium layer perform data processing according to different ratios of erasure codes. storage. Due to the different write amplification corresponding to different erasure code ratios, the storage space utilization rate is also different. Therefore, selecting different erasure code ratios for data storage according to the performance of the medium layer can better play the corresponding role. The storage performance of the media layer can effectively improve the utilization of storage space.
- the stripe management device provided by the above embodiment performs data storage
- only the division of the above functional units is used as an example for illustration.
- the above functions can be allocated to different functional units as required. , that is, dividing the internal structure of the device into different functional modules to complete all or part of the functions described above.
- the stripe management device provided by the above embodiments and the stripe management method embodiments belong to the same concept, and the specific implementation process is detailed in the method embodiments, which will not be repeated here.
- the computer program product includes one or more computer program instructions.
- the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
- the computer program instructions may be stored on or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program instructions may be downloaded from a website, computer, server or The data center transmits data to another website site, computer, server or data center by wired (eg coaxial cable, optical fiber, Digital Subscriber Line, DSL) or wireless (eg: infrared, wireless, microwave, etc.) transmission.
- the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes an integration of one or more available media.
- the available media may be magnetic media (eg: floppy disk, hard disk, magnetic tape), optical media (eg: Digital Versatile Disc (DVD)), or semiconductor media (eg: Solid State Disk (SSD)) )Wait.
- references herein to "at least one” refers to one or more, and “plurality” refers to two or more.
- “/” means or means, for example, A/B can mean A or B;
- "and/or” in this document is only an association relationship to describe associated objects, indicating that There can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, A and B exist at the same time, and B exists alone.
- words such as “first” and “second” are used to distinguish the same or similar items with basically the same function and effect. Those skilled in the art can understand that the words “first”, “second” and the like do not limit the quantity and execution order, and the words “first”, “second” and the like are not necessarily different.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Computer Security & Cryptography (AREA)
- Probability & Statistics with Applications (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Techniques For Improving Reliability Of Storages (AREA)
Abstract
一种分条管理方法、存储系统、分条管理装置及存储介质,属于数据存储技术领域。该方法包括:获取多个第一分条中的校验单元;其中,所述第一分条遵从第一纠删码配比;根据所述多个第一分条的校验单元生成新的校验单元;其中,所述新的校验单元与所述多个第一分条中的数据单元属于新的分条;所述新的分条遵从第二纠删码配比;其中,第一纠删码配比对应的数据单元的个数小于第二纠删码对应的数据单元的个数。
Description
本申请要求于2020年7月10日提交中国专利局、申请号为202010661972.0、申请名称为“数据存储方法以及存储设备”的中国专利申请以及于2020年10月23日提交中国专利局、申请号为202011148485.0、申请名称为“分条管理方法、存储系统、分条管理装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请实施例涉及数据存储技术领域,特别涉及一种分条管理方法、存储系统、分条管理装置及存储介质。
在存储系统中,提升有效存储容量是降低存储成本的有力武器,而纠删码(erasure code,EC)技术就能够提升存储系统的有效存储容量。当前,EC技术被广泛的应用于存储系统中。EC技术主要是通过纠删码算法将数据单元进行编码得到校验单元,并将数据单元和校验单元一并存储起来,以达到容错的目的。存储系统中为了降低成本,采用EC技术时编码时数据单元的个数越大,存储空间利用率越高,但是数量单元的个数较大时凑满EC条带比较困难,从而影响数据存储可靠性。
发明内容
本申请实施例提供了一种分条管理方法、存储系统、分条管理装置及存储介质,能够提高存储系统的存储空间利用率的同时提高数据存储可靠性。所述技术方案如下:
第一方面,提供了一种分条管理方法,应用于存储系统中,所述方法包括:
获取多个第一分条中的校验单元;其中,所述第一分条遵从第一纠删码配比;
根据所述多个第一分条的校验单元生成新的校验单元;其中,所述新的校验单元与所述多个第一分条中的数据单元属于新的分条;所述新的分条遵从第二纠删码配比;其中,第一纠删码配比对应的数据单元的个数小于第二纠删码对应的数据单元的个数。也就是说,先采用较小的纠删码配比来存储数据,然后再转换成较大纠删码配比较大的纠删码配比来存储数据。通过较小的纠删码配比来存储数据,容易凑满分条,降低了写放大,提高了存储空间利用率。另外,采用较大的纠删码配比来存储数据,能够减少冗余数据在存储空间中的占比,从而提高存储空间利用率。因此,能够提高存储系统的存储空间利用率的同时提高数据存储可靠性。同时,新的分条中的校验单元由多个第一分条的校验单元生成,多个第一分条中的数据单元不参与运算,从而节约了计算资源。
一种情况下,所述新的分条中的校验单元的个数与所述第一分条中的校验单元的个数相同。
在一种实现方式中,所述多个第一分条包含至少一个在所述存储系统中未持久化存储的第一分条和至少一个已经在所述存储系统持久化存储的第一分条;所述获取多个第一分条中的校验单元,具体包括:
读取所述至少一个已经在所述存储系统持久化存储的第一分条中的校验单元;
读取所述至少一个在所述存储系统中未持久化存储的第一分条中的校验单元。
进一步地,所述方法还包括:
持久化存储所述至少一个在所述存储系统中未持久化存储的第一分条中的数据单元和所述新的校验单元。
在一种实现方式中,所述多个第一分条为所述存储系统中未持久化存储的第一分条;所述方法还包括:
持久化存储所述多个第一分条中的数据单元以及所述新的校验单元。
第二方面,提供了一种存储系统,所述存储系统包含一个或多个处理器,所述一个或多个处理器用于实现上述第一方面的数据存储方法。
第二方面所提供的存储系统既可以是一种分布式的存储系统,也可以是一种集中式的存储系统。
第三方面提供一种分条管理装置,所述存储设备应用于存储系统中,所述分条管理装置包含多个单元,所述多个单元用于实现上述第一方面的数据存储方法。
第四方面,提供了一种计算机可读存储介质,
所述计算机可读存储介质包含计算机程序指令,存储系统中的一个或多个中央处理器执行所述计算机程序指令使得所述存储系统执行上述第一方面所述的数据存储方法。
第五方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的数据存储方法。
上述第二方面、第三方面、第四方面、第五方面和第六方面所获得的技术效果与第一方面中对应的技术手段获得的技术效果近似,在这里不再赘述。
图1是本申请实施例提供的一种存储系统架构图;
图2是本申请实施例提供的另一种存储系统架构图;
图3是本申请实施例提供的一种存储设备的系统架构图;
图4是本申请实施例提供的一种分条管理方法的流程图;
图5是本申请实施例提供的一个第一校验矩阵中的各个单元在第一介质层中的分布示意图;
图6是本申请实施例提供的w个第一校验矩阵中的单元在第一介质层中的分布示意图;
图7是本申请实施例提供第二校验矩阵中的单元在第二介质层中的分布示意图;
图8是本申请实施例提供的一种根据w个第一校验矩阵合并得到第二校验矩阵的原理示意图;
图9是本申请实施例提供的一种获得第二校验矩阵的校验单元的示意图;
图10是本申请实施例提供的一种在第二介质层中按照第二纠删码配比写入数据的示意图;
图11是本申请实施例提供的一种分条管理装置的结构示意图。
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
在对本申请实施例进行详细的解释说明之前,先对本申请实施例涉及的系统架构进行介绍。
图1是本申请实施例提供的一种存储系统的结构示意图。如图1所示,该存储系统包括计算机节点集群和存储节点集群。其中,计算节点集群包括一个或多个计算节点10(图1中示出了两个计算节点10,但不限于两个计算节点10)。计算节点10是用户侧的一种计算设备,如服务器、台式计算机等。在硬件层面,计算节点10中设置有中央处理器和内存(图1中未示出)。在软件层面,计算节点10上运行有应用程序(application)101(简称应用)和客户端程序102(简称客户端)。应用101是对用户呈现的各种应用程序的统称。客户端102用于接收由应用101触发的数据访问请求,并且与存储节点20交互,向存储节点20发送数据访问请求。客户端102还用于接收来自存储节点的数据,并向应用101转发该数据。应理解的是,当客户端102是软件程序时,客户端102的功能由计算节点10所包含的中央处理器运行内存中的程序来实现。客户端102也可以由位于计算节点10内部的硬件组件来实现。计算节点集群中的任意一个客户端102可以访问存储节点集群中的任意一个存储节点20。
存储节点集群包括一个或多个存储节点20(图1中示出了三个存储节点20,但不限于三个存储节点20),各个存储节点20之间可以互联。存储节点如服务器、台式计算机或者存储阵列的控制器、硬盘框等。在功能上,存储节点20主要用于对数据进行计算或处理等。
在硬件上,如图1所示,存储节点20至少包括存储器、网卡和一个或多个中央处理器。
其中,中央处理器(central processing unit,CPU),用于处理来自存储节点20外部的数据,或者存储节点20内部生成的数据。
存储器,是指用于存储数据的装置。在本申请实施例中,存储器可以是内存,也可以是硬盘。其中,内存是指与处理器直接交换数据的内部存储器,它可以随时读写数据,而且速度很快,作为操作系统或其他正在运行中的程序的临时数据存储器。内存包括一种或多种类型的存储器,例如内存既可以是随机存取存储器,也可以是只读存储器(Read Only Memory,ROM)。举例来说,随机存取存储器可以是动态随机存取存储器(Dynamic Random Access Memory,DRAM),也可以是存储级存储器(Storage Class Memory,SCM)。DRAM是一种半导体存储器,与大部分随机存取存储器(Random Access Memory,RAM)一样,属于一种易失性存储器(volatile memory)设备。SCM是一种同时结合传统储存装置与存储器特性的复合型储存技术,SCM能够提供比硬盘更快速的读写速度,但运算速度上比DRAM慢,在成本上也比DRAM更为便宜。需 要说明的是,处理器可以直接访问内存,例如,如图2中所示,处理器可以直接访问DRAM和SCM。
然而,DRAM和SCM在本实施例中只是示例性的说明,在一些可能的情况中,内存可以只包含DRAM和SCM中的其中一种。或者,内存还可以包括其他随机存取存储器,例如静态随机存取存储器(Static Random Access Memory,SRAM)等。而对于只读存储器,举例来说,可以是可编程只读存储器(Programmable Read Only Memory,PROM)、可抹除可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)等。另外,内存还可以是双列直插式存储器模块或双线存储器模块(Dual In-line Memory Module,简称DIMM),即由动态随机存取存储器(DRAM)组成的模块。在后续的实施例中,均以内存包括一种存储器为例那个说明,但这并不构成对内存包括的存储器类型数量的限制。
硬盘读写数据的速度比内存慢,通常用于持久性地存储数据。以存储节点20a为例,其内部设置一个或多个硬盘;或者,在存储节点20的外部挂载一个硬盘框(如图2所示),在硬盘框中设置多个硬盘。无论哪一种部署方式,这些硬盘都可以视作存储节点20所包含的硬盘。其中,这里的硬盘可以是指物理硬盘,也可以是指一个包括多个物理硬盘的逻辑域或故障域,本申请实施例对此不作限定。另外,物理硬盘的类型为固态硬盘、机械硬盘,或者其他类型的硬盘。
需要说明的是,内存包括的存储器与硬盘是完全不同的两种存储介质,二者的性能完全不同。其中,相较于硬盘,内存的数据读取速度更快,时延更小,也即,内存的性能高于硬盘的性能。基于此,在本申请实施例中,如图1和2中所示,将各个存储节点20中的内存称为第一介质层,将各个存储节点20中的硬盘称为第二介质层。其中,第一介质层的性能高于第二介质层。当然,也可以将各个存储节点20中的内存称为第二介质层,将各个存储节点20中的硬盘称为第一介质层。此时,第一介质层的性能低于第二介质层。可选地,当内存包括多种性能不同的存储介质时,也可以将内存中的每种存储介质作为一个介质层,例如,如图1和图2中,各个存储节点中的DRAM组成一个介质层,SCM组成一个介质层,硬盘组成一个介质层。可选地,在一些可能的情况中,第一介质层可以是固态硬盘(solid state disk,SSD),第二介质层可以为硬盘驱动器(hard disk drvie,HDD)。后续实施例中将以内存中包括一种存储介质,将各个存储节点的内存中的该种存储介质作为第一介质层、硬盘作为第二介质层为例进行说明。
网卡用于与其他存储节点进行通信,或者,用于与该存储节点耦合的硬盘框进行通信。另外,网卡可以直接访问存储节点的内存,如图2所示,网卡可以直接访问DRAM和SCM。
在本发明另一实施例中,存储系统不包含计算节点。在本发明另一实施例中,存储系统中的计算节点和存储节点可以在同一个节点。关于存储系统的具体形态,本发明对此不作限定。另外,本发明实施例中的存储节点中的一个或多个CPU的功能可以由现场可编程门阵列(Field Programmable Gate Array,FPGA)或者专用集成电路(Application-specific integrated circuit,ASIC)等或者上述多种组合来实现。本发明实施例中将上述各种实现方式统称为由一个或多个处理器实现。
图3是本申请实施例提供的另一种存储系统的结构示意图。图3所示的存储系统为一个存储阵列,该存储阵列包括至少一个控制器(如图3所示的控制器11)和多个硬盘22。控制器11通过存储区域网络(英文:storage area network,SAN)与主机(图中未示出)连接。控制器11可以是一种计算设备,如服务器、台式计算机等等。在控制器11上安装有操作系统以及应用程序。控制器11可以接收来自主机的输入输出(I/O)请求。控制器11还可以存储I/O请求中携带的数据(如果有的话),并且将该数据写入硬盘22中。其中,硬盘22为机械硬盘或固态硬盘,固态硬盘是以闪存(英文:flash memory)芯片为介质的存储器,又名固态驱动器(Solid State Drive,SSD)。
图3仅是示例性说明,在实际应用中存储阵列可包含两个或两个以上控制器,每个控制器的物理结构和功能与控制器11类似,并且本实施例并不限定控制器之间,以及任意一个控制器与硬盘22之间的连接方式。只要各个控制器之间,以及各个控制器和硬盘22之间能够相互通信。另外,在本实施例中,硬盘可以是指物理硬盘,也可以是指一个包括多个物理硬盘的逻辑域或故障域,本申请实施例对此不作限定。
如图3所示,控制器11包括接口卡110、一个或多个处理器112和接口卡113。
接口卡110用于和主机通信,控制器11通过接口卡110接收主机的操作指令。处理器112可能是中央处理器(英文:central processing unit,CPU)。在本申请实施例中,处理器112用于接收来自主机的I/O请求、处理所述I/O请求。所述I/O请求是写数据请求或者读数据请求,处理器112还可以将写数据请求中的数据发送给硬盘22。接口卡113,用于和硬盘22通信,控制器11通过接口卡113将写数据请求(包括数据、数据的逻辑地址以及数据的虚拟地址)发送给硬盘22存储。本发明实施例中的处理器112还可以由现场可编程门阵列(Field Programmable Gate Array,FPGA)或者专用集成电路(Application-specific integrated circuit,ASIC)等或者上述多种组合来实现。本发明实施例中将上述各种实现方式统称为由一个或多个处理器实现。
可选地,控制器11还包括内存111。内存111用于临时存储从主机接收的数据或从硬盘22读取的数据。控制器11接收主机发送的多个写数据请求时,可以将多个写数据请求中的数据暂时保存在内存111中。当内存111的容量达到一定阈值时,将内存111存储的数据、数据的虚拟地址以及为数据分配的逻辑地址发送给硬盘22。硬盘22存储接收到的数据。内存111包括易失性存储器,闪存芯片或其组合。易失性存储器例如为随机访问存储器(英文:random-access memory,RAM)。闪存芯片例如软盘、硬盘、光盘等各种可以存储程序代码的机器可读介质。内存111具有保电功能,保电功能是指系统发生掉电又重新上电时,内存111中存储的数据也不会丢失。
需要说明的是,控制器11包括的内存111与硬盘22为完全不同的两种存储介质。其中,相较于硬盘,内存的数据读取速度更快,时延更小,也即,内存的性能高于硬盘的性能。在本申请实施例中,将性能更高的内存111称为第一介质层,将性能相对内存而言较低的多个硬盘22称为第二介质层,也即,第一介质层的性能高于第二介质层。或者,将性能更高的内存111称为第二介质层,将性能相对内存而言较低的多个硬盘22称为第一介质层,此时,第一介质层的性能低于第二介质层。
纠删码是一种数据冗余技术,相对于多副本策略,纠删码具有更高的磁盘利用率。例如Reed-Solomon码就是一种常见的纠删码。纠删码技术主要是通过纠删码算法将原 始的数据进行编码得到冗余,并将数据和冗余一并存储起来,以达到容错的目的。其基本思想是将n块原始的数据元素(数据单元)通过一定的计算,得到m块冗余元素(校验单元),磁盘利用率为n/(n+m)。对于这n+m块的元素,当其中任意的m块元素出错(包括原始的数据元素和冗余元素)时,均可以通过对应的重构算法恢复出原来的n块数据元素。生成校验的过程被成为编码(encoding),恢复丢失的数据元素的过程被称为解码(decoding)。本申请所提到的纠删码配比是指数据元素n与冗余元素m的比值。基于纠删码技术的n块数据元素和m块冗余元素属于一个分条;其中,数据元素也称为数据单元,冗余元素也称为校验单元。接下来对本申请实施例提供的数据存储方法进行介绍。
图4是本申请实施例提供的一种分条管理方法的流程图。该方法可以应用于图1或图2所示的存储系统中,也可以应用于图3所示的存储系统中。参见图4,该方法包括以下步骤:
步骤401:在第一介质层中,按照第一纠删码配比缓存多个第一分条的数据单元及校验单元。
由前述图1至图3中的介绍可知,存储系统中包括多种不同的存储介质,本申请实施例中以存储系统包括两种不同的存储介质且第一介质层的性能高于第二介质层的性能为例对该数据存储方法进行解释说明。例如,第一介质层是DRAM,第二介质层是硬盘。在第一介质层中,按照第一纠删码配比存储数据,即遵从第一纠删码配比的分条缓存在DRAM,即分条中的数据单元和校验单元均缓存在DRAM。所谓第一介质层的性能高于第二介质层的性能是指第一介质层比第二介质层的读写速度更快、时延更小。
在本申请实施例中,针对第一介质层和第二介质层,可以分别获取第一介质层对应的第一纠删码配比和第二介质层对应的第二纠删码配比,进而在第一介质层中,按照第一纠删码配比缓存第一分条的数据单元及校验单元。即按照第一纠删码配比将接收的数据划分为数据单元,基于纠删码算法得到相应的校验单元,从而得到遵从第一纠删码配比的第一分条。在本发明实施例中,当分条中的数据单元和校验单元缓存在DRAM中时,该分条为在存储系统中未持久化存储的分条。当分条中的数据单元和校验单元存储在存储系统的非易失性存储介质上时,该分条为在存储系统中持久化存储的分条。进一步地,本发明实施例中,当第一介质层包含SCM时,当分条中的数据单元和校验单元存储在第一介质层的SCM时,仍称该分条为在存储系统中未持久化存储的分条,即分条缓存在SCM中。另一种实现,当第一介质层不包含SCM,第二介质层包含SCM,当分条中的数据单元和校验单元存储在第二介质层的SCM时,仍称该分条为在存储系统中持久化存储的分条。
其中,如果该数据存储方法应用于图1或图2所示的存储系统中,则可以由存储系统中的管理节点获取第一纠删码配比和第二纠删码配比。其中,该管理节点为该存储系统的多个存储节点中的任一存储节点,或者是,该管理节点为该存储系统中独立于存储节点之外用于对各个存储节点进行管理的一个节点。另外,该管理节点可以在初始化时获取第一纠删码配比和第二纠删码配比,也可以在存储系统运行过程中获取第一纠删码配比和第二纠删码配比,本申请实施例对此不做限定。如果该数据存储方 法应用于图3所示的存储系统中,则可以由存储系统中的控制器来获取第一纠删码配比和第二纠删码配比。接下来以管理节点获取第一纠删码配比和第二纠删码配比为例来进行介绍。
在一种可能的实现方式中,管理节点根据该存储系统的拓扑结构和容错能力确定第一纠删码配比或第二纠删码配比。其中,第一纠删码配比和第二纠删码配比可以均根据存储系统的拓扑结构和容错能力确定得到,或者,第一纠删码配比根据存储系统的拓扑结构和容错能力确定得到,第二纠删码配比根据第一纠删码配比获得,或者,第二纠删码配比根据存储系统的拓扑结构和容错能力确定得到,第一纠删码配比根据第二纠删码配比获得,本申请实施例对此不作限定。另外,拓扑结构用于指示存储系统所包含的存储节点的数量,容错能力用于指示存储系统容忍出错的存储节点的数量。其中,该存储系统容忍出错的存储节点的数量等于第一纠删码配比对应的校验单元的数量,或者第二纠删码配比对应的校验单元的数量。
其中,管理节点首先获取存储系统的拓扑结构。示例性地,管理节点中可以存储有存储系统的拓扑结构,或者,接收由其他设备发送的该存储系统的拓扑结构,或者,接收用户输入的存储系统的拓扑结构。该拓扑结构能够指示存储系统的组成,例如,该存储系统内所包含的存储节点数量,每个存储节点包括的子节点的数量。其中,当存储节点为一台服务器时,存储节点的子节点的数量是指服务器包括的物理硬盘的数量或者对相应存储节点包括的物理硬盘进行划分得到的硬盘逻辑域的数量。当存储节点为一个机柜时,存储节点的子节点的数量是指机柜内包括的服务器的数量。通常,一个机柜包括多个服务器。
例如,假设存储系统内包括4个服务器,每个服务器包括60个物理硬盘,其中,每15个物理硬盘划分为1个硬盘逻辑域,则根据该拓扑结构可知,该存储系统包括4个存储节点,每个服务器即为一个存储节点。每个存储节点包括4个硬盘逻辑域,也即,每个存储节点包括的子节点的数量为4。
除了获取存储系统的拓扑结构,管理节点还获取该存储系统的安全级别和容错能力。在一种可能的实现方式中,管理节点上显示有配置界面,该配置界面包括安全级别配置项和容错能力配置选项。用户在该安全级别配置项中输入所需的安全级别,并在容错能力配置选项中输入允许出错的节点数量t,t为大于或等于的整数。管理节点获取用户输入的安全级别和允许出错的节点数量t。其中,安全级别包括服务器级安全、机柜级安全等。其中,服务器级安全用于指示存储系统最多能够容忍t个服务器出现故障。机柜级安全用于指示该存储系统最多能够容忍t个机柜出现故障。可选地,管理节点也可以根据该存储系统的拓扑结构,按照预设原则确定该存储系统的安全级别,其中,该预设原则是指能够保证该存储系统的可靠性的计算原则,本申请实施例对此不作限定。另外,该存储系统的容错能力也可以是一个系统默认值,本申请实施例对此不作限定。
在获取到存储系统的拓扑结构、容错能力和安全级别之后,管理节点通过下述公式(1)确定第一纠删码配比中的数据单元的数量的取值范围。
N≤(k*M)-M (1)
其中,N为第一纠删码配比对应的数据单元的数量。k为存储系统中包含的节点 的数量,当安全级别为服务器级安全时,上述节点指服务器,当安全级别为机柜级安全时,上述节点为机柜。M为容错能力所指示的该存储系统能够容忍出错的节点的数量,也即,第一纠删码配比中的校验单元的数量。需要说明的是,M可以是默认值,也可以是由用户自定义的值,且M为大于或等于1的整数,例如,M=2。
在确定出第一纠删码配比中的数据单元的数量的取值范围之后,管理节点根据该取值范围和M,确定得到多个第一候选纠删码配比,每个候选纠删码配比对应所述取值范围的一个值。之后,从多个第一候选纠删码配比中选择对应的写放大值最小的纠删码配比作为第一纠删码配比。
其中,写放大是指存储节点实际写入的数据量大于从计算节点接收到的数据量。在本申请实施例中,写放大通过写放大值来表征,对于任一个第一候选纠删码配比而言,该第一候选纠删码配比对应的写放大值等于该第一候选纠删码配比的数据单元和校验单元的总数量与数据单元的数量之间的比值。例如,对于纠删码配比6:2而言,该纠删码配比用于表征每6个数据单元对应2个校验单元,如此,该纠删码配比对应的写放大值即为(6+2)/6。
示例性地,假设根据该存储系统的拓扑结构指示该存储系统包括4个服务器,用户输入的安全级别为服务器级安全,则k=4。假设该存储系统能够容忍出错的存储节点的数量为2个,也即,M=2。根据上述公式(1)可得,第一纠删码配比的数据单元的数量取值范围为N≤4*2-2,也即,N≤6。在确定第一纠删码配比的数据单元的数量取值范围之后,根据该取值范围和校验单元的数量可得到常用的多个第一候选纠删码配比分别为6:2、4:2和2:2。由于6:2这一配比是三个配比中写放大最小的,因此,将6:2这一配比作为第一纠删码配比。
除了第一纠删码配比之外,管理节点还用于根据该存储系统的拓扑结构和容错能力获取第二纠删码配比。具体的,管理节点通过下述公式(2)确定第二纠删码配比中的数据单元的数量取值范围。
X≤(i*Y)-Y (2)
其中,X为第二纠删码配比对应的数据单元的数量,且X大于N。i为该存储系统包含的节点的子节点的数量,其中,当安全级别为服务器级安全时,i为该存储系统包含的服务器的子节点的数量,其中,服务器的子节点可以是指该服务器连接的物理硬盘或硬盘逻辑域。当安全级别为机柜级安全时,i为该存储系统包含的机柜的子节点的数量,其中,机柜的子节点是指该机柜包含的服务器的数量。Y为容错能力所指示的该存储系统能够容忍出错的节点的数量,也即,Y为第二纠删码配比对应的校验单元的数量。需要说明的是,Y可以是默认值,也可以是由用户自定义的值,且Y大于或等于1,例如,Y=2。另外,Y与M可以相等,也可以不相等,本申请实施例对此不作限定。还需要说明的是,安全级别可以由前述介绍的配置方式由用户进行配置,在这种情况下,管理节点直接获取用户配置的安全级别。或者,该安全级别也可以是管理节点根据该存储系统的拓扑结构按照预设原则确定得到的,其中,该预设原则是指能够保证该存储系统的可靠性的计算原则,本申请实施例对此不作限定。
在确定出第二纠删码配比中的数据单元的数量取值范围之后,管理节点根据该取值范围和Y,确定第二纠删码配比。
例如,仍以前述的包含4个服务器的存储系统为例,假设每个服务器包括4个硬盘逻辑域,则在安全级别为服务器级安全时,该存储系统中包含的每个服务器下有4个子节点,这样,4个服务器的子节点的总数量为16。假设容错能力所指示的该存储系统能够容忍出错的节点的数量为2,也即Y=2,则根据上述公式(2)可得,X≤(16*2)-2,也即,X≤30。根据该取值范围,考虑到系统可靠性约束机制,管理节点可选择数据单元的数量为24,此时,第二纠删码配比即为24:2。
由上文可知,第一纠删码配比中的N和第二纠删码配比中的X不相等,且N小于X。另外,第一纠删码配比中的M和第二纠删码配比中的Y可以相等也可以不等。除此之外,N和M的比值不等于X和Y的比值。
上述介绍了分别根据存储系统的拓扑结构和容错能力确定第一纠删码配比和第二纠删码配比的实现过程。在一些可能的实现方式中,在参考上述方式确定得到第一纠删码配比之后,管理节点根据第一纠删码配比N:M和预设的w确定第二纠删码配比X:Y。其中,X等于w*N,Y等于M或大于M。或者,在参考上述方式确定得到第二纠删码配比之后,管理节点根据第二纠删码配比X:Y和预设的w确定第一纠删码配比N:M。其中,N等于X/w,M等于Y或小于Y。
管理节点获得了第一纠删码配比和第二纠删码配比之后,计算第二纠删码配比中的数据单元的数量X和第一纠删码配比中的数据单元的数量N之间的比值,该比值就等于在第一介质层中按照第一纠删码配比存储的数据包括的第一校验矩阵的个数w。例如,当第二纠删码配比中的数据单元的数量X=24,第一纠删码配比中的数据单元的数量N=6,则可以确定在第一介质层中按照第一纠删码配比存储的数据包括的第一校验矩阵的个数w=4。由此可见,在前述根据第一纠删码配比获得第二纠删码配比或者是根据第二纠删码配比获得第一纠删码配比的实现方式中,预设的w实际上就是在第一介质层中按照第一纠删码配比存储的数据包括的第一校验矩阵的个数。其中,一个第一校验矩阵为一个遵从第一纠删码配比的分条。
本发明实施例的另一种实现方式,可以由管理员通过管理节点配置第一纠删码配比和第二纠删码配比。
在获得第一纠删码配比、第二纠删码配比以及w之后,后续,当存储节点接收到计算节点发送的写数据请求时,存储节点按照第一纠删码配比和w在第一介质层中写入数据。其中,写数据请求包括待写入的数据。接下来以存储系统中的目标存储节点接收计算节点发送的写数据请求为例对该过程进行说明。
示例性地,目标存储节点接收计算节点发送的写数据请求,当接收的待写入数据的数据量达到N个数据单元的尺寸时,目标存储节点将这些待写入数据划分成N个数据单元,并根据该N个数据单元生成M个校验单元。所述N个数据单元和M个校验单元属于一个子数据,该子数据对应一个第一校验矩阵,其中,该第一校验矩阵包括该N个数据单元和M个校验单元。之后,目标存储节点将第一校验矩阵包含的N个数据单元和M个校验单元存储至该存储系统的第一介质层中。与此同时,目标存储节点继续接收计算节点发送的写数据请求,按照上述方式获得另一个第一校验矩阵,并存储至所述第一介质层。如此,当按照上述方式该目标存储节点将该w个第一校验矩阵包括的数据单元和校验单元作为写入至第一介质层后,即可以执行后续步骤402。
例如,第一纠删码配比为6:2,也即,N=6,M=2,且w=4,当目标存储节点接收到计算节点发送的待写入数据的数量达到6个数据单元的尺寸时,将这些待写入数据划分成6个数据单元,根据这6个数据单元生成2个校验单元,之后,生成包括6个数据单元和2个校验单元的第一校验矩阵。将第一校验矩阵包括的8个单元存储至该存储系统中的各个存储节点的内存中。
具体的,目标存储节点可以将各个第一校验矩阵包括的校验单元分布在同一个存储节点上,对于各个第一校验矩阵包括的数据单元,则可按照平均分布的原则分布在各个存储节点中。
参见图5,假设该存储系统包括的存储节点为4个服务器,这4个服务器包括的内存为第一介质层。第一纠删码配比为6:2,也即,N=6,M=2,并且,w=4。由于M=2,所以每个服务器的内存中最多允许存储第一校验矩阵的8个单元中的2个单元,基于此,目标存储节点在获得一个第一校验矩阵之后,将第一校验矩阵包括的2个校验单元存储至自身内存中,而将剩余的6个数据单元转发至其他3个服务器中存储,例如,参见图5,在其他3个服务器中的每个服务器上各存储2个数据单元。如此,在将4个第一校验矩阵存储至第一介质层后,4个校验矩阵包含的32个单元在存储系统的存储节点的分布如图6所示。
图5和图6仅是本申请实施例示例性的给出的一种第一校验矩阵中各个单元的可能分布。可选地,目标存储节点也可以根据存储系统每个存储节点最多允许分布的单元数量和第一校验矩阵包括的单元的数量,随机将多个单元平均分布在各个存储节点中。也即,不限制校验单元在同一个存储节点上。例如,在上述的示例中,可以将一个第一校验矩阵中的1个数据单元和1个校验单元存储在目标存储节点上,另外一个校验单元和一个数据单元存储至另外一个存储节点中,这样,还剩余4个数据单元,这4个数据单元分别存储在剩余的两个存储节点上。进一步地,当第一校验矩阵中的两个校验单元为校验单元p和校验单元q时,4个第一校验矩阵中的4个校验单元p存储在一个存储节点中,4个校验单元q存储在一个存储节点中,且4个校验单元p所在的存储节点和4个校验单元q所在的存储节点可以不同。
上述介绍了存储系统为图1或图2所示的存储系统时,在第一介质层中按照第一纠删码配比缓存数据的过程。可选地,当存储系统为图3所示的存储阵列中,可以由存储阵列中的控制器来确定第一纠删码配比、第二纠删码配比和第一介质层中存储的数据包括的第一校验矩阵的个数w,确定的方法参考前述介绍的方法,本申请实施例在此不再赘述。在确定第一纠删码配比、第二纠删码配比和w之后,控制器可以按照第一纠删码配比和w在第一介质层中缓存数据,其中,第一介质层即为存该存储系统包括的控制器的内存。需要说明的是,控制器参考前述介绍的方法生成w个第一校验矩阵。根据该存储系统包括的控制器的数量和/或每个控制器包括的内存的数量,参考前述的方法将每个第一校验矩阵包括的数据单元和校验单元分布存储至各个控制器的内存中。
步骤402:将第一介质层中缓存的多个第一分条的数据单元和新的校验单元按照第二纠删码配比持久化存储到第二介质层。
其中,新的校验单元是由多个第一分条的校验单元生成的,多个第一分条的数据 单元和新的校验单元属于遵从第二纠删码配比的分条(第二分条)。
在按照第一纠删码配比在第一介质层中缓存的数据达到设定条件时,存储节点或控制器将达到设定条件的这部分数据,按照第二纠删码配比存储到至第二介质层。例如,设定条件可以是:第一介质层中缓存的遵从第一纠删码配比的多个第一分条中的数据单元的个数达到遵从第二纠删码配比的第二分条中数据单元的个数。即在第一介质层中凑满第二分条。其中,第二纠删码配比为X:Y,也即,存储第二介质层中的数据包括X个数据单元和Y个校验单元。
接下来仍以该数据存储方法应用于图1或图2所示的存储系统中为例来对本步骤进行说明。在该存储系统中,第二介质层包括该存储系统内的存储节点所包括的硬盘。
示例性地,在凑齐w个第一校验矩阵,也即在将w个第一校验矩阵包括的数据单元和校验单元缓存至第一介质层后,根据第一介质层中存储的数据包括的w个第一校验矩阵中每个第一校验矩阵所包含的N个数据单元获得X个数据单元,X是N的整数倍;计算获得Y个校验单元以生成第二校验矩阵,第二校验矩阵包括X个数据单元和Y个校验单元;将第二校验矩阵写入第二介质层中。其中,上述过程可以通过以下几种不同的实现方式来实现。一个第二校验矩阵为一个遵从第二纠删码配比的分条。
第一种实现方式:目标存储节点在通过上述步骤401凑齐w个第一校验矩阵之后,根据w个第一校验矩阵包括的w×N个数据单元计算得到第二校验矩阵中的Y个校验单元。之后,目标存储节点将计算得到的Y个校验单元存储至第二介质层。对于其他存储节点而言,各个存储节点在满足设定条件时,将自身存储的属于w个第一校验矩阵的数据单元存储至第二介质层。如此,存储至第二介质层中的w×N个数据单元即为第二校验矩阵的X个数据单元,目标存储节点计算得到的Y个校验单元即为第二校验矩阵包括的Y个校验单元。
其中,目标存储节点在将自身存储的数据单元和Y个校验单元存储至第二介质层中时,如果第二介质层包括的硬盘的数量大于第二校验矩阵所包含的单元的总数量,则目标存储节点根据计算得到的校验单元的数量Y,从自身所包括的多个硬盘中选择Y个硬盘。然后目标存储节点将Y个校验单元写入至选择的硬盘中,其中,每个硬盘上写入一个单元。可选地,如果目标存储节点上还存储有属于w个第一校验矩阵的数据单元,则目标存储节点从自身所包括的硬盘中为每个数据单元选择一个硬盘,并将数据单元写入至选择的硬盘上,其中,每个硬盘上也同样写入一个单元。
可选地,如果第二介质层包括的硬盘的数量不大于第二校验矩阵所包含的单元的总数量,则存储节点根据第二校验矩阵包括的校验单元的数量,确定第二介质层中每个硬盘上允许分布的最大单元数。之后,如果目标存储节点上还存储有属于w个第一校验矩阵的数据单元,则按照该最大单元数和自身存储的属于w个第一校验矩阵的数据单元的数量和Y从自身所包括的硬盘中选择多个硬盘,进而将存储的数据单元和校验单元写入至选择的多个硬盘中。当然,如果目标存储节点中未存储属于w个第一校验矩阵的数据单元,则按照最大单元数和Y从自身所包括的硬盘中选择多个硬盘,从而将Y个校验单元写入至选择的硬盘上。在这种情况下,一个硬盘上分布可能存储第二校验矩阵中的多个单元,但是,存储的单元的数量不超过硬盘允许分布的最大单元数。对于除目标存储节点之外的其他存储节点,均可参考上述方法将自身存储的属于 w个第一校验矩阵的数据单元写入至第二介质层中。
例如,参见图7,假设第二介质层包括16个硬盘逻辑域,且16个硬盘逻辑域分属于4个存储节点,w个第一校验矩阵一共包括24个数据单元,也即,第二校验矩阵包括的数据单元的数量为24。目标存储节点计算得到的校验单元的数量为2。由于第二校验矩阵包括的校验单元的数量为2,因此可知每个硬盘逻辑域上允许分布的最大单元数为2,如此,各个存储节点在按照上述介绍的方式将24个数据单元和2个校验单元存储至第二介质层时,对于4个存储节点中的3个存储节点,这3个存储节点中每个存储节点的每个硬盘逻辑域内存储2个单元,也即,每个存储节点上一共存储8个单元,而对于另外一个存储节点,则可以在该存储节点的一个硬盘逻辑域上存储2个单元,或者是在该存储节点的两个硬盘逻辑域上各存储一个单元。
第二种实现方式:目标存储节点在凑齐w个第一校验矩阵之后,根据w个第一校验矩阵包括的w×M个校验单元获得第二校验矩阵中的Y个校验单元。之后,目标存储节点将计算得到的Y个校验单元存储至第二介质层。对于其他存储节点而言,各个存储节点在自身缓存中的数据量达到一定阈值时,将自身存储的属于w个第一校验矩阵的数据单元存储至第二介质层。
对于第二种实现方式,分别针对以下几种不同的情况进行说明。
(1)当w个第一校验矩阵包括的所有校验单元均存储在目标存储节点中时,目标存储节点获取自身存储的w×M个校验单元,进而根据该w×M个校验单元获得Y个校验单元。
举例来说,当每个第一校验矩阵包括的M个校验单元分别为校验单元p和校验单元q时,目标存储节点对存储的w个校验单元p进行异或运算或者其他计算方式得到第二校验矩阵中的校验单元p’,对存储的w个校验单元q进行异或运算或者其他计算方式得到第二校验矩阵中的校验单元q’。由此可见,本申请实施例中,通过直接对各个第一校验矩阵包括的M个校验单元进行计算即能够得到第二校验矩阵的校验单元,相较于根据w个第一校验矩阵中的所有数据单元重新计算校验单元,减少了计算量。并且,由于各个第一校验矩阵中的校验单元均存储在同一个存储节点上,因此,该存储节点能够直接获取存储的校验单元来获得第二校验矩阵中的校验单元,相较于将校验单元分布存储在各个存储节点中的情况,无需跨存储节点中获取校验单元,减少了网络转发量。
在计算获得Y个校验单元之后,目标存储节点参考前述第一种实施例中介绍的方法将Y个校验单元存储至第二介质层中,或者,将自身存储的属于w个校验矩阵的数据单元和Y个校验单元存储至第二介质层中。其他存储节点在满足设定条件之后,将各自存储的数据单元在存储至第二介质层中,如此,各个存储节点中存储的w个第一校验矩阵则合并为第二校验矩阵存储至了第二介质层中。
图8是本申请实施例示出的一种根据w个第一校验矩阵合并得到第二校验矩阵的原理示意图。如图8所示,w=4,第一个第一校验矩阵中的前六列元素a1至a6为6个数据单元,剩下两列元素p1和q1为2个校验单元。其中,p1为校验单元p,q1为校验单元q。同理,第二个第一校验矩阵中的a7至a12为6个数据单元,并且,剩下两列元素p2和q2为2个校验单元,以此类推。将每个校验矩阵的前6列元素取出, 组成第二校验矩阵的24个数据单元。将每个第一校验矩阵中的校验单元p(也即p1至p4)取出,进行异或运算或者采用其他计算方式得到第二校验矩阵中的校验单元p’,将每个第一校验矩阵中的校验单元q(也即q1至q4)取出,进行异或运算,或者采用其他计算方式得到第二校验矩阵中的校验单元q’。
需要说明的是,图8示例性地说明了将w个第一校验矩阵合并为第二校验矩阵的过程,在一些可能的应用场景中,例如,当第一介质层的性能低于第二介质层的性能,将第一介质层中的数据读取到第二介质层的过程中,可能需要将第二校验矩阵拆分为w个第一校验矩阵,在这种情况下,只需将上述过程逆向执行即能够得到w个第一校验矩阵,本申请实施例对此不做限定。
可选地,当每个第一校验矩阵包括的M个校验单元为校验单元r时,目标存储节点对存储的w个校验单元r进行异或运算,增量计算得到第二校验矩阵中的校验单元r’,之后,目标存储节点可以获取自身以及其他存储节点上存储的各个第一校验矩阵中的数据单元,根据获取的w×N个数据单元计算得到校验单元p’和校验单元q’。将计算得到的校验单元r’、校验单元p’和校验单元q’作为第二校验矩阵中的Y个校验单元。由此可见,在该种实现方式中,Y和M不相等。并且,由于根据数据单元计算校验单元r’的过程较为复杂,因此,本申请实施例中根据各个第一校验矩阵中的校验单元r增量计算得到第二校验矩阵中的校验单元r’,减小了计算开销。另外,根据w×N个数据单元计算得到校验单元p’和校验单元q’,使得第二校验矩阵中包含了3个校验单元,提升了第二介质层中存储的数据的冗余度,使得容错能力得以提升。
例如,参见图9,假设有3个第一校验矩阵,每个第一校验矩阵包括7个数据单元和1个校验单元r,目标存储节点将存储的3个校验单元r进行异或运算得到第二校验矩阵中的校验单元r’。之后,目标存储节点获取自身及其他各个存储节点中存储的21个数据单元,根据这21个数据单元计算得到校验单元p’和校验单元q’。如此,第二校验矩阵包括的Y个校验单元即为生成的校验单元p’、校验单元q’和校验单元r’。由此可见,通过该种实现方式,能够提升第二介质层中存储的数据的冗余度,使得容错能力得以提升,同时还能够减少部分计算开销。
在计算获得Y个校验单元之后,目标存储节点同样参考前述第一种实现方式中介绍的方法将存储的数据单元和Y个校验单元存储至第二介质层中,其他存储节点在自身缓存中的数据量达到一定阈值之后,将各自存储的数据单元在存储至第二介质层中,本申请实施例在此不再赘述。
(2)当各个第一校验矩阵包括的M个校验单元分散存储在不同的存储节点中时,目标存储节点从各个存储节点中获取存储的校验单元,进而根据获取到的w×M个校验单元获得Y个校验单元。其中,目标存储节点根据获取到的w×M个校验单元获得Y个校验单元的实现方式参考上述(1)中的实现方式,本申请实施例不再赘述。在获得Y个校验单元之后,目标存储节点参考前述第一种实现方式中介绍的方法将存储的数据单元和Y个校验单元存储至第二介质层中,其他存储节点在自身缓存中的数据量达到一定阈值之后,将各自存储的数据单元在存储至第二介质层中,本申请实施例在此不再赘述。
第三种实现方式:目标存储节点在凑齐w个第一校验矩阵之后,将自身存储的属 于w个第一校验矩阵的单元写入至第二介质层,其他各个存储节点在自身缓存存储的数量达到一定阈值之后,将自身存储的属于w个第一校验矩阵的单元也写入至第二介质层中。之后,目标存储节点获取写入至第二介质层中的w×M个校验单元,进而根据w×M个校验单元计算获得Y个校验单元,将计算得到的Y个校验单元作为第二校验矩阵的Y个校验单元写入至第二介质层中。
其中,如果第二介质层包括的硬盘的数量大于第二校验矩阵所包含的单元的总数量,则各个存储节点在将自身存储的属于w个第一校验矩阵的数据单元和校验单元写入至第二介质层时,对于数据单元,可以为每个数据单元选择一个硬盘,并将每个数据单元写入至为相应数据单元选择的硬盘中,其中,为不同的数据单元选择的硬盘也不同。这样,对于第二校验矩阵包括的X个数据单元,这X个数据单元将被写入至X个硬盘中。对于w个第一校验矩阵中的校验单元,各个存储节点可以将自身存储的校验单元存储至除上述的X个硬盘中的剩余硬盘中。
需要说明的是,在写入校验单元时,可以在每个硬盘上写入一个校验单元,这样,w×M个校验单元将被写入至w×M个硬盘中。或者,所有的校验单元可以写入至一个硬盘中。或者,可以在M个硬盘上写入w×M个校验单元,其中,M个硬盘中每个硬盘上写入的校验单元为第一校验矩阵中位于同一列的校验单元。例如,当M=2时,两个校验单元中的一个校验单元为校验单元p,另一个校验单元为校验单元q,则各个第一校验矩阵中的校验单元p写入至一个硬盘上,校验单元q写入至另一个硬盘上。
各个存储节点在将自身存储的属于w个第一校验矩阵的数据单元和校验单元写入至第二介质层中之后,目标存储节点从第二介质层中获取w×M个校验单元。其中,如果w×M个校验单元将被写入至w×M个硬盘中,则目标存储节点从这w×M个硬盘中读取w×M个校验单元。如果所有的校验单元被写入至一个硬盘中,则目标存储节点从该硬盘中一次性获取w×M个校验单元,这样,能够减少网络通信次数,节省带宽资源。如果w×M个校验单元被写入至M个硬盘中,且M个硬盘中每个硬盘上写入的校验单元为第一校验矩阵中位于同一列的校验单元,则目标存储节点从各个硬盘上读取位于相同列的校验单元,从而得到w×M个校验单元。如此,在一定程度上也可以减少网络通信次数,节省带宽资源。
在获取到w×M个校验单元之后,目标存储节点参考前述第一种实现方式中介绍的方法,根据该w×M个校验单元计算获得Y个校验单元,进而将这Y个计算单元分别写入至Y个硬盘,其中,每个硬盘上写入一个校验单元。并且,写入Y个校验单元的Y个硬盘不为前述写入数据单元的X个硬盘中的硬盘。
可选地,如果第二介质层包括的硬盘的数量不大于第二校验矩阵所包含的单元的总数量,则各个存储节点可参考前述第一种实现方式中介绍的方法,在一个硬盘上写入两个或两个以上的单元,只要不超出允许存储的最大单元数即可。同样的,在这种情况下,各个第一校验矩阵包括的M个校验单元中可以存储在同一个存储节点下的硬盘上,或者,各个第一校验矩阵包括的M个校验单元中位于同一列的校验单元可以存储在一个存储节点下的硬盘中,例如,存储在一个存储节点下的同一个硬盘逻辑域上,或者是存储在一个存储节点下的一个物理硬盘上,以此来减少计算第二校验矩阵中Y个校验单元时所需的网络转发次数。
举例说明,图10是本申请实施例提供的一种在第二介质层中按照第二纠删码配比写入数据的示意图。如图10所示,第一校验矩阵的个数w=4,每个第一校验矩阵包括6个数据单元和2个校验单元。2个校验单元分别为校验单元p和校验单元q。第二介质层包括分布在4个存储节点(a1至a4)上的16个硬盘逻辑域,每个存储节点上分布有4个硬盘逻辑域,每个硬盘逻辑域内包括16个物理硬盘。由于校验单元的数量为2,因此,每个硬盘逻辑域上最多允许分布第二校验矩阵中的2个单元。基于此,首先从16个硬盘逻辑域中选择两个硬盘逻辑域,例如,选择的两个硬盘逻辑域分别为存储节点a2上的硬盘逻辑域a21和存储节点a4上的硬盘逻辑域a41。之后,存储有各个第一校验矩阵的校验单元p的各个存储节点将自身存储的校验单元p写入至硬盘逻辑域a21的物理硬盘中,例如,每个第一校验矩阵中的校验单元p均写入至硬盘逻辑域a21的第一个物理硬盘中。同理,存储有各个第一校验矩阵的校验单元q的各个存储节点将自身存储的校验单元q写入至硬盘逻辑域a41的第一个物理硬盘中。之后,对于硬盘逻辑域a21,目标存储节点或存储节点a2对写入至硬盘逻辑域a21上的4个第一校验矩阵中的4个校验单元p进行异或运算,得到第二校验矩阵中的校验单元p’,将该校验单元p’存储在硬盘逻辑域a21上。由于每个硬盘逻辑域上最多允许存储第二校验矩阵中的两个单元,所以,在将计算得到的第二校验矩阵中的校验单元p’写入至硬盘逻辑域a21上之后,该硬盘逻辑域a21上最多还能再存储第二校验矩阵中的一个数据单元。同理,对于硬盘逻辑域a41,同样可以由目标存储节点或存储节点a4对其中存储的4个校验单元q进行异或运算,增量计算得到第二校验矩阵中的校验单元q’,并将其存储在硬盘逻辑域a41中,如此,硬盘逻辑域a41上最多也只能存储第二校验矩阵中的一个数据单元。之后,对于每个第一校验矩阵包括的6个数据单元,各个存储节点根据每个硬盘逻辑域上最多分布第二校验矩阵中的两个单元的原则,将24个数据单元分布在4个存储节点包括的16个硬盘逻辑域上。
上述介绍了存储系统为图1或图2所示的存储系统时,在第一介质层中按照第一纠删码配比缓存数据的过程。可选地,当存储系统为图3所示的存储阵列中,则上述存储节点执行的操作可以由控制器来执行,从而将第一介质层中数据包括的w个第一校验矩阵中的数据单元和校验单元合并为第二校验矩阵写入至第二介质层中,本申请实施例对此不再赘述。
在按照上述的数据存储方法存储数据后,当第一介质层包括的节点或第二介质层包括的硬盘发生故障时,如果数据已经存入至了第二介质层,也即,第二校验矩阵已经生成,则根据故障点的个数、故障位置以及第二校验矩阵中各个单元的分布位置,从第二介质层中读取除故障点之外的其他位置上的数据单元和校验单元进行重构,从而恢复出故障点中的数据。可选地,如果数据已存入至第一介质层,但是还未存入至第二介质层,则根据故障点的个数、故障位置以及各个第一校验矩阵中各个单元的分布位置,从第一介质层中读取未发生故障的位置上的数据单元和校验单元进行重构,从而恢复出第一介质层中故障点中的数据。
在本申请实施例中,第一纠删码配比对应的数据单元的个数小于第二纠删码对应的数据单元的个数,也即,第一纠删码配比为小比例的配比,而第二纠删码配比为大比例的配比。在此基础上,在第一介质层中按照第一纠删码配比缓存数据,在第二介 质层中按照第二纠删码配比存储数据,而第一介质层的性能高于第二介质层,也就是说,在高性能介质层采用较小的纠删码配比来存储数据,而在低性能介质层采用较大的纠删码配比来存储数据。由于高性能介质层接收到的IO粒度比较小,所以,在高性能介质层通过较小的纠删码配比来存储数据时,每当接收到数据的尺寸达到该纠删码配比对应的N个数据单元的尺寸时,即可以凑满一个分条(N个数据单元和M个校验单元即可组成一个分条),相较于较大的纠删码配比而言,小的纠删码配比更容易凑满分条,从而使得分条中补0的数据量减少,降低了写放大,提高了存储空间利用率。例如,在高性能介质层中采用6:2的纠删码配比来存储数据,相较于采用24:2来存储数据,在指定的时间段内,根据接收到的小粒度的IO请求,凑齐6个数据单元比凑齐24个数据单元更为容易,这样,就不必在凑不齐24个数据单元时进行补0,也即,使得分条中补0的数据量减少,降低了分条中冗余数据量的占比,降低了写放大,提高了存储空间利用率。另外,在低性能介质层采用较大的纠删码配比来存储数据,能够减少冗余数据在存储空间中的占比,从而提高存储空间利用率。
另外,在本申请实施例中,在第一介质层中按照第一纠删码配比缓存的一份数据能够直接转换为符合第二纠删码配比的一份数据,进而存储至第二介质层,在提高存储系统的存储空间利用率的同时提高数据存储可靠性。同时上述转换过程中,多个第一分条中的数据单元不再需要参与运算,从而节省了存储系统的计算资源。
上述实施例主要介绍了第一介质层中缓存的遵从第一纠删码配比的多个第一分条中的数据单元的个数达到遵从第二纠删码配比的第二分条中数据单元的个数,即缓存在第一介质层的多个第一分条可以凑满第二分条时,将多个第一分条的数据单元以及第二分条的校验单元存储到第二介质层。
本发明另一实施例,为了提高数据存储的可靠性,可以将遵从第一纠删码配比的第一分条在存储系统中进行持久化存储,即存储到第二介质层,这样可以提高数据可靠性,防止存储系统故障导到第一介质层中没有持久化存储的第一分条数据丢失。当第一介质层中非持久化存储的第一分条与已经持久化存储的第一分条能够凑满第二分条,即第一介质层中非持久化存储的第一分条中的数据单元的个数与已经持久化存储的第一分条中的数据单元的个数等于第二分条中数据单元的个数时,一方面,读取已经持久化存储的第一分条中的校验单元,另一方面,读取第一介质层中非持久化存储的第一分条中的校验单元,生成第二分条的校验单元,具体实现方式可以参考前述实施例的描述,本发明实施例不再赘述。将非持久化存储的第一分条中的数据单元以及第二分条的校验单元持久化存储到第二介质层,实现第二分条在存储系统中持久化存储。
本发明上述实施例另外一种实现方式,可以由存储系统中的接口卡实现,例如由主机总线适配器(Host Bus Adapter,HBA)、网络接口卡(Network Interface Card,NIC)或扩展器(Expander)等实现,本发明对此不再赘述。
接下来对本申请实施例提供的数据存储装置进行介绍。
参见图11,本申请实施例提供了一种分条管理装置,该分条管理装置应用于前述介绍图1或图2所示的存储系统中的任意一个存储节点,也可以应用于图3所示存储阵列等。该分条管理装置包括:
获取单元1101,用于获取多个第一分条中的校验单元;其中,所述第一分条遵从第一纠删码配比;
生成单元1102,用于根据所述多个第一分条的校验单元生成新的校验单元;其中,所述新的校验单元与所述多个第一分条中的数据单元属于新的分条;所述新的分条遵从第二纠删码配比;其中,第一纠删码配比对应的数据单元的个数小于第二纠删码对应的数据单元的个数。本发明实施例上述各单元的实现可以参考本发明前述实施例中的描述。
可选地,新的分条中的校验单元的个数与所述第一分条中的校验单元的个数相同。
可选地,多个第一分条包含至少一个在所述存储系统中未持久化存储的第一分条和至少一个已经在所述存储系统持久化存储的第一分条;获取单元1101,具体用于:
读取所述至少一个已经在所述存储系统持久化存储的第一分条中的校验单元;
读取所述至少一个在所述存储系统中未持久化存储的第一分条中的校验单元。
进一步地,所述分条管理装置还包括存储单元1103;所述存储单元1103,用于持久化存储所述至少一个在所述存储系统中未持久化存储的第一分条中的数据单元和所述新的校验单元。
可选地,所述多个第一分条为所述存储系统中未持久化存储的第一分条;所述分条管理装置还包括存储单元1103;所述存储单元1103,用于持久化存储所述多个第一分条中的数据单元以及所述新的校验单元。
综上所述,在本申请实施例中,第一介质层和第二介质层的性能不同,基于此,在第一介质层和第二介质层中按照不同的纠删码配比来进行数据存储。由于不同的纠删码配比对应的写放大不同,所导致的存储空间利用率也不同,因此,根据介质层的性能的不同选取不同的纠删码配比进行数据存储能够更好的发挥相应介质层的存储性能,有效的提高存储空间利用率。
需要说明的是:上述实施例提供的分条管理装置在进行数据存储时,仅以上述各功能单元的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的分条管理装置与分条管理方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意结合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机程序指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机程序指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(Digital Subscriber Line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所 述可用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(Digital Versatile Disc,DVD))、或者半导体介质(例如:固态硬盘(Solid State Disk,SSD))等。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
应当理解的是,本文提及的“至少一个”是指一个或多个,“多个”是指两个或两个以上。在本文的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。
Claims (20)
- 一种分条管理方法,其特征在于,应用于存储系统,所述方法包括:获取多个第一分条中的校验单元;其中,所述第一分条遵从第一纠删码配比;根据所述多个第一分条的校验单元生成新的校验单元;其中,所述新的校验单元与所述多个第一分条中的数据单元属于新的分条;所述新的分条遵从第二纠删码配比;其中,第一纠删码配比对应的数据单元的个数小于第二纠删码对应的数据单元的个数。
- 根据权利要求1所述的方法,其特征在于,所述新的分条中的校验单元的个数与所述第一分条中的校验单元的个数相同。
- 根据权利要求1或2所述的方法,其特征在于,所述多个第一分条包含至少一个在所述存储系统中未持久化存储的第一分条和至少一个已经在所述存储系统持久化存储的第一分条;所述获取多个第一分条中的校验单元,具体包括:读取所述至少一个已经在所述存储系统持久化存储的第一分条中的校验单元;读取所述至少一个在所述存储系统中未持久化存储的第一分条中的校验单元。
- 根据权利要求3所述的方法,其特征在于,所述方法还包括:持久化存储所述至少一个在所述存储系统中未持久化存储的第一分条中的数据单元和所述新的校验单元。
- 根据权利要求1或2所述的方法,所述多个第一分条为所述存储系统中未持久化存储的第一分条;所述方法还包括:持久化存储所述多个第一分条中的数据单元以及所述新的校验单元。
- 一种存储系统,其特征在于,所述存储系统包含一个或多个处理器,所述一个或多个处理器用于:获取多个第一分条中的校验单元,其中,所述第一分条遵从第一纠删码配比;根据所述多个第一分条的校验单元生成新的校验单元;其中,所述新的校验单元与所述多个第一分条中的数据单元属于新的分条;所述新的分条遵从第二纠删码配比;其中,第一纠删码配比对应的数据单元的个数小于第二纠删码对应的数据单元的个数。
- 根据权利要求6所述的存储系统,其特征在于,所述新的分条中的校验单元的个数与所述第一分条中的校验单元的个数相同。
- 根据权利要求6或7所述的存储系统,其特征在于,所述多个第一分条包含至少一个在所述存储系统中未持久化存储的第一分条和至少一个已经在所述存储系统持久化存储的第一分条;所述一个或多个处理器具体用于:读取所述至少一个已经在所述存储系统持久化存储的第一分条中的校验单元;读取所述至少一个在所述存储系统中未持久化存储的第一分条中的校验单元。
- 根据权利要求8所述的存储系统,其特征在于,所述一个或多个处理器还用于:持久化存储所述至少一个在所述存储系统中未持久化存储的第一分条中的数据单元和所述新的校验单元。
- 根据权利要求6或7所述的存储系统,其特征在于,所述多个第一分条为所述存储系统中未持久化存储的第一分条;所述一个或多个处理器还用于:持久化存储所述多个第一分条中的数据单元以及所述新的校验单元。
- 一种分条管理装置,其特征在于,所述分条管理装置应用于存储系统中,所述数据存储装置包含获取单元和生成单元;其中,所述获取单元,用于获取多个第一分条中的校验单元,其中,所述第一分条遵从第一纠删码配比;所述生成单元,用于根据所述多个第一分条的校验单元生成新的校验单元;其中, 所述新的校验单元与所述多个第一分条中的数据单元属于新的分条;所述新的分条遵从第二纠删码配比;其中,第一纠删码配比对应的数据单元的个数小于第二纠删码对应的数据单元的个数。
- 根据权利要求11所述的分条管理装置,其特征在于,所述新的分条中的校验单元的个数与所述第一分条中的校验单元的个数相同。
- 根据权利要求11或12所述的分条管理装置,其特征在于,所述多个第一分条包含至少一个在所述存储系统中未持久化存储的第一分条和至少一个已经在所述存储系统持久化存储的第一分条;所述获取单元具体用于:读取所述至少一个已经在所述存储系统持久化存储的第一分条中的校验单元;读取所述至少一个在所述存储系统中未持久化存储的第一分条中的校验单元。
- 根据权利要求13所述的分条管理装置,其特征在于,所述分条管理装置还包括存储单元;所述存储单元,用于持久化存储所述至少一个在所述存储系统中未持久化存储的第一分条中的数据单元和所述新的校验单元。
- 根据权利要求11或12所述的分条管理装置,其特征在于,所述多个第一分条为所述存储系统中未持久化存储的第一分条;所述分条管理装置还包括存储单元;所述存储单元,用于持久化存储所述多个第一分条中的数据单元以及所述新的校验单元。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质包含计算机程序指令,存储系统中的一个或多个中央处理器执行所述计算机程序指令使得所述存储系统执行:获取多个第一分条中的校验单元,其中,所述第一分条遵从第一纠删码配比;根据所述多个第一分条的校验单元生成新的校验单元;其中,所述新的校验单元与所述多个第一分条中的数据单元属于新的分条;所述新的分条遵从第二纠删码配比;其中,第一纠删码配比对应的数据单元的个数小于第二纠删码对应的数据单元的个数。
- 根据权利要求16所述的计算机可读存储介质,其特征在于,所述新的分条中的校验单元的个数与所述第一分条中的校验单元的个数相同。
- 根据权利要求16或17所述的计算机可读存储介质,其特征在于,所述多个第一分条包含至少一个在所述存储系统中未持久化存储的第一分条和至少一个已经在所述存储系统持久化存储的第一分条;所述获取多个第一分条中的校验单元,具体包括:读取所述至少一个已经在所述存储系统持久化存储的第一分条中的校验单元;读取所述至少一个在所述存储系统中未持久化存储的第一分条中的校验单元。
- 根据权利要求18所述的计算机可读存储介质,其特征在于,所述一个或多个中央处理器执行所述计算机程序指令使得所述存储系统还执行:持久化存储所述至少一个在所述存储系统中未持久化存储的第一分条中的数据单元和所述新的校验单元。
- 根据权利要求16或17所述的计算机可读存储介质,所述多个第一分条为所述存储系统中未持久化存储的第一分条;所述一个或多个中央处理器执行所述计算机程序指令使得所述存储系统还执行:持久化存储所述多个第一分条中的数据单元以及所述新的校验单元。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP21837389.2A EP4174630A4 (en) | 2020-07-10 | 2021-07-12 | STRIP MANAGEMENT METHOD, STORAGE SYSTEM, STRIP MANAGEMENT APPARATUS AND STORAGE MEDIUM |
| US18/151,848 US12131052B2 (en) | 2020-07-10 | 2023-01-09 | Stripe management method, storage system, stripe management apparatus, and storage medium to improve data storage reliability |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010661972.0 | 2020-07-10 | ||
| CN202010661972 | 2020-07-10 | ||
| CN202011148485.0A CN113918083A (zh) | 2020-07-10 | 2020-10-23 | 分条管理方法、存储系统、分条管理装置及存储介质 |
| CN202011148485.0 | 2020-10-23 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/151,848 Continuation US12131052B2 (en) | 2020-07-10 | 2023-01-09 | Stripe management method, storage system, stripe management apparatus, and storage medium to improve data storage reliability |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022007968A1 true WO2022007968A1 (zh) | 2022-01-13 |
Family
ID=79232468
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2020/120835 Ceased WO2022007225A1 (zh) | 2020-07-10 | 2020-10-14 | 数据存储方法、存储系统、存储设备及存储介质 |
| PCT/CN2021/105640 Ceased WO2022007968A1 (zh) | 2020-07-10 | 2021-07-12 | 分条管理方法、存储系统、分条管理装置及存储介质 |
Family Applications Before (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2020/120835 Ceased WO2022007225A1 (zh) | 2020-07-10 | 2020-10-14 | 数据存储方法、存储系统、存储设备及存储介质 |
Country Status (4)
| Country | Link |
|---|---|
| US (2) | US12131051B2 (zh) |
| EP (2) | EP4170499A4 (zh) |
| CN (2) | CN113918378A (zh) |
| WO (2) | WO2022007225A1 (zh) |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115686342A (zh) * | 2021-07-22 | 2023-02-03 | 华为技术有限公司 | 存储系统中的数据存储方法以及装置 |
| CN114816837B (zh) * | 2022-06-28 | 2022-12-02 | 苏州浪潮智能科技有限公司 | 一种纠删码融合方法、系统、电子设备及存储介质 |
| CN119814947B (zh) * | 2023-10-10 | 2025-11-18 | 浙江宇视科技有限公司 | 基于纠删码的数据处理方法、装置、设备和存储介质 |
| CN119645307B (zh) * | 2024-12-02 | 2025-10-14 | 天翼云科技有限公司 | 数据文件处理方法、装置、计算机设备和存储介质 |
| CN119690358B (zh) * | 2025-02-24 | 2025-05-09 | 浙江大华技术股份有限公司 | 数据降级存储方法、电子设备以及存储介质 |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103577274A (zh) * | 2012-07-31 | 2014-02-12 | 国际商业机器公司 | 管理存储器阵列的方法和装置 |
| US20150378820A1 (en) * | 2014-06-25 | 2015-12-31 | Quantum Corporation | High Reliability Erasure Code Distribution |
| CN105630423A (zh) * | 2015-12-25 | 2016-06-01 | 华中科技大学 | 一种基于数据缓存的纠删码集群存储扩容方法 |
| CN106201766A (zh) * | 2016-07-25 | 2016-12-07 | 深圳市中博科创信息技术有限公司 | 数据存储控制方法及数据服务器 |
| WO2017173623A1 (zh) * | 2016-04-07 | 2017-10-12 | 华为技术有限公司 | 用于处理存储设备中分条的方法和存储设备 |
| CN107436733A (zh) * | 2017-06-29 | 2017-12-05 | 华为技术有限公司 | 分片管理方法和分片管理装置 |
| CN109189326A (zh) * | 2018-07-25 | 2019-01-11 | 华为技术有限公司 | 分布式集群的管理方法和装置 |
| CN110865901A (zh) * | 2018-08-28 | 2020-03-06 | 华为技术有限公司 | 组建ec条带的方法和装置 |
Family Cites Families (16)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP4814119B2 (ja) * | 2007-02-16 | 2011-11-16 | 株式会社日立製作所 | 計算機システム、ストレージ管理サーバ、及びデータ移行方法 |
| US9898224B1 (en) * | 2012-09-12 | 2018-02-20 | EMC IP Holding Company LLC | Automatic adjustment of capacity usage by data storage optimizer for data migration |
| US9584160B2 (en) * | 2014-02-20 | 2017-02-28 | Quantum Corporation | Dynamically configuring erasure code redundancy and distribution |
| US9595979B2 (en) * | 2015-01-20 | 2017-03-14 | International Business Machines Corporation | Multiple erasure codes for distributed storage |
| CN105159618B (zh) * | 2015-09-25 | 2018-08-28 | 清华大学 | 用于单盘失效修复的优化方法及优化装置 |
| CN105487823B (zh) * | 2015-12-04 | 2018-06-05 | 华为技术有限公司 | 一种数据迁移的方法及装置 |
| US9817713B2 (en) * | 2016-02-04 | 2017-11-14 | International Business Machines Corporation | Distributed cache system utilizing multiple erasure codes |
| US10142419B2 (en) * | 2016-03-04 | 2018-11-27 | Sandisk Technologies Llc | Erasure correcting coding using data subsets and partial parity symbols |
| US10218789B2 (en) * | 2016-03-04 | 2019-02-26 | Western Digital Technologies, Inc. | Erasure correcting coding using temporary erasure data |
| KR101934204B1 (ko) * | 2017-07-28 | 2018-12-31 | 한양대학교 산학협력단 | 데이터 저장을 위한 소실 부호의 부호화 방법 및 장치 |
| US10929226B1 (en) * | 2017-11-21 | 2021-02-23 | Pure Storage, Inc. | Providing for increased flexibility for large scale parity |
| US20200042193A1 (en) * | 2018-08-03 | 2020-02-06 | EMC IP Holding Company LLC | Method, storage system and computer program product for managing data storage |
| CN111176880B (zh) * | 2018-11-09 | 2021-08-13 | 杭州海康威视系统技术有限公司 | 磁盘分配方法、装置和可读存储介质 |
| US11487715B1 (en) * | 2019-07-18 | 2022-11-01 | Pure Storage, Inc. | Resiliency in a cloud-based storage system |
| CN110531936B (zh) * | 2019-08-29 | 2021-05-28 | 西安交通大学 | 基于多种存储介质的分布式纠删码混合存储的林型存储结构及方法 |
| US11144394B1 (en) * | 2020-06-05 | 2021-10-12 | Vmware, Inc. | Storing B-tree pages in capacity tier for erasure-coded storage in distributed data systems |
-
2020
- 2020-08-21 CN CN202010851132.0A patent/CN113918378A/zh active Pending
- 2020-10-14 WO PCT/CN2020/120835 patent/WO2022007225A1/zh not_active Ceased
- 2020-10-14 EP EP20944002.3A patent/EP4170499A4/en active Pending
- 2020-10-23 CN CN202011148485.0A patent/CN113918083A/zh active Pending
-
2021
- 2021-07-12 WO PCT/CN2021/105640 patent/WO2022007968A1/zh not_active Ceased
- 2021-07-12 EP EP21837389.2A patent/EP4174630A4/en active Pending
-
2023
- 2023-01-03 US US18/149,288 patent/US12131051B2/en active Active
- 2023-01-09 US US18/151,848 patent/US12131052B2/en active Active
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103577274A (zh) * | 2012-07-31 | 2014-02-12 | 国际商业机器公司 | 管理存储器阵列的方法和装置 |
| US20150378820A1 (en) * | 2014-06-25 | 2015-12-31 | Quantum Corporation | High Reliability Erasure Code Distribution |
| CN105630423A (zh) * | 2015-12-25 | 2016-06-01 | 华中科技大学 | 一种基于数据缓存的纠删码集群存储扩容方法 |
| WO2017173623A1 (zh) * | 2016-04-07 | 2017-10-12 | 华为技术有限公司 | 用于处理存储设备中分条的方法和存储设备 |
| CN106201766A (zh) * | 2016-07-25 | 2016-12-07 | 深圳市中博科创信息技术有限公司 | 数据存储控制方法及数据服务器 |
| CN107436733A (zh) * | 2017-06-29 | 2017-12-05 | 华为技术有限公司 | 分片管理方法和分片管理装置 |
| CN109189326A (zh) * | 2018-07-25 | 2019-01-11 | 华为技术有限公司 | 分布式集群的管理方法和装置 |
| CN110865901A (zh) * | 2018-08-28 | 2020-03-06 | 华为技术有限公司 | 组建ec条带的方法和装置 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4174630A4 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4174630A4 (en) | 2023-11-29 |
| EP4170499A4 (en) | 2023-11-29 |
| WO2022007225A1 (zh) | 2022-01-13 |
| US20230163789A1 (en) | 2023-05-25 |
| CN113918083A (zh) | 2022-01-11 |
| EP4174630A1 (en) | 2023-05-03 |
| US12131051B2 (en) | 2024-10-29 |
| EP4170499A1 (en) | 2023-04-26 |
| US12131052B2 (en) | 2024-10-29 |
| US20230137007A1 (en) | 2023-05-04 |
| CN113918378A (zh) | 2022-01-11 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2022007968A1 (zh) | 分条管理方法、存储系统、分条管理装置及存储介质 | |
| US11204716B2 (en) | Compression offloading to RAID array storage enclosure | |
| US10831407B2 (en) | Write flow offloading to raid array storage enclosure | |
| US10453530B2 (en) | RDMA-SSD dual-port unified memory and network controller | |
| US8145837B2 (en) | Computer storage system with redundant storage servers and at least one cache server | |
| US11061618B1 (en) | Disk array enclosure configured to determine metadata page location based on metadata identifier | |
| US12498855B2 (en) | Storage system, data storage method, data read method, and storage medium | |
| US10387307B2 (en) | Lock-free raid implementation in multi-queue architecture | |
| CN102667727A (zh) | 用于实现从多达n个存储设备失效恢复的n路奇偶校验技术 | |
| US20240362117A1 (en) | Reliability coding with reduced network traffic | |
| US20190042365A1 (en) | Read-optimized lazy erasure coding | |
| US20250130723A1 (en) | Data processing method and related device | |
| WO2023000686A1 (zh) | 存储系统中的数据存储方法以及装置 | |
| CN118349188A (zh) | 一种基于非易失性存储的编码数据快速更新方法 | |
| CN104636078B (zh) | 用于对多种类型的存储等级组的非易失性存储(nvs)的有效阈值化的方法和系统 | |
| CN115408345A (zh) | 应用于并行文件系统的数据存储方法及装置 | |
| Akutsu et al. | MEC: Network optimized multi-stage erasure coding for scalable storage systems | |
| Zhang et al. | On the implementation of BRS codes in Ceph | |
| US20250390377A1 (en) | Retrieving user data and cyclic redundancy check information using a single access operation | |
| Huang et al. | Optimizing erasure-coded data archival for replica-based storage clusters | |
| CN117632572A (zh) | 数据存储方法、装置及存储介质 | |
| Volos et al. | Analyzing a Two-Tier Disaggregated Memory Protection Scheme Based on Memory Replication | |
| Wei et al. | TSUE+: an Efficient Update Framework with Swift Recycling Mechanism for Erasure-Coded Cluster File Systems | |
| Pan et al. | ACS: an alternate coding scheme to improve degrade read performance for SSD-based RAID5 systems | |
| WO2026045259A1 (zh) | 存储介质控制器、存储器以及存储设备 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21837389 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2021837389 Country of ref document: EP Effective date: 20230127 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |