WO2024149207A1 - 数据处理方法和装置、介质和计算机设备 - Google Patents

数据处理方法和装置、介质和计算机设备 Download PDF

Info

Publication number
WO2024149207A1
WO2024149207A1 PCT/CN2024/071228 CN2024071228W WO2024149207A1 WO 2024149207 A1 WO2024149207 A1 WO 2024149207A1 CN 2024071228 W CN2024071228 W CN 2024071228W WO 2024149207 A1 WO2024149207 A1 WO 2024149207A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
record
data page
column
compression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2024/071228
Other languages
English (en)
French (fr)
Inventor
张峥
马占峰
杨莘军
李飞飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou AliCloud Feitian Information Technology Co Ltd
Original Assignee
Hangzhou AliCloud Feitian Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou AliCloud Feitian Information Technology Co Ltd filed Critical Hangzhou AliCloud Feitian Information Technology Co Ltd
Priority to EP24741206.7A priority Critical patent/EP4650978A4/en
Publication of WO2024149207A1 publication Critical patent/WO2024149207A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6047Power optimization with respect to the encoder, decoder, storage or transmission
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/70Type of the data to be coded, other than image and sound
    • H03M7/707Structured documents, e.g. XML
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6017Methods or arrangements to increase the throughput
    • H03M7/6023Parallelization

Definitions

  • the present disclosure relates to the field of data processing technology, and in particular to a data processing method and apparatus, a medium and a computer device.
  • Databases generally use data pages as the smallest storage unit. In order to reduce storage overhead and improve the performance of data systems, data pages are usually compressed. However, the data compression method of related technologies generally removes the high-order 0s of the numbers to be compressed in the data page and stores the remaining valid digits in a variable length manner. However, this method has a poor compression effect for numbers with larger values.
  • an embodiment of the present disclosure provides a data processing method, the method comprising: obtaining a target compression column of multiple user records on a first data page; the target compression column comprises digital type data; determining a first basic value corresponding to each target compression column, the first basic value corresponding to the target compression column being used to differentially encode the target compression column of each user record on the first data page; generating a first basic value record based on the first basic value corresponding to each target compression column, and storing the first basic value record in the first data page.
  • the generating of the first basic value record based on the first basic value corresponding to each target compression column includes: if the remaining storage space of the first data page is less than the storage space occupied by the next user record to be inserted into the first data page, releasing the storage space occupied by invalid user records in the first data page, and determining the compression rate of valid user records in the multiple user records in the process of releasing the storage space; if the compression rate of the valid user record meets a preset first compression rate condition, compressing the corresponding target compression columns in the valid user records based on the first basic values corresponding to each target compression column, and generating the first basic value record based on the first basic values corresponding to each target compression column.
  • the method further comprises: releasing the memory occupied by invalid user records in the first data page; After the remaining storage space of the first data page is filled, if the remaining storage space of the first data page is still smaller than the storage space occupied by the next user record to be inserted into the first data page, the target user record among the multiple user records is migrated to the second data page.
  • migrating the target user record among the multiple user records to the second data page includes: if the compression rate of the target user record does not meet the preset second compression rate condition, migrating the target user record to the second data page in an uncompressed state; and/or if the compression rate of the target user record meets the preset second compression rate condition, migrating the target user record to the second data page in a compressed state, and storing a second basic value record in the second data page; the second basic value record includes a second basic value corresponding to each of the target compression columns, and the second basic value corresponding to the target compression column is used to compress the target compression column of the target basic value record.
  • migrating the target user record in a compressed state to the second data page includes: if the target user record is a compressed user record in the first data page, decompressing the target user record based on the first basic value record; and migrating the decompressed target user record to the second data page.
  • the method further includes: after generating a first base value record based on the first base value corresponding to each target compression column, recording a first flag bit in the first data page, the first flag bit being used to indicate that the first data page includes the first base value record; and/or after compressing the target compression column of the user record in the first data page, recording a second flag bit in the user record, the second flag bit being used to indicate that the user record is a compressed user record.
  • the generating of the first base value record based on the first base value corresponding to each target compression column includes: generating a compression entry corresponding to each target compression column, the compression entry corresponding to the target compression column including identification information of the target compression column and the base value of the target compression column; encapsulating the number of target compression columns in the first data page and the compression entries corresponding to each target compression column into the first base value record.
  • the first base value record also includes at least one of the following information: a version number of a compression algorithm; and verification information for verifying the first base value record.
  • the method further includes: compressing a new user record to be inserted into the first data page based on the first base value record; inserting the compressed new user record into the first data page; and/or decompressing the user record to be updated in the first data page based on the first base value record; after updating the decompressed user record to be updated, compressing the updated user record based on the first base value record.
  • the numerical relationship between the target compressed columns of the multiple user records after digital compression is the same as the numerical relationship before compression; the method further includes: obtaining the data to be queried after differentially encoding the target compressed columns of the multiple user records on the first data page; compressing the data to be queried based on the first basic value corresponding to the target compressed column to which the data to be queried belongs; and determining the target record from the compressed multiple user records based on the compressed data to be queried.
  • the method further includes: decompressing each target compressed column of the compressed user record based on the first base value record.
  • an embodiment of the present disclosure provides a data processing device, comprising: an acquisition module, used to acquire a target compression column of multiple user records on a first data page; the target compression column comprises digital type data; a determination module, used to determine a first basic value corresponding to each target compression column, the first basic value corresponding to the target compression column being used to differentially encode the target compression column of each user record on the first data page; a generation module, used to generate a first basic value record based on the first basic value corresponding to each target compression column, and store the first basic value record in the first data page.
  • an embodiment of the present disclosure provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method described in any embodiment of the present disclosure.
  • an embodiment of the present disclosure provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method described in any embodiment of the present disclosure when executing the program.
  • the disclosed embodiment obtains the first basic value of each target compression column, and differentially encodes the target compression column of each user record of the first data page based on the first basic value.
  • the first basic value can usually be determined for the data in the same target compression column; on the other hand, the first basic value is used to convert the number to be encoded in the target compression column into a difference with the basic value.
  • the difference is smaller than the original value, and more high-order 0s can be removed. Therefore, the number with a larger value is converted into a number with a smaller value, thereby improving the compression rate.
  • the first basic value record can be directly read from the first data page during decompression to decompress each compressed target compression column.
  • FIG. 1 is a flow chart of a data processing method according to an embodiment of the present disclosure.
  • FIG. 2 is a schematic diagram of user records on a first data page.
  • FIG. 3 is a schematic diagram of a user record after differential encoding.
  • FIG. 4 is a schematic diagram of the specific structure of the first data page before and after compression according to an embodiment of the present disclosure.
  • FIG. 5 is a flow chart of a compression process according to an embodiment of the present disclosure.
  • FIG. 6 is a flowchart of a decompression process according to an embodiment of the present disclosure.
  • FIG. 7 is a schematic diagram of a data processing device according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present disclosure.
  • first, second, third, etc. may be used in the present disclosure to describe various information, such information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information.
  • word "if” as used herein may be interpreted as "at the time of” or "when” or "in response to determining”.
  • databases have been widely used in many industries. With the popularization of the Internet and the application of various new technologies, the amount of business data generated is increasing day by day, and the amount of data stored in databases is also increasing. Databases generally use fixed-size data pages as the smallest storage unit. Common data organization methods include heap tables and index tables. Data compression based on data pages can reduce storage overhead and improve the performance of data systems.
  • the decimal number 1 can be represented as 0000 0000 0000 0000 0001 before compression;
  • the decimal number 2147483647 can be represented as 1111 1111 1111 1111 1111 1111 1111. It can be seen that no matter whether the stored number is 1 or the maximum value 2147483647, it occupies 4 bytes (if it is less than 4 bytes, the high bits are filled with 0), resulting in a waste of storage space. In view of this situation, in some compression schemes, the high-order 0s of the number to be compressed are removed, and the remaining valid digits are stored in a variable-length manner of header+payload.
  • the header occupies one byte, and the number of bytes occupied by the remaining valid digits is stored.
  • the remaining valid digit is 1 (0000 0001 in binary), which can be expressed as 0000 0001 0000 0001, which occupies 2 bytes in total; the decimal number 2147483647 has no 0 in the high digit to remove, and the remaining valid digit is still 4 (0000 0100 in binary), which can be expressed as 0000 0100 1111 1111 1111 1111 1111 1111 1111 1111, which occupies 5 bytes in total.
  • this method is very effective, but the compression effect of large numbers is not ideal.
  • the numbers compressed in this way can only be compared after decompression, and it does not support direct comparison of the compressed numbers, and more efficient byte-by-byte comparison of the compressed numbers.
  • the present disclosure proposes a data processing method, referring to FIG1 , wherein the method includes:
  • Step S102 obtaining a target compression column of a plurality of user records on a first data page; the target compression column includes numbers;
  • Step S104 determining a first base value corresponding to each target compression column, where the first base value corresponding to the target compression column is used to perform base-delta encoding on the target compression column of each user record on the first data page;
  • Step S106 Generate a first basic value record based on the first basic value corresponding to each target compression column, and store the first basic value record in the first data page.
  • the first data page may be a data page of a fixed size (e.g., 16K) in a database.
  • the first data page may include multiple user records, for example, records for storing user identity information, records for storing user transaction information, records for storing user credits, etc., which are not limited by the present disclosure.
  • Each user record may include one or more identical columns.
  • each user record may include columns such as the user's ID, name, age, gender, and ID number; in the records for storing user transaction information, each user record may include columns such as the user's ID, transaction time, and transaction amount; in the records for storing user credits, each user record may include columns such as the user's student number, the name of the selected course, and the grade of the selected course.
  • a target compression column may be selected from the columns included on the first data page.
  • the target compression column is a column including numbers, wherein the numbers may be data such as user IDs and transaction amounts.
  • the user records on the first data page may be records for storing user identity information, and the ID card in the user record may include numbers or letters (e.g., X). All column information on the first data page may be traversed, and whether the column is a target compression column may be determined according to the type of each column. In this case, if a column of data includes numbers, the column is a target compression column; otherwise, the column is not a target compression column.
  • the column is a target compression column based on the type of each column of data and the storage length of the column of data. In this case, if a column of data includes numbers, and the storage length of the numbers in the column is greater than the preset length threshold, the column is a target compression column; as long as any one of the two conditions of data type and storage length is not met, the column is not a target compression column.
  • a preset length threshold e.g. 3 bytes
  • the number of target compressed columns in the first data page may be greater than or equal to 1.
  • each target compressed column on the first data page may be obtained.
  • the user record on the first data page is a record for storing user transaction information, in which the column corresponding to the user ID and the column corresponding to the transaction amount may both be target compressed columns. Therefore, the two target compressed columns may be obtained from the first data page respectively.
  • a corresponding first base value may be determined for each target compressed column.
  • the first base value corresponding to a target compressed column is used to subtract the number included in the target compressed column to obtain a difference value (delta). Since the difference value is usually smaller than the original value, a number with a larger value is converted into a number with a smaller value, thereby saving storage space.
  • the first basic value can be the minimum value, average value, median value, etc. of the target compression column in each user record of the first data page.
  • the first basic value corresponding to each target compression column can be selected using the same or different strategies. All user records on the first data page can be traversed, and the data of the target compression column can be extracted therefrom for aggregation and calculation, the data distribution characteristics of the data of the target compression column can be determined, and the first basic value can be determined based on the data distribution characteristics.
  • the minimum value when used as the first basic value, all data (numbers) of the same target compression column can be compared, and the number with the smallest value in the target compression column can be used as the first basic value.
  • the average value when used as the first basic value, all data (numbers) of the same target compression column can be summed (SUM), and the sum result can be divided by the number of numbers in the target compression column to obtain the first basic value.
  • the middle value is used as the first base value, all data (numbers) in the same target compression column can be sorted. If there are an odd number of numbers, the (n+1)/2th number in the sorting result is selected as the first base value; if there are an even number of numbers, the n/2th number in the sorting result is selected as the first base value.
  • FIG. 2 shows a user record for storing user transaction information, which is a user record before compression, including three columns of data: user ID, name, and amount.
  • the column where the ID is located and the column where the amount is located are the target compression columns.
  • the first basic value corresponding to the column where the ID is located is 100.
  • the first basic value corresponding to the column where the ID is located is 175350 (i.e., the average value of 100, 101, ..., 600).
  • the median value is used as the first basic value
  • the first basic value corresponding to the column where the ID is located is 350.
  • FIG. 3 is the difference obtained by subtracting the corresponding basic value from the data in the target compression column. It can be seen that after subtracting the first basic value, the difference corresponding to the target compression column in most data records is smaller than the original value, so that the space actually required to store these numbers can be reduced.
  • the first basic values corresponding to all selected target compressed columns can be stored in the first basic value record.
  • a compression entry corresponding to each target compressed column can be generated, wherein the compression entry corresponding to each target compressed column includes the identification information of the target compressed column (for example, the column number of the target compressed column) and the basic value of the target compressed column. Then, the number of target compressed columns in the first data page and the compression entries corresponding to each target compressed column can be encapsulated into a first basic value record.
  • the compression entries ⁇ col no_1, base_1>, ⁇ col no_2, base_2>, ..., ⁇ col no_m, base_m> corresponding to each target compressed column can be obtained and encapsulated into a first basic value record.
  • the column where the ID is located and the column where the amount is located are both target compressed columns, and when the minimum value is taken as the first basic value, the first basic value corresponding to the column where the ID is located is 100, and the first basic value corresponding to the column where the amount is located is 100.
  • not every column in the first data page is a target compression column.
  • the column corresponding to the name is not a target compression column. Therefore, the column corresponding to the name does not include a compression entry, indicating that the data in this column does not need to participate in the compression of the first data page.
  • the first basic value record may include the following information:
  • the version number (Version) of the compression algorithm may be represented by 1 byte of data; the compression algorithm may include differential coding and other algorithms.
  • the target compression column in each user record may be compressed by differential coding, and then the difference after differential coding may be compressed by other coding methods. Therefore, the version number of the compression algorithm may refer to the version number of the entire set of compression algorithms including differential coding and other algorithms.
  • Verification information (Magic number) for verifying the first basic value record, which can be represented by 1 byte of data;
  • the number of compressed entries (#columns) is used to record the number of ⁇ col no,base> contained in the first base value record. Two bytes of data can be used to represent this information, so that the first base values corresponding to up to 65536 target compressed columns can be stored in the first base value record.
  • the compression entry ⁇ col no, base> including the identification information of the target compression column and its corresponding first base value, can be stored in a compact manner, and its length depends on the size of the first base value.
  • the first basic value record may include only part of the above information, for example, only the compression entries and the number of compression entries corresponding to each target compression column, or only the compression entries, the number of compression entries and the version number of the compression algorithm corresponding to each target compression column, or only the compression entries, the number of compression entries and the verification information corresponding to each target compression column.
  • the number of bytes used by the above various information can be set to other values as needed.
  • the total length of the first basic value record is variable, and its length depends on the number of target compression columns involved in compression and the size of the basic value corresponding to each target compression column. Different data pages can have first basic value records of different lengths. When constructing the first basic value record, all user records on the current data page can be traversed, and the data distribution characteristics of each target compression column can be analyzed. Finally, a suitable value is selected as the first basic value based on the data distribution characteristics.
  • the following is an example of an embodiment of a specific process of generating the first basic value record.
  • data compression when initially inserting a user record into the first data page or changing a user record, data compression may not be performed on the user record.
  • “initially” may refer to before the first data page is filled for the first time (i.e., the remaining storage space of the first data page is insufficient to store new user records), or before the number of user records in the first data page reaches a preset number threshold.
  • the number of data records in the first data page is relatively small, it is usually difficult to accurately determine the distribution characteristics of the data in each target compression column, and thus it is difficult to select a suitable first basic value to compress the data in the target compression column. Therefore, not compressing the initially inserted or changed user records can reduce the situation where the selected first basic value is inappropriate due to the difficulty in determining the distribution characteristics of the data.
  • the first data page When the remaining storage space of the first data page is less than the storage space occupied by the next user record to be inserted into the first data page, the first data page may be reorganized or split to try to free up enough continuous free storage space in the first data page.
  • the first basic value record may be generated and the user record may be compressed.
  • the invalid user record may be a user record that has been deleted. After releasing the storage space occupied by the invalid user record, you can try to compress the valid user records in the first data page (that is, the remaining user records in the first data page). If the compression rate of the valid user record meets the preset first compression rate condition, the corresponding target compression column in the valid user record is compressed based on the first basic value corresponding to each target compression column, and the first basic value record is generated based on the first basic value corresponding to each target compression column.
  • the first compression rate condition is used to characterize whether the valid user records in the first data page are suitable for compression, that is, whether the storage space occupied in the first data page can be effectively reduced after compression.
  • the first compression rate condition can be the total compression rate of each valid user record, which can be determined based on the ratio of the length of each valid user record after compression to the length before compression. If the total compression rate is greater than a preset compression rate threshold, it can be determined that the first compression rate condition is not met, otherwise it is determined that the first compression rate condition is met.
  • the first compression rate condition can also be the number of valid user records whose compression rate is less than a preset compression rate threshold. If the number is greater than a preset number threshold, it can be determined that the first compression rate condition is met, otherwise it is determined that the first compression rate condition is not met.
  • the first compression rate condition can also be other conditions, which are not listed here one by one.
  • a temporary data page When attempting to compress valid user records in the first data page, a temporary data page may be generated, and the corresponding target compression column of each user record in the first data page may be compressed based on the first basic value corresponding to each target compression column, and then the compressed column may be compressed.
  • the compressed user records are written into the temporary data page. Then, it is determined whether the compressed user records in the temporary data page meet the first compression rate condition. If so, the user records before compression can be deleted from the first data page, and the compressed user records in the temporary data page can be written into the first data page. If the first compression rate condition is not met, the temporary data page can be directly deleted, and the uncompressed user records are still stored in the first data page.
  • some numbers with large intervals are not suitable for compression and generally do not meet the first compression rate condition. If the numbers included in the target compression column have such characteristics, compression can be omitted and the uncompressed user records can be directly stored in the first data page.
  • the first data page can be split, that is, the target user record among the multiple user records is migrated to the second data page.
  • the target user record may include some user records in the first data page, and there may be multiple ways to select the target user record. For example, a preset number of user records in the first data page may be migrated to the second data page, and the preset number of user records may be multiple continuous user records in the first data page, or multiple user records with similar data distribution characteristics, or multiple user records that meet other conditions.
  • the following situations may occur when migrating the target user records to the second data page:
  • the compressed target user record can be stored in the second data page. At this time, if the target user record in the first data page is not compressed, these uncompressed target user records can be compressed and migrated to the second data page (that is, the target user record is migrated to the second data page in a compressed state);
  • the uncompressed target user record may be stored in the second data page. At this time, if the target user record in the first data page is uncompressed, the uncompressed target user record may be directly migrated to the second data page without compression (i.e., the target user record is migrated to the second data page in an uncompressed state);
  • the compressed target user record can be stored in the second data page.
  • the compressed target user record can be directly migrated to the second data page, so that the compressed target user record is stored in the second data page (that is, the target user record is migrated to the second data page in a compressed state);
  • the uncompressed target user record may be stored in the second data page.
  • the compressed target user record may be decompressed and migrated to the second data page, so as to store the uncompressed target user record in the second data page (that is, the target user record is migrated to the second data page in an uncompressed state).
  • the decompression of the target user record may be implemented based on the first basic value record.
  • the second compression rate condition may be the compression rate of the target user record, and the compression rate may be determined based on the ratio between the length of the target user record after compression and the length before compression.
  • the compression rate may be determined based on the ratio between the length of the target user record after compression and the length before compression.
  • the number of target user records is greater than 1, it may be determined whether each target user record meets the second compression rate condition. If the compression rate of the target user record is less than a certain threshold (such as 80%, configurable), it may be determined that the second compression rate condition is met, otherwise it is determined that the second compression rate condition is not met.
  • the compression operation is data-related, and the distribution characteristics of the data directly affect the compression rate.
  • the embodiment of the present disclosure sets a second compression rate condition when compressing the target user record. If the second compression rate condition is met, the compression is considered to be valid, and the compressed target user record can be stored on the second data page; otherwise, the compression is considered to be invalid, and the uncompressed target user record can be stored on the second data page.
  • each target user record has the flexibility of being compressed or not compressed individually.
  • a second flag bit may be recorded in the target user record to characterize that the target user record is a compressed user record.
  • the second flag bit may be represented by a 1-bit binary number. For example, if the second flag bit in a user record is "1", it indicates that the user record is a compressed user record; if the second flag bit in a user record is "0", it indicates that the user record is an uncompressed user record.
  • the second flag bit in a user record is not empty, it indicates that the user record is a compressed user record; if the second flag bit in a user record is empty, it indicates that the user record is an uncompressed user record.
  • each user record in the first data page may also include a second flag bit to characterize whether the user record is a compressed user record.
  • a second base value record may also be stored in the second data page, and the second base value record includes a second base value corresponding to each of the target compressed columns.
  • the second base value may be the same as the first base value, or may be different from the first base value.
  • 300 may be used as the second base value, that is, the second base value is different from the first base value.
  • the first basic value is different.
  • 100 can still be used as the second basic value, that is, the second basic value is the same as the first basic value.
  • the information that may be included in the second base value record is the same as that in the first base value record.
  • it may include the version number of the differential encoding algorithm, verification information for verifying the second base value record, the number of compression entries and the compression entries corresponding to each target compression column, etc.
  • the detailed description of the above information can be found in the relevant description in the first base value record, which will not be repeated here.
  • the first base value record can also be stored in a specified position of the first data page (for example, before the first user record on the first data page).
  • Figure 4 shows the layout of the first data page before and after compression. It can be seen that before compression, the first data page only includes a page header, a user record, and a page trailer; after compression, a base value record is added to the first data page, and the base value (base) used by each compressed column is stored in the base value record. This information is needed during compression and decompression.
  • not all first data pages include the first base value record.
  • the first base value record can be constructed and saved in the first data page only when the user record on the first data page is compressed for the first time.
  • a first flag bit can be recorded in the first data page to characterize that the first base value record is included in the first data page.
  • the first flag bit can be recorded in the header of the first data page. If the user record on a first data page does not need to be compressed, the first data page does not include the first base value record. Therefore, the data page has the flexibility of being compressed or not compressed separately.
  • recording the first flag bit in the first data page can be changing the first flag bit in the first data page from an invalid state to a valid state, for example, using 1 bit to represent the first flag bit, when the first flag bit is "0", it is an invalid state, and when the first flag bit is "1", it is a valid state. Therefore, recording the first flag bit in the first data page can be changing the first flag bit in the first data page from "0" to "1". Alternatively, recording the first flag bit in the first data page may also be writing a non-empty first flag bit into the first data page when the first flag bit in the first data page is empty.
  • the new user records to be inserted into the first data page can be compressed based on the first basic value record, and the compressed new user records can be inserted into the first data page.
  • the user records to be updated in the first data page can be decompressed based on the first basic value record, and after updating the decompressed user records to be updated, the updated user records can be decompressed based on the first basic value record. The records are recompressed.
  • the first basic value record when inserting a new user record, can be read from the designated storage location of the first data page (S502). For each column in the new user record, it can be determined whether the first basic value corresponding to the column exists in the first basic value record (S504). If so, the first basic value is used to compress the corresponding column in the new user record to be inserted (S506), and it is determined whether the compression is effective (S508), for example, whether the second compression rate condition is met.
  • the compressed form of the column is used (i.e., the data of the column is compressed) (S510), and the compressed form of the column is inserted into the first data page (S514); otherwise, the uncompressed form of the column is used (i.e., the data of the column is not compressed) (S512), and the uncompressed form of the column is inserted into the first data page (S514).
  • the first basic value corresponding to a column does not exist in the first basic value record, it means that the column does not need to be compressed, so the data of the column can be directly inserted into the first data page (S514).
  • the existing user records in the first data page are updated to obtain the updated user records
  • the compressed form of the column is used (i.e., the data of the column is compressed) (S510), and the compressed form of the column is inserted into the first data page (S514); otherwise, the uncompressed form of the column is used (i.e., the data of the column is not compressed) (S512), and the uncompressed form of the column is inserted into the first data page (S514).
  • the first basic value corresponding to a column does not exist in the first basic value record, it means that the column does not need to be compressed, so the data of the column can be directly inserted into the first data page (S514).
  • the user record When querying the user record in the first data page, if the user record has been compressed, it needs to be decompressed, and the specific process is shown in Figure 6.
  • a user record to be queried it can be determined to read the first basic value record from the specified storage location of the first data page (S602), and determine whether the column to be queried has been compressed (S604). If so, the first basic value corresponding to the column to be queried in the first basic value record is used to decompress the column to be queried (S606), and then the decompressed data is returned (S608); otherwise, the data in the column to be queried is directly returned (S608).
  • the target compressed columns of the compressed user records can be decompressed based on the first base value record.
  • differential encoding can be used to preprocess the user records on the first data page, and then other encoding algorithms can be further used to re-encode the values of the preprocessed target compressed columns, remove the high-order 0s, and store them in a byte-comparable manner.
  • decompressing other encoding algorithms can be used first.
  • the decoding algorithm corresponding to the algorithm decodes the re-encoded value of the target compressed column, and then decodes the decoded value again based on the first basic value to obtain the uncompressed value of the target compressed column.
  • the number of bytes of the payload in the target compressed column after preprocessing can be determined; the payload is the minimum data that can represent the size of the target compressed column after preprocessing after removing the high-order 0s; a header byte is generated; the header byte at least includes: length data, the length data is obtained according to the number of bytes of the payload; the generated header byte is combined with the payload of the target compressed column after preprocessing, and the combined data is used as the encoding result of the target compressed column after preprocessing. Further, the header byte at least includes: a sign bit and length data; the sign bit is used to indicate the positive or negative of the integer to be encoded.
  • the value of the target compression column after preprocessing is an integer
  • the value of the target compression column after preprocessing can be called the integer to be encoded. Since the number of bytes included in the payload is related to the size of an integer, the number of bytes of the integer payload is recorded by adding a header byte.
  • the header byte sizes of the two integers are different, the sizes of the two data can be determined by simply comparing the sizes of the header bytes of the two numbers; when the header byte sizes of the two integers are the same, it is necessary to further compare the payload parts of the two integers to know the sizes of the two integers. In this way, through the above method, while compression is achieved, it can be ensured that the compressed integers can participate in sorting and can be compared with other compressed integers.
  • the byte corresponding to the first data bit of the data in the encoding value can also be stored in the first address bit of the storage address
  • the byte corresponding to the second data bit of the data in the encoding value can be stored in the second address bit of the storage address, wherein the first data bit is higher than the second data bit (i.e., the first data bit is the high bit of the data, and the second data bit is the low bit of the data), and the first address bit is lower than the second address bit (i.e., the first address bit is the low bit of the storage address, and the second address bit is the high bit of the storage address).
  • the encoding values stored in the storage address can be directly read in the order of address bits from low to high, and the read encoding values can be compared. If the encoding values in the lower address bits are the same, the encoding values in the second lowest address bits are compared, and so on, until the encoding values in all address bits are compared, or the address bits with different encoding values are found. If the address bits with different encoding values are found, the value of the data before encoding corresponding to the larger encoding value in the address bits is also larger. This method is called byte comparable storage method.
  • the header byte can be stored in the lowest address bit of the storage address. For example, if the header byte occupies 1 byte, the header byte can be stored in bits 0 to 7 of the storage address. Then, the payload is stored.
  • the 0th to 7th bits of the two storage addresses can be read first to obtain the header bytes of the two encoded numbers. If the header byte of number A is larger than the header byte of number B, it means that the decoded value of number A is also larger than the decoded value of number B. If the header byte of number A is equal to the header byte of number B, then compare the payload of number A with the payload of number B.
  • the encoded value can be read from the lowest address bit of the corresponding storage address first to obtain "04" and "05” respectively. Since the value of "04" is less than "05", and "04" and "05" both correspond to the highest bit of the number before encoding, it can be determined that the highest bit of number A before encoding is less than the highest bit of number B before encoding, so that the size of number A and number B can be determined without decompressing the encoded values of number A and number B.
  • the data in multiple compressed user records can be compared with the data to be queried, so as to determine the user records that hit the data to be queried, thereby eliminating the need to decompress each user record.
  • the numerical relationship of the target compressed data of the plurality of user records in the first data page after compression is the same as the numerical relationship before compression.
  • the numerical relationship includes greater than, equal to, and less than. For example, for the same target compressed data of two user records, user record A and user record B, if the numerical value of the target compressed data of user record A before compression is greater than/equal to/less than the numerical value of the target compressed data of user record B before compression, then the numerical value of the target compressed data of user record A after compression is also greater than/equal to/less than the numerical value of the target compressed data of user record B after compression.
  • the data to be queried can be obtained; the data to be queried is compressed based on the first basic value corresponding to the target compressed column to which the data to be queried belongs; and the target record is determined from the compressed multiple user records based on the compressed data to be queried. That is to say, in the embodiment of the present disclosure, instead of first decompressing each compressed user record and then comparing it with the data to be queried, the data to be queried is compressed and then compared with the compressed user record.
  • the numerical relationship after compression is the same, so by comparing the compressed data with the compressed user records, it is possible to determine which user record hits the data to be queried. In this way, the number of decompressions during the query process is reduced, thereby improving data query efficiency.
  • each user record can be encoded first. After encoding, assuming that the data to be queried is the amount data in the user record with ID 200, the first basic value corresponding to the column where the ID is located can be queried from the first basic value record, and the number "200" can be compressed by the first basic value. Then, based on the compressed value of the number "200" and the compressed value of the column where the ID is located in each user record, the user record with ID 200 is determined, and then the first basic value corresponding to the column where the amount is located is queried from the first basic value record, and the column where the amount is located in the user record with ID 200 is decompressed by the first basic value, and the decompressed amount data is returned. In this way, the entire query process only needs one compression and one decompression.
  • this specification also provides an integer decoding method for decoding the value of the target compressed column re-encoded in the above manner. Still taking the value of the target compressed column as an integer (referred to as the integer to be decoded) as an example, the method includes the following steps:
  • the effective load is processed to obtain a decoding result corresponding to the integer to be decoded.
  • the decoded target compressed column may be decoded again based on the basic value in the first data page to obtain the original data of the target compressed column.
  • the disclosed embodiment compresses and decompresses data at the granularity of data page and user record, and uses differential coding for preprocessing, which can cover a larger data interval.
  • a base value is selected from all the numbers in the same column, and each number is converted into a difference from the base value. The difference is often smaller than the original value, and the actual storage space required is also less. In this way, large numbers can also be compressed.
  • the disclosed embodiment does not compress the entire data page, but compresses the user records.
  • compression and decompression are performed directly in the database storage layer, so that the overall size of the data page before and after compression can remain unchanged.
  • the size of the data page before and after compression is 16K.
  • the operating system does not need to support sparse files and hole punching functions to determine the actual size of each data page after compression; on the other hand, each data page can accommodate more user records. Therefore, after caching data pages of the same size into the buffer pool, more user records can be read from the data pages in the buffer pool, thereby reducing disk I/O and improving system performance.
  • each target compression column in the same user record can be compressed and decompressed by reading the basic value record. In this way, the user record can be read row by row and compressed and decompressed, reducing the number of IO times during the compression and decompression process.
  • an embodiment of the present disclosure further provides a data processing device, the device comprising:
  • An acquisition module 702 is used to acquire a target compression column of multiple user records on a first data page; the target compression column includes digital data;
  • a determination module 704 is used to determine a first basic value corresponding to each target compression column, where the first basic value corresponding to the target compression column is used to perform differential encoding on the target compression column of each user record on the first data page;
  • the generating module 706 is configured to generate a first basic value record based on the first basic value corresponding to each target compressed column, and store the first basic value record in the first data page.
  • the functions or modules included in the device provided by the embodiments of the present disclosure can be used to execute the method described in the above method embodiments.
  • the specific implementation can refer to the description of the above method embodiments, and for the sake of brevity, it will not be repeated here.
  • the present disclosure also provides a computer device, which at least includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method described in any of the above embodiments when executing the program.
  • FIG8 shows a more specific schematic diagram of the hardware structure of a computing device provided by an embodiment of the present disclosure.
  • the device may include: a processor 802, a memory 804, an input/output interface 806, a communication interface 808 and a bus 810.
  • the processor 802, the memory 804, the input/output interface 806 and the communication interface 808 are connected to each other through the bus 810 in communication with each other.
  • the processor 802 can be implemented by a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits, etc., for executing related programs to implement the technical solutions provided by the embodiments of the present disclosure.
  • the processor 802 can also include a graphics card, which can be a Nvidia titan X graphics card or a 1080Ti graphics card, etc.
  • the memory 804 can be implemented in the form of a read-only memory (ROM), a random access memory (RAM), a static storage device, a dynamic storage device, etc.
  • the memory 804 can store an operating system and other application programs.
  • the relevant program code is stored in the memory 804 and is called and executed by the processor 802.
  • the input/output interface 806 is used to connect the input/output module to realize information input and output.
  • the input/output module can be configured in the device as a component (not shown in the figure), or it can be externally connected to the device to provide corresponding functions.
  • the input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc.
  • the output device may include a display, a speaker, a vibrator, an indicator light, etc.
  • the communication interface 808 is used to connect a communication module (not shown) to realize communication interaction between the device and other devices.
  • the communication module can realize communication through wired means (such as USB, network cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).
  • the bus 810 comprises a pathway for transmitting information between the various components of the device (eg, the processor 802, the memory 804, the input/output interface 806, and the communication interface 808).
  • the device may also include other components necessary for normal operation.
  • the above device may also only include the components necessary for implementing the embodiments of the present disclosure, and does not necessarily include all the components shown in the figure.
  • the present disclosure also provides a computer-readable storage medium on which a computer program is stored.
  • a computer program is stored on which a computer program is stored.
  • the program is executed by a processor, the method described in any of the above embodiments is implemented.
  • Computer-readable media include permanent and non-permanent, removable and non-removable media and can store information by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disk read-only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by a computing device.
  • computer-readable media does not include transitory media such as modulated data signals and carrier waves.
  • the embodiments of the present disclosure can be implemented by means of software plus a necessary general hardware platform. Based on such an understanding, the technical solution of the embodiments of the present disclosure is essentially or the part that contributes to the prior art can be embodied in the form of a software product, which can be stored in a storage medium such as ROM/RAM, a disk, an optical disk, etc., including several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment of the present disclosure or some parts of the embodiments.
  • a computer device which can be a personal computer, a server, or a network device, etc.
  • a typical implementation device is a computer, which may be in the form of a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email transceiver, a game console, a tablet computer, a wearable device or a combination of any of these devices.
  • each embodiment in the present disclosure is described in a progressive manner, and the same and similar parts between the embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments.
  • the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.
  • the device embodiment described above is merely schematic, wherein the modules described as separate components may or may not be physically separated, and the functions of each module can be implemented in the same one or more software and/or hardware when implementing the embodiment of the present disclosure. It is also possible to select some or all of the modules according to actual needs to achieve the purpose of the embodiment. A person of ordinary skill in the art can understand and implement it without paying creative labor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种数据处理方法和装置、介质和计算机设备,所述方法包括:获取第一数据页上的多条用户记录的目标压缩列;所述目标压缩列包括数字类型的数据;确定每个目标压缩列对应的第一基础值,目标压缩列对应的第一基础值用于对所述第一数据页上的各条用户记录的该目标压缩列进行差分编码;基于各个目标压缩列对应的第一基础值生成第一基础值记录,并将所述第一基础值记录存储到所述第一数据页中。

Description

数据处理方法和装置、介质和计算机设备
本申请要求于2023年01月09日提交中国专利局、申请号为202310029941.7、申请名称为“数据处理方法和装置、介质和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及数据处理技术领域,尤其涉及数据处理方法和装置、介质和计算机设备。
背景技术
数据库一般使用数据页(data page)作为最小存储单位。为了降低存储开销,并且提高数据系统的性能,通常会对数据页进行压缩。然而,相关技术的数据压缩方式一般是把数据页中待压缩数字高位的0去掉,以变长的方式存储剩余部分的有效数字。但是,这种方式对于数值较大的数字而言压缩效果较差。
发明内容
第一方面,本公开实施例提供一种数据处理方法,所述方法包括:获取第一数据页上的多条用户记录的目标压缩列;所述目标压缩列包括数字类型的数据;确定每个目标压缩列对应的第一基础值,目标压缩列对应的第一基础值用于对所述第一数据页上的各条用户记录的该目标压缩列进行差分编码;基于各个目标压缩列对应的第一基础值生成第一基础值记录,并将所述第一基础值记录存储到所述第一数据页中。
在一些实施例中,所述基于各个目标压缩列对应的第一基础值生成第一基础值记录,包括:若所述第一数据页的剩余存储空间小于待插入所述第一数据页的下一条用户记录占用的存储空间,释放所述第一数据页中无效用户记录占用的存储空间,并在释放存储空间的过程中确定所述多条用户记录中的有效用户记录的压缩率;若所述有效用户记录的压缩率满足预设的第一压缩率条件,分别基于各个目标压缩列对应的第一基础值对所述有效用户记录中对应的目标压缩列进行压缩,并基于各个目标压缩列对应的第一基础值生成所述第一基础值记录。
在一些实施例中,所述方法还包括:在释放所述第一数据页中无效用户记录占用的存 储空间之后,若所述第一数据页的剩余存储空间仍小于待插入所述第一数据页的下一条用户记录占用的存储空间,将所述多条用户记录中的目标用户记录迁移到第二数据页中。
在一些实施例中,所述将所述多条用户记录中的目标用户记录迁移到第二数据页中,包括:若所述目标用户记录的压缩率不满足预设的第二压缩率条件,将所述目标用户记录以未压缩状态迁移到所述第二数据页中;和/或若所述目标用户记录的压缩率满足预设的第二压缩率条件,将所述目标用户记录以压缩状态迁移到所述第二数据页中,并在所述第二数据页中存储第二基础值记录;所述第二基础值记录中包括每个所述目标压缩列对应的第二基础值,目标压缩列对应的第二基础值用于对所述目标基础值记录的该目标压缩列进行压缩。
在一些实施例中,所述将所述目标用户记录以压缩状态迁移到所述第二数据页中,包括:若所述目标用户记录为所述第一数据页中经过压缩的用户记录,基于所述第一基础值记录对所述目标用户记录进行解压缩;将解压后的目标用户记录迁移到所述第二数据页中。
在一些实施例中,所述方法还包括:在基于各个目标压缩列对应的第一基础值生成第一基础值记录之后,在所述第一数据页中记录第一标志位,所述第一标志位用于表征所述第一数据页中包括第一基础值记录;和/或在对所述第一数据页中的用户记录的目标压缩列进行压缩之后,在该用户记录中记录第二标志位,所述第二标志位用于表征该用户记录为压缩后的用户记录。
在一些实施例中,所述基于各个目标压缩列对应的第一基础值生成第一基础值记录,包括:生成每个目标压缩列对应的压缩条目,目标压缩列对应的压缩条目包括该目标压缩列的标识信息和该目标压缩列的基础值;将所述第一数据页中目标压缩列的数量和各个目标压缩列对应的压缩条目封装成所述第一基础值记录。
在一些实施例中,所述第一基础值记录中还包括以下至少一种信息:压缩算法的版本号;用于对所述第一基础值记录进行校验的校验信息。
在一些实施例中,在将所述第一基础值记录存储到所述第一数据页中之后,所述方法还包括:基于所述第一基础值记录对待插入所述第一数据页的新的用户记录进行压缩;将压缩后的新的用户记录插入所述第一数据页;和/或基于所述第一基础值记录对所述第一数据页中的待更新的用户记录进行解压缩;在对解压缩后的待更新的用户记录进行更新之后,基于所述第一基础值记录对更新后的用户记录进行压缩。
在一些实施例中,所述多条用户记录的目标压缩列的数字压缩后的数值关系与压缩前的数值关系相同;所述方法还包括:在对所述第一数据页上的所述多条用户记录的目标压缩列进行差分编码之后,获取待查询数据;基于所述待查询数据所属的目标压缩列对应的第一基础值对所述待查询数据进行压缩;基于压缩后的待查询数据从压缩后的所述多条用户记录中确定目标记录。
在一些实施例中,所述方法还包括:基于所述第一基础值记录对压缩后的用户记录的各个目标压缩列进行解压缩。
第二方面,本公开实施例提供一种数据处理装置,所述装置包括:获取模块,用于获取第一数据页上的多条用户记录的目标压缩列;所述目标压缩列包括数字类型的数据;确定模块,用于确定每个目标压缩列对应的第一基础值,目标压缩列对应的第一基础值用于对所述第一数据页上的各条用户记录的该目标压缩列进行差分编码;生成模块,用于基于各个目标压缩列对应的第一基础值生成第一基础值记录,并将所述第一基础值记录存储到所述第一数据页中。
第三方面,本公开实施例提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现本公开任一实施例所述的方法。
第四方面,本公开实施例提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现本公开任一实施例所述的方法。
本公开实施例通过获取每个目标压缩列的第一基础值,并基于第一基础值对第一数据页的每条用户记录的目标压缩列进行差分编码。一方面,由于同一目标压缩列的数据具有相同的属性,因此,一般具有相似的数据分布特征,因此,通常能够为处于同一目标压缩列的数据确定出第一基础值;另一方面,采用第一基础值对目标压缩列中要编码的数字转换为与基础值的差值,通常差值比原始值更小,高位有更多的0可以去掉,因此,把数值较大的数字转换成了数值较小的数字,从而提高了压缩率。此外,通过将基础值记录到第一基础值记录中,并在第一数据页中存储第一基础值记录,因此,在解压时可以直接从第一数据页中读取第一基础值记录来对压缩后的各个目标压缩列进行解压缩。
应当理解,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开。
附图说明
此处的附图被并入说明书中并构成本公开的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。
图1是本公开实施例的数据处理方法的流程图。
图2是第一数据页上的用户记录的示意图。
图3是差分编码后的用户记录的示意图。
图4是本公开实施例的第一数据页压缩前后的具体结构示意图。
图5是本公开实施例的压缩过程的流程图。
图6是本公开实施例的解压缩过程的流程图。
图7是本公开实施例的数据处理装置的示意图。
图8是本公开实施例的计算机设备的示意图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开。在本公开和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合。
应当理解,尽管在本公开可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
为了使本技术领域的人员更好的理解本公开实施例中的技术方案,并使本公开实施例的上述目的、特征和优点能够更加明显易懂,下面结合附图对本公开实施例中的技术方案 作进一步详细的说明。
作为数据存储的基础软件,数据库一直以来在许多行业有着广泛的应用。随着互联网的普及以及各种新技术的应用,产生的业务数据日益增多,数据库中存储的数据也越来越多。数据库一般使用固定大小的数据页(data page)作为最小存储单位,常用的数据组织方式有堆表(heap-organized table)和索引表(index-organized table)。基于数据页的数据压缩能够降低存储开销,并提高数据系统的性能。
在数据库系统支持的数据类型中,数字类型的使用非常广泛,对该类型的数据进行压缩能够有效减少存储空间。在一些压缩方案中,在存储一个数字时,先把高位的0去掉,只存储低位的有效部分。这样一来,就把数字由定长存储改为变长存储,实际占用的空间由数字的数值大小决定。然而,这种压缩方式对于数值较大的数字压缩效果并不理想。
以整型数字为例,在压缩前,固定采用4字节表示一个数字,例如,十进制数1在压缩前可以表示为:0000 0000 0000 0000 0000 0000 0000 0001;十进制数2147483647在压缩前可以表示为:1111 1111 1111 1111 1111 1111 1111 1111。可以看出,无论存储的数字是1,还是最大值2147483647,都占用4字节(不足4字节时高位补0),造成了存储空间的浪费。针对这种情况,在一些压缩方案中,把待压缩数字高位的0去掉,以header+payload的变长的方式存储剩余部分的有效数字,header占用一个字节,存储剩余部分的有效数字占用的字节数。按照这种方法,十进制数1去掉高位3字节的0后,剩余有效位为1(二进制表示为0000 0001)字节,可以表示为0000 0001 0000 0001,共占用2字节;十进制数2147483647高位没有0可以去掉,剩余有效位仍为4(二进制表示为0000 0100)字节,可以表示为0000 0100 1111 1111 1111 1111 1111 1111 1111 1111,共占用5字节。显然,如果数字的数值较小,这种方法很有效果,但对数值大的数字的压缩效果并不理想。并且,以这种压缩方式压缩的数字只能先解压缩才能比较大小,并不支持压缩后的数字直接比较大小,以及更高效的压缩后的数字按字节比较大小。
基于此,本公开实施例提出一种数据处理方法,参见图1,所述方法包括:
步骤S102:获取第一数据页上的多条用户记录(user record)的目标压缩列;所述目标压缩列包括数字;
步骤S104:确定每个目标压缩列对应的第一基础值,目标压缩列对应的第一基础值用于对所述第一数据页上的各条用户记录的该目标压缩列进行差分编码(base-delta encoding);
步骤S106:基于各个目标压缩列对应的第一基础值生成第一基础值记录,并将所述第一基础值记录存储到所述第一数据页中。
在步骤S102中,第一数据页可以是数据库中一个大小固定(例如,16K)的数据页。第一数据页上可以包括多条用户记录,例如,用于存储用户身份信息的记录、用于存储用户交易信息的记录、用于存储用户学分的记录等,对此本公开不做限制。每条用户记录可以包括一个或多个相同的列(column),例如,在用于存储用户身份信息的记录中,每条用户记录可以包括用户的ID、姓名、年龄、性别、身份证号等列;在用于存储用户交易信息的记录中,每条用户记录可以包括用户的ID、交易时间、交易金额等列;在用于存储用户学分的记录中,每条用户记录可以包括用户的学号、所选课程的名称以及所选课程的成绩等列。
可以从第一数据页上包括的各个列中选出目标压缩列。目标压缩列是包括数字的列,其中,数字可以是用户ID、交易金额等数据。例如,在一些应用场景中,第一数据页上的用户记录可以是用于存储用户身份信息的记录,用户记录中的身份证可以包括数字,也可以包括字母(例如,X)。可以遍历第一数据页上的所有列信息,根据每一列的类型确定该列是否为目标压缩列。在这种情况下,如果某一列数据包括数字,则该列为目标压缩列;否则该列不为目标压缩列。进一步地,在数字的存储长度大于预设长度阈值(例如,3字节)时,压缩效果比较好,因此,也可以同时基于每一列数据的类型和该列数据的存储长度确定该列是否为目标压缩列。在这种情况下,如果某一列数据包括数字,且该列中数字的存储长度大于预设长度阈值,则该列为目标压缩列;只要数据类型和存储长度两个条件中的任意一个条件不满足,则该列不为目标压缩列。
第一数据页中的目标压缩列的数量可以大于或等于1,在目标压缩列的数量大于1的情况下,可以获取第一数据页上的每个目标压缩列。以第一数据页上的用户记录为用于存储用户交易信息的记录为例,其中,用户的ID对应的列和交易金额对应的列都可以是目标压缩列,因此,可以分别从第一数据页上获取这两个目标压缩列。
在步骤S104中,可以为每个目标压缩列确定对应的第一基础值(base)。一个目标压缩列对应的第一基础值用于与该目标压缩列包括的数字进行相减,得到一个差值(delta)。由于差值通常比原始值更小。这样一来,就把数值较大的数字转换成了数值较小的数字,从而节省了存储空间。
第一基础值的选取有多种策略,可以是第一数据页的各条用户记录中目标压缩列的最小值,平均值、中间值等。在目标压缩列数量大于1时,各个目标压缩列对应的第一基础值可以采用相同或不同的策略来选取。可以遍历第一数据页上所有的用户记录,从中提取出目标压缩列的数据进行汇总和计算,确定目标压缩列的数据的数据分布特征,并基于该数据分布特征确定第一基础值。例如,在将最小值作为第一基础值时,可以对同一目标压缩列的所有数据(数字)进行比较,将该目标压缩列中取值最小的数字作为第一基础值。在将平均值作为第一基础值时,可以对同一目标压缩列的所有数据(数字)进行求和(SUM)运算,并将求和结果除以该目标压缩列中数字的个数,得到第一基础值。在将中间值作为第一基础值时,可以对同一目标压缩列的所有数据(数字)进行排序,如果有奇数个数字,选取排序结果中的第(n+1)/2个数字作为第一基础值;如果有偶数个数字,选取排序结果中的第n/2个数字作为第一基础值。
图2示出了用于存储用户交易信息的用户记录,该用户记录为压缩前的用户记录,包括用户的ID、姓名和金额三列数据。其中,ID所在的列和金额所在的列为目标压缩列。以ID所在的列为例,在将最小值作为第一基础值时,ID所在的列对应的第一基础值为100。在将平均值作为第一基础值时,ID所在的列对应的第一基础值为175350(即100,101,……,600的平均值)。在将中间值作为第一基础值时,ID所在的列对应的第一基础值为350。以将最小值作为第一基础值为例,参见图3,是将目标压缩列中的数据减去对应的基础值后得到的差值。可以看出,减去第一基础值后,大部分数据记录中目标压缩列对应的差值比原始值小,这样就可以减少存储这些数字实际需要的空间。
在步骤S106中,可以把所有选出的目标压缩列对应的第一基础值存储到第一基础值记录中。具体来说,可以生成每个目标压缩列对应的压缩条目,其中,每个目标压缩列对应的压缩条目包括该目标压缩列的标识信息(例如,目标压缩列的列号)和该目标压缩列的基础值。然后,可以将第一数据页中目标压缩列的数量和各个目标压缩列对应的压缩条目封装成第一基础值记录。假设第i个目标压缩列的标识信息记为col_no_i,目标压缩列的基础值记为base_i,则可以获取每个目标压缩列对应的压缩条目<col no_1,base_1>,<col no_2,base_2>,……,<col no_m,base_m>封装成第一基础值记录。在上述用于存储用户交易信息的用户记录中,ID所在的列和金额所在的列均为目标压缩列,并且在将最小值作为第一基础值的情况下,ID所在的列对应的第一基础值为100,金额所在的列对应的第一基 础值为10000,因此,可以将<col no_1=1,base_1=100>和<col no_2=3,base_2=10000>两个压缩条目封装成第一基础值记录。
应当说明的是,在一些情况下,并非第一数据页中的每个列都属于目标压缩列。例如,在图2和图3所示的实施例中,姓名对应的列并不是目标压缩列。因此,姓名对应的列不包括压缩条目,表示该列数据无需参加第一数据页的压缩。
在一个更为具体的实施例中,第一基础值记录中可以包括以下信息:
压缩算法的版本号(Version),可以采用1个字节的数据来表示该信息;其中,压缩算法可以包括差分编码,还可以包括其他算法,例如,可以通过差分编码对每条用户记录中的目标压缩列进行压缩处理,然后,再采用其他编码方法对经差分编码后的差值进行压缩,因此,压缩算法的版本号可以指包括差分编码和其他算法在内的整套压缩算法的版本号;
用于对所述第一基础值记录进行校验的校验信息(Magic number),可以采用1个字节的数据来表示该信息;
压缩条目的数量(#columns),用于记录第一基础值记录中包含的<col no,base>的数量,可以采用2个字节的数据来表示该信息,从而在第一基础值记录中最多能够保存65536个目标压缩列对应的第一基础值。
压缩条目<col no,base>,包括目标压缩列的标识信息及其对应的第一基础值,可以采用紧凑的方式存储,其长度取决于第一基础值的大小。
可以理解,上述实施例仅为示例性说明,在实际应用中,第一基础值记录中可以仅包括上述的部分信息,例如,仅包括各个目标压缩列对应的压缩条目和压缩条目的数量,或者仅包括各个目标压缩列对应的压缩条目、压缩条目的数量和压缩算法的版本号,或者仅包括各个目标压缩列对应的压缩条目、压缩条目的数量和校验信息。此外,上述各种信息所采用的字节数可以根据需要设置为其他取值。
第一基础值记录的总长度是可变的,其长度取决于参与压缩的目标压缩列的个数以及各个目标压缩列对应的基础值的大小,不同的数据页可以有不同长度的第一基础值记录。构建第一基础值记录时,可以遍历当前数据页上的所有用户记录,对每个目标压缩列的数据分布特征进行分析,根据数据分布特征最终选择一个合适的值作为第一基础值。
下面对生成第一基础值记录的具体过程的实施例进行举例说明。
在一些实施例中,最初往第一数据页中插入用户记录或者更改用户记录时,可以不对用户记录进行数据压缩。其中,“最初”可以是指第一数据页首次存满(即第一数据页的剩余存储空间不足以存入新的用户记录)之前,或者第一数据页中的用户记录的数量达到预设数量阈值之前。在第一数据页中数据记录数量比较少时,通常难以准确地确定各个目标压缩列中的数据的分布特征,从而难以选取出合适的第一基础值来对目标压缩列中的数据进行压缩。因此,对最初插入或更改的用户记录不进行压缩,能够减少因难以确定数据的分布特征而导致选取的第一基础值不合适的情况。
在第一数据页的剩余存储空间小于待插入该第一数据页的下一条用户记录占用的存储空间时,可以对第一数据页进行数据页整理(page reorganize)或数据页分裂(page split),以尝试在第一数据页中腾出足够的连续空闲的存储空间。在上述过程中,可以生成第一基础值记录,并对用户记录进行压缩。
这里涉及到的操作主要有两个,首先尝试对第一数据页进行数据页整理,即释放第一数据页中无效用户记录占用的存储空间。其中,无效用户记录可以是已经被删除的用户记录。在释放无效用户记录占用的存储空间之后,可以尝试对第一数据页中的有效用户记录(即第一数据页中剩余的用户记录)进行压缩。如果有效用户记录的压缩率满足预设的第一压缩率条件,则分别基于各个目标压缩列对应的第一基础值对所述有效用户记录中对应的目标压缩列进行压缩,并基于各个目标压缩列对应的第一基础值生成所述第一基础值记录。
其中,第一压缩率条件用于表征第一数据页中的有效用户记录是否适合压缩,即,压缩后能否有效减少对第一数据页中的存储空间的占用。可选地,第一压缩率条件可以是各条有效用户记录的总的压缩率,可以基于各条有效用户记录压缩后的长度与压缩前的长度的比值来确定。如果所述总的压缩率大于预设的压缩率阈值,则可以确定不满足第一压缩率条件,否则确定满足第一压缩率条件。或者,第一压缩率条件还可以是压缩率小于预设压缩率阈值的有效用户记录的数量。如果所述数量大于预设的数量阈值,则可以确定满足第一压缩率条件,否则确定不满足第一压缩率条件。第一压缩率条件还可以是其他条件,此处不再一一列举。
在尝试对第一数据页中的有效用户记录时,可以生成临时数据页,基于各个目标压缩列对应的第一基础值对第一数据页中每条用户记录的对应目标压缩列进行压缩,然后将压 缩后的用户记录写入临时数据页。然后,确定临时数据页中压缩后的各条用户记录是否满足第一压缩率条件。如果满足,可以从第一数据页中删除压缩前的用户记录,并将临时数据页中压缩后的用户记录写入第一数据页。如果不满足第一压缩率条件,可以直接删除临时数据页,仍然在第一数据页中存储未压缩的用户记录。例如,一些间隔较大的数字(例如,100000,100000,200000,300000)不适合进行压缩,一般不满足第一压缩率条件,如果目标压缩列中包括的数字具有这种特征,则可以不进行压缩,直接在第一数据页中存储未压缩的用户记录。
在释放第一数据页中无效用户记录占用的存储空间之后,若第一数据页的剩余存储空间仍小于待插入第一数据页的下一条用户记录占用的存储空间,可以对第一数据页进行数据页分裂,即,将多条用户记录中的目标用户记录迁移到第二数据页中。其中,目标用户记录可以包括第一数据页中的部分用户记录,选择目标用户记录的方式可以有多种,例如,可以将第一数据页中预设数量条用户记录迁移到第二数据页中,所述预设数量条用户记录可以是第一数据页中连续的多条用户记录,也可以是具有相似的数据分布特征的多条用户记录,或者满足其他条件的多条用户记录。
由于第一数据页中的用户记录可能经过压缩,也可能未经过压缩,因此,在向第二数据页迁移目标用户记录时,可能发生以下几种情况:
(1)若目标用户记录的压缩率满足预设的第二压缩率条件,则可以在第二数据页中存储经过压缩的目标用户记录。此时,如果第一数据页中的目标用户记录未经压缩,则可以将这些未经压缩的目标用户记录压缩后迁移到第二数据页中(即,将目标用户记录以压缩状态迁移到所述第二数据页中);
(2)若目标用户记录的压缩率不满足预设的第二压缩率条件,则可以在第二数据页中存储未经压缩的目标用户记录。此时,如果第一数据页中的目标用户记录未经压缩,则可以将这些未经压缩的目标用户记录直接迁移到第二数据页中,而不进行压缩(即,将目标用户记录以未压缩状态迁移到所述第二数据页中);
(3)若目标用户记录的压缩率满足预设的第二压缩率条件,则可以在第二数据页中存储经过压缩的目标用户记录。此时,如果第一数据页中的目标用户记录经过压缩,则可以将这些经过压缩的目标用户记录直接迁移到第二数据页中,以便在第二数据页中存储压缩后的目标用户记录(即,将目标用户记录以压缩状态迁移到所述第二数据页中);
(4)若目标用户记录的压缩率不满足预设的第二压缩率条件,则可以在第二数据页中存储未经压缩的目标用户记录。此时,如果第一数据页中的目标用户记录经过压缩,则可以将这些经过压缩的目标用户记录解压缩后迁移到第二数据页中,以便在第二数据页中存储未压缩的目标用户记录(即,将目标用户记录以未压缩状态迁移到所述第二数据页中)。对目标用户记录的解压缩可以基于第一基础值记录实现。
其中,第二压缩率条件可以是目标用户记录的压缩率,该压缩率可以基于目标用户记录压缩后的长度与压缩前的长度之间的比值来确定。在目标用户记录的数量大于1时,可以分别确定每条目标用户记录是否满足第二压缩率条件。如果目标用户记录的压缩率小于某个阈值(如80%,可配置),可以确定满足第二压缩率条件,否则确定不满足第二压缩率条件。压缩操作是数据相关的,数据的分布特征直接影响压缩率。由于用户记录中包含的目标压缩列中数据分布特征的不确定性,并不是每条用户记录在压缩后都能获得可观的压缩率,而无效(即不满足第二压缩率条件)的压缩和后续的解压缩操作会消耗额外的CPU,需要尽量避免。因此,本公开实施例在压缩目标用户记录时设置了第二压缩率条件,如果满足第二压缩率条件,则认为压缩是有效的,可以在第二数据页上存储压缩后的目标用户记录;反之,则认为压缩是无效的,可以在第二数据页上存储未压缩的目标用户记录。
因此,每条目标用户记录有单独压缩和不压缩的灵活性。在一些实施例中,在对目标用户记录进行压缩后,可以在该目标用户记录中记录第二标志位,用于表征该目标用户记录为压缩后的用户记录。第二标志位可以采用1比特的二进制数来表示,例如,一条用户记录中的第二标志位为“1”,表示该用户记录为压缩后的用户记录;一条用户记录中的第二标志位为“0”,表示该用户记录为未经压缩的用户记录。或者,如果一条用户记录中的第二标志位为非空,表示该用户记录为压缩后的用户记录;一条用户记录中的第二标志位为空,表示该用户记录为未经压缩的用户记录。除了目标用户记录之外,第一数据页中的每条用户记录中也可以包括第二标志位,用于表征该用户记录是否为压缩后的用户记录。
若第二数据页中包括压缩后的目标用户记录,还可以在第二数据页中存储第二基础值记录,该第二基础值记录中包括每个所述目标压缩列对应的第二基础值。第二基础值可以与第一基础值相同,也可以与第一基础值不同。假设第一数据页中从用户记录如图2所示,在进行数据页分裂时,假设将ID为300到600的用户记录作为目标用户记录迁移到第二数据页中,对于ID所在的目标压缩列,可以将300作为第二基础值,即,第二基础值与 第一基础值不同。假设将ID为100到299的用户记录作为目标用户记录迁移到第二数据页中,对于ID所在的目标压缩列,可以仍将100作为第二基础值,即,第二基础值与第一基础值相同。
第二基础值记录中可包括的信息与第一基础值记录相同,例如,可以包括差分编码算法的版本号、用于对所述第二基础值记录进行校验的校验信息、压缩条目的数量和各个目标压缩列对应的压缩条目等,上述信息的详细说明可参见第一基础值记录中的相关说明,此处不再赘述。
在生成第一基础值记录之后,还可以把第一基础值记录存储到第一数据页的指定位置(例如,第一数据页上的第一条用户记录之前)。图4展示了压缩前后第一数据页的布局,可以看出,在压缩前,第一数据页中仅包括页头(page header)、用户记录和页尾(page trailer);在压缩后,第一数据页上新增了一条基础值记录,该基础值记录里保存了每个压缩列使用的基础值(base),压缩和解压缩的时候都需要用到这些信息。
在一些实施例中,并非所有的第一数据页中都会包括第一基础值记录。例如,可以仅在第一数据页上的用户记录首次被压缩时,在该第一数据页中构建并保存第一基础值记录。在为一个第一数据页生成第一基础值记录之后,可以在该第一数据页中记录第一标志位,用于表征该第一数据页中包括第一基础值记录。例如,可以将第一标志位记录在该第一数据页的页头中。如果一个第一数据页上的用户记录不需要压缩,则该第一数据页不包括第一基础值记录。因此,数据页有单独压缩和不压缩的灵活性。其中,在第一数据页中记录第一标志位可以是将第一数据页中的第一标志位从无效状态更改为有效状态,例如,采用1比特表示第一标志位,第一标志位为“0”时为无效状态,第一标志位为“1”时为有效状态。因此,在第一数据页中记录第一标志位可以是将第一数据页中的第一标志位从“0”更改为“1”。或者,在第一数据页中记录第一标志位也可以是在第一数据页中的第一标志位为空时,将非空的第一标志位写入第一数据页。
除了上面提到的生成第一基础值记录时批量对用户记录进行压缩外,在向第一数据页插入新的用户记录时,可以基于第一基础值记录对待插入第一数据页的新的用户记录进行压缩,并将压缩后的新的用户记录插入第一数据页。此外,在对第一数据页中的用户记录进行更新时,可以基于第一基础值记录对第一数据页中的待更新的用户记录进行解压缩,并在对解压缩后的待更新的用户记录进行更新之后,基于第一基础值记录对更新后的用户 记录重新进行压缩。
参见图5,在插入新的用户记录时,可以先从第一数据页的指定存储位置读取第一基础值记录(S502)。针对新的用户记录中的每一列,可以判断第一基础值记录中是否存在该列对应的第一基础值(S504)。如果是,则采用该第一基础值对待插入的新的用户记录中对应的列进行压缩(S506),并判断压缩是否有效(S508),例如,判断是否满足第二压缩率条件。如果压缩有效,则采用该列的压缩形式(即,对该列的数据进行压缩)(S510),并将该列的压缩形式插入第一数据页(S514);否则,采用该列的未压缩形式(即,不对该列的数据进行压缩)(S512),并将该列的未压缩形式插入第一数据页(S514)。如果第一基础值记录中不存在某一列对应的第一基础值,则表示该列不需要压缩,从而可以直接将该列的数据插入第一数据页(S514)。
在对第一数据页中已有的用户记录进行更新,得到更新后的用户记录之后,针对更新的用户记录中的每一列,同样可以判断第一基础值记录中是否存在该列对应的第一基础值(S504)。如果是,采用第一基础值对更新后的用户记录中对应的列进行压缩(S506),并判断压缩是否有效(S508)。如果有效,则采用该列的压缩形式(即,对该列的数据进行压缩)(S510),并将该列的压缩形式插入第一数据页(S514);否则,采用该列的未压缩形式(即,不对该列的数据进行压缩)(S512),并将该列的未压缩形式插入第一数据页(S514)。如果第一基础值记录中不存在某一列对应的第一基础值,则表示该列不需要压缩,从而可以直接将该列的数据插入第一数据页(S514)。
在对第一数据页中的用户记录进行查询操作时,如果该用户记录已经被压缩,则需要对其进行解压缩,具体过程如图6所示。针对待查询的一条用户记录,可以确定从第一数据页的指定存储位置读取第一基础值记录(S602),并判断待查询列是否经过压缩(S604)。如果是,则采用第一基础值记录中该待查询列对应的第一基础值对该待查询列进行解压缩(S606),然后返回解压缩后的数据(S608);否则,直接返回待查询列中的数据(S608)。
在一些实施例中,在对第一数据页上的用户记录中的目标压缩列进行差分编码之后,还可以基于第一基础值记录对压缩后的用户记录的各个目标压缩列进行解压缩。进一步地,差分编码可以用于对第一数据页上的用户记录进行预处理,然后,还可以进一步采用其他编码算法对预处理之后的目标压缩列的值进行再次编码,将其高位的0去掉后以字节可比较(byte comparable)的方式存储。在这种情况下,进行解压缩时,可以先采用其他编码 算法对应的解码算法对再次编码的目标压缩列的值进行解码,再基于第一基础值对解码后的值进行再次解码,得到未经压缩的目标压缩列的值。
下面对再次编码的方式进行举例说明。本领域技术人员可以理解,以下仅为再次编码的一种可选方式,在其他实施例中,也可以采用其他编码方式进行再次编码,此处不再一一列举。
在一些实施例中,可以确定预处理之后的目标压缩列中有效负载的字节数量;所述有效负载为去掉高位的0后能够表征所述预处理之后的目标压缩列大小的最少数据;生成头部字节;所述头部字节至少包括:长度数据,所述长度数据根据所述有效负载的字节数量得到;将生成的所述头部字节与所述预处理之后的目标压缩列的有效负载进行组合,将组合后的数据作为所述预处理之后的目标压缩列的编码结果。进一步地,所述头部字节至少包括:符号位和长度数据;所述符号位用于表示所述待编码整数的正负。
以预处理之后的目标压缩列的数值是整数为例,在这种情况下,可以将预处理之后的目标压缩列的值称为待编码整数。由于有效负载包括的字节数量和一个整数的大小相关,通过增加头部字节来记录整数有效负载的字节数量,在两个整数的头部字节大小不同的情况下,可以只比较两个数字头部字节的大小就能确定两个数据的大小;在两个整数的头部字节大小相同的情况下,需要进一步比较两个整数的有效负载部分从而知道两个整数的大小。这样,通过上述方法,在实现了压缩的同时,能保证压缩后的整数可以参与排序且可以与其他压缩后的整数比较大小。
针对某个数据对应的压缩后的编码值,还可以将编码值中对应于该数据的第一数据位的字节存储在存储地址的第一地址位,将编码值中对应于该数据的第二数据位的字节存储在存储地址的第二地址位,其中,第一数据位高于第二数据位(即第一数据位为数据的高位,第二数据位为数据的低位),第一地址位低于第二地址位(即第一地址位为存储地址的低位,第二地址位为存储地址的高位)。这样,在对这两个数据进行比较时,可以直接按照地址位从低到高的顺序读取存储地址中存储的编码值,并对读取的编码值进行比较。如果较低的地址位中的编码值相同,则对次低的地址位中的编码值进行比较,以此类推,直到对所有地址位中的编码值均比较完成,或者找到编码值不同的地址位。如果找到编码值不同的地址位,则在该地址位中较大的编码值对应的数据编码前的数值也较大。这种方式称为byte comparable的存储方式。
在上述编码方式中,由于最先比较的是头部字节,因此,可以将头部字节存储在存储地址的最低地址位,例如,头部字节占用1个字节,则可以将头部字节存储在存储地址的第0到7位。然后,再存储有效负载。假设有效负载为16进制数0x 04 03 02 01,其中,“0x”表示十六进制,“04”为该16进制数的最高字节,“01”为该16进制数的最低字节,则可以将0x 04 03 02 01中的“04”、“03”、“02”和“01”依次存储在同一个存储地址中的第8到15位、第16到23位、第24到31位、第32到39位。
在比较数字A和数字B(均按照上述方式存储)两个编码后的数字的大小时,可以先读取两个存储地址中的第0到7位,从而获取两个编码后的数字的头部字节。如果数字A的头部字节比数字B的头部字节大,则表示数字A解码后的数值也比数字B解码后的数值要大。如果数字A的头部字节与数字B的头部字节相等,再比较数字A的有效负载和数字B的有效负载。假设数字A的有效负载为0x 04 03 02 01,数字B的有效负载为0x 0503 02 00,可以先从对应存储地址的最低地址位读取编码值,分别得到“04”和“05”。由于“04”的值小于“05”,且“04”和“05”对应的都是编码前的数字的最高位,因此,可以确定编码前的数字A的最高位小于编码前的数字B的最高位,从而可以在不对数字A和数字B的编码值进行解压缩的情况下,确定出数字A和数字B的大小。在进行数据查询时,可以基于该特性,对多条压缩后的用户记录中的数据与待查询数据进行比较,从而确定命中待查询数据的用户记录,从而无需对各条用户记录进行解压缩。
在一些实施例中,第一数据页中的所述多条用户记录的目标压缩列的数字压缩后的数值关系与压缩前的数值关系相同。所述数值关系包括大于、等于和小于。例如,针对用户记录A和用户记录B这两条用户记录的同一个目标压缩列而言,如果用户记录A的该目标压缩的数据在压缩前的数值大于/等于/小于用户记录B的该目标压缩的数据在压缩前的数值,则用户记录A的该目标压缩的数据在压缩后的数值也相应地大于/等于/小于用户记录B的该目标压缩的数据在压缩后的数值。
因此,在对第一数据页上的所述多条用户记录的目标压缩列进行差分编码之后,可以获取待查询数据;基于待查询数据所属的目标压缩列对应的第一基础值对所述待查询数据进行压缩;基于压缩后的待查询数据从压缩后的所述多条用户记录中确定目标记录。也就是说,在本公开实施例中,并非先将各条压缩后的用户记录先解压,再与待查询数据进行比较,而是将待查询数据进行压缩后,再与压缩后的用户记录进行比较。由于数据压缩前 后的数值关系是相同的,因此,通过对压缩后的数据与压缩后的用户记录进行比较,即可判断出哪条用户记录命中待查询数据。这样,减少了查询过程中解压缩的次数,从而提高了数据查询效率。
例如,针对图2所示的用户记录,可以先对各条用户记录进行编码,在编码后,假设待查询数据是ID为200的用户记录中的金额数据,则可以从第一基础值记录中查询ID所在的列对应的第一基础值,通过该第一基础值对数字“200”进行压缩,再基于数字“200”压缩后的数值与各用户记录中ID所在的列压缩后的数值,确定ID为200的用户记录,然后从第一基础值记录中查询金额所在的列对应的第一基础值,通过该第一基础值对ID为200的用户记录中的金额所在的列进行解压缩,并返回解压缩后的金额数据。这样,整个查询过程只需要进行一次压缩和一次解压缩。
此外,本说明书还提供一种整数解码方法,用于对通过上述方式进行再次编码的目标压缩列的值进行解码,仍以目标压缩列的值是整数(称为待解码整数)为例,该方法包括以下步骤:
从待解码整数的头部字节中确定所述待解码整数的有效负载的字节数量;
基于确定的所述有效负载的字节数量,读取有效负载;
基于预设的整数数据类型,对所述有效负载进行处理得到所述待解码整数对应的解码结果。
在采用上述方式进行解码之后,还可以基于第一数据页中的基础值对解码后的目标压缩列进行再次解码,得到目标压缩列的原始数据。
本公开实施例具有以下优点:
(1)本公开实施例以数据页和用户记录为粒度进行压缩和解压缩,使用差分编码进行预处理,能够覆盖更大的数据区间。在同一列的所有数字中选定一个基础值,把每个数字都转换成与基础值的差值。差值往往比原始值更小,其实际需要的存储空间也更少。这样一来,大数字也可以被压缩。
(2)以byte comparable的方式对差值进行编码。该编码方式的一个重要特点是,编码后的数据依然保留了原始数字的大小关系,也就是说,在不用解压缩的情况下,直接比较编码后的数字就能知道两个数字原始值的大小。这样的好处是,在执行搜索或者排序等操作时,避免了因为要解压缩大量用户记录而产生的额外开销,有助于系统性能的提升。
(3)本公开实施例并非对整个数据页进行压缩,而是对用户记录进行压缩,这样,压缩和解压缩直接在数据库存储层进行,可以使压缩前后数据页整体的大小保持不变,例如,数据页压缩前后均为16K。这样,一方面,操作系统无需支持稀疏文件(sparsefile)和打洞(hole punching)功能来确定每个数据页压缩后的实际大小;另一方面,每个数据页可以容纳更多的用户记录,因此,在将同样大小的数据页缓存到buffer pool之后,可以从buffer pool的数据页中读取到更多的用户记录,从而减少了磁盘I/O,提升了系统性能。
(4)由于基础值记录中存储在多个目标压缩列的基础值,因此,通过读取基础值记录,即可对同一条用户记录中的每个目标压缩列进行压缩和解压缩,这样,就可以按行读取用户记录并进行压缩和解压缩,减少了压缩和解压缩过程中的IO次数。
(5)通过设置第一压缩率条件和第二压缩率条件,仅对适合压缩的用户记录进行压缩,从而减少了CPU消耗,减少了非必要的压缩。
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。
参见图7,本公开实施例还提供一种数据处理装置,所述装置包括:
获取模块702,用于获取第一数据页上的多条用户记录的目标压缩列;所述目标压缩列包括数字类型的数据;
确定模块704,用于确定每个目标压缩列对应的第一基础值,目标压缩列对应的第一基础值用于对所述第一数据页上的各条用户记录的该目标压缩列进行差分编码;
生成模块706,用于基于各个目标压缩列对应的第一基础值生成第一基础值记录,并将所述第一基础值记录存储到所述第一数据页中。
在一些实施例中,本公开实施例提供的装置具有的功能或包含的模块可以用于执行上文方法实施例描述的方法,其具体实现可以参照上文方法实施例的描述,为了简洁,这里不再赘述。
本公开实施例还提供一种计算机设备,其至少包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,处理器执行所述程序时实现前述任一实施例所述的方法。
图8示出了本公开实施例所提供的一种更为具体的计算设备硬件结构示意图,该设备 可以包括:处理器802、存储器804、输入/输出接口806、通信接口808和总线810。其中处理器802、存储器804、输入/输出接口806和通信接口808通过总线810实现彼此之间在设备内部的通信连接。
处理器802可以采用通用的中央处理器(Central Processing Unit,CPU)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本公开实施例所提供的技术方案。处理器802还可以包括显卡,所述显卡可以是Nvidia titan X显卡或者1080Ti显卡等。
存储器804可以采用只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、静态存储设备,动态存储设备等形式实现。存储器804可以存储操作系统和其他应用程序,在通过软件或者固件来实现本公开实施例所提供的技术方案时,相关的程序代码保存在存储器804中,并由处理器802来调用执行。
输入/输出接口806用于连接输入/输出模块,以实现信息输入及输出。输入输出/模块可以作为组件配置在设备中(图中未示出),也可以外接于设备以提供相应功能。其中输入设备可以包括键盘、鼠标、触摸屏、麦克风、各类传感器等,输出设备可以包括显示器、扬声器、振动器、指示灯等。
通信接口808用于连接通信模块(图中未示出),以实现本设备与其他设备的通信交互。其中通信模块可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信。
总线810包括一通路,在设备的各个组件(例如处理器802、存储器804、输入/输出接口806和通信接口808)之间传输信息。
需要说明的是,尽管上述设备仅示出了处理器802、存储器804、输入/输出接口806、通信接口808以及总线810,但是在具体实施过程中,该设备还可以包括实现正常运行所必需的其他组件。此外,本领域的技术人员可以理解的是,上述设备中也可以仅包含实现本公开实施例方案所必需的组件,而不必包含图中所示的全部组件。
本公开实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现前述任一实施例所述的方法。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计 算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本公开实施例可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本公开实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开实施例各个实施例或者实施例的某些部分所述的方法。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机装置或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。
本公开中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,在实施本公开实施例方案时可以把各模块的功能在同一个或多个软件和/或硬件中实现。也可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
以上所述仅是本公开实施例的具体实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本公开实施例原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本公开实施例的保护范围。

Claims (13)

  1. 一种数据处理方法,所述方法包括:
    获取第一数据页上的多条用户记录的目标压缩列;所述目标压缩列包括数字类型的数据;
    确定每个目标压缩列对应的第一基础值,目标压缩列对应的第一基础值用于对所述第一数据页上的各条用户记录的该目标压缩列进行差分编码;
    基于各个目标压缩列对应的第一基础值生成第一基础值记录,并将所述第一基础值记录存储到所述第一数据页中。
  2. 根据权利要求1所述的方法,所述基于各个目标压缩列对应的第一基础值生成第一基础值记录,包括:
    若所述第一数据页的剩余存储空间小于待插入所述第一数据页的下一条用户记录占用的存储空间,释放所述第一数据页中无效用户记录占用的存储空间,并在释放存储空间的过程中确定所述多条用户记录中的有效用户记录的压缩率;
    若所述有效用户记录的压缩率满足预设的第一压缩率条件,基于各个目标压缩列对应的第一基础值生成所述第一基础值记录,并基于各个目标压缩列对应的第一基础值对所述有效用户记录中对应的目标压缩列进行压缩。
  3. 根据权利要求2所述的方法,所述方法还包括:
    在释放所述第一数据页中无效用户记录占用的存储空间之后,若所述第一数据页的剩余存储空间仍小于待插入所述第一数据页的下一条用户记录占用的存储空间,将所述多条用户记录中的目标用户记录迁移到第二数据页中。
  4. 根据权利要求3所述的方法,所述将所述多条用户记录中的目标用户记录迁移到第二数据页中,包括:
    若所述目标用户记录的压缩率不满足预设的第二压缩率条件,将所述目标用户记录以未压缩状态迁移到所述第二数据页中;和/或
    若所述目标用户记录的压缩率满足预设的第二压缩率条件,将所述目标用户记录以压缩状态迁移到所述第二数据页中,并在所述第二数据页中存储第二基础值记录;
    所述第二基础值记录中包括每个所述目标压缩列对应的第二基础值,目标压缩列对应的第二基础值用于对所述目标基础值记录的该目标压缩列进行压缩。
  5. 根据权利要求4所述的方法,所述将所述目标用户记录以压缩状态迁移到所述第二数据页中,包括:
    若所述目标用户记录为所述第一数据页中经过压缩的用户记录,基于所述第一基础值记录对所述目标用户记录进行解压缩;
    将解压后的目标用户记录迁移到所述第二数据页中。
  6. 根据权利要求1所述的方法,所述方法还包括:
    在基于各个目标压缩列对应的第一基础值生成第一基础值记录之后,在所述第一数据页中记录第一标志位,所述第一标志位用于表征所述第一数据页中包括第一基础值记录;和/或
    在对所述第一数据页中的用户记录的目标压缩列进行压缩之后,在该用户记录中记录第二标志位,所述第二标志位用于表征该用户记录为压缩后的用户记录。
  7. 根据权利要求1所述的方法,所述基于各个目标压缩列对应的第一基础值生成第一基础值记录,包括:
    生成每个目标压缩列对应的压缩条目,目标压缩列对应的压缩条目包括该目标压缩列的标识信息和该目标压缩列的基础值;
    将所述第一数据页中目标压缩列的数量和各个目标压缩列对应的压缩条目封装成所述第一基础值记录。
  8. 根据权利要求7所述的方法,所述第一基础值记录中还包括以下至少一种信息:
    压缩算法的版本号;
    用于对所述第一基础值记录进行校验的校验信息。
  9. 根据权利要求1所述的方法,在将所述第一基础值记录存储到所述第一数据页中之后,所述方法还包括:
    基于所述第一基础值记录对待插入所述第一数据页的新的用户记录进行压缩;
    将压缩后的新的用户记录插入所述第一数据页;
    和/或
    基于所述第一基础值记录对所述第一数据页中的待更新的用户记录进行解压缩;
    在对解压缩后的待更新的用户记录进行更新之后,基于所述第一基础值记录对更新后的用户记录进行压缩。
  10. 根据权利要求1所述的方法,所述多条用户记录的目标压缩列的数字压缩后的数值关系与压缩前的数值关系相同;所述方法还包括:
    在对所述第一数据页上的所述多条用户记录的目标压缩列进行差分编码之后,获取待查询数据;
    基于所述待查询数据所属的目标压缩列对应的第一基础值对所述待查询数据进行压缩;
    基于压缩后的待查询数据从压缩后的所述多条用户记录中确定目标记录。
  11. 根据权利要求1所述的方法,所述方法还包括:
    在对所述第一数据页上的用户记录的各个目标压缩列进行差分编码之后,基于所述第一基础值记录对压缩后的用户记录的各个目标压缩列进行解压缩。
  12. 一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现权利要求1至11任意一项所述的方法。
  13. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现权利要求1至11任意一项所述的方法。
PCT/CN2024/071228 2023-01-09 2024-01-08 数据处理方法和装置、介质和计算机设备 Ceased WO2024149207A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP24741206.7A EP4650978A4 (en) 2023-01-09 2024-01-08 DATA PROCESSING METHOD AND APPARATUS, AND COMPUTER SUPPORT AND DEVICE

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310029941.7A CN115934730B (zh) 2023-01-09 2023-01-09 数据处理方法和装置、介质和计算机设备
CN202310029941.7 2023-01-09

Publications (1)

Publication Number Publication Date
WO2024149207A1 true WO2024149207A1 (zh) 2024-07-18

Family

ID=86554243

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/071228 Ceased WO2024149207A1 (zh) 2023-01-09 2024-01-08 数据处理方法和装置、介质和计算机设备

Country Status (3)

Country Link
EP (1) EP4650978A4 (zh)
CN (1) CN115934730B (zh)
WO (1) WO2024149207A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115934730B (zh) * 2023-01-09 2023-07-18 阿里云计算有限公司 数据处理方法和装置、介质和计算机设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975498A (zh) * 2016-04-27 2016-09-28 华为技术有限公司 数据查询的方法、装置和系统
US20200409925A1 (en) * 2018-08-16 2020-12-31 Tencent Technology (Shenzhen) Company Limited Data processing method and apparatus, storage medium and electronic device
CN115203148A (zh) * 2021-04-09 2022-10-18 华为技术有限公司 修改文件的方法和装置
CN115934730A (zh) * 2023-01-09 2023-04-07 阿里云计算有限公司 数据处理方法和装置、介质和计算机设备

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6892207B2 (en) * 2003-01-24 2005-05-10 Hewlett-Packard Development Company, L.P. Method of updating data in a compressed data structure
US9195695B2 (en) * 2006-09-15 2015-11-24 Ibm International Group B.V. Technique for compressing columns of data
US20090006399A1 (en) * 2007-06-29 2009-01-01 International Business Machines Corporation Compression method for relational tables based on combined column and row coding
US9507811B2 (en) * 2008-12-22 2016-11-29 Oracle International Corporation Compressed data page with uncompressed data fields
US8239421B1 (en) * 2010-08-30 2012-08-07 Oracle International Corporation Techniques for compression and processing optimizations by using data transformations
CN101996250B (zh) * 2010-11-15 2012-07-25 中国科学院计算技术研究所 一种基于Hadoop的海量流数据存储和查询方法及系统
US8898118B2 (en) * 2012-11-30 2014-11-25 International Business Machines Corporation Efficiency of compression of data pages
CN103258030B (zh) * 2013-05-09 2016-04-13 西安电子科技大学 基于字典与游长编码的移动设备内存压缩方法
CN103326732B (zh) * 2013-05-10 2016-12-28 华为技术有限公司 压缩数据的方法、解压数据的方法、编码器和解码器
US9798727B2 (en) * 2014-05-27 2017-10-24 International Business Machines Corporation Reordering of database records for improved compression
CN104156407B (zh) * 2014-07-29 2017-08-25 华为技术有限公司 索引数据的存储方法、装置及存储设备
JP6397995B2 (ja) * 2015-04-13 2018-09-26 株式会社日立製作所 データベース管理システム、データベースサーバ、及び、データベース管理方法
US10083203B2 (en) * 2015-08-11 2018-09-25 International Business Machines Corporation Reducing the cost of update, delete, and append-only insert operations in a database
US11030149B2 (en) * 2018-09-06 2021-06-08 Sap Se File format for accessing data quickly and efficiently
US10719253B2 (en) * 2018-10-31 2020-07-21 EMC IP Holding Company LLC Efficient compression of data in storage systems through offloading computation to storage devices
JP2020143928A (ja) * 2019-03-04 2020-09-10 株式会社豊田中央研究所 データ収録装置、データ収録方法、及びデータ収録プログラム
CN111817722B (zh) * 2020-07-09 2025-07-11 北京奥星贝斯科技有限公司 数据压缩方法、装置及计算机设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975498A (zh) * 2016-04-27 2016-09-28 华为技术有限公司 数据查询的方法、装置和系统
US20200409925A1 (en) * 2018-08-16 2020-12-31 Tencent Technology (Shenzhen) Company Limited Data processing method and apparatus, storage medium and electronic device
CN115203148A (zh) * 2021-04-09 2022-10-18 华为技术有限公司 修改文件的方法和装置
CN115934730A (zh) * 2023-01-09 2023-04-07 阿里云计算有限公司 数据处理方法和装置、介质和计算机设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4650978A1

Also Published As

Publication number Publication date
CN115934730B (zh) 2023-07-18
EP4650978A1 (en) 2025-11-19
CN115934730A (zh) 2023-04-07
EP4650978A4 (en) 2026-04-15

Similar Documents

Publication Publication Date Title
US11755565B2 (en) Hybrid column store providing both paged and memory-resident configurations
CN103326732B (zh) 压缩数据的方法、解压数据的方法、编码器和解码器
US20160147776A1 (en) Altering data type of a column in a database
CN119149890A (zh) 一种矩阵计算装置、方法、系统、电路、芯片及设备
US11704286B2 (en) High-density compression method and computing system
CN106570018A (zh) 序列化与反序列化的方法、装置、系统以及电子设备
CN107565971A (zh) 一种数据压缩方法及装置
CN101751451B (zh) 一种中文数据压缩及解压缩方法及相关设备
CN113742056A (zh) 一种数据存储方法、装置、设备及计算机可读存储介质
CN114547030B (zh) 多级时序数据压缩方法、装置、电子设备及存储介质
CN115483935A (zh) 一种数据处理方法及装置
US20160226516A1 (en) Non-transitory computer-readable recording medium, compression method, decompression method, compression device, and decompression device
CN105320669B (zh) 数据存储、读取方法及数据存储、读取装置
CN116185305A (zh) 业务数据存储方法、装置、计算机设备和存储介质
CN116208168A (zh) 一种数据压缩方法、数据解压方法及装置
WO2024149207A1 (zh) 数据处理方法和装置、介质和计算机设备
US10885074B2 (en) Memory optimization system for inverted indexes
US20220199202A1 (en) Method and apparatus for compressing fastq data through character frequency-based sequence reordering
CN108880559B (zh) 数据压缩方法、数据解压缩方法、压缩设备及解压缩设备
WO2024138981A1 (zh) 数据压缩和解压缩方法、装置、电子设备及存储介质
AU2017248412A1 (en) Information processing apparatus, and data management method
CN107291832A (zh) 一种基于列表存储结构的数据存储方法
CN104133883B (zh) 电话号码归属地数据压缩方法
CN115765754B (zh) 一种数据编码方法及一种编码数据比较方法
WO2025242037A1 (zh) 数据压缩方法、数据处理方法及装置、电子设备、介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24741206

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 11202503059Q

Country of ref document: SG

WWP Wipo information: published in national office

Ref document number: 11202503059Q

Country of ref document: SG

NENP Non-entry into the national phase

Ref country code: DE

WWP Wipo information: published in national office

Ref document number: 2024741206

Country of ref document: EP