WO2021208612A1 - 数据处理的方法与装置 - Google Patents

数据处理的方法与装置 Download PDF

Info

Publication number
WO2021208612A1
WO2021208612A1 PCT/CN2021/078390 CN2021078390W WO2021208612A1 WO 2021208612 A1 WO2021208612 A1 WO 2021208612A1 CN 2021078390 W CN2021078390 W CN 2021078390W WO 2021208612 A1 WO2021208612 A1 WO 2021208612A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
mask
word
text
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2021/078390
Other languages
English (en)
French (fr)
Inventor
廖亿
李博文
郑豪
蒋欣
刘群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to EP21789181.1A priority Critical patent/EP4131020A4/en
Publication of WO2021208612A1 publication Critical patent/WO2021208612A1/zh
Priority to US17/964,165 priority patent/US12608606B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/274Converting codes to words; Guess-ahead of partial word inputs
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a method and device for data processing.
  • Natural language processing is a technology that allows computers to understand and process human natural language, and is an important technical means to realize artificial intelligence.
  • the pretrained language model (PLM) is an important general model in the field of NLP that has emerged in recent years.
  • the PLM training program is a research hotspot in this field.
  • the PLM training program has two improvement directions: first, to improve the natural language understanding ability of PLM; second, to speed up the model training speed (that is, to speed up the model convergence speed).
  • a commonly used training scheme for PLM is called a masked language model (MLM).
  • MLM masked language model
  • the training principle of MLM is to enable PLM to learn the ability to capture textual contextual information.
  • the PLM training sample is the text processed by the mask, that is, part of the text is replaced with a sentence with a special mark symbol (for example, [MASK]), for example, the original text is "Today is a sunny week Six", the masked text is "Today [MASK] is sunny [MASK] Saturday”; the masked text is input into PLM, and PLM needs to predict that the masked words are " ⁇ "And "lang”.
  • the training samples of PLM can be called mask training samples.
  • This application provides a data processing method and device, which can improve the natural language understanding ability of PLM.
  • a data processing method comprising: determining an original text sample, the original text sample is not subjected to mask processing; performing mask processing on the original text sample to obtain a mask training sample, The mask processing makes the mask ratio of the mask training sample not fixed, and the mask training sample is used to train the pre-training language model PLM.
  • the mask ratio of the mask training sample includes the text level mask ratio and/or the word level mask ratio.
  • the text-level mask ratio is used to indicate the ratio of the words processed by the mask in a text to all the words in the text.
  • the text-level mask ratio can also be referred to as the sentence-level mask ratio or the text-level mask ratio.
  • the word level mask ratio is used to indicate the probability of a word being masked.
  • each word has a word-level mask ratio.
  • the word level mask ratio can also be referred to as the word mask probability.
  • variable mask ratio of the mask training sample includes:
  • the text-level mask ratios of different samples in the mask training samples are not completely the same; and/or
  • the word-level mask ratio of each word in any sample in the mask training sample is not completely the same.
  • a variety of implementation methods can be used to perform mask processing on the original text samples to obtain mask training samples.
  • the performing mask processing on the original text samples to obtain mask training samples includes: using a prior probability distribution model to generate the original The text level mask ratio of each sample in the text sample, the prior probability distribution model makes the text level mask ratio of different samples in the original text sample not exactly the same; according to the ratio of each sample in the original text sample The text-level mask ratio is used to perform mask processing on the corresponding samples to obtain the mask training samples.
  • the length of the probability value interval of the prior probability distribution model is not less than 40%.
  • the performing mask processing on the original text sample to obtain a mask training sample includes: obtaining the first text in the original text sample The word-level mask ratio of each word in the sample, the word-level mask ratios of different words in the first text sample are not completely the same; according to the word-level mask ratio of each word in the first text sample, all Mask processing is performed on part of the words in the first text sample to obtain the first training sample in the mask training sample.
  • the masking process is performed on part of the words in the first text sample according to the word-level mask ratio of each word in the first text sample to obtain the first mask training sample.
  • the training sample includes: masking the first S words or the words located in the top G% of the first text sample in the order of word level mask ratio from high to low to obtain the first training sample, S is a positive integer whose value is less than the total number of words in the first text sample, and G is an integer greater than 0 and less than 100.
  • the acquiring the word-level mask ratio of each word in the first text sample in the original text sample includes: using a prior probability distribution A model for generating a word-level mask ratio of each word in the first text sample, and the prior probability distribution model makes the word-level mask ratios of different words in the first text sample not completely the same.
  • the acquiring the word-level mask ratio of each word in the first text sample in the original text sample includes: The text sample is input to a neural network model, the word level mask ratio of each word in the first text sample is obtained from the output of the neural network model, and the output of the neural network model is the word level of each word in the input text The mask ratio, wherein the neural network model is obtained through optimization learning through the following steps, wherein the initial value of i is 1.
  • step 6 using the neural network model obtained in step 4) as the neural network model learned through optimization.
  • the step 3) includes: using the training sample corresponding to the i-th sample to perform a training update on the PLM; In the PLM, the loss value of the PLM updated by the training for the word processed by the mask is obtained; wherein, the step 4) includes: according to the PLM updated by the training, the loss value of the word processed by the mask is obtained; The loss value of the word processed by the code, and the neural network model update and optimize the neural network network with respect to the output signal of the word processed by the mask.
  • a data processing method includes: obtaining mask training samples by the method provided in the first aspect; training a pre-training language model PLM using the mask training samples, and the PLM is used for prediction The text processed by the mask.
  • a data processing method comprising: determining a target text to be predicted, the target text including a sentence lacking part of the text; inputting the target text into a pre-training language model PLM, from the The output of the PLM predicts the missing characters in the target text, where the PLM is obtained through training by the method provided in the second aspect.
  • a data processing device in a fourth aspect, includes a first processing unit and a second processing unit.
  • the first processing unit is configured to determine an original text sample, and the original text sample has not been masked.
  • the second processing unit is configured to perform mask processing on the original text samples to obtain mask training samples, where the mask processing makes the mask ratio of the mask training samples not fixed, and the mask training The samples are used to train the pre-trained language model PLM.
  • the mask ratio of the mask training sample includes the text level mask ratio and/or the word level mask ratio. See the description above for details, so I won't repeat them here.
  • the second processing unit is configured to: use a prior probability distribution model to generate a text level mask ratio of each sample in the original text sample ,
  • the prior probability distribution model makes the text level mask ratios of different samples in the original text samples not completely the same; according to the text level mask ratio of each sample in the original text samples, the corresponding samples are masked Processing to obtain the mask training sample.
  • the length of the probability value interval of the prior probability distribution model is not less than 40%.
  • the second processing unit is configured to obtain the word-level mask ratio of each word in the first text sample in the original text sample, The word-level mask ratios of different words in the first text sample are not completely the same; according to the word-level mask ratio of each word in the first text sample, mask some words in the first text sample Processing to obtain the first training sample in the mask training sample.
  • the second processing unit is configured to use a prior probability distribution model to generate a word level mask for each word in the first text sample Ratio, the prior probability distribution model makes the word level mask ratios of different words in the first text sample not completely the same.
  • the second processing unit is configured to input the first text sample into a neural network model, and obtain the neural network model from the output of the neural network model.
  • the word-level mask ratio of each word in the first text sample, the output of the neural network model is the word-level mask ratio of each word in the input text, wherein the neural network model passes the step 1) described above Go to step 6) to perform optimization learning, where the initial value of i is 1. See the previous article for details, so I won't repeat it here.
  • the second processing unit is configured to: The first S words or the words located in the top G% are masked to obtain the first training sample.
  • S is a positive integer whose value is less than the total number of words in the first text sample, and G is greater than 0 and less than 100 Integer.
  • a data processing device in a fifth aspect, includes: a first processing unit, configured to obtain mask training samples by the method provided in the first aspect; and a second processing unit, configured to use the mask training The sample trains a pre-trained language model PLM, which is used to predict the text processed by the mask.
  • a data processing device comprising: a first processing unit, configured to determine a target text to be predicted, the target text including a sentence lacking part of the text; a second processing unit, configured to The target text is input into a pre-trained language model PLM, and the missing characters in the target text are predicted from the output of the PLM, wherein the PLM is obtained through training by the method provided in the second aspect.
  • a data processing device in a seventh aspect, includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to execute the first Aspect, the method in the second aspect, or the third aspect.
  • a computer-readable medium stores program code for device execution, and the program code includes the method for executing the above-mentioned first, second, or third aspect.
  • a computer program product containing instructions is provided.
  • the computer program product runs on a computer, the computer executes the method in the first, second, or third aspect.
  • a chip in a tenth aspect, includes a processor and a data interface.
  • the processor reads instructions stored in a memory through the data interface, and executes any of the above-mentioned first, second, or third aspects. Methods.
  • the chip may further include a memory in which instructions are stored, and the processor is configured to execute instructions stored on the memory.
  • the processor is configured to execute the method in the first aspect, the second aspect, or the third aspect described above.
  • an electronic device is provided, and the electronic device includes the device provided in the fourth, fifth, sixth, or seventh aspect described above.
  • training PLM using mask training samples with an unfixed mask ratio can enhance the pattern diversity of PLM training samples, so that the features learned by PLM can be more diverse, and the generality of PLM can be improved. Therefore, it can improve the natural language comprehension ability of the trained PLM.
  • FIG 1 is a schematic diagram of the training principle of the pre-training language model (PLM).
  • Fig. 2 is a schematic diagram of a system architecture applicable to an embodiment of the present application.
  • Fig. 3 is a schematic flowchart of a method for obtaining mask training samples provided by an embodiment of the present application.
  • FIG. 4 is another schematic flowchart of a method for obtaining mask training samples provided by an embodiment of the present application.
  • FIG. 5 is another schematic flowchart of the method for obtaining mask training samples provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of the word level mask ratio of words in the original text sample in the embodiment of the present application.
  • FIG. 7 is a schematic flowchart of optimization learning of a neural network model for generating word-level mask ratios of words in an embodiment of the present application.
  • FIG. 8 is another schematic flowchart of optimization learning of a neural network model for generating word-level mask ratios of words in an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a data processing method provided by another embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a data processing method according to another embodiment of the present application.
  • FIG. 11 is a schematic block diagram of a data processing device provided by an embodiment of the present application.
  • Fig. 12 is a schematic diagram of the application of the device shown in Fig. 11.
  • FIG. 13 is another schematic block diagram of a data processing apparatus provided by an embodiment of the present application.
  • FIG. 14 is another schematic block diagram of a data processing device provided by an embodiment of the present application.
  • FIG. 15 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
  • NLP Natural language processing
  • AI artificial intelligence
  • NLP can cover a variety of downstream tasks: sentiment analysis, part-of-speech analysis, intent analysis, named entity recognition, reading comprehension, logical reasoning, machine translation, or dialogue robots.
  • PLM pretrained language model
  • MLM masked language model
  • the training sample of PLM is the text processed by the mask, that is, part of the text is replaced with a sentence with a special mark symbol (for example, [MASK]), for example, the original text is "Today is a sunny Saturday", the masked text is "Today [MASK] is sunny [MASK] Saturday”; the masked text is input into PLM, and PLM needs to predict the masked
  • the characters are " ⁇ " and "lang”.
  • the training samples of PLM can be called mask training samples.
  • a text for example, a sentence
  • the word not processed by a mask is its context information.
  • the PLM learns the ability to capture the context information of the text by predicting the word processed by the mask. Therefore, the PLM trained according to the MLM training program has the ability to understand the deep semantics of natural language, and can be used for a series of NLP-related downstream tasks.
  • the PLM training program has two improvement directions: first, to improve the natural language understanding ability of PLM; second, to speed up the model training speed (that is, to speed up the model convergence speed).
  • second to speed up the model training speed (that is, to speed up the model convergence speed).
  • a random strategy is used to select words in each text for mask processing according to a fixed mask ratio to obtain mask training samples.
  • the fixed mask ratio is denoted as r
  • r*N words are randomly selected for mask processing
  • the mask training samples are obtained using a random strategy according to a fixed mask ratio, which will result in a single pattern of PLM training samples, which makes the features learned by PLM relatively fixed, resulting in PLM in The generalization ability is lacking. Therefore, it brings a bottleneck in the natural language understanding ability to the trained PLM.
  • an embodiment of the present application proposes a scheme for generating mask training samples of PLM, which can improve the natural language understanding ability of PLM obtained by training.
  • using the mask training samples obtained in the embodiment of the present application to train PLM can overcome the bottleneck in the natural language understanding ability of PLM in the prior art.
  • FIG. 2 is a schematic diagram of a system architecture applicable to an embodiment of the application.
  • the system may include a data collection device 21, a server device 22, and a client device 23.
  • the data collection device 21, the server device 22, and the client device 23 are connected through a communication network.
  • the data collection device 21 is used to obtain original text samples (for example, a large number of sentences), and transmit the original text samples to the server device 22.
  • the data collection device 21 can obtain the original text sample through a variety of ways. For example, it can be obtained through manual input and/or network search.
  • the server device 22 is configured to obtain mask training data using the solution provided in the embodiment of the present application, and then obtain the trained PLM, and can output the PLM to the client device 23.
  • the client device 23 is used to use the PLM trained by the server device 22 to perform natural language understanding and processing, for example, to perform any one or more of the following NLP downstream tasks: sentiment analysis, part-of-speech analysis, intent analysis, and named entity recognition , Reading comprehension, logical reasoning, machine translation or dialogue robots, etc.
  • FIG. 2 is only an example and not a limitation.
  • the data collection device 21 is optional.
  • the operation of the data collection device 21 may be performed on the server device 22.
  • the client device 23 is optional.
  • the operation of the client device 23 may be performed on the server device 22.
  • the original text sample represents a collection of texts to be masked.
  • Each sample in the original text sample represents a text (or called a text sentence).
  • the original text sample is a collection of multiple text sentences.
  • the mask training sample represents the set of text processed by the mask.
  • Each sample in the mask training sample represents a text after mask processing.
  • the mask ratios involved in the embodiments of the present application include text-level mask ratios and word-level mask ratios.
  • the text-level mask ratio is used to indicate the ratio of the words processed by the mask in a text to all the words in the text.
  • the text-level mask ratio can also be referred to as the sentence-level mask ratio or the sample-level mask ratio.
  • the word level mask ratio is used to indicate the probability of a word being masked.
  • each word has a word-level mask ratio.
  • mask ratio of mask training samples and “mask ratio of original text samples” mentioned in this article include text level mask ratio, and/or word level mask ratio.
  • the word level mask ratio can also be referred to as the word mask probability.
  • FIG. 3 is a schematic flowchart of a data processing method 300 provided by an embodiment of the application.
  • the method 300 may be executed by the server device 22 in FIG. 2.
  • the method 300 includes step S310 and step S320.
  • one of the original text samples is "Today is a sunny Saturday.”
  • S320 Perform mask processing on the original text sample to obtain a mask training sample.
  • the mask processing makes the mask ratio of the mask training sample not fixed.
  • Mask training samples are used to train PLM, and PLM is used to predict the text processed by the mask.
  • a sample in the original text sample is "Today is a sunny Saturday”
  • the corresponding training sample is obtained after masking the sample as "Today [MASK] is a sunny Saturday of [MASK]”.
  • the mask ratio of the mask training sample includes the text level mask ratio and/or the word level mask ratio.
  • the mask ratio of the mask training sample is the text level mask ratio.
  • variable mask ratio of the mask training sample means that the text-level mask ratio of different samples in the mask training sample is not exactly the same.
  • word level mask ratio of different words in each sample they can be the same, or different, or not exactly the same.
  • the first sample and the second sample are included.
  • the text level mask ratio of the first sample is 15%
  • the text level mask ratio of the second sample is 20%.
  • the total number of words contained in the sample and the total number of words contained in the second sample are both 100, then 15 words in the first sample are masked, and 20 words in the second sample are masked.
  • the mask ratio of the mask training sample is the word level mask ratio.
  • variable mask ratio of the mask training sample means that the word-level mask ratios of different words in each sample in the mask training sample are not completely the same.
  • the text-level mask ratios of different samples in the mask training samples they can be the same, or different, or not exactly the same.
  • the mask ratio of the mask training sample includes the text-level mask ratio and the word-level mask ratio.
  • variable mask ratio of the mask training sample means that the text-level mask ratio of different samples in the mask training sample is not exactly the same, and the mask training sample has different words in each sample.
  • the word level mask ratio is not exactly the same.
  • step S320 includes: obtaining a mask strategy, which can make the mask ratio of the original text sample not fixed; according to the mask strategy, determine whether each word in each sample in the original text sample needs to be masked Process, if it is, replace it with a mark symbol (for example, [MASK]), if not, do no processing, and finally obtain a mask training sample.
  • a mask strategy which can make the mask ratio of the original text sample not fixed
  • the mask strategy can be obtained in a variety of ways, which will be described below, and will not be described in detail here.
  • step S310 is optional.
  • mask processing can be directly performed on the original text samples to obtain mask training samples, that is, step S310 is directly performed without performing step S310.
  • step S320 multiple implementation manners can be adopted to perform mask processing on the original text samples to obtain mask training samples.
  • a variety of ways can be used to obtain the mask strategy of the original text sample.
  • step S320 includes step S321 and step S322.
  • S321 Use a prior probability distribution model to generate a text level mask ratio of each sample in the original text sample.
  • the prior probability distribution model makes the text level mask ratio of different samples in the original text sample not completely the same.
  • the prior probability distribution model is used to generate a mask ratio for each sample in the original text sample.
  • the prior probability distribution model uses the prior probability distribution model to generate a probability, and use this probability as the text-level mask ratio of the i-th sample, i is 1, ..., M, and M represents the original The sample size of the text sample.
  • the probability generated by the prior probability distribution model obeys a certain probability distribution. Therefore, the mask ratio generated by the prior probability distribution model is dynamic, not fixed. That is to say, the text level mask ratio of each sample in the original text samples generated using the prior probability distribution model is not exactly the same, for example, the text level mask ratio of all samples is different, or at least some of the samples are different The text level mask ratio is different.
  • r represents the probability
  • the value interval of r can be between 0% and 100%, then use the value interval of the mask ratio generated by P(r) Between 0% and 100%.
  • the probability distribution that the prior probability distribution model obeys can be any continuous or discrete probability distribution.
  • the probability distribution that the prior probability distribution model obeys is a uniform distribution or a Gaussian distribution.
  • Gaussian distribution can also be called normal distribution (normal distribution).
  • the probability distribution subject to the prior probability distribution model is a truncated Gaussian distribution (also referred to as truncated normal distribution).
  • S322 Perform mask processing on the corresponding sample according to the text level mask ratio of each sample in the original text sample to obtain a mask training sample.
  • step S322 the text sample 1 is masked according to the mask ratio r1 Processing, obtain the training sample corresponding to text sample 1 (denoted as training sample 1), mask the text sample 2 according to the mask ratio r2, and obtain the training sample corresponding to text sample 2 (denoted as training sample 2). It can be understood that if r1 and r2 are different, the mask ratios of training sample 1 and training sample 2 are different.
  • one way to obtain the training sample corresponding to text sample 1 is: according to the mask ratio r1, use a random strategy to select r1* in text sample 1 N1 subs are masked, and the training sample corresponding to text sample 1 is obtained.
  • other feasible strategies may be used to select r1*N1 sub-subs in text sample 1 for mask processing to obtain training samples corresponding to text sample 1. The embodiment of the application does not limit this.
  • the text-level mask ratios of the mask training samples are not completely the same, or in other words, the text-level mask ratios are not fixed.
  • using mask training samples with an unfixed text-level mask ratio to train PLM can enhance the pattern diversity of PLM training samples, so that the features learned by PLM can be more diverse, and the training results can be improved. Natural language understanding ability of PLM.
  • the length of the probability value interval of the prior probability distribution model is not less than 40%.
  • the probability value interval of the prior probability distribution model is 0% to 40%.
  • Simulation experiments show that the PLM trained using the mask training samples obtained in this embodiment has the ability to generate natural language in a random order.
  • the PLM trained using the mask training samples obtained in this embodiment can generate natural language in a random order generation manner as shown in Table 1.
  • the sequence of natural language text generation is from left to right, and the PLM trained using the mask training sample obtained in this embodiment can specify the coordinates of the next generated text each time.
  • smooth text can still be generated.
  • step S320 includes: step S323 and step S324. Through step S323 and step S324, the first training sample corresponding to the first text sample in the original text sample can be obtained.
  • the first text sample is taken as an example to illustrate each sample in the original text sample. That is, the following description of the first text sample applies to each sample in the original text sample.
  • the word-level mask ratios of different words in the first text sample are not completely the same, which means that at least two words in the first text sample have different word-level mask ratios.
  • the word level mask ratios of different words in the first text sample are all different.
  • some words in the first text sample have different word-level mask ratios, and some words have the same word-level mask ratio.
  • step S323 the distribution diagram of the word level mask ratio of each word in "Today is a sunny Saturday” is obtained as shown in Figure 6. Show. In the example of FIG. 6, the word level mask ratios of all words in the first text sample are different.
  • the word-level mask ratio of a word indicates the probability of this word being masked.
  • S324 Perform mask processing on part of the words in the first text sample according to the word level mask ratio of each word in the first text sample to obtain the first training sample in the mask training sample.
  • Masking some of the words in the first text sample refers to masking the words with a larger mask ratio in the first text sample.
  • step S324 includes: masking the first S words in the first text sample in the order of the word level mask ratio from high to low to obtain the first training sample, where S is A positive integer whose value is less than the total number of words in the first text sample.
  • the first text sample is "Today is a sunny Saturday", and the word level mask ratio of each word in the first text sample is shown in Figure 6, assuming the value of S is 2. , The two characters "lang” and “day” with the largest mask ratio in the first text sample are masked to obtain the first training sample "this [MASK] is sunny [MASK] Saturday”.
  • step S324 includes: masking the top G% words in the first text sample in the order of word level mask ratio from high to low to obtain the first training sample, G is an integer greater than 0 and less than 100.
  • the first text sample is "Today is a sunny Saturday”
  • the word level mask ratio of each word in the first text sample is shown in Figure 6, assuming that the value of G is 25 .
  • the order of word level mask ratio from high to low mask the top 25% of the words in the first text sample, namely "lang” and “tian”, to obtain the first training sample " ⁇ [MASK ] It’s sunny [MASK] Saturday”.
  • step S324 includes: masking the words whose mask ratio reaches D in the first text sample to obtain the first training sample, where D is a decimal number greater than 0 and less than 1, And D is smaller than the word level mask ratio of the word with the smallest word level mask ratio in the first text sample.
  • the word level mask ratio of the word reaches D, which means that the word level mask ratio of the word is greater than or equal to D.
  • the first text sample is "Today is a sunny Saturday”
  • the word level mask ratio of each word in the first text sample is shown in Figure 6, assuming that there are only “lang” and "
  • the word-level mask ratios of the mask training samples are not completely the same, or in other words, the word-level mask ratios of words are not fixed.
  • the prior art uses a random strategy to select words in each text for mask processing according to a fixed mask ratio to obtain mask training samples.
  • the mask training samples generated randomly may have repetitive features. Training PLM with mask training samples will cause PLM to repeatedly learn the same training samples during the training process, which cannot guarantee the rapid convergence of the model.
  • the characters in each sample of the original text sample are made to have different mask ratios, and the masking process of the original text samples is based on the word-level mask ratio of each word
  • this can reduce or avoid the repetitive features of the mask training sample, which can avoid the PLM repetitively learning the same sample during the training process to a certain extent, and realize the fast model convergence.
  • step S323 multiple implementation manners can be used to obtain the word-level mask ratio of each word in the first text sample in the original text sample, so that the word-level mask ratios of different words in the first text sample are not completely the same .
  • step S323 includes: using a prior probability distribution model to generate a word-level mask ratio of each word in the first text sample, and the prior probability distribution model makes different words in the first text sample The ratio of word level masks is not exactly the same.
  • a mask ratio is generated for each word in the first text sample. For example, for the j-th word in the first text sample, a prior probability distribution model is used to generate a probability, and the probability is used as the word-level mask ratio of the j-th word, where j is 1, ..., N1, and N means The total number of words contained in the first text sample.
  • step S323 use the prior probability distribution model to obtain the ratio of the word level mask of each word in "Today is a sunny Saturday"
  • the distribution diagram is shown in Figure 6.
  • the probability generated by the prior probability distribution model obeys a certain probability distribution.
  • the probability of the prior probability distribution model ranges from 0% to 100%. Therefore, the mask generated by the prior probability distribution model is used.
  • the ratio is dynamic, not fixed.
  • r represents the probability
  • the value interval of r can be between 0% and 100%, then use the value interval of the mask ratio generated by P(r) Between 0% and 100%.
  • the prior probability distribution model mentioned here is the same as the prior probability distribution model mentioned in step S321.
  • step S323 includes: inputting the first text sample into a neural network model, obtaining the word-level mask ratio of each word in the first text sample from the output of the neural network model, and the neural network model
  • the output of the network model is the word level mask ratio of each word in the input text.
  • the neural network model is obtained by optimization learning, and the learning optimization process of the neural network model is shown in Figure 7. In Figure 7, the initial value of i is 1.
  • the neural network model can output a loss value (loss) for each word, and the loss value can be mapped to a mask ratio.
  • the neural network model can output a loss value for each word in the sample "Today is a sunny Saturday", for example, loss_0 represents the loss value output by the neural network model for "now” in the sample, and the loss value from loss_1 to loss_7 The meaning is similar, so I won't repeat them here.
  • the method of obtaining the probability mask of a word is to obtain the word level mask ratio of the word according to the loss value output by the neural network model for the word and the mapping relationship between the loss value and the mask ratio .
  • mapping relationship between the loss value output by the neural network model and the mask ratio can be designed according to application requirements, which is not limited in this application.
  • the neural network model can directly output the word-level mask ratio of the word for each word.
  • the word level mask ratio of each word can be obtained directly according to the output signal of the neural network model for each word in the i-th sample.
  • Step 2) may correspond to step S324.
  • the implementation of masking the i-th sample according to the word-level mask ratio of each word in the i-th sample is detailed in the foregoing, and will not be repeated here.
  • the loss value of the PLM output obtained in step 3) for the masked word output can be called Feedback signal.
  • the PLM can predict the word processed by the mask for the input mask training data, and can also output the loss value (loss') for the word processed by the mask.
  • the PLM may be a model with fixed parameters. It will be described below and will not be detailed here.
  • the first signal and the second signal are signals with the same meaning (That is, a signal that can be compared); through the difference between the first signal and the second signal, the neural network model is optimized and updated.
  • the loss value of the PLM output for the masked word (denoted as output signal 1) has the same meaning as the output signal of the neural network model for the masked word (denoted as output signal 2)
  • Signals can be directly compared, and the neural network model can be optimized and updated directly based on the difference between the output signal 1 and the output signal 2.
  • PLM outputs loss values loss_1' and loss_4' for the masked words “tian” and “lang”, respectively, and the output signal of the neural network model for the masked words is also
  • the neural network model can be optimized and updated by comparing the loss value output by the PLM with the loss value output by the neural network model.
  • the loss value of the PLM output for the masked word (denoted as output signal 1) is not the same as the output signal of the neural network model for the masked word (denoted as output signal 2). Signals cannot be compared. In this case, one of output signal 1 and output signal 2 can be treated as a signal with the same meaning as the other, and then compared.
  • the output signal of the neural network model for the word processed by the mask is the mask ratio
  • the loss value of the PLM output for the word processed by the mask has a mapping relationship with the mask ratio.
  • the network model is optimized and updated.
  • each loss value is divided by the same larger value (greater than all loss values), and the obtained ratio is used as the mask ratio of each loss value mapping.
  • step 5 judge whether the neural network meets the convergence condition, if yes, go to step 6), if not, add 1 to the value of i, and go to step 1).
  • the neural network model obtained in step 4) is used as the neural network model learned by optimization.
  • S810 Input the sample "Today is a sunny Saturday” into the neural network model, and obtain the word-level mask ratio of each word in the sample "Today is a sunny Saturday” from the output of the neural network model.
  • S810 corresponds to step 1) in FIG. 7.
  • the neural network model outputs a loss value for each word in the sample "Today is a sunny Saturday", for example, loss_0 represents the loss value output by the neural network model for "now” in the sample, and the meaning of loss_1 to loss_7 Similar, I won't repeat it here.
  • the neural network model has a mapping relationship between the loss value output by each word and the mask ratio. As shown in Figure 8, loss_0 to loss_7 respectively map a mask ratio, and then each word can be obtained according to the loss value and the mapping relationship. Of course, the word-level mask ratio of the word processed by the mask can be obtained.
  • S820 Mask the words “ ⁇ ” and “lang” in the sample “Today is a sunny Saturday” with the mask ratio meeting the conditions, and obtain the training sample “ ⁇ [MASK] corresponding to the sample “Today is a sunny Saturday” ] It’s sunny [MASK] Saturday". S820 corresponds to step 2) in Figure 7.
  • S830 Input the mask training sample "Today [MASK] is sunny [MASK] Saturday” into the PLM, and obtain the prediction result of the masked word in the mask training sample from the output of the PLM, and also obtain the PLM target
  • the output signal of the words processed by the mask ie "Tian” and “Lang”).
  • S830 corresponds to step 3) in FIG. 7.
  • the neural network is updated and optimized according to the output signal of the masked word (ie, " ⁇ " and "lang") of the neural network model, and the PLM output loss value of the masked word.
  • S840 corresponds to step 4) in FIG. 7.
  • PLM is a model that performs real-time parameter update based on mask training samples.
  • step 3) includes: using the training sample corresponding to the i-th sample to perform a training update on the PLM; inputting the training sample corresponding to the i-th sample into the trained and updated PLM to obtain the trained and updated PLM to be processed by the mask The loss value of the word output.
  • step 4) includes: updating and optimizing the neural network network according to the output loss value of the PLM after the training and updating for the word processed by the mask, and the output signal of the neural network model for the word processed by the mask.
  • the neural network model learned by optimization is used to generate the word-level mask ratio of each word in each sample in the original text sample, which is equivalent to optimized learning of the mask strategy, which can generate better Mask training samples. Therefore, using such mask training samples to train PLM can achieve rapid convergence of the PLM model and improve the natural language understanding ability of PLM.
  • the foregoing description of the embodiment of obtaining the mask strategy according to the probability distribution model can be regarded as, according to the preset model (or According to experience) Obtain the mask strategy.
  • the generation of mask training samples can be controlled in a certain way. For example, control the generation of mask training samples by controlling the mask ratio of the text (that is, sentence) (that is, the text-level mask ratio); another example, by controlling the word-level mask ratio of words in the text to control the mask training sample generate. Therefore, in the embodiment of the present application, mask training samples can be generated in a controlled manner, rather than randomly generated mask training samples, so that the natural language understanding ability of PLM can be improved by controlling the mask training samples and Improved PLM convergence speed.
  • the solution for obtaining mask training samples of PLM provided in the embodiments of the present application can be applied to all MLM-based PLMs.
  • an embodiment of the present application also provides a method 900 for training PLM.
  • the method 900 may be executed by the server device 22 in FIG. 2.
  • the method 900 includes step S910 and step S920.
  • S910 Obtain mask training samples by using the method 300 in the above embodiment.
  • S920 Use mask training samples to train a pre-trained language model PLM, where PLM is used to predict the text processed by the mask.
  • using mask training samples with an unfixed mask ratio to train PLM can enhance the pattern diversity of PLM training samples, so that the features learned by PLM can be more diverse, and the generalization of PLM can be improved Ability, therefore, can improve the natural language comprehension ability of the trained PLM.
  • code strategy rather than determining according to a random strategy, can reduce or avoid the repetitive features of mask training samples, which can prevent PLM from repetitively learning the same samples in the training process to a certain extent, and can achieve rapid model convergence.
  • the word level mask ratio of each word in each sample in the original text sample is generated by using the neural network model learned by optimization, and then according to the word level of each word
  • the mask ratio is used to determine the mask strategy, which can generate better mask training samples. Therefore, using such mask training samples to train PLM can achieve rapid convergence of the PLM model and improve the natural language understanding ability of PLM.
  • the PLM training solution provided in the embodiments of the present application can be applied to all MLM-based PLMs.
  • an embodiment of the present application also provides a data processing method 1000.
  • the method 1000 may be executed by the server device 22 or the client device 23 in FIG. 2.
  • the method 1000 includes step S1010 and step S1020.
  • S1010 Determine the target text to be predicted, where the target text includes sentences lacking part of the text.
  • the target text includes sentences that lack part of the text, and can also be expressed as the target text includes sentences that lack context information.
  • the target text is input into the PLM, and the missing characters in the target text are predicted from the output of the PLM.
  • the PLM is the PLM obtained by training through the method 900 provided in the above embodiment.
  • Table 2 shows that the PLM (denoted as u-PMLM-A) obtained by training using the scheme provided in the embodiment of the present application has an improvement in score compared with the existing BERT(A) model.
  • COLA, SST2, MRPC, STSB, QQP, MNL1-m/mm, QNLI, RTE, AX in the horizontal header of Table 2 respectively represent the name of a subtask in the natural language processing task set (GLUE), and AVG. represents The average score of these subtasks.
  • FIG. 11 is a schematic block diagram of a data processing apparatus 1100 according to an embodiment of the application.
  • the device 1100 includes a mask generation module 1110 and a PLM model training module 1120.
  • the mask generation module 1110 is used to perform mask processing on the original text samples by using the method 300 provided in the embodiment of the present application to obtain mask training data.
  • the mask generation module 1110 includes a neural network model as shown in FIG. 8.
  • the PLM model training module 1120 is configured to use the mask training data obtained by the mask generation module 1110 to perform PLM model training.
  • the text sample “Today is a sunny Saturday” is input to the mask generation module 1110, and the mask generation module 1110 outputs the training sample “Today [MASK] is sunny [MASK] Saturday”;
  • the training sample “Today [MASK] is sunny [MASK] Saturday” is input to the PLM model training module 1120, and the PLM model training module 1120 outputs the prediction results "Tian” and "Lang” of the words processed by the mask.
  • an embodiment of the present application further provides an apparatus 1300 for data processing, and the apparatus 1300 is configured to execute the above method embodiment.
  • the device 1300 includes a first processing unit 1310 and a second processing unit 1320.
  • the apparatus 1300 is configured to execute the method 300 in the above method embodiment.
  • the first processing unit 1310 is used to determine the original text sample, and the original text sample has not undergone mask processing; the second processing unit 1320 is used to perform mask processing on the original text sample to obtain mask training samples, and the mask processing makes the mask processing
  • the mask ratio of the code training sample is not fixed, and the mask training sample is used to train PLM.
  • the mask ratio of the mask training sample includes the text level mask ratio and/or the word level mask ratio. See the description above for details, so I won't repeat them here.
  • the second processing unit 1320 is configured to: use a prior probability distribution model to generate the text level mask ratio of each sample in the original text sample, and the prior probability distribution model makes the text level mask of different samples in the original text sample mask.
  • the code ratios are not exactly the same; according to the text level mask ratio of each sample in the original text sample, the corresponding sample is masked to obtain the mask training sample.
  • the length of the probability value interval of the prior probability distribution model is not less than 40%.
  • the second processing unit 1320 is configured to obtain the word-level mask ratio of each word in the first text sample in the original text sample, and the word-level mask ratios of different words in the first text sample are not completely the same; According to the word level mask ratio of each word in the first text sample, mask processing is performed on part of the words in the first text sample to obtain the first training sample in the mask training sample.
  • the second processing unit 1320 is configured to use a prior probability distribution model to generate the word-level mask ratio of each word in the first text sample, and the prior probability distribution model makes the words of different words in the first text sample The level mask ratio is not exactly the same.
  • the second processing unit 1320 is configured to input the first text sample into the neural network model, and obtain the word-level mask ratio of each word in the first text sample from the output of the neural network model, and the output of the neural network model is The word level mask ratio of each word in the input text.
  • the neural network model is obtained through optimization learning through the steps shown in FIG. 7, where the initial value of i is 1. See the previous description for details, so I won't repeat it here.
  • the second processing unit 1320 is configured to mask the top S words or the words located in the top G% of the first text sample in the order of word level mask ratio from high to low to obtain the first
  • S is a positive integer whose value is less than the total number of words in the first text sample
  • G is an integer greater than 0 and less than 100.
  • the device 1300 under the first design may be set in the mask generation module 1110 in the device 1100.
  • the mask generation module 1110 in the device 1100 includes the device 1300 in the first design.
  • the apparatus 1300 is configured to execute the method 900 in the above method embodiment.
  • the first processing unit 1310 is used to obtain mask training samples by the method 300 in the above method embodiment;
  • the second processing unit 1320 is used to train the pre-training language model PLM using the mask training samples, and the PLM is used to predict the masked training samples. Code processed text.
  • the device 1300 under the second design may be set in the PLM model training module 1120 in the device 1100.
  • the PLM model training module 1120 in the device 1100 includes the device 1300 in the second design.
  • the apparatus 1300 is configured to execute the method 1000 in the above method embodiment.
  • the first processing unit 1310 is used to determine the target text to be predicted, and the target text includes sentences lacking part of the text;
  • the second processing unit 1320 is used to input the target text into the pre-trained language model PLM, and predict the target text from the output of the PLM The missing text, where PLM is trained through the method 900 in the above method embodiment.
  • first processing unit 1310 and the second processing unit 1320 may be implemented by processors.
  • an embodiment of the present application also provides a data processing apparatus 1400.
  • the device 1400 includes a processor 1410, which is coupled to a memory 1420, the memory 1420 is used to store computer programs or instructions, and the processor 1410 is used to execute computer programs or instructions stored in the memory 1420, so that the method in the above method embodiment Be executed.
  • the apparatus 1400 may further include a memory 1420.
  • the device 1400 may further include a data interface 1430, and the data interface 1430 is used for data transmission with the outside world.
  • the apparatus 1400 is used to implement the method 300 in the foregoing embodiment.
  • the apparatus 1400 is used to implement the method 900 in the foregoing embodiment.
  • the apparatus 1400 is used to implement the method 1000 in the foregoing embodiment.
  • the embodiments of the present application also provide a computer-readable medium that stores program code for device execution, and the program code includes a method for executing the above-mentioned embodiments.
  • the embodiments of the present application also provide a computer program product containing instructions, which when the computer program product runs on a computer, cause the computer to execute the method of the foregoing embodiment.
  • An embodiment of the present application further provides a chip, which includes a processor and a data interface, and the processor reads instructions stored on the memory through the data interface, and executes the method of the foregoing embodiment.
  • the chip may further include a memory in which instructions are stored, and the processor is used to execute the instructions stored on the memory, and when the instructions are executed, the processor is used to execute the method in the foregoing embodiment.
  • An embodiment of the present application also provides an electronic device, and the electronic device includes the apparatus 1100 in the foregoing embodiment.
  • An embodiment of the present application also provides an electronic device, which includes a device 1300 in a first design, or a device 1300 in a second design, or a device 1300 in a third design.
  • An embodiment of the present application also provides an electronic device, which includes an apparatus 1300 in a first design and an apparatus 1300 in a second design.
  • FIG. 15 is a chip hardware structure provided by an embodiment of the application, and the chip includes a neural network processor 1500.
  • the chip can be installed in any one or more of the following devices:
  • the methods 300, 900, or 1000 in the above method embodiments can all be implemented in the chip as shown in FIG. 15.
  • the neural network processor 1500 is mounted on a host CPU as a coprocessor, and the host CPU distributes tasks.
  • the core part of the neural network processor 1500 is the arithmetic circuit 1503, and the controller 1504 controls the arithmetic circuit 1503 to obtain data in the memory (weight memory 1502 or input memory 1501) and perform calculations.
  • the arithmetic circuit 1503 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 1503 is a two-dimensional systolic array. The arithmetic circuit 1503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1503 is a general-purpose matrix processor.
  • the arithmetic circuit 1503 fetches the data corresponding to matrix B from the weight memory 1502 and caches it on each PE in the arithmetic circuit 1503.
  • the arithmetic circuit 1503 fetches the matrix A data and matrix B from the input memory 1501 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 1508.
  • the vector calculation unit 1507 can perform further processing on the output of the arithmetic circuit 1503, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
  • the vector calculation unit 1507 can be used for network calculations in the non-convolution/non-FC layer of the neural network, such as pooling, batch normalization, local response normalization, etc. .
  • the vector calculation unit 1507 can store the processed output vector in a unified memory (also referred to as a unified buffer) 1506.
  • the vector calculation unit 1507 may apply a nonlinear function to the output of the arithmetic circuit 1503, such as a vector of accumulated values, to generate the activation value.
  • the vector calculation unit 1507 generates normalized values, combined values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 1503, for example for use in subsequent layers in a neural network.
  • the method 300, 500 or 600 in the above method embodiment may be executed by 1503 or 1507.
  • the unified memory 1506 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 1501 and/or the unified memory 1506 through the storage unit access controller 1505 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 1502, And save the data in the unified memory 1506 into the external memory.
  • DMAC direct memory access controller
  • the bus interface unit (BIU) 1510 is used to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 1509 through the bus.
  • An instruction fetch buffer 1509 connected to the controller 1504 is used to store instructions used by the controller 1504;
  • the controller 1504 is used to call the instructions cached in the memory 1509 to control the working process of the computing accelerator.
  • the data here may be original text samples.
  • the data here may be mask training samples.
  • the data here may be the target text to be predicted.
  • the unified memory 1506, the input memory 1501, the weight memory 1502, and the instruction fetch memory 1509 are all on-chip (On-Chip) memories.
  • the external memory is a memory external to the NPU.
  • the external memory can be a double data rate synchronous dynamic random access memory.
  • Memory double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (HBM) or other readable and writable memory.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: Universal Serial Bus flash disk (USB flash disk, UFD) (UFD can also be referred to as U disk or USB flash drive for short), mobile hard disk, read-only memory (read-only memory, ROM), random access Various media that can store program codes, such as random access memory (RAM), magnetic disks, or optical disks.
  • USB flash disk UFD
  • UFD Universal Serial Bus flash disk
  • ROM read-only memory
  • RAM random access memory
  • magnetic disks magnetic disks
  • optical disks optical disks.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

本申请提供一种数据处理的方法与装置。涉及人工智能领域,具体涉及自然语言处理领域。该方法包括:确定原始文本样本,原始文本样本未进行掩码处理;对原始文本样本进行掩码处理,获得掩码训练样本,该掩码处理使得掩码训练样本的掩码比例不固定,掩码训练样本用于训练预训练语言模型PLM。使用掩码比例不固定的掩码训练样本训练PLM,可以增强PLM的训练样本的模式多样性,从而可以使得PLM学习到的特征也较为多样,可以提高PLM的泛化能力,可以提高训练得到的PLM的自然语言理解能力。

Description

数据处理的方法与装置
本申请要求于2020年4月13日提交中国专利局、申请号为202010286915.9、申请名称为“数据处理的方法与装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,具体涉及一种数据处理的方法与装置。
背景技术
自然语言处理(natural language processing,NLP)是让计算机理解并处理人类自然语言的技术,是实现人工智能的重要技术手段。预训练语言模型(pertrained language model,PLM)是近年来兴起的NLP领域的一个重要的通用模型。PLM的训练方案是本领域的研究热点,PLM的训练方案具有两个改进方向:第一,提高PLM的自然语言理解能力;第二,加快模型训练速度(即加快模型收敛速度)。PLM常用的训练方案叫做掩码语言模型(masked language model,MLM)。
MLM的训练原理是,使得PLM学习到捕捉文字上下文信息的能力。在MLM训练方案中,PLM的训练样本是被掩码处理后的文本,即部分文字被替换成特殊的标记符号(例如,[MASK])的句子,例如,原文本是“今天是晴朗的周六”,被掩码处理后的文本为“今[MASK]是晴[MASK]的周六”;被掩码处理后的文本输入到PLM,PLM需要预测出被掩码的字分别是“天”和“朗”。PLM的训练样本可以称为掩码训练样本。
在当前的MLM训练方案中,按照固定掩码比例使用随机策略选择每个文本中的字进行掩码处理,获得掩码训练样本。这样获得的掩码训练样本会存在模式单一的问题,因此,使用这样的掩码训练样本训练PLM,会给PLM带来自然语言理解能力上的瓶颈。
发明内容
本申请提供一种数据处理的方法与装置,可以提高PLM的自然语言理解能力。
第一方面,提供一种数据处理的方法,所述方法包括:确定原始文本样本,所述原始文本样本未进行掩码处理;对所述原始文本样本进行掩码处理,获得掩码训练样本,所述掩码处理使得所述掩码训练样本的掩码比例不固定,所述掩码训练样本用于训练预训练语言模型PLM。
掩码训练样本的掩码比例包括文本级别掩码比例,和/或字级别掩码比例。
文本级别掩码比例用于表示,一个文本中被掩码处理的字占该文本中所有字的比例。
文本级别掩码比例也可以称为,句子级别的掩码比例或者文本级别掩码比例。
字级别掩码比例用于表示,一个字被掩码处理的概率。在一个文本中,每个字都具有一个字级别掩码比例。
字级别掩码比例也可称为字的掩码概率。
其中,所述掩码训练样本的掩码比例不固定包括:
所述掩码训练样本中不同样本的文本级别掩码比例不完全相同;和/或
所述掩码训练样本中任一个样本中每个字的字级别掩码比例不完全相同。
应理解,使用掩码比例不固定的掩码训练样本训练PLM,可以增强PLM的训练样本的模式多样性,从而可以使得PLM学习到的特征也较为多样,可以提高PLM的泛化能力,因此,可以提高训练得到的PLM的自然语言理解能力。
可以采用多种实现方式,对原始文本样本进行掩码处理,获得掩码训练样本。
结合第一方面,在第一方面的一种可能的实现方式中,所述对所述原始文本样本进行掩码处理,获得掩码训练样本,包括:使用先验概率分布模型,生成所述原始文本样本中每个样本的文本级别掩码比例,所述先验概率分布模型使得所述原始文本样本中不同样本的文本级别掩码比例不完全相同;按照所述原始文本样本中每个样本的文本级别掩码比例,对相应样本进行掩码处理,获得所述掩码训练样本。
可选地,所述先验概率分布模型的概率值区间长度不小于40%。
应理解,使用文本级别的掩码比例不固定的掩码训练样本训练PLM,可以增强PLM的训练样本的模式多样性,从而可以使得PLM学习到的特征也较为多样,可以提高训练得到的PLM的自然语言理解能力。
结合第一方面,在第一方面的一种可能的实现方式中,所述对所述原始文本样本进行掩码处理,获得掩码训练样本,包括:获取所述原始文本样本中的第一文本样本中每个字的字级别掩码比例,所述第一文本样本中不同字的字级别掩码比例不完全相同;根据所述第一文本样本中各个字的字级别掩码比例,对所述第一文本样本中的部分字进行掩码处理,获得所述掩码训练样本中的第一训练样本。
可许地,所述根据所述第一文本样本中各个字的字级别掩码比例,对所述第一文本样本中的部分字进行掩码处理,获得所述掩码训练样本中的第一训练样本,包括:按照字级别掩码比例从高到低的顺序,对所述第一文本样本中前S个字或者位于前G%的字进行掩码处理,获得所述第一训练样本,S为取值小于所述第一文本样本中字的总数量的正整数,G为大于0且小于100的整数。
应理解,通过使得原始文本样本的每个样本中的字具有不完全相同的掩码比例,并在原始文本样本的掩码处理过程中,是根据每个字的字级别掩码比例来确定掩码策略,而非按照随机策略确定,这样可以减少或者避免掩码训练样本存在重复特征,从而可在一定程度上避免PLM在训练过程中重复性地学习相同的样本,可以实现模型快速收敛。
可以采用多种实施方式,获取原始文本样本中的第一文本样本中每个字的字级别掩码比例,以使得第一文本样本中不同字的字级别掩码比例不完全相同。
结合第一方面,在第一方面的一种可能的实现方式中,所述获取所述原始文本样本中的第一文本样本中每个字的字级别掩码比例,包括:使用先验概率分布模型,生成所述第一文本样本中每个字的字级别掩码比例,所述先验概率分布模型使得所述第一文本样本中不同字的字级别掩码比例不完全相同。
结合第一方面,在第一方面的一种可能的实现方式中,所述获取所述原始文本样本中的第一文本样本中每个字的字级别掩码比例,包括:将所述第一文本样本输入神经网络模 型,从所述神经网络模型的输出获得所述第一文本样本中每个字的字级别掩码比例,所述神经网络模型的输出为输入的文本中各个字的字级别掩码比例,其中,所述神经网络模型通过如下步骤进行优化学习得到,其中,i的初始取值为1。
1),将所述原始文本样本中第i个样本输入所述神经网络模型,从所述神经网络模型的输出获得所述第i个样本中每个字的字级别掩码比例;
2),根据所述第i个样本中各个字的字级别掩码比例,对所述第i个样本中的部分字进行掩码处理,获得所述第i个样本对应的训练样本;
3),将所述第i个样本对应的训练样本输入所述PLM,获得所述PLM针对被掩码处理的字输出的损失值;
4),根据所述PLM针对被掩码处理的字输出的损失值,以及所述神经网络模型针对所述被掩码处理的字的输出信号,更新优化所述神经网络网络;
5),判断所述神经网络网络是否满足收敛条件,若是,转到步骤6),若否,将i的取值加1,转到步骤1);
6),将所述步骤4)得到的神经网络模型作为优化学习到的所述神经网络模型。
可选地,所述步骤3)包括:利用所述第i个样本对应的训练样本对所述PLM进行一次训练更新;将所述第i个样本对应的训练样本输入经过所述训练更新的所述PLM,获得经过所述训练更新的所述PLM针对所述被掩码处理的字的损失值;其中,所述步骤4)包括:根据经过所述训练更新的所述PLM针对所述被掩码处理的字的损失值,以及所述神经网络模型针对所述被掩码处理的字的输出信号,更新优化所述神经网络网络。
应理解,通过使用优化学习到的神经网络模型生成原始文本样本中每个样本中每个字的字级别掩码比例,相当于对掩码策略进行了优化学习,从而可以生成更优的掩码训练样本,因此,使用这样的掩码训练样本训练PLM,可以实现PLM的模型快速收敛,以及PLM的自然语言理解能力的提升。
通过在优化学习用于生成字的字级别掩码比例时同时训练更新PLM,可以更进一步生成更优的掩码训练样本,使用这样的掩码训练样本训练PLM,可以实现PLM的模型快速收敛,以及PLM的自然语言理解能力的提升。
第二方面,提供一种数据处理的方法,所述方法包括:通过第一方面提供的方法获得掩码训练样本;使用所述掩码训练样本训练预训练语言模型PLM,所述PLM用于预测被掩码处理的文字。
通过使用掩码比例不固定的掩码训练样本训练PLM,可以增强PLM的训练样本的模式多样性,从而可以使得PLM学习到的特征也较为多样,可以提高PLM的泛化能力,因此,可以提高训练得到的PLM的自然语言理解能力。
第三方面,提供一种数据处理的方法,所述方法包括:确定待预测的目标文本,所述目标文本包括缺少部分文字的语句;将所述目标文本输入预训练语言模型PLM,从所述PLM的输出预测所述目标文本中缺少的文字,其中,所述PLM通过第二方面提供的方法训练得到。
第四方面,提供一种数据处理的装置,所述装置包括第一处理单元与第二处理单元。所述第一处理单元,用于确定原始文本样本,所述原始文本样本未进行掩码处理。所述第二处理单元,用于对所述原始文本样本进行掩码处理,获得掩码训练样本,所述掩码处理 使得所述掩码训练样本的掩码比例不固定,所述掩码训练样本用于训练预训练语言模型PLM。
掩码训练样本的掩码比例包括文本级别掩码比例,和/或字级别掩码比例。详见上文描述,这里不再赘述。
结合第四方面,在第四方面的一种可能的实现方式中,所述第二处理单元用于:使用先验概率分布模型,生成所述原始文本样本中每个样本的文本级别掩码比例,所述先验概率分布模型使得所述原始文本样本中不同样本的文本级别掩码比例不完全相同;按照所述原始文本样本中每个样本的文本级别掩码比例,对相应样本进行掩码处理,获得所述掩码训练样本。
可选地,所述先验概率分布模型的概率值区间长度不小于40%。
结合第四方面,在第四方面的一种可能的实现方式中,所述第二处理单元用于,获取所述原始文本样本中的第一文本样本中每个字的字级别掩码比例,所述第一文本样本中不同字的字级别掩码比例不完全相同;根据所述第一文本样本中各个字的字级别掩码比例,对所述第一文本样本中的部分字进行掩码处理,获得所述掩码训练样本中的第一训练样本。
结合第四方面,在第四方面的一种可能的实现方式中,所述第二处理单元用于,使用先验概率分布模型,生成所述第一文本样本中每个字的字级别掩码比例,所述先验概率分布模型使得所述第一文本样本中不同字的字级别掩码比例不完全相同。
结合第四方面,在第四方面的一种可能的实现方式中,所述第二处理单元用于,将所述第一文本样本输入神经网络模型,从所述神经网络模型的输出获得所述第一文本样本中每个字的字级别掩码比例,所述神经网络模型的输出为输入的文本中各个字的字级别掩码比例,其中,所述神经网络模型通过前文描述的步骤1)至步骤6)进行优化学习得到,其中,i的初始取值为1。详见前文,这里不再赘述。
结合第四方面,在第四方面的一种可能的实现方式中,所述第二处理单元用于,按照字的字级别掩码比例从高到低的顺序,对所述第一文本样本中前S个字或者位于前G%的字进行掩码处理,获得所述第一训练样本,S为取值小于所述第一文本样本中字的总数量的正整数,G为大于0且小于100的整数。
第五方面,提供一种数据处理的装置,所述装置包括:第一处理单元,用于通过第一方面提供的方法获得掩码训练样本;第二处理单元,用于使用所述掩码训练样本训练预训练语言模型PLM,所述PLM用于预测被掩码处理的文字。
第六方面,提供一种数据处理的装置,所述装置包括:第一处理单元,用于确定待预测的目标文本,所述目标文本包括缺少部分文字的语句;第二处理单元,用于将所述目标文本输入预训练语言模型PLM,从所述PLM的输出预测所述目标文本中缺少的文字,其中,所述PLM通过第二方面提供的方法训练得到。
第七方面,提供一种数据处理的装置,该装置包括:存储器,用于存储程序;处理器,用于执行存储器存储的程序,当存储器存储的程序被执行时,处理器用于执行上述第一方面、第二方面或第三方面中的方法。
第八方面,提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行上述第一方面、第二方面或第三方面中的方法。
第九方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面、第二方面或第三方面中的方法。
第十方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行上述第一方面、第二方面或第三方面中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行上述第一方面、第二方面或第三方面中的方法。
第十一方面,提供一种电子设备,该电子设备包括上述第四方面、第五方面、第六方面或第七方面提供的装置。
在本申请提供的方案中,使用掩码比例不固定的掩码训练样本训练PLM,可以增强PLM的训练样本的模式多样性,从而可以使得PLM学习到的特征也较为多样,可以提高PLM的泛化能力,因此,可以提高训练得到的PLM的自然语言理解能力。
附图说明
图1是预训练语言模型(PLM)的训练原理示意图。
图2是本申请实施例可应用的系统架构示意图。
图3是本申请实施例提供的获取掩码训练样本的方法的示意性流程图。
图4是本申请实施例提供的获取掩码训练样本的方法的另一示意性流程图。
图5是本申请实施例提供的获取掩码训练样本的方法的再一示意性流程图。
图6是本申请实施例中原始文本样本中字的字级别掩码比例的示意图。
图7是本申请实施例中的用于生成字的字级别掩码比例的神经网络模型的优化学习的示意性流程图。
图8是本申请实施例中的用于生成字的字级别掩码比例的神经网络模型的优化学习的另一示意性流程图。
图9是本申请另一实施例提供的数据处理的方法的示意性流程图。
图10是本申请又一实施例提供的数据处理的方法的示意性流程图。
图11是本申请实施例提供的数据处理的装置的示意性框图。
图12是图11所示的装置的应用示意图。
图13是本申请实施例提供的数据处理的装置的另一示意性框图。
图14是本申请实施例提供的数据处理的装置的又一示意性框图。
图15是本申请实施例提供的一种芯片硬件结构示意图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
自然语言处理(natural language processing,NLP)是让计算机理解并处理人类自然语言的技术,是实现人工智能(artificial intelligence,AI)的重要技术手段。例如,NLP可以涵盖如下多种下游任务:情感分析、词性分析、意图分析、命名实体识别、阅读理解、逻辑推理、机器翻译或对话机器人等。预训练语言模型(pertrained language model,PLM)是近年来兴起的NLP领域的一个重要的通用模型。PLM在大部分NLP领域的下游任务上 都有较好的效果。
PLM常用的训练方案叫做掩码语言模型(masked language model,MLM)。MLM的训练原理是,使得PLM学习到捕捉文字上下文信息的能力。
如图1所示,在MLM训练方案中,PLM的训练样本是被掩码处理后的文本,即部分文字被替换成特殊的标记符号(例如,[MASK])的句子,例如,原文本是“今天是晴朗的周六”,被掩码处理后的文本为“今[MASK]是晴[MASK]的周六”;被掩码处理后的文本输入到PLM,PLM需要预测出被掩码的字分别是“天”和“朗”。PLM的训练样本可以称为掩码训练样本。在一个文本(例如,句子)中,对于被掩码处理的字,未被掩码处理的字是它的上下文信息,PLM通过预测被掩码处理的字,学习到了捕捉文字上下文信息的能力。因此按照MLM训练方案训练完成的PLM具有理解自然语言深度语义的能力,可用于一系列NLP相关的下游任务。
在当前的MLM训练方案中,按照固定掩码比例使用随机策略选择每个文本中的字进行掩码处理,获得掩码训练样本。
如前文描述,PLM的训练方案具有两个改进的方向:第一,提高PLM的自然语言理解能力;第二,加快模型训练速度(即加快模型收敛速度)。使用现有的MLM训练方案获得的掩码训练样本训练PLM,会给PLM带来自然语言理解能力上的瓶颈。原因如下。
在当前的MLM训练方案中,按照固定掩码比例使用随机策略选择每个文本中的字进行掩码处理,获得掩码训练样本。例如,将固定掩码比例记为r,对于每个文本,随机地选取r*N个字进行掩码处理,N表示文本包含的字的数量(若将文本视为句子,则N表示该句子的长度)。例如,假设某个句子的长度N=100,在掩码比例r=15%的情况下,随机地选择该句子中100*15%=15个字替换成[MASK]。
在当前的MLM训练方案中,掩码训练样本是按照固定掩码比例使用随机策略得到的,这会导致PLM的训练样本的模式较为单一,从而使得PLM学习到的特征也较为固定,导致PLM在泛化能力上有所欠缺,因此,给训练得到的PLM带来自然语言理解能力上的瓶颈。
针对上述问题,本申请实施例提出一种生成PLM的掩码训练样本的方案,可以提高训练得到的PLM的自然语言理解能力。换言之,采用本申请实施例获得的掩码训练样本训练PLM,可以克服现有技术存在的PLM在自然语言理解能力上的瓶颈。
图2为本申请实施例可应用的系统架构的示意图。该系统可以包括数据收集设备21、服务器设备22与客户端设备23。数据收集设备21、服务器设备22与客户端设备23通过通信网络连接。
数据收集设备21用于,获取原始文本样本(例如,大量的句子),并将原始文本样本传输至服务器设备22。
数据收集设备21可以通过多种途径,获取原始文本样本。例如,通过人工输入和/或网络查找等方式获取。
服务器设备22用于,使用本申请实施例提供的方案获得掩码训练数据,进而获得训练后的PLM,可以将PLM输出给客户端设备23。
客户端设备23用于,使用服务器设备22训练得到的PLM进行自然语言理解与处理,例如,进行下列NLP下游任务中的任一种或多种:情感分析、词性分析、意图分析、命 名实体识别、阅读理解、逻辑推理、机器翻译或对话机器人等。
需要说明的是,图2仅为示例而非限定。
例如,数据收集设备21是可选的。例如,数据收集设备21的操作可以在服务器设备22上执行。
又如,客户端设备23是可选的。例如,客户端设备23的操作可以在服务器设备22上执行。
为了便于理解与描述,对本文中的术语做如下解释。
1、原始文本样本
原始文本样本表示,待进行掩码处理的文本的集合。原始文本样本中的每个样本表示一个文本(或称为文本语句)。例如,原始文本样本是多个文本句子的集合。
2、掩码训练样本
掩码训练样本表示,被掩码处理后的文本的集合。掩码训练样本中的每个样本表示一个经过掩码处理后的文本。
3、掩码比例
本申请实施例中涉及的掩码比例包括文本级别掩码比例与字级别掩码比例。
文本级别掩码比例用于表示,一个文本中被掩码处理的字占该文本中所有字的比例。
文本级别掩码比例也可以称为,句子级别的掩码比例或者样本级别掩码比例。
字级别掩码比例用于表示,一个字被掩码处理的概率。在一个文本中,每个字都具有一个字级别掩码比例。
本文中涉及的表述“掩码训练样本的掩码比例”、“原始文本样本的掩码比例”,包括,文本级别掩码比例,和/或,字级别掩码比例。
字级别掩码比例也可称为字的掩码概率。
图3为本申请实施例提供的数据处理的方法300的示意性流程图。例如,该方法300可以由图2中的服务器设备22执行。该方法300包括步骤S310与步骤S320。
S310,确定原始文本样本,该原始文本样本未进行掩码处理。
作为示例,原始文本样本中的一个样本为“今天是晴朗的周六”。
可以通过多种途径,获取原始文本样本。例如,通过人工输入和/或网络查找等方式获取。
S320,对原始文本样本进行掩码处理,获得掩码训练样本,该掩码处理使得该掩码训练样本的掩码比例不固定。掩码训练样本用于训练PLM,PLM用于预测被掩码处理的文字。
作为示例,原始文本样本中的一个样本为“今天是晴朗的周六”,该样本被掩码处理后得到对应的训练样本为“今[MASK]是晴[MASK]的周六”。
掩码训练样本的掩码比例包括文本级别掩码比例,和/或字级别掩码比例。
例如,掩码训练样本的掩码比例为文本级别掩码比例。
在本例中,掩码训练样本的掩码比例不固定指的是,掩码训练样本中不同样本的文本级别掩码比例不完全相同。至于每个样本中不同字的字级别掩码比例,可以相同,或不同,或不完全相同。
作为示例,在掩码训练样本中,包括第一样本与第二样本,第一样本的文本级别掩码 比例为15%,第二样本的文本级别掩码比例为20%,假设第一样本包含的字的总数量与第二样本包含的字的总数量均为100,则第一样本中15个字被掩码处理了,第二样本中20个字被掩码处理了。
又例如,掩码训练样本的掩码比例为字级别掩码比例。
在本例中,掩码训练样本的掩码比例不固定指的是,掩码训练样本中每个样本中的不同字的字级别掩码比例不完全相同。至于掩码训练样本中不同样本的文本级别掩码比例,可以相同,或不同,或不完全相同。
再例如,掩码训练样本的掩码比例包括文本级别掩码比例与字级别掩码比例。
在本例中,掩码训练样本的掩码比例不固定指的是,掩码训练样本中不同样本的文本级别掩码比例不完全相同,并且掩码训练样本中每个样本中的不同字的字级别掩码比例不完全相同。
例如,步骤S320包括:获取掩码策略,该掩码策略能够使得原始文本样本的掩码比例不固定;根据该掩码策略,判断原始文本样本中每个样本中每个字是否需要进行掩码处理,若是,将其替换为标记符号(例如[MASK]),若否,不作处理,最终获得掩码训练样本。
其中,可以采用多种方式获取该掩码策略,下文将描述,这里暂不详述。
应理解,使用掩码比例不固定的掩码训练样本训练PLM,可以增强PLM的训练样本的模式多样性,从而可以使得PLM学习到的特征也较为多样,可以提高PLM的泛化能力,因此,可以提高训练得到的PLM的自然语言理解能力。
需要说明的是,步骤S310是可选的。例如,在实际应用中,在原始文本样本为已知或现成的情况下,可以直接对原始文本样本进行掩码处理获得掩码训练样本,即直接执行步骤S310,而无需执行步骤S310。
在步骤S320中,可以采用多种实现方式,对原始文本样本进行掩码处理,获得掩码训练样本。换言之,可以采用多种方式获取原始文本样本的掩码策略。
可选地,作为一种实现方式,如图4所示,步骤S320包括步骤S321与步骤S322。
S321,使用先验概率分布模型,生成原始文本样本中每个样本的文本级别掩码比例,先验概率分布模型使得原始文本样本中不同样本的文本级别掩码比例不完全相同。
换句话说,使用先验概率分布模型,为原始文本样本中每个样本,生成掩码比例。
作为示例,针对原始文本样本中第i个样本,使用先验概率分布模型生成一个概率,并将该概率作为第i个样本的文本级别掩码比例,i为1,…,M,M表示原始文本样本的样本数量。
先验概率分布模型生成的概率服从某种概率分布,因此,使用先验概率分布模型生成的掩码比例是动态变化的,而非固定不变的。也就是说,使用先验概率分布模型生成的原始文本样本中每个样本的文本级别掩码比例不是完全相同的,例如,所有样本的文本级别掩码比例不同,或者,至少一部分样本中不同样本的文本级别掩码比例不同。
作为示例,将先验概率分布模型记为P(r),r表示概率,r的取值区间可以为0%到100%之间,则使用P(r)生成的掩码比例的取值区间为0%到100%之间。
先验概率分布模型服从的概率分布可以是任意连续或离散的概率分布。例如,先验概率分布模型服从的概率分布为均匀分布或高斯分布等。高斯分布也可以称为正态分布 (normal distribution)。
可选地,先验概率分布模型服从的概率分布为截断高斯分布(也称为截断正态分布(truncated normal distribution))。
可以根据应用需求,设置截断高斯分布的变量限制范围。
S322,按照原始文本样本中每个样本的文本级别掩码比例,对相应样本进行掩码处理,获得掩码训练样本。
作为示例,假设在步骤S321中分别为原始文本样本中的文本样本1与文本样本2生成的掩码比例为r1与r2,则在步骤S322中,按照掩码比例r1对文本样本1进行掩码处理,获得文本样本1对应的训练样本(记为训练样本1),按照掩码比例r2对文本样本2进行掩码处理,获得文本样本2对应的训练样本(记为训练样本2)。可以理解到,若r1与r2不同,则训练样本1与训练样本2的掩码比例不同。
在如上示例中,假设文本样本1包含的字的总数量为N1,获得文本样本1对应的训练样本的一种实现方式为:按照掩码比例r1,使用随机策略选择文本样本1中的r1*N1个子进行掩码处理,获得文本样本1对应的训练样本。或者,还可以使用其它可行的策略选择文本样本1中的r1*N1个子进行掩码处理,获得文本样本1对应的训练样本。本申请实施例对此不作限定。
应理解,在图4所示实施例中,掩码训练样本的文本级别掩码比例不完全相同,或者说,文本级别掩码比例不固定。
在本申请实施例中,使用文本级别掩码比例不固定的掩码训练样本训练PLM,可以增强PLM的训练样本的模式多样性,从而可以使得PLM学习到的特征也较为多样,可以提高训练得到的PLM的自然语言理解能力。
可选地,在图4所示的实施例中,先验概率分布模型的概率值区间长度不低于40%。
例如,先验概率分布模型的概率值区间为0%~40%。
通过仿真实验表明,使用本实施例获得的掩码训练样本训练得到的PLM具有按照随机顺序生成自然语言的能力。
作为示例,使用本实施例获得的掩码训练样本训练得到的PLM可以按照如表1所示的随机顺序生成方式生成自然语言。
通常意义上,自然语言文字生成的顺序是从左往右依次生成,而使用本实施例获得的掩码训练样本训练得到的PLM可以每次指定下一个生成的文字的坐标,在顺序随机的情况下,依然可以生成流畅的文本。
表1
Figure PCTCN2021078390-appb-000001
在表1中,序列“This is a sentence generated in random order”中每个字的生成顺序为3→7→1→2→4→6→5→8。
可选地,作为另一种实现方式,如图5所示,步骤S320包括:步骤S323与步骤S324。通过步骤S323与步骤S324,可以获得原始文本样本中第一文本样本对应的第一训练样本。
为了便于描述而非限定,在如图5所示的实施例中,以第一文本样本为例说明原始文本样本中每个样本。也就是说,下文中对第一文本样本的描述适用于原始文本样本中的每个样本。
S323,获取原始文本样本中的第一文本样本中每个字的字级别掩码比例,第一文本样本中不同字的字级别掩码比例不完全相同。
第一文本样本中不同字的字级别掩码比例不完全相同,表示,第一文本样本中至少有两个字的字级别掩码比例不同。
可选地,第一文本样本中不同字的字级别掩码比例均不同。
可选地,第一文本样本中有部分字的字级别掩码比例不同,有部分字的字级别掩码比例相同。
作为示例,假设第一文本样本为“今天是晴朗的周六”,在步骤S323中,获取的“今天是晴朗的周六”中每个字的字级别掩码比例的分布示意图如图6所示。在图6的示例中,第一文本样本中所有字的字级别掩码比例都不同。
字的字级别掩码比例表示,这个字被掩码处理的概率。
S324,根据第一文本样本中各个字的字级别掩码比例,对第一文本样本中的部分字进行掩码处理,获得掩码训练样本中的第一训练样本。
对第一文本样本中的部分字进行掩码处理,指的是,对第一文本样本中掩码比例较大的字进行掩码处理。
基于第一文本样本中各个字的字级别掩码比例,可以采用多种方式,对第一文本样本进行掩码处理,获得所述第一训练样本。
可选地,作为一种方式,步骤S324包括:按照字级别掩码比例从高到低的顺序,对第一文本样本中前S个字进行掩码处理,获得第一训练样本,S为取值小于第一文本样本中字的总数量的正整数。
作为示例,还以图6为例,第一文本样本为“今天是晴朗的周六”,第一文本样本中各个字的字级别掩码比例如图6所示,假设S的取值为2,则对第一文本样本中掩码比例最 大的2个字“朗”与“天”进行掩码处理,获得第一训练样本“今[MASK]是晴[MASK]的周六”。
可选地,作为另一种方式,步骤S324包括:按照字级别掩码比例从高到低的顺序,对第一文本样本中位于前G%的字进行掩码处理,获得第一训练样本,G为大于0且小于100的整数。
作为示例,还以图6为例,第一文本样本为“今天是晴朗的周六”,第一文本样本中各个字的字级别掩码比例如图6所示,假设G的取值为25,则按照字级别掩码比例从高到低的顺序,对第一文本样本中位于前25%的字,即“朗”与“天”进行掩码处理,获得第一训练样本“今[MASK]是晴[MASK]的周六”。
可选地,作为又一种方式,步骤S324包括:对第一文本样本中掩码比例达到D的字进行掩码处理,获得所述第一训练样本,D为大于0且小于1的小数,并且D小于第一文本样本中字级别掩码比例最小的字的字级别掩码比例。
字的字级别掩码比例达到D,表示,字的字级别掩码比例大于或等于D。
作为示例,还以图6为例,第一文本样本为“今天是晴朗的周六”,第一文本样本中各个字的字级别掩码比例如图6所示,假设只有“朗”与“天”的掩码比例达到D,则对“朗”与“天”进行掩码处理,获得第一训练样本“今[MASK]是晴[MASK]的周六”。
应理解,在图5所示实施例中,掩码训练样本的字级别掩码比例不完全相同,或者说,字的字级别掩码比例不固定。
如前文描述,现有技术按照固定掩码比例使用随机策略选择每个文本中的字进行掩码处理,获得掩码训练样本,而随机产生的掩码训练样本可能具有重复的特征,使用这样的掩码训练样本训练PLM会导致PLM在训练过程中重复性地学习同样的训练样本,从而无法保证模型快速收敛。
在本申请实施例中,使得原始文本样本的每个样本中的字具有不完全相同的掩码比例,并在原始文本样本的掩码处理过程中,是根据每个字的字级别掩码比例来确定掩码策略,而非按照随机策略确定,这样可以减少或者避免掩码训练样本存在重复特征,从而可在一定程度上避免PLM在训练过程中重复性地学习相同的样本,可以实现模型快速收敛。
在步骤S323,可以采用多种实施方式,获取原始文本样本中的第一文本样本中每个字的字级别掩码比例,以使得第一文本样本中不同字的字级别掩码比例不完全相同。
可选地,作为一种实施方式,步骤S323包括:使用先验概率分布模型,生成第一文本样本中每个字的字级别掩码比例,先验概率分布模型使得第一文本样本中不同字的字级别掩码比例不完全相同。
换句话说,使用先验概率分布模型,为第一文本样本中的每个字,生成掩码比例。例如,针对第一文本样本中的第j个字,使用先验概率分布模型生成一个概率,并将该概率作为第j个字的字级别掩码比例,j为1,…,N1,N表示第一文本样本中包含的字的总数量。
作为示例,假设第一文本样本为“今天是晴朗的周六”,在步骤S323中,使用先验概率分布模型获取的“今天是晴朗的周六”中每个字的字级别掩码比例的分布示意图如图6所示。
应理解,先验概率分布模型生成的概率服从某种概率分布,例如,先验概率分布模型的概率取值区间为0%到100%之间,因此,使用先验概率分布模型生成的掩码比例是动态 变化的,而非固定不变的。
作为示例,将先验概率分布模型记为P(r),r表示概率,r的取值区间可以为0%到100%之间,则使用P(r)生成的掩码比例的取值区间为0%到100%之间。
这里提及的先验概率分布模型与前文在步骤S321中提及的先验概率分布模型相同,关于先验概率分布模型的说明详见前文描述,这里不再赘述。
可选地,作为另一种实施方式,步骤S323包括:将第一文本样本输入神经网络模型,从该神经网络模型的输出获得第一文本样本中每个字的字级别掩码比例,该神经网络模型的输出为输入的文本中各个字的字级别掩码比例。该神经网络模型是优化学习得到的,该神经网络模型的学习优化过程如图7所示。在图7中,i的初始取值为1。
1),将原始文本样本中第i个样本输入神经网络模型,从神经网络模型的输出获得第i个样本中每个字的字级别掩码比例。
从神经网络模型的输出获得第i个样本中每个字的字级别掩码比例,表示,可以根据神经网络模型针对第i个样本中每个字的输出信号,获得每个字的字级别掩码比例。
例如,该神经网络模型可以针对每个字输出一个损失值(loss),该损失值可以映射到一个掩码比例。如8所示,神经网络模型针对样本“今天是晴朗的周六”中每个字可以输出一个损失值,例如loss_0表示神经网络模型针对样本中的“今”输出的损失值,loss_1至loss_7的含义类似,这里不再赘述。
在本例中,获取一个字的概率掩码的方法为,根据神经网络模型针对该字输出的损失值,以及损失值与掩码比例之间的映射关系,获得该字的字级别掩码比例。
在本例中,神经网络模型输出的损失值与掩码比例之间的映射关系可以根据应用需求设计,本申请对此不作限定。
又例如,该神经网络模型可以针对每个字直接输出该字的字级别掩码比例。
在本例中,可以直接根据神经网络模型针对第i个样本中每个字的输出信号,获得每个字的字级别掩码比例。
2),根据第i个样本中各个字的字级别掩码比例,对第i个样本中的部分字进行掩码处理,获得第i个样本对应的训练样本。
步骤2)可以对应步骤S324,关于根据第i个样本中各个字的字级别掩码比例对第i个样本进行掩码处理的实现方式详见前文,这里不再赘述。
3),将第i个样本对应的训练样本输入PLM,获得PLM针对被掩码处理的字输出的损失值。
因为步骤3)获取的PLM针对被掩码处理的字输出的损失值是作为神经网络模型的反馈信号,因此,可以将步骤3)获取的PLM针对被掩码处理的字输出的损失值称为反馈信号。
PLM针对输入的掩码训练数据可以预测被掩码处理的字,还可以输出针对被掩码处理的字的损失值(loss’)。
可选地,PLM可以是参数固定的模型。下文将描述,这里暂不详述。
4),根据PLM针对被掩码处理的字输出的损失值,以及该神经网络模型针对被掩码处理的字的输出信号,更新优化神经网络网络。
根据PLM针对被掩码处理的字输出的损失值获得第一信号,根据该神经网络模型针 对被掩码处理的字的输出信号获得第二信号,第一信号与第二信号是含义相同的信号(即能够进行比较的信号);通过第一信号与第二信号之间的差值,对该神经网络模型进行优化更新。
可选地,PLM针对被掩码处理的字输出的损失值(记为输出信号1),与该神经网络模型针对被掩码处理的字的输出信号(记为输出信号2)是相同含义的信号,即可直接进行比较,则可以直接根据输出信号1与输出信号2之间的差值,对该神经网络模型进行优化更新。
作为示例,如图8所示,PLM针对被掩码处理的字“天”与“朗”分别输出损失值loss_1’与loss_4’,该神经网络模型针对被掩码处理的字的输出信号也为损失值(如图8中所示的loss_1与loss_4),则可以通过比较PLM输出的损失值与神经网络模型输出的损失值,对该神经网络模型进行优化更新。
可选地,PLM针对被掩码处理的字输出的损失值(记为输出信号1),与该神经网络模型针对被掩码处理的字的输出信号(记为输出信号2)不是相同含义的信号,即无法进行比较,这种情况下,可以将输出信号1与输出信号2中的一方处理为与另一方相同含义的信号,然后进行比较。
作为一个示例,该神经网络模型针对被掩码处理的字的输出信号为掩码比例,PLM针对被掩码处理的字输出的损失值与掩码比例具有映射关系,这种情况下,可以先根据该映射关系将PLM针对被掩码处理的字输出的损失值换算为掩码比例,然后将其与该神经网络模型针对被掩码处理的字输出的掩码比例进行比较,从而对该神经网络模型进行优化更新。
关于损失值与掩码比例的映射关系的建立方法,本申请实施例对此不作限定。例如,分别用每个损失值除以同一个较大的数值(大于所有损失值),求得的比值作为各个损失值映射的掩码比例。
5),判断神经网络网络是否满足收敛条件,若是,转到步骤6),若否,将i的取值加1,转到步骤1)。
6),将步骤4)得到的神经网络模型作为优化学习到的神经网络模型。
作为一个示例,假设原始文本样本中的一个样本为“今天是晴朗的周六”,使用这个样本对该神经网络模型进行一次优化学习(一次迭代过程)的流程示意图如图8所示。
S810,将样本“今天是晴朗的周六”输入神经网络模型,从神经网络模型的输出获得样本“今天是晴朗的周六”中每个字的字级别掩码比例。S810对应图7中的步骤1)。
从神经网络模型的输出获得样本“今天是晴朗的周六”中每个字的字级别掩码比例,表示,可以根据神经网络模型针对样本中每个字的输出信号,获得每个字的字级别掩码比例。
如8所示,神经网络模型针对样本“今天是晴朗的周六”中每个字输出一个损失值,例如loss_0表示神经网络模型针对样本中的“今”输出的损失值,loss_1至loss_7的含义类似,这里不再赘述。神经网络模型针对每个字输出的损失值与掩码比例具有映射关系,如图8中所示的loss_0至loss_7分别映射一个掩码比例,则可以根据损失值与该映射关系,获得每个字的字级别掩码比例,当然可以获得被掩码处理的字的字级别掩码比例。
S820,对样本“今天是晴朗的周六”中掩码比例达到条件的字“天”与“朗”进行掩码处理,获得样本“今天是晴朗的周六”对应的训练样本“今[MASK]是晴[MASK]的周六”。S820对应 图7中的步骤2)。
S830,将掩码训练样本“今[MASK]是晴[MASK]的周六”输入PLM,从PLM的输出获得该掩码训练样本中被掩码处理的字的预测结果,还可以获得PLM针对被掩码处理的字(即“天”与“朗”)的输出信号。S830对应图7中的步骤3)。
S840,根据神经网络模型针对被掩码处理的字(即“天”与“朗”)的输出信号,以及PLM针对被掩码处理的字输出的损失值,更新优化该神经网络网络。S840对应图7中的步骤4)。
可选地,在图7所示实施例中,PLM是基于掩码训练样本进行参数实时更新的模型。
例如,步骤3)包括:利用第i个样本对应的训练样本对PLM进行一次训练更新;将第i个样本对应的训练样本输入经过训练更新的PLM,获得经过训练更新的PLM针对被掩码处理的字输出的损失值。
利用第i个样本对应的训练样本对PLM进行一次训练更新,表示,利用第i个样本对应的训练样本对PLM进行一次训练,使得PLM的参数发生更新。
其中,步骤4)包括:根据经过训练更新的PLM针对被掩码处理的字输出的损失值,以及神经网络模型针对被掩码处理的字的输出信号,更新优化神经网络网络。
通过在优化学习用于生成字的字级别掩码比例时同时训练更新PLM,可以更进一步生成更优的掩码训练样本,使用这样的掩码训练样本训练PLM,可以实现PLM的模型快速收敛,以及PLM的自然语言理解能力的提升。
应理解,在图7或图8所示实施例中,实现神经网络模型的优化学习,相当于实现了掩码策略的优化学习。
在本实施例中,使用优化学习到的神经网络模型生成原始文本样本中每个样本中每个字的字级别掩码比例,相当于对掩码策略进行了优化学习,从而可以生成更优的掩码训练样本,因此,使用这样的掩码训练样本训练PLM,可以实现PLM的模型快速收敛,以及PLM的自然语言理解能力的提升。
相比于图7或图8所示实施例中通过可优化学习的神经网络模型获取掩码策略,前文描述根据概率分布模型获取掩码策略的实施例,可以视为,根据预设模型(或根据经验)获取掩码策略。
在本申请实施例中,可以通过一定的方式控制掩码训练样本的生成。例如,通过控制文本(即句子)的掩码比例(即文本级别掩码比例)来控制掩码训练样本的生成;又例如,通过控制文本中字的字级别掩码比例来控制掩码训练样本的生成。因此,在本申请实施例中,可以通过控制的方式生成掩码训练样本,而非随机生成掩码训练样本,从而可以通过对掩码训练样本的控制来实现PLM的自然语言理解能力的提高以及PLM收敛速度的提升。
本申请实施例提供的获得PLM的掩码训练样本的方案可应用于所有基于MLM的PLM。
如图9所示,本申请实施例还提供一种训练PLM的方法900。例如,该方法900可以由图2中的服务器设备22执行。该方法900包括步骤S910与步骤S920。
S910,通过上文实施例中的方法300获得掩码训练样本。
S920,使用掩码训练样本训练预训练语言模型PLM,PLM用于预测被掩码处理的文字。
在本申请实施例中,使用掩码比例不固定的掩码训练样本训练PLM,可以增强PLM的训练样本的模式多样性,从而可以使得PLM学习到的特征也较为多样,可以提高PLM的泛化能力,因此,可以提高训练得到的PLM的自然语言理解能力。
进一步地,通过使得原始文本样本的每个样本中的字具有不完全相同的掩码比例,并在原始文本样本的掩码处理过程中,是根据每个字的字级别掩码比例来确定掩码策略,而非按照随机策略确定,这样可以减少或者避免掩码训练样本存在重复特征,从而可在一定程度上避免PLM在训练过程中重复性地学习相同的样本,可以实现模型快速收敛。
再进一步,在原始文本样本的掩码处理过程中,通过使用优化学习到的神经网络模型生成原始文本样本中每个样本中每个字的字级别掩码比例,然后根据每个字的字级别掩码比例来确定掩码策略,可以生成更优的掩码训练样本,因此,使用这样的掩码训练样本训练PLM,可以实现PLM的模型快速收敛,以及PLM的自然语言理解能力的提升。
本申请实施例提供的训练PLM的方案可应用于所有基于MLM的PLM。
如图10所示,本申请实施例还提供一种数据处理的方法1000。例如,该方法1000可以由图2中的服务器设备22或客户端设备23执行。该方法1000包括步骤S1010与步骤S1020。
S1010,确定待预测的目标文本,目标文本包括缺少部分文字的语句。
目标文本包括缺少部分文字的语句,也可以表述为目标文本包括缺少上下文信息的语句。
S1020,将目标文本输入PLM,从该PLM的输出预测得到目标文本中缺少的文字,该PLM为通过上文实施例提供的方法900训练得到PLM。
实验仿真表明,使用本申请实施例提供的方案训练得到的PLM,在自然语言理解的相关下游任务上具有显著提升的效果。
作为一个示例,表2示出采用本申请实施例提供的方案训练得到的PLM(记为u-PMLM-A)相较于现有的BERT(A)模型在分数上具有提升。
表2
Model COLA SST2 MRPC STSB QQP MNLI-m/mm QNLI RTE AX AVG.
BERT(A) 52.1 93.5 88.9/84.8 87.1/85.8 71.2/89.2 84.6/83.4 90.5 66.4 34.2 78.3
u-PMLM-A 56.5 94.3 88.8/84.4 87.0/85.9 71.4/89.2 84.5/83.5 91.8 66.1 37.0 79.0
表2的横向表头中的COLA、SST2、MRPC、STSB、QQP、MNL1-m/mm、QNLI、RTE、AX分别表示自然语言处理任务集合(GLUE)中的一个子任务的名称,AVG.表示这些子任务的平均分。
本文中描述的各个实施例可以为独立的方案,也可以根据内在逻辑进行组合,这些方案都落入本申请的保护范围中。
上文描述了本申请提供的方法实施例,下文将描述本申请提供的装置实施例。应理解,装置实施例的描述与方法实施例的描述相互对应,因此,未详细描述的内容可以参见上文方法实施例,为了简洁,这里不再赘述。
图11为本申请实施例提供的数据处理的装置1100的示意性框图。装置1100包括掩码生成模块1110与PLM模型训练模块1120。
掩码生成模块1110,用于通过本申请实施例提供的方法300,对原始文本样本进行掩 码处理,获得掩码训练数据。
例如,在一些实施例中,掩码生成模块1110包括如图8所示的神经网络模型。
PLM模型训练模块1120用于,利用掩码生成模块1110获得的掩码训练数据进行PLM模型训练。
作为示例,如图12所示,将文本样本“今天是晴朗的周六”输入掩码生成模块1110,掩码生成模块1110输出训练样本“今[MASK]是晴[MASK]的周六”;将训练样本“今[MASK]是晴[MASK]的周六”输入PLM模型训练模块1120,PLM模型训练模块1120输出被掩码处理的字的预测结果“天”与“朗”。
如图13所示,本申请实施例还提供一种数据处理的装置1300,装置1300用于执行上文方法实施例。装置1300包括第一处理单元1310与第二处理单元1320。
可选地,作为第一种设计,装置1300用于执行上文方法实施例中的方法300。第一处理单元1310,用于确定原始文本样本,原始文本样本未进行掩码处理;第二处理单元1320,用于对原始文本样本进行掩码处理,获得掩码训练样本,掩码处理使得掩码训练样本的掩码比例不固定,掩码训练样本用于训练PLM。
掩码训练样本的掩码比例包括文本级别掩码比例,和/或字级别掩码比例。详见上文描述,这里不再赘述。
可选地,第二处理单元1320用于:使用先验概率分布模型,生成原始文本样本中每个样本的文本级别掩码比例,先验概率分布模型使得原始文本样本中不同样本的文本级别掩码比例不完全相同;按照原始文本样本中每个样本的文本级别掩码比例,对相应样本进行掩码处理,获得掩码训练样本。
可选地,先验概率分布模型的概率值区间长度不小于40%。
可选地,第二处理单元1320用于,获取原始文本样本中的第一文本样本中每个字的字级别掩码比例,第一文本样本中不同字的字级别掩码比例不完全相同;根据第一文本样本中各个字的字级别掩码比例,对第一文本样本中的部分字进行掩码处理,获得掩码训练样本中的第一训练样本。
可选地,第二处理单元1320用于,使用先验概率分布模型,生成第一文本样本中每个字的字级别掩码比例,先验概率分布模型使得第一文本样本中不同字的字级别掩码比例不完全相同。
可选地,第二处理单元1320用于,将第一文本样本输入神经网络模型,从神经网络模型的输出获得第一文本样本中每个字的字级别掩码比例,神经网络模型的输出为输入的文本中各个字的字级别掩码比例。其中,神经网络模型通过如图7所示的步骤进行优化学习得到,其中,i的初始取值为1。详见前文描述,这里不再赘述。
可选地,第二处理单元1320用于,按照字级别掩码比例从高到低的顺序,对第一文本样本中前S个字或者位于前G%的字进行掩码处理,获得第一训练样本,S为取值小于第一文本样本中字的总数量的正整数,G为大于0且小于100的整数。
第一种设计下的装置1300可以被设置在装置1100中的掩码生成模块1110中。
在一些实施例中,装置1100中的掩码生成模块1110包括第一种设计下的装置1300。
可选地,作为第二种设计,装置1300用于执行上文方法实施例中的方法900。第一处理单元1310,用于通过上文方法实施例中的方法300获得掩码训练样本;第二处理单 元1320,用于使用掩码训练样本训练预训练语言模型PLM,PLM用于预测被掩码处理的文字。
第二种设计下的装置1300可以被设置在装置1100中的PLM模型训练模块1120中。
在一些实施例中,装置1100中的PLM模型训练模块1120包括第二种设计下的装置1300。
可选地,作为第三种设计,装置1300用于执行上文方法实施例中的方法1000。第一处理单元1310,用于确定待预测的目标文本,目标文本包括缺少部分文字的语句;第二处理单元1320,用于将目标文本输入预训练语言模型PLM,从PLM的输出预测目标文本中缺少的文字,其中,PLM通过上文方法实施例中的方法900训练得到。
应理解,第一处理单元1310与第二处理单元1320可以通过处理器实现。
如图14所示,本申请实施例还提供一种数据处理的装置1400。该装置1400包括处理器1410,处理器1410与存储器1420耦合,存储器1420用于存储计算机程序或指令,处理器1410用于执行存储器1420存储的计算机程序或指令,使得上文方法实施例中的方法被执行。
可选地,如图14所示,该装置1400还可以包括存储器1420。
可选地,如图14所示,该装置1400还可以包括数据接口1430,数据接口1430用于与外界进行数据的传输。
可选地,作为一种方案,该装置1400用于实现上文实施例中的方法300。
可选地,作为另一种方案,该装置1400用于实现上文实施例中的方法900。
可选地,作为又一种方案,该装置1400用于实现上文实施例中的方法1000。
本申请实施例还提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行上述实施例的方法。
本申请实施例还提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述实施例的方法。
本申请实施例还提供一种芯片,该芯片包括处理器与数据接口,处理器通过数据接口读取存储器上存储的指令,执行上述实施例的方法。
可选地,作为一种实现方式,该芯片还可以包括存储器,存储器中存储有指令,处理器用于执行存储器上存储的指令,当指令被执行时,处理器用于执行上述实施例中的方法。
本申请实施例还提供一种电子设备,该电子设备包括上述实施例中的装置1100。
本申请实施例还提供一种电子设备,该电子设备包括第一种设计下的装置1300,或第二种设计下的装置1300,或第三种设计下的装置1300。
本申请实施例还提供一种电子设备,该电子设备包括第一种设计下的装置1300与第二种设计下的装置1300。
图15为本申请实施例提供的一种芯片硬件结构,该芯片上包括神经网络处理器1500。该芯片可以被设置在如下任一种或多种装置中:
如图13所示的装置1100、如图13所示的装置1300、如图14中所示的装置1400。
上文方法实施例中的方法300、900或1000均可在如图15所示的芯片中得以实现。
神经网络处理器1500作为协处理器挂载到主处理器(Host CPU)上,由主CPU分配任务。神经网络处理器1500的核心部分为运算电路1503,控制器1504控制运算电路1503 获取存储器(权重存储器1502或输入存储器1501)中的数据并进行运算。
在一些实现中,运算电路1503内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路1503是二维脉动阵列。运算电路1503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1503是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路1503从权重存储器1502中取矩阵B相应的数据,并缓存在运算电路1503中每一个PE上。运算电路1503从输入存储器1501中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1508中。
向量计算单元1507可以对运算电路1503的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元1507可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现种,向量计算单元能1507将经处理的输出的向量存储到统一存储器(也可称为统一缓存器)1506。例如,向量计算单元1507可以将非线性函数应用到运算电路1503的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1507生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1503的激活输入,例如用于在神经网络中的后续层中的使用。
上文方法实施例中的方法300、500或600可以由1503或1507执行。
统一存储器1506用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器1505(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器1501和/或统一存储器1506、将外部存储器中的权重数据存入权重存储器1502,以及将统一存储器1506中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)1510,用于通过总线实现主CPU、DMAC和取指存储器1509之间进行交互。
与控制器1504连接的取指存储器(instruction fetch buffer)1509,用于存储控制器1504使用的指令;
控制器1504,用于调用指存储器1509中缓存的指令,实现控制该运算加速器的工作过程。
在图15所示的芯片用于执行上文方法实施例中的方法300的情况下,这里的数据可以是原始文本样本。
在图15所示的芯片用于执行上文方法实施例中的方法900的情况下,这里的数据可以是掩码训练样本。
在图15所示的芯片用于执行上文方法实施例中的方法1000的情况下,这里的数据可以是待预测的目标文本。
一般地,统一存储器1506,输入存储器1501,权重存储器1502以及取指存储器1509均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access  memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。
需要说明的是,本文中涉及的第一、第二、第三或第四等各种数字编号仅为描述方便进行的区分,并不用来限制本申请实施例的范围。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:通用串行总线闪存盘(USB flash disk,UFD)(UFD也可以简称为U盘或者优盘)、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (25)

  1. 一种数据处理的方法,其特征在于,包括:
    确定原始文本样本,所述原始文本样本未进行掩码处理;
    对所述原始文本样本进行掩码处理,获得掩码训练样本,所述掩码处理使得所述掩码训练样本的掩码比例不固定,所述掩码训练样本用于训练预训练语言模型PLM。
  2. 根据权利要求1所述的方法,其特征在于,所述掩码训练样本的文本级别掩码比例包括:
    文本级别掩码比例,用于表示一个样本中被掩码处理的字占所述样本中所有字的比例;和/或
    字级别掩码比例,用于表示一个字被掩码处理的概率;
    其中,所述掩码训练样本的掩码比例不固定包括:
    所述掩码训练样本中不同样本的文本级别掩码比例不完全相同;和/或
    所述掩码训练样本中任一个样本中每个字的字级别掩码比例不完全相同。
  3. 根据权利要求1或2所述的方法,其特征在于,所述对所述原始文本样本进行掩码处理,获得掩码训练样本,包括:
    使用先验概率分布模型,生成所述原始文本样本中每个样本的文本级别掩码比例,所述先验概率分布模型使得所述原始文本样本中不同样本的文本级别掩码比例不完全相同;
    按照所述原始文本样本中每个样本的文本级别掩码比例,对相应样本进行掩码处理,获得所述掩码训练样本。
  4. 根据权利要求3所述的方法,其特征在于,所述先验概率分布模型的概率值区间长度不小于40%。
  5. 根据权利要求1或2所述的方法,其特征在于,所述对所述原始文本样本进行掩码处理,获得掩码训练样本,包括:
    获取所述原始文本样本中的第一文本样本中每个字的字级别掩码比例,所述第一文本样本中不同字的字级别掩码比例不完全相同;
    根据所述第一文本样本中各个字的字级别掩码比例,对所述第一文本样本中的部分字进行掩码处理,获得所述掩码训练样本中的第一训练样本。
  6. 根据权利要求5所述的方法,其特征在于,所述获取所述原始文本样本中的第一文本样本中每个字的字级别掩码比例,包括:
    使用先验概率分布模型,生成所述第一文本样本中每个字的字级别掩码比例,所述先验概率分布模型使得所述第一文本样本中不同字的字级别掩码比例不完全相同。
  7. 根据权利要求5所述的方法,其特征在于,所述获取所述原始文本样本中的第一文本样本中每个字的字级别掩码比例,包括:
    将所述第一文本样本输入神经网络模型,从所述神经网络模型的输出获得所述第一文本样本中每个字的字级别掩码比例,其中,所述神经网络模型通过如下步骤进行优化学习得到,其中,i的初始取值为1:
    1),将所述原始文本样本中第i个样本输入所述神经网络模型,从所述神经网络模 型的输出获得所述第i个样本中每个字的字级别掩码比例;
    2),根据所述第i个样本中各个字的字级别掩码比例,对所述第i个样本中的部分字进行掩码处理,获得所述第i个样本对应的训练样本;
    3),将所述第i个样本对应的训练样本输入所述PLM,获得所述PLM针对被掩码处理的字的损失值;
    4),根据所述PLM针对被掩码处理的字输出的损失值,以及所述神经网络模型针对所述被掩码处理的字的输出信号,更新优化所述神经网络网络;
    5),判断所述神经网络网络是否满足收敛条件,若是,转到步骤6),若否,将i的取值加1,转到步骤1);
    6),将所述步骤4)得到的神经网络模型作为优化学习到的所述神经网络模型。
  8. 根据权利要求7所述的方法,其特征在于,所述步骤3)包括:
    利用所述第i个样本对应的训练样本对所述PLM进行一次训练更新;
    将所述第i个样本对应的训练样本输入经过所述训练更新的所述PLM,获得经过所述训练更新的所述PLM针对所述被掩码处理的字输出的损失值;
    其中,所述步骤4)包括:根据经过所述训练更新的所述PLM针对所述被掩码处理的字输出的损失值,以及所述神经网络模型针对所述被掩码处理的字的输出信号,更新优化所述神经网络网络。
  9. 根据权利要求5-8中任一项所述的方法,其特征在于,所述根据所述第一文本样本中各个字的字级别掩码比例,对所述第一文本样本中的部分字进行掩码处理,获得所述掩码训练样本中的第一训练样本,包括:
    按照字级别掩码比例从高到低的顺序,对所述第一文本样本中前S个字或者位于前G%的字进行掩码处理,获得所述第一训练样本,S为取值小于所述第一文本样本中字的总数量的正整数,G为大于0且小于100的整数。
  10. 一种数据处理的方法,其特征在于,包括:
    通过如权利要求1-9中任一项所述的方法获得掩码训练样本;
    使用所述掩码训练样本训练预训练语言模型PLM,所述PLM用于预测被掩码处理的文字。
  11. 一种数据处理的方法,其特征在于,包括:
    确定待预测的目标文本,所述目标文本包括缺少部分文字的语句;
    将所述目标文本输入预训练语言模型PLM,从所述PLM的输出预测所述目标文本中缺少的文字,
    其中,所述PLM通过权利要求10所述的方法训练得到。
  12. 一种数据处理的装置,其特征在于,包括:
    第一处理单元,用于确定原始文本样本,所述原始文本样本未进行掩码处理;
    第二处理单元,用于对所述原始文本样本进行掩码处理,获得掩码训练样本,所述掩码处理使得所述掩码训练样本的掩码比例不固定,所述掩码训练样本用于训练预训练语言模型PLM。
  13. 根据权利要求11所述的装置,其特征在于,所述掩码训练样本的掩码比例包括:
    文本级别掩码比例,用于表示一个样本中被掩码处理的字占所述样本中所有字的比 例;和/或
    字级别掩码比例,用于表示一个字被掩码处理的概率;
    其中,所述掩码训练样本的掩码比例不固定包括:
    所述掩码训练样本中不同样本的文本级别掩码比例不完全相同;和/或
    所述掩码训练样本中任一个样本中每个字的字级别掩码比例不完全相同。
  14. 根据权利要求12或13所述的装置,其特征在于,所述第二处理单元用于:
    使用先验概率分布模型,生成所述原始文本样本中每个样本的文本级别掩码比例,所述先验概率分布模型使得所述原始文本样本中不同样本的文本级别掩码比例不完全相同;
    按照所述原始文本样本中每个样本的文本级别掩码比例,对相应样本进行掩码处理,获得所述掩码训练样本。
  15. 根据权利要求14所述的装置,其特征在于,所述先验概率分布模型的概率值区间长度不小于40%。
  16. 根据权利要求12或13所述的装置,其特征在于,所述第二处理单元用于:
    获取所述原始文本样本中的第一文本样本中每个字的字级别掩码比例,所述第一文本样本中不同字的字级别掩码比例不完全相同;
    根据所述第一文本样本中各个字的字级别掩码比例,对所述第一文本样本中的部分字进行掩码处理,获得所述掩码训练样本中的第一训练样本。
  17. 根据权利要求16所述的装置,其特征在于,所述第二处理单元用于,使用先验概率分布模型,生成所述第一文本样本中每个字的字级别掩码比例,所述先验概率分布模型使得所述第一文本样本中不同字的字级别掩码比例不完全相同。
  18. 根据权利要求16所述的装置,其特征在于,所述第二处理单元用于,将所述第一文本样本输入神经网络模型,从所述神经网络模型的输出获得所述第一文本样本中每个字的字级别掩码比例,
    其中,所述神经网络模型通过如下步骤进行优化学习得到,其中,i的初始取值为1:
    1),将所述原始文本样本中第i个样本输入所述神经网络模型,从所述神经网络模型的输出获得所述第i个样本中每个字的字级别掩码比例;
    2),根据所述第i个样本中各个字的字级别掩码比例,对所述第i个样本中的部分字进行掩码处理,获得所述第i个样本对应的训练样本;
    3),将所述第i个样本对应的训练样本输入所述PLM,获得所述PLM针对被掩码处理的字输出的损失值;
    4),根据所述PLM针对被掩码处理的字输出的损失值,以及所述神经网络模型针对所述被掩码处理的字的输出信号,更新优化所述神经网络网络;
    5),判断所述神经网络网络是否满足收敛条件,若是,转到步骤6),若否,将i的取值加1,转到步骤1);
    6),将所述步骤4)得到的神经网络模型作为优化学习到的所述神经网络模型。
  19. 根据权利要求18所述的装置,其特征在于,所述步骤3)包括:
    利用所述第i个样本对应的训练样本对所述PLM进行一次训练更新;
    将所述第i个样本对应的训练样本输入经过所述训练更新的所述PLM,获得经过所述训练更新的所述PLM针对所述被掩码处理的字的损失值;
    其中,所述步骤4)包括:根据经过所述训练更新的所述PLM针对所述被掩码处理的字的损失值,以及所述神经网络模型针对所述被掩码处理的字的输出信号,更新优化所述神经网络网络。
  20. 根据权利要求16-19中任一项所述的装置,其特征在于,所述第二处理单元用于,按照字级别掩码比例从高到低的顺序,对所述第一文本样本中前S个字或者位于前G%的字进行掩码处理,获得所述第一训练样本,S为取值小于所述第一文本样本中字的总数量的正整数,G为大于0且小于100的整数。
  21. 一种数据处理的装置,其特征在于,包括:
    第一处理单元,用于通过如权利要求1-9中任一项所述的方法获得掩码训练样本;
    第二处理单元,用于使用所述掩码训练样本训练预训练语言模型PLM,所述PLM用于预测被掩码处理的文字。
  22. 一种数据处理的装置,其特征在于,包括:
    第一处理单元,用于确定待预测的目标文本,所述目标文本包括缺少部分文字的语句;
    第二处理单元,用于将所述目标文本输入预训练语言模型PLM,从所述PLM的输出预测所述目标文本中缺少的文字,其中,所述PLM通过权利要求10所述的方法训练得到。
  23. 一种数据处理的装置,其特征在于,包括:
    存储器,用于存储程序;
    处理器,用于执行所述存储器中存储的程序,当所述存储器中存储的程序被执行时,所述处理器用于执行权利要求1至11中任一项所述的方法。
  24. 一种计算机可读存储介质,其特征在于,所述计算机可读介质存储用于设备执行的程序代码,所述程序代码被执行时,所述所示设备执行权利要求1至11中任一项所述的方法。
  25. 一种芯片,其特征在于,包括至少一个处理器和数据接口;
    所述至少一个所述处理器用于,通过所述数据接口调用并运行存储在存储器上的计算机程序,以使所述芯片执行权利要求1至11中任一项所述的方法。
PCT/CN2021/078390 2020-04-13 2021-03-01 数据处理的方法与装置 Ceased WO2021208612A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21789181.1A EP4131020A4 (en) 2020-04-13 2021-03-01 DATA PROCESSING METHOD AND DEVICE
US17/964,165 US12608606B2 (en) 2020-04-13 2022-10-12 Data processing method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010286915.9A CN111611790B (zh) 2020-04-13 2020-04-13 数据处理的方法与装置
CN202010286915.9 2020-04-13

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/964,165 Continuation US12608606B2 (en) 2020-04-13 2022-10-12 Data processing method and apparatus

Publications (1)

Publication Number Publication Date
WO2021208612A1 true WO2021208612A1 (zh) 2021-10-21

Family

ID=72203710

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/078390 Ceased WO2021208612A1 (zh) 2020-04-13 2021-03-01 数据处理的方法与装置

Country Status (4)

Country Link
US (1) US12608606B2 (zh)
EP (1) EP4131020A4 (zh)
CN (1) CN111611790B (zh)
WO (1) WO2021208612A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114820398A (zh) * 2022-07-01 2022-07-29 北京汉仪创新科技股份有限公司 基于扩散模型的图片字体替换方法、系统、设备和介质
CN116049400A (zh) * 2023-01-04 2023-05-02 北京百度网讯科技有限公司 文本分类模型的训练方法、文本分类方法及其装置
CN116451708A (zh) * 2023-03-16 2023-07-18 苏州大学 基于自适应掩码策略的文本预测方法、系统及电子设备
CN119621881A (zh) * 2024-10-24 2025-03-14 中国科学院香港创新研究院人工智能与机器人创新中心 基于视频大语言模型的手术视频分析方法及相关设备
US12608606B2 (en) * 2020-04-13 2026-04-21 Huawei Technologies Co., Ltd. Data processing method and apparatus

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328778B (zh) * 2020-11-03 2025-08-12 腾讯科技(深圳)有限公司 确定用户特征和模型训练的方法、装置、设备及介质
CN114462628B (zh) * 2020-11-09 2025-05-02 华为技术有限公司 数据增强方法、装置、计算设备以及计算机可读存储介质
CN112580339B (zh) * 2020-12-18 2022-04-05 北京百度网讯科技有限公司 模型的训练方法、装置、电子设备及存储介质
CN112507735B (zh) * 2020-12-18 2024-07-02 北京百度网讯科技有限公司 机器翻译模型的训练方法、装置和电子设备
CN113221569A (zh) * 2021-05-27 2021-08-06 中国人民解放军军事科学院国防工程研究院工程防护研究所 一种毁伤试验文本信息抽取方法
CN113591459B (zh) * 2021-08-10 2023-09-15 平安银行股份有限公司 地址标准化处理方法、装置、电子设备及可读存储介质
CN113779959B (zh) * 2021-08-31 2023-06-06 西南电子技术研究所(中国电子科技集团公司第十研究所) 小样本文本数据混合增强方法
CN115994530A (zh) * 2021-10-18 2023-04-21 北京京东尚科信息技术有限公司 语言模型的预训练方法、装置、电子设备及存储介质
CN114330512B (zh) * 2021-12-13 2024-04-26 腾讯科技(深圳)有限公司 数据处理方法、装置、电子设备及计算机可读存储介质
CN114676234A (zh) * 2022-02-22 2022-06-28 华为技术有限公司 一种模型训练方法及相关设备
CN114722153B (zh) * 2022-04-15 2025-11-11 贝壳找房(北京)科技有限公司 一种意图分类的方法和装置
CN115640611B (zh) * 2022-11-25 2023-05-23 荣耀终端有限公司 一种自然语言处理模型的更新方法及相关设备
CN116051859B (zh) * 2023-02-21 2023-09-08 阿里巴巴(中国)有限公司 服务提供方法、设备和存储介质
CN116910549A (zh) * 2023-07-17 2023-10-20 Oppo广东移动通信有限公司 模型训练方法、装置、计算机设备及存储介质
CN120430317B (zh) * 2025-07-08 2025-09-19 北京邮电大学 无监督文本编码方法、装置及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110137653A1 (en) * 2009-12-04 2011-06-09 At&T Intellectual Property I, L.P. System and method for restricting large language models
CN110196894A (zh) * 2019-05-30 2019-09-03 北京百度网讯科技有限公司 语言模型的训练方法和预测方法
CN110377744A (zh) * 2019-07-26 2019-10-25 北京香侬慧语科技有限责任公司 一种舆情分类的方法、装置、存储介质及电子设备
CN111611790A (zh) * 2020-04-13 2020-09-01 华为技术有限公司 数据处理的方法与装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543639B (zh) * 2019-09-12 2023-06-02 扬州大学 一种基于预训练Transformer语言模型的英文句子简化算法
CN111309908B (zh) * 2020-02-12 2023-08-25 支付宝(杭州)信息技术有限公司 文本数据处理方法及装置
US11875120B2 (en) * 2021-02-22 2024-01-16 Robert Bosch Gmbh Augmenting textual data for sentence classification using weakly-supervised multi-reward reinforcement learning
US11989240B2 (en) * 2022-06-22 2024-05-21 Optum Services (Ireland) Limited Natural language processing machine learning frameworks trained using multi-task training routines
WO2024191475A1 (en) * 2023-03-14 2024-09-19 OpenAI Opco, LLC Systems and methods for language model-based text editing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110137653A1 (en) * 2009-12-04 2011-06-09 At&T Intellectual Property I, L.P. System and method for restricting large language models
CN110196894A (zh) * 2019-05-30 2019-09-03 北京百度网讯科技有限公司 语言模型的训练方法和预测方法
CN110377744A (zh) * 2019-07-26 2019-10-25 北京香侬慧语科技有限责任公司 一种舆情分类的方法、装置、存储介质及电子设备
CN111611790A (zh) * 2020-04-13 2020-09-01 华为技术有限公司 数据处理的方法与装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4131020A4

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12608606B2 (en) * 2020-04-13 2026-04-21 Huawei Technologies Co., Ltd. Data processing method and apparatus
CN114820398A (zh) * 2022-07-01 2022-07-29 北京汉仪创新科技股份有限公司 基于扩散模型的图片字体替换方法、系统、设备和介质
CN114820398B (zh) * 2022-07-01 2022-11-04 北京汉仪创新科技股份有限公司 基于扩散模型的图片字体替换方法、系统、设备和介质
CN116049400A (zh) * 2023-01-04 2023-05-02 北京百度网讯科技有限公司 文本分类模型的训练方法、文本分类方法及其装置
CN116451708A (zh) * 2023-03-16 2023-07-18 苏州大学 基于自适应掩码策略的文本预测方法、系统及电子设备
CN119621881A (zh) * 2024-10-24 2025-03-14 中国科学院香港创新研究院人工智能与机器人创新中心 基于视频大语言模型的手术视频分析方法及相关设备

Also Published As

Publication number Publication date
US20230048031A1 (en) 2023-02-16
EP4131020A4 (en) 2023-08-23
US12608606B2 (en) 2026-04-21
CN111611790A (zh) 2020-09-01
EP4131020A1 (en) 2023-02-08
CN111611790B (zh) 2022-09-16

Similar Documents

Publication Publication Date Title
WO2021208612A1 (zh) 数据处理的方法与装置
CN111951805B (zh) 一种文本数据处理方法及装置
CN112487182B (zh) 文本处理模型的训练方法、文本处理方法及装置
CN112288075B (zh) 一种数据处理方法及相关设备
WO2021159714A1 (zh) 一种数据处理方法及相关设备
WO2022001724A1 (zh) 一种数据处理方法及装置
WO2017124646A1 (zh) 一种用于稀疏连接的人工神经网络计算装置和方法
CN111160049B (zh) 文本翻译方法、装置、机器翻译系统和存储介质
CN110083842B (zh) 译文质量检测方法、装置、机器翻译系统和存储介质
WO2022127613A1 (zh) 一种翻译模型的训练方法、翻译方法以及设备
CN112784003A (zh) 训练语句复述模型的方法、语句复述方法及其装置
WO2024114659A1 (zh) 一种摘要生成方法及其相关设备
CN118038238A (zh) 一种视觉问答方法、装置及电子设备和存储介质
JP7596559B2 (ja) フィードフォワード空間変換ユニットを備えたニューラルネットワーク
CN112132281B (zh) 一种基于人工智能的模型训练方法、装置、服务器及介质
CN118551053B (zh) 一种基于基学习器和元学习器的医疗文本分类方法
CN117892733A (zh) 基于联合多语义信息的文本匹配方法和系统
WO2023202484A1 (zh) 神经网络模型的修复方法和相关设备
WO2023143262A1 (zh) 一种数据处理方法及相关设备
CN110334359B (zh) 文本翻译方法和装置
US20250363361A1 (en) Systems and methods for embedding variational generative dynamics to a machine-learning model
CN112836024B (zh) 问题生成方法、装置及计算机可读存储介质
US20250390286A1 (en) Synthetic generation of software code using language models
US20260093525A1 (en) Processor cache allocation for optimized task execution
US20260105265A1 (en) Streaming machine translations enhanced with language model predictions

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21789181

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021789181

Country of ref document: EP

Effective date: 20221027

NENP Non-entry into the national phase

Ref country code: DE