Disclosure of Invention
The invention mainly aims to provide a model construction method, a model construction device, model construction equipment and a model construction medium, and aims to reduce the labor cost of manual carding and labeling and improve the construction efficiency of classification models.
In order to achieve the above object, the present invention provides a model construction method, including the steps of:
obtaining a training corpus of a constructed model;
clustering the training corpora based on a pre-trained clustering model to obtain corresponding clustering results, wherein the clustering results comprise clustering labels and clustering corpora corresponding to the clustering labels;
and performing model training and prediction based on the clustering labels in the clustering result and the corresponding clustering corpora, and determining a target intention recognition model according to the model training and prediction result.
Preferably, the clustering process is performed on the training corpus based on a pre-trained clustering model, and the step of obtaining a corresponding clustering result includes:
sequentially disordering and dividing the training corpus to obtain a clustered sample corpus;
and clustering the clustering sample corpus based on a hierarchical clustering algorithm HAC to obtain clustering labels and clustering corpuses corresponding to the clustering labels.
Preferably, the clustering process of the clustered sample corpus based on the hierarchical clustering algorithm HAC to obtain the clustering labels and the clustered corpuses corresponding to the clustering labels includes:
classifying the clustered sample corpus, and dividing the classified clustered sample corpus of the same kind into a cluster to obtain clustering labels corresponding to clusters of different kinds;
determining the clustered corpora corresponding to different kinds of clustering labels based on the clustering labels corresponding to the clusters;
if the number of the corpora in the cluster is greater than a preset threshold value N1, taking the cluster as a clustering label and the corpora in the cluster corresponding to the clustering label as corresponding clustering corpora;
if the number of the corpora in the cluster is not greater than a preset threshold value N1, the corpora in the cluster use the other as a clustering label, and the corpora in the cluster corresponding to the clustering label are corresponding clustering corpora.
Preferably, the step of performing model training and prediction based on the clustering labels in the clustering result and the corresponding clustering corpora, and determining the target intention recognition model according to the model training and prediction result includes:
dividing clustering linguistic data in a clustering result into training linguistic data and prediction linguistic data according to a clustering label;
performing model training based on the training corpus to obtain a trained initial classification model;
inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
determining an accurate recall ratio (PRF) value of the clustering result based on the clustering label and the prediction score value in the clustering result;
based on the PRF values, a corresponding target intent recognition model is determined.
Preferably, the determining whether the clustering result is reasonable is based on the PRF value;
judging whether the PRF value reaches a preset threshold value;
if the PRF value reaches a preset threshold value, the clustering result is reasonable, and the initial classification model is output as a target intention identification model;
if the PRF value does not reach a preset threshold value, the clustering result is unreasonable, and the clustering result is classified and adjusted to obtain a classified and adjusted clustering result;
taking the clustering result after the classification adjustment as the current clustering result;
dividing clustering linguistic data in a clustering result into training linguistic data and prediction linguistic data according to a clustering label;
performing model training based on the training corpus to obtain a trained initial classification model;
inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
determining an accurate recall ratio (PRF) value of the clustering result based on the clustering label and the prediction score value in the clustering result;
and until the PRF value reaches a preset threshold value, the clustering result is reasonable, and the initial classification model is output as a target intention identification model.
Preferably, the clustering result comprises other clustering corpora with clustering labels being other and non-other clustering corpora with clustering labels being non-other,
the step of performing classification adjustment on the clustering result to obtain a clustering result after the classification adjustment comprises the following steps:
adjusting the other clustering linguistic data and the non-other clustering linguistic data to obtain clustering linguistic data corresponding to the adjusted clustering labels;
calculating the confusion degree of the clustering corpus corresponding to the adjusted clustering label;
and when the confusion degree is greater than a preset threshold value T2, merging the non-other clustering corpora before and after the adjustment and the adjustment as the non-other clustering corpora of the current clustering result, and taking other corpora as other clustering corpora with the other labels as the other clustering corpora.
Preferably, the step of adjusting the other clustering corpus and the non-other clustering corpus to obtain the clustering corpus corresponding to the adjusted clustering label includes:
acquiring a prediction score value of the non-other clustering corpus;
if the predicted score value of the non-other clustering corpus is lower than a preset threshold value T1, changing the clustering label of the non-other clustering corpus into other;
and when the number of the other clustering corpuses with the clustering labels being other exceeds a preset threshold value N2, acquiring the adjusted other clustering corpuses and the adjusted non-other clustering corpuses.
Preferably, the step of obtaining the corpus of the constructed model includes:
acquiring an original corpus from a service end;
preprocessing the original corpus to obtain a training corpus for model construction;
the preprocessing mode comprises one or more of eliminating stop words, full intersection half angles, eliminating emoticons, eliminating calling words and nonsense problems, uniformly using punctuation marks and eliminating non-used punctuation marks.
Further, to achieve the above object, the present invention also provides a model building apparatus including:
the acquisition module is used for acquiring training corpora of the constructed model;
the clustering module is used for clustering the training corpora based on a pre-trained clustering model to obtain corresponding clustering results, wherein the clustering results comprise clustering labels and clustering corpora corresponding to the clustering labels;
and the determining module is used for carrying out model training and prediction based on the clustering labels in the clustering result and the corresponding clustering corpora, and determining a target intention recognition model according to the model training and prediction result.
Preferably, the obtaining module is further configured to:
acquiring an original corpus from a service end;
preprocessing the original corpus to obtain a training corpus for model construction;
the preprocessing mode comprises one or more of eliminating stop words, full intersection half angles, eliminating emoticons, eliminating calling words and nonsense problems, uniformly using punctuation marks and eliminating non-used punctuation marks.
Preferably, the clustering module is further configured to:
sequentially disordering and dividing the training corpus to obtain a clustered sample corpus;
and clustering the clustering sample corpus based on a hierarchical clustering algorithm HAC to obtain clustering labels and clustering corpuses corresponding to the clustering labels.
Preferably, the clustering module is further configured to:
classifying the clustered sample corpus, and dividing the classified clustered sample corpus of the same kind into a cluster to obtain clustering labels corresponding to clusters of different kinds;
determining the clustered corpora corresponding to different kinds of clustering labels based on the clustering labels corresponding to the clusters;
if the number of the corpora in the cluster is greater than a preset threshold value N1, taking the cluster as a clustering label and the corpora in the cluster corresponding to the clustering label as corresponding clustering corpora;
if the number of the corpora in the cluster is not greater than a preset threshold value N1, the corpora in the cluster use the other as a clustering label, and the corpora in the cluster corresponding to the clustering label are corresponding clustering corpora.
Preferably, the determining module is further configured to:
dividing clustering linguistic data in a clustering result into training linguistic data and prediction linguistic data according to a clustering label;
performing model training based on the training corpus to obtain a trained initial classification model;
inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
determining an accurate recall ratio (PRF) value of the clustering result based on the clustering label and the prediction score value in the clustering result;
based on the PRF values, a corresponding target intent recognition model is determined.
Preferably, the determining module is further configured to:
judging whether the PRF value reaches a preset threshold value;
if the PRF value reaches a preset threshold value, the clustering result is reasonable, and the initial classification model is output as a target intention identification model;
if the PRF value does not reach the preset threshold value, the clustering result is unreasonable, and the clustering result is classified and adjusted to obtain the classified and adjusted clustering result;
taking the clustering result after the classification adjustment as the current clustering result, and executing the steps:
dividing the clustering linguistic data in the clustering result after classification adjustment into training linguistic data and prediction linguistic data according to the clustering label;
performing model training based on the training corpus to obtain a trained initial classification model;
inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
determining an accurate recall ratio (PRF) value of the clustering result based on the clustering label and the prediction score value in the clustering result;
and until the PRF value reaches a preset threshold value, the clustering result is reasonable, and the initial classification model is output as a target intention recognition model.
Preferably, the determining module is further configured to:
adjusting the other clustering linguistic data and the non-other clustering linguistic data to obtain clustering linguistic data corresponding to the adjusted clustering labels;
calculating the confusion degree of the clustering corpus corresponding to the adjusted clustering label;
and when the confusion degree is greater than a preset threshold value T2, merging the non-other clustering corpora before and after the adjustment and the adjustment as the non-other clustering corpora of the current clustering result, and taking other corpora as other clustering corpora with the other labels as the other clustering corpora.
Preferably, the determining module is further configured to:
acquiring a prediction score value of the non-other clustering corpus;
if the predicted score value of the non-other clustering corpus is lower than a preset threshold value T1, changing the clustering label of the non-other clustering corpus into other;
and when the number of the other clustering corpuses with the clustering labels being other exceeds a preset threshold value N2, acquiring the adjusted other clustering corpuses and the adjusted non-other clustering corpuses.
Further, to achieve the above object, the present invention also provides a model building apparatus including: a memory, a processor and a model building program stored on the memory and executable on the processor, the model building program when executed by the processor implementing the steps of the model building method as described above.
Further, to achieve the above object, the present invention also provides a medium which is a computer-readable storage medium having stored thereon a model construction program which, when executed by a processor, implements the steps of the model construction method as described above.
The model building method, the device, the equipment and the medium provided by the invention are characterized in that training corpora for building the model are obtained; clustering the training corpora based on a pre-trained clustering model to obtain corresponding clustering results, wherein the clustering results comprise clustering labels and clustering corpora corresponding to the clustering labels; and performing model training and prediction based on the clustering labels in the clustering result and the corresponding clustering corpora, and determining a target intention recognition model according to the model training and prediction result.
The clustering processing method for the automatic generation of the target intention recognition model has the advantages that training corpora used for building the intention recognition model are clustered, clustering results corresponding to training expectations are obtained, the clustering results comprise clustering labels which are classified by the clustering corpora and clustering corpora corresponding to the clustering labels, model training and prediction are conducted on the clustering results comprising the clustering labels and the clustering corpora corresponding to the clustering labels, PRF values corresponding to the clustering results are obtained, the target intention recognition model is determined according to the PRF values, the time invested in the process of knowing service points and data labeling is reduced, the efficiency of combing the service points and labeling the service corpora is improved, and the labor cost is reduced.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.
The device of the embodiment of the invention can be a mobile terminal or a server device.
As shown in fig. 1, the apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. The communication bus 1002 is used to implement connection communication among these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration of the apparatus shown in fig. 1 is not intended to be limiting of the apparatus and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a model building program.
The operating system is a program for managing and controlling the model building equipment and software resources, and supports the operation of a network communication module, a user interface module, a model building program and other programs or software; the network communication module is used for managing and controlling the network interface 1002; the user interface module is used to manage and control the user interface 1003.
In the model building apparatus shown in fig. 1, the model building apparatus calls a model building program stored in a memory 1005 by a processor 1001 and performs operations in various embodiments of the model building method described below.
Based on the hardware structure, the embodiment of the model construction method is provided.
Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the model building method of the present invention, and the method includes:
step S10, obtaining a training corpus of a constructed model;
acquiring an original corpus from a service end; preprocessing the original corpus to obtain a training corpus for model construction; the preprocessing mode comprises one or more of eliminating stop words, full intersection half angles, eliminating emoticons, eliminating calling words and nonsense problems, uniformly using punctuation marks and eliminating non-used punctuation marks.
In a specific embodiment, conversation corpora from a service scene are collected for model construction, a large number of user chat records of an intelligent customer service are used as original corpora of a training target model, the original corpora are subjected to standardized preprocessing in a mode of one or more of eliminating stop words, full turn-over half corners, eliminating emoticons, eliminating solicited words and nonsense problems, unifying punctuation marks and eliminating non-punctuation marks, and the effects of reducing noise in the original corpora, improving the purity of the original corpora and acquiring the training corpora capable of being used for model construction are achieved by carrying out standardized preprocessing on the whole original corpora from a client.
Step S20, clustering the training corpora based on a pre-trained clustering model to obtain corresponding clustering results, wherein the clustering results comprise clustering labels and clustering corpora corresponding to the clustering labels;
the principle of clustering the training corpuses to obtain the clustering result corresponding to the training corpuses is as follows: the clustering process can cluster and merge adjacent similar classified regions by using morphological operators, and after the clustering and merging process, clusters with different categories are generated, wherein the clusters generated by clustering are a set of data objects, and the objects are similar to objects in the same cluster and different from objects in other clusters.
In the existing clustering processing algorithms, most of the clustering algorithms have the capacity of processing noise data, and some clustering algorithms are very sensitive to the noise data and can obtain corresponding accurate clustering results after the audio data are clustered.
In a specific embodiment, the corpus data collected from the service end is preprocessed to obtain preprocessed corpus, the preprocessed corpus is input into a pre-trained clustering model, the preprocessed corpus is disorderly in sequence, and part of the corpus in the training language is extracted for clustering, wherein the clustering algorithm can adopt a hierarchical clustering algorithm HAC to obtain clusters of different categories, and the clusters have corpuses corresponding to the clusters, so as to obtain a clustering result comprising clustering labels and clustering corpuses corresponding to the clustering labels, in the clustering result, if the number of corpuses in a cluster is larger than a preset threshold N1 (example: 50), the cluster id is used as the clustering label, and the rest corpuses the other as the clustering label.
And step S30, performing model training and prediction based on the clustering labels in the clustering result and the corresponding clustering corpora, and determining a target intention recognition model according to the model training and prediction result.
In the prior art, an initial pre-training model is used for dividing and labeling the manual work to obtain a data set capable of performing model training, and deep learning is performed according to the initial pre-training model of the data set to obtain a classification model of a target. In the model building method of this embodiment, the training corpuses of the service end are directly obtained, clustering processing is performed on the training corpuses to obtain clustering results including clustering labels and clustering corpuses corresponding to the clustering labels, the clustering results are trained and predicted, and a target intention recognition model is determined according to model training and prediction results.
In a specific embodiment, according to the clustering result including the clustering label and the clustering corpora corresponding to the clustering label, the clustering result is divided into 5 parts according to the clustering label on average, for example, the number of the corpora with the clustering label (cluster id) of label1 is 200, the number of the clustering label of each part of the clustered corpora after the division is label1 is 40, 4 parts of the clustered corpora are taken each time to train the pre-trained classification model, and the remaining 1 part of the clustered corpora is predicted, so that the score value of the predicted value of all 5 parts of the clustered corpora can be obtained. And (3) obtaining a PRF value corresponding to the clustering result, namely Precision, Recall and F1 values (F1) through the clustering label and the corresponding prediction score value, evaluating whether the clustering result is reasonable according to the PRF value, and if the clustering result is reasonable, outputting a classification model trained on the basis of training data to obtain a corresponding target intention recognition model.
In this embodiment, clustering is performed on the preprocessed training corpora to obtain clustering labels with different feature categories and clustering corpora corresponding to the clustering labels, model training and prediction are performed on the clustering labels and the clustering corpora corresponding to the clustering labels as clustering results, and a target intention recognition model is determined according to the model training and prediction results. Through automatic establishment of the intention recognition model, the labor cost of manual marking is greatly reduced, the operation pressure of operators is reduced, in addition, the automatic establishment of the model method can assist the operators to understand the business, different business types related to carding are found, and the efficiency of establishment of the intention recognition model is improved.
Further, a second embodiment of the model construction method of the present invention is proposed based on the first embodiment of the model construction method of the present invention.
The difference between the second embodiment of the model building method and the first embodiment of the model building method is that, in this embodiment, for step S20, the clustering process is performed on the training corpora based on the pre-trained clustering model to obtain a refinement of the corresponding clustering result, referring to fig. 4, the step specifically includes:
step S21, orderly disordering and dividing the training corpora to obtain clustered sample corpora;
in a specific embodiment, the method for inputting the preprocessed corpus into a pre-trained clustering model, disordering the arrangement order of the corpus, extracting a part of the corpus for clustering, and extracting a part of the corpus for clustering specifically includes: and synchronously clustering partial training corpora, and then gradually adding the rest training corpora to obtain clustering results corresponding to the training corpora. And the mode of directly clustering all the training corpuses is compared, and the classification effect finally obtained by the mode of extracting part of the training corpuses for clustering is better.
In an embodiment, the preprocessed corpus is disordered, and a part of the corpus in the corpus is extracted as a clustering sample corpus, for example, 20% of the total corpus may be extracted as a clustering sample corpus for clustering with reference to the total corpus.
And step S22, clustering the clustered sample corpus based on a hierarchical clustering algorithm (HAC) to obtain a clustering label and a clustered corpus corresponding to the clustering label.
There are many ways of clustering, including a partitioning method, a hierarchical method, a density-based method, a grid-based method, a model-based method, a pass-through closed-packet method, a boolean matrix method, a direct clustering method, a correlation analysis clustering method, a statistical-based clustering method, etc., and in different methods, there are also various different clustering algorithms from which corresponding clustering results can be obtained.
In a specific embodiment, the clustering processing is performed on the clustered corpus sample based on a hierarchical clustering algorithm HAC, so as to obtain clustering labels that the clustered corpus sample belongs to different categories, the clustering labels are different clusters corresponding to each category, the clusters have corpora corresponding to each other, the clusters are called clustering labels, the corpora corresponding to each different cluster are used as clustered corpora corresponding to the clustering labels, and then, a clustering result including the clustering labels and the clustered corpora corresponding to the clustering labels after the clustering processing is obtained.
Referring to fig. 5, step S22 specifically includes:
step A1, classifying the clustered sample corpora, and dividing the clustered sample corpora of the same kind into a cluster after classification to obtain clustering labels corresponding to clusters of different kinds;
in a specific embodiment, the clustered sample corpora are classified to obtain clustered labels that belong to different categories after classification, and the clustered sample corpora are classified according to the clustered labels to obtain clustered labels and clustered corpora corresponding to the clustered labels.
Step A2, determining clustered corpora corresponding to different kinds of clustering labels based on the clustering labels corresponding to the clusters;
in a specific embodiment, clustering sample corpora are classified based on a hierarchical clustering algorithm HAC to obtain different clusters corresponding to each category, the clusters have corpora corresponding to each category, the clusters are called clustering labels, and the corpora corresponding to each different cluster are used as clustering corpora corresponding to the clustering labels.
Step A3, if the number of the corpora in the cluster is greater than a preset threshold value N1, the corpora in the cluster use the cluster as a clustering label, and the corpora in the cluster corresponding to the clustering label are corresponding clustering corpora;
step A4, if the number of the corpora in the cluster is not greater than a preset threshold N1, the corpora in the cluster use the other as a clustering label, and the corpora in the cluster corresponding to the clustering label are corresponding clustering corpora.
In a specific embodiment, the clustered sample corpora may be divided into five different categories, namely, label1, label2, label3, label4 and label5, and the clustering labels of the five different categories have corresponding clustered corpora, if the clusters in the clustering result, of which the corpus number is greater than the preset threshold 50, include label1, label2, label3 and label4, the clustering labels are the clustering labels of the clustered corpora of label1, label2, label3 and label4, and the clustering labels of label1, label2, label3 and label4 are clustering labels, and the clustering corpora corresponding to the remaining clustered sample corpus label5 uses the other label.
In this embodiment, the training corpora are disordered and divided to obtain clustered sample corpora for clustering, and other clustered non-sample corpora are also reserved, and the clustered sample corpora are clustered based on a hierarchical clustering algorithm HAC to obtain clustered results after clustering, wherein the clustered results include clustering labels and clustered corpora corresponding to the clustering labels.
Further, a third embodiment of the model construction method of the present invention is proposed based on the first and second embodiments of the model construction method of the present invention.
The third embodiment of the model building method is different from the first and second embodiments of the model building method in that, in this embodiment, step S30 is performed, the model training and prediction are performed based on the clustering labels in the clustering result and the corresponding clustering corpora, and the refinement of the target intention recognition model is determined according to the model training and prediction result, with reference to fig. 6, the step specifically includes:
step S31, dividing the clustering linguistic data in the clustering result into training linguistic data and prediction linguistic data according to the clustering label;
and averagely dividing the clustering linguistic data in the clustering result into n parts according to the clustering label, taking m parts of clustering linguistic data in the n parts of clustering linguistic data each time to train the training initial model to obtain a corresponding classification model, predicting the remaining clustering linguistic data in the n parts of clustering linguistic data, and further obtaining the predicted value scores of all clustering linguistic data.
In a specific embodiment, the clustered corpus in the clustering result is averagely partitioned into 5 parts according to the clustering labels label1 and label2, when the number of the clustered corpus is 200, the number of labels label1 contained in each partitioned clustered corpus is 40, 4 parts of the clustered corpus are taken for training each time, and the rest clustered corpus is predicted, so that the prediction score value of all clustered corpuses with the clustering labels label1 can be obtained; in the clustering corpus with the clustering label of label2, when the number of the clustering corpus is 400, after segmentation, the number of labels of label2 in each clustering corpus is 80, 4 clustering corpuses are taken each time for training, and the rest is predicted, so that the prediction score value of all clustering corpuses with the clustering label of label2 can be obtained.
Step S32, performing model training based on the training corpus to obtain a trained initial classification model;
step S33, inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
in a specific embodiment, the training corpus in the partitioned clustering corpus is used for training a pre-training classification model to obtain a corresponding classification model, the prediction corpus in the partitioned clustering corpus is input into the classification model to obtain a corresponding classification result after classification, and the classification result is subjected to prediction testing to obtain a prediction score value of the classification result corresponding to the classification model.
Step S34, based on the clustering label in the clustering result and the prediction score value, determining the accurate recall ratio PRF value of the clustering result;
and obtaining a clustering label in the clustering result and a Precision (Precision rate), a Recall (Recall rate) and an F1 (F1) value which are corresponding to the clustering result, namely a PRF value, through a prediction score value corresponding to the classifying result of the classification model, evaluating whether the classification model is reasonable according to the PRF value, further evaluating whether the training data of the classification model is reasonable, and further obtaining whether the clustering result is reasonable.
Step S35, determining a corresponding target intention recognition model based on the PRF value.
In a specific embodiment, according to the PRF value of the clustering result, the output corresponding target intention recognition model may be determined, and specifically, the value determination rule for the PRF value is as follows:
if the PRF value reaches a preset threshold value and the clustering result is reasonable, training the pre-training model by adopting all corpora in the clustering result to obtain a corresponding classification model;
if the PRF value does not reach the preset threshold value and the clustering result is unreasonable, the clustering labels corresponding to the clustering result and the clustering corpora corresponding to the clustering labels need to be classified and adjusted to obtain the classified and adjusted clustering labels and the clustering corpora corresponding to the clustering labels, model training and prediction are carried out on the classified and adjusted clustering labels and the clustering results of the clustering corpora corresponding to the clustering labels again until the PRF value corresponding to the clustering result reaches the preset threshold value, and a corresponding classification model is output.
In this embodiment, the clustering result after the clustering process is trained and predicted to obtain the classification output result of the classification model corresponding to the clustering result and the prediction score value corresponding to the clustering result, the PRF value of the clustering result is obtained according to the classification output result and the prediction score value, whether the clustering result is reasonable is determined according to whether the PRF value reaches a preset threshold, and the corresponding target intention recognition model is finally determined, so that the accuracy of automatically creating the target intention recognition model can be improved, a fault-tolerant mechanism is provided, the classification result of the classification model is predicted and determined, and the accuracy of the classification result of the classification model is improved.
Further, a fourth embodiment of the model construction method of the present invention is proposed based on the first, second, and third embodiments of the model construction method of the present invention.
The fourth embodiment of the model construction method differs from the first, second, and third embodiments of the model construction method in that the present embodiment is a refinement of determining a corresponding target intention recognition model based on the PRF value in step S35, and with reference to fig. 7, the step specifically includes:
step B1, judging whether the PRF value reaches a preset threshold value;
step B2, if the PRF value reaches a preset threshold value, the clustering result is reasonable, and the initial classification model is output as a target intention identification model;
in a specific embodiment, if the PRF value of the classification result output by the classification model reaches a preset threshold, it is determined whether all the clustered corpus are added to the process of training the pre-training model, if all the clustered corpus are added, the classification model is directly output as the target intention recognition model, if not all the clustered corpus are added, the corpus are added to further train the classification model until all the corpus are added to the training, and the classification model is output as the target intention recognition model.
Step B3, if the PRF value does not reach a preset threshold value, the clustering result is unreasonable, and the clustering result is classified and adjusted to obtain a classified and adjusted clustering result;
referring to fig. 8, step B3 specifically includes:
the clustering result comprises other clustering linguistic data with a clustering label of other and non-other clustering linguistic data with a clustering label of non-other, and the step of classifying and adjusting the clustering result to obtain the classified and adjusted clustering result comprises the following steps:
b1, adjusting the other clustered linguistic data and the non-other clustered linguistic data to obtain the clustered linguistic data corresponding to the adjusted clustered labels;
step b2, calculating the confusion degree of the clustering corpus corresponding to the adjusted clustering label;
step b3, when the confusion degree is larger than the preset threshold value T2, merging the non-other clustered corpora before and after the adjustment of the classification as the non-other clustered corpora of the current clustering result, and using other corpora as the other clustered corpora with the other labels as the other clustered corpora.
In a specific embodiment, the step of performing classification adjustment on the clustering result includes three steps:
firstly, the first part is to change the label of the part which is not credible in the non-other clustering corpus to the clustering corpus of which the clustering label is other, and the clustering labels and the prediction score values of all the clustering corpuses are obtained through the steps. And (4) exchanging the inconsistent clustering linguistic data with the prediction value lower than a certain threshold value of 0.3 and the prediction value and the clustering linguistic data with the clustering label being non-other into the clustering linguistic data with the clustering label being other, changing the label to integrate the advantages of classification and clustering, and placing the non-credible part of the clustering linguistic data with the clustering label being non-other into the other.
Then, whether the corpus number of the cluster label other exceeds 10% of the threshold value overall training corpus is judged, if yes, the new clustering result is executed with the clustering algorithm, the cluster corpus of which the cluster corpus number exceeds a certain preset threshold value N2 in the clustering result is marked as a new cluster corpus, and the cluster corpus is added into the training corpus.
And finally, classifying and combining the new clustering corpora and the clustering corpora before classification adjustment, wherein the process of classification and combination needs to calculate the confusion degree of the clustering corpora before and after classification adjustment, and the formula for calculating the confusion degree is as follows:
wherein N is
catei,catej The number representing the actual intent as catei but misclassified to catej; n is a radical of
catei Representing the actual number of cateis;
representing the number of predicted cateis; and merging the two clustering corpora with the confusion degree larger than the threshold T2 (example: 0.25) and with the clustering labels being non-other clustering corpora to obtain the clustering result after the classification adjustment.
Step B4, taking the clustering result after the classification adjustment as the current clustering result;
obtaining a current clustering result, and executing the following steps:
dividing clustering linguistic data in a clustering result into training linguistic data and prediction linguistic data according to a clustering label;
performing model training based on the training corpus to obtain a trained initial classification model;
inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
determining an accurate recall PRF value of the clustering result based on the clustering label and the prediction score value in the clustering result;
in a specific embodiment, if the PRF value of the clustering result does not reach the preset threshold and the clustering result is not reasonable, the clustering label corresponding to the clustering result and the clustering corpus corresponding to the clustering label need to be classified and adjusted, after the classification-adjusted clustering result is obtained, the classification-adjusted clustering result needs to be used as the current clustering result, the steps of performing model training and prediction based on the clustering label in the clustering result and the corresponding clustering corpus are repeatedly performed, and determining the target intention recognition model according to the model training and prediction result.
Further, the clustering linguistic data in the clustering result are divided into training linguistic data and prediction according to the clustering label, the classification result of the classification model is obtained through the training linguistic data, the prediction score value of the clustering result is obtained through the prediction, the PRF value corresponding to the clustering result is determined based on the classification result and the prediction score value, and then the corresponding target intention recognition model is determined according to the PRF value.
And step B5, until the PRF value reaches a preset threshold value, the clustering result is reasonable, and the initial classification model is output as a target intention identification model.
And if the PRF value reaches a preset threshold value and the clustering result is reasonable, outputting a classification model obtained by training the clustering result as a target intention recognition model.
In this embodiment, whether the classification model is reasonable or not is determined by judging the PRF value of the classification result of the classification model, whether the training data of the classification model is reasonable or not is further determined, whether the clustering result including the training data is reasonable or not is further determined, if so, the classification model is directly output to obtain the target intention recognition model, if not, the clustering result needs to be classified and adjusted to obtain corresponding model training data, and then the target intention recognition model is obtained, so that the accuracy of automatically creating the target intention recognition model is improved, and the efficiency of automatically creating the target intention recognition model is improved.
The invention also provides a model construction device. Referring to fig. 9, the model building apparatus of the present invention includes:
an obtaining module 10, configured to obtain a corpus of a constructed model;
the clustering module 20 is configured to perform clustering processing on the training corpora based on a pre-trained clustering model to obtain corresponding clustering results, where the clustering results include clustering labels and clustering corpora corresponding to the clustering labels;
and the determining module 30 is configured to perform model training and prediction based on the clustering labels in the clustering result and the corresponding clustering corpora, and determine a target intention recognition model according to the model training and prediction result.
Furthermore, the present invention also provides a computer-readable storage medium, preferably a computer-readable storage medium, having stored thereon a model construction program, which when executed by a processor implements the steps of the model construction method as described above.
In the embodiments of the model building apparatus and medium of the present invention, all technical features of the embodiments of the model building method are included, and the descriptions and explanations are basically the same as those of the embodiments of the model building method, and are not repeated here.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.