CN114817455B

CN114817455B - Model building methods, devices, equipment and media

Info

Publication number: CN114817455B
Application number: CN202210229151.9A
Authority: CN
Inventors: 赵高枫; 文俊杰; 李金龙
Original assignee: China Merchants Bank Co Ltd
Current assignee: China Merchants Bank Co Ltd
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2026-04-07
Anticipated expiration: 2042-03-08
Also published as: CN114817455A

Abstract

The invention relates to the technical field of artificial intelligence and discloses a model construction method, a device, equipment and a medium. The method comprises the steps of obtaining training corpus of a built model, carrying out clustering processing on the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, wherein the clustering result comprises clustering labels and clustering corpora corresponding to the clustering labels, carrying out model training and prediction based on the clustering labels and the corresponding clustering corpora in the clustering result, and determining a target intention recognition model according to the model training and prediction results. The method for automatically generating the target intention recognition model reduces the time input in the process of familiarity with service points and data labeling, accelerates the carding of the service points and labeling of service corpus, improves the efficiency of constructing the target intention recognition model, and reduces the labor cost.

Description

Model construction method, device, equipment and medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for model construction.

Background

With the continuous development of artificial intelligence technology, the application of a dialogue system is more and more widespread, such as an unmanned customer service system, in which intention recognition is an important component, and a common algorithm for intention recognition is to recognize the intention of a user through text classification, specifically, divide the intention of the user into several categories, and match corresponding response schemes under the categories.

When a dialogue system is built, text classification is usually the simplest and effective means for building a user intention recognition model, the existing text classification model for intention recognition is mainly based on business establishment needed to be carried out in the dialogue system, an operator is often required to know business points needed to be carried out in the dialogue system in the establishment process, relevant business point information is combed out, corpus provided by business parties is marked in a large quantity, classification is adjusted by means of a classification adjustment tool after marking is finished, if the corpus is insufficient, tool expansion is considered, finally a text classification model for recognizing user intention is obtained, in the process, the efficiency of building the text classification model by the operator through combing the business points and marking the business corpus is low, and a large amount of manpower is consumed.

Disclosure of Invention

The invention mainly aims to provide a model construction method, device, equipment and medium, which aim to reduce the labor cost of manual carding and labeling and improve the classification model construction efficiency.

In order to achieve the above object, the present invention provides a model construction method comprising the steps of:

Acquiring training corpus of a constructed model;

Clustering the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, wherein the clustering result comprises a clustering label and clustering corpus corresponding to the clustering label;

and carrying out model training and prediction based on the clustering labels in the clustering results and the corresponding clustering corpus, and determining a target intention recognition model according to the model training and prediction results.

Preferably, the step of clustering the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result includes:

sequentially disordered and segmented the training corpus to obtain clustered sample corpus;

And carrying out clustering processing on the clustering sample corpus based on a hierarchical aggregation clustering algorithm HAC to obtain a clustering label and a clustering corpus corresponding to the clustering label.

Preferably, the step of clustering the clustered sample corpus based on the hierarchical aggregation clustering algorithm HAC to obtain a clustered label and a clustered corpus corresponding to the clustered label includes:

classifying the clustered sample corpus, and dividing the classified clustered sample corpus of the same kind into a cluster to obtain cluster labels corresponding to clusters of different kinds;

based on the cluster labels corresponding to the clusters, determining intra-cluster corpus corresponding to different kinds of cluster labels;

If the number of the intra-cluster linguistic data is larger than a preset threshold N1, the intra-cluster linguistic data takes the cluster as a cluster label, and the intra-cluster linguistic data corresponding to the cluster label is the corresponding cluster linguistic data;

if the number of the intra-cluster corpora is not greater than a preset threshold N1, the intra-cluster corpora uses other as a cluster label, and the intra-cluster corpora corresponding to the cluster label is the corresponding cluster corpora.

Preferably, the step of performing model training and prediction based on the clustering labels in the clustering results and the corresponding clustering corpus, and determining the target intention recognition model according to the model training and prediction results includes:

Dividing clustered linguistic data in the clustered results into training linguistic data and predicted linguistic data according to the clustered labels;

model training is carried out based on the training corpus, and a trained initial classification model is obtained;

Inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;

Determining an accurate recall ratio PRF value of the clustering result based on the clustering label in the clustering result and the predictive score value;

based on the PRF values, a corresponding target intent recognition model is determined.

Preferably, the step of judging whether the clustering result is reasonable based on the PRF value;

judging whether the PRF value reaches a preset threshold value or not;

If the PRF value reaches a preset threshold, the clustering result is reasonable, and the initial classification model is output as a target intention recognition model;

if the PRF value does not reach the preset threshold value, the clustering result is unreasonable, and the clustering result is subjected to classification adjustment to obtain a clustering result after classification adjustment;

taking the clustering result after classification adjustment as a current clustering result;

and outputting the initial classification model as a target intention recognition model until the PRF value reaches a preset threshold value and the clustering result is reasonable.

Preferably, the clustering result comprises other clustering corpus with a clustering label of other and non-other clustering corpus with a clustering label of non-other,

The step of classifying and adjusting the clustering result to obtain a classified and adjusted clustering result comprises the following steps:

Adjusting the other clustering corpus and the non-other clustering corpus to obtain clustering corpus corresponding to the adjusted clustering labels;

Calculating the confusion degree of the clustering corpus corresponding to the adjusted clustering labels;

and when the confusion is greater than a preset threshold T2, merging and adjusting the non-other clustering corpuses before and after classification to serve as non-other clustering corpuses of the current clustering result, and taking other corpuses as other clustering corpuses with the clustering labels being other.

Preferably, the step of adjusting the other clustering corpus and the non-other clustering corpus to obtain the clustering corpus corresponding to the adjusted clustering label includes:

acquiring a predictive score value of the non-other clustering corpus;

If the predictive score value of the non-other clustering corpus is lower than a preset threshold T1, changing the clustering label of the non-other clustering corpus into other;

And when the number of other clustering corpuses of which the clustering labels are other exceeds a preset threshold N2, acquiring the adjusted other clustering corpuses and the adjusted non-other clustering corpuses.

Preferably, the step of obtaining the training corpus for constructing the model includes:

Acquiring an original corpus from a service end;

preprocessing the original corpus to obtain a training corpus used for model construction;

The preprocessing mode comprises one or more of eliminating stop words, full-cross half angles, eliminating expression symbols, eliminating calling words and nonsensical questions, unifying punctuation marks and eliminating unusual punctuation marks.

In addition, in order to achieve the above object, the present invention also provides a model building apparatus comprising:

the acquisition module is used for acquiring training corpus of the constructed model;

the clustering module is used for carrying out clustering processing on the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, wherein the clustering result comprises a clustering label and a clustering corpus corresponding to the clustering label;

And the determining module is used for carrying out model training and prediction based on the clustering labels in the clustering results and the corresponding clustering corpus, and determining a target intention recognition model according to the model training and prediction results.

Preferably, the acquisition module is further configured to:

Acquiring an original corpus from a service end;

Preferably, the clustering module is further configured to:

Preferably, the determining module is further configured to:

judging whether the PRF value reaches a preset threshold value or not;

taking the clustering result after classification adjustment as a current clustering result, and executing the steps:

Dividing the clustered corpus in the clustered result after classification adjustment into training corpus and prediction corpus according to the clustered labels;

Preferably, the determining module is further configured to:

acquiring a predictive score value of the non-other clustering corpus;

In addition, in order to achieve the above object, the present invention also provides a model construction apparatus including a memory, a processor, and a model construction program stored on the memory and executable on the processor, the model construction program implementing the steps of the model construction method as described above when executed by the processor.

In addition, in order to achieve the above object, the present invention also provides a medium that is a computer-readable storage medium having stored thereon a model building program that, when executed by a processor, implements the steps of the model building method as described above.

The method, the device, the equipment and the medium for constructing the model are characterized by acquiring training corpus of the constructed model, carrying out clustering processing on the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, wherein the clustering result comprises a clustering label and the clustering corpus corresponding to the clustering label, carrying out model training and prediction based on the clustering label and the corresponding clustering corpus in the clustering result, and determining a target intention recognition model according to the model training and prediction result.

Obtaining a clustering result corresponding to training expectation by carrying out clustering processing on training corpuses for constructing an intention recognition model, wherein the clustering result comprises clustering labels with clustered corpuses and clustering corpuses corresponding to the clustering labels, model training and prediction are carried out on the clustering results comprising the clustering labels and the clustering corpuses corresponding to the clustering labels, obtaining a PRF value corresponding to the clustering result, determining a target intention recognition model according to the PRF value, automatically generating the target intention recognition model, the time input in the process of familiarity with service points and data labeling is reduced, the efficiency of carding service points and labeling service corpora is improved, and the labor cost is reduced.

Drawings

FIG. 1 is a schematic diagram of a device architecture of a hardware operating environment involved in an embodiment of the model building of the present invention;

FIG. 2 is a schematic flow chart of a first embodiment of the model building method of the present invention;

FIG. 3 is a schematic flow chart of a first embodiment of the model building method of the present invention;

FIG. 4 is a flow chart of a second embodiment of the model building method of the present invention;

FIG. 5 is a schematic view showing a sub-process of step S22 in a second embodiment of the model building method of the present invention;

FIG. 6 is a flow chart of a third embodiment of the model building method of the present invention;

FIG. 7 is a flow chart of a fourth embodiment of the model building method of the present invention;

FIG. 8 is a schematic flow chart of a step B3 in a fourth embodiment of the model building method of the present invention;

fig. 9 is a schematic functional block diagram of a model building apparatus according to a first embodiment of the model building method of the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic device structure of a hardware running environment according to an embodiment of the present invention.

The device of the embodiment of the invention can be a mobile terminal or a server device.

As shown in fig. 1, the device may include a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

It will be appreciated by those skilled in the art that the device structure shown in fig. 1 is not limiting of the device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

As shown in fig. 1, an operating system, a network communication module, a user interface module, and a model building program may be included in the memory 1005, which is a type of computer storage medium.

The operating system is a program for managing and controlling the model building device and software resources, and supports the operation of a network communication module, a user interface module, a model building program, and other programs or software, wherein the network communication module is used for managing and controlling the network interface 1002, and the user interface module is used for managing and controlling the user interface 1003.

In the model building apparatus shown in fig. 1, the model building apparatus calls a model building program stored in a memory 1005 by a processor 1001 and performs operations in various embodiments of the model building method described below.

Based on the hardware structure, the embodiment of the model building method is provided.

Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of a model building method according to the present invention, where the method includes:

s10, obtaining training corpus for constructing a model;

The method comprises the steps of obtaining an original corpus from a business end, preprocessing the original corpus to obtain a training corpus used for model construction, wherein the preprocessing mode comprises one or more of removing stop words, full cross conversion half angles, removing expression symbols, removing calling words and nonsensical problems, unifying punctuation marks and removing common punctuation marks.

In a specific embodiment, dialogue corpus from a business scene is collected and used for model construction, a large amount of user chat records of intelligent customer service are used as original corpus of a training target model, the original corpus is subjected to standardized pretreatment, and the treatment modes can be one or more of stop word removal, full-cross half-angle removal, expression symbol removal, call expression and nonsensical problems removal, unified punctuation and common punctuation removal, and the effects of reducing noise in the original corpus, improving the purity of the original corpus and obtaining training corpus capable of being used for constructing a model are achieved by carrying out standardized pretreatment on the whole original corpus from a client.

Step S20, clustering is carried out on the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, wherein the clustering result comprises a clustering label and clustering corpus corresponding to the clustering label;

the clustering process can use a morphological operator to cluster and combine similar classification areas nearby, and after the clustering and combining process, clusters with different categories are generated, wherein the clusters generated by the clustering are a set of data objects, and the objects are similar to the objects in the same cluster and are different from the objects in other clusters.

In the existing clustering algorithms, most of the clustering algorithms have the capability of processing noise data, some of the clustering algorithms are very sensitive to the noise data, and corresponding accurate clustering results can be obtained after the audio data are clustered.

In a specific embodiment, corpus data collected from a service end is preprocessed to obtain preprocessed training corpus, the training corpus is input into a pre-trained clustering model, the preprocessed training corpus is disordered, partial corpus in the training corpus is extracted to cluster, wherein a hierarchical aggregation clustering algorithm HAC can be adopted by a clustering algorithm to obtain clusters with different categories, the clusters have corresponding corpus respectively, and further a clustering result comprising clustering labels and clustering corpus corresponding to the clustering labels is obtained, and in the clustering result, if the number of the corpora in the clusters is larger than a preset threshold N1 (for example: 50), cluster ids are used as the clustering labels, and other corpora are used as the clustering labels.

And step S30, performing model training and prediction based on the clustering labels and the corresponding clustering corpus in the clustering results, and determining a target intention recognition model according to the model training and prediction results.

In the prior art, a manual is divided and marked by an initial pre-training model to obtain a data set capable of carrying out model training, and deep learning is carried out according to the initial pre-training model of the data set to obtain a classification model of a target. In the method for constructing the model, training corpuses of a service end are directly obtained, clustering processing is carried out on the training corpuses to obtain clustering results comprising clustering labels and clustering corpuses corresponding to the clustering labels, training and predicting are carried out on the clustering results, and a target intention recognition model is determined according to the model training and predicting results.

In a specific embodiment, according to the above clustering result including the clustering labels and the clustering corpuses corresponding to the clustering labels, the clustering result is divided into 5 parts according to the clustering labels, for example, the number of the clustering labels (cluster id) is 200 for the corpuses of label1, the number of the clustering labels of each part of clustering corpuses after the division is 40 for label1, 4 parts of clustering corpuses are taken each time to train the pre-training classification model, and the rest 1 part of clustering corpuses are predicted, so that the predicted value score value of all 5 parts of clustering corpuses can be obtained. The PRF value corresponding to the clustering result, namely Precision, recall and F1 value (F1) can be obtained through clustering the labels and the corresponding predictive score values, whether the clustering result is reasonable or not is evaluated by the PRF value, and if the clustering result is reasonable, a classification model trained based on training data is output to obtain a corresponding target intention recognition model.

According to the embodiment, the preprocessed training corpus is subjected to clustering processing to obtain the clustering labels with different feature categories and the clustering corpuses corresponding to the clustering labels, the clustering corpuses corresponding to the clustering labels are used as clustering results to perform model training and prediction, and a target intention recognition model is determined according to the model training and prediction results. The automatic construction method has the advantages that the manual cost of manual marking is greatly reduced, the operation pressure of operators is reduced, in addition, the automatic construction method can assist the operators in understanding the service, the operators find different service types involved in carding, and the efficiency of constructing the intention recognition model is improved.

Further, based on the first embodiment of the model building method of the present invention, a second embodiment of the model building method of the present invention is proposed.

The difference between the second embodiment of the model building method and the first embodiment of the model building method is that in the present embodiment, in step S20, the clustering process is performed on the training corpus based on the pre-trained clustering model, so as to obtain refinement of the corresponding clustering result, and referring to fig. 4, the step specifically includes:

Step S21, carrying out sequential disorder and segmentation on the training corpus to obtain clustered sample corpus;

In a specific embodiment, the preprocessed training corpus is input into a pre-trained clustering model, the arrangement sequence of the training corpus is disturbed, part of the training corpus is extracted for clustering, the method specifically comprises the steps of synchronously clustering the part of the training corpus, then gradually adding the rest of the training corpus, and further obtaining a clustering result corresponding to the training corpus. And compared with a mode of directly clustering all the training corpuses, a mode of extracting part of the training corpuses to cluster is better in finally obtained classification effect.

In a specific embodiment, the preprocessed training corpus is disordered, and a part of the training corpus is extracted as a clustering sample corpus, for example, 20% of the total corpus can be extracted as the clustering sample corpus by referring to the total corpus amount for clustering.

And S22, carrying out clustering processing on the clustering sample corpus based on a hierarchical aggregation clustering algorithm HAC to obtain a clustering label and a clustering corpus corresponding to the clustering label.

There are various clustering methods, including partitioning, layering, density-based, grid-based, model-based, transitive closure, boolean matrix, direct clustering, correlation analysis, and statistical-based clustering, and there are various different clustering algorithms in different methods from which the corresponding clustering results can be obtained.

In a specific embodiment, clustering is performed on the clustered sample corpuses based on a hierarchical aggregation clustering algorithm HAC to obtain clustered labels of different types, wherein the clustered labels are different clusters corresponding to each category, the clusters have corpuses corresponding to the clusters, the clusters are called as clustered labels, the corpuses corresponding to the different clusters are used as clustered corpuses corresponding to the clustered labels, and further, a clustering result comprising the clustered labels and the clustered corpuses corresponding to the clustered labels after the clustering is obtained.

Referring to fig. 5, step S22 specifically includes:

Step A1, classifying the clustered sample linguistic data, and dividing the classified clustered sample linguistic data of the same kind into a cluster to obtain cluster labels corresponding to clusters of different kinds;

in a specific embodiment, the clustered sample corpuses are classified to obtain clustered labels of different categories, and the clustered sample corpuses are classified according to the clustered labels to obtain clustered labels and clustered corpuses corresponding to the clustered labels.

Step A2, determining intra-cluster corpus corresponding to different kinds of cluster labels based on the cluster labels corresponding to the clusters;

In a specific embodiment, the clustering sample corpus is classified based on a hierarchical aggregation clustering algorithm HAC to obtain different clusters corresponding to each category, and the clusters have corpora corresponding to each cluster, which are called as clustering labels, and the corpora corresponding to each different cluster are used as clustering corpora corresponding to the clustering labels.

Step A3, if the number of the intra-cluster corpora is greater than a preset threshold N1, the intra-cluster corpora takes the cluster as a cluster label, and the intra-cluster corpora corresponding to the cluster label is a corresponding cluster corpus;

and step A4, if the number of the intra-cluster corpora is not greater than a preset threshold N1, the intra-cluster corpora uses other as a cluster label, and the intra-cluster corpora corresponding to the cluster label is the corresponding cluster corpora.

In a specific embodiment, the clustering sample corpus can be divided into five different categories of label1, label2, label3, label4 and label5, the clustering labels of the five different categories have corresponding clustering corpuses, if the number of the intra-cluster corpuses in the clustering result is greater than a preset threshold 50, the clusters of the intra-cluster corpuses comprise label1, label2, label3 and label4, the clustering corpuses of the label1, label2, label3 and label4 take the cluster id, the clustering corpuses of the label1, label2, label3 and label4 take the other as the clustering labels, and the clustering corpuses of the rest of the clustering sample corpuses of label5 correspond to the clustering corpuses.

In the embodiment, the training corpus is subjected to disorder and division to obtain clustered sample corpus, and other clustered non-sample corpus is also reserved, the clustered sample corpus is subjected to clustering processing based on a hierarchical aggregation clustering algorithm HAC to obtain a clustered result, wherein the clustered result comprises a clustered label and clustered corpus corresponding to the clustered label, the clustered result obtained by the method is used for training a target intention recognition model, a clustering algorithm is introduced to process training data, the labor input in the process of constructing intention recognition text classification is reduced, and the efficiency of constructing the intention recognition model is improved.

Further, based on the first and second embodiments of the model building method of the present invention, a third embodiment of the model building method of the present invention is provided.

The difference between the third embodiment of the model building method and the first and second embodiments of the model building method is that in this embodiment, in step S30, model training and prediction are performed based on the cluster labels in the clustering results and the corresponding clustering corpus, and refinement of the target intention recognition model is determined according to the model training and prediction results, and referring to fig. 6, the steps specifically include:

Step S31, dividing the clustering corpus in the clustering result into training corpus and prediction corpus according to the clustering labels;

And averagely dividing the clustering linguistic data in the clustering result into n parts according to the clustering labels, training the training initial model by taking m parts of the clustering linguistic data in the n parts of the clustering linguistic data each time to obtain a corresponding classification model, predicting the rest clustering linguistic data in the n parts of the clustering linguistic data, and further obtaining the predicted value scores of all the clustering linguistic data.

In a specific embodiment, the clustering corpus in the clustering result is divided into 5 parts according to the clustering labels label1 and label2, when the clustering corpus with the clustering label of label1 is 200, the number of labels of each clustering corpus is 40, 4 clustering corpuses are obtained for training each time after the division, the rest one part is predicted, the prediction score value of all the clustering corpuses with the clustering label of label1 can be obtained, and when the clustering corpus with the clustering label of label2 is 400, the number of labels of each clustering corpus with the clustering label of label2 is 80, 4 clustering corpuses are obtained for training each time after the division, and the rest one part is predicted, so that the prediction score value of all the clustering corpuses with the clustering label of label2 can be obtained.

Step S32, model training is carried out based on the training corpus, and a trained initial classification model is obtained;

Step S33, inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;

In a specific embodiment, training corpus in the partitioned clustered corpus is used for training a pre-training classification model to obtain a corresponding classification model, and prediction corpus in the partitioned clustered corpus is input into the classification model to obtain a corresponding classification result after classification, and a prediction test is performed on the classification result to obtain a prediction score value of the classification result corresponding to the classification model.

Step S34, determining an accurate recall PRF value of the clustering result based on the clustering label and the predictive score value in the clustering result;

And obtaining a clustering label in the clustering result and Precision (Precision), recall (Recall), F1 value (F1) corresponding to the clustering result through a prediction score value corresponding to the classification result of the classification model, namely a PRF value for short, and evaluating whether the classification model is reasonable or not according to the PRF value, and further evaluating whether training data of the classification model is reasonable or not, and further obtaining whether the clustering result is reasonable or not.

And step S35, determining a corresponding target intention recognition model based on the PRF value.

In a specific embodiment, according to the PRF value of the clustering result, the output corresponding target intention recognition model may be determined, and specifically, the value determination rule for the PRF value is as follows:

If the PRF value reaches a preset threshold value, the clustering result is reasonable, training the pre-training model by adopting all corpus in the clustering result to obtain a corresponding classification model;

If the PRF value does not reach the preset threshold, the clustering result is unreasonable, the clustering labels corresponding to the clustering result and the clustering corpus corresponding to the clustering labels are required to be subjected to classification adjustment, the clustering labels after classification adjustment and the clustering corpus corresponding to the clustering labels are obtained, model training and prediction are performed on the clustering results of the clustering labels after classification adjustment and the clustering corpus corresponding to the clustering labels again until the PRF value corresponding to the clustering result reaches the preset threshold, and a corresponding classification model is output.

In this embodiment, training and predicting the clustering result after the clustering process to obtain a classification output result of the classification model corresponding to the clustering result and a prediction score value corresponding to the clustering result, obtaining a PRF value of the clustering result according to the classification output result and the prediction score value, determining whether the clustering result is reasonable according to whether the PRF value reaches a preset threshold, and finally determining a corresponding target intention recognition model, so that the accuracy of automatically creating the target intention recognition model can be improved, a fault tolerance mechanism is set, the classification result of the classification model is predicted and determined, and the accuracy of the classification result of the classification model is improved.

Further, based on the first, second, and third embodiments of the model building method of the present invention, a fourth embodiment of the model building method of the present invention is provided.

The fourth embodiment of the model building method differs from the first, second and third embodiments of the model building method in that, in this embodiment, in step S35, based on the PRF value, refinement of the corresponding target intention recognition model is determined, and referring to fig. 7, the step specifically includes:

step B1, judging whether the PRF value reaches a preset threshold value or not;

step B2, if the PRF value reaches a preset threshold, the clustering result is reasonable, and the initial classification model is output as a target intention recognition model;

In a specific embodiment, if the PRF value of the classification result output by the classification model reaches a preset threshold, judging whether the clustered training corpus is completely added in the process of training the pre-training model, if so, directly outputting the classification model as a target intention recognition model, and if not, adding the training corpus to further train the classification model until all the training corpus is added for training, and outputting the classification model as the target intention recognition model.

Step B3, if the PRF value does not reach the preset threshold value, the clustering result is unreasonable, and classification adjustment is carried out on the clustering result to obtain a clustering result after classification adjustment;

Referring to fig. 8, step B3 specifically includes:

the clustering result comprises other clustering corpus with a clustering label being other and non-other clustering corpus with a clustering label being non-other, and the step of classifying and adjusting the clustering result to obtain the clustering result after classifying and adjusting comprises the following steps:

step b1, adjusting the other clustering corpus and the non-other clustering corpus to obtain clustering corpus corresponding to the adjusted clustering labels;

step b2, calculating the confusion degree of the clustering corpus corresponding to the adjusted clustering labels;

And b3, when the confusion is greater than a preset threshold T2, merging and adjusting the non-other clustering corpuses before and after classification to serve as non-other clustering corpuses of the current clustering result, and taking other corpuses as other clustering corpuses with the clustering labels being other.

In a specific embodiment, the step of classifying and adjusting the clustering result includes three parts:

firstly, the first part is to change the label of an unreliable part in the non-other clustering corpus into the clustering corpus with the clustering label being the other, and the clustering labels and the predictive score values of all the clustering corpora can be obtained through the steps. And (3) replacing the inconsistent clustering corpus in the clustering corpus with the predictive value and the clustering label being the non-other with the clustering corpus with the clustering label being the other, changing the label to integrate the advantages of classification and clustering, and placing the unreliable part in the clustering corpus with the clustering label being the non-other into the other.

And then judging whether the number of the clustered corpuses with the other label exceeds 10% of the threshold value overall training corpuses, if so, executing the clustering algorithm on the new clustered result, marking the clustered corpuses with the number exceeding a certain preset threshold value N2 in the clustered result as new clustered corpuses, and adding the clustered corpuses into the current training corpus.

Finally, classifying and combining the new clustered corpus and the clustered corpus before classification and adjustment, wherein the process of classifying and combining needs to calculate the confusion degree of the clustered corpus before and after classification and adjustment, and the formula for calculating the confusion degree is as follows:

Where N _catei,catej represents the actual intent catei, but misclassified to catej, N _catei represents the actual number catei; And combining clustering corpus with two clustering labels with the confusion degree larger than a threshold value T2 (for example, 0.25) as non-other clusters to obtain a clustering result after classification adjustment.

Step B4, taking the clustering result after classification adjustment as a current clustering result;

acquiring a current clustering result, and executing the following steps:

Determining an accurate recall PRF value of the clustering result based on the clustering label in the clustering result and the predictive score value;

In a specific embodiment, if the PRF value of the clustering result does not reach the preset threshold, the clustering result is unreasonable, and then the clustering label corresponding to the clustering result and the clustering corpus corresponding to the clustering label need to be classified and adjusted, after the clustering result after the classification and adjustment is obtained, the clustering result after the classification and adjustment needs to be used as the current clustering result, and the steps of performing model training and prediction based on the clustering label in the clustering result and the corresponding clustering corpus and determining the target intention recognition model according to the model training and prediction result are repeatedly performed.

Further, the clustering corpus in the clustering result is divided into training corpus and prediction according to the clustering label, the classification result of the classification model is obtained through the training corpus, the prediction score value of the clustering result is obtained through prediction, the PRF value corresponding to the clustering result is determined based on the classification result and the prediction score value, and the corresponding target intention recognition model is determined according to the PRF value.

And step B5, outputting the initial classification model as a target intention recognition model until the PRF value reaches a preset threshold value and the clustering result is reasonable.

If the PRF value reaches a preset threshold, outputting a classification model obtained by training the clustering result as a target intention recognition model, wherein the clustering result is reasonable.

In this embodiment, by judging the PRF value of the classification result of the classification model, it is determined whether the classification model is reasonable, further determining whether the training data of the classification model is reasonable, further determining whether the clustering result including the training data is reasonable, if so, directly outputting the classification model to obtain the target intention recognition model, and if not, classifying and adjusting the clustering result to obtain the corresponding model training data, thereby obtaining the target intention recognition model, improving the accuracy of automatically creating the target intention recognition model, and improving the efficiency of automatically creating the target intention recognition model.

The invention also provides a model construction device. Referring to fig. 9, the model building apparatus of the present invention includes:

An acquisition module 10, configured to acquire a training corpus for constructing a model;

the clustering module 20 is configured to perform clustering processing on the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, where the clustering result includes a clustering label and a clustering corpus corresponding to the clustering label;

The determining module 30 is configured to perform model training and prediction based on the cluster labels in the cluster results and the corresponding cluster corpus, and determine a target intention recognition model according to the model training and prediction results.

Furthermore, the present invention provides a computer-readable storage medium, preferably a computer-readable storage medium, having stored thereon a model building program which, when executed by a processor, implements the steps of the model building method as described above.

In the embodiments of the model building apparatus and medium of the present invention, all the technical features of each embodiment of the model building method are included, and description and explanation contents are substantially the same as those of each embodiment of the model building method, which are not repeated herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein, or any application, directly or indirectly, in the field of other related technology.

Claims

1. A model building method, characterized in that the model building method includes the following steps:

Obtain the training corpus for building the model;

Based on a pre-trained clustering model, the training corpus is clustered to obtain the corresponding clustering results, wherein the clustering results include clustering labels and the clustered corpus corresponding to the clustering labels;

Based on the clustering labels and corresponding clustering corpora in the clustering results, the model is trained and predicted, and the target intent recognition model is determined according to the model training and prediction results.

The steps of training and predicting the model based on the clustering labels and corresponding clustering corpus in the clustering results, and determining the target intent recognition model based on the model training and prediction results, include:

The clustered data in the clustering results are divided into training data and prediction data according to the clustering labels.

The model is trained based on the training corpus to obtain a trained initial classification model;

The prediction corpus is input into the trained initial classification model for prediction, and the prediction score is obtained.

Based on the cluster labels in the clustering results and the predicted scores, the Precise Recall (PRF) value of the clustering results is determined.

Based on the PRF value, the corresponding target intent recognition model is determined;

The step of determining the corresponding target intent recognition model based on the PRF value includes:

Determine whether the PRF value reaches a preset threshold;

If the PRF value reaches a preset threshold, the clustering result is reasonable, and the initial classification model is output as the target intent recognition model.

If the PRF value does not reach the preset threshold, the clustering result is unreasonable. The clustering result is then adjusted to obtain the adjusted clustering result.

Use the clustering result after classification adjustment as the current clustering result, and return to the execution steps:

Until the PRF value reaches a preset threshold, the clustering result is considered reasonable, and the initial classification model is output as the target intent recognition model.

The clustering results include corpora of other clusters labeled "other" and corpora of non-other clusters labeled "non-other".

The step of classifying and adjusting the clustering results to obtain the adjusted clustering results includes:

Adjust the other clustering corpus and the non-other clustering corpus to obtain the clustering corpus corresponding to the adjusted clustering labels;

Calculate the confusion level of the clustered corpus corresponding to the adjusted clustering labels;

When the confusion level is greater than the preset threshold T2, the non-other clustered data before and after the classification is merged and adjusted as the non-other clustered data of the current clustering result, and the other data are used as the other clustered data with the clustering label "other".

The step of adjusting the "other" clustering corpus and the non-"other" clustering corpus to obtain the clustering corpus corresponding to the adjusted clustering labels includes:

Obtain the predicted score of the non-other clustered corpus;

If the predicted score of the non-other clustered corpus is lower than the preset threshold T1, then the clustering label of the non-other clustered corpus will be changed to other.

When the number of other cluster corpora with the clustering label "other" exceeds a preset threshold N2, the intra-cluster corpora in the clustering results that exceed the preset threshold N2 are marked as new cluster corpora.

We obtained the adjusted "other" clustered corpus and the adjusted non-"other" clustered corpus;

The step of clustering the training corpus based on a pre-trained clustering model to obtain the corresponding clustering results includes:

The training corpus is sequentially shuffled and segmented to obtain clustered sample corpus.

Based on the hierarchical agglomerative clustering algorithm (HAC), the clustered sample corpus is clustered to obtain cluster labels and the corresponding clustered corpus.

The steps of performing clustering processing on the clustering sample corpus based on the hierarchical agglomerative clustering algorithm (HAC) to obtain cluster labels and the corresponding clustered corpus include:

The clustered sample corpus is classified, and the clustered sample corpus of the same type after classification is divided into a cluster to obtain the clustering labels corresponding to different types of clusters.

Based on the clustering labels corresponding to the clusters, determine the intra-cluster corpus corresponding to different types of clustering labels;

If the number of corpora within a cluster is greater than a preset threshold N1, then the corpora within a cluster are labeled with the cluster as the cluster label, and the corpora within the cluster corresponding to the cluster label are the corresponding cluster corpora.

If the number of corpora within a cluster is not greater than a preset threshold N1, then the corpora within a cluster are labeled with "other", and the corpora within a cluster corresponding to the cluster label are the corresponding clustered corpora.

2. The model building method as described in claim 1, wherein the step of obtaining the training corpus for building the model includes:

Obtain raw corpus from the business side;

The original corpus is preprocessed to obtain training corpus for model construction;

The preprocessing methods include removing stop words, using full-width characters, removing emoticons, removing greetings and meaningless questions, using standardized punctuation marks, and removing one or more of the following:

3. A model building apparatus, characterized in that the model building apparatus comprises:

The acquisition module is used to acquire the training corpus for building the model;

The clustering module is used to perform clustering processing on the training corpus based on a pre-trained clustering model to obtain the corresponding clustering results, wherein the clustering results include clustering labels and the clustered corpus corresponding to the clustering labels;

The determination module is used to train and predict the model based on the clustering labels in the clustering results and the corresponding clustering corpus, and to determine the target intent recognition model based on the model training and prediction results.

The step of training and predicting the model based on the clustering labels and corresponding clustering corpus in the clustering results, and determining the target intent recognition model based on the model training and prediction results, includes:

Determine whether the PRF value reaches a preset threshold;

The process of classifying and adjusting the clustering results to obtain the adjusted clustering results includes:

Obtain the predicted score of the non-other clustered corpus;

The clustering process based on the pre-trained clustering model, which performs clustering on the training corpus to obtain the corresponding clustering results, includes:

The hierarchical agglomerative clustering algorithm (HAC) performs clustering processing on the clustered sample corpus to obtain cluster labels and the corresponding clustered corpus, including:

4. A model building device, characterized in that the model building device comprises: a memory, a processor, and a model building program stored in the memory and executable on the processor, wherein the model building program, when executed by the processor, implements the steps of the model building method as described in any one of claims 1 to 2.

5. A medium, the medium being a computer-readable storage medium, characterized in that the computer-readable storage medium stores a model building program, which, when executed by a processor, implements the steps of the model building method as described in any one of claims 1 to 2.