CN114817455B - Model building methods, devices, equipment and media - Google Patents
Model building methods, devices, equipment and mediaInfo
- Publication number
- CN114817455B CN114817455B CN202210229151.9A CN202210229151A CN114817455B CN 114817455 B CN114817455 B CN 114817455B CN 202210229151 A CN202210229151 A CN 202210229151A CN 114817455 B CN114817455 B CN 114817455B
- Authority
- CN
- China
- Prior art keywords
- clustering
- corpus
- model
- clustered
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/231—Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of artificial intelligence and discloses a model construction method, a device, equipment and a medium. The method comprises the steps of obtaining training corpus of a built model, carrying out clustering processing on the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, wherein the clustering result comprises clustering labels and clustering corpora corresponding to the clustering labels, carrying out model training and prediction based on the clustering labels and the corresponding clustering corpora in the clustering result, and determining a target intention recognition model according to the model training and prediction results. The method for automatically generating the target intention recognition model reduces the time input in the process of familiarity with service points and data labeling, accelerates the carding of the service points and labeling of service corpus, improves the efficiency of constructing the target intention recognition model, and reduces the labor cost.
Description
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for model construction.
Background
With the continuous development of artificial intelligence technology, the application of a dialogue system is more and more widespread, such as an unmanned customer service system, in which intention recognition is an important component, and a common algorithm for intention recognition is to recognize the intention of a user through text classification, specifically, divide the intention of the user into several categories, and match corresponding response schemes under the categories.
When a dialogue system is built, text classification is usually the simplest and effective means for building a user intention recognition model, the existing text classification model for intention recognition is mainly based on business establishment needed to be carried out in the dialogue system, an operator is often required to know business points needed to be carried out in the dialogue system in the establishment process, relevant business point information is combed out, corpus provided by business parties is marked in a large quantity, classification is adjusted by means of a classification adjustment tool after marking is finished, if the corpus is insufficient, tool expansion is considered, finally a text classification model for recognizing user intention is obtained, in the process, the efficiency of building the text classification model by the operator through combing the business points and marking the business corpus is low, and a large amount of manpower is consumed.
Disclosure of Invention
The invention mainly aims to provide a model construction method, device, equipment and medium, which aim to reduce the labor cost of manual carding and labeling and improve the classification model construction efficiency.
In order to achieve the above object, the present invention provides a model construction method comprising the steps of:
Acquiring training corpus of a constructed model;
Clustering the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, wherein the clustering result comprises a clustering label and clustering corpus corresponding to the clustering label;
and carrying out model training and prediction based on the clustering labels in the clustering results and the corresponding clustering corpus, and determining a target intention recognition model according to the model training and prediction results.
Preferably, the step of clustering the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result includes:
sequentially disordered and segmented the training corpus to obtain clustered sample corpus;
And carrying out clustering processing on the clustering sample corpus based on a hierarchical aggregation clustering algorithm HAC to obtain a clustering label and a clustering corpus corresponding to the clustering label.
Preferably, the step of clustering the clustered sample corpus based on the hierarchical aggregation clustering algorithm HAC to obtain a clustered label and a clustered corpus corresponding to the clustered label includes:
classifying the clustered sample corpus, and dividing the classified clustered sample corpus of the same kind into a cluster to obtain cluster labels corresponding to clusters of different kinds;
based on the cluster labels corresponding to the clusters, determining intra-cluster corpus corresponding to different kinds of cluster labels;
If the number of the intra-cluster linguistic data is larger than a preset threshold N1, the intra-cluster linguistic data takes the cluster as a cluster label, and the intra-cluster linguistic data corresponding to the cluster label is the corresponding cluster linguistic data;
if the number of the intra-cluster corpora is not greater than a preset threshold N1, the intra-cluster corpora uses other as a cluster label, and the intra-cluster corpora corresponding to the cluster label is the corresponding cluster corpora.
Preferably, the step of performing model training and prediction based on the clustering labels in the clustering results and the corresponding clustering corpus, and determining the target intention recognition model according to the model training and prediction results includes:
Dividing clustered linguistic data in the clustered results into training linguistic data and predicted linguistic data according to the clustered labels;
model training is carried out based on the training corpus, and a trained initial classification model is obtained;
Inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
Determining an accurate recall ratio PRF value of the clustering result based on the clustering label in the clustering result and the predictive score value;
based on the PRF values, a corresponding target intent recognition model is determined.
Preferably, the step of judging whether the clustering result is reasonable based on the PRF value;
judging whether the PRF value reaches a preset threshold value or not;
If the PRF value reaches a preset threshold, the clustering result is reasonable, and the initial classification model is output as a target intention recognition model;
if the PRF value does not reach the preset threshold value, the clustering result is unreasonable, and the clustering result is subjected to classification adjustment to obtain a clustering result after classification adjustment;
taking the clustering result after classification adjustment as a current clustering result;
Dividing clustered linguistic data in the clustered results into training linguistic data and predicted linguistic data according to the clustered labels;
model training is carried out based on the training corpus, and a trained initial classification model is obtained;
Inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
Determining an accurate recall ratio PRF value of the clustering result based on the clustering label in the clustering result and the predictive score value;
and outputting the initial classification model as a target intention recognition model until the PRF value reaches a preset threshold value and the clustering result is reasonable.
Preferably, the clustering result comprises other clustering corpus with a clustering label of other and non-other clustering corpus with a clustering label of non-other,
The step of classifying and adjusting the clustering result to obtain a classified and adjusted clustering result comprises the following steps:
Adjusting the other clustering corpus and the non-other clustering corpus to obtain clustering corpus corresponding to the adjusted clustering labels;
Calculating the confusion degree of the clustering corpus corresponding to the adjusted clustering labels;
and when the confusion is greater than a preset threshold T2, merging and adjusting the non-other clustering corpuses before and after classification to serve as non-other clustering corpuses of the current clustering result, and taking other corpuses as other clustering corpuses with the clustering labels being other.
Preferably, the step of adjusting the other clustering corpus and the non-other clustering corpus to obtain the clustering corpus corresponding to the adjusted clustering label includes:
acquiring a predictive score value of the non-other clustering corpus;
If the predictive score value of the non-other clustering corpus is lower than a preset threshold T1, changing the clustering label of the non-other clustering corpus into other;
And when the number of other clustering corpuses of which the clustering labels are other exceeds a preset threshold N2, acquiring the adjusted other clustering corpuses and the adjusted non-other clustering corpuses.
Preferably, the step of obtaining the training corpus for constructing the model includes:
Acquiring an original corpus from a service end;
preprocessing the original corpus to obtain a training corpus used for model construction;
The preprocessing mode comprises one or more of eliminating stop words, full-cross half angles, eliminating expression symbols, eliminating calling words and nonsensical questions, unifying punctuation marks and eliminating unusual punctuation marks.
In addition, in order to achieve the above object, the present invention also provides a model building apparatus comprising:
the acquisition module is used for acquiring training corpus of the constructed model;
the clustering module is used for carrying out clustering processing on the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, wherein the clustering result comprises a clustering label and a clustering corpus corresponding to the clustering label;
And the determining module is used for carrying out model training and prediction based on the clustering labels in the clustering results and the corresponding clustering corpus, and determining a target intention recognition model according to the model training and prediction results.
Preferably, the acquisition module is further configured to:
Acquiring an original corpus from a service end;
preprocessing the original corpus to obtain a training corpus used for model construction;
The preprocessing mode comprises one or more of eliminating stop words, full-cross half angles, eliminating expression symbols, eliminating calling words and nonsensical questions, unifying punctuation marks and eliminating unusual punctuation marks.
Preferably, the clustering module is further configured to:
sequentially disordered and segmented the training corpus to obtain clustered sample corpus;
And carrying out clustering processing on the clustering sample corpus based on a hierarchical aggregation clustering algorithm HAC to obtain a clustering label and a clustering corpus corresponding to the clustering label.
Preferably, the clustering module is further configured to:
classifying the clustered sample corpus, and dividing the classified clustered sample corpus of the same kind into a cluster to obtain cluster labels corresponding to clusters of different kinds;
based on the cluster labels corresponding to the clusters, determining intra-cluster corpus corresponding to different kinds of cluster labels;
If the number of the intra-cluster linguistic data is larger than a preset threshold N1, the intra-cluster linguistic data takes the cluster as a cluster label, and the intra-cluster linguistic data corresponding to the cluster label is the corresponding cluster linguistic data;
if the number of the intra-cluster corpora is not greater than a preset threshold N1, the intra-cluster corpora uses other as a cluster label, and the intra-cluster corpora corresponding to the cluster label is the corresponding cluster corpora.
Preferably, the determining module is further configured to:
Dividing clustered linguistic data in the clustered results into training linguistic data and predicted linguistic data according to the clustered labels;
model training is carried out based on the training corpus, and a trained initial classification model is obtained;
Inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
Determining an accurate recall ratio PRF value of the clustering result based on the clustering label in the clustering result and the predictive score value;
based on the PRF values, a corresponding target intent recognition model is determined.
Preferably, the determining module is further configured to:
judging whether the PRF value reaches a preset threshold value or not;
If the PRF value reaches a preset threshold, the clustering result is reasonable, and the initial classification model is output as a target intention recognition model;
if the PRF value does not reach the preset threshold value, the clustering result is unreasonable, and the clustering result is subjected to classification adjustment to obtain a clustering result after classification adjustment;
taking the clustering result after classification adjustment as a current clustering result, and executing the steps:
Dividing the clustered corpus in the clustered result after classification adjustment into training corpus and prediction corpus according to the clustered labels;
model training is carried out based on the training corpus, and a trained initial classification model is obtained;
Inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
Determining an accurate recall ratio PRF value of the clustering result based on the clustering label in the clustering result and the predictive score value;
and outputting the initial classification model as a target intention recognition model until the PRF value reaches a preset threshold value and the clustering result is reasonable.
Preferably, the determining module is further configured to:
Adjusting the other clustering corpus and the non-other clustering corpus to obtain clustering corpus corresponding to the adjusted clustering labels;
Calculating the confusion degree of the clustering corpus corresponding to the adjusted clustering labels;
and when the confusion is greater than a preset threshold T2, merging and adjusting the non-other clustering corpuses before and after classification to serve as non-other clustering corpuses of the current clustering result, and taking other corpuses as other clustering corpuses with the clustering labels being other.
Preferably, the determining module is further configured to:
acquiring a predictive score value of the non-other clustering corpus;
If the predictive score value of the non-other clustering corpus is lower than a preset threshold T1, changing the clustering label of the non-other clustering corpus into other;
And when the number of other clustering corpuses of which the clustering labels are other exceeds a preset threshold N2, acquiring the adjusted other clustering corpuses and the adjusted non-other clustering corpuses.
In addition, in order to achieve the above object, the present invention also provides a model construction apparatus including a memory, a processor, and a model construction program stored on the memory and executable on the processor, the model construction program implementing the steps of the model construction method as described above when executed by the processor.
In addition, in order to achieve the above object, the present invention also provides a medium that is a computer-readable storage medium having stored thereon a model building program that, when executed by a processor, implements the steps of the model building method as described above.
The method, the device, the equipment and the medium for constructing the model are characterized by acquiring training corpus of the constructed model, carrying out clustering processing on the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, wherein the clustering result comprises a clustering label and the clustering corpus corresponding to the clustering label, carrying out model training and prediction based on the clustering label and the corresponding clustering corpus in the clustering result, and determining a target intention recognition model according to the model training and prediction result.
Obtaining a clustering result corresponding to training expectation by carrying out clustering processing on training corpuses for constructing an intention recognition model, wherein the clustering result comprises clustering labels with clustered corpuses and clustering corpuses corresponding to the clustering labels, model training and prediction are carried out on the clustering results comprising the clustering labels and the clustering corpuses corresponding to the clustering labels, obtaining a PRF value corresponding to the clustering result, determining a target intention recognition model according to the PRF value, automatically generating the target intention recognition model, the time input in the process of familiarity with service points and data labeling is reduced, the efficiency of carding service points and labeling service corpora is improved, and the labor cost is reduced.
Drawings
FIG. 1 is a schematic diagram of a device architecture of a hardware operating environment involved in an embodiment of the model building of the present invention;
FIG. 2 is a schematic flow chart of a first embodiment of the model building method of the present invention;
FIG. 3 is a schematic flow chart of a first embodiment of the model building method of the present invention;
FIG. 4 is a flow chart of a second embodiment of the model building method of the present invention;
FIG. 5 is a schematic view showing a sub-process of step S22 in a second embodiment of the model building method of the present invention;
FIG. 6 is a flow chart of a third embodiment of the model building method of the present invention;
FIG. 7 is a flow chart of a fourth embodiment of the model building method of the present invention;
FIG. 8 is a schematic flow chart of a step B3 in a fourth embodiment of the model building method of the present invention;
fig. 9 is a schematic functional block diagram of a model building apparatus according to a first embodiment of the model building method of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic device structure of a hardware running environment according to an embodiment of the present invention.
The device of the embodiment of the invention can be a mobile terminal or a server device.
As shown in fig. 1, the device may include a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It will be appreciated by those skilled in the art that the device structure shown in fig. 1 is not limiting of the device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and a model building program may be included in the memory 1005, which is a type of computer storage medium.
The operating system is a program for managing and controlling the model building device and software resources, and supports the operation of a network communication module, a user interface module, a model building program, and other programs or software, wherein the network communication module is used for managing and controlling the network interface 1002, and the user interface module is used for managing and controlling the user interface 1003.
In the model building apparatus shown in fig. 1, the model building apparatus calls a model building program stored in a memory 1005 by a processor 1001 and performs operations in various embodiments of the model building method described below.
Based on the hardware structure, the embodiment of the model building method is provided.
Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of a model building method according to the present invention, where the method includes:
s10, obtaining training corpus for constructing a model;
The method comprises the steps of obtaining an original corpus from a business end, preprocessing the original corpus to obtain a training corpus used for model construction, wherein the preprocessing mode comprises one or more of removing stop words, full cross conversion half angles, removing expression symbols, removing calling words and nonsensical problems, unifying punctuation marks and removing common punctuation marks.
In a specific embodiment, dialogue corpus from a business scene is collected and used for model construction, a large amount of user chat records of intelligent customer service are used as original corpus of a training target model, the original corpus is subjected to standardized pretreatment, and the treatment modes can be one or more of stop word removal, full-cross half-angle removal, expression symbol removal, call expression and nonsensical problems removal, unified punctuation and common punctuation removal, and the effects of reducing noise in the original corpus, improving the purity of the original corpus and obtaining training corpus capable of being used for constructing a model are achieved by carrying out standardized pretreatment on the whole original corpus from a client.
Step S20, clustering is carried out on the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, wherein the clustering result comprises a clustering label and clustering corpus corresponding to the clustering label;
the clustering process can use a morphological operator to cluster and combine similar classification areas nearby, and after the clustering and combining process, clusters with different categories are generated, wherein the clusters generated by the clustering are a set of data objects, and the objects are similar to the objects in the same cluster and are different from the objects in other clusters.
In the existing clustering algorithms, most of the clustering algorithms have the capability of processing noise data, some of the clustering algorithms are very sensitive to the noise data, and corresponding accurate clustering results can be obtained after the audio data are clustered.
In a specific embodiment, corpus data collected from a service end is preprocessed to obtain preprocessed training corpus, the training corpus is input into a pre-trained clustering model, the preprocessed training corpus is disordered, partial corpus in the training corpus is extracted to cluster, wherein a hierarchical aggregation clustering algorithm HAC can be adopted by a clustering algorithm to obtain clusters with different categories, the clusters have corresponding corpus respectively, and further a clustering result comprising clustering labels and clustering corpus corresponding to the clustering labels is obtained, and in the clustering result, if the number of the corpora in the clusters is larger than a preset threshold N1 (for example: 50), cluster ids are used as the clustering labels, and other corpora are used as the clustering labels.
And step S30, performing model training and prediction based on the clustering labels and the corresponding clustering corpus in the clustering results, and determining a target intention recognition model according to the model training and prediction results.
In the prior art, a manual is divided and marked by an initial pre-training model to obtain a data set capable of carrying out model training, and deep learning is carried out according to the initial pre-training model of the data set to obtain a classification model of a target. In the method for constructing the model, training corpuses of a service end are directly obtained, clustering processing is carried out on the training corpuses to obtain clustering results comprising clustering labels and clustering corpuses corresponding to the clustering labels, training and predicting are carried out on the clustering results, and a target intention recognition model is determined according to the model training and predicting results.
In a specific embodiment, according to the above clustering result including the clustering labels and the clustering corpuses corresponding to the clustering labels, the clustering result is divided into 5 parts according to the clustering labels, for example, the number of the clustering labels (cluster id) is 200 for the corpuses of label1, the number of the clustering labels of each part of clustering corpuses after the division is 40 for label1, 4 parts of clustering corpuses are taken each time to train the pre-training classification model, and the rest 1 part of clustering corpuses are predicted, so that the predicted value score value of all 5 parts of clustering corpuses can be obtained. The PRF value corresponding to the clustering result, namely Precision, recall and F1 value (F1) can be obtained through clustering the labels and the corresponding predictive score values, whether the clustering result is reasonable or not is evaluated by the PRF value, and if the clustering result is reasonable, a classification model trained based on training data is output to obtain a corresponding target intention recognition model.
According to the embodiment, the preprocessed training corpus is subjected to clustering processing to obtain the clustering labels with different feature categories and the clustering corpuses corresponding to the clustering labels, the clustering corpuses corresponding to the clustering labels are used as clustering results to perform model training and prediction, and a target intention recognition model is determined according to the model training and prediction results. The automatic construction method has the advantages that the manual cost of manual marking is greatly reduced, the operation pressure of operators is reduced, in addition, the automatic construction method can assist the operators in understanding the service, the operators find different service types involved in carding, and the efficiency of constructing the intention recognition model is improved.
Further, based on the first embodiment of the model building method of the present invention, a second embodiment of the model building method of the present invention is proposed.
The difference between the second embodiment of the model building method and the first embodiment of the model building method is that in the present embodiment, in step S20, the clustering process is performed on the training corpus based on the pre-trained clustering model, so as to obtain refinement of the corresponding clustering result, and referring to fig. 4, the step specifically includes:
Step S21, carrying out sequential disorder and segmentation on the training corpus to obtain clustered sample corpus;
In a specific embodiment, the preprocessed training corpus is input into a pre-trained clustering model, the arrangement sequence of the training corpus is disturbed, part of the training corpus is extracted for clustering, the method specifically comprises the steps of synchronously clustering the part of the training corpus, then gradually adding the rest of the training corpus, and further obtaining a clustering result corresponding to the training corpus. And compared with a mode of directly clustering all the training corpuses, a mode of extracting part of the training corpuses to cluster is better in finally obtained classification effect.
In a specific embodiment, the preprocessed training corpus is disordered, and a part of the training corpus is extracted as a clustering sample corpus, for example, 20% of the total corpus can be extracted as the clustering sample corpus by referring to the total corpus amount for clustering.
And S22, carrying out clustering processing on the clustering sample corpus based on a hierarchical aggregation clustering algorithm HAC to obtain a clustering label and a clustering corpus corresponding to the clustering label.
There are various clustering methods, including partitioning, layering, density-based, grid-based, model-based, transitive closure, boolean matrix, direct clustering, correlation analysis, and statistical-based clustering, and there are various different clustering algorithms in different methods from which the corresponding clustering results can be obtained.
In a specific embodiment, clustering is performed on the clustered sample corpuses based on a hierarchical aggregation clustering algorithm HAC to obtain clustered labels of different types, wherein the clustered labels are different clusters corresponding to each category, the clusters have corpuses corresponding to the clusters, the clusters are called as clustered labels, the corpuses corresponding to the different clusters are used as clustered corpuses corresponding to the clustered labels, and further, a clustering result comprising the clustered labels and the clustered corpuses corresponding to the clustered labels after the clustering is obtained.
Referring to fig. 5, step S22 specifically includes:
Step A1, classifying the clustered sample linguistic data, and dividing the classified clustered sample linguistic data of the same kind into a cluster to obtain cluster labels corresponding to clusters of different kinds;
in a specific embodiment, the clustered sample corpuses are classified to obtain clustered labels of different categories, and the clustered sample corpuses are classified according to the clustered labels to obtain clustered labels and clustered corpuses corresponding to the clustered labels.
Step A2, determining intra-cluster corpus corresponding to different kinds of cluster labels based on the cluster labels corresponding to the clusters;
In a specific embodiment, the clustering sample corpus is classified based on a hierarchical aggregation clustering algorithm HAC to obtain different clusters corresponding to each category, and the clusters have corpora corresponding to each cluster, which are called as clustering labels, and the corpora corresponding to each different cluster are used as clustering corpora corresponding to the clustering labels.
Step A3, if the number of the intra-cluster corpora is greater than a preset threshold N1, the intra-cluster corpora takes the cluster as a cluster label, and the intra-cluster corpora corresponding to the cluster label is a corresponding cluster corpus;
and step A4, if the number of the intra-cluster corpora is not greater than a preset threshold N1, the intra-cluster corpora uses other as a cluster label, and the intra-cluster corpora corresponding to the cluster label is the corresponding cluster corpora.
In a specific embodiment, the clustering sample corpus can be divided into five different categories of label1, label2, label3, label4 and label5, the clustering labels of the five different categories have corresponding clustering corpuses, if the number of the intra-cluster corpuses in the clustering result is greater than a preset threshold 50, the clusters of the intra-cluster corpuses comprise label1, label2, label3 and label4, the clustering corpuses of the label1, label2, label3 and label4 take the cluster id, the clustering corpuses of the label1, label2, label3 and label4 take the other as the clustering labels, and the clustering corpuses of the rest of the clustering sample corpuses of label5 correspond to the clustering corpuses.
In the embodiment, the training corpus is subjected to disorder and division to obtain clustered sample corpus, and other clustered non-sample corpus is also reserved, the clustered sample corpus is subjected to clustering processing based on a hierarchical aggregation clustering algorithm HAC to obtain a clustered result, wherein the clustered result comprises a clustered label and clustered corpus corresponding to the clustered label, the clustered result obtained by the method is used for training a target intention recognition model, a clustering algorithm is introduced to process training data, the labor input in the process of constructing intention recognition text classification is reduced, and the efficiency of constructing the intention recognition model is improved.
Further, based on the first and second embodiments of the model building method of the present invention, a third embodiment of the model building method of the present invention is provided.
The difference between the third embodiment of the model building method and the first and second embodiments of the model building method is that in this embodiment, in step S30, model training and prediction are performed based on the cluster labels in the clustering results and the corresponding clustering corpus, and refinement of the target intention recognition model is determined according to the model training and prediction results, and referring to fig. 6, the steps specifically include:
Step S31, dividing the clustering corpus in the clustering result into training corpus and prediction corpus according to the clustering labels;
And averagely dividing the clustering linguistic data in the clustering result into n parts according to the clustering labels, training the training initial model by taking m parts of the clustering linguistic data in the n parts of the clustering linguistic data each time to obtain a corresponding classification model, predicting the rest clustering linguistic data in the n parts of the clustering linguistic data, and further obtaining the predicted value scores of all the clustering linguistic data.
In a specific embodiment, the clustering corpus in the clustering result is divided into 5 parts according to the clustering labels label1 and label2, when the clustering corpus with the clustering label of label1 is 200, the number of labels of each clustering corpus is 40, 4 clustering corpuses are obtained for training each time after the division, the rest one part is predicted, the prediction score value of all the clustering corpuses with the clustering label of label1 can be obtained, and when the clustering corpus with the clustering label of label2 is 400, the number of labels of each clustering corpus with the clustering label of label2 is 80, 4 clustering corpuses are obtained for training each time after the division, and the rest one part is predicted, so that the prediction score value of all the clustering corpuses with the clustering label of label2 can be obtained.
Step S32, model training is carried out based on the training corpus, and a trained initial classification model is obtained;
Step S33, inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
In a specific embodiment, training corpus in the partitioned clustered corpus is used for training a pre-training classification model to obtain a corresponding classification model, and prediction corpus in the partitioned clustered corpus is input into the classification model to obtain a corresponding classification result after classification, and a prediction test is performed on the classification result to obtain a prediction score value of the classification result corresponding to the classification model.
Step S34, determining an accurate recall PRF value of the clustering result based on the clustering label and the predictive score value in the clustering result;
And obtaining a clustering label in the clustering result and Precision (Precision), recall (Recall), F1 value (F1) corresponding to the clustering result through a prediction score value corresponding to the classification result of the classification model, namely a PRF value for short, and evaluating whether the classification model is reasonable or not according to the PRF value, and further evaluating whether training data of the classification model is reasonable or not, and further obtaining whether the clustering result is reasonable or not.
And step S35, determining a corresponding target intention recognition model based on the PRF value.
In a specific embodiment, according to the PRF value of the clustering result, the output corresponding target intention recognition model may be determined, and specifically, the value determination rule for the PRF value is as follows:
If the PRF value reaches a preset threshold value, the clustering result is reasonable, training the pre-training model by adopting all corpus in the clustering result to obtain a corresponding classification model;
If the PRF value does not reach the preset threshold, the clustering result is unreasonable, the clustering labels corresponding to the clustering result and the clustering corpus corresponding to the clustering labels are required to be subjected to classification adjustment, the clustering labels after classification adjustment and the clustering corpus corresponding to the clustering labels are obtained, model training and prediction are performed on the clustering results of the clustering labels after classification adjustment and the clustering corpus corresponding to the clustering labels again until the PRF value corresponding to the clustering result reaches the preset threshold, and a corresponding classification model is output.
In this embodiment, training and predicting the clustering result after the clustering process to obtain a classification output result of the classification model corresponding to the clustering result and a prediction score value corresponding to the clustering result, obtaining a PRF value of the clustering result according to the classification output result and the prediction score value, determining whether the clustering result is reasonable according to whether the PRF value reaches a preset threshold, and finally determining a corresponding target intention recognition model, so that the accuracy of automatically creating the target intention recognition model can be improved, a fault tolerance mechanism is set, the classification result of the classification model is predicted and determined, and the accuracy of the classification result of the classification model is improved.
Further, based on the first, second, and third embodiments of the model building method of the present invention, a fourth embodiment of the model building method of the present invention is provided.
The fourth embodiment of the model building method differs from the first, second and third embodiments of the model building method in that, in this embodiment, in step S35, based on the PRF value, refinement of the corresponding target intention recognition model is determined, and referring to fig. 7, the step specifically includes:
step B1, judging whether the PRF value reaches a preset threshold value or not;
step B2, if the PRF value reaches a preset threshold, the clustering result is reasonable, and the initial classification model is output as a target intention recognition model;
In a specific embodiment, if the PRF value of the classification result output by the classification model reaches a preset threshold, judging whether the clustered training corpus is completely added in the process of training the pre-training model, if so, directly outputting the classification model as a target intention recognition model, and if not, adding the training corpus to further train the classification model until all the training corpus is added for training, and outputting the classification model as the target intention recognition model.
Step B3, if the PRF value does not reach the preset threshold value, the clustering result is unreasonable, and classification adjustment is carried out on the clustering result to obtain a clustering result after classification adjustment;
Referring to fig. 8, step B3 specifically includes:
the clustering result comprises other clustering corpus with a clustering label being other and non-other clustering corpus with a clustering label being non-other, and the step of classifying and adjusting the clustering result to obtain the clustering result after classifying and adjusting comprises the following steps:
step b1, adjusting the other clustering corpus and the non-other clustering corpus to obtain clustering corpus corresponding to the adjusted clustering labels;
step b2, calculating the confusion degree of the clustering corpus corresponding to the adjusted clustering labels;
And b3, when the confusion is greater than a preset threshold T2, merging and adjusting the non-other clustering corpuses before and after classification to serve as non-other clustering corpuses of the current clustering result, and taking other corpuses as other clustering corpuses with the clustering labels being other.
In a specific embodiment, the step of classifying and adjusting the clustering result includes three parts:
firstly, the first part is to change the label of an unreliable part in the non-other clustering corpus into the clustering corpus with the clustering label being the other, and the clustering labels and the predictive score values of all the clustering corpora can be obtained through the steps. And (3) replacing the inconsistent clustering corpus in the clustering corpus with the predictive value and the clustering label being the non-other with the clustering corpus with the clustering label being the other, changing the label to integrate the advantages of classification and clustering, and placing the unreliable part in the clustering corpus with the clustering label being the non-other into the other.
And then judging whether the number of the clustered corpuses with the other label exceeds 10% of the threshold value overall training corpuses, if so, executing the clustering algorithm on the new clustered result, marking the clustered corpuses with the number exceeding a certain preset threshold value N2 in the clustered result as new clustered corpuses, and adding the clustered corpuses into the current training corpus.
Finally, classifying and combining the new clustered corpus and the clustered corpus before classification and adjustment, wherein the process of classifying and combining needs to calculate the confusion degree of the clustered corpus before and after classification and adjustment, and the formula for calculating the confusion degree is as follows:
Where N catei,catej represents the actual intent catei, but misclassified to catej, N catei represents the actual number catei; And combining clustering corpus with two clustering labels with the confusion degree larger than a threshold value T2 (for example, 0.25) as non-other clusters to obtain a clustering result after classification adjustment.
Step B4, taking the clustering result after classification adjustment as a current clustering result;
acquiring a current clustering result, and executing the following steps:
Dividing clustered linguistic data in the clustered results into training linguistic data and predicted linguistic data according to the clustered labels;
model training is carried out based on the training corpus, and a trained initial classification model is obtained;
Inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
Determining an accurate recall PRF value of the clustering result based on the clustering label in the clustering result and the predictive score value;
In a specific embodiment, if the PRF value of the clustering result does not reach the preset threshold, the clustering result is unreasonable, and then the clustering label corresponding to the clustering result and the clustering corpus corresponding to the clustering label need to be classified and adjusted, after the clustering result after the classification and adjustment is obtained, the clustering result after the classification and adjustment needs to be used as the current clustering result, and the steps of performing model training and prediction based on the clustering label in the clustering result and the corresponding clustering corpus and determining the target intention recognition model according to the model training and prediction result are repeatedly performed.
Further, the clustering corpus in the clustering result is divided into training corpus and prediction according to the clustering label, the classification result of the classification model is obtained through the training corpus, the prediction score value of the clustering result is obtained through prediction, the PRF value corresponding to the clustering result is determined based on the classification result and the prediction score value, and the corresponding target intention recognition model is determined according to the PRF value.
And step B5, outputting the initial classification model as a target intention recognition model until the PRF value reaches a preset threshold value and the clustering result is reasonable.
If the PRF value reaches a preset threshold, outputting a classification model obtained by training the clustering result as a target intention recognition model, wherein the clustering result is reasonable.
In this embodiment, by judging the PRF value of the classification result of the classification model, it is determined whether the classification model is reasonable, further determining whether the training data of the classification model is reasonable, further determining whether the clustering result including the training data is reasonable, if so, directly outputting the classification model to obtain the target intention recognition model, and if not, classifying and adjusting the clustering result to obtain the corresponding model training data, thereby obtaining the target intention recognition model, improving the accuracy of automatically creating the target intention recognition model, and improving the efficiency of automatically creating the target intention recognition model.
The invention also provides a model construction device. Referring to fig. 9, the model building apparatus of the present invention includes:
An acquisition module 10, configured to acquire a training corpus for constructing a model;
the clustering module 20 is configured to perform clustering processing on the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, where the clustering result includes a clustering label and a clustering corpus corresponding to the clustering label;
The determining module 30 is configured to perform model training and prediction based on the cluster labels in the cluster results and the corresponding cluster corpus, and determine a target intention recognition model according to the model training and prediction results.
Furthermore, the present invention provides a computer-readable storage medium, preferably a computer-readable storage medium, having stored thereon a model building program which, when executed by a processor, implements the steps of the model building method as described above.
In the embodiments of the model building apparatus and medium of the present invention, all the technical features of each embodiment of the model building method are included, and description and explanation contents are substantially the same as those of each embodiment of the model building method, which are not repeated herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein, or any application, directly or indirectly, in the field of other related technology.
Claims (5)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210229151.9A CN114817455B (en) | 2022-03-08 | 2022-03-08 | Model building methods, devices, equipment and media |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210229151.9A CN114817455B (en) | 2022-03-08 | 2022-03-08 | Model building methods, devices, equipment and media |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114817455A CN114817455A (en) | 2022-07-29 |
| CN114817455B true CN114817455B (en) | 2026-04-07 |
Family
ID=82528956
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210229151.9A Active CN114817455B (en) | 2022-03-08 | 2022-03-08 | Model building methods, devices, equipment and media |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114817455B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116467602B (en) * | 2023-04-27 | 2025-12-23 | 中国工商银行股份有限公司 | Training data generation method, device, computer equipment and storage medium |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113704429A (en) * | 2021-08-31 | 2021-11-26 | 平安普惠企业管理有限公司 | Semi-supervised learning-based intention identification method, device, equipment and medium |
Family Cites Families (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101866337B (en) * | 2009-04-14 | 2014-07-02 | 日电(中国)有限公司 | Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model |
| CN109739984A (en) * | 2018-12-25 | 2019-05-10 | 贵州商学院 | A kind of parallel KNN network public-opinion sorting algorithm of improvement based on Hadoop platform |
| CN111813905B (en) * | 2020-06-17 | 2024-05-10 | 平安科技(深圳)有限公司 | Corpus generation method, corpus generation device, computer equipment and storage medium |
| CN113191148B (en) * | 2021-04-30 | 2024-05-28 | 西安理工大学 | A rail transit entity recognition method based on semi-supervised learning and clustering |
| CN113704479B (en) * | 2021-10-26 | 2022-02-18 | 深圳市北科瑞声科技股份有限公司 | Unsupervised text classification method and device, electronic equipment and storage medium |
| CN114003720A (en) * | 2021-10-29 | 2022-02-01 | 平安国际智慧城市科技股份有限公司 | Business document classification method, device, equipment and storage medium |
-
2022
- 2022-03-08 CN CN202210229151.9A patent/CN114817455B/en active Active
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113704429A (en) * | 2021-08-31 | 2021-11-26 | 平安普惠企业管理有限公司 | Semi-supervised learning-based intention identification method, device, equipment and medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114817455A (en) | 2022-07-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN112836509B (en) | A method and system for constructing an expert system knowledge base | |
| CN109800306B (en) | Intent analysis method, device, display terminal and computer-readable storage medium | |
| CN111651996B (en) | Abstract generation method, device, electronic device and storage medium | |
| CN108304372B (en) | Entity extraction method and device, computer equipment and storage medium | |
| JP4311552B2 (en) | Automatic document separation | |
| CN111753060A (en) | Information retrieval method, device, equipment and computer readable storage medium | |
| CN109359296B (en) | Public opinion emotion recognition method, device and computer-readable storage medium | |
| CN111428028A (en) | Information classification method based on deep learning and related equipment | |
| CN116955534B (en) | Intelligent complaint work order processing method, intelligent complaint work order processing device, intelligent complaint work order processing equipment and storage medium | |
| CN112417132A (en) | New intention recognition method for screening negative samples by utilizing predicate guest information | |
| CN113012687B (en) | Information interaction method and device and electronic equipment | |
| CN112671985A (en) | Agent quality inspection method, device, equipment and storage medium based on deep learning | |
| CN117113982A (en) | A big data topic analysis method based on embedding model | |
| CN114491010B (en) | Training method and device for information extraction model | |
| CN116644183B (en) | Text classification method, device and storage medium | |
| CN113095073B (en) | Corpus tag generation method and device, computer equipment and storage medium | |
| CN114817478A (en) | Text-based question and answer method and device, computer equipment and storage medium | |
| CN118445398A (en) | Intelligent question number auxiliary decision-making method and system based on AI large model | |
| CN114817455B (en) | Model building methods, devices, equipment and media | |
| TW202034207A (en) | Dialogue system using intention detection ensemble learning and method thereof | |
| CN119155391B (en) | Full-scale voice analysis method, device and equipment based on large model and storage medium | |
| CN113505227A (en) | Text classification method and device, electronic equipment and readable storage medium | |
| CN111767735B (en) | Method, apparatus and computer readable storage medium for executing tasks | |
| CN121605398A (en) | Text recognition method, model and electronic equipment | |
| CN114239565A (en) | Deep learning-based emotion reason identification method and system |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |