CN114817455B - Model building methods, devices, equipment and media - Google Patents

Model building methods, devices, equipment and media

Info

Publication number
CN114817455B
CN114817455B CN202210229151.9A CN202210229151A CN114817455B CN 114817455 B CN114817455 B CN 114817455B CN 202210229151 A CN202210229151 A CN 202210229151A CN 114817455 B CN114817455 B CN 114817455B
Authority
CN
China
Prior art keywords
clustering
corpus
model
clustered
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210229151.9A
Other languages
Chinese (zh)
Other versions
CN114817455A (en
Inventor
赵高枫
文俊杰
李金龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Merchants Bank Co Ltd
Original Assignee
China Merchants Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Merchants Bank Co Ltd filed Critical China Merchants Bank Co Ltd
Priority to CN202210229151.9A priority Critical patent/CN114817455B/en
Publication of CN114817455A publication Critical patent/CN114817455A/en
Application granted granted Critical
Publication of CN114817455B publication Critical patent/CN114817455B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of artificial intelligence and discloses a model construction method, a device, equipment and a medium. The method comprises the steps of obtaining training corpus of a built model, carrying out clustering processing on the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, wherein the clustering result comprises clustering labels and clustering corpora corresponding to the clustering labels, carrying out model training and prediction based on the clustering labels and the corresponding clustering corpora in the clustering result, and determining a target intention recognition model according to the model training and prediction results. The method for automatically generating the target intention recognition model reduces the time input in the process of familiarity with service points and data labeling, accelerates the carding of the service points and labeling of service corpus, improves the efficiency of constructing the target intention recognition model, and reduces the labor cost.

Description

Model construction method, device, equipment and medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for model construction.
Background
With the continuous development of artificial intelligence technology, the application of a dialogue system is more and more widespread, such as an unmanned customer service system, in which intention recognition is an important component, and a common algorithm for intention recognition is to recognize the intention of a user through text classification, specifically, divide the intention of the user into several categories, and match corresponding response schemes under the categories.
When a dialogue system is built, text classification is usually the simplest and effective means for building a user intention recognition model, the existing text classification model for intention recognition is mainly based on business establishment needed to be carried out in the dialogue system, an operator is often required to know business points needed to be carried out in the dialogue system in the establishment process, relevant business point information is combed out, corpus provided by business parties is marked in a large quantity, classification is adjusted by means of a classification adjustment tool after marking is finished, if the corpus is insufficient, tool expansion is considered, finally a text classification model for recognizing user intention is obtained, in the process, the efficiency of building the text classification model by the operator through combing the business points and marking the business corpus is low, and a large amount of manpower is consumed.
Disclosure of Invention
The invention mainly aims to provide a model construction method, device, equipment and medium, which aim to reduce the labor cost of manual carding and labeling and improve the classification model construction efficiency.
In order to achieve the above object, the present invention provides a model construction method comprising the steps of:
Acquiring training corpus of a constructed model;
Clustering the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, wherein the clustering result comprises a clustering label and clustering corpus corresponding to the clustering label;
and carrying out model training and prediction based on the clustering labels in the clustering results and the corresponding clustering corpus, and determining a target intention recognition model according to the model training and prediction results.
Preferably, the step of clustering the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result includes:
sequentially disordered and segmented the training corpus to obtain clustered sample corpus;
And carrying out clustering processing on the clustering sample corpus based on a hierarchical aggregation clustering algorithm HAC to obtain a clustering label and a clustering corpus corresponding to the clustering label.
Preferably, the step of clustering the clustered sample corpus based on the hierarchical aggregation clustering algorithm HAC to obtain a clustered label and a clustered corpus corresponding to the clustered label includes:
classifying the clustered sample corpus, and dividing the classified clustered sample corpus of the same kind into a cluster to obtain cluster labels corresponding to clusters of different kinds;
based on the cluster labels corresponding to the clusters, determining intra-cluster corpus corresponding to different kinds of cluster labels;
If the number of the intra-cluster linguistic data is larger than a preset threshold N1, the intra-cluster linguistic data takes the cluster as a cluster label, and the intra-cluster linguistic data corresponding to the cluster label is the corresponding cluster linguistic data;
if the number of the intra-cluster corpora is not greater than a preset threshold N1, the intra-cluster corpora uses other as a cluster label, and the intra-cluster corpora corresponding to the cluster label is the corresponding cluster corpora.
Preferably, the step of performing model training and prediction based on the clustering labels in the clustering results and the corresponding clustering corpus, and determining the target intention recognition model according to the model training and prediction results includes:
Dividing clustered linguistic data in the clustered results into training linguistic data and predicted linguistic data according to the clustered labels;
model training is carried out based on the training corpus, and a trained initial classification model is obtained;
Inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
Determining an accurate recall ratio PRF value of the clustering result based on the clustering label in the clustering result and the predictive score value;
based on the PRF values, a corresponding target intent recognition model is determined.
Preferably, the step of judging whether the clustering result is reasonable based on the PRF value;
judging whether the PRF value reaches a preset threshold value or not;
If the PRF value reaches a preset threshold, the clustering result is reasonable, and the initial classification model is output as a target intention recognition model;
if the PRF value does not reach the preset threshold value, the clustering result is unreasonable, and the clustering result is subjected to classification adjustment to obtain a clustering result after classification adjustment;
taking the clustering result after classification adjustment as a current clustering result;
Dividing clustered linguistic data in the clustered results into training linguistic data and predicted linguistic data according to the clustered labels;
model training is carried out based on the training corpus, and a trained initial classification model is obtained;
Inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
Determining an accurate recall ratio PRF value of the clustering result based on the clustering label in the clustering result and the predictive score value;
and outputting the initial classification model as a target intention recognition model until the PRF value reaches a preset threshold value and the clustering result is reasonable.
Preferably, the clustering result comprises other clustering corpus with a clustering label of other and non-other clustering corpus with a clustering label of non-other,
The step of classifying and adjusting the clustering result to obtain a classified and adjusted clustering result comprises the following steps:
Adjusting the other clustering corpus and the non-other clustering corpus to obtain clustering corpus corresponding to the adjusted clustering labels;
Calculating the confusion degree of the clustering corpus corresponding to the adjusted clustering labels;
and when the confusion is greater than a preset threshold T2, merging and adjusting the non-other clustering corpuses before and after classification to serve as non-other clustering corpuses of the current clustering result, and taking other corpuses as other clustering corpuses with the clustering labels being other.
Preferably, the step of adjusting the other clustering corpus and the non-other clustering corpus to obtain the clustering corpus corresponding to the adjusted clustering label includes:
acquiring a predictive score value of the non-other clustering corpus;
If the predictive score value of the non-other clustering corpus is lower than a preset threshold T1, changing the clustering label of the non-other clustering corpus into other;
And when the number of other clustering corpuses of which the clustering labels are other exceeds a preset threshold N2, acquiring the adjusted other clustering corpuses and the adjusted non-other clustering corpuses.
Preferably, the step of obtaining the training corpus for constructing the model includes:
Acquiring an original corpus from a service end;
preprocessing the original corpus to obtain a training corpus used for model construction;
The preprocessing mode comprises one or more of eliminating stop words, full-cross half angles, eliminating expression symbols, eliminating calling words and nonsensical questions, unifying punctuation marks and eliminating unusual punctuation marks.
In addition, in order to achieve the above object, the present invention also provides a model building apparatus comprising:
the acquisition module is used for acquiring training corpus of the constructed model;
the clustering module is used for carrying out clustering processing on the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, wherein the clustering result comprises a clustering label and a clustering corpus corresponding to the clustering label;
And the determining module is used for carrying out model training and prediction based on the clustering labels in the clustering results and the corresponding clustering corpus, and determining a target intention recognition model according to the model training and prediction results.
Preferably, the acquisition module is further configured to:
Acquiring an original corpus from a service end;
preprocessing the original corpus to obtain a training corpus used for model construction;
The preprocessing mode comprises one or more of eliminating stop words, full-cross half angles, eliminating expression symbols, eliminating calling words and nonsensical questions, unifying punctuation marks and eliminating unusual punctuation marks.
Preferably, the clustering module is further configured to:
sequentially disordered and segmented the training corpus to obtain clustered sample corpus;
And carrying out clustering processing on the clustering sample corpus based on a hierarchical aggregation clustering algorithm HAC to obtain a clustering label and a clustering corpus corresponding to the clustering label.
Preferably, the clustering module is further configured to:
classifying the clustered sample corpus, and dividing the classified clustered sample corpus of the same kind into a cluster to obtain cluster labels corresponding to clusters of different kinds;
based on the cluster labels corresponding to the clusters, determining intra-cluster corpus corresponding to different kinds of cluster labels;
If the number of the intra-cluster linguistic data is larger than a preset threshold N1, the intra-cluster linguistic data takes the cluster as a cluster label, and the intra-cluster linguistic data corresponding to the cluster label is the corresponding cluster linguistic data;
if the number of the intra-cluster corpora is not greater than a preset threshold N1, the intra-cluster corpora uses other as a cluster label, and the intra-cluster corpora corresponding to the cluster label is the corresponding cluster corpora.
Preferably, the determining module is further configured to:
Dividing clustered linguistic data in the clustered results into training linguistic data and predicted linguistic data according to the clustered labels;
model training is carried out based on the training corpus, and a trained initial classification model is obtained;
Inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
Determining an accurate recall ratio PRF value of the clustering result based on the clustering label in the clustering result and the predictive score value;
based on the PRF values, a corresponding target intent recognition model is determined.
Preferably, the determining module is further configured to:
judging whether the PRF value reaches a preset threshold value or not;
If the PRF value reaches a preset threshold, the clustering result is reasonable, and the initial classification model is output as a target intention recognition model;
if the PRF value does not reach the preset threshold value, the clustering result is unreasonable, and the clustering result is subjected to classification adjustment to obtain a clustering result after classification adjustment;
taking the clustering result after classification adjustment as a current clustering result, and executing the steps:
Dividing the clustered corpus in the clustered result after classification adjustment into training corpus and prediction corpus according to the clustered labels;
model training is carried out based on the training corpus, and a trained initial classification model is obtained;
Inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
Determining an accurate recall ratio PRF value of the clustering result based on the clustering label in the clustering result and the predictive score value;
and outputting the initial classification model as a target intention recognition model until the PRF value reaches a preset threshold value and the clustering result is reasonable.
Preferably, the determining module is further configured to:
Adjusting the other clustering corpus and the non-other clustering corpus to obtain clustering corpus corresponding to the adjusted clustering labels;
Calculating the confusion degree of the clustering corpus corresponding to the adjusted clustering labels;
and when the confusion is greater than a preset threshold T2, merging and adjusting the non-other clustering corpuses before and after classification to serve as non-other clustering corpuses of the current clustering result, and taking other corpuses as other clustering corpuses with the clustering labels being other.
Preferably, the determining module is further configured to:
acquiring a predictive score value of the non-other clustering corpus;
If the predictive score value of the non-other clustering corpus is lower than a preset threshold T1, changing the clustering label of the non-other clustering corpus into other;
And when the number of other clustering corpuses of which the clustering labels are other exceeds a preset threshold N2, acquiring the adjusted other clustering corpuses and the adjusted non-other clustering corpuses.
In addition, in order to achieve the above object, the present invention also provides a model construction apparatus including a memory, a processor, and a model construction program stored on the memory and executable on the processor, the model construction program implementing the steps of the model construction method as described above when executed by the processor.
In addition, in order to achieve the above object, the present invention also provides a medium that is a computer-readable storage medium having stored thereon a model building program that, when executed by a processor, implements the steps of the model building method as described above.
The method, the device, the equipment and the medium for constructing the model are characterized by acquiring training corpus of the constructed model, carrying out clustering processing on the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, wherein the clustering result comprises a clustering label and the clustering corpus corresponding to the clustering label, carrying out model training and prediction based on the clustering label and the corresponding clustering corpus in the clustering result, and determining a target intention recognition model according to the model training and prediction result.
Obtaining a clustering result corresponding to training expectation by carrying out clustering processing on training corpuses for constructing an intention recognition model, wherein the clustering result comprises clustering labels with clustered corpuses and clustering corpuses corresponding to the clustering labels, model training and prediction are carried out on the clustering results comprising the clustering labels and the clustering corpuses corresponding to the clustering labels, obtaining a PRF value corresponding to the clustering result, determining a target intention recognition model according to the PRF value, automatically generating the target intention recognition model, the time input in the process of familiarity with service points and data labeling is reduced, the efficiency of carding service points and labeling service corpora is improved, and the labor cost is reduced.
Drawings
FIG. 1 is a schematic diagram of a device architecture of a hardware operating environment involved in an embodiment of the model building of the present invention;
FIG. 2 is a schematic flow chart of a first embodiment of the model building method of the present invention;
FIG. 3 is a schematic flow chart of a first embodiment of the model building method of the present invention;
FIG. 4 is a flow chart of a second embodiment of the model building method of the present invention;
FIG. 5 is a schematic view showing a sub-process of step S22 in a second embodiment of the model building method of the present invention;
FIG. 6 is a flow chart of a third embodiment of the model building method of the present invention;
FIG. 7 is a flow chart of a fourth embodiment of the model building method of the present invention;
FIG. 8 is a schematic flow chart of a step B3 in a fourth embodiment of the model building method of the present invention;
fig. 9 is a schematic functional block diagram of a model building apparatus according to a first embodiment of the model building method of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic device structure of a hardware running environment according to an embodiment of the present invention.
The device of the embodiment of the invention can be a mobile terminal or a server device.
As shown in fig. 1, the device may include a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It will be appreciated by those skilled in the art that the device structure shown in fig. 1 is not limiting of the device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and a model building program may be included in the memory 1005, which is a type of computer storage medium.
The operating system is a program for managing and controlling the model building device and software resources, and supports the operation of a network communication module, a user interface module, a model building program, and other programs or software, wherein the network communication module is used for managing and controlling the network interface 1002, and the user interface module is used for managing and controlling the user interface 1003.
In the model building apparatus shown in fig. 1, the model building apparatus calls a model building program stored in a memory 1005 by a processor 1001 and performs operations in various embodiments of the model building method described below.
Based on the hardware structure, the embodiment of the model building method is provided.
Referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of a model building method according to the present invention, where the method includes:
s10, obtaining training corpus for constructing a model;
The method comprises the steps of obtaining an original corpus from a business end, preprocessing the original corpus to obtain a training corpus used for model construction, wherein the preprocessing mode comprises one or more of removing stop words, full cross conversion half angles, removing expression symbols, removing calling words and nonsensical problems, unifying punctuation marks and removing common punctuation marks.
In a specific embodiment, dialogue corpus from a business scene is collected and used for model construction, a large amount of user chat records of intelligent customer service are used as original corpus of a training target model, the original corpus is subjected to standardized pretreatment, and the treatment modes can be one or more of stop word removal, full-cross half-angle removal, expression symbol removal, call expression and nonsensical problems removal, unified punctuation and common punctuation removal, and the effects of reducing noise in the original corpus, improving the purity of the original corpus and obtaining training corpus capable of being used for constructing a model are achieved by carrying out standardized pretreatment on the whole original corpus from a client.
Step S20, clustering is carried out on the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, wherein the clustering result comprises a clustering label and clustering corpus corresponding to the clustering label;
the clustering process can use a morphological operator to cluster and combine similar classification areas nearby, and after the clustering and combining process, clusters with different categories are generated, wherein the clusters generated by the clustering are a set of data objects, and the objects are similar to the objects in the same cluster and are different from the objects in other clusters.
In the existing clustering algorithms, most of the clustering algorithms have the capability of processing noise data, some of the clustering algorithms are very sensitive to the noise data, and corresponding accurate clustering results can be obtained after the audio data are clustered.
In a specific embodiment, corpus data collected from a service end is preprocessed to obtain preprocessed training corpus, the training corpus is input into a pre-trained clustering model, the preprocessed training corpus is disordered, partial corpus in the training corpus is extracted to cluster, wherein a hierarchical aggregation clustering algorithm HAC can be adopted by a clustering algorithm to obtain clusters with different categories, the clusters have corresponding corpus respectively, and further a clustering result comprising clustering labels and clustering corpus corresponding to the clustering labels is obtained, and in the clustering result, if the number of the corpora in the clusters is larger than a preset threshold N1 (for example: 50), cluster ids are used as the clustering labels, and other corpora are used as the clustering labels.
And step S30, performing model training and prediction based on the clustering labels and the corresponding clustering corpus in the clustering results, and determining a target intention recognition model according to the model training and prediction results.
In the prior art, a manual is divided and marked by an initial pre-training model to obtain a data set capable of carrying out model training, and deep learning is carried out according to the initial pre-training model of the data set to obtain a classification model of a target. In the method for constructing the model, training corpuses of a service end are directly obtained, clustering processing is carried out on the training corpuses to obtain clustering results comprising clustering labels and clustering corpuses corresponding to the clustering labels, training and predicting are carried out on the clustering results, and a target intention recognition model is determined according to the model training and predicting results.
In a specific embodiment, according to the above clustering result including the clustering labels and the clustering corpuses corresponding to the clustering labels, the clustering result is divided into 5 parts according to the clustering labels, for example, the number of the clustering labels (cluster id) is 200 for the corpuses of label1, the number of the clustering labels of each part of clustering corpuses after the division is 40 for label1, 4 parts of clustering corpuses are taken each time to train the pre-training classification model, and the rest 1 part of clustering corpuses are predicted, so that the predicted value score value of all 5 parts of clustering corpuses can be obtained. The PRF value corresponding to the clustering result, namely Precision, recall and F1 value (F1) can be obtained through clustering the labels and the corresponding predictive score values, whether the clustering result is reasonable or not is evaluated by the PRF value, and if the clustering result is reasonable, a classification model trained based on training data is output to obtain a corresponding target intention recognition model.
According to the embodiment, the preprocessed training corpus is subjected to clustering processing to obtain the clustering labels with different feature categories and the clustering corpuses corresponding to the clustering labels, the clustering corpuses corresponding to the clustering labels are used as clustering results to perform model training and prediction, and a target intention recognition model is determined according to the model training and prediction results. The automatic construction method has the advantages that the manual cost of manual marking is greatly reduced, the operation pressure of operators is reduced, in addition, the automatic construction method can assist the operators in understanding the service, the operators find different service types involved in carding, and the efficiency of constructing the intention recognition model is improved.
Further, based on the first embodiment of the model building method of the present invention, a second embodiment of the model building method of the present invention is proposed.
The difference between the second embodiment of the model building method and the first embodiment of the model building method is that in the present embodiment, in step S20, the clustering process is performed on the training corpus based on the pre-trained clustering model, so as to obtain refinement of the corresponding clustering result, and referring to fig. 4, the step specifically includes:
Step S21, carrying out sequential disorder and segmentation on the training corpus to obtain clustered sample corpus;
In a specific embodiment, the preprocessed training corpus is input into a pre-trained clustering model, the arrangement sequence of the training corpus is disturbed, part of the training corpus is extracted for clustering, the method specifically comprises the steps of synchronously clustering the part of the training corpus, then gradually adding the rest of the training corpus, and further obtaining a clustering result corresponding to the training corpus. And compared with a mode of directly clustering all the training corpuses, a mode of extracting part of the training corpuses to cluster is better in finally obtained classification effect.
In a specific embodiment, the preprocessed training corpus is disordered, and a part of the training corpus is extracted as a clustering sample corpus, for example, 20% of the total corpus can be extracted as the clustering sample corpus by referring to the total corpus amount for clustering.
And S22, carrying out clustering processing on the clustering sample corpus based on a hierarchical aggregation clustering algorithm HAC to obtain a clustering label and a clustering corpus corresponding to the clustering label.
There are various clustering methods, including partitioning, layering, density-based, grid-based, model-based, transitive closure, boolean matrix, direct clustering, correlation analysis, and statistical-based clustering, and there are various different clustering algorithms in different methods from which the corresponding clustering results can be obtained.
In a specific embodiment, clustering is performed on the clustered sample corpuses based on a hierarchical aggregation clustering algorithm HAC to obtain clustered labels of different types, wherein the clustered labels are different clusters corresponding to each category, the clusters have corpuses corresponding to the clusters, the clusters are called as clustered labels, the corpuses corresponding to the different clusters are used as clustered corpuses corresponding to the clustered labels, and further, a clustering result comprising the clustered labels and the clustered corpuses corresponding to the clustered labels after the clustering is obtained.
Referring to fig. 5, step S22 specifically includes:
Step A1, classifying the clustered sample linguistic data, and dividing the classified clustered sample linguistic data of the same kind into a cluster to obtain cluster labels corresponding to clusters of different kinds;
in a specific embodiment, the clustered sample corpuses are classified to obtain clustered labels of different categories, and the clustered sample corpuses are classified according to the clustered labels to obtain clustered labels and clustered corpuses corresponding to the clustered labels.
Step A2, determining intra-cluster corpus corresponding to different kinds of cluster labels based on the cluster labels corresponding to the clusters;
In a specific embodiment, the clustering sample corpus is classified based on a hierarchical aggregation clustering algorithm HAC to obtain different clusters corresponding to each category, and the clusters have corpora corresponding to each cluster, which are called as clustering labels, and the corpora corresponding to each different cluster are used as clustering corpora corresponding to the clustering labels.
Step A3, if the number of the intra-cluster corpora is greater than a preset threshold N1, the intra-cluster corpora takes the cluster as a cluster label, and the intra-cluster corpora corresponding to the cluster label is a corresponding cluster corpus;
and step A4, if the number of the intra-cluster corpora is not greater than a preset threshold N1, the intra-cluster corpora uses other as a cluster label, and the intra-cluster corpora corresponding to the cluster label is the corresponding cluster corpora.
In a specific embodiment, the clustering sample corpus can be divided into five different categories of label1, label2, label3, label4 and label5, the clustering labels of the five different categories have corresponding clustering corpuses, if the number of the intra-cluster corpuses in the clustering result is greater than a preset threshold 50, the clusters of the intra-cluster corpuses comprise label1, label2, label3 and label4, the clustering corpuses of the label1, label2, label3 and label4 take the cluster id, the clustering corpuses of the label1, label2, label3 and label4 take the other as the clustering labels, and the clustering corpuses of the rest of the clustering sample corpuses of label5 correspond to the clustering corpuses.
In the embodiment, the training corpus is subjected to disorder and division to obtain clustered sample corpus, and other clustered non-sample corpus is also reserved, the clustered sample corpus is subjected to clustering processing based on a hierarchical aggregation clustering algorithm HAC to obtain a clustered result, wherein the clustered result comprises a clustered label and clustered corpus corresponding to the clustered label, the clustered result obtained by the method is used for training a target intention recognition model, a clustering algorithm is introduced to process training data, the labor input in the process of constructing intention recognition text classification is reduced, and the efficiency of constructing the intention recognition model is improved.
Further, based on the first and second embodiments of the model building method of the present invention, a third embodiment of the model building method of the present invention is provided.
The difference between the third embodiment of the model building method and the first and second embodiments of the model building method is that in this embodiment, in step S30, model training and prediction are performed based on the cluster labels in the clustering results and the corresponding clustering corpus, and refinement of the target intention recognition model is determined according to the model training and prediction results, and referring to fig. 6, the steps specifically include:
Step S31, dividing the clustering corpus in the clustering result into training corpus and prediction corpus according to the clustering labels;
And averagely dividing the clustering linguistic data in the clustering result into n parts according to the clustering labels, training the training initial model by taking m parts of the clustering linguistic data in the n parts of the clustering linguistic data each time to obtain a corresponding classification model, predicting the rest clustering linguistic data in the n parts of the clustering linguistic data, and further obtaining the predicted value scores of all the clustering linguistic data.
In a specific embodiment, the clustering corpus in the clustering result is divided into 5 parts according to the clustering labels label1 and label2, when the clustering corpus with the clustering label of label1 is 200, the number of labels of each clustering corpus is 40, 4 clustering corpuses are obtained for training each time after the division, the rest one part is predicted, the prediction score value of all the clustering corpuses with the clustering label of label1 can be obtained, and when the clustering corpus with the clustering label of label2 is 400, the number of labels of each clustering corpus with the clustering label of label2 is 80, 4 clustering corpuses are obtained for training each time after the division, and the rest one part is predicted, so that the prediction score value of all the clustering corpuses with the clustering label of label2 can be obtained.
Step S32, model training is carried out based on the training corpus, and a trained initial classification model is obtained;
Step S33, inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
In a specific embodiment, training corpus in the partitioned clustered corpus is used for training a pre-training classification model to obtain a corresponding classification model, and prediction corpus in the partitioned clustered corpus is input into the classification model to obtain a corresponding classification result after classification, and a prediction test is performed on the classification result to obtain a prediction score value of the classification result corresponding to the classification model.
Step S34, determining an accurate recall PRF value of the clustering result based on the clustering label and the predictive score value in the clustering result;
And obtaining a clustering label in the clustering result and Precision (Precision), recall (Recall), F1 value (F1) corresponding to the clustering result through a prediction score value corresponding to the classification result of the classification model, namely a PRF value for short, and evaluating whether the classification model is reasonable or not according to the PRF value, and further evaluating whether training data of the classification model is reasonable or not, and further obtaining whether the clustering result is reasonable or not.
And step S35, determining a corresponding target intention recognition model based on the PRF value.
In a specific embodiment, according to the PRF value of the clustering result, the output corresponding target intention recognition model may be determined, and specifically, the value determination rule for the PRF value is as follows:
If the PRF value reaches a preset threshold value, the clustering result is reasonable, training the pre-training model by adopting all corpus in the clustering result to obtain a corresponding classification model;
If the PRF value does not reach the preset threshold, the clustering result is unreasonable, the clustering labels corresponding to the clustering result and the clustering corpus corresponding to the clustering labels are required to be subjected to classification adjustment, the clustering labels after classification adjustment and the clustering corpus corresponding to the clustering labels are obtained, model training and prediction are performed on the clustering results of the clustering labels after classification adjustment and the clustering corpus corresponding to the clustering labels again until the PRF value corresponding to the clustering result reaches the preset threshold, and a corresponding classification model is output.
In this embodiment, training and predicting the clustering result after the clustering process to obtain a classification output result of the classification model corresponding to the clustering result and a prediction score value corresponding to the clustering result, obtaining a PRF value of the clustering result according to the classification output result and the prediction score value, determining whether the clustering result is reasonable according to whether the PRF value reaches a preset threshold, and finally determining a corresponding target intention recognition model, so that the accuracy of automatically creating the target intention recognition model can be improved, a fault tolerance mechanism is set, the classification result of the classification model is predicted and determined, and the accuracy of the classification result of the classification model is improved.
Further, based on the first, second, and third embodiments of the model building method of the present invention, a fourth embodiment of the model building method of the present invention is provided.
The fourth embodiment of the model building method differs from the first, second and third embodiments of the model building method in that, in this embodiment, in step S35, based on the PRF value, refinement of the corresponding target intention recognition model is determined, and referring to fig. 7, the step specifically includes:
step B1, judging whether the PRF value reaches a preset threshold value or not;
step B2, if the PRF value reaches a preset threshold, the clustering result is reasonable, and the initial classification model is output as a target intention recognition model;
In a specific embodiment, if the PRF value of the classification result output by the classification model reaches a preset threshold, judging whether the clustered training corpus is completely added in the process of training the pre-training model, if so, directly outputting the classification model as a target intention recognition model, and if not, adding the training corpus to further train the classification model until all the training corpus is added for training, and outputting the classification model as the target intention recognition model.
Step B3, if the PRF value does not reach the preset threshold value, the clustering result is unreasonable, and classification adjustment is carried out on the clustering result to obtain a clustering result after classification adjustment;
Referring to fig. 8, step B3 specifically includes:
the clustering result comprises other clustering corpus with a clustering label being other and non-other clustering corpus with a clustering label being non-other, and the step of classifying and adjusting the clustering result to obtain the clustering result after classifying and adjusting comprises the following steps:
step b1, adjusting the other clustering corpus and the non-other clustering corpus to obtain clustering corpus corresponding to the adjusted clustering labels;
step b2, calculating the confusion degree of the clustering corpus corresponding to the adjusted clustering labels;
And b3, when the confusion is greater than a preset threshold T2, merging and adjusting the non-other clustering corpuses before and after classification to serve as non-other clustering corpuses of the current clustering result, and taking other corpuses as other clustering corpuses with the clustering labels being other.
In a specific embodiment, the step of classifying and adjusting the clustering result includes three parts:
firstly, the first part is to change the label of an unreliable part in the non-other clustering corpus into the clustering corpus with the clustering label being the other, and the clustering labels and the predictive score values of all the clustering corpora can be obtained through the steps. And (3) replacing the inconsistent clustering corpus in the clustering corpus with the predictive value and the clustering label being the non-other with the clustering corpus with the clustering label being the other, changing the label to integrate the advantages of classification and clustering, and placing the unreliable part in the clustering corpus with the clustering label being the non-other into the other.
And then judging whether the number of the clustered corpuses with the other label exceeds 10% of the threshold value overall training corpuses, if so, executing the clustering algorithm on the new clustered result, marking the clustered corpuses with the number exceeding a certain preset threshold value N2 in the clustered result as new clustered corpuses, and adding the clustered corpuses into the current training corpus.
Finally, classifying and combining the new clustered corpus and the clustered corpus before classification and adjustment, wherein the process of classifying and combining needs to calculate the confusion degree of the clustered corpus before and after classification and adjustment, and the formula for calculating the confusion degree is as follows:
Where N catei,catej represents the actual intent catei, but misclassified to catej, N catei represents the actual number catei; And combining clustering corpus with two clustering labels with the confusion degree larger than a threshold value T2 (for example, 0.25) as non-other clusters to obtain a clustering result after classification adjustment.
Step B4, taking the clustering result after classification adjustment as a current clustering result;
acquiring a current clustering result, and executing the following steps:
Dividing clustered linguistic data in the clustered results into training linguistic data and predicted linguistic data according to the clustered labels;
model training is carried out based on the training corpus, and a trained initial classification model is obtained;
Inputting the prediction corpus into the trained initial classification model for prediction to obtain a prediction score value;
Determining an accurate recall PRF value of the clustering result based on the clustering label in the clustering result and the predictive score value;
In a specific embodiment, if the PRF value of the clustering result does not reach the preset threshold, the clustering result is unreasonable, and then the clustering label corresponding to the clustering result and the clustering corpus corresponding to the clustering label need to be classified and adjusted, after the clustering result after the classification and adjustment is obtained, the clustering result after the classification and adjustment needs to be used as the current clustering result, and the steps of performing model training and prediction based on the clustering label in the clustering result and the corresponding clustering corpus and determining the target intention recognition model according to the model training and prediction result are repeatedly performed.
Further, the clustering corpus in the clustering result is divided into training corpus and prediction according to the clustering label, the classification result of the classification model is obtained through the training corpus, the prediction score value of the clustering result is obtained through prediction, the PRF value corresponding to the clustering result is determined based on the classification result and the prediction score value, and the corresponding target intention recognition model is determined according to the PRF value.
And step B5, outputting the initial classification model as a target intention recognition model until the PRF value reaches a preset threshold value and the clustering result is reasonable.
If the PRF value reaches a preset threshold, outputting a classification model obtained by training the clustering result as a target intention recognition model, wherein the clustering result is reasonable.
In this embodiment, by judging the PRF value of the classification result of the classification model, it is determined whether the classification model is reasonable, further determining whether the training data of the classification model is reasonable, further determining whether the clustering result including the training data is reasonable, if so, directly outputting the classification model to obtain the target intention recognition model, and if not, classifying and adjusting the clustering result to obtain the corresponding model training data, thereby obtaining the target intention recognition model, improving the accuracy of automatically creating the target intention recognition model, and improving the efficiency of automatically creating the target intention recognition model.
The invention also provides a model construction device. Referring to fig. 9, the model building apparatus of the present invention includes:
An acquisition module 10, configured to acquire a training corpus for constructing a model;
the clustering module 20 is configured to perform clustering processing on the training corpus based on a pre-trained clustering model to obtain a corresponding clustering result, where the clustering result includes a clustering label and a clustering corpus corresponding to the clustering label;
The determining module 30 is configured to perform model training and prediction based on the cluster labels in the cluster results and the corresponding cluster corpus, and determine a target intention recognition model according to the model training and prediction results.
Furthermore, the present invention provides a computer-readable storage medium, preferably a computer-readable storage medium, having stored thereon a model building program which, when executed by a processor, implements the steps of the model building method as described above.
In the embodiments of the model building apparatus and medium of the present invention, all the technical features of each embodiment of the model building method are included, and description and explanation contents are substantially the same as those of each embodiment of the model building method, which are not repeated herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein, or any application, directly or indirectly, in the field of other related technology.

Claims (5)

1.一种模型构建方法,其特征在于,所述模型构建方法包括如下步骤:1. A model building method, characterized in that the model building method includes the following steps: 获取构建模型的训练语料;Obtain the training corpus for building the model; 基于预先训练好的聚类模型,对所述训练语料进行聚类处理,得到对应的聚类结果,其中,聚类结果包括聚类标签以及聚类标签对应的聚类语料;Based on a pre-trained clustering model, the training corpus is clustered to obtain the corresponding clustering results, wherein the clustering results include clustering labels and the clustered corpus corresponding to the clustering labels; 基于所述聚类结果中的聚类标签以及对应的聚类语料进行模型训练和预测,根据模型训练和预测结果确定目标意图识别模型;Based on the clustering labels and corresponding clustering corpora in the clustering results, the model is trained and predicted, and the target intent recognition model is determined according to the model training and prediction results. 所述基于所述聚类结果中的聚类标签以及对应的聚类语料进行模型训练和预测,根据模型训练和预测结果确定目标意图识别模型的步骤包括:The steps of training and predicting the model based on the clustering labels and corresponding clustering corpus in the clustering results, and determining the target intent recognition model based on the model training and prediction results, include: 将聚类结果中的聚类语料,根据聚类标签划分为训练语料和预测语料;The clustered data in the clustering results are divided into training data and prediction data according to the clustering labels. 基于所述训练语料进行模型训练,得到训练好的初始分类模型;The model is trained based on the training corpus to obtain a trained initial classification model; 将所述预测语料输入所述训练好的初始分类模型进行预测,得到预测分数值;The prediction corpus is input into the trained initial classification model for prediction, and the prediction score is obtained. 基于所述聚类结果中的聚类标签与所述预测分数值,确定所述聚类结果的精确召回率PRF值;Based on the cluster labels in the clustering results and the predicted scores, the Precise Recall (PRF) value of the clustering results is determined. 基于所述PRF值,确定对应的目标意图识别模型;Based on the PRF value, the corresponding target intent recognition model is determined; 所述基于所述PRF值,确定对应的目标意图识别模型的步骤包括:The step of determining the corresponding target intent recognition model based on the PRF value includes: 判断所述PRF值是否达到预设阈值;Determine whether the PRF value reaches a preset threshold; 若所述PRF值达到预设阈值,则所述聚类结果合理,输出所述初始分类模型作为目标意图识别模型;If the PRF value reaches a preset threshold, the clustering result is reasonable, and the initial classification model is output as the target intent recognition model. 若所述PRF值未达到预设阈值,则所述聚类结果不合理,对所述聚类结果进行分类调整,得到分类调整后的聚类结果;If the PRF value does not reach the preset threshold, the clustering result is unreasonable. The clustering result is then adjusted to obtain the adjusted clustering result. 将所述分类调整后的聚类结果作为当前的聚类结果,并返回执行步骤:Use the clustering result after classification adjustment as the current clustering result, and return to the execution steps: 将聚类结果中的聚类语料,根据聚类标签划分为训练语料和预测语料;The clustered data in the clustering results are divided into training data and prediction data according to the clustering labels. 基于所述训练语料进行模型训练,得到训练好的初始分类模型;The model is trained based on the training corpus to obtain a trained initial classification model; 将所述预测语料输入所述训练好的初始分类模型进行预测,得到预测分数值;The prediction corpus is input into the trained initial classification model for prediction, and the prediction score is obtained. 基于所述聚类结果中的聚类标签与所述预测分数值,确定所述聚类结果的精确召回率PRF值;Based on the cluster labels in the clustering results and the predicted scores, the Precise Recall (PRF) value of the clustering results is determined. 直到所述PRF值达到预设阈值,所述聚类结果合理,输出所述初始分类模型作为目标意图识别模型;Until the PRF value reaches a preset threshold, the clustering result is considered reasonable, and the initial classification model is output as the target intent recognition model. 所述聚类结果包括聚类标签为other的other聚类语料和聚类标签为非other的非other聚类语料,The clustering results include corpora of other clusters labeled "other" and corpora of non-other clusters labeled "non-other". 所述对所述聚类结果进行分类调整,得到分类调整后的聚类结果的步骤包括:The step of classifying and adjusting the clustering results to obtain the adjusted clustering results includes: 调整所述other聚类语料和非other聚类语料,获取调整后的聚类标签对应的聚类语料;Adjust the other clustering corpus and the non-other clustering corpus to obtain the clustering corpus corresponding to the adjusted clustering labels; 计算所述调整后的聚类标签对应的聚类语料的混淆度;Calculate the confusion level of the clustered corpus corresponding to the adjusted clustering labels; 当所述混淆度大于预设阈值T2,则合并调整分类前后的非other的聚类语料作为当前的聚类结果的非other聚类语料,其他语料则作为聚类标签为other的other聚类语料;When the confusion level is greater than the preset threshold T2, the non-other clustered data before and after the classification is merged and adjusted as the non-other clustered data of the current clustering result, and the other data are used as the other clustered data with the clustering label "other". 所述调整所述other聚类语料和非other聚类语料,获取调整后的聚类标签对应的聚类语料的步骤包括:The step of adjusting the "other" clustering corpus and the non-"other" clustering corpus to obtain the clustering corpus corresponding to the adjusted clustering labels includes: 获取所述非other聚类语料的预测分数值;Obtain the predicted score of the non-other clustered corpus; 若所述非other聚类语料的预测分数值低于预设阈值T1,则将所述非other聚类语料的聚类标签更改为other;If the predicted score of the non-other clustered corpus is lower than the preset threshold T1, then the clustering label of the non-other clustered corpus will be changed to other. 当所述聚类标签为other的other聚类语料数量超过预设阈值N2,则将聚类结果中聚类语料数量超过预设阈值N2的簇内语料标记为新的聚类语料;When the number of other cluster corpora with the clustering label "other" exceeds a preset threshold N2, the intra-cluster corpora in the clustering results that exceed the preset threshold N2 are marked as new cluster corpora. 获取到调整后other聚类语料以及调整后的非other聚类语料;We obtained the adjusted "other" clustered corpus and the adjusted non-"other" clustered corpus; 所述基于预先训练好的聚类模型,对所述训练语料进行聚类处理,得到对应的聚类结果的步骤包括:The step of clustering the training corpus based on a pre-trained clustering model to obtain the corresponding clustering results includes: 将所述训练语料进行依次乱序、分割,得到聚类样本语料;The training corpus is sequentially shuffled and segmented to obtain clustered sample corpus. 基于层次凝聚聚类算法HAC,对所述聚类样本语料进行聚类处理,得到聚类标签以及聚类标签对应的聚类语料;Based on the hierarchical agglomerative clustering algorithm (HAC), the clustered sample corpus is clustered to obtain cluster labels and the corresponding clustered corpus. 所述基于层次凝聚聚类算法HAC,对所述聚类样本语料进行聚类处理,得到聚类标签以及聚类标签对应的聚类语料的步骤包括:The steps of performing clustering processing on the clustering sample corpus based on the hierarchical agglomerative clustering algorithm (HAC) to obtain cluster labels and the corresponding clustered corpus include: 将所述聚类样本语料进行分类,并将分类过后同种类的聚类样本语料划分为一个簇,得到不同种类的簇对应的聚类标签;The clustered sample corpus is classified, and the clustered sample corpus of the same type after classification is divided into a cluster to obtain the clustering labels corresponding to different types of clusters. 基于所述簇对应的聚类标签,确定不同种类的聚类标签对应的簇内语料;Based on the clustering labels corresponding to the clusters, determine the intra-cluster corpus corresponding to different types of clustering labels; 若所述簇内语料的数量大于预设阈值N1,则所述簇内语料以所述簇为聚类标签,聚类标签对应的簇内语料为对应的聚类语料;If the number of corpora within a cluster is greater than a preset threshold N1, then the corpora within a cluster are labeled with the cluster as the cluster label, and the corpora within the cluster corresponding to the cluster label are the corresponding cluster corpora. 若所述簇内语料的数量不大于预设阈值N1,则所述簇内语料以other为聚类标签,聚类标签对应的簇内语料为对应的聚类语料。If the number of corpora within a cluster is not greater than a preset threshold N1, then the corpora within a cluster are labeled with "other", and the corpora within a cluster corresponding to the cluster label are the corresponding clustered corpora. 2.如权利要求1所述的模型构建方法,其特征在于,所述获取构建模型的训练语料的步骤包括:2. The model building method as described in claim 1, wherein the step of obtaining the training corpus for building the model includes: 从业务端获取原始语料;Obtain raw corpus from the business side; 对所述原始语料进行预处理,得到用于模型构建的训练语料;The original corpus is preprocessed to obtain training corpus for model construction; 其中,预处理的方式包括剔除停用词、全交转半角、剔除表情符号、剔除招呼用语和无意义问题、统一用标点符号以及剔除非常用标点符号中的一种或多种。The preprocessing methods include removing stop words, using full-width characters, removing emoticons, removing greetings and meaningless questions, using standardized punctuation marks, and removing one or more of the following: 3.一种模型构建装置,其特征在于,所述模型构建装置包括:3. A model building apparatus, characterized in that the model building apparatus comprises: 获取模块,用于获取构建模型的训练语料;The acquisition module is used to acquire the training corpus for building the model; 聚类模块,用于基于预先训练好的聚类模型,对所述训练语料进行聚类处理,得到对应的聚类结果,其中,聚类结果包括聚类标签以及聚类标签对应的聚类语料;The clustering module is used to perform clustering processing on the training corpus based on a pre-trained clustering model to obtain the corresponding clustering results, wherein the clustering results include clustering labels and the clustered corpus corresponding to the clustering labels; 确定模块,用于基于所述聚类结果中的聚类标签以及对应的聚类语料进行模型训练和预测,根据模型训练和预测结果确定目标意图识别模型;The determination module is used to train and predict the model based on the clustering labels in the clustering results and the corresponding clustering corpus, and to determine the target intent recognition model based on the model training and prediction results. 所述基于所述聚类结果中的聚类标签以及对应的聚类语料进行模型训练和预测,根据模型训练和预测结果确定目标意图识别模型包括:The step of training and predicting the model based on the clustering labels and corresponding clustering corpus in the clustering results, and determining the target intent recognition model based on the model training and prediction results, includes: 将聚类结果中的聚类语料,根据聚类标签划分为训练语料和预测语料;The clustered data in the clustering results are divided into training data and prediction data according to the clustering labels. 基于所述训练语料进行模型训练,得到训练好的初始分类模型;The model is trained based on the training corpus to obtain a trained initial classification model; 将所述预测语料输入所述训练好的初始分类模型进行预测,得到预测分数值;The prediction corpus is input into the trained initial classification model for prediction, and the prediction score is obtained. 基于所述聚类结果中的聚类标签与所述预测分数值,确定所述聚类结果的精确召回率PRF值;Based on the cluster labels in the clustering results and the predicted scores, the Precise Recall (PRF) value of the clustering results is determined. 基于所述PRF值,确定对应的目标意图识别模型;Based on the PRF value, the corresponding target intent recognition model is determined; 所述基于所述PRF值,确定对应的目标意图识别模型包括:The step of determining the corresponding target intent recognition model based on the PRF value includes: 判断所述PRF值是否达到预设阈值;Determine whether the PRF value reaches a preset threshold; 若所述PRF值达到预设阈值,则所述聚类结果合理,输出所述初始分类模型作为目标意图识别模型;If the PRF value reaches a preset threshold, the clustering result is reasonable, and the initial classification model is output as the target intent recognition model. 若所述PRF值未达到预设阈值,则所述聚类结果不合理,对所述聚类结果进行分类调整,得到分类调整后的聚类结果;If the PRF value does not reach the preset threshold, the clustering result is unreasonable. The clustering result is then adjusted to obtain the adjusted clustering result. 将所述分类调整后的聚类结果作为当前的聚类结果,并返回执行步骤:Use the clustering result after classification adjustment as the current clustering result, and return to the execution steps: 将聚类结果中的聚类语料,根据聚类标签划分为训练语料和预测语料;The clustered data in the clustering results are divided into training data and prediction data according to the clustering labels. 基于所述训练语料进行模型训练,得到训练好的初始分类模型;The model is trained based on the training corpus to obtain a trained initial classification model; 将所述预测语料输入所述训练好的初始分类模型进行预测,得到预测分数值;The prediction corpus is input into the trained initial classification model for prediction, and the prediction score is obtained. 基于所述聚类结果中的聚类标签与所述预测分数值,确定所述聚类结果的精确召回率PRF值;Based on the cluster labels in the clustering results and the predicted scores, the Precise Recall (PRF) value of the clustering results is determined. 直到所述PRF值达到预设阈值,所述聚类结果合理,输出所述初始分类模型作为目标意图识别模型;Until the PRF value reaches a preset threshold, the clustering result is considered reasonable, and the initial classification model is output as the target intent recognition model. 所述聚类结果包括聚类标签为other的other聚类语料和聚类标签为非other的非other聚类语料,The clustering results include corpora of other clusters labeled "other" and corpora of non-other clusters labeled "non-other". 所述对所述聚类结果进行分类调整,得到分类调整后的聚类结果包括:The process of classifying and adjusting the clustering results to obtain the adjusted clustering results includes: 调整所述other聚类语料和非other聚类语料,获取调整后的聚类标签对应的聚类语料;Adjust the other clustering corpus and the non-other clustering corpus to obtain the clustering corpus corresponding to the adjusted clustering labels; 计算所述调整后的聚类标签对应的聚类语料的混淆度;Calculate the confusion level of the clustered corpus corresponding to the adjusted clustering labels; 当所述混淆度大于预设阈值T2,则合并调整分类前后的非other的聚类语料作为当前的聚类结果的非other聚类语料,其他语料则作为聚类标签为other的other聚类语料;When the confusion level is greater than the preset threshold T2, the non-other clustered data before and after the classification is merged and adjusted as the non-other clustered data of the current clustering result, and the other data are used as the other clustered data with the clustering label "other". 所述调整所述other聚类语料和非other聚类语料,获取调整后的聚类标签对应的聚类语料包括:The step of adjusting the "other" clustering corpus and the non-"other" clustering corpus to obtain the clustering corpus corresponding to the adjusted clustering labels includes: 获取所述非other聚类语料的预测分数值;Obtain the predicted score of the non-other clustered corpus; 若所述非other聚类语料的预测分数值低于预设阈值T1,则将所述非other聚类语料的聚类标签更改为other;If the predicted score of the non-other clustered corpus is lower than the preset threshold T1, then the clustering label of the non-other clustered corpus will be changed to other. 当所述聚类标签为other的other聚类语料数量超过预设阈值N2,则将聚类结果中聚类语料数量超过预设阈值N2的簇内语料标记为新的聚类语料;When the number of other cluster corpora with the clustering label "other" exceeds a preset threshold N2, the intra-cluster corpora in the clustering results that exceed the preset threshold N2 are marked as new cluster corpora. 获取到调整后other聚类语料以及调整后的非other聚类语料;We obtained the adjusted "other" clustered corpus and the adjusted non-"other" clustered corpus; 所述基于预先训练好的聚类模型,对所述训练语料进行聚类处理,得到对应的聚类结果包括:The clustering process based on the pre-trained clustering model, which performs clustering on the training corpus to obtain the corresponding clustering results, includes: 将所述训练语料进行依次乱序、分割,得到聚类样本语料;The training corpus is sequentially shuffled and segmented to obtain clustered sample corpus. 基于层次凝聚聚类算法HAC,对所述聚类样本语料进行聚类处理,得到聚类标签以及聚类标签对应的聚类语料;Based on the hierarchical agglomerative clustering algorithm (HAC), the clustered sample corpus is clustered to obtain cluster labels and the corresponding clustered corpus. 所述基于层次凝聚聚类算法HAC,对所述聚类样本语料进行聚类处理,得到聚类标签以及聚类标签对应的聚类语料包括:The hierarchical agglomerative clustering algorithm (HAC) performs clustering processing on the clustered sample corpus to obtain cluster labels and the corresponding clustered corpus, including: 将所述聚类样本语料进行分类,并将分类过后同种类的聚类样本语料划分为一个簇,得到不同种类的簇对应的聚类标签;The clustered sample corpus is classified, and the clustered sample corpus of the same type after classification is divided into a cluster to obtain the clustering labels corresponding to different types of clusters. 基于所述簇对应的聚类标签,确定不同种类的聚类标签对应的簇内语料;Based on the clustering labels corresponding to the clusters, determine the intra-cluster corpus corresponding to different types of clustering labels; 若所述簇内语料的数量大于预设阈值N1,则所述簇内语料以所述簇为聚类标签,聚类标签对应的簇内语料为对应的聚类语料;If the number of corpora within a cluster is greater than a preset threshold N1, then the corpora within a cluster are labeled with the cluster as the cluster label, and the corpora within the cluster corresponding to the cluster label are the corresponding cluster corpora. 若所述簇内语料的数量不大于预设阈值N1,则所述簇内语料以other为聚类标签,聚类标签对应的簇内语料为对应的聚类语料。If the number of corpora within a cluster is not greater than a preset threshold N1, then the corpora within a cluster are labeled with "other", and the corpora within a cluster corresponding to the cluster label are the corresponding clustered corpora. 4.一种模型构建设备,其特征在于,所述模型构建设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的模型构建程序,所述模型构建程序被所述处理器执行时实现如权利要求1至2中任一项所述的模型构建方法的步骤。4. A model building device, characterized in that the model building device comprises: a memory, a processor, and a model building program stored in the memory and executable on the processor, wherein the model building program, when executed by the processor, implements the steps of the model building method as described in any one of claims 1 to 2. 5.一种介质,所述介质为计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有模型构建程序,所述模型构建程序被处理器执行时实现如权利要求1至2中任一项所述的模型构建方法的步骤。5. A medium, the medium being a computer-readable storage medium, characterized in that the computer-readable storage medium stores a model building program, which, when executed by a processor, implements the steps of the model building method as described in any one of claims 1 to 2.
CN202210229151.9A 2022-03-08 2022-03-08 Model building methods, devices, equipment and media Active CN114817455B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210229151.9A CN114817455B (en) 2022-03-08 2022-03-08 Model building methods, devices, equipment and media

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210229151.9A CN114817455B (en) 2022-03-08 2022-03-08 Model building methods, devices, equipment and media

Publications (2)

Publication Number Publication Date
CN114817455A CN114817455A (en) 2022-07-29
CN114817455B true CN114817455B (en) 2026-04-07

Family

ID=82528956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210229151.9A Active CN114817455B (en) 2022-03-08 2022-03-08 Model building methods, devices, equipment and media

Country Status (1)

Country Link
CN (1) CN114817455B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116467602B (en) * 2023-04-27 2025-12-23 中国工商银行股份有限公司 Training data generation method, device, computer equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113704429A (en) * 2021-08-31 2021-11-26 平安普惠企业管理有限公司 Semi-supervised learning-based intention identification method, device, equipment and medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866337B (en) * 2009-04-14 2014-07-02 日电(中国)有限公司 Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model
CN109739984A (en) * 2018-12-25 2019-05-10 贵州商学院 A kind of parallel KNN network public-opinion sorting algorithm of improvement based on Hadoop platform
CN111813905B (en) * 2020-06-17 2024-05-10 平安科技(深圳)有限公司 Corpus generation method, corpus generation device, computer equipment and storage medium
CN113191148B (en) * 2021-04-30 2024-05-28 西安理工大学 A rail transit entity recognition method based on semi-supervised learning and clustering
CN113704479B (en) * 2021-10-26 2022-02-18 深圳市北科瑞声科技股份有限公司 Unsupervised text classification method and device, electronic equipment and storage medium
CN114003720A (en) * 2021-10-29 2022-02-01 平安国际智慧城市科技股份有限公司 Business document classification method, device, equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113704429A (en) * 2021-08-31 2021-11-26 平安普惠企业管理有限公司 Semi-supervised learning-based intention identification method, device, equipment and medium

Also Published As

Publication number Publication date
CN114817455A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN112836509B (en) A method and system for constructing an expert system knowledge base
CN109800306B (en) Intent analysis method, device, display terminal and computer-readable storage medium
CN111651996B (en) Abstract generation method, device, electronic device and storage medium
CN108304372B (en) Entity extraction method and device, computer equipment and storage medium
JP4311552B2 (en) Automatic document separation
CN111753060A (en) Information retrieval method, device, equipment and computer readable storage medium
CN109359296B (en) Public opinion emotion recognition method, device and computer-readable storage medium
CN111428028A (en) Information classification method based on deep learning and related equipment
CN116955534B (en) Intelligent complaint work order processing method, intelligent complaint work order processing device, intelligent complaint work order processing equipment and storage medium
CN112417132A (en) New intention recognition method for screening negative samples by utilizing predicate guest information
CN113012687B (en) Information interaction method and device and electronic equipment
CN112671985A (en) Agent quality inspection method, device, equipment and storage medium based on deep learning
CN117113982A (en) A big data topic analysis method based on embedding model
CN114491010B (en) Training method and device for information extraction model
CN116644183B (en) Text classification method, device and storage medium
CN113095073B (en) Corpus tag generation method and device, computer equipment and storage medium
CN114817478A (en) Text-based question and answer method and device, computer equipment and storage medium
CN118445398A (en) Intelligent question number auxiliary decision-making method and system based on AI large model
CN114817455B (en) Model building methods, devices, equipment and media
TW202034207A (en) Dialogue system using intention detection ensemble learning and method thereof
CN119155391B (en) Full-scale voice analysis method, device and equipment based on large model and storage medium
CN113505227A (en) Text classification method and device, electronic equipment and readable storage medium
CN111767735B (en) Method, apparatus and computer readable storage medium for executing tasks
CN121605398A (en) Text recognition method, model and electronic equipment
CN114239565A (en) Deep learning-based emotion reason identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant