WO2017118427A1 - 网页训练的方法和装置、搜索意图识别的方法和装置 - Google Patents
网页训练的方法和装置、搜索意图识别的方法和装置 Download PDFInfo
- Publication number
- WO2017118427A1 WO2017118427A1 PCT/CN2017/070504 CN2017070504W WO2017118427A1 WO 2017118427 A1 WO2017118427 A1 WO 2017118427A1 CN 2017070504 W CN2017070504 W CN 2017070504W WO 2017118427 A1 WO2017118427 A1 WO 2017118427A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- webpage
- training
- category
- query string
- classification model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
- G06F16/24578—Query processing with adaptation to user needs using ranking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
Definitions
- the present invention relates to the field of Internet technologies, and in particular, to a method and apparatus for web page training, and a method and apparatus for searching for intent.
- search engine With the development of Internet technology, people can use the search engine to retrieve the information they need through the network. For example, when the user enters "Fairy Swordsman" in the search engine, the user's intention is probably to search for a TV series or search for a game. The search engine needs to first determine the intention of the user's search, so that the returned search result is closer to the user's needs. Content. Intent recognition is to determine the category to which the query string belongs for any given query string.
- a method of web page training comprising:
- a device for webpage training comprising:
- the webpage vector generating module is configured to obtain a training webpage set of the manual labeling category, and generate a webpage vector of the webpage in the training webpage collection, where the webpage vector generating module includes:
- a segmentation unit configured to acquire a valid historical query string of the first training webpage in the training webpage set, and perform segmentation on the valid historical query string
- the word segmentation weight calculation unit is configured to obtain a valid number of times of each word segment, the effective number of times is a total number of occurrences of the participles in the valid history query string, and the word segmentation weight of each word segment is calculated according to the effective number of the respective word segments;
- a webpage vector generating unit configured to generate a webpage vector of the first training webpage according to each of the word segmentation and the corresponding word segmentation weight
- the webpage classification model generating module is configured to generate a webpage classification model according to the manual annotation category of the webpage in the training webpage collection and the corresponding webpage vector.
- the method and apparatus for training the webpage described above by acquiring a collection of training webpages of manual annotation categories, Generating a webpage vector of the webpage in the training webpage collection includes: obtaining a valid historical query string of the first training webpage in the training webpage collection, segmenting the valid historical query string, and obtaining the effective number of each segmentation, the effective number is The total number of occurrences of the word segmentation in the effective history query string, the word segmentation weight of each segment word is calculated according to the effective number of each word segment, and the webpage vector of the first training webpage is generated according to each segmentation word and the corresponding segmentation weight, according to the webpage in the training webpage collection
- the manual labeling category and the corresponding webpage vector are generated, and the webpage classification model is generated, and the webpage vector generated by the effective historical query string segmentation is used for training, the training cost is low, the efficiency is high, and the webpage classification model can automatically mark the webpage after the webpage classification model is generated.
- the medium and long tail webpages can also automatically obtain categories, so that the coverage of the webpage categories in the
- a method of searching for intent recognition comprising:
- An intent recognition result of the query string is obtained according to the intent distribution.
- a device for searching for intent recognition comprising:
- Obtaining a module configured to obtain a query string to be identified, and obtain a historical webpage set corresponding to the query string, where the historical webpage set includes each webpage whose history is clicked by the query string;
- a webpage category obtaining module configured to acquire a webpage classification model generated by the device trained by the webpage, and obtain a class of the webpage in the historical webpage collection according to the webpage classification model do not;
- the intent identification module is configured to count the number of webpages in each category in the historical webpage collection, and calculate an intent distribution of the query string according to the number of webpages in the respective categories and the total number of webpages in the historical webpage collection. An intent recognition result of the query string is obtained according to the intent distribution.
- the method and the device for identifying the search intent obtain the historical webpage set corresponding to the query string by acquiring the query string to be identified, and the historical webpage set includes each webpage whose history is clicked by the query string, and obtains the webpage through the above embodiment.
- the webpage classification model generated by the training method obtains the category of the webpage in the historical webpage collection according to the webpage classification model, and counts the number of webpages in each category in the historical webpage collection, according to the number of webpages in each category and the historical webpage collection
- the total number of web pages is calculated to obtain the intent distribution of the query string, and the intent recognition result of the query string is obtained according to the intent distribution.
- the category of the webpage in the historical webpage collection is automatically identified according to the webpage classification model, which is more manually marked.
- the page coverage of the category is large, so that the medium and long tail pages can also automatically get the category, and the recognized intent rate is higher.
- 1 is an application environment diagram of a method for webpage training and a method for searching for intent recognition in an embodiment
- Figure 2 is a diagram showing the internal structure of the server of Figure 1 in an embodiment
- FIG. 3 is a flow chart of a method for web page training in an embodiment
- FIG. 4 is a flow chart of a method for searching for intent recognition in one embodiment
- 5 is a flow chart of generating a string classification model in one embodiment
- FIG. 6 is a structural block diagram of an apparatus for webpage training in an embodiment
- FIG. 7 is a structural block diagram of an apparatus for webpage training in another embodiment
- FIG. 8 is a structural block diagram of an apparatus for searching for intent recognition in an embodiment
- FIG. 9 is a structural block diagram of an apparatus for searching for intent recognition in another embodiment
- Figure 10 is a block diagram showing the structure of an apparatus for searching for intent recognition in still another embodiment.
- FIG. 1 is an application environment diagram of a method for webpage training and a method for searching for intent recognition in an embodiment.
- the application environment includes a terminal 110 and a server 120, wherein the terminal 110 and the server 120 communicate through a network.
- the terminal 110 can be a smartphone, a tablet, a notebook, a desktop computer, etc., but is not limited thereto.
- the terminal 110 transmits a query string to the server 120 for searching through the network, and the server 120 can respond to the request sent by the terminal 110.
- the internal structure of server 120 in FIG. 1 is as shown in FIG. 2, which includes a processor, storage medium, memory, and network interface connected by a system bus.
- the storage medium of the server 120 stores an operating system, a database, and a device for searching for intent identification, wherein the device for searching for intent identification includes a device for webpage training, a database for storing data, and a device for searching for intent recognition for implementing one.
- a method for search intent recognition of the server 120, the webpage training device is for implementing a method for webpage training of the server 120.
- the processor of the server 120 is used to provide computing and control capabilities to support the operation of the entire server 120.
- the memory of the server 120 provides an environment for the operation of the device identified by the search intent in the storage medium.
- the network interface of the server 120 is used to communicate with the external terminal 110 via a network connection, such as receiving a search request sent by the terminal 110 and returning data to the terminal 110.
- a method for webpage training is provided to be applied to a server in the application environment, and the following steps are included:
- Step S210 Acquire a training webpage collection of the manual annotation category, and generate a webpage vector of the webpage in the training webpage collection.
- the number of webpages in the training webpage collection can be customized according to requirements. In order to make the trained webpage classification model more accurate, the number of webpages in the training webpage collection is sufficient and belongs to The number of species is also sufficient for different types.
- the pages in the training webpage collection are manually labeled with categories, such as mp3.baidu.com is manually marked as music, and youku.com is manually marked as video.
- Generating a webpage vector of a webpage in the training webpage collection all webpages in the training webpage collection may generate a webpage vector, or a partial webpage may be selected according to a preset condition to generate a corresponding webpage vector, such as selecting different categories manually marked, from Selecting a preset number of web pages in each category generates a corresponding web page vector.
- the step of generating a webpage vector of the webpage in the training webpage collection specifically includes:
- Step S211 Obtain a valid historical query string of the first training webpage in the training webpage set, and perform segmentation on the valid historical query string.
- the first training webpage is used as the search result of the first query string and is clicked by the user
- the first query string is a valid historical query string of the first training webpage
- the second query string is not a valid historical query string of the first training webpage.
- the number of valid historical query strings of the first training webpage can be customized according to requirements, but in order to make the training result valid, it needs to be sufficient, for example, to obtain all valid historical query strings of the first training webpage in the preset time period, Let the time period be a time period that is closer to the current time. The word segmentation is performed on the valid historical query string, and the query string is represented by each participle.
- Step S212 Obtain a valid number of times of each participle, and the effective number is the total number of occurrences of the participle in the valid historical query string.
- the valid history query string is used for word segmentation, there are 30 participles named "Jay Chou". Then the number of times the term "Jay Chou" is valid is 30. The greater the number of times the word segment is valid, the more times the query string containing the word segment enters the current training page.
- Step S213 calculating the word segmentation weight of each word segment according to the effective number of each word segmentation.
- the size of the word segmentation weight is proportional to the effective number of times, and the specific word segmentation weight calculation method can be customized as needed.
- the log function is relatively smooth, and satisfies the proportional relationship between the size of the word segmentation weight W(q i ) and the effective number c i , and the word segmentation weight of each word segment can be obtained simply and conveniently.
- Step S214 generating a webpage vector of the first training webpage according to each word segmentation and the corresponding word segmentation weight.
- the webpage vector of the first training webpage can be expressed as ⁇ q 1 : W(q 1 ), q 2 : W(q 2 ), ... q m : W(q m ) ⁇ , and the generated webpage vector represents the first A word bag feature of a training web page.
- the webpage vector of the second webpage may be launched.
- the category is also the first category.
- the webpage vector of mp3.baidu.com and the cosine function cosine of the webpage vector of y.qq.com are greater than the preset threshold, then y.qq is introduced for the music category according to mp3.baidu.com. .com is also a music class.
- Step S215 Acquire other training webpages in the training webpage collection, and repeat step S211 to step S214 until the webpage vector of the target training webpage is generated.
- the number of target training webpages may be customized according to requirements, and the target training webpage may be a training webpage filtered by a preset rule in the training webpage collection. You can also directly put the page All training pages in the collection serve as target training pages.
- Step S220 Generate a webpage classification model according to the manual annotation category of the webpage in the training webpage collection and the corresponding webpage vector.
- the manual annotation category of the webpage in the training webpage collection and the corresponding webpage vector are substituted into a logistic regression model for training, and the webpage classification model is obtained.
- the training of the webpage classification model adopts a logistic regression method.
- Logistic Regression (LR) model based on linear regression, a logic function is applied, and the trained web page classification model has a high correct rate.
- the webpage classification model is a mathematical model for classifying webpages, and different methods can be used to train the classification model to obtain different webpage classification models. Choose a training method as needed.
- the trained webpage classification model is used for category prediction when the webpage is predicted online.
- a webpage classification model is generated by a limited number of manually annotated category webpages and generated webpage vectors, and the webpage category automatic annotation can be realized by the webpage classification model.
- the webpage vector is used as the training data, and it is not necessary to crawl all the content on the webpage and the word bag is used. The data cost of the training is low, and the training efficiency is high.
- the webpage vector of the webpage in the training webpage collection is generated by acquiring the training webpage collection of the manual labeling category, and specifically includes: obtaining a valid historical query string of the first training webpage in the training webpage collection, and validating the historical query string
- the string is used to segment the word, and the effective number of each word segment is obtained.
- the effective number is the total number of occurrences of the word segmentation in the valid history query string, and the word segmentation weight of each word segment is calculated according to the effective number of each word segment, and the segmentation weight is generated according to each word segment and the corresponding word segment weight.
- the webpage vector of the training webpage generates a webpage classification model according to the manual annotation category of the webpage in the training webpage collection and the corresponding webpage vector, and the training is performed by the webpage vector generated by the effective historical query string segmentation, and the training cost is low and the efficiency is high. And generate a webpage After the classification model, the webpages can be automatically labeled, so that the medium and long tail webpages can also automatically obtain the categories, so that the coverage of the webpage categories in the intent recognition is high, and the correctness of the identified intent is higher.
- the method before step S220, the method further includes: acquiring an LDA feature of the webpage in the training webpage collection.
- the LDA (Latent Dirichlet Allocation) is used to perform topic clustering on the text, and the LDA feature of the webpage can be obtained by inputting the LDA model into the webpage text.
- Step S220 is: generating a webpage classification model according to the LDA feature of the webpage, the manual labeling category, and the corresponding webpage vector.
- the LDA feature of the webpage, the manual annotation category, and the corresponding webpage vector are substituted into the logistic regression model for training, and the webpage classification model is obtained.
- the training of the webpage classification model adopts a logistic regression method.
- Logistic Regression (LR) model based on linear regression, a logic function is applied, and the trained web page classification model has a high correct rate.
- the training data of the training webpage classification model adds the LDA feature of the webpage, and the LDA feature reflects the theme of the webpage, so that the trained webpage classification model can more accurately classify the webpage.
- LDA represents the document subject generation model
- LR+LDA means that both LR (logistic regression) logistic regression model and LDA feature are used
- LR+BOW+LDA means simultaneous use of LR model, LDA feature and BOW (Bag of Words) words.
- the accuracy rate is how many of the retrieved items (such as: documents, web pages, etc.) are accurate; the recall rate is all accurate entries. How many were retrieved.
- Accuracy the number of correct information extracted / the number of extracted information
- recall rate the number of correct information extracted / the number of information in the sample
- F1 is the harmonic mean of the accuracy and recall rate.
- a method for searching for intent recognition including:
- Step S310 Acquire a query string to be identified, and obtain a historical webpage set corresponding to the query string.
- the historical webpage set includes each webpage whose history is clicked by the query string.
- the query string to be identified is a query string input by the terminal in the search engine, and obtains a historical webpage set composed of each webpage clicked by the query string in the historical search.
- Step S320 Obtain a webpage classification model generated by the webpage training method of any of the above embodiments, and obtain a category of the webpage in the historical webpage collection according to the webpage classification model.
- the webpage classification model generated by the webpage training method in the above embodiment automatically classifies the webpages in the historical webpage collection.
- the collection of historical web pages is ⁇ url 1 , url 2 , ... url n ⁇ , where url i (1 ⁇ i ⁇ n) represents each web page, and the categories url 1 ⁇ d 1 , url 2 ⁇ d 2 , ... of each web page are obtained.
- ...url n ⁇ d s where d 1 , d 2 , ..., d s represents the category, s is the total number of categories, and the category set is ⁇ d 1 , d 2 , ... d s ⁇ .
- Step S330 counting the number of webpages in each category in the historical webpage collection, and calculating the intent distribution of the query string according to the number of webpages in each category and the total number of webpages in the historical webpage collection.
- the page count in the set of history in each category such as category 1 comprises d t pages, then Counting the total number of web pages in the historical web page collection to obtain the total number of web pages in the historical web page collection. For example, for the historical web page collection ⁇ url 1 , url 2 , ...
- Step S340 obtaining an intention recognition result of the query string according to the intention distribution.
- the category with the highest probability of the intent distribution may be used as the intent recognition result of the query string, or the preset number of categories may be taken as the intent recognition result of the query string according to the probability from the largest to the smallest, or the probability is greater than the pre-predetermined Set the category of the threshold as the result of the intent recognition of the query string.
- the service corresponding to the current application that sends the query string is obtained, and the intent recognition result of the query string is obtained according to the service information and the intention distribution. For example, if the current application service information of the query string is the music service, even if the distribution is intended
- the category with the highest probability is not music, and the music category can also be used as a result of intent recognition.
- the historical webpage set corresponding to the query string is obtained by obtaining the query string to be identified, and the historical webpage set includes each webpage whose history is clicked by the query string, and the method for generating the webpage training by the above embodiment is generated.
- a webpage classification model which obtains a category of a webpage in a historical webpage collection according to a webpage classification model, and counts each of the historical webpage collections The number of webpages in each category, the intent distribution of the query string is calculated according to the number of webpages in each category and the total number of webpages in the historical webpage collection, and the intent recognition result of the query string is obtained according to the intent distribution, according to the webpage in the intent recognition
- the classification model automatically identifies the categories of the web pages in the historical webpage collection, which is larger than the webpage coverage of the manually labeled categories, so that the medium and long tail webpages can also automatically obtain the categories, and the recognized intent correct rate is higher.
- the method before step S340, the method further includes: acquiring a string classification model, and obtaining a prediction category of the query string according to the string classification model.
- the string classification model is a mathematical model for classifying query strings. Different methods can be used to train the classification model to obtain different string classification models, and the training method is selected according to needs. After the string classification model is obtained by offline training through the supervised learning method, the trained string classification model can be used to predict the category of the query string when the query string is intended to be identified.
- the prediction category of the query string can correct the intent recognition result of the query string when the intent distribution of the query string is not obvious. For example, the intent distribution of the query string has many categories, and the probabilities of each category are close and relatively small. When the identification is based only on the intent distribution of the query string, the result is often inaccurate.
- Step S340 is: obtaining an intent recognition result of the query string according to the intention distribution and the prediction category.
- the intent recognition result of the query string may be determined according to the number of categories in the intent distribution and the probability corresponding to each category. If there are many categories in the intent distribution and the corresponding probability of each category is relatively small, the prediction category can be directly used as the intent recognition result of the query string, or the combination of the category with the highest probability and the prediction category in the intent distribution can form the intent recognition of the query string. As a result, the specific algorithm for obtaining the intent recognition result can be customized as needed. In the case where the intent distribution is not available, such as the query string is a rare string, the number of pages in the corresponding historical web page collection is 0 or very small, resulting in the intent distribution being uncalculated or the resulting intent distribution is only one category. Probability, and 100% is likely to be wrong, you can also directly query the string The prediction category is the result of the intent recognition of the query string.
- the method before the step of acquiring the string classification model, the method further includes:
- Step S410 Acquire a query string corresponding to the category with the highest intention probability in the intent distribution corresponding to the historical query string as the category training query string, wherein the category with the highest intention probability includes a plurality of different categories.
- the intent distribution is calculated for a large number of historical query strings, and the categories with the highest intent probability in the intent distribution corresponding to different query strings may be different.
- the query string corresponding to the category with the highest intention probability in the intention distribution is used as the category training query string and the category with the highest intention probability includes a plurality of different categories to ensure the validity of the training data.
- Step S420 extracting a word-based and/or character-based n-gram feature for the category training query string corresponding to different categories, where n is an integer greater than 1 and less than M, and M is a word length of the currently extracted category training query string. Or the length of the character.
- Word-based and/or character-based n-gram grammar features are extracted such that the feature length is expanded. For the same query string, multiple extractions can be performed, and the number of elements extracted each time is different. Here, the number of elements represents the number of words, and the result of each extraction forms a feature combination. For example, for the training string of "Jay Jay's Songs", the grammar features based on words are extracted as follows:
- Extracting character-based 1-3 metagram features are obtained as follows:
- the feature length of the character-based 1-3 grammar feature is expanded to 15 or more, which effectively solves the problem of feature sparseness.
- the training data is large enough, it has good scalability.
- Step S430 using the classification model to train the n-gram feature and the corresponding category as the training data to generate a string classification model.
- the n-gram feature and the corresponding category are used as training data, and are substituted into the classification model for training to obtain a string classification model.
- the n-gram feature and the corresponding category are used as training data, and the training data is extended from the category training query string, and the accuracy and coverage of the obtained string classification model classification can be improved.
- training features can be mapped to vectors of fixed dimensions (eg, 1 million dimensions) to improve training efficiency and reduce invalid training data to improve the accuracy of training results, or to increase the category of web pages that query string clicks.
- the proportional feature increases the coverage of the training data.
- the category proportional feature refers to the proportion of each webpage category that is clicked to all the webpages, such as the ratio of the clicked videopages to all the webpages.
- Bayesian represents the naive Bayesian model
- the word segmentation extracts the n-gram grammatical features based on words
- the character feature representation extracts the character-based n-gram grammatical features
- the support vector machine (SVM) represents the support vector machine model.
- the string classification model generated by extracting the character-based n-gram feature training is used to classify the query string with high accuracy and recall rate, and at the same time, extracting character-based n-gram features and The correct rate and recall rate of word-based n-gram features are higher.
- the overall accuracy of the intent to identify using this method can be increased from 54.6% to 85% compared to before use, with a 60% increase.
- a device for webpage training including:
- the webpage vector generating module 510 is configured to acquire a training webpage set of the manual labeling category, and generate a webpage vector of the webpage in the training webpage collection.
- the webpage vector generating module 510 includes:
- the word segmentation unit 511 is configured to obtain a valid historical query string of the first training webpage in the training webpage collection, and perform segmentation on the valid historical query string.
- the word segmentation weight calculation unit 512 is configured to obtain the effective number of each word segment, the effective number of times is the total number of occurrences of the word segmentation in the valid history query string, and the word segmentation weight of each segment word is calculated according to the effective number of each word segment.
- the webpage vector generating unit 513 is configured to generate a webpage vector of the first training webpage according to each word segmentation and the corresponding word segmentation weight.
- the webpage classification model generating module 520 is configured to generate a webpage classification model according to the manual annotation category of the webpage in the training webpage collection and the corresponding webpage vector.
- the apparatus further includes:
- the LDA feature acquisition module 530 is configured to obtain an LDA feature of the webpage in the training webpage collection.
- the webpage classification model generating module 520 is further configured to generate a webpage classification model according to the LDA feature of the webpage, the manual annotation category, and the corresponding webpage vector.
- the webpage classification model generating module 520 is further configured to substitute the manual annotation category of the webpage in the training webpage collection and the corresponding webpage vector into a logistic regression model for training, to obtain the webpage classification model.
- an apparatus for searching for intent recognition including:
- the obtaining module 610 is configured to obtain a query string to be identified, and obtain a historical webpage set corresponding to the query string.
- the historical webpage set includes each webpage whose history is clicked by the query string.
- the webpage category obtaining module 620 is configured to acquire a webpage classification model generated by the device for webpage training according to any of the above embodiments, and obtain a category of the webpage in the historical webpage collection according to the webpage classification model.
- the intent identification module 630 is configured to count the number of webpages in each category in the historical webpage collection, and calculate an intent distribution of the query string according to the number of webpages in each category and the total number of webpages in the historical webpage collection, according to the intention distribution. Get the intent recognition result of the query string.
- the device further includes:
- the prediction category module 640 is configured to obtain a string classification model, and obtain a prediction category of the query string according to the string classification model.
- the intent identification module 630 is also configured to obtain a query string based on the intent distribution and the predicted category The intention to identify the result.
- the device further includes:
- the string classification model generating module 650 is configured to obtain a query string corresponding to the category with the highest intention probability in the intent distribution corresponding to the historical query string as the category training query string, wherein the category with the highest intention probability includes multiple different categories,
- the category training query string extraction corresponding to different categories is based on words and/or character-based n-gram grammar features, n is an integer greater than 1 and smaller than the currently extracted query string word length or character length, and the n-gram grammatical features and corresponding
- the category is trained as a training data using a classification model to generate a string classification model.
- the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).
- an embodiment of the present invention further provides a computer storage medium, wherein a computer program for performing a webpage training method or a method for searching for intent identification according to an embodiment of the present invention is stored.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Medical Informatics (AREA)
- Algebra (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种网页训练的方法和装置,及一种搜索意图识别的方法和装置。所述网页训练的方法包括:获取人工标注类别的训练网页集合,生成训练网页集合中的网页的网页向量(S210),具体包括:获取训练网页集合中的第一训练网页的有效历史查询字符串,对有效历史查询字符串进行分词(S211);获取各个分词的有效次数,有效次数为有效历史查询字符串中分词出现的总次数(S212);根据各个分词的有效次数计算各个分词的分词权重(S213);根据各个分词和对应的分词权重生成所述第一训练网页的网页向量(S214);根据训练网页集合中的网页的人工标注类别和对应的网页向量,生成网页分类模型(S220)。所述方法和装置训练成本低,效率高,且生成网页分类模型后可自动对网页进行标注类别,使识别出的意图正确率更高。
Description
本专利申请要求2016年01月07日提交的中国专利申请号为201610008131.3,申请人为腾讯科技(深圳)有限公司,发明名称为“网页训练的方法和装置、搜索意图识别的方法和装置”的优先权,该申请的全文以引用的方式并入本申请中。
本发明涉及互联网技术领域,特别是涉及一种网页训练的方法和装置、搜索意图识别的方法和装置。
随着互联网技术的发展,人们可以通过网络使用搜索引擎检索自己所需要的信息。如当用户在搜索引擎里输入“仙剑奇侠传”时,用户的意图较大可能是搜索电视剧或搜索游戏,搜索引擎需要先判断出用户搜索的意图,才能使返回的搜索结果更接近用户需要的内容。意图识别就是对于任意给定的查询字符串,判断该查询字符串属于的类别。
现有的搜索意图识别方法,往往使用人工标注的方法对网页标注类别,在进行意图识别时,需要使用人工标注的网页类别进行识别,需要人工标注每个类别的网页集合,成本太高,而且人工标注的结果往往数量有限,对于点击率少的网页很有可能网页的类别未知,导致意图识别的准确率不高。
发明内容
基于此,有必要针对上述技术问题,提供一种网页训练的方法和装置、搜索意图识别的方法和装置,提高搜索意图识别的准确率。
一种网页训练的方法,所述方法包括:
获取人工标注类别的训练网页集合,生成所述训练网页集合中的网页的网页向量,具体包括:
获取所述训练网页集合中的第一训练网页的有效历史查询字符串,对所述有效历史查询字符串进行分词;
获取各个分词的有效次数,所述有效次数为所述有效历史查询字符串中所述分词出现的总次数;
根据所述各个分词的有效次数计算各个分词的分词权重;
根据所述各个分词和对应的分词权重生成所述第一训练网页的网页向量;
根据所述训练网页集合中的网页的人工标注类别和对应的网页向量,生成网页分类模型。
一种网页训练的装置,所述装置包括:
网页向量生成模块,配置为获取人工标注类别的训练网页集合,生成所述训练网页集合中的网页的网页向量,所述网页向量生成模块包括:
分词单元,配置为获取所述训练网页集合中的第一训练网页的有效历史查询字符串,对所述有效历史查询字符串进行分词;
分词权重计算单元,配置为获取各个分词的有效次数,所述有效次数为所述有效历史查询字符串中所述分词出现的总次数,根据所述各个分词的有效次数计算各个分词的分词权重;
网页向量生成单元,配置为根据所述各个分词和对应的分词权重生成所述第一训练网页的网页向量;
网页分类模型生成模块,配置为根据所述训练网页集合中的网页的人工标注类别和对应的网页向量,生成网页分类模型。
上述网页训练的方法和装置,通过获取人工标注类别的训练网页集合,
生成训练网页集合中的网页的网页向量,具体包括:获取训练网页集合中的第一训练网页的有效历史查询字符串,对有效历史查询字符串进行分词,获取各个分词的有效次数,有效次数为有效历史查询字符串中分词出现的总次数,根据各个分词的有效次数计算各个分词的分词权重,根据各个分词和对应的分词权重生成第一训练网页的网页向量,根据训练网页集合中的网页的人工标注类别和对应的网页向量,生成网页分类模型,通过有效历史查询字符串分词后生成的网页向量进行训练,训练成本低,效率高,且生成网页分类模型后可自动对网页进行标注类别,使得中长尾网页也能自动得到类别,从而使得意图识别中网页类别的覆盖率高,识别出的意图正确率更高。
一种搜索意图识别的方法,所述方法包括:
获取待识别的查询字符串,获取所述查询字符串对应的历史网页集合,所述历史网页集合中包括历史通过所述查询字符串点击的各个网页;
获取通过上述的网页训练的方法生成的网页分类模型,根据所述网页分类模型得到所述历史网页集合中的网页的类别;
统计所述历史网页集合中的各个类别中的网页数量,根据所述各个类别中的网页数量和历史网页集合中网页的总数量计算得到所述查询字符串的意图分布;
根据所述意图分布得到所述查询字符串的意图识别结果。
一种搜索意图识别的装置,所述装置包括:
获取模块,配置为获取待识别的查询字符串,获取所述查询字符串对应的历史网页集合,所述历史网页集合中包括历史通过所述查询字符串点击的各个网页;
网页类别获取模块,配置为获取通过上述的网页训练的装置生成的网页分类模型,根据所述网页分类模型得到所述历史网页集合中的网页的类
别;
意图识别模块,配置为统计所述历史网页集合中的各个类别中的网页数量,根据所述各个类别中的网页数量和历史网页集合中网页的总数量计算得到所述查询字符串的意图分布,根据所述意图分布得到所述查询字符串的意图识别结果。
上述搜索意图识别的方法和装置,通过获取待识别的查询字符串,获取查询字符串对应的历史网页集合,历史网页集合中包括历史通过查询字符串点击的各个网页,获取通过上述实施例的网页训练的方法生成的网页分类模型,根据网页分类模型得到历史网页集合中的网页的类别,统计所述历史网页集合中的各个类别中的网页数量,根据各个类别中的网页数量和历史网页集合中网页的总数量计算得到查询字符串的意图分布,根据意图分布得到查询字符串的意图识别结果,在意图识别时根据网页分类模型自动对历史网页集合中的网页的类别进行识别,比人工标注的类别的网页覆盖率大,使得中长尾网页也能自动得到类别,识别出的意图正确率更高。
图1为一个实施例中网页训练的方法、搜索意图识别的方法的应用环境图;
图2为一个实施例中图1中服务器的内部结构图;
图3为一个实施例中网页训练的方法的流程图;
图4为一个实施例中搜索意图识别的方法的流程图;
图5为一个实施例中生成字符串分类模型的流程图;
图6为一个实施例中网页训练的装置的结构框图;
图7为另一个实施例中网页训练的装置的结构框图;
图8为一个实施例中搜索意图识别的装置的结构框图;
图9为另一个实施例中搜索意图识别的装置的结构框图;
图10为再一个实施例中搜索意图识别的装置的结构框图。
图1为一个实施例中网页训练的方法、搜索意图识别的方法运行的应用环境图。如图1所示,该应用环境包括终端110、服务器120,其中终端110和服务器120通过网络进行通信。
终端110可为智能手机、平板电脑、笔记本电脑、台式计算机等,但并不局限于此。终端110通过网络向服务器120发送查询字符串进行搜索,服务器120可以响应终端110发送的请求。
在一个实施例中,图1中的服务器120的内部结构如图2所示,该服务器120包括通过系统总线连接的处理器、存储介质、内存和网络接口。其中,该服务器120的存储介质存储有操作系统、数据库和搜索意图识别的装置,其中搜索意图识别的装置中包括有网页训练的装置,数据库用于存储数据,搜索意图识别的装置用于实现一种适用于服务器120的搜索意图识别的方法,网页训练的装置用于实现一种适用于服务器120的网页训练的方法。该服务器120的处理器用于提供计算和控制能力,支撑整个服务器120的运行。该服务器120的内存为存储介质中的搜索意图识别的装置的运行提供环境。该服务器120的网络接口用于与外部的终端110通过网络连接通信,比如接收终端110发送的搜索请求以及向终端110返回数据等。
如图3所示,在一个实施例中,提供了一种网页训练的方法,以应用于上述应用环境中的服务器来举例说明,包括如下步骤:
步骤S210,获取人工标注类别的训练网页集合,生成训练网页集合中的网页的网页向量。
具体的,训练网页集合中网页的数量可以根据需要自定义,为了使训练出的网页分类模型更准确,训练网页集合中网页的数量要足够多并且属
于不同的种类,种类的数量也要足够多。训练网页集合中的网页都通过人工标注了类别,如mp3.baidu.com被人工标记为音乐类,youku.com被人工标记为视频类。生成训练网页集合中的网页的网页向量,可以将训练网页集合中的所有网页都生成网页向量,也可以根据预设条件选择部分网页生成对应的网页向量,如选择人工标注的不同的类别,从每个类别中选择预设数量的网页生成对应的网页向量。
生成训练网页集合中的网页的网页向量的步骤具体包括:
步骤S211,获取训练网页集合中的第一训练网页的有效历史查询字符串,对有效历史查询字符串进行分词。
具体的,如果第一训练网页作为第一查询字符串的搜索结果,被用户点击进入,则此第一查询字符串为第一训练网页的有效历史查询字符串,如果第一训练网页作为第二查询字符串的搜索结果,但是没有被用户点击进入,则第二查询字符串不是第一训练网页的有效历史查询字符串。第一训练网页的有效历史查询字符串的数量可根据需要自定义,但是为了使训练结果有效,需要足够多,如获取预设时间段内的第一训练网页的所有有效历史查询字符串,预设时间段可为距离当前时间较接近的时间段。对有效历史查询字符串进行分词,用各个分词表示此查询字符串,如将“周杰伦的歌”分词后得到“周杰伦”、“歌”,分词的目的是更好的表示网页,如果直接用查询字符串query表示网页,数据太稀疏,如查询字符串“周杰伦的歌”和“周杰伦的歌曲”为2个不同的查询字符串,但是将它进行分词后得到“周杰伦”、“歌”和“周杰伦”、“歌曲”,其中都包括分词“周杰伦”,增加了查询字符串的相似度。
步骤S212,获取各个分词的有效次数,有效次数为有效历史查询字符串中所述分词出现的总次数。
具体的,如有效历史查询字符串进行分词后,有30个分词为“周杰伦”,
则“周杰伦”这个分词的有效次数为30。分词的有效次数越大,表明通过包括此分词的查询字符串进入当前训练网页的次数越多。
步骤S213,根据各个分词的有效次数计算各个分词的分词权重。
具体的,分词权重的大小与有效次数的大小成正比,具体的分词权重计算方法可根据需要自定义。
在一个实施例中,根据公式W(qi)=log(ci+1)计算分词qi的分词权重W(qi),其中i为分词的序号,ci为分词qi的有效次数。
具体的,log函数比较平滑,且满足分词权重W(qi)的大小与有效次数ci的大小成正比的比例关系,能简单方便的得到各个分词的分词权重。
步骤S214,根据各个分词和对应的分词权重生成第一训练网页的网页向量。
具体的,对于第一训练网页,如果其有效历史查询字符串生成的分词数量为m个,用qi表示各个分词,其中1≤i≤m,W(qi)为分词qi对应的分词权重,则第一训练网页的网页向量可表示为{q1:W(q1),q2:W(q2),……qm:W(qm)},生成的网页向量表示第一训练网页的词袋特征。如对于训练网页mp3.baidu.com,它的网页向量为{周杰伦:5.4,歌曲:3.6,蔡依林:3.0,tfboys:10}。可根据网页向量计算不同的网页之间的相似度,如果第一网页与第二网页的相似度满足预设条件,且第一网页的网页类别为第一类,则可以推出第二网页的网页类别也为第一类,如mp3.baidu.com的网页向量与y.qq.com的网页向量的余弦函数cosine相似度大于预设阈值,则根据mp3.baidu.com为音乐类推出y.qq.com也为音乐类。
步骤S215,获取训练网页集合中的其它训练网页,重复以上步骤S211至步骤S214直到目标训练网页的网页向量生成完毕。
具体的,目标训练网页的数量可根据需要自定义,目标训练网页可以是训练网页集合中通过预设规则筛选出来的训练网页。也可以直接将网页
集合中的全部训练网页作为目标训练网页。
步骤S220,根据训练网页集合中的网页的人工标注类别和对应的网页向量,生成网页分类模型。
具体地,将所述训练网页集合中的网页的人工标注类别和对应的网页向量代入逻辑回归模型中进行训练,得到所述网页分类模型。本发明实施例中,网页分类模型的训练采用逻辑回归法。逻辑回归(Logistic Regression,LR)模型在线性回归的基础上,套用了一个逻辑函数,训练出来的网页分类模型正确率高。
具体的,网页分类模型是一种数学模型,用于对网页进行分类,可以采用不同的方法训练分类模型得到不同的网页分类模型。根据需要选择训练方法。
通过监督学习的办法离线训练得到网页分类模型后,对网页进行在线类别预测时使用训练好的网页分类模型进行类别预测。本实施例中通过有限数量的人工标注类别的网页和生成的网页向量生成网页分类模型,可通过网页分类模型实现网页类别自动标注。同时,采用网页向量作为训练数据,不需要爬取网页上所有的内容并词袋化,进行训练的数据成本低,训练效率高。
本实施例中,通过获取人工标注类别的训练网页集合,生成训练网页集合中的网页的网页向量,具体包括:获取训练网页集合中的第一训练网页的有效历史查询字符串,对有效历史查询字符串进行分词,获取各个分词的有效次数,有效次数为有效历史查询字符串中分词出现的总次数,根据各个分词的有效次数计算各个分词的分词权重,根据各个分词和对应的分词权重生成第一训练网页的网页向量,根据训练网页集合中的网页的人工标注类别和对应的网页向量,生成网页分类模型,通过有效历史查询字符串分词后生成的网页向量进行训练,训练成本低,效率高,且生成网页
分类模型后可自动对网页进行标注类别,使得中长尾网页也能自动得到类别,从而使得意图识别中网页类别的覆盖率高,识别出的意图正确率更高。
在一个实施例中,步骤S220之前,还包括:获取训练网页集合中的网页的LDA特征。
具体的,LDA(Latent Dirichlet Allocation,文档主题生成模型)用于对文本进行主题聚类,网页的LDA特征可通过对网页文本输入LDA模型得到。
步骤S220为:根据网页的LDA特征、人工标注类别和对应的网页向量,生成网页分类模型。
具体地,将网页的LDA特征、人工标注类别和对应的网页向量代入逻辑回归模型中进行训练,得到所述网页分类模型。本发明实施例中,网页分类模型的训练采用逻辑回归法。逻辑回归(Logistic Regression,LR)模型在线性回归的基础上,套用了一个逻辑函数,训练出来的网页分类模型正确率高。
具体的,训练网页分类模型的训练数据中增加了网页的LDA特征,LDA特征反映了网页的主题,使得训练出的网页分类模型更能准确的对网页进行类别标注。
表1展示了采用不同的模型和方法进行训练得到的网页分类模型对网页进行分类的准确率和召回率,其只展示了对于小说类别和各个类别综合进行分类时的准确率和召回率以及对于准确率和召回率综合得到的F1,其中F1=2×准确率/(准确率+召回率)。表格中LDA表示文档主题生成模型,LR+LDA表示同时采用LR(logistic regression)逻辑回归模型和LDA特征,LR+BOW+LDA表示同时采用LR模型、LDA特征和网页向量BOW(Bag of Words)词袋特征进行训练。这里,准确率就是检索出来的条目(比如:文档、网页等)有多少是准确的;召回率就是所有准确的条目有
多少被检索出来了。准确率=提取出的正确信息条数/提取出的信息条数;召回率=提取出的正确信息条数/样本中的信息条数;F1即为准确率和召回率的调和平均值。
表1
从表格中可以看出基于网页向量采用逻辑回归法训练生成的网页分类模型对网页进行分类时,准确率和召回率大部分得到提高,并且对于准确率和召回率综合得到的F1比其它方法要高很多,效果很好。
在一个实施例中,如图4所示,提供了一种搜索意图识别的方法,包括:
步骤S310,获取待识别的查询字符串,获取查询字符串对应的历史网页集合,历史网页集合中包括历史通过所述查询字符串点击的各个网页。
具体的,待识别的查询字符串为终端在搜索引擎输入的查询字符串,获取历史搜索中通过此查询字符串点击的各个网页组成的历史网页集合。
步骤S320,获取通过上述任一项实施例的网页训练的方法生成的网页分类模型,根据网页分类模型得到历史网页集合中的网页的类别。
具体的,通过上述实施例中的网页训练的方法生成的网页分类模型自动对历史网页集合中的网页进行分类。如历史网页集合为{url1,url2,……urln},其中urli(1≤i≤n)代表各个网页,得到各个网页的类别url1∈d1,
url2∈d2,……url n∈ds,其中d1,d2,......,ds表示类别,s为类别的总个数,类别集合为{d1,d2,……ds}。
步骤S330,统计所述历史网页集合中的各个类别中的网页数量,根据各个类别中的网页数量和历史网页集合中网页的总数量计算得到查询字符串的意图分布。
具体的,统计所述历史网页集合中的各个类别中的网页数量,如类别d1中包括t个网页,则统计历史网页集合中网页的总个数得到历史网页集合中网页的总数量,如对于历史网页集合{url1,url2,……urln}其总数量totalurl=n,则待识别的查询字符串p-query属于类别d1的概率采用相同的方法计算得到p-query属于各个类别的概率p(di/p-query)得到查询字符串的意图分布,其中1≤i≤s,其中概率p(di/p-query)的大小表示查询字符串属于类别di的可能性。
步骤S340,根据意图分布得到查询字符串的意图识别结果。
具体的,可将意图分布中概率最大的类别作为查询字符串的意图识别结果,或按概率从大到小的顺序取预设数目个类别作为查询字符串的意图识别结果,或将概率大于预设阈值的类别作为查询字符串的意图识别结果。还可获取发送查询字符串的当前应用所对应的业务,根据业务信息和意图分布得到查询字符串的意图识别结果,如发送查询字符串的当前应用的业务信息为音乐业务,则即使意图分布中概率最大的类别不为音乐,也可将音乐类别作为意图识别的一个结果。
本实施例中,通过获取待识别的查询字符串,获取查询字符串对应的历史网页集合,历史网页集合中包括历史通过查询字符串点击的各个网页,获取通过上述实施例的网页训练的方法生成的网页分类模型,根据网页分类模型得到历史网页集合中的网页的类别,统计所述历史网页集合中的各
个类别中的网页数量,根据各个类别中的网页数量和历史网页集合中网页的总数量计算得到查询字符串的意图分布,根据意图分布得到查询字符串的意图识别结果,在意图识别时根据网页分类模型自动对历史网页集合中的网页的类别进行识别,比人工标注的类别的网页覆盖率大,使得中长尾网页也能自动得到类别,识别出的意图正确率更高。
在一个实施例中,步骤S340之前,还包括:获取字符串分类模型,根据字符串分类模型得到查询字符串的预测类别。
具体的,字符串分类模型是一种数学模型,用于对查询字符串进行分类,可以采用不同的方法训练分类模型得到不同的字符串分类模型,根据需要选择训练方法。通过监督学习的办法离线训练得到字符串分类模型后,对查询字符串进行意图识别时可使用训练好的字符串分类模型进行查询字符串的类别预测。查询字符串的预测类别可以在查询字符串的意图分布不明显时修正查询字符串的意图识别结果,如查询字符串的意图分布中类别多,且各个类别的概率都接近,且比较小,此时只根据查询字符串的意图分布进行识别往往结果不准确。
步骤S340为:根据意图分布和预测类别得到查询字符串的意图识别结果。
具体的,可根据意图分布中类别的多少和各个类别对应的概率,决定查询字符串的意图识别结果。如意图分布中类别多且各个类别对应的概率都比较小,可直接将预测类别作为查询字符串的意图识别结果,或将意图分布中概率最大的类别和预测类别组合形成查询字符串的意图识别结果,具体的得到意图识别结果的算法可根据需要自定义。在意图分布得不到的情况下,如查询字符串为一个罕见的字符串,其对应的历史网页集合中的网页数量为0或非常小,导致意图分布无法计算或得到的意图分布只有一个类别的概率,且为100%很可能是错误的,此时也可直接将查询字符串的
预测类别作为查询字符串的意图识别结果。
在一个实施例中,如图所示,获取字符串分类模型的步骤之前,还包括:
步骤S410,获取历史查询字符串对应的意图分布中意图概率最大的类别对应的查询字符串作为类别训练查询字符串,其中意图概率最大的类别包括多个不同类别。
具体的,对大量的历史查询字符串计算得到了意图分布,不同的查询字符串对应的意图分布中意图概率最大的类别可能不同。将意图分布中意图概率最大的类别对应的查询字符串作为类别训练查询字符串且意图概率最大的类别包括多个不同类别以保证训练数据的有效性。
步骤S420,对不同类别对应的类别训练查询字符串提取基于词语和/或基于字符的n元语法特征,n为大于1且小于M的整数,M为当前提取的类别训练查询字符串的词语长度或字符长度。
具体的,如果直接用类别训练查询字符串训练模型,对于比较短的查询字符串,如长度在4个词语左右,这种情况下特征过于稀疏,训练模型不能得到很好的训练结果。提取基于词语和/或基于字符的n元语法特征,使得特征长度被扩充。对于同一查询字符串,可进行多次提取,每次提取的元数不同,这里,元数代表词语的数目,将每次提取的结果形成一个特征组合。如对于“周杰伦的歌曲”这个类别训练查询字符串,提取基于词语的1-3元语法特征分别得到如下:
1元语法特征:周杰伦 的 歌曲
2元语法特征:周杰伦的 的歌曲
3元语法特征:周杰伦的歌曲
提取基于字符的1-3元语法特征分别得到如下:
1元语法特征::周 杰 伦 的 歌 曲
2元语法特征::周杰 杰伦 伦的 的歌 歌曲
3元语法特征::周杰伦 杰伦的 伦的歌 的歌曲
对于一个长度为3个词语的查询字符串,提取基于字符的1-3元语法特征后其特征长度被扩充为15维以上,有效的解决了特征稀疏的问题。同时因为训练数据足够大,具有很好的扩展性。
步骤S430,将n元语法特征和对应的类别作为训练数据采用分类模型进行训练生成字符串分类模型。
具体地,将n元语法特征和对应的类别作为训练数据,代入分类模型中进行训练,得到字符串分类模型。
具体的,使用n元语法特征和对应的类别作为训练数据,训练数据从类别训练查询字符串进行了扩展,得到的字符串分类模型分类的准确性和覆盖率都能提高。在一个实施例中,可将训练特征映射到固定维度(例如100万维)的向量以提高训练的效率和减少无效的训练数据提高训练结果的准确性,或增加查询字符串点击的网页的类别比例特征等增加训练数据的覆盖率,这里,类别比例特征是指点击的各个网页类别占全部网页的比例,如点击的视频类网页占全部网页的比例。
表2展示了采用不同的模型和方法进行训练得到的字符串分类模型对查询字符串进行分类的准确率和召回率,以及对于准确率和召回率综合得到的F1,其中F1=2×准确率/(准确率+召回率)。表格中NB(Bayesian)表示朴素贝叶斯模型,分词表示提取基于词语的n元语法特征,字符特征表示提取基于字符的n元语法特征,SVM(support vector machine)表示支持向量机模型。
表2
从表格中可以看出采用提取基于字符的n元语法特征训练生成的字符串分类模型对查询字符串进行分类时正确率和召回率都很高,且同时采用提取基于字符的n元语法特征和基于词语的n元语法特征的正确率和召回率更高。使用了本方法的意图识别的整体准确率相比于未使用前可从54.6%提升至85%,提升幅度达60%。
在一个实施例中,如图6所示,提供了一种网页训练的装置,包括:
网页向量生成模块510,配置为获取人工标注类别的训练网页集合,生成训练网页集合中的网页的网页向量,网页向量生成模块510包括:
分词单元511,配置为获取训练网页集合中的第一训练网页的有效历史查询字符串,对有效历史查询字符串进行分词。
分词权重计算单元512,配置为获取各个分词的有效次数,有效次数为有效历史查询字符串中分词出现的总次数,根据各个分词的有效次数计算各个分词的分词权重。
网页向量生成单元513,配置为根据各个分词和对应的分词权重生成第一训练网页的网页向量。
网页分类模型生成模块520,配置为根据训练网页集合中的网页的人工标注类别和对应的网页向量,生成网页分类模型。
在一个实施例中,如图7所示,装置还包括:
LDA特征获取模块530,配置为获取训练网页集合中的网页的LDA特征。
网页分类模型生成模块520还配置为根据网页的LDA特征、人工标注类别和对应的网页向量,生成网页分类模型。
在一个实施例中,网页分类模型生成模块520还配置为将所述训练网页集合中的网页的人工标注类别和对应的网页向量代入逻辑回归模型中进行训练,得到所述网页分类模型。
在一个实施例中,分词权重计算单元511还配置为根据公式W(qi)=log(ci+1)计算分词qi的分词权重W(qi),其中i为分词的序号,ci为分词qi的有效次数。
在一个实施例中,如图8所示,提供了一种搜索意图识别的装置,包括:
获取模块610,配置为获取待识别的查询字符串,获取查询字符串对应的历史网页集合,历史网页集合中包括历史通过查询字符串点击的各个网页。
网页类别获取模块620,配置为获取通过上述任实施例的网页训练的装置生成的网页分类模型,根据网页分类模型得到历史网页集合中的网页的类别。
意图识别模块630,配置为统计所述历史网页集合中的各个类别中的网页数量,根据各个类别中的网页数量和历史网页集合中网页的总数量计算得到查询字符串的意图分布,根据意图分布得到查询字符串的意图识别结果。
在一个实施例中,如图9所示,装置还包括:
预测类别模块640,配置为获取字符串分类模型,根据字符串分类模型得到查询字符串的预测类别。
意图识别模块630还配置为根据意图分布和预测类别得到查询字符串
的意图识别结果。
在一个实施例中,如图10所示,装置还包括:
字符串分类模型生成模块650,配置为获取历史查询字符串对应的意图分布中意图概率最大的类别对应的查询字符串作为类别训练查询字符串,其中意图概率最大的类别包括多个不同类别,对不同类别对应的类别训练查询字符串提取基于词语和/或基于字符的n元语法特征,n为大于1且小于当前提取的查询字符串词语长度或字符长度的整数,将n元语法特征和对应的类别作为训练数据采用分类模型进行训练生成字符串分类模型。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述程序可存储于一计算机可读取存储介质中,如本发明实施例中,该程序可存储于计算机系统的存储介质中,并被该计算机系统中的至少一个处理器执行,以实现包括如上述各方法的实施例的流程。其中,所述存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
相应地,本发明实施例还提供一种计算机存储介质,其中存储有计算机程序,该计算机程序用于执行本发明实施例的网页训练的方法或者搜索意图识别的方法。
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的
保护范围应以所附权利要求为准。
Claims (14)
- 一种网页训练的方法,所述方法包括:获取人工标注类别的训练网页集合,生成所述训练网页集合中的网页的网页向量,具体包括:获取所述训练网页集合中的第一训练网页的有效历史查询字符串,对所述有效历史查询字符串进行分词;获取各个分词的有效次数,所述有效次数为所述有效历史查询字符串中所述分词出现的总次数;根据所述各个分词的有效次数计算各个分词的分词权重;根据所述各个分词和对应的分词权重生成所述第一训练网页的网页向量;根据所述训练网页集合中的网页的人工标注类别和对应的网页向量,生成网页分类模型。
- 根据权利要求1所述的方法,其中,所述根据所述训练网页集合中的网页的人工标注类别和对应的网页向量,生成网页分类模型的步骤之前,还包括:获取所述训练网页集合中的网页的LDA特征;所述根据所述训练网页集合中的网页的人工标注类别和对应的网页向量,生成网页分类模型的步骤为:根据所述网页的LDA特征、人工标注类别和对应的网页向量,生成网页分类模型。
- 根据权利要求1所述的方法,其中,所述根据所述训练网页集合中的网页的人工标注类别和对应的网页向量,生成网页分类模型,包括:将所述训练网页集合中的网页的人工标注类别和对应的网页向量代入逻辑回归模型中进行训练,得到所述网页分类模型。
- 根据权利要求1所述的方法,其中,所述根据所述各个分词的有效次数计算各个分词的分词权重的步骤包括:根据公式W(qi)=log(ci+1)计算分词qi的分词权重W(qi),其中i为分词的序号,ci为分词qi的有效次数。
- 一种搜索意图识别的方法,所述方法包括:获取待识别的查询字符串,获取所述查询字符串对应的历史网页集合,所述历史网页集合中包括历史通过所述查询字符串点击的各个网页;获取通过所述权利要求1至4中任一项所述的网页训练的方法生成的网页分类模型,根据所述网页分类模型得到所述历史网页集合中的网页的类别;统计所述历史网页集合中的各个类别中的网页数量,根据所述各个类别中的网页数量和历史网页集合中网页的总数量计算得到所述查询字符串的意图分布;根据所述意图分布得到所述查询字符串的意图识别结果。
- 根据权利要求5所述的方法,其中,在所述根据所述意图分布得到所述查询字符串的意图识别结果的步骤之前,还包括:获取字符串分类模型,根据所述字符串分类模型得到所述查询字符串的预测类别;所述根据所述意图分布得到所述查询字符串的意图识别结果的步骤为:根据所述意图分布和预测类别得到所述查询字符串的意图识别结果。
- 根据权利要求6所述的方法,其中,所述获取字符串分类模型的步骤之前,还包括:获取历史查询字符串对应的意图分布中意图概率最大的类别对应的查询字符串作为类别训练查询字符串,其中所述意图概率最大的类别包括多 个不同类别;对所述不同类别对应的类别训练查询字符串提取基于词语和/或基于字符的n元语法特征,所述n为大于1且小于当前提取的查询字符串词语长度或字符长度的整数;将所述n元语法特征和对应的类别作为训练数据采用分类模型进行训练生成所述字符串分类模型。
- 一种网页训练的装置,所述装置包括:网页向量生成模块,配置为获取人工标注类别的训练网页集合,生成所述训练网页集合中的网页的网页向量,所述网页向量生成模块包括:分词单元,配置为获取所述训练网页集合中的第一训练网页的有效历史查询字符串,对所述有效历史查询字符串进行分词;分词权重计算单元,配置为获取各个分词的有效次数,所述有效次数为所述有效历史查询字符串中所述分词出现的总次数,根据所述各个分词的有效次数计算各个分词的分词权重;网页向量生成单元,配置为根据所述各个分词和对应的分词权重生成所述第一训练网页的网页向量;网页分类模型生成模块,配置为根据所述训练网页集合中的网页的人工标注类别和对应的网页向量,生成网页分类模型。
- 根据权利要求8所述的装置,其中,所述装置还包括:LDA特征获取模块,配置为获取所述训练网页集合中的网页的LDA特征;所述网页分类模型生成模块还配置为根据所述网页的LDA特征、人工标注类别和对应的网页向量,生成网页分类模型。
- 根据权利要求8所述的装置,其中,网页分类模型生成模块,还配置为将所述训练网页集合中的网页的人工标注类别和对应的网页向量代 入逻辑回归模型中进行训练,得到所述网页分类模型。
- 根据权利要求8所述的装置,其中,所述分词权重计算单元还配置为根据公式W(qi)=log(ci+1)计算分词qi的分词权重W(qi),其中i为分词的序号,ci为分词qi的有效次数。
- 一种搜索意图识别的装置,所述装置包括:获取模块,配置为获取待识别的查询字符串,获取所述查询字符串对应的历史网页集合,所述历史网页集合中包括历史通过所述查询字符串点击的各个网页;网页类别获取模块,配置为获取通过所述权利要求8至11中任一项所述的网页训练的装置生成的网页分类模型,根据所述网页分类模型得到所述历史网页集合中的网页的类别;意图识别模块,配置为统计所述历史网页集合中的各个类别中的网页数量,根据所述各个类别中的网页数量和历史网页集合中网页的总数量计算得到所述查询字符串的意图分布,根据所述意图分布得到所述查询字符串的意图识别结果。
- 根据权利要求12所述的装置,其中,所述装置还包括:预测类别模块,配置为获取字符串分类模型,根据所述字符串分类模型得到所述查询字符串的预测类别;所述意图识别模块还配置为根据所述意图分布和预测类别得到所述查询字符串的意图识别结果。
- 根据权利要求13所述的装置,其中,所述装置还包括:字符串分类模型生成模块,配置为获取历史查询字符串对应的意图分布中意图概率最大的类别对应的查询字符串作为类别训练查询字符串,其中所述意图概率最大的类别包括多个不同类别,对所述不同类别对应的类别训练查询字符串提取基于词语和/或基于字符的n元语法特征,所述n为 大于1且小于当前提取的查询字符串词语长度或字符长度的整数,将所述n元语法特征和对应的类别作为训练数据采用分类模型进行训练生成所述字符串分类模型。
Priority Applications (5)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP17735865.2A EP3401802A4 (en) | 2016-01-07 | 2017-01-06 | Webpage training method and device, and search intention identification method and device |
| JP2018516619A JP6526329B2 (ja) | 2016-01-07 | 2017-01-06 | ウェブページトレーニング方法及び装置、検索意図識別方法及び装置 |
| MYPI2017704608A MY188760A (en) | 2016-01-07 | 2017-01-06 | Search intention identifying method and device |
| KR1020177037044A KR102092691B1 (ko) | 2016-01-07 | 2017-01-06 | 웹페이지 트레이닝 방법 및 기기, 그리고 검색 의도 식별 방법 및 기기 |
| US15/843,267 US20180107933A1 (en) | 2016-01-07 | 2017-12-15 | Web page training method and device, and search intention identifying method and device |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610008131.3A CN106951422B (zh) | 2016-01-07 | 2016-01-07 | 网页训练的方法和装置、搜索意图识别的方法和装置 |
| CN201610008131.3 | 2016-01-07 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/843,267 Continuation US20180107933A1 (en) | 2016-01-07 | 2017-12-15 | Web page training method and device, and search intention identifying method and device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2017118427A1 true WO2017118427A1 (zh) | 2017-07-13 |
Family
ID=59273509
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2017/070504 Ceased WO2017118427A1 (zh) | 2016-01-07 | 2017-01-06 | 网页训练的方法和装置、搜索意图识别的方法和装置 |
Country Status (7)
| Country | Link |
|---|---|
| US (1) | US20180107933A1 (zh) |
| EP (1) | EP3401802A4 (zh) |
| JP (1) | JP6526329B2 (zh) |
| KR (1) | KR102092691B1 (zh) |
| CN (1) | CN106951422B (zh) |
| MY (1) | MY188760A (zh) |
| WO (1) | WO2017118427A1 (zh) |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107807987A (zh) * | 2017-10-31 | 2018-03-16 | 广东工业大学 | 一种字符串分类方法、系统及一种字符串分类设备 |
| CN108052613A (zh) * | 2017-12-14 | 2018-05-18 | 北京百度网讯科技有限公司 | 用于生成页面的方法和装置 |
| CN110019784A (zh) * | 2017-09-29 | 2019-07-16 | 北京国双科技有限公司 | 一种文本分类方法及装置 |
| CN111046662A (zh) * | 2018-09-26 | 2020-04-21 | 阿里巴巴集团控股有限公司 | 分词模型的训练方法、装置、系统和存储介质 |
| CN111161890A (zh) * | 2019-12-31 | 2020-05-15 | 嘉兴太美医疗科技有限公司 | 不良事件和合并用药的关联性判断方法及系统 |
| CN111581388A (zh) * | 2020-05-11 | 2020-08-25 | 北京金山安全软件有限公司 | 一种用户意图识别方法、装置及电子设备 |
| CN113312523A (zh) * | 2021-07-30 | 2021-08-27 | 北京达佳互联信息技术有限公司 | 字典生成、搜索关键字推荐方法、装置和服务器 |
Families Citing this family (25)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170300533A1 (en) * | 2016-04-14 | 2017-10-19 | Baidu Usa Llc | Method and system for classification of user query intent for medical information retrieval system |
| CN107506472B (zh) * | 2017-09-05 | 2020-09-08 | 淮阴工学院 | 一种学生浏览网页分类方法 |
| CN107862027B (zh) * | 2017-10-31 | 2019-03-12 | 北京小度信息科技有限公司 | 检索意图识别方法、装置、电子设备及可读存储介质 |
| CN107967256B (zh) * | 2017-11-14 | 2021-12-21 | 北京拉勾科技有限公司 | 词语权重预测模型生成方法、职位推荐方法及计算设备 |
| CN109948036B (zh) * | 2017-11-15 | 2022-10-04 | 腾讯科技(深圳)有限公司 | 一种分词词项权重的计算方法和装置 |
| KR101881744B1 (ko) * | 2017-12-18 | 2018-07-25 | 주식회사 머니브레인 | 대화형 ai 에이전트 시스템을 위한 계층적 대화 흐름 관리 모델을 자동으로 구축 또는 갱신하는 방법, 컴퓨터 장치 및 컴퓨터 판독가능 기록 매체 |
| RU2711104C2 (ru) * | 2017-12-27 | 2020-01-15 | Общество С Ограниченной Ответственностью "Яндекс" | Способ и компьютерное устройство для определения намерения, связанного с запросом для создания зависящего от намерения ответа |
| RU2693332C1 (ru) | 2017-12-29 | 2019-07-02 | Общество С Ограниченной Ответственностью "Яндекс" | Способ и компьютерное устройство для выбора текущего зависящего от контекста ответа для текущего пользовательского запроса |
| CN108710613B (zh) * | 2018-05-22 | 2022-04-08 | 平安科技(深圳)有限公司 | 文本相似度的获取方法、终端设备及介质 |
| CN109635157B (zh) * | 2018-10-30 | 2021-05-25 | 北京奇艺世纪科技有限公司 | 模型生成方法、视频搜索方法、装置、终端及存储介质 |
| TWI701565B (zh) * | 2018-12-19 | 2020-08-11 | 財團法人工業技術研究院 | 資料標記系統及資料標記方法 |
| CN109408731B (zh) * | 2018-12-27 | 2021-03-16 | 网易(杭州)网络有限公司 | 一种多目标推荐方法、多目标推荐模型生成方法以及装置 |
| CN110162535B (zh) * | 2019-03-26 | 2023-11-07 | 腾讯科技(深圳)有限公司 | 用于执行个性化的搜索方法、装置、设备以及存储介质 |
| CN110503143B (zh) * | 2019-08-14 | 2024-03-19 | 平安科技(深圳)有限公司 | 基于意图识别的阈值选取方法、设备、存储介质及装置 |
| CN110598067B (zh) * | 2019-09-12 | 2022-10-21 | 腾讯音乐娱乐科技(深圳)有限公司 | 词语权重获取方法、装置及存储介质 |
| US11860903B1 (en) * | 2019-12-03 | 2024-01-02 | Ciitizen, Llc | Clustering data base on visual model |
| CN111061835B (zh) * | 2019-12-17 | 2023-09-22 | 医渡云(北京)技术有限公司 | 查询方法及装置、电子设备和计算机可读存储介质 |
| CN111695337B (zh) * | 2020-04-29 | 2024-11-08 | 平安科技(深圳)有限公司 | 智能面试中专业术语的提取方法、装置、设备及介质 |
| CN112200546A (zh) * | 2020-11-06 | 2021-01-08 | 南威软件股份有限公司 | 基于bayes交叉模型的政务审批智能筛查方法 |
| CN114694106A (zh) * | 2020-12-29 | 2022-07-01 | 北京万集科技股份有限公司 | 道路检测区域的提取方法、装置、计算机设备和存储介质 |
| JP7372278B2 (ja) * | 2021-04-20 | 2023-10-31 | ヤフー株式会社 | 算出装置、算出方法及び算出プログラム |
| CN113343028B (zh) * | 2021-05-31 | 2022-09-02 | 北京达佳互联信息技术有限公司 | 意图确定模型的训练方法和装置 |
| CN114661910B (zh) * | 2022-03-25 | 2025-05-06 | 平安科技(深圳)有限公司 | 一种意图识别方法、装置、电子设备及存储介质 |
| CN116248375B (zh) * | 2023-02-01 | 2023-12-15 | 北京市燃气集团有限责任公司 | 一种网页登录实体识别方法、装置、设备和存储介质 |
| CN115827953B (zh) * | 2023-02-20 | 2023-05-12 | 中航信移动科技有限公司 | 用于网页数据抽取的数据处理方法、存储介质及电子设备 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101261629A (zh) * | 2008-04-21 | 2008-09-10 | 上海大学 | 基于自动分类技术的特定信息搜索方法 |
| CN101609450A (zh) * | 2009-04-10 | 2009-12-23 | 南京邮电大学 | 基于训练集的网页分类方法 |
| US20130144860A1 (en) * | 2011-09-07 | 2013-06-06 | Cheng Xu | System and Method for Automatically Identifying Classified Websites |
| CN104834640A (zh) * | 2014-02-10 | 2015-08-12 | 腾讯科技(深圳)有限公司 | 网页的识别方法及装置 |
| US20150248715A1 (en) * | 2014-02-28 | 2015-09-03 | Ebay Inc. | Suspicion classifier for website activity |
Family Cites Families (25)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7698626B2 (en) * | 2004-06-30 | 2010-04-13 | Google Inc. | Enhanced document browsing with automatically generated links to relevant information |
| JP4757016B2 (ja) * | 2005-12-21 | 2011-08-24 | 富士通株式会社 | 文書分類プログラム、文書分類装置、および文書分類方法 |
| KR100898458B1 (ko) * | 2007-08-10 | 2009-05-21 | 엔에이치엔(주) | 정보 검색 방법 및 그 시스템 |
| US8103676B2 (en) * | 2007-10-11 | 2012-01-24 | Google Inc. | Classifying search results to determine page elements |
| JP5133946B2 (ja) * | 2009-06-18 | 2013-01-30 | ヤフー株式会社 | 情報検索装置及び情報検索方法 |
| CN101673306B (zh) * | 2009-10-19 | 2011-08-24 | 中国科学院计算技术研究所 | 网页信息查询方法及其系统 |
| US20110208715A1 (en) * | 2010-02-23 | 2011-08-25 | Microsoft Corporation | Automatically mining intents of a group of queries |
| CN102999520B (zh) * | 2011-09-15 | 2016-04-27 | 北京百度网讯科技有限公司 | 一种搜索需求识别的方法和装置 |
| JP5648008B2 (ja) * | 2012-03-19 | 2015-01-07 | 日本電信電話株式会社 | 文書分類方法、装置、及びプログラム |
| CN103838744B (zh) * | 2012-11-22 | 2019-01-15 | 百度在线网络技术(北京)有限公司 | 一种查询词需求分析的方法及装置 |
| CN103020164B (zh) * | 2012-11-26 | 2015-06-10 | 华北电力大学 | 一种基于多语义分析和个性化排序的语义检索方法 |
| CN103049542A (zh) * | 2012-12-27 | 2013-04-17 | 北京信息科技大学 | 一种面向领域的网络信息搜索方法 |
| CN103914478B (zh) * | 2013-01-06 | 2018-05-08 | 阿里巴巴集团控股有限公司 | 网页训练方法及系统、网页预测方法及系统 |
| CN103106287B (zh) * | 2013-03-06 | 2017-10-17 | 深圳市宜搜科技发展有限公司 | 一种用户检索语句的处理方法及系统 |
| US9875237B2 (en) * | 2013-03-14 | 2018-01-23 | Microsfot Technology Licensing, Llc | Using human perception in building language understanding models |
| CN104424279B (zh) * | 2013-08-30 | 2018-11-20 | 腾讯科技(深圳)有限公司 | 一种文本的相关性计算方法和装置 |
| CN103744981B (zh) * | 2014-01-14 | 2017-02-15 | 南京汇吉递特网络科技有限公司 | 一种基于网站内容用于网站自动分类分析的系统 |
| CN103870538B (zh) * | 2014-01-28 | 2017-02-15 | 百度在线网络技术(北京)有限公司 | 针对用户进行个性化推荐的方法、用户建模设备及系统 |
| US9870356B2 (en) * | 2014-02-13 | 2018-01-16 | Microsoft Technology Licensing, Llc | Techniques for inferring the unknown intents of linguistic items |
| CN104268546A (zh) * | 2014-05-28 | 2015-01-07 | 苏州大学 | 一种基于主题模型的动态场景分类方法 |
| CN105159898B (zh) * | 2014-06-12 | 2019-11-26 | 北京搜狗科技发展有限公司 | 一种搜索的方法和装置 |
| CN104778161B (zh) * | 2015-04-30 | 2017-07-07 | 车智互联(北京)科技有限公司 | 基于Word2Vec和Query log抽取关键词方法 |
| CN104820703A (zh) * | 2015-05-12 | 2015-08-05 | 武汉数为科技有限公司 | 一种文本精细分类方法 |
| CN104866554B (zh) * | 2015-05-15 | 2018-04-27 | 大连理工大学 | 一种基于社会化标注的个性化搜索方法及系统 |
| CN104951433B (zh) * | 2015-06-24 | 2018-01-23 | 北京京东尚科信息技术有限公司 | 基于上下文进行意图识别的方法和系统 |
-
2016
- 2016-01-07 CN CN201610008131.3A patent/CN106951422B/zh active Active
-
2017
- 2017-01-06 JP JP2018516619A patent/JP6526329B2/ja active Active
- 2017-01-06 MY MYPI2017704608A patent/MY188760A/en unknown
- 2017-01-06 KR KR1020177037044A patent/KR102092691B1/ko active Active
- 2017-01-06 WO PCT/CN2017/070504 patent/WO2017118427A1/zh not_active Ceased
- 2017-01-06 EP EP17735865.2A patent/EP3401802A4/en not_active Ceased
- 2017-12-15 US US15/843,267 patent/US20180107933A1/en not_active Abandoned
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101261629A (zh) * | 2008-04-21 | 2008-09-10 | 上海大学 | 基于自动分类技术的特定信息搜索方法 |
| CN101609450A (zh) * | 2009-04-10 | 2009-12-23 | 南京邮电大学 | 基于训练集的网页分类方法 |
| US20130144860A1 (en) * | 2011-09-07 | 2013-06-06 | Cheng Xu | System and Method for Automatically Identifying Classified Websites |
| CN104834640A (zh) * | 2014-02-10 | 2015-08-12 | 腾讯科技(深圳)有限公司 | 网页的识别方法及装置 |
| US20150248715A1 (en) * | 2014-02-28 | 2015-09-03 | Ebay Inc. | Suspicion classifier for website activity |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP3401802A4 * |
Cited By (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110019784A (zh) * | 2017-09-29 | 2019-07-16 | 北京国双科技有限公司 | 一种文本分类方法及装置 |
| CN110019784B (zh) * | 2017-09-29 | 2021-10-15 | 北京国双科技有限公司 | 一种文本分类方法及装置 |
| CN107807987A (zh) * | 2017-10-31 | 2018-03-16 | 广东工业大学 | 一种字符串分类方法、系统及一种字符串分类设备 |
| CN108052613A (zh) * | 2017-12-14 | 2018-05-18 | 北京百度网讯科技有限公司 | 用于生成页面的方法和装置 |
| CN108052613B (zh) * | 2017-12-14 | 2021-12-31 | 北京百度网讯科技有限公司 | 用于生成页面的方法和装置 |
| CN111046662A (zh) * | 2018-09-26 | 2020-04-21 | 阿里巴巴集团控股有限公司 | 分词模型的训练方法、装置、系统和存储介质 |
| CN111046662B (zh) * | 2018-09-26 | 2023-07-18 | 阿里巴巴集团控股有限公司 | 分词模型的训练方法、装置、系统和存储介质 |
| CN111161890A (zh) * | 2019-12-31 | 2020-05-15 | 嘉兴太美医疗科技有限公司 | 不良事件和合并用药的关联性判断方法及系统 |
| CN111581388A (zh) * | 2020-05-11 | 2020-08-25 | 北京金山安全软件有限公司 | 一种用户意图识别方法、装置及电子设备 |
| CN111581388B (zh) * | 2020-05-11 | 2023-09-19 | 北京金山安全软件有限公司 | 一种用户意图识别方法、装置及电子设备 |
| CN113312523A (zh) * | 2021-07-30 | 2021-08-27 | 北京达佳互联信息技术有限公司 | 字典生成、搜索关键字推荐方法、装置和服务器 |
Also Published As
| Publication number | Publication date |
|---|---|
| KR20180011254A (ko) | 2018-01-31 |
| CN106951422A (zh) | 2017-07-14 |
| EP3401802A4 (en) | 2019-01-02 |
| KR102092691B1 (ko) | 2020-03-24 |
| EP3401802A1 (en) | 2018-11-14 |
| CN106951422B (zh) | 2021-05-28 |
| JP6526329B2 (ja) | 2019-06-05 |
| MY188760A (en) | 2021-12-29 |
| JP2018518788A (ja) | 2018-07-12 |
| US20180107933A1 (en) | 2018-04-19 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2017118427A1 (zh) | 网页训练的方法和装置、搜索意图识别的方法和装置 | |
| US11886515B2 (en) | Hierarchical clustering on graphs for taxonomy extraction and applications thereof | |
| CN108009228B (zh) | 一种内容标签的设置方法、装置及存储介质 | |
| CN106649818B (zh) | 应用搜索意图的识别方法、装置、应用搜索方法和服务器 | |
| CN111539197B (zh) | 文本匹配方法和装置以及计算机系统和可读存储介质 | |
| CN104834747B (zh) | 基于卷积神经网络的短文本分类方法 | |
| CN103699625B (zh) | 基于关键词进行检索的方法及装置 | |
| CN114780746A (zh) | 基于知识图谱的文档检索方法及其相关设备 | |
| CN112989208B (zh) | 一种信息推荐方法、装置、电子设备及存储介质 | |
| WO2021051518A1 (zh) | 基于神经网络模型的文本数据分类方法、装置及存储介质 | |
| CN109271624B (zh) | 一种目标词确定方法、装置及存储介质 | |
| WO2018176913A1 (zh) | 搜索方法、装置及非临时性计算机可读存储介质 | |
| CN105893362A (zh) | 获取知识点语义向量的方法、确定相关知识点的方法及系统 | |
| CN110858217A (zh) | 微博敏感话题的检测方法、装置及可读存储介质 | |
| CN113761125B (zh) | 动态摘要确定方法和装置、计算设备以及计算机存储介质 | |
| CN103164428B (zh) | 确定微博与给定实体的相关性的方法和装置 | |
| CN112926308B (zh) | 匹配正文的方法、装置、设备、存储介质以及程序产品 | |
| WO2019064137A1 (en) | Extraction of expression for natural language processing | |
| CN114416998B (zh) | 文本标签的识别方法、装置、电子设备及存储介质 | |
| CN105989047A (zh) | 获取装置、获取方法、训练装置以及检测装置 | |
| US12613921B2 (en) | Hierarchical clustering on graphs for taxonomy extraction and applications thereof | |
| Wandabwa et al. | Topical affinity in short text microblogs | |
| CN115329754A (zh) | 一种文本主题提取方法、装置、设备及存储介质 | |
| CN114706948A (zh) | 新闻处理方法、装置、存储介质以及电子设备 | |
| CN116151258A (zh) | 文本消岐方法、电子设备、存储介质 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17735865 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2018516619 Country of ref document: JP Kind code of ref document: A |
|
| ENP | Entry into the national phase |
Ref document number: 20177037044 Country of ref document: KR Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |


