WO2016066043A1 - 网页去重方法及装置 - Google Patents
网页去重方法及装置 Download PDFInfo
- Publication number
- WO2016066043A1 WO2016066043A1 PCT/CN2015/092510 CN2015092510W WO2016066043A1 WO 2016066043 A1 WO2016066043 A1 WO 2016066043A1 CN 2015092510 W CN2015092510 W CN 2015092510W WO 2016066043 A1 WO2016066043 A1 WO 2016066043A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- webpage
- feature code
- words
- current
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2365—Ensuring data consistency and integrity
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/02—Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
Definitions
- the present application relates to the field of Internet technologies, and in particular, to a webpage deduplication method and apparatus.
- the webpage can be de-duplicated by selecting the feature code in the webpage and comparing the signatures.
- the existing process of deduplicating a webpage by using a feature code of a webpage is as follows: First, a certain period is selected as an anchor point in the webpage 1, and a certain number of Chinese characters are selected as feature codes on both sides of the anchorage point. Then, in the webpage 2, the feature code is acquired in the same manner, and the feature codes of the two webpages are compared. If the feature codes in the two webpages are the same, it is determined that the webpage 2 is a duplicate webpage, and the duplicated webpage is deleted. Webpage 2; if the feature codes are not the same, it is determined that the two webpages are different, that is, the webpage 2 is not a duplicate webpage of the webpage 1.
- the webpage 1 is a verse of dozens of words, and the user reprints the webpage. After 1 , according to his own understanding, the verse is interpreted by hundreds of words or more, and there is no period in the explanatory text. If the page is only deduplicated based on the feature code, the two pages will be judged as the same page, and the two pages are Pages should be different pages. Therefore, the above-mentioned web page deduplication method has a low accuracy rate. In addition, the feature code extracted in the above manner is not accurate.
- the original webpage and the reprinted webpage have different feature codes, the original webpage and the reprinted webpage.
- the original web page and the reprinted web page may contain the same body content.
- the present application aims to solve at least one of the technical problems in the related art to some extent.
- the first object of the present application is to provide a webpage deduplication method, which can greatly improve the accuracy of webpage deduplication and reduce the false positive rate of webpage deduplication.
- a second object of the present application is to provide a web page deduplication device.
- the first aspect of the present application provides a webpage deduplication method, including: acquiring a webpage of a predetermined type; and extracting, for each webpage, a feature code of the current webpage and a number of words included in the body of the current webpage. And querying whether the feature code is included in the preset data table, and if the feature code is included, reading the number of words of the webpage text corresponding to the feature code in the data table, and reading the number of words and The difference in word count between the extracted words is in the preset range When the area is inside, the current web page is discarded.
- the webpage deduplication method of the embodiment of the present application obtains a predetermined type of webpage, and extracts the feature code of the current webpage and the number of words included in the body of the current webpage for each webpage, and queries whether the preset data table includes the signature code. If the feature code is included, the number of words of the webpage text corresponding to the feature code in the data table is read, and when the difference between the read and extracted word counts is within a preset range, the current webpage is discarded, and the embodiment is based on the webpage.
- the feature code and the number of words included in the body of the webpage deduplicate the webpage, and the method of deduplicating the webpage based on the existing feature code only can greatly improve the accuracy of the deduplication of the webpage, and reduce the false positive rate of the deduplication of the webpage.
- the second aspect of the present application provides a webpage deduplication apparatus, including: an obtaining module, configured to acquire a webpage of a predetermined type; and a first processing module, configured to extract a current for each webpage The feature code of the webpage and the number of words included in the body of the current webpage, and querying whether the feature code is included in the preset data table, and if the feature code is included, reading the webpage corresponding to the feature code in the data table The number of words in the body, and discarding the current web page when the difference between the number of words read and the number of extracted words is within a preset range.
- the webpage deduplication device of the embodiment of the present application acquires a webpage of a predetermined type by using an obtaining module, and the first processing module extracts the feature code of the current webpage and the number of words included in the body of the current webpage for each webpage, and queries the preset data table. Whether the feature code is included in the file, if the feature code is included, the number of words in the body of the webpage corresponding to the feature code in the data table is read, and when the difference between the read and extracted words is within a preset range, the current webpage is discarded.
- the webpage is deduplicated based on the feature code of the webpage and the number of words included in the body of the webpage. Compared with the existing method of deduplicating the webpage based only on the signature code, the accuracy of deduplication of the webpage can be greatly improved, and the deduplication of the webpage is reduced. The rate of false positives.
- FIG. 1 is a flowchart of a webpage deduplication method according to an embodiment of the present application.
- FIG. 2 is a first schematic diagram of a web page according to an embodiment of the present application.
- FIG. 3 is a second schematic diagram of a webpage according to an embodiment of the present application.
- FIG. 4 is a schematic structural diagram of a webpage deduplication apparatus according to an embodiment of the present application.
- FIG. 5 is a schematic structural diagram of a webpage deduplication apparatus according to another embodiment of the present application.
- FIG. 1 is a flowchart of a webpage deduplication method according to an embodiment of the present application. As shown in FIG. 1, the webpage deduplication method includes:
- a predetermined type of webpage for example, a webpage including a body text
- S102 For each webpage, extract the feature code of the current webpage and the number of words included in the body of the current webpage, and query whether the preset data table includes the feature code. If the feature code is included, the read data table corresponds to the feature code. The number of words in the body of the web page, and discard the current web page when the difference between the number of words read and the number of words extracted is within a preset range.
- a paragraph included in the body of the current webpage may be acquired, and for each paragraph in the body of the current webpage, the first pre-selection is selected at a preset position of the current paragraph. Set the number of characters, and stitch the characters of all the selected paragraphs into a string, and operate on the string to generate the signature.
- the second preset number of characters may be selected from the middle position of the current paragraph and the left and right sides of the center, wherein the second preset number is the first preset number One second, and the second preset number may be 3-8.
- the second preset number may be five, correspondingly The first preset number can be 10.
- the obtained characters may be spliced into a character string in a paragraph order, in order to efficiently perform a fast search through the character string,
- the string corresponding to each webpage can be operated to generate a corresponding signature.
- the hash string corresponding to each web page may be converted into a corresponding hash value by a hash function, that is, a hash function, and the hash value is used as a feature code of the web page.
- the code for converting a string to a hash function of the corresponding hash value is as follows:
- the hash function used in this example is the high bit of the string multiplied by 31 plus the low bit. Since the value range of the int type in JAVA is -2147483648 to 2147483647, the coverage reaches more than 4 billion. Basically, there will be no case where different strings get the same hash value. That is to say, the possibility that the same feature code appears on different web pages is small, and the accuracy of the feature code of the extracted web page is high.
- the embodiment When obtaining the feature code of the webpage, the embodiment fully considers the text structure of the webpage, and selects a first preset number of characters in the middle position of each paragraph for each paragraph in the body of the webpage, and selects all the paragraphs selected. Characters are stitched into strings, and signatures are obtained based on strings.
- the feature code obtained by extracting the feature code in this embodiment is accurate and high. Since different websites are reprinting information, different annotations, edits, and the like are usually added to the information, and the articles may be differently abridged, modified, paged, or added. Therefore, in order to further improve the accuracy of the same web page classification, it is also necessary to extract the number of words included in the body of each web page while extracting the feature code of each web page.
- the preset data table may be queried, for example, whether the hash code includes the feature code, that is, whether the hash value is included in the hash table, and if the hash table includes the hash value,
- the column value reads the number of words of the body of the webpage corresponding to the hash value in the hash table, and compares with the number of words of the body of the current webpage. When the difference of the number of words between the two is within a preset range, for example, 0-50, The current web page is a duplicate web page, and the current web page is discarded.
- the hash table is a good data structure of the organization feature code. It accesses the record by mapping the key code value, that is, the feature code of the web page to a position in the table, to speed up the search, and the hash table has an efficient search. Ability and support for storage and retrieval of dynamic data.
- the preset range is 0-50, assuming that the hash value corresponding to the webpage shown in FIG. 3 and the number of words included in the webpage text are already stored in the hash table, and the feature code of the webpage as shown in FIG. 4 is extracted.
- the query hash table may determine that the feature code of the webpage shown in FIG. 4 is the same as the feature code of the webpage shown in FIG.
- the number of words included in the body of the webpage corresponding to the hash value may be read from the hash table, that is, the number of words included in the body of the webpage of the webpage shown in FIG. 3, and the number of words included in the body of the webpage shown in FIG. 4 may be obtained by calculation.
- the difference between the number of words included in the body of the webpage shown in FIG. 3 is 18, and the difference in the number of words included in the body of the webpage of the two webpages is within a preset range. Therefore, the webpages shown in FIG. 4 and FIG. 3 can be regarded as the same webpage. , discard the web page shown in Figure 4.
- the extracted feature code and the number of words of the current web page are correspondingly written into the data table.
- the extracted feature code and the number of words of the current web page are correspondingly written into the data table.
- the webpage deduplication method of the embodiment of the present application needs to compare the difference between the two web pages in addition to the feature codes of the two web pages, thereby effectively The spoofing of the webpage with the same feature code and the difference in the number of the webpages is reduced.
- the extraction method of the feature code used in the embodiment of the present application is different from the extraction method of the feature code used in the prior art, the method can be effectively reduced.
- the misjudgment of web pages with the same feature code and small difference in the number of web pages can further improve the accuracy of web page deduplication.
- the current webpage may not be regarded as a duplicate webpage.
- the webpage of the current webpage may be The number of words is added to the hash table.
- the search engine obtains 10 webpages related to keywords, wherein three webpages are webpages with the same content, and the character codes corresponding to the 10 webpages and the number of words of the corresponding webpage texts may be separately extracted and passed through The list deduplicates these 10 web pages.
- the process of deduplicating a webpage is also a process of establishing a hashtable. When the hashtable is established, the corresponding webpage is deduplicated. At this point, the same web page in the 10 web pages will be removed. Therefore, relative to the retrieval system built by the signature code, and based on the manner in which the retrieval system queries and deduplicates the webpage, the webpage is deduplicated in this manner, thereby improving the efficiency of deduplication of the webpage.
- the random random sampling method can be used for evaluation, assuming 6 random persons 50 repeated web pages were selected for evaluation, and the obtained corresponding web page deduplication results are shown in Table 1.
- the number of errors in Table 1 indicates that the number of the same web pages is not removed by this embodiment, and the accuracy of deduplication of the web pages in Table 1 can be obtained by calculation to be 96.7%.
- the accuracy of deduplication of the webpage in Table 2 can be obtained by calculation as 90.37%.
- the accuracy of the deduplication of the web pages of Tables 1 and 2 it can be seen that the accuracy of the deduplication of the webpage of this embodiment is higher than that of the feature-only code.
- the webpage deduplication method of the embodiment of the present application extracts a predetermined type of webpage and extracts for each webpage.
- the feature code of the current webpage and the number of words included in the body of the current webpage and query whether the preset data table contains the feature code. If the feature code is included, the number of words of the webpage text corresponding to the feature code in the data table is read, and is read. When the obtained word difference is within a preset range, the current web page is discarded.
- the web page is deduplicated based on the feature code of the web page and the number of words included in the web page body, compared to the existing feature-only code pair.
- the method of deduplicating the webpage can greatly improve the accuracy of the deduplication of the webpage and reduce the false positive rate of the deduplication of the webpage.
- the present application also proposes a webpage deduplication device.
- FIG. 4 is a schematic structural diagram of a webpage deduplication apparatus according to an embodiment of the present application. As shown in FIG. 4, the apparatus includes: an obtaining module 100 and a first processing module 200, where:
- the obtaining module 100 is configured to obtain a webpage of a predetermined type; and the first processing module 200 is configured to extract, for each webpage, a feature code of the current webpage and a number of words included in the body of the current webpage, and query whether the preset data table includes the feature.
- the code if the feature code is included, reads the number of words of the webpage text corresponding to the feature code in the data table, and discards the current webpage when the difference between the number of words read and the number of extracted words is within a preset range.
- the obtaining module 100 may select a predetermined type of webpage from among a plurality of types of webpages, for example, a webpage including a webpage body.
- the first processing module 200 is specifically configured to: obtain a paragraph included in the body of the current webpage; select, for each paragraph, a first preset number of characters in a preset position of the current paragraph; and splicing the characters of all the selected paragraphs into characters String and operate on the string to generate the signature.
- the first processing module 200 may convert a string corresponding to each webpage into a corresponding hash value by using a hash function, that is, a hash function, and use the hash value as a feature code of the webpage.
- a hash function that is, a hash function
- the first processing module 200 may select a second preset number of characters from the left and right sides of the center, with the second preset number being the first preset number.
- the second preset number is the first preset number.
- One-half, and the second preset number can be 3-8.
- the second preset number may be five, and correspondingly, the first preset number may be ten.
- the preset data table may be, for example, a hash table.
- the hash table is a good data structure of the organization signature, and the record is accessed by mapping the key code value, that is, the feature code of the web page to a position in the table. To speed up the lookup, hash tables have efficient retrieval capabilities and support the storage and retrieval of dynamic data.
- the foregoing apparatus may further include a second processing module 300, where the second processing module 300 is configured to: after the first processing module 200 queries whether the preset data table includes the signature, if the data table If the signature is not included, the extracted signature and the number of words of the current webpage are correspondingly written into the data table.
- the foregoing apparatus may further include a third processing module 400, where the third processing module 400 is configured to: when the read and extracted word difference is not within the preset range, extract the feature code and the number of words of the current webpage. Corresponding to the written data table.
- the third processing module 400 writes the extracted feature code and the number of words of the current web page into the data table.
- the webpage deduplication device of the embodiment of the present application acquires a webpage of a predetermined type by using an obtaining module, and the first processing module extracts the feature code of the current webpage and the number of words included in the body of the current webpage for each webpage, and queries the preset data table. Whether the feature code is included in the file, if the feature code is included, the number of words in the body of the webpage corresponding to the feature code in the data table is read, and when the difference between the read and extracted words is within a preset range, the current webpage is discarded.
- the webpage is deduplicated based on the feature code of the webpage and the number of words included in the body of the webpage. Compared with the existing method of deduplicating the webpage based only on the signature code, the accuracy of deduplication of the webpage can be greatly improved, and the deduplication of the webpage is reduced. The rate of false positives.
- first and second are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated.
- features defining “first” or “second” may include at least one of the features, either explicitly or implicitly.
- the meaning of "a plurality” is at least two, such as two, three, etc., unless specifically defined otherwise.
- a "computer-readable medium” can be any apparatus that can contain, store, communicate, propagate, or transport a program for use in an instruction execution system, apparatus, or device, or in conjunction with the instruction execution system, apparatus, or device.
- computer readable media include the following: electrical connections (electronic devices) having one or more wires, portable computer disk cartridges (magnetic devices), random access memory (RAM), Read only memory (ROM), erasable editable read only memory (EPROM or flash memory), fiber optic devices, and Portable compact disk read only memory (CDROM).
- the computer readable medium may even be a paper or other suitable medium on which the program can be printed, as it may be optically scanned, for example by paper or other medium, followed by editing, interpretation or, if appropriate, other suitable The method is processed to obtain the program electronically and then stored in computer memory.
- portions of the application can be implemented in hardware, software, firmware, or a combination thereof.
- multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system.
- a suitable instruction execution system For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.
- each functional unit in each embodiment of the present application may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module.
- the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
- the integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.
- the above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like. While the embodiments of the present application have been shown and described above, it is understood that the above-described embodiments are illustrative and are not to be construed as limiting the scope of the present application. The embodiments are subject to variations, modifications, substitutions and variations.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
Description
| 用户 | 1 | 2 | 3 | 4 | 5 | 6 |
| 网页数 | 50 | 50 | 50 | 50 | 50 | 50 |
| 错误数 | 2 | 1 | 4 | 1 | 1 | 1 |
| 用户 | 1 | 2 | 3 | 4 | 5 | 6 |
| 网页数 | 50 | 50 | 50 | 50 | 50 | 50 |
| 错误数 | 4 | 2 | 6 | 2 | 3 | 2 |
Claims (12)
- 一种网页去重方法,其特征在于,包括:获取预定类型的网页;以及针对每个网页,提取出当前网页的特征码和当前网页正文包含的字数,并查询预设的数据表中是否包含所述特征码,若包含所述特征码,则读取所述数据表中与所述特征码对应的网页正文的字数,并当读取到的字数和提取出的字数间的字数差在预设范围内时,丢弃所述当前网页。
- 根据权利要求1所述的方法,其特征在于,在所述查询预设的数据表中是否包含所述特征码之后,还包括:若所述数据表中未包含所述特征码,则将提取出的所述当前网页的特征码和字数对应写入所述数据表中。
- 根据权利要求1所述的方法,其特征在于,还包括:当读取到的字数和提取的字数间的字数差未在预设范围内时,将提取出的所述当前网页的特征码和所述字数对应写入所述数据表中。
- 根据权利要求1-3任一项所述的方法,其特征在于,所述提取当前网页的特征码,包括:获取当前网页正文包含的段落;针对每个段落,在当前段落的预设位置选取第一预设数量的字符;以及将选取的所有段落的字符拼接成字符串,并对所述字符串进行运算,以生成所述特征码。
- 根据权利要求4所述的方法,其特征在于,所述在当前段落的预设位置选取第一预设数量的字符,包括:以所述当前段落的中间位置为中心,从所述中心的左侧和右侧选取第二预设数量的字符,其中,所述第二预设数量为所述第一预设数量的二分之一,且所述第二预设数量为3-8个。
- 根据权利要求5所述的方法,其特征在于,所述第二预设数量优选为5个。
- 一种网页去重装置,其特征在于,包括:获取模块,用于获取预定类型的网页;以及第一处理模块,用于针对每个网页,提取出当前网页的特征码和当前网页正文包含的字数,并查询预设的数据表中是否包含所述特征码,若包含所述特征码,则读取所述数据表中与所述特征码对应的网页正文的字数,并当读取到的字数和提取出的字数间的字数差在预设范围内时,丢弃所述当前网页。
- 根据权利要求7所述的装置,其特征在于,还包括:第二处理模块,用于在所述第一处理模块查询预设的数据表中是否包含所述特征码之后,若所述数据表中未包含所述特征码,则将提取出的所述当前网页的特征码和字数对应写入所述数据表中。
- 根据权利要求7所述的装置,其特征在于,还包括:第三处理模块,用于当读取到的字数和提取的字数间的字数差未在预设范围内时,将提取出的所述当前网页的特征码和所述字数对应写入所述数据表中。
- 根据权利要求7-9任一项所述的装置,其特征在于,所述第一处理模块,具体用于:获取当前网页正文包含的段落;针对每个段落,在当前段落的预设位置选取第一预设数量的字符;以及将选取的所有段落的字符拼接成字符串,并对所述字符串进行运算,以生成所述特征码。
- 根据权利要求10所述的装置,其特征在于,所述第一处理模块,具体用于:以所述当前段落的中间位置为中心,从所述中心的左侧和右侧选取第二预设数量的字符,其中,所述第二预设数量为所述第一预设数量的二分之一,且所述第二预设数量为3-8个。
- 根据权利要求11所述的装置,其特征在于,所述第二预设数量优选为5个。
Priority Applications (5)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2017522605A JP6672292B2 (ja) | 2014-10-30 | 2015-10-22 | 重複ウェブページを除去する方法および装置 |
| SG11201703563SA SG11201703563SA (en) | 2014-10-30 | 2015-10-22 | Methods and apparatus for removing a duplicated web page |
| KR1020177014662A KR102179855B1 (ko) | 2014-10-30 | 2015-10-22 | 중복 웹 페이지 제거 방법 및 장치 |
| EP15853793.6A EP3214557B1 (en) | 2014-10-30 | 2015-10-22 | Web page deduplication method and apparatus |
| US15/582,322 US10691769B2 (en) | 2014-10-30 | 2017-04-28 | Methods and apparatus for removing a duplicated web page |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201410599140.5 | 2014-10-30 | ||
| CN201410599140.5A CN105630802A (zh) | 2014-10-30 | 2014-10-30 | 网页去重方法及装置 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/582,322 Continuation US10691769B2 (en) | 2014-10-30 | 2017-04-28 | Methods and apparatus for removing a duplicated web page |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2016066043A1 true WO2016066043A1 (zh) | 2016-05-06 |
Family
ID=55856595
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2015/092510 Ceased WO2016066043A1 (zh) | 2014-10-30 | 2015-10-22 | 网页去重方法及装置 |
Country Status (7)
| Country | Link |
|---|---|
| US (1) | US10691769B2 (zh) |
| EP (1) | EP3214557B1 (zh) |
| JP (1) | JP6672292B2 (zh) |
| KR (1) | KR102179855B1 (zh) |
| CN (1) | CN105630802A (zh) |
| SG (1) | SG11201703563SA (zh) |
| WO (1) | WO2016066043A1 (zh) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10691769B2 (en) | 2014-10-30 | 2020-06-23 | Alibaba Group Holding Limited | Methods and apparatus for removing a duplicated web page |
Families Citing this family (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20180107580A1 (en) * | 2016-10-14 | 2018-04-19 | Microsoft Technology Licensing, Llc | Metadata enabled comparison of user interfaces |
| CN106527876A (zh) * | 2016-11-10 | 2017-03-22 | 广东工业大学 | 一种统计网页字数的方法及系统 |
| CN108205810B (zh) * | 2016-12-16 | 2021-08-10 | 富士通株式会社 | 图像比较装置及方法、电子设备 |
| CN107729343A (zh) * | 2017-07-24 | 2018-02-23 | 上海壹账通金融科技有限公司 | 资源提取方法、计算机可读存储介质及电子设备 |
| CN109033385B (zh) * | 2018-07-27 | 2021-08-27 | 百度在线网络技术(北京)有限公司 | 图片检索方法、装置、服务器及存储介质 |
| CN109103953B (zh) * | 2018-08-23 | 2021-07-20 | 广州市香港科大霍英东研究院 | 一种电池组主动均衡控制方法、系统及装置 |
| KR102722603B1 (ko) * | 2022-11-14 | 2024-10-25 | 고려대학교 산학협력단 | 교차 검증된 앙상블 및 필터링 전략에 기반한 프로그래밍 코드의 유사성 판단 방법 및 장치 |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101102316A (zh) * | 2007-06-22 | 2008-01-09 | 腾讯科技(深圳)有限公司 | 一种网页去重的方法及系统 |
| CN101645082A (zh) * | 2009-04-17 | 2010-02-10 | 华中科技大学 | 基于并行编程模式的相似网页去重系统 |
| US7698317B2 (en) * | 2007-04-20 | 2010-04-13 | Yahoo! Inc. | Techniques for detecting duplicate web pages |
Family Cites Families (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6421675B1 (en) * | 1998-03-16 | 2002-07-16 | S. L. I. Systems, Inc. | Search engine |
| KR100406671B1 (ko) * | 2000-07-24 | 2003-11-21 | 주식회사 유니마이다스 | 문장 표절 및 도용 검색 방법 |
| US6778986B1 (en) * | 2000-07-31 | 2004-08-17 | Eliyon Technologies Corporation | Computer method and apparatus for determining site type of a web site |
| US6658423B1 (en) * | 2001-01-24 | 2003-12-02 | Google, Inc. | Detecting duplicate and near-duplicate files |
| CA2577841A1 (en) * | 2004-08-19 | 2006-03-02 | Claria Corporation | Method and apparatus for responding to end-user request for information |
| CN101499098B (zh) * | 2009-03-04 | 2012-07-11 | 阿里巴巴集团控股有限公司 | 一种网页评估值的确定及运用的方法、系统 |
| KR20100115048A (ko) * | 2009-04-17 | 2010-10-27 | 정원석 | 복사 문서 판별 시스템 및 그 방법 |
| KR20120124581A (ko) * | 2011-05-04 | 2012-11-14 | 엔에이치엔(주) | 개선된 유사 문서 탐지 방법, 장치 및 컴퓨터 판독 가능한 기록 매체 |
| CN102799647B (zh) * | 2012-06-30 | 2015-01-21 | 华为技术有限公司 | 网页去重方法和设备 |
| CN103559259A (zh) * | 2013-11-04 | 2014-02-05 | 同济大学 | 基于云平台的消除近似重复网页方法 |
| CN103646078B (zh) * | 2013-12-11 | 2017-01-25 | 北京启明星辰信息安全技术有限公司 | 一种实现互联网宣传监测目标评估的方法及装置 |
| CN105630802A (zh) | 2014-10-30 | 2016-06-01 | 阿里巴巴集团控股有限公司 | 网页去重方法及装置 |
| US11843679B2 (en) * | 2015-07-27 | 2023-12-12 | Wp Company Llc | Automated dependency management based on page components |
-
2014
- 2014-10-30 CN CN201410599140.5A patent/CN105630802A/zh active Pending
-
2015
- 2015-10-22 EP EP15853793.6A patent/EP3214557B1/en active Active
- 2015-10-22 JP JP2017522605A patent/JP6672292B2/ja active Active
- 2015-10-22 WO PCT/CN2015/092510 patent/WO2016066043A1/zh not_active Ceased
- 2015-10-22 KR KR1020177014662A patent/KR102179855B1/ko active Active
- 2015-10-22 SG SG11201703563SA patent/SG11201703563SA/en unknown
-
2017
- 2017-04-28 US US15/582,322 patent/US10691769B2/en active Active
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7698317B2 (en) * | 2007-04-20 | 2010-04-13 | Yahoo! Inc. | Techniques for detecting duplicate web pages |
| CN101102316A (zh) * | 2007-06-22 | 2008-01-09 | 腾讯科技(深圳)有限公司 | 一种网页去重的方法及系统 |
| CN101645082A (zh) * | 2009-04-17 | 2010-02-10 | 华中科技大学 | 基于并行编程模式的相似网页去重系统 |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10691769B2 (en) | 2014-10-30 | 2020-06-23 | Alibaba Group Holding Limited | Methods and apparatus for removing a duplicated web page |
Also Published As
| Publication number | Publication date |
|---|---|
| KR102179855B1 (ko) | 2020-11-18 |
| US20170235746A1 (en) | 2017-08-17 |
| JP6672292B2 (ja) | 2020-03-25 |
| US10691769B2 (en) | 2020-06-23 |
| CN105630802A (zh) | 2016-06-01 |
| EP3214557A4 (en) | 2017-09-06 |
| SG11201703563SA (en) | 2017-06-29 |
| JP2017532690A (ja) | 2017-11-02 |
| EP3214557A1 (en) | 2017-09-06 |
| KR20170078777A (ko) | 2017-07-07 |
| EP3214557B1 (en) | 2019-02-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2016066043A1 (zh) | 网页去重方法及装置 | |
| US20240330661A1 (en) | Techniques for generating and correcting language model outputs | |
| CN106649786B (zh) | 基于深度问答的答案检索方法及装置 | |
| WO2022121171A1 (zh) | 相似文本匹配方法、装置、电子设备及计算机存储介质 | |
| JP2016522524A (ja) | 同義表現の探知及び関連コンテンツを検索する方法及び装置 | |
| US20170300533A1 (en) | Method and system for classification of user query intent for medical information retrieval system | |
| WO2014000517A1 (zh) | 一种用于搜索输入的推荐系统及方法 | |
| WO2021169186A1 (zh) | 文本查重方法、电子设备及计算机可读存储介质 | |
| US10235350B2 (en) | Detect annotation error locations through unannotated document segment partitioning | |
| CN106874481B (zh) | 一种分布式文件系统元数据信息读取方法及系统 | |
| US10417285B2 (en) | Corpus generation based upon document attributes | |
| US10664755B2 (en) | Searching method and system based on multi-round inputs, and terminal | |
| WO2020074017A1 (zh) | 基于深度学习的医学文献中关键词筛选方法及装置 | |
| CN106156143A (zh) | 网页处理装置和网页处理方法 | |
| US8862556B2 (en) | Difference analysis in file sub-regions | |
| CN107111618A (zh) | 将图像的缩略图链接到网页 | |
| CN113988015A (zh) | 一种文档结构检测方法及装置 | |
| CN103999079A (zh) | 对准文档的字段的注解 | |
| CN105574004A (zh) | 一种网页去重方法和设备 | |
| KR101545273B1 (ko) | 클러스터링 및 해싱을 이용하여 빅데이터 텍스트의 중복여부를 검출하는 중복문서 검출장치 및 방법 | |
| JP2010182238A (ja) | 引用検出装置、原典文書データベース生成装置、その方法、プログラム及び記録媒体 | |
| US20130144799A1 (en) | Computing device and method for extracting patent rejection information | |
| KR102075709B1 (ko) | 컨텐츠 검색 방법 및 장치 | |
| CN109947947A (zh) | 一种文本分类方法、装置及计算机可读存储介质 | |
| CN107644049B (zh) | 检索索引产生方法及应用此方法的服务器 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15853793 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2017522605 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 11201703563S Country of ref document: SG |
|
| ENP | Entry into the national phase |
Ref document number: 20177014662 Country of ref document: KR Kind code of ref document: A |
|
| REEP | Request for entry into the european phase |
Ref document number: 2015853793 Country of ref document: EP |