WO2016066043A1 - 网页去重方法及装置 - Google Patents

网页去重方法及装置 Download PDF

Info

Publication number
WO2016066043A1
WO2016066043A1 PCT/CN2015/092510 CN2015092510W WO2016066043A1 WO 2016066043 A1 WO2016066043 A1 WO 2016066043A1 CN 2015092510 W CN2015092510 W CN 2015092510W WO 2016066043 A1 WO2016066043 A1 WO 2016066043A1
Authority
WO
WIPO (PCT)
Prior art keywords
webpage
feature code
words
current
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2015/092510
Other languages
English (en)
French (fr)
Inventor
唐小棚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to JP2017522605A priority Critical patent/JP6672292B2/ja
Priority to SG11201703563SA priority patent/SG11201703563SA/en
Priority to KR1020177014662A priority patent/KR102179855B1/ko
Priority to EP15853793.6A priority patent/EP3214557B1/en
Publication of WO2016066043A1 publication Critical patent/WO2016066043A1/zh
Priority to US15/582,322 priority patent/US10691769B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]

Definitions

  • the present application relates to the field of Internet technologies, and in particular, to a webpage deduplication method and apparatus.
  • the webpage can be de-duplicated by selecting the feature code in the webpage and comparing the signatures.
  • the existing process of deduplicating a webpage by using a feature code of a webpage is as follows: First, a certain period is selected as an anchor point in the webpage 1, and a certain number of Chinese characters are selected as feature codes on both sides of the anchorage point. Then, in the webpage 2, the feature code is acquired in the same manner, and the feature codes of the two webpages are compared. If the feature codes in the two webpages are the same, it is determined that the webpage 2 is a duplicate webpage, and the duplicated webpage is deleted. Webpage 2; if the feature codes are not the same, it is determined that the two webpages are different, that is, the webpage 2 is not a duplicate webpage of the webpage 1.
  • the webpage 1 is a verse of dozens of words, and the user reprints the webpage. After 1 , according to his own understanding, the verse is interpreted by hundreds of words or more, and there is no period in the explanatory text. If the page is only deduplicated based on the feature code, the two pages will be judged as the same page, and the two pages are Pages should be different pages. Therefore, the above-mentioned web page deduplication method has a low accuracy rate. In addition, the feature code extracted in the above manner is not accurate.
  • the original webpage and the reprinted webpage have different feature codes, the original webpage and the reprinted webpage.
  • the original web page and the reprinted web page may contain the same body content.
  • the present application aims to solve at least one of the technical problems in the related art to some extent.
  • the first object of the present application is to provide a webpage deduplication method, which can greatly improve the accuracy of webpage deduplication and reduce the false positive rate of webpage deduplication.
  • a second object of the present application is to provide a web page deduplication device.
  • the first aspect of the present application provides a webpage deduplication method, including: acquiring a webpage of a predetermined type; and extracting, for each webpage, a feature code of the current webpage and a number of words included in the body of the current webpage. And querying whether the feature code is included in the preset data table, and if the feature code is included, reading the number of words of the webpage text corresponding to the feature code in the data table, and reading the number of words and The difference in word count between the extracted words is in the preset range When the area is inside, the current web page is discarded.
  • the webpage deduplication method of the embodiment of the present application obtains a predetermined type of webpage, and extracts the feature code of the current webpage and the number of words included in the body of the current webpage for each webpage, and queries whether the preset data table includes the signature code. If the feature code is included, the number of words of the webpage text corresponding to the feature code in the data table is read, and when the difference between the read and extracted word counts is within a preset range, the current webpage is discarded, and the embodiment is based on the webpage.
  • the feature code and the number of words included in the body of the webpage deduplicate the webpage, and the method of deduplicating the webpage based on the existing feature code only can greatly improve the accuracy of the deduplication of the webpage, and reduce the false positive rate of the deduplication of the webpage.
  • the second aspect of the present application provides a webpage deduplication apparatus, including: an obtaining module, configured to acquire a webpage of a predetermined type; and a first processing module, configured to extract a current for each webpage The feature code of the webpage and the number of words included in the body of the current webpage, and querying whether the feature code is included in the preset data table, and if the feature code is included, reading the webpage corresponding to the feature code in the data table The number of words in the body, and discarding the current web page when the difference between the number of words read and the number of extracted words is within a preset range.
  • the webpage deduplication device of the embodiment of the present application acquires a webpage of a predetermined type by using an obtaining module, and the first processing module extracts the feature code of the current webpage and the number of words included in the body of the current webpage for each webpage, and queries the preset data table. Whether the feature code is included in the file, if the feature code is included, the number of words in the body of the webpage corresponding to the feature code in the data table is read, and when the difference between the read and extracted words is within a preset range, the current webpage is discarded.
  • the webpage is deduplicated based on the feature code of the webpage and the number of words included in the body of the webpage. Compared with the existing method of deduplicating the webpage based only on the signature code, the accuracy of deduplication of the webpage can be greatly improved, and the deduplication of the webpage is reduced. The rate of false positives.
  • FIG. 1 is a flowchart of a webpage deduplication method according to an embodiment of the present application.
  • FIG. 2 is a first schematic diagram of a web page according to an embodiment of the present application.
  • FIG. 3 is a second schematic diagram of a webpage according to an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a webpage deduplication apparatus according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a webpage deduplication apparatus according to another embodiment of the present application.
  • FIG. 1 is a flowchart of a webpage deduplication method according to an embodiment of the present application. As shown in FIG. 1, the webpage deduplication method includes:
  • a predetermined type of webpage for example, a webpage including a body text
  • S102 For each webpage, extract the feature code of the current webpage and the number of words included in the body of the current webpage, and query whether the preset data table includes the feature code. If the feature code is included, the read data table corresponds to the feature code. The number of words in the body of the web page, and discard the current web page when the difference between the number of words read and the number of words extracted is within a preset range.
  • a paragraph included in the body of the current webpage may be acquired, and for each paragraph in the body of the current webpage, the first pre-selection is selected at a preset position of the current paragraph. Set the number of characters, and stitch the characters of all the selected paragraphs into a string, and operate on the string to generate the signature.
  • the second preset number of characters may be selected from the middle position of the current paragraph and the left and right sides of the center, wherein the second preset number is the first preset number One second, and the second preset number may be 3-8.
  • the second preset number may be five, correspondingly The first preset number can be 10.
  • the obtained characters may be spliced into a character string in a paragraph order, in order to efficiently perform a fast search through the character string,
  • the string corresponding to each webpage can be operated to generate a corresponding signature.
  • the hash string corresponding to each web page may be converted into a corresponding hash value by a hash function, that is, a hash function, and the hash value is used as a feature code of the web page.
  • the code for converting a string to a hash function of the corresponding hash value is as follows:
  • the hash function used in this example is the high bit of the string multiplied by 31 plus the low bit. Since the value range of the int type in JAVA is -2147483648 to 2147483647, the coverage reaches more than 4 billion. Basically, there will be no case where different strings get the same hash value. That is to say, the possibility that the same feature code appears on different web pages is small, and the accuracy of the feature code of the extracted web page is high.
  • the embodiment When obtaining the feature code of the webpage, the embodiment fully considers the text structure of the webpage, and selects a first preset number of characters in the middle position of each paragraph for each paragraph in the body of the webpage, and selects all the paragraphs selected. Characters are stitched into strings, and signatures are obtained based on strings.
  • the feature code obtained by extracting the feature code in this embodiment is accurate and high. Since different websites are reprinting information, different annotations, edits, and the like are usually added to the information, and the articles may be differently abridged, modified, paged, or added. Therefore, in order to further improve the accuracy of the same web page classification, it is also necessary to extract the number of words included in the body of each web page while extracting the feature code of each web page.
  • the preset data table may be queried, for example, whether the hash code includes the feature code, that is, whether the hash value is included in the hash table, and if the hash table includes the hash value,
  • the column value reads the number of words of the body of the webpage corresponding to the hash value in the hash table, and compares with the number of words of the body of the current webpage. When the difference of the number of words between the two is within a preset range, for example, 0-50, The current web page is a duplicate web page, and the current web page is discarded.
  • the hash table is a good data structure of the organization feature code. It accesses the record by mapping the key code value, that is, the feature code of the web page to a position in the table, to speed up the search, and the hash table has an efficient search. Ability and support for storage and retrieval of dynamic data.
  • the preset range is 0-50, assuming that the hash value corresponding to the webpage shown in FIG. 3 and the number of words included in the webpage text are already stored in the hash table, and the feature code of the webpage as shown in FIG. 4 is extracted.
  • the query hash table may determine that the feature code of the webpage shown in FIG. 4 is the same as the feature code of the webpage shown in FIG.
  • the number of words included in the body of the webpage corresponding to the hash value may be read from the hash table, that is, the number of words included in the body of the webpage of the webpage shown in FIG. 3, and the number of words included in the body of the webpage shown in FIG. 4 may be obtained by calculation.
  • the difference between the number of words included in the body of the webpage shown in FIG. 3 is 18, and the difference in the number of words included in the body of the webpage of the two webpages is within a preset range. Therefore, the webpages shown in FIG. 4 and FIG. 3 can be regarded as the same webpage. , discard the web page shown in Figure 4.
  • the extracted feature code and the number of words of the current web page are correspondingly written into the data table.
  • the extracted feature code and the number of words of the current web page are correspondingly written into the data table.
  • the webpage deduplication method of the embodiment of the present application needs to compare the difference between the two web pages in addition to the feature codes of the two web pages, thereby effectively The spoofing of the webpage with the same feature code and the difference in the number of the webpages is reduced.
  • the extraction method of the feature code used in the embodiment of the present application is different from the extraction method of the feature code used in the prior art, the method can be effectively reduced.
  • the misjudgment of web pages with the same feature code and small difference in the number of web pages can further improve the accuracy of web page deduplication.
  • the current webpage may not be regarded as a duplicate webpage.
  • the webpage of the current webpage may be The number of words is added to the hash table.
  • the search engine obtains 10 webpages related to keywords, wherein three webpages are webpages with the same content, and the character codes corresponding to the 10 webpages and the number of words of the corresponding webpage texts may be separately extracted and passed through The list deduplicates these 10 web pages.
  • the process of deduplicating a webpage is also a process of establishing a hashtable. When the hashtable is established, the corresponding webpage is deduplicated. At this point, the same web page in the 10 web pages will be removed. Therefore, relative to the retrieval system built by the signature code, and based on the manner in which the retrieval system queries and deduplicates the webpage, the webpage is deduplicated in this manner, thereby improving the efficiency of deduplication of the webpage.
  • the random random sampling method can be used for evaluation, assuming 6 random persons 50 repeated web pages were selected for evaluation, and the obtained corresponding web page deduplication results are shown in Table 1.
  • the number of errors in Table 1 indicates that the number of the same web pages is not removed by this embodiment, and the accuracy of deduplication of the web pages in Table 1 can be obtained by calculation to be 96.7%.
  • the accuracy of deduplication of the webpage in Table 2 can be obtained by calculation as 90.37%.
  • the accuracy of the deduplication of the web pages of Tables 1 and 2 it can be seen that the accuracy of the deduplication of the webpage of this embodiment is higher than that of the feature-only code.
  • the webpage deduplication method of the embodiment of the present application extracts a predetermined type of webpage and extracts for each webpage.
  • the feature code of the current webpage and the number of words included in the body of the current webpage and query whether the preset data table contains the feature code. If the feature code is included, the number of words of the webpage text corresponding to the feature code in the data table is read, and is read. When the obtained word difference is within a preset range, the current web page is discarded.
  • the web page is deduplicated based on the feature code of the web page and the number of words included in the web page body, compared to the existing feature-only code pair.
  • the method of deduplicating the webpage can greatly improve the accuracy of the deduplication of the webpage and reduce the false positive rate of the deduplication of the webpage.
  • the present application also proposes a webpage deduplication device.
  • FIG. 4 is a schematic structural diagram of a webpage deduplication apparatus according to an embodiment of the present application. As shown in FIG. 4, the apparatus includes: an obtaining module 100 and a first processing module 200, where:
  • the obtaining module 100 is configured to obtain a webpage of a predetermined type; and the first processing module 200 is configured to extract, for each webpage, a feature code of the current webpage and a number of words included in the body of the current webpage, and query whether the preset data table includes the feature.
  • the code if the feature code is included, reads the number of words of the webpage text corresponding to the feature code in the data table, and discards the current webpage when the difference between the number of words read and the number of extracted words is within a preset range.
  • the obtaining module 100 may select a predetermined type of webpage from among a plurality of types of webpages, for example, a webpage including a webpage body.
  • the first processing module 200 is specifically configured to: obtain a paragraph included in the body of the current webpage; select, for each paragraph, a first preset number of characters in a preset position of the current paragraph; and splicing the characters of all the selected paragraphs into characters String and operate on the string to generate the signature.
  • the first processing module 200 may convert a string corresponding to each webpage into a corresponding hash value by using a hash function, that is, a hash function, and use the hash value as a feature code of the webpage.
  • a hash function that is, a hash function
  • the first processing module 200 may select a second preset number of characters from the left and right sides of the center, with the second preset number being the first preset number.
  • the second preset number is the first preset number.
  • One-half, and the second preset number can be 3-8.
  • the second preset number may be five, and correspondingly, the first preset number may be ten.
  • the preset data table may be, for example, a hash table.
  • the hash table is a good data structure of the organization signature, and the record is accessed by mapping the key code value, that is, the feature code of the web page to a position in the table. To speed up the lookup, hash tables have efficient retrieval capabilities and support the storage and retrieval of dynamic data.
  • the foregoing apparatus may further include a second processing module 300, where the second processing module 300 is configured to: after the first processing module 200 queries whether the preset data table includes the signature, if the data table If the signature is not included, the extracted signature and the number of words of the current webpage are correspondingly written into the data table.
  • the foregoing apparatus may further include a third processing module 400, where the third processing module 400 is configured to: when the read and extracted word difference is not within the preset range, extract the feature code and the number of words of the current webpage. Corresponding to the written data table.
  • the third processing module 400 writes the extracted feature code and the number of words of the current web page into the data table.
  • the webpage deduplication device of the embodiment of the present application acquires a webpage of a predetermined type by using an obtaining module, and the first processing module extracts the feature code of the current webpage and the number of words included in the body of the current webpage for each webpage, and queries the preset data table. Whether the feature code is included in the file, if the feature code is included, the number of words in the body of the webpage corresponding to the feature code in the data table is read, and when the difference between the read and extracted words is within a preset range, the current webpage is discarded.
  • the webpage is deduplicated based on the feature code of the webpage and the number of words included in the body of the webpage. Compared with the existing method of deduplicating the webpage based only on the signature code, the accuracy of deduplication of the webpage can be greatly improved, and the deduplication of the webpage is reduced. The rate of false positives.
  • first and second are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated.
  • features defining “first” or “second” may include at least one of the features, either explicitly or implicitly.
  • the meaning of "a plurality” is at least two, such as two, three, etc., unless specifically defined otherwise.
  • a "computer-readable medium” can be any apparatus that can contain, store, communicate, propagate, or transport a program for use in an instruction execution system, apparatus, or device, or in conjunction with the instruction execution system, apparatus, or device.
  • computer readable media include the following: electrical connections (electronic devices) having one or more wires, portable computer disk cartridges (magnetic devices), random access memory (RAM), Read only memory (ROM), erasable editable read only memory (EPROM or flash memory), fiber optic devices, and Portable compact disk read only memory (CDROM).
  • the computer readable medium may even be a paper or other suitable medium on which the program can be printed, as it may be optically scanned, for example by paper or other medium, followed by editing, interpretation or, if appropriate, other suitable The method is processed to obtain the program electronically and then stored in computer memory.
  • portions of the application can be implemented in hardware, software, firmware, or a combination thereof.
  • multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system.
  • a suitable instruction execution system For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques well known in the art: having logic gates for implementing logic functions on data signals. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.
  • each functional unit in each embodiment of the present application may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.
  • the above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like. While the embodiments of the present application have been shown and described above, it is understood that the above-described embodiments are illustrative and are not to be construed as limiting the scope of the present application. The embodiments are subject to variations, modifications, substitutions and variations.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本申请提出一种网页去重方法及装置。其中,该方法包括:获取预定类型的网页;以及针对每个网页,提取出当前网页的特征码和当前网页正文包含的字数,并查询预设的数据表中是否包含特征码,若包含特征码,则读取数据表中与特征码对应的网页正文的字数,并当读取到的和提取出的字数差在预设范围内时,丢弃当前网页。本申请实施例的网页去重方法及装置,基于网页的特征码和网页正文包含的字数对网页进行去重,可大大提高网页去重的准确度,减少网页去重的误判率。

Description

网页去重方法及装置 技术领域
本申请涉及互联网技术领域,尤其涉及一种网页去重方法及装置。
背景技术
随着互联网技术的发展,互联网成为人们获取各种信息的重要来源,但是在互联网上,有很多信息是重复信息。目前几十亿上百亿网页中有大量信息重复的网页,这些重复网页的存在,对于信息处理非常麻烦,因此,对网页进行去重处理是十分必要的。
目前,可通过在网页中选取特征码,对比特征码的方式对网页进行去重处理。现有的通过网页的特征码对网页去重的过程为:首先,在网页1中选取某个句号作为定位点,并在定位点两边选取一定数量的汉字作为特征码。然后,在网页2中采用相同的方式获取特征码,并对上述两个网页的特征码进行比较,若上述两个网页中的特征码相同,则判断出网页2为重复网页,并删除重复的网页2;若特征码不相同,则判断两个网页不相同,即网页2不是网页1的重复网页。
现有的仅基于特征码进行网页去重的方式存在的问题是,容易造成对特征码相同而网页实际内容不同的网页的误判,例如,网页1为几十个字的诗句,用户转载网页1后,根据自己的理解对该诗句进行几百字以上的解释,并且解释文字中没有句号,若仅基于特征码进行网页去重,这两个网页会被判断为相同的网页,而这两个网页应为不同的网页。因此,上述网页去重的方式的网页去重准确率不高。另外,上述方式所提取的特征码不准确。例如,若用户在转载网页的题注或编辑中添加句号,当以现有的方式提取网页的特征码,并进行网页去重时,原网页和转载网页的特征码不同,原网页和转载网页被判断为不同的网页。但实际上,原网页和转载网页所包含的网页正文内容可能相同。
发明内容
本申请旨在至少在一定程度上解决相关技术中的技术问题之一。
为此,本申请的第一个目的在于提出一种网页去重方法,该方法可大大提高网页去重的准确度,减少网页去重的误判率。
本申请的第二个目的在于提出一种网页去重装置。
为达上述目的,本申请第一方面实施例提出了一种网页去重方法,包括:获取预定类型的网页;以及针对每个网页,提取出当前网页的特征码和当前网页正文包含的字数,并查询预设的数据表中是否包含所述特征码,若包含所述特征码,则读取所述数据表中与所述特征码对应的网页正文的字数,并当读取到的字数和提取出的字数间的字数差在预设范 围内时,丢弃所述当前网页。
本申请实施例的网页去重方法,通过获取预定类型的网页,并针对每个网页,提取出当前网页的特征码和当前网页正文包含的字数,并查询预设的数据表中是否包含特征码,若包含特征码,则读取数据表中与特征码对应的网页正文的字数,并当读取到的和提取出的字数差在预设范围内时,丢弃当前网页,该实施例基于网页的特征码和网页正文包含的字数对网页进行去重,相对于现有的仅基于特征码对网页去重的方式,可大大提高网页去重的准确度,减少网页去重的误判率。
为达上述目的,本申请第二方面实施例提出了一种网页去重装置,包括:获取模块,用于获取预定类型的网页;以及第一处理模块,用于针对每个网页,提取出当前网页的特征码和当前网页正文包含的字数,并查询预设的数据表中是否包含所述特征码,若包含所述特征码,则读取所述数据表中与所述特征码对应的网页正文的字数,并当读取到的字数和提取出的字数间的字数差在预设范围内时,丢弃所述当前网页。
本申请实施例的网页去重装置,通过获取模块获取预定类型的网页,第一处理模块针对每个网页,提取出当前网页的特征码和当前网页正文包含的字数,并查询预设的数据表中是否包含特征码,若包含特征码,则读取数据表中与特征码对应的网页正文的字数,并当读取到的和提取出的字数差在预设范围内时,丢弃当前网页,该实施例基于网页的特征码和网页正文包含的字数对网页进行去重,相对于现有的仅基于特征码对网页去重的方式,可大大提高网页去重的准确度,减少网页去重的误判率。
附图说明
图1是本申请一个实施例的网页去重方法的流程图。
图2是本申请一个实施例的网页的示意图一。
图3是本申请一个实施例的网页的示意图二。
图4是本申请一个实施例的网页去重装置的结构示意图。
图5是本申请另一个实施例的网页去重装置的结构示意图。
具体实施方式
下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本申请,而不能理解为对本申请的限制。
下面参考附图描述本申请实施例的网页去重方法及装置。
图1是本申请一个实施例的网页去重方法的流程图,如图1所示,该网页去重方法包括:
S101,获取预定类型的网页。
具体地,用户在搜索引擎中以某个关键词进行搜索后,可获得多个与该关键词相关的网页,并从中选择出预定类型的网页,例如包含正文的网页。
S102,针对每个网页,提取出当前网页的特征码和当前网页正文包含的字数,并查询预设的数据表中是否包含特征码,若包含特征码,则读取数据表中与特征码对应的网页正文的字数,并当读取到的字数和提取出的字数间的字数差在预设范围内时,丢弃当前网页。
具体地,在获得预定类型例如包含正文的网页后,对于每个网页,可以获取当前网页正文包含的段落,并针对当前网页正文中的每个段落,在当前段落的预设位置选取第一预设数量的字符,以及将选取的所有段落的字符拼接成字符串,并对字符串进行运算,以生成特征码。
优选地,针对每个段落,可以以当前段落的中间位置为中心,并从中心的左侧和右侧选取第二预设数量的字符,其中,第二预设数量为第一预设数量的二分之一,且第二预设数量可以为3-8个,为了提高网页去重的能力,减少特征码所占的存储空间,优选地,第二预设数量可以为5个,相应地,第一预设数量可以为10个。
例如,某个网页中仅有一个段落,并且该段落中包含1000个汉字字符,则在该段落的中间位置处,向左和向右各取5个汉字字符,共取10个汉字字符。如果按照信息论中多元文法(N-Gram)的定义,这10汉字字符相当于一个10阶的文字(10-Gram),按照6753个汉字计算,这10个汉字字符重复的概率大约为1/(6763)10,也就是说,该网页的特征码重复的概率大约为1/(6763)10。由此可以看出,从每个段落中的中间位置各取10个字符的方式,可以有效地保证网页特征码各不相同,可提高网页特征码的计算准确率。
需要说明的是,若网页中的某个段落中的字符数量小于第一预设数量,可通过特定的字符进行补充。
另外,在获得当前网页的每个段落中的第一预定数量例如10个字符后,可按照段落顺序将所获得的字符拼接为一个字符串,为了可以高效地通过该字符串进行快速的查找,以确定出内容重复的网页,可对每个网页对应的字符串进行运算,生成对应的特征码。具体而言,可通过哈希(hash)函数,即散列函数,将每个网页对应的字符串转换为对应的散列值,并将散列值作为该网页的特征码。
例如,在JAVA编程中,将字符串转换为对应的散列值的哈希函数的代码如下:
public int hashCode(){
int h=hash;
if(h==0){
int off=offset;
char val[]=value;
int len=count;for(int i=0;i<len;i++){
h=31*h+val[off++];
}
hash=h;
}
renturn h;
}
通过上述代码可以看出,该例子中使用的哈希函数为字符串的高位乘以31加上低位,由于JAVA中int类型的数值范围是-2147483648~2147483647,覆盖范围达到了40多亿,因此基本不会有不同的字符串得到同一个散列值的情况,也就是说,不同网页出现相同特征码的可能性很小,所提取的网页的特征码的准确性高。
该实施例获取网页的特征码时,充分考虑网页的文本结构,针对网页正文中的每个段落,在每个段落的中间位置处选取第一预设数量的字符,并将选取的所有段落的字符拼接成字符串,以及基于字符串获得特征码。相对于现有的以句号为定位点提取特征码的方式,该实施例提取特征码的方式所获得的特征码准确高。由于不同的网站在转载信息时,通常会在信息中添加不同的题注、编辑等信息,同时还可能对文章进行不同的删节、改动、分页显示或者添加等。因此,为了进一步地提高相同的网页分类的准确率,在提取每个网页的特征码的同时,还需要提取每个网页正文包含的字数。
在提取当前网页的特征码和正文包含的字数后,可查询预设的数据表例如散列表中是否包含特征码,即从散列表中查询是否包含该散列值,若散列表中包含该散列值,则读取散列表中该散列值对应的网页正文的字数,并与当前网页正文的字数进行比较,当两者之间的字数差在预设范围例如0-50内,则认为当前网页为重复的网页,丢弃当前网页。
其中,散列表是一种很好的组织特征码的数据结构,它通过把关键码值即网页的特征码映射到表中一个位置来访问记录,以加快查找的速度,散列表具有高效的检索能力,并且可以支持动态数据的存储和提取。
例如,预设范围为0-50,假定散列表中已经保存了如图3所示的网页对应的散列值和网页正文包含的字数,在提取出如图4所示的网页的特征码和网页正文包含的字数后,查询散列表可以确定图4所示的网页的特征码与图3所示网页的特征码相同。此时,可从散列表中读取该散列值对应的网页正文包含的字数,即图3所示网页的网页正文包含的字数,通过计算可以得到图4所示的网页正文包含的字数与图3所示的网页正文包含的字数间的字数差为18,两个网页的网页正文包含的字数差在预设范围内,因此,可以认为图4与图3所示的网页为相同的网页,丢弃如图4所示的网页。
在该实施例中,若数据表中未包含特征码,则将提取出的当前网页的特征码和字数对应写入数据表中。
另外,在该实施例中,当读取到的和提取的字数差未在预设范围内时,将提取出的当前网页的特征码和字数对应写入数据表中。
相对于现有的仅基于特征码进行网页去重的方式,本申请实施例的网页去重方式,除了比较两个网页的特征码,还需要比较两个网页的字数差,由此,可有效减少对特征码相同而网页字数相差较大的网页的误判,同时,由于本申请实施例所采用的特征码的提取方式与现有技术中所采用的特征码的提取方式不同,可以有效减少对特征码相同而网页字数相差较小的网页的误判,进而可提高网页去重的准确率。
例如,假定预设范围为0-50,当前网页的网页正文的字数为4900,若当前网页对应的特征码在散列表中,并且从散列表中获得该特征码对应的网页的字数为5000,当前网页与散列表中网页对应的字数的差值的绝对值为100,该值没有在预设范围内,因此,可认为当前网页不是重复的网页,此时,可将当前网页的网页正文的字数加入散列表中。
再例如,假定搜索引擎获取到10个与关键词相关的网页,其中,有三个网页为内容相同的网页,可分别提取这10个网页对应的特征码和对应的网页正文的字数,并通过散列表对这10个网页进行去重处理。其中,对网页进行去重的过程,也是建立散列表的过程,当散列表建立完成,对应的网页去重结束。此时,10个网页中相同的网页将被去除。由此,相对于由特征码建成的检索系统,并基于检索系统对网页进行查询并去重的方式,通过该方式对网页进行去重,可提高网页去重的效率。
假定当前获得了5万篇网页,通过上述实施例对5万篇网页进行去重处理,为了评价该实施例的网页去重的准确率,可通过人工随机抽样的方式进行评价,假定6个人随机选择50个重复的网页进行评测,获得的对应的网页去重的结果如表1所示。
表1网页去重的结果
用户 1 2 3 4 5 6
网页数 50 50 50 50 50 50
错误数 2 1 4 1 1 1
其中,表1中的错误数表示通过该实施例未将相同的网页去除的个数,通过计算可以得到表1中的网页去重的准确率为96.7%。
相对应地,若以现有的基于特征码的方式进行网页去重,所获得的对应的网页去重的结果如表2所示。
表2网页去重的结果
用户 1 2 3 4 5 6
网页数 50 50 50 50 50 50
错误数 4 2 6 2 3 2
其中,通过计算可以获得表2中的网页去重的准确率为90.37%。通过比较表1和表2的网页去重的准确率,由此可以看出,该实施例的网页去重的准确率高于仅基于特征码的方式。
本申请实施例的网页去重方法,通过获取预定类型的网页,并针对每个网页,提取出 当前网页的特征码和当前网页正文包含的字数,并查询预设的数据表中是否包含特征码,若包含特征码,则读取数据表中与特征码对应的网页正文的字数,并当读取到的和提取出的字数差在预设范围内时,丢弃当前网页,该实施例基于网页的特征码和网页正文包含的字数对网页进行去重,相对于现有的仅基于特征码对网页去重的方式,可大大提高网页去重的准确度,减少网页去重的误判率。
为了实现上述实施例,本申请还提出一种网页去重装置。
图4是本申请一个实施例的网页去重装置的结构示意图,如图4所示,该装置包括:获取模块100和第一处理模块200,其中:
获取模块100用于获取预定类型的网页;以及第一处理模块200用于针对每个网页,提取出当前网页的特征码和当前网页正文包含的字数,并查询预设的数据表中是否包含特征码,若包含特征码,则读取数据表中与特征码对应的网页正文的字数,并当读取到的字数和提取出的字数间的字数差在预设范围内时,丢弃当前网页。
具体地,假定当前包含多种类型的网页,获取模块100可从多种类型的网页中选择出预定类型的网页,例如,包含网页正文的网页。
上述第一处理模块200具体用于:获取当前网页正文包含的段落;针对每个段落,在当前段落的预设位置选取第一预设数量的字符;以及将选取的所有段落的字符拼接成字符串,并对字符串进行运算,以生成特征码。
具体地,第一处理模块200可通过哈希(hash)函数,即散列函数,将每个网页对应的字符串转换为对应的散列值,并将散列值作为该网页的特征码。
具体而言,第一处理模块200可以以当前段落的中间位置为中心,从中心的左侧和右侧选取第二预设数量的字符,其中,第二预设数量为第一预设数量的二分之一,且第二预设数量可以为3-8个。为了提高网页去重的能力,减少特征码所占的存储空间,优选地,第二预设数量可以为5个,相应地,第一预设数量可以为10个。
需要说明的是,若网页中的某个段落中的字符数量小于第一预设数量可通过特定的字符补充。
其中,上述预设的数据表例如可以是散列表,散列表是一种很好的组织特征码的数据结构,它通过把关键码值即网页的特征码映射到表中一个位置来访问记录,以加快查找的速度,散列表具有高效的检索能力,并且可以支持动态数据的存储和提取。
另外,如图5所示,上述装置还可以包括第二处理模块300,该第二处理模块300用于在第一处理模块200查询预设的数据表中是否包含特征码之后,若数据表中未包含特征码,则将提取出的当前网页的特征码和字数对应写入数据表中。
另外,上述装置还可以包括第三处理模块400,该第三处理模块400用于当读取到的和提取的字数差未在预设范围内时,将提取出的当前网页的特征码和字数对应写入数据表中。
具体地,若从预设的数据表例如散列表中读取到的网页正文包含的字数和从当前网页提取到的字数间的字数差小于预设范围,例如,预设范围为0-50个,两个网页之间的字数差为120个,第三处理模块400将提取出的当前网页的特征码和字数对应写入数据表中。
本申请实施例的网页去重装置,通过获取模块获取预定类型的网页,第一处理模块针对每个网页,提取出当前网页的特征码和当前网页正文包含的字数,并查询预设的数据表中是否包含特征码,若包含特征码,则读取数据表中与特征码对应的网页正文的字数,并当读取到的和提取出的字数差在预设范围内时,丢弃当前网页,该实施例基于网页的特征码和网页正文包含的字数对网页进行去重,相对于现有的仅基于特征码对网页去重的方式,可大大提高网页去重的准确度,减少网页去重的误判率。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本申请的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本申请的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本申请的实施例所属技术领域的技术人员所理解。
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多个布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及 便携式光盘只读存储器(CDROM)。另外,计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序,然后将其存储在计算机存储器中。
应当理解,本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。
此外,在本申请各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。
上述提到的存储介质可以是只读存储器,磁盘或光盘等。尽管上面已经示出和描述了本申请的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本申请的限制,本领域的普通技术人员在本申请的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (12)

  1. 一种网页去重方法,其特征在于,包括:
    获取预定类型的网页;以及
    针对每个网页,提取出当前网页的特征码和当前网页正文包含的字数,并查询预设的数据表中是否包含所述特征码,若包含所述特征码,则读取所述数据表中与所述特征码对应的网页正文的字数,并当读取到的字数和提取出的字数间的字数差在预设范围内时,丢弃所述当前网页。
  2. 根据权利要求1所述的方法,其特征在于,在所述查询预设的数据表中是否包含所述特征码之后,还包括:
    若所述数据表中未包含所述特征码,则将提取出的所述当前网页的特征码和字数对应写入所述数据表中。
  3. 根据权利要求1所述的方法,其特征在于,还包括:
    当读取到的字数和提取的字数间的字数差未在预设范围内时,将提取出的所述当前网页的特征码和所述字数对应写入所述数据表中。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述提取当前网页的特征码,包括:
    获取当前网页正文包含的段落;
    针对每个段落,在当前段落的预设位置选取第一预设数量的字符;以及
    将选取的所有段落的字符拼接成字符串,并对所述字符串进行运算,以生成所述特征码。
  5. 根据权利要求4所述的方法,其特征在于,所述在当前段落的预设位置选取第一预设数量的字符,包括:
    以所述当前段落的中间位置为中心,从所述中心的左侧和右侧选取第二预设数量的字符,其中,所述第二预设数量为所述第一预设数量的二分之一,且所述第二预设数量为3-8个。
  6. 根据权利要求5所述的方法,其特征在于,所述第二预设数量优选为5个。
  7. 一种网页去重装置,其特征在于,包括:
    获取模块,用于获取预定类型的网页;以及
    第一处理模块,用于针对每个网页,提取出当前网页的特征码和当前网页正文包含的字数,并查询预设的数据表中是否包含所述特征码,若包含所述特征码,则读取所述数据表中与所述特征码对应的网页正文的字数,并当读取到的字数和提取出的字数间的字数差在预设范围内时,丢弃所述当前网页。
  8. 根据权利要求7所述的装置,其特征在于,还包括:
    第二处理模块,用于在所述第一处理模块查询预设的数据表中是否包含所述特征码之后,若所述数据表中未包含所述特征码,则将提取出的所述当前网页的特征码和字数对应写入所述数据表中。
  9. 根据权利要求7所述的装置,其特征在于,还包括:
    第三处理模块,用于当读取到的字数和提取的字数间的字数差未在预设范围内时,将提取出的所述当前网页的特征码和所述字数对应写入所述数据表中。
  10. 根据权利要求7-9任一项所述的装置,其特征在于,所述第一处理模块,具体用于:
    获取当前网页正文包含的段落;针对每个段落,在当前段落的预设位置选取第一预设数量的字符;以及将选取的所有段落的字符拼接成字符串,并对所述字符串进行运算,以生成所述特征码。
  11. 根据权利要求10所述的装置,其特征在于,所述第一处理模块,具体用于:
    以所述当前段落的中间位置为中心,从所述中心的左侧和右侧选取第二预设数量的字符,其中,所述第二预设数量为所述第一预设数量的二分之一,且所述第二预设数量为3-8个。
  12. 根据权利要求11所述的装置,其特征在于,所述第二预设数量优选为5个。
PCT/CN2015/092510 2014-10-30 2015-10-22 网页去重方法及装置 Ceased WO2016066043A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
JP2017522605A JP6672292B2 (ja) 2014-10-30 2015-10-22 重複ウェブページを除去する方法および装置
SG11201703563SA SG11201703563SA (en) 2014-10-30 2015-10-22 Methods and apparatus for removing a duplicated web page
KR1020177014662A KR102179855B1 (ko) 2014-10-30 2015-10-22 중복 웹 페이지 제거 방법 및 장치
EP15853793.6A EP3214557B1 (en) 2014-10-30 2015-10-22 Web page deduplication method and apparatus
US15/582,322 US10691769B2 (en) 2014-10-30 2017-04-28 Methods and apparatus for removing a duplicated web page

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410599140.5 2014-10-30
CN201410599140.5A CN105630802A (zh) 2014-10-30 2014-10-30 网页去重方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/582,322 Continuation US10691769B2 (en) 2014-10-30 2017-04-28 Methods and apparatus for removing a duplicated web page

Publications (1)

Publication Number Publication Date
WO2016066043A1 true WO2016066043A1 (zh) 2016-05-06

Family

ID=55856595

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/092510 Ceased WO2016066043A1 (zh) 2014-10-30 2015-10-22 网页去重方法及装置

Country Status (7)

Country Link
US (1) US10691769B2 (zh)
EP (1) EP3214557B1 (zh)
JP (1) JP6672292B2 (zh)
KR (1) KR102179855B1 (zh)
CN (1) CN105630802A (zh)
SG (1) SG11201703563SA (zh)
WO (1) WO2016066043A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10691769B2 (en) 2014-10-30 2020-06-23 Alibaba Group Holding Limited Methods and apparatus for removing a duplicated web page

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180107580A1 (en) * 2016-10-14 2018-04-19 Microsoft Technology Licensing, Llc Metadata enabled comparison of user interfaces
CN106527876A (zh) * 2016-11-10 2017-03-22 广东工业大学 一种统计网页字数的方法及系统
CN108205810B (zh) * 2016-12-16 2021-08-10 富士通株式会社 图像比较装置及方法、电子设备
CN107729343A (zh) * 2017-07-24 2018-02-23 上海壹账通金融科技有限公司 资源提取方法、计算机可读存储介质及电子设备
CN109033385B (zh) * 2018-07-27 2021-08-27 百度在线网络技术(北京)有限公司 图片检索方法、装置、服务器及存储介质
CN109103953B (zh) * 2018-08-23 2021-07-20 广州市香港科大霍英东研究院 一种电池组主动均衡控制方法、系统及装置
KR102722603B1 (ko) * 2022-11-14 2024-10-25 고려대학교 산학협력단 교차 검증된 앙상블 및 필터링 전략에 기반한 프로그래밍 코드의 유사성 판단 방법 및 장치

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101102316A (zh) * 2007-06-22 2008-01-09 腾讯科技(深圳)有限公司 一种网页去重的方法及系统
CN101645082A (zh) * 2009-04-17 2010-02-10 华中科技大学 基于并行编程模式的相似网页去重系统
US7698317B2 (en) * 2007-04-20 2010-04-13 Yahoo! Inc. Techniques for detecting duplicate web pages

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6421675B1 (en) * 1998-03-16 2002-07-16 S. L. I. Systems, Inc. Search engine
KR100406671B1 (ko) * 2000-07-24 2003-11-21 주식회사 유니마이다스 문장 표절 및 도용 검색 방법
US6778986B1 (en) * 2000-07-31 2004-08-17 Eliyon Technologies Corporation Computer method and apparatus for determining site type of a web site
US6658423B1 (en) * 2001-01-24 2003-12-02 Google, Inc. Detecting duplicate and near-duplicate files
CA2577841A1 (en) * 2004-08-19 2006-03-02 Claria Corporation Method and apparatus for responding to end-user request for information
CN101499098B (zh) * 2009-03-04 2012-07-11 阿里巴巴集团控股有限公司 一种网页评估值的确定及运用的方法、系统
KR20100115048A (ko) * 2009-04-17 2010-10-27 정원석 복사 문서 판별 시스템 및 그 방법
KR20120124581A (ko) * 2011-05-04 2012-11-14 엔에이치엔(주) 개선된 유사 문서 탐지 방법, 장치 및 컴퓨터 판독 가능한 기록 매체
CN102799647B (zh) * 2012-06-30 2015-01-21 华为技术有限公司 网页去重方法和设备
CN103559259A (zh) * 2013-11-04 2014-02-05 同济大学 基于云平台的消除近似重复网页方法
CN103646078B (zh) * 2013-12-11 2017-01-25 北京启明星辰信息安全技术有限公司 一种实现互联网宣传监测目标评估的方法及装置
CN105630802A (zh) 2014-10-30 2016-06-01 阿里巴巴集团控股有限公司 网页去重方法及装置
US11843679B2 (en) * 2015-07-27 2023-12-12 Wp Company Llc Automated dependency management based on page components

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7698317B2 (en) * 2007-04-20 2010-04-13 Yahoo! Inc. Techniques for detecting duplicate web pages
CN101102316A (zh) * 2007-06-22 2008-01-09 腾讯科技(深圳)有限公司 一种网页去重的方法及系统
CN101645082A (zh) * 2009-04-17 2010-02-10 华中科技大学 基于并行编程模式的相似网页去重系统

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10691769B2 (en) 2014-10-30 2020-06-23 Alibaba Group Holding Limited Methods and apparatus for removing a duplicated web page

Also Published As

Publication number Publication date
KR102179855B1 (ko) 2020-11-18
US20170235746A1 (en) 2017-08-17
JP6672292B2 (ja) 2020-03-25
US10691769B2 (en) 2020-06-23
CN105630802A (zh) 2016-06-01
EP3214557A4 (en) 2017-09-06
SG11201703563SA (en) 2017-06-29
JP2017532690A (ja) 2017-11-02
EP3214557A1 (en) 2017-09-06
KR20170078777A (ko) 2017-07-07
EP3214557B1 (en) 2019-02-20

Similar Documents

Publication Publication Date Title
WO2016066043A1 (zh) 网页去重方法及装置
US20240330661A1 (en) Techniques for generating and correcting language model outputs
CN106649786B (zh) 基于深度问答的答案检索方法及装置
WO2022121171A1 (zh) 相似文本匹配方法、装置、电子设备及计算机存储介质
JP2016522524A (ja) 同義表現の探知及び関連コンテンツを検索する方法及び装置
US20170300533A1 (en) Method and system for classification of user query intent for medical information retrieval system
WO2014000517A1 (zh) 一种用于搜索输入的推荐系统及方法
WO2021169186A1 (zh) 文本查重方法、电子设备及计算机可读存储介质
US10235350B2 (en) Detect annotation error locations through unannotated document segment partitioning
CN106874481B (zh) 一种分布式文件系统元数据信息读取方法及系统
US10417285B2 (en) Corpus generation based upon document attributes
US10664755B2 (en) Searching method and system based on multi-round inputs, and terminal
WO2020074017A1 (zh) 基于深度学习的医学文献中关键词筛选方法及装置
CN106156143A (zh) 网页处理装置和网页处理方法
US8862556B2 (en) Difference analysis in file sub-regions
CN107111618A (zh) 将图像的缩略图链接到网页
CN113988015A (zh) 一种文档结构检测方法及装置
CN103999079A (zh) 对准文档的字段的注解
CN105574004A (zh) 一种网页去重方法和设备
KR101545273B1 (ko) 클러스터링 및 해싱을 이용하여 빅데이터 텍스트의 중복여부를 검출하는 중복문서 검출장치 및 방법
JP2010182238A (ja) 引用検出装置、原典文書データベース生成装置、その方法、プログラム及び記録媒体
US20130144799A1 (en) Computing device and method for extracting patent rejection information
KR102075709B1 (ko) 컨텐츠 검색 방법 및 장치
CN109947947A (zh) 一种文本分类方法、装置及计算机可读存储介质
CN107644049B (zh) 检索索引产生方法及应用此方法的服务器

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15853793

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2017522605

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 11201703563S

Country of ref document: SG

ENP Entry into the national phase

Ref document number: 20177014662

Country of ref document: KR

Kind code of ref document: A

REEP Request for entry into the european phase

Ref document number: 2015853793

Country of ref document: EP