WO2013087012A1 - 一种网络数据的采集方法和系统 - Google Patents
一种网络数据的采集方法和系统 Download PDFInfo
- Publication number
- WO2013087012A1 WO2013087012A1 PCT/CN2012/086584 CN2012086584W WO2013087012A1 WO 2013087012 A1 WO2013087012 A1 WO 2013087012A1 CN 2012086584 W CN2012086584 W CN 2012086584W WO 2013087012 A1 WO2013087012 A1 WO 2013087012A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- page
- url
- webpage
- link address
- chapter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/06—Generation of reports
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Definitions
- the invention belongs to the technical field of information retrieval and data integration, and in particular relates to a method and system for collecting network data. Background technique
- Internet literature refers to newly-created literary works, literary texts and online art works containing a part of literary composition, which are represented by hypertext links and multimedia deductions, using the Internet as a platform and medium of communication. Among them, the original works of the network are the main ones. Online literature can be divided into three categories: one is to form digital resources by means of electronic scanning technology or manual input; the first is literary works that are "published” directly on the Internet; Computer creation or literary works generated by computer software enters the Internet, as well as the "relay novel" with the openness of the Internet, dozens of writers and even hundreds of netizens. The second category is mostly in the form.
- the invention provides a network data collection method and system, which can collect the latest network data in real time.
- the method of the present invention provides a method for collecting network data, which is used for collecting data of a network document respectively associated with M topics posted on a website, where M is a positive integer, and the method includes: Configuring a webpage link address of the network data to be collected, and configuring a webpage link address of the network data to be collected into a queue of a corresponding type, where the webpage link address of the network data to be collected is a link address of a webpage where the data of the network document related to the M topics are respectively located; obtaining a network of the network data to be collected in the queue of the corresponding type a webpage source code corresponding to the page link address; a uniform resource locator URL information corresponding to the webpage source code and
- the set depth value of the URL extracts data of the network document corresponding to the URL.
- the refresh time interval is set according to an update frequency of the network document respectively related to the M topics, and the webpage link address of the network data to be collected is refreshed based on the refresh time interval.
- each of the M topics is a network literature
- the method further includes: configuring a depth value of the URL according to the structure of the network literature, specifically:
- the type corresponding to the webpage link address of the network data to be collected includes a topic name page, a list page, and a content page, and the theme name page is configured to extract a topic name; and the list page is configured to extract a topic. Chapter catalog or topic section; Configuring the content page for extracting topic body content.
- the configuring the webpage link address of the network data to be collected into the queue of the corresponding type specifically: adding a link address of the type name page to the topic name page queue; The link address of the list page is added to the list page queue; the link address of the type of the content page is added to the content page queue.
- the webpage source code corresponding to the webpage link address of the network data to be collected in the queue of the corresponding type is specifically: acquiring the link of the topic name page in the topic name page queue The source code of the web page corresponding to the address.
- the extracting the data of the network document corresponding to the URL according to the URL information corresponding to the webpage source code and the collected depth value of the URL specifically: if the set depth value is the first threshold, then extracting a name of the theme and a URL corresponding to the name, and the set depth value of the URL corresponding to the name is marked as a second threshold and added to the list page queue; if the set depth value is the second threshold, The name of the topic and the URL corresponding to the name are extracted, and the set depth value of the URL corresponding to the name is marked as a third threshold, and then added to the list page queue.
- the webpage source code corresponding to the webpage link address of the network data to be collected in the queue of the corresponding type is specifically: acquiring the link address of the list page in the list page queue Web page source code.
- the extracting, according to the URL information corresponding to the webpage source code and the collected depth value of the URL, the data of the network document corresponding to the URL is: if the set depth value is a second threshold, extracting a topic The chapter directory and the URL corresponding to the chapter directory, and mark the depth value of the URL corresponding to the chapter directory as the third The threshold value is added to the list page queue; if the set depth value is the third threshold, it is determined whether the URL corresponding to the webpage source code has a superior URL: if yes, the chapter title of the topic is extracted and the chapter title corresponds to a URL of the chapter, and adding the URL of the chapter to the content page queue; if not, extracting the name of the topic, the chapter title of the topic, and the URL of the chapter corresponding to the chapter title, and the URL of the chapter Join the content page queue.
- the webpage source code corresponding to the webpage link address of the webpage data to be collected in the queue of the corresponding type is specifically: acquiring the link address of the contentpage in the contentpage queue Web page source code.
- the extracting the data of the web document corresponding to the URL according to the URL information corresponding to the webpage source code and the collected depth value of the URL is: extracting a chapter title of the topic from the webpage source code, a chapter body content, and extracting a chapter ID of a chapter corresponding to the chapter title from a URL corresponding to the webpage source code.
- the first page link of the chapter body content is a unique key value
- the content of the page is stored
- the end tag is given when the last page is collected.
- Another aspect of the present invention provides a system for collecting network data, which is used for collecting data of a network document respectively associated with M topics posted on a website, where M is a positive integer, and the system includes a configuration module, Configuring a webpage link address of the network data to be collected according to the type corresponding to the webpage link address of the network data to be collected, and configuring the webpage link address of the network data to be collected as a webpage source code of the webpage link address of the webpage data of the corresponding type of the webpage And a data extraction module, configured to extract data of the network document corresponding to the URL according to the uniform resource locator URL information corresponding to the webpage source code and the collected depth value of the URL.
- the system further includes a refreshing module, configured to: according to the website, publish an update frequency of the network document respectively related to the M topics, set a refresh time interval, and refresh the to-be-set based on the refresh time interval.
- the web page link address of the network data is not limited to: publishing an update frequency of the network document respectively related to the M topics, set a refresh time interval, and refresh the to-be-set based on the refresh time interval.
- the type corresponding to the webpage link address of the network data to be collected includes a topic name page, a list page, and a content page
- the configuration module includes a webpage configuration module, configured to configure the topic name page for extracting a topic.
- Name, configuration The list page is used to extract a topic chapter directory or topic chapter and configure the content page for extracting topic content.
- the configuration module further includes a queue configuration module, configured to configure a webpage link address of the network data to be collected into a queue of a corresponding type, where the queue allocation module includes: a first allocation unit, configured to: Assigning a link address of the type name page to the topic name page queue; a second allocation unit, configured to assign a link address of the type of the list page to the list page queue; A link address of the type of the content page is assigned to the content page queue.
- a queue configuration module configured to configure a webpage link address of the network data to be collected into a queue of a corresponding type
- the queue allocation module includes: a first allocation unit, configured to: Assigning a link address of the type name page to the topic name page queue; a second allocation unit, configured to assign a link address of the type of the list page to the list page queue; A link address of the type of the content page is assigned to the content page queue.
- a network data collection system is used to collect network data.
- the system acquires a link address of the network data and then configures the type of the link address, and places the link address into the corresponding queue according to the type of the link address.
- the source code corresponding to the link address is obtained from the queue, and the information of the network data is extracted according to the corresponding URL information in the source code and the collected depth value of the URL, thereby achieving the technical effect of collecting network data in real time.
- the content merge module is also used, and the network documents belonging to the same topic can be merged, so that the convenient centralized browsing effect can be achieved on the basis of collecting network data in real time.
- FIG. 1 is a flowchart of a method for collecting a set according to an embodiment of the present invention
- FIG. 2 is a detailed flowchart of the method for collecting the data in FIG. 1 according to the present invention
- FIG. 3 is a structural diagram of a collection system according to a first embodiment of the present invention.
- FIG. 4 is a structural diagram of a configuration module in an embodiment of the present invention.
- FIG. 5 is a structural diagram of a webpage obtaining module according to an embodiment of the present invention.
- FIG. 6 is a structural diagram of a data extraction module according to an embodiment of the present invention.
- FIG. 7 is a structural diagram of a collection system according to a second embodiment of the present invention.
- FIG. 8 is a structural diagram of a collection system according to a third embodiment of the present invention.
- FIG. 9 is a structural diagram of a collection system according to a fourth embodiment of the present invention. detailed description
- FIG. 1 is a method for collecting network data, which is used for collecting and publishing on a website.
- FIG. 1 is a flowchart of the method for collecting the data in the embodiment. As shown in Figure 1, the method of collecting data includes:
- Step 11 Configuring a webpage link address of the network data to be collected according to the type of the webpage link address of the network data to be collected, to the queue of the corresponding type, and the webpage link address of the network data to be collected For the place The link address of the webpage where the data of the web document related to the M topics are respectively located;
- Step 12 Obtain a webpage source code corresponding to a webpage link address of the network data to be collected in the queue of the corresponding type;
- Step 13 Extract data of the network document corresponding to the URL according to the Uniform Resource Locator (URL) information corresponding to the webpage source code and the collected depth value of the URL.
- URL Uniform Resource Locator
- the M topics published on the website may be M part network literature works.
- Web literature has a publishing structure that is different from topics such as web news.
- General online news is single-page, and online literary works are generally presented in two forms on the website.
- One is similar to the "literature name -> chapter directory page of the novel reading website -> a specific chapter of the network literature content page", some online literature will also exist in the "chapter directory page” before the "volume” Concept; the other is a content directory page similar to a general news website.
- the chapters of different literary works are interspersed together, but will be marked in the same title as "literary work name (5), Different chapters in the work.
- the link address of the web page where the data of the network document is located In this embodiment, according to the structure published by the network literary works on the website, the data of the network document generally includes the name of the network literary work to which the network document belongs, the name of the volume and/or chapter of the network literary work to which the network document belongs, and the network.
- the body content of the document Correspondingly, the type of the link address of the webpage where the data of the network document is located includes: a topic name page, which is used to extract the name of the network literature belonging to the network document; a list page, a chapter directory link and a chapter for extracting the network literature work. Link, where the chapter directory includes the volume catalog and chapter catalog of the network literature; the content page is used to extract the subject text content.
- the link addresses of the web pages where the data of the M network literatures are located are respectively placed in different queues according to their types. Specifically, a link address of the type of the title page is assigned to the topic name page queue; a link address of the type of the list page is assigned to the list page queue; and a link address of the type of the content page is assigned to the content page queue.
- a link address of the type of the title page is assigned to the topic name page queue
- a link address of the type of the list page is assigned to the list page queue
- a link address of the type of the content page is assigned to the content page queue.
- there are three online literary works published on the A website namely Al, A2, and A3.
- the publication structure of A1 on website A is: literature name -> volume directory -> chapter directory -> specific chapter of the network literature content page;
- A2 publication structure on website A is: literature name -> Chapter directory->Specific chapter of the network literature content page;
- A3's publication structure on the website A is: Chapter name->Specific chapter of the network literature content page,
- A3 chapter name is A3
- the combination of the name of the work and the number of chapters, for example, the chapter name of the first chapter of A3 is: A3 (-);
- the chapter name of Chapter 5 of A3 is A3 (5).
- the link address B 1 of the web page having the name of the A1 work is placed in the topic name page queue; the link address B2 of the web page having the name of the A2 work is placed in the topic In the name page queue; the address B3 of the chapter link with the A3 work is placed in the list page queue waiting to be collected.
- the timing refresh strategy can be used.
- the adaptive refresh strategy can also be used.
- the website automatically publishes the frequency of different network literature works to automatically adjust the refresh interval. When it is detected that the network literary works has reached the refresh interval time, the refreshed webpage link address of the network data to be collected is put into the queue of its corresponding type.
- the webpage source code corresponding to the webpage link address of the network data to be collected in each queue is specifically a URL acquisition policy according to a system setting, for example, according to system operation conditions or queues.
- the personnel can set the URL acquisition policy according to the time requirement, obtain a link address to be collected from each queue, and then the system obtains the source code of the webpage through the Http request.
- the webpage link addresses B1 and B2 of the network data to be collected extracted from the topic name page queue are set according to the system.
- the predetermined URL acquisition strategy respectively obtains the webpage source code corresponding to B1 and the webpage source code corresponding to B2; extracts the webpage link address B3 of the network data to be collected from the list page queue, and acquires the webpage link address according to the URL acquisition policy set by the system. Web page source code.
- the URL information corresponding to the webpage source code includes a network literary work name, a chapter directory, a chapter link, and a link to the body content.
- the depth value of the URL is configured according to the structure of the network literature, specifically:
- N ft ⁇ second threshold, indicating that the structure of the work is "name ⁇ chapter ⁇ content"
- the third threshold indicates that the structure of the work is "Chapter ⁇ Content"
- the first threshold is 3, the second threshold is 2, and the third threshold is 1.
- the first threshold value is 3, the second threshold value is 2, and the third threshold value is 1.
- the depth value of the network configuration according to the structure of the network literature can be understood in conjunction with Al, A2, and A3 posted on the website A.
- the corresponding URL ie, URL-A1
- the structure of A1 is "literature name->volume directory->chapter directory->specific
- the structure of A2 is "literature name -> chapter directory -> a specific chapter of the network literature content page, then the collection of the URL corresponding to the source code according to B2 (ie URL-A2)
- the depth value is 2
- the structure of A3 is "chapter name -> specific part of the network literature content page”, then the depth value of the URL corresponding to the source code (ie URL-A3) obtained according to B3 is 3.
- Step 13 specifically includes: (Please refer to Figure 2)
- Step 131 Extract the data of the network document corresponding to the URL according to the URL information corresponding to the webpage source code corresponding to the link address of the topic name page obtained from the topic name page queue and the URL collection depth value.
- Step 132 Corresponding to the webpage source code corresponding to the link address of the list page obtained from the list page queue
- the URL information and the URL collect the depth value, and extract the data of the network document corresponding to the URL.
- Step 133 Extract the chapter title and the chapter body content of the topic from the source code of the webpage according to the URL corresponding to the webpage source code corresponding to the link address of the content page obtained from the content page queue, and from the URL corresponding to the source code of the webpage.
- the chapter ID of the chapter corresponding to the chapter title is extracted.
- the foregoing steps 131, 132, and 133 are not limited in order of implementation.
- the link addresses of the collections can be collected to obtain the network data to be collected.
- the webpage source code corresponding to the webpage link address is used to extract the data of the web document corresponding to the URL according to the URL information corresponding to the webpage source code and the URL depth value. The process of extracting the network document data in each step will be described in detail below.
- step 131 the data of the network document corresponding to the extracted URL is specifically:
- the depth of the URL is 3, the name of the topic and the URL corresponding to the name are extracted, and the depth value of the URL corresponding to the name is marked as the second threshold and then added to the queue of the list page;
- the depth of the URL is 2
- the name of the topic and the URL corresponding to the name are extracted, and the depth of the URL corresponding to the name is marked as 1 and added to the list page queue.
- the link address extracted from the topic name page queue is the link address B 1 of A1 and the link address B2 of A2. Since the depth of the URL-A1 corresponding to the source code of B1 is 3, the topic name of A1 should be extracted and represented by "name A1". You should also extract the URL corresponding to "name A1", denoted by "URL-A11", and mark the depth value of "URL-A11” as 2 and add it to the list page queue to extract the URL-A11 belonging to the work. Additional information for A1. For the link address B2, since the depth of the URL-A2 is 2, the topic name of A2 should be extracted and represented by "name A2". You should also extract the URL corresponding to "name A2", denoted by "URL-A21", and mark the depth value of "URL-A21” as 1 and add it to the list page queue to extract the work belonging to URL-A21. Additional information for A2.
- step 132 the network document data corresponding to the extracted URL is specifically:
- the depth of the URL is 2, the chapter directory of the topic and the URL corresponding to the chapter directory are extracted, and the depth value of the URL corresponding to the chapter directory is marked as 1 and added to the queue of the list page;
- the name of the topic, the chapter title of the topic, and the URL of the chapter corresponding to the chapter title are extracted, and the URL of the chapter is added to the content page queue.
- the URL-A11 and the to-be-collected have been stored in the list page queue after step 131 URL-A2L
- the link address B3 corresponding to the work A3 has been placed in the list page queue.
- URL-A11 For URL-A11, if the set depth value is 2, the chapter directory of A1 and the URL corresponding to the chapter directory are extracted, which is represented by "URL-A12". Mark the URL depth value of URL-A12 as 1 and add it to the list page queue.
- the set depth value is 1 and it has the upper URL (and URL-A21), so the chapter title of A2 and the URL of the chapter corresponding to the chapter title are extracted, represented by "URL-A22", and the URL is -A22 is added to the content page queue.
- step 133 if the chapter body has a page break, it is necessary to extract the link address of the next page, and simultaneously mark the page number of the current page and the page number of the next page and add the link address of the next page to the content page queue for waiting. set.
- the first page link of the chapter body content is a unique key value, and the content of the page is stored, and the end mark is given when the last page is collected.
- the website, the name of the theme, the chapter title of the theme, the chapter ID, and the chapter body content are uploaded to the database.
- the chapter body content can also be stored as an attachment to the file server and the path of the stored file is recorded in the database.
- the method for collecting and merging network data can make the network literature appear in the form of a book. Further, the real-time data collection can be realized by automatically refreshing the collected data, so this embodiment Get the benefits of real-time, convenient, and centralized browsing of online literature.
- a first embodiment of the present invention provides a system for collecting network data, which is used to collect data of network documents respectively associated with M topics posted on a website, where M is a positive integer, please refer to FIG. FIG. 3 is a structural diagram of the collection system in the embodiment.
- the system for collecting data includes a configuration module 31, a webpage obtaining module 32, and a data extracting module 33.
- the configuration module 31 is configured to configure, according to the type corresponding to the webpage link address of the network data to be collected, the webpage link address of the network data to be collected into the queue of the corresponding type, and the webpage link address of the network data to be collected.
- the link address of the web page where the data of the web document related to the M topics are respectively located.
- the webpage obtaining module 32 is configured to obtain a webpage source code corresponding to a webpage link address of the network data to be collected in the corresponding type of queue.
- the data extraction module 33 is configured to extract data of the network document corresponding to the URL according to the URL information corresponding to the webpage source code and the collected depth value of the URL.
- the type corresponding to the webpage link address of the network data to be collected includes a theme name page, a list page, and a content page.
- the configuration module 31 includes a webpage configuration module 311 for configuring a topic name page for extracting a topic name, a configuration list page for extracting a topic chapter directory or a topic chapter, and a configuration content page for extracting topic content.
- the configuration module 31 further includes a queue configuration module 312, configured to configure a webpage link address of the network data to be collected into a queue of a corresponding type.
- the queue allocation module 312 further includes: a first allocating unit 3121, configured to allocate a link address of the type of the topic name page into the topic name page queue; and a second allocating unit 3122, configured to allocate the link address of the type of the list page.
- the third allocation unit 3123 is configured to allocate a link address of the type of the content page into the content page queue.
- the webpage obtaining module 32 includes: a first obtaining unit 321 configured to obtain a webpage source code corresponding to a link address of the topic name page in the topic name page queue.
- the second obtaining unit 322 is configured to obtain a webpage source code corresponding to the link address of the list page in the list page queue.
- the third obtaining unit 323 is configured to obtain a webpage source code corresponding to the link address of the content page in the content page queue. Please refer to Figure 5.
- the data extraction module 33 further includes: a first extraction unit 331, configured to: when the depth value of the URL corresponding to the URL of the webpage source code is the first threshold, extract the name of the theme and the URL corresponding to the name, and name the name The set depth value of the corresponding URL is marked as the second threshold and then sent to the second allocation unit 3122.
- the second extracting unit 332 is configured to: when the depth value of the URL corresponding to the URL of the webpage source code is a second threshold, extract the URL corresponding to the name and the name of the topic, and mark the depth value of the URL corresponding to the URL as the third threshold. It is sent to the second distribution unit 3122.
- the third extracting unit 333 is configured to: when the depth value of the webpage source code corresponding URL is a second threshold, extract the chapter directory of the topic and the URL of the chapter directory, and mark the set depth value of the URL of the chapter directory as the first The three thresholds are then sent to the second allocation unit 3122.
- the fourth extracting unit 334 is configured to determine whether the URL corresponding to the source code of the webpage has a superior URL, and when the determination result is yes, extract the URL of the chapter title and the chapter corresponding to the chapter title, and send the URL of the chapter to the third.
- the fifth extracting unit 335 is configured to extract a chapter title and a chapter body content of the topic from the webpage source code, and extract a chapter ID of the chapter corresponding to the chapter title from the URL corresponding to the webpage source code.
- the page determining unit 336 is configured to determine whether the chapter body content has a page break; when the chapter body content has a page break, the fifth extracting unit 335 is further configured to extract the link address of the next page and simultaneously mark the page number of the current page and the next page. The page number is sent to the third allocation unit 3123.
- the page storage unit 337 is used to store the content of the page by the first page link of the chapter body content, and to give the end mark when the last page is collected. Please refer to Figure 6.
- the system further includes a refresh module 34 for And publishing an update frequency of the network document respectively related to the M topics, setting a refresh time interval, and refreshing the webpage link address of the network data to be collected based on the refresh time interval. Please refer to FIG. 7 for this embodiment.
- the system further includes a content merge module 35 for combining the extracted body contents of all the pages and outputting them in conjunction with the chapter titles. Please refer to FIG. 8 for this embodiment.
- the refreshing module in the second embodiment can also be used for the collection work.
- the system used in combination will not be described in detail.
- the system further includes a first data storage module 36 for using the website, the name of the theme, the chapter title of the theme, the chapter ID, and the chapter.
- the body content is uploaded to the database.
- a second data storage module 37 configured to: when the chapter body content may occupy more database space, select the database to upload the website, the name of the topic, the chapter title of the topic, the chapter ID, and the storage path of the chapter body content into the database,
- the chapter body content storage path refers to the path of storing the chapter body content as an attachment to the file server. Please refer to FIG. 9 for this embodiment.
- the refreshing module in the second embodiment can also be used for the collection work.
- the system used in combination is not described in detail.
- the systems of the first, second, third and fourth embodiments described above can be implemented in accordance with an embodiment of the network data collection method provided by the present invention and a description thereof. This is for the cleaning of the instructions, so it will not be detailed.
- a network data collection system is used to collect network data.
- the system acquires a link address of the network data and then configures the type of the link address, and places the link address into the corresponding queue according to the type of the link address.
- the source code corresponding to the link address is obtained from the queue, and the information of the network data is extracted according to the corresponding URL information in the source code and the collected depth value of the URL, thereby achieving the technical effect of collecting network data in real time.
- the content merge module is also used, and the network documents belonging to the same topic can be merged, so that the convenient centralized browsing effect can be achieved on the basis of collecting network data in real time.
- embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the present invention can be embodied in the form of a computer program product embodied on one or more computer-usable storage interfaces (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code.
- computer-usable storage interfaces including but not limited to disk storage, CD-ROM, optical storage, etc.
- the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
- the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
- These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
- the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2014532241A JP5823620B2 (ja) | 2011-12-13 | 2012-12-13 | ネットデータの採集方法及びシステム |
| EP12857177.5A EP2793143A4 (en) | 2011-12-13 | 2012-12-13 | METHOD AND SYSTEM FOR COLLECTING NETWORK DATA |
| US14/123,036 US9525605B2 (en) | 2011-12-13 | 2012-12-13 | Method of and system for collecting network data |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201110415356.8 | 2011-12-13 | ||
| CN201110415356.8A CN103164435B (zh) | 2011-12-13 | 2011-12-13 | 一种网络数据的采集方法和系统 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2013087012A1 true WO2013087012A1 (zh) | 2013-06-20 |
Family
ID=48587529
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2012/086584 Ceased WO2013087012A1 (zh) | 2011-12-13 | 2012-12-13 | 一种网络数据的采集方法和系统 |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US9525605B2 (zh) |
| EP (1) | EP2793143A4 (zh) |
| JP (1) | JP5823620B2 (zh) |
| CN (1) | CN103164435B (zh) |
| WO (1) | WO2013087012A1 (zh) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110297994A (zh) * | 2019-06-03 | 2019-10-01 | 北京金蝶管理软件有限公司 | 网页数据的采集方法、装置、计算机设备和存储介质 |
Families Citing this family (15)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9596313B2 (en) * | 2013-04-12 | 2017-03-14 | Tencent Technology (Shenzhen) Company Limited | Method, terminal, cache server and system for updating webpage data |
| CN104426900B (zh) * | 2013-09-11 | 2019-12-06 | 腾讯科技(深圳)有限公司 | 一种多媒体数据采集方法和系统 |
| CN104065741B (zh) * | 2014-07-04 | 2018-06-19 | 用友网络科技股份有限公司 | 数据采集系统和数据采集方法 |
| CN105630942B (zh) * | 2015-12-23 | 2019-05-21 | 北京奇虎科技有限公司 | 电子书更新章节的调度方法和装置 |
| CN106919722A (zh) * | 2017-04-28 | 2017-07-04 | 暴风集团股份有限公司 | 一种用于体育赛事的网络数据获取方法和系统 |
| CN109067853B (zh) * | 2018-07-16 | 2021-07-30 | 郑州云海信息技术有限公司 | 一种Web动态网页源码自动化尝试获取方法 |
| CN109376327B (zh) * | 2018-10-10 | 2021-09-21 | 北京北信源信息安全技术有限公司 | 一种网站url的管理方法 |
| CN109543086B (zh) * | 2018-11-23 | 2022-11-22 | 北京信息科技大学 | 一种面向多数据源的网络数据采集与展示方法 |
| CN109670099A (zh) * | 2018-12-21 | 2019-04-23 | 全通教育集团(广东)股份有限公司 | 基于教育网络信息主题采集方法 |
| GB2583771B (en) * | 2019-05-10 | 2022-06-15 | Samsung Electronics Co Ltd | Improvements in and relating to data analytics in a telecommunication network |
| CN111858476A (zh) * | 2020-07-20 | 2020-10-30 | 上海闻泰电子科技有限公司 | 文件处理方法、装置、电子设备和计算机可读存储介质 |
| CN112035723A (zh) * | 2020-08-28 | 2020-12-04 | 光大科技有限公司 | 资源库的确定方法和装置、存储介质及电子装置 |
| CN113569181B (zh) * | 2021-07-29 | 2024-12-20 | 山东亿云信息技术有限公司 | 一种分页数据采集方法及系统 |
| CN115017430B (zh) * | 2022-06-27 | 2024-10-18 | 京东科技控股股份有限公司 | 列表页面的确定方法、装置、电子设备及存储介质 |
| CN115827942A (zh) * | 2022-11-25 | 2023-03-21 | 四川文化产业职业学院 | 基于Ajax的新闻网页动态数据的抓取方法及系统 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101094135A (zh) * | 2006-06-23 | 2007-12-26 | 腾讯科技(深圳)有限公司 | 一种互联网内容信息的提取方法和提取系统 |
| CN101136026A (zh) * | 2007-05-15 | 2008-03-05 | 北京聚生科技有限公司 | 一种基于xmlhttp组件技术的网页内容采集方法 |
| CN101593200A (zh) * | 2009-06-19 | 2009-12-02 | 淮海工学院 | 基于关键词频度分析的中文网页分类方法 |
| CN102118400A (zh) * | 2009-12-31 | 2011-07-06 | 北京四维图新科技股份有限公司 | 数据采集方法和数据采集系统 |
Family Cites Families (21)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6292796B1 (en) * | 1999-02-23 | 2001-09-18 | Clinical Focus, Inc. | Method and apparatus for improving access to literature |
| US7275061B1 (en) * | 2000-04-13 | 2007-09-25 | Indraweb.Com, Inc. | Systems and methods for employing an orthogonal corpus for document indexing |
| US6718363B1 (en) * | 1999-07-30 | 2004-04-06 | Verizon Laboratories, Inc. | Page aggregation for web sites |
| US6665665B1 (en) * | 1999-07-30 | 2003-12-16 | Verizon Laboratories Inc. | Compressed document surrogates |
| US6757866B1 (en) * | 1999-10-29 | 2004-06-29 | Verizon Laboratories Inc. | Hyper video: information retrieval using text from multimedia |
| FR2807537B1 (fr) * | 2000-04-06 | 2003-10-17 | France Telecom | Moteur de recherche de ressources hypermedia et procede d'indexation associe |
| US6912525B1 (en) * | 2000-05-08 | 2005-06-28 | Verizon Laboratories, Inc. | Techniques for web site integration |
| US7013323B1 (en) * | 2000-05-23 | 2006-03-14 | Cyveillance, Inc. | System and method for developing and interpreting e-commerce metrics by utilizing a list of rules wherein each rule contain at least one of entity-specific criteria |
| US7080073B1 (en) * | 2000-08-18 | 2006-07-18 | Firstrain, Inc. | Method and apparatus for focused crawling |
| US7330850B1 (en) * | 2000-10-04 | 2008-02-12 | Reachforce, Inc. | Text mining system for web-based business intelligence applied to web site server logs |
| JP2004118415A (ja) * | 2002-09-25 | 2004-04-15 | Fujitsu Ltd | 情報収集方法及びその方法における処理をコンピュータに行なわせるためのプログラム |
| JP4350001B2 (ja) * | 2004-08-17 | 2009-10-21 | 富士通株式会社 | ページ情報収集プログラム、ページ情報収集方法、及びページ情報収集装置 |
| JP4718205B2 (ja) * | 2005-02-22 | 2011-07-06 | 三菱電機株式会社 | 選択的Web情報収集装置 |
| CN101178713A (zh) * | 2006-11-29 | 2008-05-14 | 腾讯科技(深圳)有限公司 | 一种采集网页的方法及系统 |
| WO2010041517A1 (ja) * | 2008-10-08 | 2010-04-15 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 情報収集装置、検索エンジン、情報収集方法およびプログラム |
| US8140526B1 (en) * | 2009-03-16 | 2012-03-20 | Guangsheng Zhang | System and methods for ranking documents based on content characteristics |
| US8229873B1 (en) * | 2009-09-18 | 2012-07-24 | Google Inc. | News topic-interest-based recommendations twiddling |
| US8650195B2 (en) * | 2010-03-26 | 2014-02-11 | Palle M Pedersen | Region based information retrieval system |
| JP5063729B2 (ja) * | 2010-03-31 | 2012-10-31 | ヤフー株式会社 | クローラ管理システム及び方法 |
| US20140108445A1 (en) * | 2011-05-05 | 2014-04-17 | Google Inc. | System and Method for Personalizing Query Suggestions Based on User Interest Profile |
| US8538949B2 (en) * | 2011-06-17 | 2013-09-17 | Microsoft Corporation | Interactive web crawler |
-
2011
- 2011-12-13 CN CN201110415356.8A patent/CN103164435B/zh not_active Expired - Fee Related
-
2012
- 2012-12-13 US US14/123,036 patent/US9525605B2/en not_active Expired - Fee Related
- 2012-12-13 WO PCT/CN2012/086584 patent/WO2013087012A1/zh not_active Ceased
- 2012-12-13 JP JP2014532241A patent/JP5823620B2/ja not_active Expired - Fee Related
- 2012-12-13 EP EP12857177.5A patent/EP2793143A4/en not_active Withdrawn
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101094135A (zh) * | 2006-06-23 | 2007-12-26 | 腾讯科技(深圳)有限公司 | 一种互联网内容信息的提取方法和提取系统 |
| CN101136026A (zh) * | 2007-05-15 | 2008-03-05 | 北京聚生科技有限公司 | 一种基于xmlhttp组件技术的网页内容采集方法 |
| CN101593200A (zh) * | 2009-06-19 | 2009-12-02 | 淮海工学院 | 基于关键词频度分析的中文网页分类方法 |
| CN102118400A (zh) * | 2009-12-31 | 2011-07-06 | 北京四维图新科技股份有限公司 | 数据采集方法和数据采集系统 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP2793143A4 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110297994A (zh) * | 2019-06-03 | 2019-10-01 | 北京金蝶管理软件有限公司 | 网页数据的采集方法、装置、计算机设备和存储介质 |
Also Published As
| Publication number | Publication date |
|---|---|
| US9525605B2 (en) | 2016-12-20 |
| JP2014528136A (ja) | 2014-10-23 |
| US20140289394A1 (en) | 2014-09-25 |
| JP5823620B2 (ja) | 2015-11-25 |
| EP2793143A1 (en) | 2014-10-22 |
| CN103164435A (zh) | 2013-06-19 |
| EP2793143A4 (en) | 2015-08-12 |
| CN103164435B (zh) | 2016-03-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2013087012A1 (zh) | 一种网络数据的采集方法和系统 | |
| CN102708174B (zh) | 一种浏览器中的富媒体信息的展示方法和装置 | |
| WO2013097632A1 (zh) | 一种信息发布方法及装置 | |
| CN103531218B (zh) | 一种在线多媒体文件编辑方法及系统 | |
| CN104572870B (zh) | 提供文档在线阅读的方法和装置以及系统 | |
| CN103034722B (zh) | 一种网络视频评论聚合装置及方法 | |
| WO2015021199A1 (en) | Access and management of entity-augmented content | |
| JP2010536075A5 (zh) | ||
| CN106372113A (zh) | 新闻内容的推送方法及系统 | |
| CN105138557B (zh) | 一种音乐随机播放方法和装置 | |
| CN105095211A (zh) | 多媒体数据的获取方法和装置 | |
| WO2017107620A1 (zh) | 一种页面数据的加载方法和系统 | |
| CN106033428A (zh) | 统一资源定位符的选择方法和统一资源定位符的选择装置 | |
| WO2012083870A9 (zh) | 一种论坛回帖增量采集方法及系统 | |
| CN104504006A (zh) | 对新闻客户端的数据采集及解析的方法及系统 | |
| WO2014108040A1 (zh) | 在电子设备上呈现内容的方法和装置 | |
| CN103034655B (zh) | 一种用户行为信息的收集方法、系统及相关设备 | |
| CN102118400B (zh) | 数据采集方法和数据采集系统 | |
| CN102629265A (zh) | 一种建立网页数据库的方法及系统 | |
| CN104536972B (zh) | 基于cdn的网页内容感知系统及方法 | |
| CN102819613B (zh) | Rss信息分页抓取系统及方法 | |
| CN103164438B (zh) | 一种网络评论的采集方法及系统 | |
| JP2014142738A (ja) | 管理方法、管理装置および管理プログラム | |
| CN103020195A (zh) | 文件浏览方法及装置 | |
| CN104113464A (zh) | 基于即时通讯提示的交互方法、装置和系统 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12857177 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 14123036 Country of ref document: US |
|
| ENP | Entry into the national phase |
Ref document number: 2014532241 Country of ref document: JP Kind code of ref document: A |
|
| REEP | Request for entry into the european phase |
Ref document number: 2012857177 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |