WO2017190641A1 - Procédé et dispositif d'interception de robot, terminal de serveur et support lisible par ordinateur - Google Patents
Procédé et dispositif d'interception de robot, terminal de serveur et support lisible par ordinateur Download PDFInfo
- Publication number
- WO2017190641A1 WO2017190641A1 PCT/CN2017/082707 CN2017082707W WO2017190641A1 WO 2017190641 A1 WO2017190641 A1 WO 2017190641A1 CN 2017082707 W CN2017082707 W CN 2017082707W WO 2017190641 A1 WO2017190641 A1 WO 2017190641A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- page
- crawler
- value
- server
- field value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Definitions
- the present invention relates to network technologies, and in particular, to a method, an apparatus, a server terminal, and a computer readable medium for intercepting a crawler.
- Web crawlers are a fundamental part of search engine technology.
- the web crawler technology starts from the URL (Uniform Resource Locator) of one or several initial web pages, and obtains the URL on the initial webpage.
- the URL Uniform Resource Locator
- the current crawling strategy continuously
- the web page extracts a new URL into the queue until some stop condition is met.
- the crawled web page information is then stored in the search engine's server.
- An object of the embodiments of the present invention is to provide a method, an apparatus, a server terminal, and a computer readable medium for intercepting a crawler, which can effectively intercept crawler access.
- an embodiment of the present invention provides a method for intercepting a crawler, the method comprising:
- the server After receiving the access request from the access page sent by the client, the server generates the current request. a field value for identifying the crawler, and generating a picture attribute value for saving the field value to the picture; saving the picture uniform resource locator URL path containing the picture attribute value to the requested page;
- the server determines whether the currently accessed page belongs to the directly allowed access page, and if so, returns the requested page to the client; if not, further determines whether the access request includes a valid field value for identifying the crawler, If it is a valid field value, the requested page is returned to the client; if the field value for identifying the crawler is not included, or the included field value is invalid, it is confirmed as a crawler, and the first page of the category to be accessed is returned. To the client.
- the embodiment of the present invention further provides a device for intercepting a crawler, and the device is applied to a server, and includes:
- Generating a saving unit after receiving an access request of the access page sent by the client, generating a field value currently used to identify the crawler, and generating an image attribute value for saving the field value into the image; the image attribute is included
- the value of the Uniform Resource Locator URL path is saved to the requested page;
- the processing unit determines whether the currently accessed page belongs to the directly allowed access page, and if yes, returns the requested page to the client; if not, further determines whether the access request includes a valid field value for identifying the crawler If the value is a valid field value, the requested page is returned to the client; if the field value for identifying the crawler is not included, or the field value contained is invalid, it is confirmed as a crawler, and the first page of the page to be accessed is to be accessed. Return to the client.
- the embodiment of the present invention further provides a device for intercepting a crawler, and the device is applied to a client as a browser, including:
- the download unit downloads the image to the browser according to the image URL path included in the page returned by the server;
- the extracting unit parses the image, extracts a field value for identifying the crawler, and saves the field value for identifying the crawler in the access request when the browser accesses the other page.
- the embodiment of the present invention further provides a server terminal, where the server terminal includes:
- One or more processors are One or more processors;
- a storage device for storing one or more programs
- the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement a method of intercepting a crawler of an embodiment of the present invention.
- an embodiment of the present invention further provides a computer readable medium having stored thereon a computer program, the program being executed by a processor to implement a method for intercepting a crawler according to an embodiment of the present invention.
- the server receives the access request of the access page sent by the client, and generates the current a field value for identifying the crawler, and generating a picture attribute value for saving the field value to the picture; saving the picture uniform resource locator URL path containing the picture attribute value to the requested page; Determining whether the currently accessed page belongs to the directly allowed access page, and if so, returning the requested page to the client; if not, further determining whether the access request includes a valid field value for identifying the crawler, if A valid field value returns the requested page to the client; if it does not contain a field value for identifying the crawler, or if the included field value is invalid, it is confirmed as a crawler, and the first page of the category to be accessed is returned to the client.
- the present invention utilizes the feature that the crawler does not execute the Javascript (JS) method, and does not download the image in the webpage, and the server side saves the generated field cookie value for identifying the crawler to the image, and the crawler does not download.
- the picture therefore, after the application of the invention, effectively improves the interception rate of the crawler, reduces the pressure on the server, and ensures the stability and high concurrency of the website. And normal user access will not be blocked.
- FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present invention may be applied.
- FIG. 2 is a schematic flow chart of a method for intercepting a crawler according to an embodiment of the present invention.
- FIG. 3 is a schematic structural diagram of an apparatus for intercepting a reptile applied to the above method according to an embodiment of the present invention.
- FIG. 4 is a block diagram showing the structure of a computer system suitable for implementing a terminal device or a server of an embodiment of the present invention.
- FIG. 1 illustrates an exemplary system architecture 100 in which the intercept crawler method or intercept crawler device of the present application can be applied.
- system architecture 100 can include terminal devices 101, 102, 103, network 104, and server 105.
- the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
- Network 104 may include various types of connections, such as wired, wireless communication links, fiber optic cables, and the like.
- the user can interact with the server 105 over the network 104 using the terminal devices 101, 102, 103 to receive or transmit messages and the like.
- Various communication client applications such as a shopping application, a web browser application, a search application, an instant communication tool, a mailbox client, a social platform software, and the like can be installed on the terminal devices 101, 102, and 103 (for example only).
- the terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop portable computers, desktop computers, and the like.
- the server 105 may be a server that provides various services, such as a background management server that provides support for a shopping website browsed by the user using the terminal devices 101, 102, and 103. (for example only).
- the background management server may analyze and process data such as the received product information query request, and feed back the processing result (for example, target push information, product information--only examples) to the terminal device.
- the intercepting crawling method provided by the embodiment of the present invention is generally performed by the server 105. Accordingly, the intercepting crawling device is generally disposed in the server 105.
- terminal devices, networks, and servers in Figure 1 is merely illustrative. Depending on the implementation needs, there can be any number of terminal devices, networks, and servers.
- the invention saves the normal access of the browser, effectively blocks the crawler, uses the crawler does not execute the JS method, and does not download the image in the webpage, and the server side saves the generated field cookie value for identifying the crawler into the image.
- the crawler does not download the image. Therefore, the crawler does not carry the cookie value in the access request sent by the crawler, and then distinguishes the crawler request and the browser request by carrying the cookie value in the access request, and finally realizes the crawler. Effective interception.
- the embodiment of the invention discloses a method for intercepting a reptile, which comprises the following steps.
- the schematic diagram of the process is shown in FIG. 2 .
- Step 21 After receiving the access request of the access page sent by the client, the server generates a field value currently used to identify the crawler, and generates a picture attribute value that saves the field value to the image; the image is included The image URL path of the attribute value is saved to the requested page.
- the field value used to identify the crawler may be a cookie value; the image attribute value may be a picture name.
- the server receives the access request of the access page sent by the client, for example, after the HTTP request, generates a cookie value and a picture name, and then saves the image URL path containing the picture name to the requested page. specifically,
- the method for generating the cookie value currently used by the server to identify the crawler includes: the server The terminal selects the value of the current timestamp according to the valid time of the cookie value; encrypts the string of the selected current timestamp with the configured current first key, for example, the md5 message digest operation, Current cookie value.
- the method for generating a picture name by the server includes: the server selects the value of the current timestamp according to the valid time of the cookie value; and encrypts the string of the selected current timestamp with the configured current second key. For example, it can be an md5 message digest operation to get the name of the picture.
- the cookie value in the present invention is time-sensitive, the generation time is related to the timestamp, and the other time is obtained by the timestamp. Both the value and the method of the picture name are within the scope of the present invention.
- a URL is an identification method for completely describing the addresses of web pages and other resources on the Internet.
- each web page on the Internet has a unique URL.
- the URL path information of the page is carried. It should be noted that the image URL path is further saved in the page, and the specific location of the save may be set according to a specific implementation. In one embodiment, the image URL path may be saved in an image tag of the page.
- Step 22 The server determines whether the currently accessed page belongs to the directly allowed access page, and if yes, returns the requested page to the client; if not, further determines whether the access request includes a valid identifier for identifying the crawler.
- the field value if it is a valid field value, returns the requested page to the client; if it does not contain the field value used to identify the crawler, or if the included field value is invalid, it is confirmed as a crawler, and the page to be accessed is classified. A page is returned to the client.
- the method for the server to determine whether the current page to be accessed is directly allowed to access the page includes: the server side is preset with a page range that allows direct access to the page; the server determines whether the currently accessed page is within the range, and if so, Belongs to directly allow access to the page.
- the method for the server to determine whether the HTTP request includes a valid cookie value includes: the server compares the cookie value generated by the server with the cookie value carried in the HTTP request, and if the two are equal, the cookie carried in the HTTP request is determined. The value is a valid cookie value. Obviously, if the two are not equal, the cookie value is invalid.
- the cookie value generated by the server side changes every predetermined time. Conversely, assuming that the predetermined time is 10 minutes, the cookie value generated by the server is the same every 10 minutes. Then the server will return the page containing the cookie value to the client, so as long as the client is a browser, the cookie value can be parsed, carried in the next HTTP request, and sent to the server, then, as long as Within the same 10 minutes, the cookie value received by the server will be the same as the cookie value generated by the server itself, which indicates that the cookie value is valid.
- the server If, in the next 10 minutes, the client still sends the HTTP request to the server with the previous cookie value, the server generates a new cookie value, which causes the server to receive the cookie value and the server itself.
- the cookie value is inconsistent, which means that the cookie value is invalid.
- the server After receiving the HTTP request from the crawler, the server will also save the image URL path to the requested page. Then, the server determines whether the currently accessed page belongs to the directly allowed access page, and if so, returns the requested page to the crawler. This is because, in practical applications, crawlers are generally allowed to access a limited number of pages, which in one embodiment may be 1-10 pages of the same category.
- the server determines that the current page to be accessed is not directly allowed to access the page, for example, the crawler wants to access page 11, it further determines whether the HTTP request contains a valid cookie value, after judging the crawler The HTTP request does not carry a cookie value, so the crawler's request is intercepted and the first page of the current classification is returned to the crawler. In this way, the crawler always gets the first page of the current category and won't get more pages.
- the server After receiving the HTTP request from the browser, the server will save the image URL path to the requested page. Then, the server determines whether the currently accessed page belongs to the directly allowed access page, and if so, returns the requested page to the browser. At this time, the browser downloads the image to the browser according to the image URL path included in the page returned by the server; parses the image in Javascript, extracts the cookie value, and saves it for the browser to access other pages.
- the cookie value is carried in the HTTP request. Suppose the browser accesses the page 11 and carries the parsed cookie value in the HTTP request. After receiving the HTTP request, the server determines whether the cookie value is valid. If it is valid, it allows access to page 11. If it is invalid, then Return the first page of the current category to the browser.
- the page allowing direct access is cached on a CDN (Content Delivery Network) server, and when the client requests a page in which direct access is permitted, the CDN server will The requested page is returned to the client.
- CDN technology forms a layer of intelligent virtual network on the existing Internet by placing CDN servers throughout the network. Usually, a large amount of data can be cached on the CDN server. When the user accesses the stored content data, the CDN server can directly provide the data. Give the user a quick response service. In this way, the traffic of the crawler is directed to the CDN server of each province and city, thereby protecting the server and ensuring normal access by the user.
- the cookie value generated by the server side changes every 10 minutes, that is, the cookie value is valid for 10 minutes.
- the server takes the first 11 digits of the current timestamp, 20160101081: It means 10 minutes from 8:10 to 19:00 on January 1, 2016. Therefore, the string of 20160101081 and the current first key is merged into an md5 message digest operation to obtain the current cookie value.
- the string of the combination of 20160101081 and the current second key is subjected to the md5 message digest operation to obtain the name of the picture.
- the server puts the obtained cookie value into the description information of the picture, generates a new picture and saves the new picture with the obtained picture name, and then the server side saves the picture URL path containing the picture name to the requested one.
- the description information of the picture includes, but is not limited to, the time of photographing, the resolution of the photo, the type of the camera, and the like.
- the new image named after the image name contains the cookie value.
- Embodiment 1 in one embodiment,
- the browser sends an HTTP request to the server to request the first page of the current classification
- the server generates a picture URL path containing the cookie value and saves it to the first page;
- the server side presets a page range of 1-10 pages that allows direct access to the page, and the server determines that the first page belongs to the direct access range, and therefore returns the first page containing the image URL path to the browser;
- the browser automatically downloads the image to the browser according to the image URL path included in the page of the first page of the current classification; parses the image with the JS method, extracts the cookie value, and saves the cookie value; .
- the browser sends an HTTP request carrying a cookie value to the server, requesting the current classification page 10;
- the server generates a picture URL path containing the cookie value, and saves it to page 10; wherein, since the valid time is 10 minutes, the cookie value generated by the server is the same as the cookie value carried in the HTTP request;
- the server side presets the range of pages that allow direct access to the page to be 1-10 pages, and the server determines that the 10th page belongs to the direct access range. Therefore, it is not necessary to determine whether the cookie value is valid at this time, and directly includes the image URL path. 10 pages are returned to the browser.
- the browser automatically downloads the image to the browser according to the image URL path included in the page on page 10 of the current classification; parses the image with the JS method and extracts the cookie Value, and save; carry the cookie value when page turning.
- the browser sends an HTTP request carrying a cookie value to the server, requesting the current classification page 11;
- the server generates a picture URL path containing the cookie value, and saves it to page 11; wherein, since the valid time is 10 minutes, the cookie value generated by the server is the same as the cookie value carried in the HTTP request;
- the server side presets a page range of 1-10 pages that allows direct access to the page, and the server side judges that the 11th page does not belong to the direct access range. Therefore, it is further determined whether the cookie value is valid.
- the foregoing has explained that since the effective time is within 10 minutes. Therefore, at this time, the cookie value generated by the server side is the same as the cookie value carried in the HTTP request, so it is determined that the cookie value is valid, and the 11th page including the image URL path is returned to the browser.
- the browser automatically downloads the image to the browser according to the image URL path included in the page on page 11 of the current classification; parses the image with the JS method, extracts the cookie value, and saves the cookie value; .
- Embodiment 2 in another embodiment,
- the browser sends an HTTP request to the server to request the 10th page of the current classification
- the server generates a picture URL path containing the cookie value and saves it to page 10;
- the server side presets a page range of 1-10 pages that allows direct access to the page, and the server side judges that the 10th page belongs to the direct access range. Therefore, although the HTTP request does not have a cookie value at this time, the image will be directly included.
- the 10th page of the URL path is returned to the browser.
- the browser automatically downloads the image to the browser according to the image URL path included in the page on page 10 of the current classification; parses the image with the JS method and extracts the cookie Value, and save; carry the cookie value when page turning.
- Embodiment 3 in another embodiment,
- the browser sends an HTTP request to the server to request the current classification page 11;
- the server generates a picture URL path containing the cookie value and saves it to page 11;
- the server side judges that page 11 is not a direct access range. Therefore, it is further determined whether the HTTP request has a cookie value. Since the link is directly received by the browser, the HTTP request does not have a cookie value, so the browsing is performed. Returns the first page of the current classification.
- the crawler sends an HTTP request to the server to request the first page of the current classification
- the server generates a picture URL path containing the cookie value and saves it to the first page;
- the server side presets a page range of 1-10 pages that allows direct access to the page, and the server side judges that the first page belongs to the direct access range, and therefore returns the first page containing the image URL path to the crawler;
- the crawler does not download images in the prior art, nor does it use the JS method to parse the image, because if executed, it will greatly increase the cost of the crawler, including CPU and bandwidth costs. Therefore, the crawler does not extract the cookie value in the image as the browser does, and it is carried when accessing other pages. Then it will be intercepted by the server.
- the crawler sends an HTTP request to the server to request the current classification page 11;
- the server generates a picture URL path containing the cookie value and saves it to page 11;
- the server side judges that page 11 is not a direct access range. Therefore, it is further determined whether the HTTP request has a cookie value. Since the HTTP request sent by the crawler to the server side cannot have a cookie value, the server returns the current classification to the crawler. One page.
- the web crawler can only capture a limited number of pages, ensuring normal access of the browser.
- an embodiment of the present invention also provides a device for intercepting a crawler, which is applied to a server end, as shown in FIG.
- the device includes:
- the generating and saving unit 301 after receiving the access request of the access page sent by the client, generates a field value currently used to identify the crawler, and generates a picture attribute value that saves the field value to the image; the image is included The image uniform resource locator URL path of the attribute value is saved to the requested page;
- the processing unit 302 determines whether the currently accessed page belongs to the directly allowed access page, and if yes, returns the requested page to the client; if not, further determines whether the access request includes a valid field for identifying the crawler. Value, if it is a valid field value, returns the requested page to the client; if it does not contain a field value for identifying the crawler, or if the included field value is invalid, it is confirmed as a crawler, and the page to be accessed is first The page is returned to the client.
- the invention also proposes a device for intercepting a crawler, which is applied to a client as a browser, comprising:
- the download unit downloads the image to the browser according to the image URL path included in the page returned by the server;
- the extracting unit parses the image, extracts a field value for identifying the crawler, and saves the field value for identifying the crawler in the access request when the browser accesses the other page.
- FIG. 4 there is shown a block diagram of a computer system 400 suitable for use in implementing a terminal device in accordance with an embodiment of the present invention.
- the terminal device shown in FIG. 4 is just an example, There is no limitation to the function and scope of use of the embodiments of the present invention.
- computer system 400 includes a central processing unit (CPU) 401 that can be loaded into a program in random access memory (RAM) 403 according to a program stored in read only memory (ROM) 402 or from storage portion 408. And perform various appropriate actions and processes.
- RAM random access memory
- ROM read only memory
- RAM 403 various programs and data required for the operation of the system 400 are also stored.
- the CPU 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404.
- An input/output (I/O) interface 405 is also coupled to bus 404.
- the following components are connected to the I/O interface 405: an input portion 406 including a keyboard, a mouse, etc.; an output portion 407 including a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a storage portion 408 including a hard disk or the like. And a communication portion 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the Internet.
- Driver 410 is also coupled to I/O interface 405 as needed.
- a removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like is mounted on the drive 410 as needed so that a computer program read therefrom is installed into the storage portion 408 as needed.
- embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for executing the method illustrated in the flowchart.
- the computer program can be downloaded and installed from the network via the communication portion 409, and/or installed from the removable medium 411.
- CPU central processing unit
- the computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium or any combination of the two.
- the computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the above.
- Computer More specific examples of readable storage media may include, but are not limited to, electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable Read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing.
- a computer readable storage medium may be any tangible medium that can contain or store a program, which can be used by or in connection with an instruction execution system, apparatus or device.
- a computer readable signal medium may include a data signal that is propagated in the baseband or as part of a carrier, in which computer readable program code is carried. Such propagated data signals can take a variety of forms including, but not limited to, electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- the computer readable signal medium can also be any computer readable medium other than a computer readable storage medium, which can transmit, propagate, or transport a program for use by or in connection with the instruction execution system, apparatus, or device.
- Program code embodied on a computer readable medium can be transmitted by any suitable medium, including but not limited to wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
- each block of the flowchart or block diagrams can represent a module, a program segment, or a portion of code that includes one or more Executable instructions.
- the functions noted in the blocks may also occur in a different order than that illustrated in the drawings. For example, two successively represented blocks may in fact be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending upon the functionality involved.
- each block of the block diagrams or flowcharts, and combinations of blocks in the block diagrams or flowcharts can be implemented by a dedicated hardware-based system that performs the specified function or operation, or can be used A combination of dedicated hardware and computer instructions is implemented.
- the units involved in the embodiments of the present invention may be implemented by software or by hardware.
- the described unit can also be set to handle In the device, for example, it can be described that a processor includes a generation save unit and a processing unit.
- the name of these units does not constitute a limitation on the unit itself in some cases.
- the generation and storage unit may also be described as “generating the current identification for the access request after receiving the access page sent by the client.
- the unit of the crawler's field value may be described as “generating the current identification for the access request after receiving the access page sent by the client.
- the present invention also provides a computer readable medium, which may be included in the apparatus described in the above embodiments, or may be separately present and not incorporated in the apparatus.
- the computer readable medium carries one or more programs.
- the device includes: after the server receives the access request of the access page sent by the client, generating the current use.
- the crawler traffic is directed to the CDN server of each province and city, thereby further protecting the server and ensuring that users can access normally.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
La présente invention concerne un procédé et un dispositif d'interception de robot, un serveur et un support de stockage. Le procédé consiste : après réception d'une demande d'accès, envoyée par un client, pour accéder à une page, à générer, par une extrémité de serveur, une valeur de champ courante permettant de reconnaître un robot et de générer une valeur d'attribut d'image afin de sauvegarder la valeur de champ dans une image; à sauvegarder un chemin d'accès à un localisateur uniforme de ressources (URL) d'image qui contient la valeur d'attribut d'image dans la page demandée; à déterminer, par l'extrémité de serveur, si une page courante faisant l'objet d'un accès appartient à une page autorisée d'accès direct; si tel est le cas, à renvoyer la page demandée au client; si tel n'est pas le cas, à déterminer en outre si la demande d'accès contient une valeur de champ valide permettant de reconnaître le robot; s'il existe une valeur de champ valide, à renvoyer la page demandée au client; et si la demande d'accès ne contient aucune valeur de champ permettant de reconnaître le robot, ou que la valeur de champ contenue dans la demande d'accès n'est pas valide, à confirmer que ladite valeur de champ est le robot, et à renvoyer une première page classifiée de la page faisant l'objet d'un accès au client. Grâce à la présente invention, l'accès d'un robot peut être intercepté efficacement.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201610286222.3 | 2016-05-03 | ||
| CN201610286222.3A CN107341160B (zh) | 2016-05-03 | 2016-05-03 | 一种拦截爬虫的方法及装置 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2017190641A1 true WO2017190641A1 (fr) | 2017-11-09 |
Family
ID=60202740
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2017/082707 Ceased WO2017190641A1 (fr) | 2016-05-03 | 2017-05-02 | Procédé et dispositif d'interception de robot, terminal de serveur et support lisible par ordinateur |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN107341160B (fr) |
| WO (1) | WO2017190641A1 (fr) |
Cited By (20)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109657176A (zh) * | 2018-10-16 | 2019-04-19 | 深圳壹账通智能科技有限公司 | 网络使用状态识别方法、装置、设备及可读存储介质 |
| CN110069688A (zh) * | 2019-03-16 | 2019-07-30 | 平安城市建设科技(深圳)有限公司 | 反爬虫的页面展示方法、服务器、存储介质及装置 |
| CN110209911A (zh) * | 2019-06-03 | 2019-09-06 | 桂林电子科技大学 | 一种基于请求成功率的自适应休眠时间调节方法 |
| CN111428108A (zh) * | 2020-03-25 | 2020-07-17 | 山东浪潮通软信息科技有限公司 | 一种基于深度学习的反爬虫方法、装置和介质 |
| CN111475700A (zh) * | 2020-03-19 | 2020-07-31 | 平安国际智慧城市科技股份有限公司 | 一种数据提取方法及相关设备 |
| CN111614652A (zh) * | 2020-05-15 | 2020-09-01 | 广东科徕尼智能科技有限公司 | 一种爬虫识别拦截方法、设备、存储介质 |
| CN112003819A (zh) * | 2020-07-07 | 2020-11-27 | 瑞数信息技术(上海)有限公司 | 识别爬虫的方法、装置、设备和计算机存储介质 |
| CN112073412A (zh) * | 2020-09-08 | 2020-12-11 | 北京天融信网络安全技术有限公司 | 一种反爬虫方法、装置、处理器及计算机可读介质 |
| CN112784195A (zh) * | 2019-11-07 | 2021-05-11 | 北京沃东天骏信息技术有限公司 | 一种页面数据发布方法和系统 |
| CN113010818A (zh) * | 2021-02-23 | 2021-06-22 | 腾讯科技(深圳)有限公司 | 访问限流方法、装置、电子设备及存储介质 |
| CN113515682A (zh) * | 2021-05-19 | 2021-10-19 | 平安国际智慧城市科技股份有限公司 | 数据爬取方法、装置、计算机设备和存储介质 |
| CN113704080A (zh) * | 2020-05-22 | 2021-11-26 | 北京沃东天骏信息技术有限公司 | 一种自动化测试方法和装置 |
| CN113806614A (zh) * | 2021-10-10 | 2021-12-17 | 北京亚鸿世纪科技发展有限公司 | 一种基于分析Http请求的网络爬虫快速识别装置 |
| CN113901299A (zh) * | 2021-08-31 | 2022-01-07 | 重庆小雨点小额贷款有限公司 | 一种数据处理方法、装置及计算机可读存储介质 |
| CN114386059A (zh) * | 2021-12-15 | 2022-04-22 | 北京五八信息技术有限公司 | 网页文本混淆反爬虫方法、装置、电子设备及存储介质 |
| CN115037507A (zh) * | 2022-04-22 | 2022-09-09 | 京东科技控股股份有限公司 | 用户访问管理的方法、装置和系统 |
| CN115632817A (zh) * | 2022-09-22 | 2023-01-20 | 浪潮卓数大数据产业发展有限公司 | 一种安卓端反爬方法及装置 |
| CN116455660A (zh) * | 2023-05-04 | 2023-07-18 | 北京数美时代科技有限公司 | 页面访问请求的控制方法、系统、存储介质和电子设备 |
| CN116932854A (zh) * | 2023-09-14 | 2023-10-24 | 百鸟数据科技(北京)有限责任公司 | 一种网页信息反爬虫方法、装置、系统、设备及存储介质 |
| CN118573449A (zh) * | 2024-06-07 | 2024-08-30 | 舟谱数据技术南京有限公司 | 一种授信爬虫识别及防御方法 |
Families Citing this family (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109784960B (zh) * | 2017-11-10 | 2024-05-14 | 北京奇虎科技有限公司 | 一种创意自动化审核方法、装置和设备 |
| CN108763274B (zh) * | 2018-04-09 | 2021-06-11 | 北京三快在线科技有限公司 | 访问请求的识别方法、装置、电子设备及存储介质 |
| CN109492146B (zh) * | 2018-11-09 | 2021-06-29 | 杭州安恒信息技术股份有限公司 | 一种防web爬虫的方法和装置 |
| CN110958228A (zh) * | 2019-11-19 | 2020-04-03 | 用友网络科技股份有限公司 | 爬虫访问拦截方法及设备、服务器和计算机可读存储介质 |
| CN111683098B (zh) * | 2020-06-10 | 2022-12-23 | 创新奇智(成都)科技有限公司 | 反爬虫方法、装置、电子设备及存储介质 |
| CN111783006B (zh) * | 2020-07-22 | 2024-07-12 | 网易(杭州)网络有限公司 | 页面的生成方法、装置、电子设备及计算机可读介质 |
| CN116467504A (zh) * | 2023-03-28 | 2023-07-21 | 深圳市一览网络股份有限公司 | 一种爬虫阻断方法、设备及存储介质 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101635622A (zh) * | 2008-07-24 | 2010-01-27 | 阿里巴巴集团控股有限公司 | 一种网页加密和解密的方法、系统及设备 |
| US20110208714A1 (en) * | 2010-02-19 | 2011-08-25 | c/o Microsoft Corporation | Large scale search bot detection |
| CN102833212A (zh) * | 2011-06-14 | 2012-12-19 | 阿里巴巴集团控股有限公司 | 网页访问者身份识别方法及系统 |
| US20140019488A1 (en) * | 2012-07-16 | 2014-01-16 | Salesforce.Com, Inc. | Methods and systems for regulating database activity |
| CN105426415A (zh) * | 2015-10-30 | 2016-03-23 | Tcl集团股份有限公司 | 网站访问请求的管理方法、装置及系统 |
Family Cites Families (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7130466B2 (en) * | 2000-12-21 | 2006-10-31 | Cobion Ag | System and method for compiling images from a database and comparing the compiled images with known images |
| CN103107948B (zh) * | 2011-11-15 | 2016-02-03 | 阿里巴巴集团控股有限公司 | 一种流量控制方法和装置 |
| CA2762544C (fr) * | 2011-12-20 | 2019-03-05 | Ibm Canada Limited - Ibm Canada Limitee | Identification des demandes qui invalident les sessions utilisateurs |
| CN102663025B (zh) * | 2012-03-22 | 2014-04-02 | 浙江盘石信息技术有限公司 | 一种违规在线商品检测方法 |
| CN104281607A (zh) * | 2013-07-08 | 2015-01-14 | 上海锐英软件技术有限公司 | 微博热点话题分析方法 |
| CN104281626B (zh) * | 2013-07-12 | 2018-01-19 | 阿里巴巴集团控股有限公司 | 基于图片化处理的网页展示方法及网页展示装置 |
-
2016
- 2016-05-03 CN CN201610286222.3A patent/CN107341160B/zh active Active
-
2017
- 2017-05-02 WO PCT/CN2017/082707 patent/WO2017190641A1/fr not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101635622A (zh) * | 2008-07-24 | 2010-01-27 | 阿里巴巴集团控股有限公司 | 一种网页加密和解密的方法、系统及设备 |
| US20110208714A1 (en) * | 2010-02-19 | 2011-08-25 | c/o Microsoft Corporation | Large scale search bot detection |
| CN102833212A (zh) * | 2011-06-14 | 2012-12-19 | 阿里巴巴集团控股有限公司 | 网页访问者身份识别方法及系统 |
| US20140019488A1 (en) * | 2012-07-16 | 2014-01-16 | Salesforce.Com, Inc. | Methods and systems for regulating database activity |
| CN105426415A (zh) * | 2015-10-30 | 2016-03-23 | Tcl集团股份有限公司 | 网站访问请求的管理方法、装置及系统 |
Cited By (27)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109657176A (zh) * | 2018-10-16 | 2019-04-19 | 深圳壹账通智能科技有限公司 | 网络使用状态识别方法、装置、设备及可读存储介质 |
| CN110069688A (zh) * | 2019-03-16 | 2019-07-30 | 平安城市建设科技(深圳)有限公司 | 反爬虫的页面展示方法、服务器、存储介质及装置 |
| CN110209911A (zh) * | 2019-06-03 | 2019-09-06 | 桂林电子科技大学 | 一种基于请求成功率的自适应休眠时间调节方法 |
| CN110209911B (zh) * | 2019-06-03 | 2023-03-28 | 桂林电子科技大学 | 一种基于请求成功率的自适应休眠时间调节方法 |
| CN112784195A (zh) * | 2019-11-07 | 2021-05-11 | 北京沃东天骏信息技术有限公司 | 一种页面数据发布方法和系统 |
| CN111475700A (zh) * | 2020-03-19 | 2020-07-31 | 平安国际智慧城市科技股份有限公司 | 一种数据提取方法及相关设备 |
| CN111428108A (zh) * | 2020-03-25 | 2020-07-17 | 山东浪潮通软信息科技有限公司 | 一种基于深度学习的反爬虫方法、装置和介质 |
| CN111614652A (zh) * | 2020-05-15 | 2020-09-01 | 广东科徕尼智能科技有限公司 | 一种爬虫识别拦截方法、设备、存储介质 |
| CN113704080A (zh) * | 2020-05-22 | 2021-11-26 | 北京沃东天骏信息技术有限公司 | 一种自动化测试方法和装置 |
| CN112003819A (zh) * | 2020-07-07 | 2020-11-27 | 瑞数信息技术(上海)有限公司 | 识别爬虫的方法、装置、设备和计算机存储介质 |
| CN112003819B (zh) * | 2020-07-07 | 2022-07-01 | 瑞数信息技术(上海)有限公司 | 识别爬虫的方法、装置、设备和计算机存储介质 |
| CN112073412A (zh) * | 2020-09-08 | 2020-12-11 | 北京天融信网络安全技术有限公司 | 一种反爬虫方法、装置、处理器及计算机可读介质 |
| CN113010818A (zh) * | 2021-02-23 | 2021-06-22 | 腾讯科技(深圳)有限公司 | 访问限流方法、装置、电子设备及存储介质 |
| CN113010818B (zh) * | 2021-02-23 | 2023-06-30 | 腾讯科技(深圳)有限公司 | 访问限流方法、装置、电子设备及存储介质 |
| CN113515682A (zh) * | 2021-05-19 | 2021-10-19 | 平安国际智慧城市科技股份有限公司 | 数据爬取方法、装置、计算机设备和存储介质 |
| CN113901299A (zh) * | 2021-08-31 | 2022-01-07 | 重庆小雨点小额贷款有限公司 | 一种数据处理方法、装置及计算机可读存储介质 |
| CN113806614A (zh) * | 2021-10-10 | 2021-12-17 | 北京亚鸿世纪科技发展有限公司 | 一种基于分析Http请求的网络爬虫快速识别装置 |
| CN113806614B (zh) * | 2021-10-10 | 2024-05-17 | 北京亚鸿世纪科技发展有限公司 | 一种基于分析Http请求的网络爬虫快速识别装置 |
| CN114386059A (zh) * | 2021-12-15 | 2022-04-22 | 北京五八信息技术有限公司 | 网页文本混淆反爬虫方法、装置、电子设备及存储介质 |
| CN115037507B (zh) * | 2022-04-22 | 2024-04-05 | 京东科技控股股份有限公司 | 用户访问管理的方法、装置和系统 |
| CN115037507A (zh) * | 2022-04-22 | 2022-09-09 | 京东科技控股股份有限公司 | 用户访问管理的方法、装置和系统 |
| CN115632817A (zh) * | 2022-09-22 | 2023-01-20 | 浪潮卓数大数据产业发展有限公司 | 一种安卓端反爬方法及装置 |
| CN115632817B (zh) * | 2022-09-22 | 2023-09-05 | 浪潮卓数大数据产业发展有限公司 | 一种安卓端反爬方法及装置 |
| CN116455660A (zh) * | 2023-05-04 | 2023-07-18 | 北京数美时代科技有限公司 | 页面访问请求的控制方法、系统、存储介质和电子设备 |
| CN116455660B (zh) * | 2023-05-04 | 2023-10-17 | 北京数美时代科技有限公司 | 页面访问请求的控制方法、系统、存储介质和电子设备 |
| CN116932854A (zh) * | 2023-09-14 | 2023-10-24 | 百鸟数据科技(北京)有限责任公司 | 一种网页信息反爬虫方法、装置、系统、设备及存储介质 |
| CN118573449A (zh) * | 2024-06-07 | 2024-08-30 | 舟谱数据技术南京有限公司 | 一种授信爬虫识别及防御方法 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN107341160B (zh) | 2020-09-01 |
| CN107341160A (zh) | 2017-11-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2017190641A1 (fr) | Procédé et dispositif d'interception de robot, terminal de serveur et support lisible par ordinateur | |
| US11416564B1 (en) | Web scraper history management across multiple data centers | |
| CN102710748B (zh) | 数据获取方法、系统及设备 | |
| CN105100294B (zh) | 获取网页的方法、系统、网络服务器、浏览器和gslb | |
| CN107547548B (zh) | 数据处理方法及系统 | |
| CN104335524B (zh) | 用于客户端侧页面处理的公共web可访问数据存储 | |
| US8484373B2 (en) | System and method for redirecting a request for a non-canonical web page | |
| CN110661826B (zh) | 代理服务器端处理网络请求的方法和代理服务器 | |
| WO2015106692A1 (fr) | Procédé, client, serveur, et système de poussée de page web | |
| CN111614624A (zh) | 风险检测方法、装置、系统及存储介质 | |
| WO2010051766A1 (fr) | Procédé et dispositif pour acquérir des informations de ressources cibles | |
| CN105959358A (zh) | Cdn服务器及其缓存数据的方法 | |
| US20230018983A1 (en) | Traffic counting for proxy web scraping | |
| CN113452733A (zh) | 文件下载方法和装置 | |
| US8676884B2 (en) | Security configuration | |
| CN106850572A (zh) | 目标资源的访问方法和装置 | |
| WO2017088369A1 (fr) | Procédé, dispositif et système de requête de données inter-domaines | |
| CN103905477B (zh) | 一种处理http请求的方法及服务器 | |
| CN111953718B (zh) | 一种页面调试方法和装置 | |
| CN117633320A (zh) | 基于湖仓一体的数据处理方法、装置和电子设备 | |
| WO2015123990A1 (fr) | Procédé, dispositif, serveur et système d'envoi de page | |
| CN103957252B (zh) | 云储存系统的日志获取方法及其系统 | |
| CN116186723A (zh) | 一种权限控制系统、方法、设备、介质及产品 | |
| CN106411978A (zh) | 一种资源缓存方法及装置 | |
| CN112149392A (zh) | 一种富文本编辑方法和装置 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17792467 Country of ref document: EP Kind code of ref document: A1 |
|
| 32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 21.02.19) |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 17792467 Country of ref document: EP Kind code of ref document: A1 |