WO2002010968A2 - Systeme de prospection de donnees - Google Patents
Systeme de prospection de donnees Download PDFInfo
- Publication number
- WO2002010968A2 WO2002010968A2 PCT/US2001/041515 US0141515W WO0210968A2 WO 2002010968 A2 WO2002010968 A2 WO 2002010968A2 US 0141515 W US0141515 W US 0141515W WO 0210968 A2 WO0210968 A2 WO 0210968A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- database
- information
- mail address
- person
- name
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0241—Advertisements
- G06Q30/0251—Targeted advertisements
- G06Q30/0257—User requested
- G06Q30/0258—Registration
Definitions
- a global computer network e.g., the Internet
- a global computer network is formed of a plurality of computers coupled to a communication line for communicating with each other.
- Each computer is referred to as a network node.
- Some nodes serve as information bearing sites while other nodes provide connectivity between end users and the information bearing sites.
- the explosive growth of the internet makes it an essential component of every business, organization and institution strategy, and leads to massive amounts of information being placed in the public domain for people to read and explore.
- the type of information available ranges from information about companies and their products, services, activities, people and partners, to information about conferences, seminars, and exhibitions, to news sites, to information about universities, schools, colleges, museums and hospitals, to information about government organizations, their purpose, activities and people.
- the Internet became the venue of choice for every organization for providing pertinent, detailed and timely information about themselves, their cause, services and activities.
- the Internet essentially is nothing more than the network infrastructure that connects geographically dispersed computer systems. Every such computer system may contain publicly available (shareable) data that are available to users connected to this network. However, until the early 1990's there was no uniform way or standard conventions for accessing this data. The users had to use a variety of techniques to connect to remote computers (e.g. telnet, ftp, etc) using passwords that were usually site-specific, and they had to know the exact directory and file name that contained the information they were looking for.
- remote computers e.g. telnet, ftp, etc
- the World Wide Web was created in an effort to simplify and facilitate access to publicly available information from computer systems comiected to the Internet.
- a set of conventions and standards were developed that enabled users to access every Web site (computer system connected to the Web) in the same uniform way, without the need to use special passwords or techniques, i addition, Web browsers became available that let users navigate easily through Web sites by simply clicking hyperlinks (words or sentences connected to some Web resource).
- Web Domain Web domain is an Internet address that provides connection to a Web server
- URL stands for Uniform Resource Locator.
- the first part describes the protocol used to access the content pointed to by the URL
- the second contains the directory in which the content is located
- the third contains the file that stores the content: ⁇ protocol> : ⁇ domain> ⁇ directory> ⁇ file>
- ⁇ protocol> For example: http://www.corex.com/bios.html http://www.cardscan.com/index.html http://fn.cnn.com/archives/may99/pr37.html ftp://shiva.lin.com/soft/words.zip
- the ⁇ protocol> part may be missing.
- Web page is the content associated with a URL. its simplest form, this content is static text, which is stored into a text file indicated by the URL. However, very often the content contains multi-media elements (e.g. images, audio, video, etc) as well as non-static text or other elements (e.g. news tickers, frames, scripts, streaming graphics, etc). Nery often, more than one files form a Web page, however, there is only one file that is associated with the URL and which initiates or guides the Web page generation.
- multi-media elements e.g. images, audio, video, etc
- non-static text or other elements e.g. news tickers, frames, scripts, streaming graphics, etc.
- Web Browser Web browser is a software program that allows users to access the content stored in Web sites. Modern Web browsers can also create content "on the fly", according to instructions received from a Web site. This concept is commonly referred to as “dynamic page generation”. In addition, browsers can commonly send information back to the Web site, thus enabling two-way communication of the user and the Web site.
- Some of these search engines have a user-friendly front end that accepts natural languages queries, hi general, these queries are analyzed to extract the keywords the user is possibly looking for, and then a simple keyword-based search is performed through the engine's indexes.
- this essentially corresponds to querying one field only in a database and it lacks the multi-field queries that are typical on any database system.
- Web queries cannot become very specific; therefore they tend to return thousands of results of which only a few maybe relevant.
- the "results" returned are not specific data, similar to what database queries typically return; instead, they are lists of Web pages, which may or may not contain the requested answer.
- the Web In order to leverage the information retrieval power and search sophistication of database systems, the information needs to be structured, so that it can be stored in database format. Since the Web contains mostly unstructured information, methods and techniques are needed to extract data and discover patterns in the Web in order to transform the unstructured information into structured data.
- the Web is a vast repository of information and data that grows continuously. Information traditionally published in other media (e.g. manuals, brochures, magazines, books, newspapers, etc.) is now increasingly published either exclusively on the Web, or in two versions, one of which is distributed through the Web. In addition, older information and content from traditional media is now routinely transferred into electronic format to be made available in the Web, e.g. old books from libraries, journals from professional associations, etc. As a result, the Web becomes gradually the primary source of information in our society, with other sources (e.g. books, journals, etc) assuming a secondary role.
- the purpose of the present invention is to extract this kind of public data about people from the Web and organize it into a database, so that simple database queries can answer such questions.
- this invention also extracts from the Web organization information. Many people are working in positions that directly relate to the organization's core activities. Hence their skills, knowledge and specialty likely match the activity of the workplace. Gathering information about the organization adds another dimension to the biographical information collected and maintained about people.
- An organization's Web site contains a lot of information about the organization, its business, products, mission, people, location, partners and more. As with people, the described invention can only collect and organize information that exist on the site itself, hence the level, accuracy and amount of collected information will vary from organization to organization, h general, one can expect to find some or all of the following information.
- Contact information including phone, fax and certain general email addresses such as sales@corpx.com
- Web site are significant sources of information identifying the main keywords describing the organization and hence the people who are associated with it. Noun phrases such as "signal processing”, “public relations”, “intellectual property” or “early childhood” can dramatically narrow a search for people in a specific profession.
- the purpose of the present invention is to develop an automated approach for the data extraction and collection.
- the benefits over the manual approach are obvious: a) automation is cheap. Computers can work 24 hours a day, 7 days a week. A single personal computer can replace tens of human workers in this data extraction task. b) computers are fast. In general, using the method described in this invention, 5 minutes of computer time in a low-cost computer can produce the same data as several man-hours of manual work. c) very high accuracy can be achieved. Even though programming errors are unavoidable, these errors can usually be found and corrected fairly easily, so that the accuracy of the system increases over time.
- a computer automated system and method mines, from a global computer network, information on people and organizations.
- the invention system includes: a plurality of automated crawlers for transversing sites of a global computer network and retrieving pages that contain information of interest; a distributor coupled to the crawlers for controlling crawler processing; an extractor responsive to the crawler retrieved pages and extracting information about people and organizations therefrom, the extracted information being stored in a database; an integrator coupled to the database for resolving duplicate information and combining related information in the database; and a post processor coupled to the database for analyzing contents of the database and generating missing information therefrom.
- the database stores information about different people in different respective records.
- the integrator Given two records of potentially the same person, the integrator combines the two records if: (a) name of the person is the same in the two records, and (b) either affiliated organization name or respective title is the same in the two records. The integrator may also consider person name - title combination matches in light of the statistical rarity of the title and person's name.
- the post-processor generates an email address (e.g. business/non-personal email address) of a subject person named in the database with respect to organization named in the database for the subject person.
- the email address is generated by the post-processor: obtaining a working e-mail address to the respective organization, the working e-mail address not being the e-mail address of the subject person; deducing from the working e-mail address, format of e-mail addresses to the respective organization; -lo ⁇
- the post-processor also may utilize predefined common email address formats to construct potential business/non-personal email addresses of the subject person.
- Fig. 1 is a block diagram of a preferred embodiment of the present invention.
- Fig. 2 is a schematic illustration of a global computer network in which the invention system of Fig. 1 operates.
- Fig. 3 is a flow diagram of email address interpolation by a post-processor of the embodiment of Fig. 1.
- Fig. 4 is a flow diagram of duplicate record detection for merger by an integrator of the embodiment of Fig. 1.
- the main components of the invention system 40 are the following: a) The "Crawler” 11 : a software robot that visits and traverses Web sites in search of Web pages that contain information of interest b) The “Distributor” 47: a software system that controls several Crawler processes c) The "Extractor” 41 : a software module that processes Web pages returned by the Crawler 11 to extract the information about people and companies (organizations). d) The "Loader” 43 : a software program that loads the data found by the Extractor into the database. e) The database 45: the place where all the information are stored.
- the "Integrator” 49 a software module that resolves duplicates, and combines related information in the database.
- the "Post-Processor 51" which enhances the data, in particular analyzes the data and adds missing pieces of information, such as email addresses. It is understood that each component 11, 47, 41, 43, 45, 49, 51 is implemented in hardware, software or a combination thereof and is executed by digital processing means (e.g., a computer) 27. A single computer or a series of computers processing in parallel, distributed or other fashion is suitable. For example as illustrated in Fig. 2, computer 27 executes invention system 40 in working memory. Computer 27 is coupled across communication lines 23 to a global network 21 of computers 25.
- Each node 25, 27 on the network 21 has a respective architecture (e.g., local area network, wide area network, client server, etc.) which may use routers, high speed connections, and the like to couple to global network 21.
- Some nodes may serve as service providers or host servers to a multiplicity of end users, and so forth.
- the Crawler 11 is a software robot that systematically visits and traverses Web sites in order to identify and collect Web pages that contain information of interest to the users. Such a robot for extracting information about people and organizations is described in detail in U.S. Patent Application
- contact information list of people, multimedia content, etc.
- All of the data collected by the crawler 11 are passed to the other components of the system so that they may use this data in their own analyses.
- the automated system 40 described by this invention needs to be as efficient as possible because of the sheer size of Web.
- One of the measures of efficiency is the number of Web sites visited and traversed per hour.
- the Web currently contains many million Web sites (estimates in January 2000 set this number to over 10 million Web sites and they keep increasing exponentially).
- a system that can visit and extract information from an average of 10 Web sites per hour will need at least 1 million hours to cover the entire Web, that is, about 100 years!
- a system that can visit 1,000 Web sites per hour (100 times more efficient) will need about 1 year to cover the entire Web, whereas a system capable of visiting 10,000 Web sites per hour can cover the entire Web in less than 2 months.
- system throughput is used to refer to the number of Web sites visited and processed per hour.
- the system throughput is related to the average time that the invention system 40 needs to visit and process (extract information) from one Web site:
- the average time per Web site is the sum of the times needed by each system module to perform its functions, that is:
- L Crawler is the time required by the Crawler 11 to crawl one Web site
- L Extractor is the time required by the Extractor 41 to extract data from the contents of one Web site
- Ti nte rator * s me time required by the Integrator 49 to post-process and clean up the data related to one Web site.
- Average time per Web site Maximum of ( T Crawler , T Extractor , T ⁇ , T ⁇ )
- This module is the Distributor 47. Because the Distributor 47 is integrated with the Crawler 11, it is able to adjust the schedule of which websites to visit by leveraging the information that the Crawler 11 extracted during previous visits. Because the crawler 11 is automatically determining the site type, the Distributor 47 is able to give higher priority to some sites and lower priority to others.
- the Distributor 47 may adjust the schedule to visit this website on a daily basis.
- the Distributor 47 adjusts the schedule to visit the website only every six ' months or longer.
- the Distributor 47 offers exactly this functionality: it is a software module whose main function is to control and distribute work to multiple Crawlers 11.
- the Distributor 47 uses a database 14 to keep track of all the Web sites that must be visited, and the visiting schedule for each one (some Web sites must be visited more frequently than others, depending on how often their contents change).
- the Distributor 47 prioritizes the Web sites according to their relative importance for the users, and it manages the Crawlers 11 so that the most important sites are visited first.
- the Distributor 40 is responsible to start multiple Crawler 11 processes, and keep their number as high as possible, without hurting the overall system performance. It also monitors the status of the running Crawler processes and stops or kills any processes that exhibit unwanted behavior (e.g. a process that takes too long, uses too much memory or disk space, etc).
- Every Crawler 11 returns and saves in local storage 48 a set of Web pages 12 that potentially contain useful information.
- These Web pages 12 are then processed by a software module that can extract data from HTML code or plain text, the Extractor 41.
- the Extractor 41 uses linguistic methods to parse and "understand” text so that it identifies and extracts useful information.
- the users of the system 40 define what they consider to be "useful” information, and customize accordingly the Extractor 41.
- the Extractor 41 itself is a very generic and flexible tool, that has the ability to read and parse correctly text written in any language, according to the syntax and grammar rules of that language. However, it needs customization in order to work with a specific language (e.g. English, French, German, etc.) and furthermore, it needs to be “trained” in order to recognize what the users consider useful information.
- Customizing the Extractor 41 for a specific language means that one provides it with a set of syntax and grammar rules so that it correctly identifies subject, verb and object in a sentence, it recognizes the time that the sentence refers to (past, present or future), it recognizes the beginning and end of sentences, etc.
- Training the Extractor 41 for recognizing "useful ⁇ information means that one provides it with rules and dictionaries of specific terms so that it recognizes keywords and using the rules it decides when sometliing is useful or not.
- this training may be automated in a significant level, by using examples of "useful” and “useless” text and let the Extractor 41 determine statistically what are the terms that may be considered as keywords of useful information, and also what are good rules (or tests) that may be used during the data extraction process.
- the Extractor 41 may use for its internal decision making and pattern recognition, for example, template-based pattern recognition (see U.S. Patent Application No. 09/585,320 filed on June 2, 2000 for a "Method and Apparatus for Deriving Information from Written Text"), Bayesian Networks for decision making (see U.S. Patent Application No.
- Extracting Data from Web Pages Attorney Docket No. 2937.1000-005. That Extractor 41 uses various methods and techniques described in U.S. Patent Application No. 09/585,320 filed on June 2, 2000 for a "Method and Apparatus for Deriving Information from Written Text”. For mining information about people and organizations, the Extractor 41 extracts the following data: a) Names of People b) Positions that these people hold or have held, including title, organization name, organization location, state and end dates, and whether the person still holds the position c) Educational degrees these people have received d) Certifications that these people have received (e.g. CPA, RN, LCSW,
- Extractor 41 places the foregoing extracted data into working records 16 (for information on people), 17 (for information on organizations). After the Extractor 41 has processed the Web pages returned by a Crawler 11 and it has extracted the useful information, it passes the extracted information (records 16, 17) to the Loader 43, which is the software module responsible for storing the information in the database 45.
- the Loader 43 is the software module responsible for storing the information in the database 45.
- One of Loader's 43 responsibilities is to make sure that the information is internally consistent, for example, with no duplicate or conflicting data (i.e., no duplicate records 16, 17).
- the Loader 43 also implements data filtering rules that have been given by the system users in order to avoid cluttering the database with "garbage" data.
- the Extractor 41 may return any information it finds connected to a person's name.
- the Loader 43 may employ filters to discard any information referring to fictional characters or historical figures, e.g. Donald Duck or Alexander the Great, and load in the database 45 only what appears to be current information about real (and alive) people.
- this filtering may also be performed as post-processing by the Integrator 49, however, by doing the filtering prior to loading the information/records 16, 17 into the database 45, one avoids cluttering the system with obviously useless data.
- some filtering rules may be employed by the Extractor 41 , however, the Extractor 41 preferably does not communicate directly with the database 45, and some of the filtering may require database access.
- Another major responsibility of the Loader 43 is to merge information found by the Extractor 41 from separate Web pages in the same Web site, h general, the Extractor 41 works in a page-by-page mode, extracting any information it finds in each individual page. Nery often though, the same information may be found repeatedly in more than one Web page from the Web site, e.g. every press release potentially contains the company address.
- the Extractor 41 itself may keep track of what information it has found as it progressively process all the Web pages from a Web site, however, that would require a lot of "bookkeeping" from the Extractor 41.
- a simpler way is to let the Extractor 41 extract all the useful information it finds, and then let the Loader 43 decide what is duplicate information or merge pieces of partial information.
- the first and last name of an employee may be found in one Web page, whereas another Web page contains only his last name and his title.
- the Loader 43 recognizes that these two pieces of information actually complement one another, and that they may be safely merged into one piece that contains the first name, last name, and title of the person.
- the Loader 43 also uses the data collected by the Crawler 11 and the Extractor 41 to tie disparate pieces of information together at the database level. For example, if the Crawler 11 finds that the owner of a website is company "A”, and the Extractor 41 finds an address for company "A” and a person working for company "A", the loader 43 combines this information when storing it in the database 45 to show that this person works for company A at the found address.
- the Loader 43 also assigns a date to all of the information that it loads. A press release is often maintained on an organization's Web site for years, but the information can quickly go out of date. For example, if the CEO of a company is replaced, all of the older press releases will still refer to that person as working at the company.
- each record 16, 17 carries two dates: the date that the information was extracted, and the original date of the document if such a date can be found. If a page does not contain a date and it is a management team page, it can generally be assumed to be current.
- the modification date of a document on the Internet cannot be used to date the information in the article, since this can change for technical reasons, such as using a new layout for a Web site, a page can be dynamically generated, etc.
- Loader 43 is described in the related U.S. Patent
- the database 45 used by the system 40 must be a modern high-end database that can handle large quantities of data and a high number of transactions.
- the amount of data collected by the system 40 can potentially tax the capacity and capabilities of any database system, therefore particular attention must be paid to the specifications and maintenance of the database 45.
- this also depends on the user requirements and the type of information that the system is designed to collect; for example, a system that collects information about "the health industry” probably requires higher capacity database than another system that collects information about "zebras", hi addition, a system that offers a Web interface through which anybody in the world may access the data requires a much more powerful database than another system which is not expected to have more than 10 users at a time browsing through the data.
- Integrator 49 Another part of the invention 40 system is the Integrator 49, the software module that periodically operates on the data in the database 45 trying to identify duplicate data, aliases, and merge or remove any incomplete or low-quality data. In essence, the Integrator 49 finds and exploits any data connections that may exist in the database 45.
- b) Data connections within a Web site Another type of data connection may exist in the Web site level. Nery often, a Web site is focused on providing information about a specific "subject". For example, company Web sites usually provide information related to the company, whereas a Web site maintained by the "Johnny Cash Fan Club" probably contains information that is focused exclusively to the singer Johnny Cash. A Web site with such a strong focus tends to assume that human readers are familiar with the central subject of the site and so the site often provides incomplete information in its Web pages. For example, a company Web site may provide the company address in some Web page without including the company name, since it is assumed that a human reader already knows the company name. As it has been described in the previous sections, these type of data connections are handled in the system by the Loader 43.
- the third type of data connections refers to data collected from different Web sites.
- the products of "RND Corporation” may be found in the RND Corporation's Web site
- the stock ticker for this company may be found in the Fidelity Investments Web site
- a brief description about the company may be found in a press release from the PRNewsWire Web site
- reviews about the company's main product may be found in a trade publication's Web site.
- the Integrator module 49 identifies and handles this type of data connections. As the database 45 is populated with new data and older data are "refreshed” by revisiting Web sites, new interconnections of this type are continuously introduced. For every new piece of information, the Integrator 49 "scans" the database to find other pieces of information that potentially share a connection. Fig. 4 illustrates the Integrator 49 process.
- step 123 if the two records 16 of people with the same name are found working for the same company (organization), either currently or in the past, the records 16 are assumed to be of the same person and therefore combined (at 140). This is almost always safe for smaller companies. For very large companies, it is possible that there are two people with the same name working there. However, the chances that both of these people are mentioned on the web and found by the system 40 reduce the chances that this situation will actually be encountered. Thus, while this process potentially may introduce a small amount of erroneous combinations, the vast majority will be correct.
- step 125 titles are used (at step 125) to determine whether subject records 16 of people information should be combined, h this case, if two records 16 of people with the same name are found to have the same title, it is possible to combine the subject records 16. This cannot be done blindly, since it is entirely possible that there are two people named "John Smith” with the title of "Product Manager” in the world. So after step 125 detecting same title, step 127 determines statisical rarity of the title indication shared by the two subject records 16.
- the database 45 itself may be used to determine statistics of the frequency of titles.
- Statistics on names may also be used when combining on a name-title match (step 129). For example, if a relatively rare name, such as "Geoffrey Westerchest" was encountered for two separate records 16, the chances that they are the same person are higher, because there are fewer people out there with that name. Thus, how rare a title needs to be might be relaxed in that case, h other words, while it is quite possible that two different instances of "John Smith, Product Manager” are two different people, it is unlikely that two instances of "Gerissay Westerchest, Product Manager” are different people. Thus, at step 140, Integrator 49 combines records 16 corresponding to these two Geoffreys, i.e., statistically rare name, but not so rare/uncommon title determined at step 129.
- post-processor 51 (Fig 1). An example of this is using the email addresses of one or two people at an organization to compute email addresses for the other people in the organization. Whenever a person is associated with an organization, post-processor 51 attempts to identify an email address for that person. Most sites will list an email address for at least one of the people in the organization, or at the very least a generic email for the site (i.e. sales@corex.com) revealing the domain name used for sending emails, which in some cases might be different than the domain name of the site. As soon as a single email address is found, post-processor 51 deduces email addresses for the rest of the people at the organization as follows and illustrated in Fig. 3.
- Most organizations have a standard format for their email addresses based on name of a person. From one found/extracted email address of a person at a given organization, post-processor 51 reverse engineers the organization's standard format for email addresses at step 101.
- the preferred algorithm searches for substrings of the known/given person's name within the given email address. For example, if the name is Dexter Sealy, and his given email address is Nicolly@corex.com, the last name is completely contained within the email address, and the given email address starts with the first two letters of the corresponding person's (Dexter's) first name.
- step 105 ⁇ first name: 1st 2 characters ⁇ ⁇ last name ⁇ @corex.com and ⁇ first name: 1st 2 characters ⁇ ⁇ last name: 1st 5 characters ⁇ @corex.com.
- step 107 forms and applies rales to database records 16 that indicate people at the given organization whose email addresses are missing from the records 16. Accordingly, the people at the given organization whose records 16 do not indicate respective email addresses have these rules applied to their names, as indicated in the name fields of records 16.
- the post-processing routine 51 applies preferred rules 111 of the most common combinations for creating an email address.
- Such common combinations include: ⁇ first ⁇ . ⁇ last ⁇ @ ⁇ server name ⁇ ⁇ last ⁇ @ ⁇ server name ⁇ ⁇ first x letters of last name ⁇ @ ⁇ server name ⁇
- email servers will try to alias email addresses to someone in the organization and forward on the message, even if the message originally used an incorrect email address. For example, many email servers will accept an email address in the form of ⁇ first name ⁇ _ ⁇ last name ⁇ @ ⁇ server name ⁇ and send it to the appropriate person.
- step 115 sends a test or email message using a candidate generated email address, out to the person from post processor 51/invention system 40. If the first tested candidate email address is incorrect, an unrecognized recipient reply will be sent back from the mail server to system 40 (host server 27). In such a case, post-processor 51 tries another candidate or an alternate variant of the email address at test step 115 until either a mail delivery acknowledgment is received or no error reply comes back.
- a unique code may be embedded in the subject field of each trial to simplify matching it with the delivery acknowledgment or error message.
- the present invention may be applied to other global computer networks and is not dependant on the web platform or HTTP protocol, and the like.
- company and “organization” are used to refer to a variety of entities and/or employers such as businesses, associations, societies, governmental bodies, clubs and the like. Hence association with anyone of these entities is generically termed “employment” or business/non-personal relations to and is intended to cover any affiliation, membership, or connection a person has with the corresponding entity. That is, the terms “employer” and “employed by” are to be given a more generic interpretation liken to non-personal/business affairs of a person. Similarly the term “business email address” is intended to distinguish from a personal, private, at-home email address of a person, but may correspond to any of the variety of entities noted above.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Finance (AREA)
- General Engineering & Computer Science (AREA)
- Strategic Management (AREA)
- Data Mining & Analysis (AREA)
- Accounting & Taxation (AREA)
- Development Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- General Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Marketing (AREA)
- Economics (AREA)
- Information Transfer Between Computers (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| AU2001283531A AU2001283531A1 (en) | 2000-07-31 | 2001-07-31 | Data mining system |
Applications Claiming Priority (12)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US22175000P | 2000-07-31 | 2000-07-31 | |
| US60/221,750 | 2000-07-31 | ||
| US09/703,907 | 2000-11-01 | ||
| US09/704,080 | 2000-11-01 | ||
| US09/704,080 US6618717B1 (en) | 2000-07-31 | 2000-11-01 | Computer method and apparatus for determining content owner of a website |
| US09/703,907 US6778986B1 (en) | 2000-07-31 | 2000-11-01 | Computer method and apparatus for determining site type of a web site |
| US09/768,869 | 2001-01-24 | ||
| US09/768,869 US7356761B2 (en) | 2000-07-31 | 2001-01-24 | Computer method and apparatus for determining content types of web pages |
| US09/821,908 US6983282B2 (en) | 2000-07-31 | 2001-03-30 | Computer method and apparatus for collecting people and organization information from Web sites |
| US09/821,908 | 2001-03-30 | ||
| US09/918,312 | 2001-07-20 | ||
| US09/918,312 US20020032740A1 (en) | 2000-07-31 | 2001-07-30 | Data mining system |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2002010968A2 true WO2002010968A2 (fr) | 2002-02-07 |
| WO2002010968A3 WO2002010968A3 (fr) | 2003-07-31 |
Family
ID=27559144
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2001/024162 Ceased WO2002010989A2 (fr) | 2000-07-31 | 2001-07-30 | Technique permettant de tenir a jour des informations relatives a des personnes et a des organisations |
| PCT/US2001/041515 Ceased WO2002010968A2 (fr) | 2000-07-31 | 2001-07-31 | Systeme de prospection de donnees |
Family Applications Before (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2001/024162 Ceased WO2002010989A2 (fr) | 2000-07-31 | 2001-07-30 | Technique permettant de tenir a jour des informations relatives a des personnes et a des organisations |
Country Status (2)
| Country | Link |
|---|---|
| AU (2) | AU2001278122A1 (fr) |
| WO (2) | WO2002010989A2 (fr) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2006518512A (ja) * | 2003-02-18 | 2006-08-10 | ダン アンド ブラッドストリート インコーポレイテッド | データ統合方法 |
| CN110222069A (zh) * | 2013-03-15 | 2019-09-10 | 美国结构数据有限公司 | 用于批量和实时数据处理的设备、系统和方法 |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2007127841A2 (fr) | 2006-04-26 | 2007-11-08 | Arteriocyte Medical Systems, Inc. | Compositions et leurs méthodes d'élaboration |
| GB2470563A (en) * | 2009-05-26 | 2010-12-01 | John Robinson | Populating a database |
| US9977717B2 (en) | 2016-03-30 | 2018-05-22 | Wipro Limited | System and method for coalescing and representing knowledge as structured data |
| WO2021229352A1 (fr) * | 2020-05-11 | 2021-11-18 | Samel Siddharth Mohan | Système et procédé pour détecter des partenaires affiliés d'une entité |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5319777A (en) * | 1990-10-16 | 1994-06-07 | Sinper Corporation | System and method for storing and retrieving information from a multidimensional array |
| US5764906A (en) * | 1995-11-07 | 1998-06-09 | Netword Llc | Universal electronic resource denotation, request and delivery system |
| US5813006A (en) * | 1996-05-06 | 1998-09-22 | Banyan Systems, Inc. | On-line directory service with registration system |
| AU740007B2 (en) * | 1997-02-21 | 2001-10-25 | Dudley John Mills | Network-based classified information systems |
| JPH10320315A (ja) * | 1997-05-14 | 1998-12-04 | Nippon Telegr & Teleph Corp <Ntt> | 電子メール送信管理装置および電子メール送信管理処理を実施するプログラムを記録した記録媒体 |
-
2001
- 2001-07-30 WO PCT/US2001/024162 patent/WO2002010989A2/fr not_active Ceased
- 2001-07-30 AU AU2001278122A patent/AU2001278122A1/en not_active Abandoned
- 2001-07-31 AU AU2001283531A patent/AU2001283531A1/en not_active Abandoned
- 2001-07-31 WO PCT/US2001/041515 patent/WO2002010968A2/fr not_active Ceased
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2006518512A (ja) * | 2003-02-18 | 2006-08-10 | ダン アンド ブラッドストリート インコーポレイテッド | データ統合方法 |
| EP1599778A4 (fr) * | 2003-02-18 | 2006-11-15 | Dun & Bradstreet Inc | Procede d'integration de donnees |
| US7822757B2 (en) | 2003-02-18 | 2010-10-26 | Dun & Bradstreet, Inc. | System and method for providing enhanced information |
| US8346790B2 (en) | 2003-02-18 | 2013-01-01 | The Dun & Bradstreet Corporation | Data integration method and system |
| CN110222069A (zh) * | 2013-03-15 | 2019-09-10 | 美国结构数据有限公司 | 用于批量和实时数据处理的设备、系统和方法 |
| US11762818B2 (en) | 2013-03-15 | 2023-09-19 | Foursquare Labs, Inc. | Apparatus, systems, and methods for analyzing movements of target entities |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2002010989A3 (fr) | 2003-07-10 |
| AU2001278122A1 (en) | 2002-02-13 |
| AU2001283531A1 (en) | 2002-02-13 |
| WO2002010989A2 (fr) | 2002-02-07 |
| WO2002010968A3 (fr) | 2003-07-31 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20020032740A1 (en) | Data mining system | |
| US6681223B1 (en) | System and method of performing profile matching with a structured document | |
| Fu et al. | A focused crawler for dark web forums | |
| Jantz | Knowledge management in academic libraries: special tools and processes to support information professionals | |
| US7305381B1 (en) | Asynchronous unconscious retrieval in a network of information appliances | |
| US8577867B2 (en) | Method and system for expanding a website | |
| Kruschwitz et al. | Searching the Enterpris | |
| WO2002033594A2 (fr) | Questions associees a une architecture de stockage et d'extraction d'informations utilisant des elements de l'internet | |
| US20120246139A1 (en) | System and method for resume, yearbook and report generation based on webcrawling and specialized data collection | |
| WO2000048057A2 (fr) | Moteur de recherche de signets | |
| Kern et al. | Understanding the information needs of social scientists in Germany | |
| WO2002010968A2 (fr) | Systeme de prospection de donnees | |
| Kang et al. | Making sense of archived e‐mail: Exploring the Enron collection with NetLens | |
| Ni et al. | Journal clustering through interlocking editorship information | |
| Theimer | Interactivity, flexibility and transparency | |
| Lamont | Knowledge management at your service: New solutions and sources for librarians | |
| Schwartz | Shared semantics and the use of organizational memories for e‐mail communications | |
| Peters | Folksonomies, social tagging and information retrieval | |
| Arora | Information and Communication Technology (ICT) for Academic Libraries | |
| Hendry | Workspaces for search | |
| EP2108156A1 (fr) | Appareil et procédé de gestion et de partage des connaissances au sein de l'entreprise | |
| Sodhi | An OR/MS guide to the internet | |
| Kramer | Agent based personalized information retrieval | |
| Campbell et al. | “PUSH-BASED” STRATEGIES FOR IMPROVING THE EFFICIENCY OF INFORMATION MANAGEMENT IN DESIGN | |
| Elbegbayan | Shared reflection-case study on a conference support website |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
| DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
| 122 | Ep: pct application non-entry in european phase | ||
| NENP | Non-entry into the national phase |
Ref country code: JP |