EP2776945A4 - EXTRACTION OF THE MAIN CONTENT OF WEB PAGES - Google Patents

EXTRACTION OF THE MAIN CONTENT OF WEB PAGES

Info

Publication number
EP2776945A4
EP2776945A4 EP12847034.1A EP12847034A EP2776945A4 EP 2776945 A4 EP2776945 A4 EP 2776945A4 EP 12847034 A EP12847034 A EP 12847034A EP 2776945 A4 EP2776945 A4 EP 2776945A4
Authority
EP
European Patent Office
Prior art keywords
extraction
web pages
main content
content
main
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP12847034.1A
Other languages
German (de)
French (fr)
Other versions
EP2776945A1 (en
Inventor
Jakob Bignert
Gabriel Alexandru Coarna
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Evernote Corp
Original Assignee
Evernote Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Evernote Corp filed Critical Evernote Corp
Publication of EP2776945A1 publication Critical patent/EP2776945A1/en
Publication of EP2776945A4 publication Critical patent/EP2776945A4/en
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)
EP12847034.1A 2011-11-10 2012-11-07 EXTRACTION OF THE MAIN CONTENT OF WEB PAGES Ceased EP2776945A4 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201161558153P 2011-11-10 2011-11-10
US13/563,060 US9152730B2 (en) 2011-11-10 2012-07-31 Extracting principal content from web pages
PCT/US2012/063777 WO2013070645A1 (en) 2011-11-10 2012-11-07 Extracting principal content from web pages

Publications (2)

Publication Number Publication Date
EP2776945A1 EP2776945A1 (en) 2014-09-17
EP2776945A4 true EP2776945A4 (en) 2015-05-27

Family

ID=48281623

Family Applications (1)

Application Number Title Priority Date Filing Date
EP12847034.1A Ceased EP2776945A4 (en) 2011-11-10 2012-11-07 EXTRACTION OF THE MAIN CONTENT OF WEB PAGES

Country Status (5)

Country Link
US (1) US9152730B2 (en)
EP (1) EP2776945A4 (en)
JP (1) JP2015502603A (en)
CA (1) CA2853199A1 (en)
WO (1) WO2013070645A1 (en)

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130339839A1 (en) * 2012-06-14 2013-12-19 Emre Yavuz Baran Analyzing User Interaction
US20140380142A1 (en) * 2013-06-20 2014-12-25 Microsoft Corporation Capturing website content through capture services
WO2015018244A1 (en) 2013-08-07 2015-02-12 Microsoft Corporation Augmenting and presenting captured data
US20150067476A1 (en) * 2013-08-29 2015-03-05 Microsoft Corporation Title and body extraction from web page
US9117280B2 (en) 2013-08-29 2015-08-25 Microsoft Technology Licensing, Llc Determining images of article for extraction
US9876848B1 (en) * 2014-02-21 2018-01-23 Twitter, Inc. Television key phrase detection
KR102063566B1 (en) * 2014-02-23 2020-01-09 삼성전자주식회사 Operating Method For Text Message and Electronic Device supporting the same
US9665617B1 (en) * 2014-04-16 2017-05-30 Google Inc. Methods and systems for generating a stable identifier for nodes likely including primary content within an information resource
US9910644B2 (en) * 2015-03-03 2018-03-06 Microsoft Technology Licensing, Llc Integrated note-taking functionality for computing system entities
US10607152B2 (en) * 2015-05-26 2020-03-31 Textio, Inc. Using machine learning to predict outcomes for documents
EP4044022A1 (en) * 2015-07-30 2022-08-17 Wix.com Ltd. System integrating a mobile device application creation, editing and distribution system with a website design system
US11500535B2 (en) * 2015-10-29 2022-11-15 Lenovo (Singapore) Pte. Ltd. Two stroke quick input selection
US10324699B2 (en) * 2015-12-15 2019-06-18 International Business Machines Corporation Enhanceable cross-domain rules engine for unmatched registry entries filtering
US10289642B2 (en) * 2016-06-06 2019-05-14 Baidu Usa Llc Method and system for matching images with content using whitelists and blacklists in response to a search query
US9817806B1 (en) * 2016-06-28 2017-11-14 International Business Machines Corporation Entity-based content change management within a document content management system
WO2018039774A1 (en) 2016-09-02 2018-03-08 FutureVault Inc. Systems and methods for sharing documents
CA3035097C (en) 2016-09-02 2024-05-21 FutureVault Inc. Automated document filing and processing methods and systems
WO2018039772A1 (en) 2016-09-02 2018-03-08 FutureVault Inc. Real-time document filtering systems and methods
CN108241612B (en) * 2016-12-27 2021-11-05 北京国双科技有限公司 Method and device for processing punctuation marks
JP7009840B2 (en) 2017-08-30 2022-01-26 富士通株式会社 Information processing equipment, information processing method and dialogue control system
US11030223B2 (en) * 2017-10-09 2021-06-08 Box, Inc. Collaboration activity summaries
US11928083B2 (en) 2017-10-09 2024-03-12 Box, Inc. Determining collaboration recommendations from file path information
KR102462516B1 (en) 2018-01-09 2022-11-03 삼성전자주식회사 Display apparatus and Method for providing a content thereof
US10824306B2 (en) * 2018-10-16 2020-11-03 Lenovo (Singapore) Pte. Ltd. Presenting captured data
CN111460272B (en) * 2019-01-22 2024-02-13 北京国双科技有限公司 Text page ordering method and related equipment
US11042555B1 (en) * 2019-06-28 2021-06-22 Bottomline Technologies, Inc. Two step algorithm for non-exact matching of large datasets
US11960834B2 (en) * 2019-09-30 2024-04-16 Brave Software, Inc. Reader mode-optimized attention application
US10956731B1 (en) 2019-10-09 2021-03-23 Adobe Inc. Heading identification and classification for a digital document
US10949604B1 (en) * 2019-10-25 2021-03-16 Adobe Inc. Identifying artifacts in digital documents
CN117707505A (en) * 2022-09-08 2024-03-15 北京有竹居网络技术有限公司 Webpage generation method and device, electronic equipment and storage medium
US12223255B2 (en) * 2022-09-12 2025-02-11 Google Llc Reading assistant in a browser environment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110066662A1 (en) * 2009-09-14 2011-03-17 Adtuitive, Inc. System and Method for Content Extraction from Unstructured Sources

Family Cites Families (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6249483A (en) * 1985-08-28 1987-03-04 Hitachi Ltd Character input method for real-time handwritten character recognition
US7536561B2 (en) * 1999-10-15 2009-05-19 Ebrary, Inc. Method and apparatus for improved information transactions
US7137067B2 (en) 2000-03-17 2006-11-14 Fujitsu Limited Device and method for presenting news information
JP3703080B2 (en) 2000-07-27 2005-10-05 インターナショナル・ビジネス・マシーンズ・コーポレーション Method, system and medium for simplifying web content
US6778986B1 (en) * 2000-07-31 2004-08-17 Eliyon Technologies Corporation Computer method and apparatus for determining site type of a web site
US7467206B2 (en) 2002-12-23 2008-12-16 Microsoft Corporation Reputation system for web services
US7653621B2 (en) * 2003-07-30 2010-01-26 Oracle International Corporation Method of determining the similarity of two strings
US7392474B2 (en) 2004-04-30 2008-06-24 Microsoft Corporation Method and system for classifying display pages using summaries
US20130212463A1 (en) 2004-09-07 2013-08-15 Evernote Corporation Smart document processing with associated online data and action streams
US7774192B2 (en) 2005-01-03 2010-08-10 Industrial Technology Research Institute Method for extracting translations from translated texts using punctuation-based sub-sentential alignment
US8468445B2 (en) * 2005-03-30 2013-06-18 The Trustees Of Columbia University In The City Of New York Systems and methods for content extraction
US9141718B2 (en) 2005-06-03 2015-09-22 Apple Inc. Clipview applications
JP4238849B2 (en) 2005-06-30 2009-03-18 カシオ計算機株式会社 Web page browsing apparatus, Web page browsing method, and Web page browsing processing program
US7548929B2 (en) * 2005-07-29 2009-06-16 Yahoo! Inc. System and method for determining semantically related terms
US8126898B2 (en) 2006-11-06 2012-02-28 Salesforce.Com, Inc. Method and system for generating scored recommendations based on scored references
US8181107B2 (en) 2006-12-08 2012-05-15 Bytemobile, Inc. Content adaptation
TW200836075A (en) 2007-02-16 2008-09-01 Esobi Inc Method of converting hypertext markup language web page into pure text and system thereof
US8806325B2 (en) 2009-11-18 2014-08-12 Apple Inc. Mode identification for selective document content presentation
US8819028B2 (en) * 2009-12-14 2014-08-26 Hewlett-Packard Development Company, L.P. System and method for web content extraction
US8281232B2 (en) * 2010-04-22 2012-10-02 Rockmelt, Inc. Integrated adaptive URL-shortening functionality
US20130155463A1 (en) * 2010-07-30 2013-06-20 Jian-Ming Jin Method for selecting user desirable content from web pages
US10089404B2 (en) 2010-09-08 2018-10-02 Evernote Corporation Site memory processing
CA2808943A1 (en) 2010-09-08 2012-03-15 Evernote Corporation Site memory processing and clipping control

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110066662A1 (en) * 2009-09-14 2011-03-17 Adtuitive, Inc. System and Method for Content Extraction from Unstructured Sources

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
See also references of WO2013070645A1 *
TOPHER KESSLER: "How to use Safari's new 'Reader'", 9 June 2010 (2010-06-09), pages 1 - 4, XP055477560, Retrieved from the Internet <URL:https://www.cnet.com/news/how-to-use-safaris-new-reader/> [retrieved on 20180523] *

Also Published As

Publication number Publication date
JP2015502603A (en) 2015-01-22
US20130124513A1 (en) 2013-05-16
EP2776945A1 (en) 2014-09-17
WO2013070645A1 (en) 2013-05-16
US9152730B2 (en) 2015-10-06
CA2853199A1 (en) 2013-05-16

Similar Documents

Publication Publication Date Title
EP2776945A4 (en) EXTRACTION OF THE MAIN CONTENT OF WEB PAGES
EP2724557A4 (en) PROVISION OF RELEVANT CONTENT
EP2932401A4 (en) CONTENT DISTRIBUTION CADRICAL
EP2862048A4 (en) SELECTION AND ROUTING OF ADDITIONAL CONTENT
EP2852952A4 (en) AUDIO CONTENT AUDIO
EP2761573A4 (en) TECHNIQUES TO MANAGE AND VIEW CONTENT FOLLOW-UP
EP2561455A4 (en) Selectively adding social dimension to web searches
EP2817970A4 (en) AUTOMATIC RECOMMENDATION CONTENT
EP2726969A4 (en) DISPLAY OF CONTENT
EP2734909A4 (en) Rich web page generation
ES1078354Y (en) BANK OF FOLDING STRUCTURE ROLLERS
LT2775868T (en) SMOKING PRODUCT WITH VISIBLE CONTENT
HUE048841T2 (en) Cloud-based web content filtering
EP2915038A4 (en) PROVIDING VIRTUALIZED CONTENT
AT509318B8 (en) separation
BRDI7105192S (en) CONFIGURATION APPLIES TO BAG
EP2748728A4 (en) SORT OF FREQUENCY CONTENT
FR2962044B1 (en) LACRYMIMETIC EMULSION
EP2727324A4 (en) CONTEXT EXTRACTION
GB201414340D0 (en) Web application content mapping
DK2898148T3 (en) DRAINAGE STRUCTURE
EP2807800A4 (en) AUTHORIZATIONS FOR EXPLOITABLE CONTENT
FI20100033A0 (en) extraction
GB201222514D0 (en) Web page variation
PL2523567T3 (en) WATER RECOVERY

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20140602

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

RIN1 Information on inventor provided before grant (corrected)

Inventor name: BIGNERT, JAKOB

Inventor name: COARNA, GABRIEL, ALEXANDRU

DAX Request for extension of the european patent (deleted)
RA4 Supplementary search report drawn up and despatched (corrected)

Effective date: 20150424

RIC1 Information provided on ipc code assigned before grant

Ipc: G06F 17/30 20060101AFI20150420BHEP

17Q First examination report despatched

Effective date: 20170919

REG Reference to a national code

Ref country code: DE

Ref legal event code: R003

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 20190222