BRPI0604212A - detecção automática de codificação de caracter - Google Patents

detecção automática de codificação de caracter

Info

Publication number
BRPI0604212A
BRPI0604212A BRPI0604212-0A BRPI0604212A BRPI0604212A BR PI0604212 A BRPI0604212 A BR PI0604212A BR PI0604212 A BRPI0604212 A BR PI0604212A BR PI0604212 A BRPI0604212 A BR PI0604212A
Authority
BR
Brazil
Prior art keywords
legally
candidates
text strings
character encoding
automatic character
Prior art date
Application number
BRPI0604212-0A
Other languages
English (en)
Inventor
Gilbert B Porter Iii
Mirelsa Fontanes-Perez
William Tay
Original Assignee
Xerox Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xerox Corp filed Critical Xerox Corp
Publication of BRPI0604212A publication Critical patent/BRPI0604212A/pt

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Document Processing Apparatus (AREA)

Abstract

DETECçãO AUTOMáTICA DE CODIFICAçãO DE CARACTER. A presente inveção refere-se a um método para detectar a codificação utilizada em um documento eletrónico que inclui testar as cadeias de caracteres de texto para determinar se o documento eletrónico contém somente cadeias de caracteres de texto possuindo códigos numéricos legais. Uma análise estatística das cadeias de caracteres de texto é então realizada para proporcionar um mapeamento dos candidatos legalmente codificados. Os candidatos legalmente são classificados e combinados com uma classificação esperada dos candidatos legalmente codificados para proporcionar um mapeamento de caracteres mais provável.
BRPI0604212-0A 2005-08-05 2006-08-07 detecção automática de codificação de caracter BRPI0604212A (pt)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/198,428 US7148824B1 (en) 2005-08-05 2005-08-05 Automatic detection of character encoding format using statistical analysis of the text strings

Publications (1)

Publication Number Publication Date
BRPI0604212A true BRPI0604212A (pt) 2007-07-17

Family

ID=37497287

Family Applications (1)

Application Number Title Priority Date Filing Date
BRPI0604212-0A BRPI0604212A (pt) 2005-08-05 2006-08-07 detecção automática de codificação de caracter

Country Status (3)

Country Link
US (1) US7148824B1 (pt)
JP (1) JP2007048284A (pt)
BR (1) BRPI0604212A (pt)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7499943B2 (en) * 2006-01-09 2009-03-03 International Business Machines Corporation Mapping for mapping source and target objects
NZ549548A (en) * 2006-08-31 2009-04-30 Arc Innovations Ltd Managing supply of a utility to a customer premises
WO2009002593A2 (en) * 2007-04-20 2008-12-31 Stephen Murphy Apparatuses, methods and systems for a multi-modal data interfacing platform
US8156432B2 (en) * 2007-11-14 2012-04-10 Zih Corp. Detection of UTF-16 encoding in streaming XML data without a byte-order mark and related printers, systems, methods, and computer program products
JP2010176237A (ja) * 2009-01-28 2010-08-12 Nec Corp 文字コード自動判別システム、文字コード自動判別方法及び文字コード自動判別プログラム
CN102567293B (zh) * 2010-12-13 2015-05-20 汉王科技股份有限公司 文本文件的编码格式探测方法和装置
GB2489512A (en) 2011-03-31 2012-10-03 Clearswift Ltd Classifying data using fingerprint of character encoding
CN104156373B (zh) * 2013-05-15 2017-06-06 宏碁股份有限公司 编码格式检测方法及装置
CN104516862B (zh) * 2013-09-29 2018-05-01 北大方正集团有限公司 一种选择读取目标文档的编码格式的方法及其系统
CN104361021B (zh) * 2014-10-21 2018-07-24 小米科技有限责任公司 网页编码识别方法及装置
US9665546B1 (en) * 2015-12-17 2017-05-30 International Business Machines Corporation Real-time web service reconfiguration and content correction by detecting in invalid bytes in a character string and inserting a missing byte in a double byte character
DE102018108693A1 (de) * 2017-04-13 2018-10-18 Hirschmann Car Communication Gmbh Zeichensatz-Erkennung
US10949617B1 (en) * 2018-09-27 2021-03-16 Amazon Technologies, Inc. System for differentiating encoding of text fields between networked services
CN110196968B (zh) * 2019-06-06 2023-04-07 北京林业大学 一种基于特定字符串查找的简体中文编码方式自动识别系统及方法
CN113569534A (zh) * 2020-04-29 2021-10-29 杭州海康威视数字技术股份有限公司 一种检测文档中乱码的方法及装置
CN117424765B (zh) * 2023-12-19 2024-03-22 天津医康互联科技有限公司 分布式独热编码方法、装置、电子设备及计算机存储介质

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4843389A (en) * 1986-12-04 1989-06-27 International Business Machines Corp. Text compression and expansion method and apparatus
US5282194A (en) * 1992-08-17 1994-01-25 Loral Aerospace Corporation Interactive protocol analysis system
US5414650A (en) * 1993-03-24 1995-05-09 Compression Research Group, Inc. Parsing information onto packets using context-insensitive parsing rules based on packet characteristics
US5649214A (en) * 1994-09-20 1997-07-15 Unisys Corporation Method and apparatus for continued use of data encoded under a first coded character set while data is gradually transliterated to a second coded character set
US5684478A (en) * 1994-12-06 1997-11-04 Cennoid Technologies, Inc. Method and apparatus for adaptive data compression
US5778361A (en) * 1995-09-29 1998-07-07 Microsoft Corporation Method and system for fast indexing and searching of text in compound-word languages
JP3499671B2 (ja) * 1996-02-09 2004-02-23 富士通株式会社 データ圧縮装置及びデータ復元装置
US7058726B1 (en) * 1996-07-08 2006-06-06 Internet Number Corporation Method and systems for accessing information on a network using message aliasing functions having shadow callback functions
US6035268A (en) * 1996-08-22 2000-03-07 Lernout & Hauspie Speech Products N.V. Method and apparatus for breaking words in a stream of text
TW421750B (en) * 1997-03-14 2001-02-11 Omron Tateisi Electronics Co Language identification device, language identification method and storage media recorded with program of language identification
US6049869A (en) * 1997-10-03 2000-04-11 Microsoft Corporation Method and system for detecting and identifying a text or data encoding system
US6525831B1 (en) * 1998-12-02 2003-02-25 Xerox Corporation Non-format violating PDL guessing technique to determine the page description language in which a print job is written
US6314469B1 (en) * 1999-02-26 2001-11-06 I-Dns.Net International Pte Ltd Multi-language domain name service
MXPA01010103A (es) * 1999-04-05 2002-11-04 Neomedia Tech Inc Sistema y metodo para utilizar codigos de enlace legibles por maquina o legibles por humanos para tener acceso a recursos de datos en red.
US6400287B1 (en) * 2000-07-10 2002-06-04 International Business Machines Corporation Data structure for creating, scoping, and converting to unicode data from single byte character sets, double byte character sets, or mixed character sets comprising both single byte and double byte character sets
US6668085B1 (en) * 2000-08-01 2003-12-23 Xerox Corporation Character matching process for text converted from images
US6829386B2 (en) * 2001-02-28 2004-12-07 Sun Microsystems, Inc. Methods and apparatus for associating character codes with optimized character codes
US7010779B2 (en) * 2001-08-16 2006-03-07 Knowledge Dynamics, Inc. Parser, code generator, and data calculation and transformation engine for spreadsheet calculations
US6650261B2 (en) * 2001-09-06 2003-11-18 Xerox Corporation Sliding window compression method utilizing defined match locations
US6701320B1 (en) * 2002-04-24 2004-03-02 Bmc Software, Inc. System and method for determining a character encoding scheme

Also Published As

Publication number Publication date
US7148824B1 (en) 2006-12-12
JP2007048284A (ja) 2007-02-22

Similar Documents

Publication Publication Date Title
BRPI0604212A (pt) detecção automática de codificação de caracter
BR112012009445A2 (pt) Codificador de áudio, decodificador de áudio, método para codificar uma informação de áudio, método para decodificar uma informação de áudio e programa de computador que utiliza uma detecção de um grupo de valores espectrais previamente decodificados
Saastamoinen The phraseology and structure of Latin building inscriptions in Roman north Africa
BR112012011230A2 (pt) fatores de risco e previsão de infarto do miocárdio
BRPI0720343A2 (pt) método aparelho e programa de computador para detecção de fraude em computador
MY149569A (en) Improvements in resisting the spread of unwanted code and data
BR112014007214A8 (pt) Método para determinar a probabilidade de que um indivíduo tenha risco elevado de um evento cardiovascular, método para avaliação do risco de um evento cardiovascular futuro e método implementado por computador para avaliação do risco de um evento cardiovascular
BR112015022493A2 (pt) sistema de determinação de contexto demográfico
BR112012013160A2 (pt) máquina de preparação de bebidas com funcionalidade de emulsão ambiente
BRPI0600359A (pt) método e meio legìvel por computador para proporcionar indicadores de desempenho chave acionados por planilha
BR112013003391A2 (pt) biomarcadores de câncer pancreático e usos dos mesmos
ATE527834T1 (de) Ökonomische lautheitmessung von codiertem audio
ATE522875T1 (de) Identifizierung von textpassagen
BR112015009022A2 (pt) métodos para determinação da abundância de um analito em uma pluralidade de amostras
WO2015003143A3 (en) Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
Pawelka et al. Is this code written in English? A study of the natural language of comments and identifiers in practice
BR112013023409A2 (pt) método e kit de teste para determinar a concentração de nitrato
BR112015014557A2 (pt) método, aparelho e sistema para indexar conteúdo com base na informação do tempo
RU2011153489A (ru) Способ автоматизированного определения языка и(или) кодировки текстового документа
BR112013006724A2 (pt) aparelho, conjunto de circuitos integrados ou chips, dispositivo de posicionamento, método, programa de computador, e, sinal
BRPI0705108A2 (pt) sistema e método de avaliação de buchas capacitivas
Olsson et al. Perception of glare in relation to the CIE scale on Unified Glare Rating (UGR) and the impact of ambient light on both UGR and Subjective Glare Index Scales (SGI)
王薇 Analysis. of Addition in English Translation of Su Shi's Lyric Poems
Sabry The phenomenon of substitution in the Akkadian and Arabic languages-a comparative study
Azenabor Developing electronic government models for Nigeria: an analysis

Legal Events

Date Code Title Description
B08F Application dismissed because of non-payment of annual fees [chapter 8.6 patent gazette]

Free format text: REFERENTE A 10A ANUIDADE.

B08K Patent lapsed as no evidence of payment of the annual fee has been furnished to inpi [chapter 8.11 patent gazette]

Free format text: EM VIRTUDE DO ARQUIVAMENTO PUBLICADO NA RPI 2385 DE 20-09-2016 E CONSIDERANDO AUSENCIA DE MANIFESTACAO DENTRO DOS PRAZOS LEGAIS, INFORMO QUE CABE SER MANTIDO O ARQUIVAMENTO DO PEDIDO DE PATENTE, CONFORME O DISPOSTO NO ARTIGO 12, DA RESOLUCAO 113/2013.