WO2020258303A1 - 语义模型实例化方法、系统和装置 - Google Patents

语义模型实例化方法、系统和装置 Download PDF

Info

Publication number
WO2020258303A1
WO2020258303A1 PCT/CN2019/093873 CN2019093873W WO2020258303A1 WO 2020258303 A1 WO2020258303 A1 WO 2020258303A1 CN 2019093873 W CN2019093873 W CN 2019093873W WO 2020258303 A1 WO2020258303 A1 WO 2020258303A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
semantic
semantic model
keyword
correlation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2019/093873
Other languages
English (en)
French (fr)
Inventor
李婧
张瑞国
司伟平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siemens Ltd China
Siemens AG
Siemens Corp
Original Assignee
Siemens Ltd China
Siemens AG
Siemens Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens Ltd China, Siemens AG, Siemens Corp filed Critical Siemens Ltd China
Priority to CN201980008609.4A priority Critical patent/CN112449700B/zh
Priority to US16/970,692 priority patent/US20220129635A1/en
Priority to PCT/CN2019/093873 priority patent/WO2020258303A1/zh
Priority to EP19915576.3A priority patent/EP3783522A4/en
Publication of WO2020258303A1 publication Critical patent/WO2020258303A1/zh
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Definitions

  • the present invention relates to the field of industrial software, and in particular to a method, system and device for instantiating a semantic model.
  • the prior art provides two solutions.
  • One of the solutions is form analysis and retrieval, which aims at the correlation between user questions and form content.
  • the form analysis and retrieval algorithm will search the form data to determine one or more forms that can potentially answer the above question.
  • Retrieval methods include string similarity algorithm BM25 and unit data similarity calculation.
  • the system may include devices such as semantic analysis, table format analysis, table similarity comparison and table retrieval process. But this solution only focuses on how to match user queries and table content.
  • Ontology matching Another solution is ontology matching, whose purpose is to find the correlation between the entities of two ontology including type, parameter and instance.
  • Ontology matching includes two basic steps: similarity calculation and queue extraction. These steps compare two ontologies from the perspective of two languages and structures, with the purpose of transferring data from one ontology model to another.
  • this scheme does not regard tables as inputs.
  • Some similar methods also try to extract web table information based on ontology information, but these schemes are mainly based on heuristic rules, which are difficult to extend to any table with multiple layouts.
  • the first aspect of the present invention provides a semantic model instantiation method, which includes the following steps: S1, receiving an ontology-based semantic model, parsing the semantic model and transforming the semantic model into a feature vector set, wherein the feature The vector represents the type and attribute of the ontology and the relationship between the attributes; S3, import a semi-structured file, and convert the semi-structured file into a keyword vector based on the semantic vector of the semantic model; S4, compare Describe the relevance of the semantic vector and the keyword vector, and identify the keyword vector corresponding to the semantic vector.
  • steps S1 and S3 further include the following step: S2, matching the synonymous words of the semantic vector word based on the semantic vector of the semantic model, wherein the step S3 further includes the following step: based on the The semantic vector of the semantic model and its synonyms convert the semi-structured document into a keyword vector.
  • the method further includes the following step: extracting the semi-structured document instance data corresponding to the keyword vector of the semantic vector to the database.
  • the ontology includes types, attributes, and relationships between the attributes.
  • step S3 further includes the following steps: determining the header position of the table file, and identifying the data part of the table file.
  • step S4 further includes the following step: execute multiple correlation calculation methods based on the semantic vector, the synonym lexicon, and the keyword vector to obtain multiple correlation values to compare the correlation between the semantic vector and the keyword vector , Weighting the correlation values to construct a correlation matrix and filtering out a parameter map to identify the keyword vector corresponding to the semantic vector, wherein the parameter map represents the matched keyword vector and semantic vector.
  • correlation matrix is constructed by the following algorithm:
  • M ij is the correlation
  • o is the semantic vector
  • k is the keyword vector
  • w q is the weight
  • Sim q is the correlation algorithm
  • i, j, and q are natural numbers.
  • the second aspect of the present invention provides a semantic model instantiation system, including: a processor; and a memory coupled with the processor, the memory having instructions stored therein, and the instructions cause the
  • the electronic device performs actions, the actions include: S1, receiving a semantic model based on the ontology, parsing the semantic model, and converting the semantic model into a feature vector set, wherein the feature vector represents the type, attribute, and all of the ontology S3, import a semi-structured file, and convert the semi-structured file into a keyword vector based on the semantic vector of the semantic model; S4, compare the semantic vector with the keyword vector And identify the keyword vector corresponding to the semantic vector.
  • actions S1 and S3 further include: S2, matching the synonymous words of the semantic vector words based on the semantic vector of the semantic model, wherein the action S3 further includes: based on the semantic model The semantic vector of and its synonyms convert the semi-structured document into a keyword vector.
  • action S4 it further includes: extracting the semi-structured document instance data corresponding to the keyword vector of the semantic vector to the database.
  • the ontology includes types, attributes, and relationships between the attributes.
  • the action S3 further includes: determining the header position of the table file, and identifying the data part of the table file.
  • the action S4 further includes: performing multiple correlation calculation methods based on the semantic vector, thesaurus, and keyword vector to obtain multiple correlation values to compare the correlation between the semantic vector and the keyword vector,
  • the correlation value is weighted to construct a correlation matrix and filter parameter mapping to identify the keyword vector corresponding to the semantic vector, wherein the parameter map represents the matched keyword vector and semantic vector.
  • correlation matrix is constructed by the following algorithm:
  • M ij is correlation
  • O is semantic vector
  • k keyword vector
  • w q is weight
  • Sim q is correlation algorithm
  • i, j, and q are natural numbers.
  • the third aspect of the present invention provides a semantic model instantiation device, which includes: a first conversion device that receives an ontology-based semantic model, parses the semantic model and converts the semantic model into a feature vector set, wherein The feature vector represents the type and attributes of the ontology and the relationship between the attributes; the second conversion device imports a semi-structured file, and converts the semi-structured file into keywords based on the semantic vector of the semantic model Vector; comparison and recognition device, which compares the relevance of the semantic vector and the keyword vector, and recognizes the keyword vector corresponding to the semantic vector.
  • the fourth aspect of the present invention provides a computer program product, which is tangibly stored on a computer-readable medium and includes computer-executable instructions, which when executed, cause at least one processor to execute The method described in the first aspect of the present invention.
  • the fifth aspect of the present invention provides a computer-readable medium on which computer-executable instructions are stored, and when executed, the computer-executable instructions cause at least one processor to perform the method according to the first aspect of the present invention.
  • the innovation of the present invention is to transform the semantic model into semantic vectors, including type vectors and related vectors, as well as calculating synonyms and constructing a synonym dictionary for each semantic vector.
  • the separated semantic vector serves as a guide for information extraction. This enables any semantic model to be parsed into multiple data retrieval retrieval formulas, which helps to automate the data retrieval process described by the matching and semantic model.
  • the innovation of the present invention is also to organize useful header data from any semi-structured file and convert it into a keyword vector, which includes identifying the keyword parameter part and data part of the table file and extracting these keyword parameters into one Tree structure. This enables tables to be converted into vectors, which can be extracted for data for further comparison and calculation.
  • the innovation of the present invention is also to extract the correlation mapping between arbitrary semantic vectors and keyword vectors to extract relevant information from semi-structured documents. This is to calculate the difference between the semantic vector and the keyword vector and match the parameter mapping. This enables a fast and automatic way of evaluating and matching data based on a model.
  • the invention can greatly reduce the workload and cost of constructing the knowledge graph, and accelerate the convenient service based on knowledge.
  • Figure 1 is a schematic structural diagram of a semantic model instantiation device according to a specific embodiment of the present invention
  • Fig. 2 is a schematic structural diagram of the ontology of the semantic model of the semantic model instantiation device according to a specific embodiment of the present invention
  • FIG. 3 is a device diagram of the second conversion device 120 of the semantic model instantiation device according to a specific embodiment of the present invention.
  • FIG. 4 is a schematic diagram of table file processing of a semantic model instantiation device according to a specific embodiment of the present invention.
  • FIG. 5 is a flowchart of the steps of the four key parts ULC, RH, CH, and data of the definition table file of the semantic model instantiation device according to a specific embodiment of the present invention
  • FIG. 6 is a schematic diagram of a keyword matrix of a semantic model instantiation device according to a specific embodiment of the present invention.
  • Fig. 7 is a schematic diagram of correlation calculation of a semantic model instantiation device according to a specific embodiment of the present invention.
  • Fig. 8 is a schematic diagram of a correlation matrix of a semantic model instantiation device according to a specific embodiment of the present invention.
  • the present invention provides a semantic model instantiation mechanism, which can extract data instances based on the summary model, and uses corresponding semi-structured data and semantic models.
  • the present invention automatically selects and executes semi-structured documents in the field, and quickly determines and extracts useful data instances into a knowledge database based on semantic definitions with reasonable accuracy, so as to automatically extract data from semi-structured documents based on any semantic model.
  • the semantic model instantiation method provided by the present invention is executed by a semantic model instantiation device 100, wherein the semantic model instantiation device 100 includes a first conversion device 110, a second conversion device 120, and a comparison recognition device. 130, a matching device 140, an extraction device 150, and a database 160.
  • the first conversion device 110 parses the semantic model A, and converts the semantic model A into a feature vector set.
  • the matching device 140 is used to match synonyms of the semantic vector word of the semantic model A.
  • the second conversion device 120 inputs the semantic vector and its word synonyms, and imports a semi-structured document B to convert the semi-structured document B into a keyword vector based on the semantic vector of the semantic model A.
  • the comparison and recognition device 130 compares the correlation between the semantic vector and the keyword vector, and recognizes the keyword vector corresponding to the semantic vector.
  • the extraction device 150 extracts the semi-structured document instance data of the keyword vector corresponding to the semantic vector to the database 160.
  • the first aspect of the present invention provides a semantic model instantiation method, which includes the following steps:
  • step S1 is executed.
  • the first transformation device 110 receives an ontology-based semantic model A, parses the semantic model A, and transforms the semantic model A into a feature vector set, where the feature vector represents the type, attribute, and The relationship between the attributes. That is, the first conversion device 110 decomposes the semantic model A into the concepts of individual classes and subclasses, and uses feature vectors to describe the classes and subclasses.
  • the ontology includes types, attributes, and relationships between the attributes.
  • the type also includes subcategories of the type.
  • the present invention can establish an ontology library in advance, and continuously update the ontology library during the execution of the present invention.
  • the types of ontology libraries include: equipment, products, labor, materials, processes, and maintenance. There is an interrelated relationship between the above types.
  • the ontology includes broad product models, and the product model includes multiple subcategories: maintenance, equipment, workshop, process, product, and labor.
  • Each sub-category corresponds to multiple attributes.
  • the attributes of labor include name, phone number, level, gender, and serial number
  • the attributes to be maintained include serial number, labor, month, week, planned time, actual time, working hours, and level
  • the attributes of equipment include parameters, name, and service start Time, type and power
  • workshop attributes include name
  • process attributes include actual start time, actual end time, blockade, buffer size, planned end time, number, planned start time and name
  • product attributes include order number, picture Confirmation, actual transportation time, contract, transportation method, customer, planned transportation time, payment, price, structure and production capacity, etc.
  • the output of the first conversion device 110 is a feature vector and a collection of relationships between multiple vectors, where the feature vector includes a semantic vector and a feature vector, and the feature vector is particularly an ontology type vector.
  • each vector includes the type name, the vector name and the relationship between them. Therefore, exemplarily, the format of one of the semantic vectors is:
  • the semantic vector is "worker operating machine C", “worker producing product” and “machine malfunctioning”, among which "operation”, “production” and “have” are among them.
  • the second conversion device 120 imports a semi-structured document B, and converts the semi-structured document B into a keyword vector based on the semantic vector of the semantic model A. Specifically, the second conversion device 120 extracts header data from any semi-structured file B and reorganizes the header data according to a certain logic for subsequent processing, where the semi-structured file B is a table file. Among them, as shown in FIG. 3, the second conversion device 120 includes three sub-devices: a preprocessing device 1201, an identification device 1202, and a keyword device 1203. The step S3 includes three sub-steps S31, S32 and S33. In many industrial fields, there is a major file type. For example, the production site is a semi-structured file, such as a table in a database, an Excel table constructed by manpower, and a web HTML table.
  • the step S3 further includes the following steps: determining the header position of the table file, and identifying the data part of the table file.
  • the preprocessing device 1201 performs basic conversion and cleaning of the input form file.
  • the preprocessing device 1201 can convert a form file excel into an HTML form, because the HTML form includes richer and clearer header data.
  • the recognition device 1202 reads the table preprocessed by the preprocessing device 1201 to recognize the attributes of the data content in the table file. Specifically, the present invention first defines four key parts ULC, RH, CH, and Data for any table file, and then determines these key parts.
  • the table defines a first key portion B 4 ULC, RH, CH and data, that in order to identify table header and the content of B 1.
  • the header part is the RH part
  • RH represents the depth of the table row header
  • the height of RH is h 1
  • CH represents the depth of the table column header
  • its width is h 2 .
  • ULC there is ULC between RH and CH
  • ULC represents the upper left space of the entire table
  • the height of ULC is h1
  • the width of ULC is h 2 .
  • the lower part of RH and the right part of CH are the data part data, where the upper left grid of the data part is C3, and the lower right grid is C 4 .
  • the upper left grid of ULC is C 1
  • the lower right grid of ULC is C2.
  • Fig. 5 first find the ULC part, and identify C 1 , C 2 , h 1 and h 2 of the ULC part.
  • the table B 1 is a two-dimensional table, and it should be identified according to the extraction rules of the two-dimensional table C 3 .
  • the table should identify C 3 according to the extraction rules of one-dimensional tables.
  • RH>h 1 is judged.
  • RH>h 1 is satisfied, only RH and C3 in the data part are extracted.
  • RH>h 1 is not satisfied, then it is judged that CH>h 2.
  • CH>h 2 is satisfied, only CH and C 3 in the data part are extracted.
  • the input of the keyword device 1203 is a table with a key position, which applies standard rules to extract table titles and attributes, and saves them in a tree structure.
  • the tree structure will be reorganized into weighted vectors for subsequent analysis steps.
  • the attributes of a one-dimensional table are extracted into a tree structure and transformed into the following table keyword vector:
  • a step S2 is further included: matching the synonymous words of the semantic vector word based on the semantic vector of the semantic model.
  • the step S3 further includes the following step: the second conversion device 120 converts the semi-structured document into a keyword vector based on the semantic vector based on the semantic model and its synonyms.
  • the second conversion device 120 is used to generate a set of synonyms for each word of the semantic vector.
  • existing software can also automatically help provide synonyms, it is difficult for these software tools to provide reasonable results for complex or compound words, especially those words composed of more than one second-level vocabulary. Therefore, the present invention provides that the second conversion device 120 can be applied to complex vocabulary or compound vocabulary.
  • a compound vocabulary is first divided into multiple second-level vocabularies (sub-word#1, sub-word#2&sub-word#n), and then the correlation of each earphone vocabulary is calculated, and finally this compound vocabulary is used Association principle construction. Therefore, the second conversion device 120 includes a synonym result list to build a synonym matrix, so the keyword database is also composed of a keyword matrix.
  • Fig. 6 shows a keyword matrix.
  • the type name class name has a first attribute attribute 1 , a second attribute attribute 2, ... the Nth attribute attribute N.
  • the above-mentioned type name class name, the first attribute attribute 1 , the second attribute attribute e2 ... the Nth attribute attribute N all have an initial word, and the initial word word and its synonyms s 1 , s 1 ... s M.
  • the original words and their synonyms are as follows:
  • step S4 is executed, the comparison and recognition device 130 compares the correlation between the semantic vector and the keyword vector, and recognizes the keyword vector corresponding to the semantic vector.
  • the keyword vector is a table keyword vector. Therefore, the comparison recognition device 130 calculates the correlation between the table keyword vector and the semantic vector.
  • the input of the comparison recognition device 130 includes a keyword vector, a semantic vector, and a synonym dictionary. The present invention uses an algorithm to calculate the difference between the keyword vector and the semantic vector.
  • the step S4 further includes the following step: execute multiple correlation calculation methods based on the semantic vector, the synonym lexicon, and the keyword vector to obtain multiple correlation values to compare the correlation between the semantic vector and the keyword vector , Weighting the correlation values to construct a correlation matrix and filtering out a parameter map to identify the keyword vector corresponding to the semantic vector, wherein the parameter map represents the matched keyword vector and semantic vector.
  • the correlation algorithm includes a first correlation algorithm, a second correlation algorithm, and a third correlation algorithm.
  • the first correlation algorithm is the cilin correlation algorithm
  • the second correlation algorithm is the word2vector correlation algorithm
  • the third correlation algorithm is the modified jaccard correlation algorithm.
  • M ij is the correlation
  • O is the semantic vector
  • k is the keyword vector
  • w q is the weight
  • Sim q is the correlation algorithm
  • i, j, and q are natural numbers.
  • Figure 8 shows the correlation matrix, the abscissa of which is the keyword vector k, and the ordinate of which is the semantic vector O.
  • the parameter mapping is then screened.
  • the threshold rule is applied to determine the matched keyword pair.
  • the output is the parameter mapping, which is a marked binary vector, which represents the matching result of the table parameters.
  • the parameter mapping indicates the matched keyword vector and semantic vector, and the filtering parameter mapping executes the Similarity Couple Determination algorithm. "1" means the matched parameter, and "0" means no matched parameter.
  • the extraction device 150 extracts the semi-structured document instance data corresponding to the keyword vector of the semantic vector to the database 160.
  • the extraction device 150 extracts table data based on the output of the comparison and recognition device 130. In one embodiment, only matched data will be extracted from the semantic model. In another embodiment, data that matches and does not match the table parameters are extracted and stored, but these data are labeled with different relevance levels. The purpose of extracting unmatched table parameters is for potential future analysis and utilization. Data correlation is also identified and extracted.
  • the second aspect of the present invention provides a semantic model instantiation system, including: a processor; and a memory coupled with the processor, the memory having instructions stored therein, and the instructions cause the
  • the electronic device performs actions, the actions include: S1, receiving a semantic model based on the ontology, parsing the semantic model, and converting the semantic model into a feature vector set, wherein the feature vector represents the type, attribute, and all of the ontology S3, import a semi-structured file, and convert the semi-structured file into a keyword vector based on the semantic vector of the semantic model; S4, compare the semantic vector with the keyword vector And identify the keyword vector corresponding to the semantic vector.
  • actions S1 and S3 further include: S2, matching the synonymous words of the semantic vector words based on the semantic vector of the semantic model, wherein the action S3 further includes: based on the semantic model The semantic vector of and its synonyms convert the semi-structured document into a keyword vector.
  • action S4 it further includes: extracting the semi-structured document instance data corresponding to the keyword vector of the semantic vector to the database.
  • the ontology includes types, attributes, and relationships between the attributes.
  • the action S3 further includes: determining the header position of the table file, and identifying the data part of the table file.
  • the action S4 further includes: performing multiple correlation calculation methods based on the semantic vector, thesaurus, and keyword vector to obtain multiple correlation values to compare the correlation between the semantic vector and the keyword vector,
  • the correlation value is weighted to construct a correlation matrix and filter parameter mapping to identify the keyword vector corresponding to the semantic vector, wherein the parameter map represents the matched keyword vector and semantic vector.
  • correlation matrix is constructed by the following algorithm:
  • M ij is correlation
  • O is semantic vector
  • k keyword vector
  • w q is weight
  • Sim q is correlation algorithm
  • i, j, and q are natural numbers.
  • the third aspect of the present invention provides a semantic model instantiation device, which includes: a first conversion device that receives an ontology-based semantic model, parses the semantic model and converts the semantic model into a feature vector set, wherein The feature vector represents the type and attribute of the ontology and the relationship between the attributes; the second conversion device imports a semi-structured document, and converts the semi-structured document into keywords based on the semantic vector of the semantic model Vector; comparison and recognition device, which compares the relevance of the semantic vector and the keyword vector, and recognizes the keyword vector corresponding to the semantic vector.
  • the fourth aspect of the present invention provides a computer program product, which is tangibly stored on a computer-readable medium and includes computer-executable instructions, which when executed, cause at least one processor to execute The method described in the first aspect of the present invention.
  • the fifth aspect of the present invention provides a computer-readable medium on which computer-executable instructions are stored, and when executed, the computer-executable instructions cause at least one processor to perform the method according to the first aspect of the present invention.
  • the innovation of the present invention is to transform the semantic model into semantic vectors, including type vectors and related vectors, as well as calculating synonyms and constructing a synonym dictionary for each semantic vector.
  • the separated semantic vector serves as a guide for information extraction. This enables any semantic model to be parsed into multiple data retrieval retrieval formulas, which helps to automate the data retrieval process described by the matching and semantic model.
  • the innovation of the present invention is also to organize useful header data from any semi-structured file and convert it into a keyword vector, which includes identifying the keyword parameter part and data part of the table file and extracting these keyword parameters into one Tree structure. This enables tables to be converted into vectors, which can be extracted for data for further comparison and calculation.
  • the innovation of the present invention is also to extract the correlation mapping between arbitrary semantic vectors and keyword vectors to extract relevant information from semi-structured documents. This is to calculate the difference between the semantic vector and the keyword vector and match the parameter mapping. This enables a fast and automatic way of evaluating and matching data based on a model.
  • the invention can greatly reduce the workload and cost of constructing the knowledge graph, and accelerate the convenient service based on knowledge.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供了语义模型实例化方法、系统和装置,其中,包括如下步骤:S1,接收一个基于本体的语义模型,解析该语义模型并将所述语义模型转化为特征向量集合,其中,所述特征向量表征本体的类型、属性以及所述属性之间的关系;S3,导入一个半结构化文件,基于所述语义模型的语义向量将所述半结构化文件转化为关键词向量;S4,比较所述语义向量和所述关键词向量的相关性,并识别对应于所述语义向量的关键词向量。本发明能够极大地减少构建知识图谱的工作量和花费,并加速了基于知识的便捷服务。

Description

语义模型实例化方法、系统和装置 技术领域
本发明涉及工业软件领域,尤其涉及语义模型实例化方法、系统和装置。
背景技术
社交网络(social network)、电子商务(e-commerce)和制造等许多工业开始为客户提供基于知识的智能功能和服务,其需要一个可延展的知识数据库作为基础。领域语义模型或者模式能够被领域专家建立,然而,根据语义模型将数据填充到知识数据库中并不简单。
例如,用数据实例或者数据个体填充语义模型来执行语义模型的实例化仍然主要依赖于人工。典型地,实例化一个语义模型时,数据实例是由本领域工程师手动识别和提取。或者数据需要在一些预先定义的数据格式和中间表格处理,以利用定制程序填充到知识数据库中。这些方法的人力参与程度都很高,因此花费的成本高、时间长。在许多工业领域,原始数据具有不同的形式,其使得定制数据提取过程难以应用到其他情况。因此,客户缺少基于定义的领域语义模型从领域文件自动提取数据实例的工具。
现有技术提供了两种解决方案。其中一种方案是表格分析和检索,其目的在于用户问题和表格内容之间的相关性。其中,当用户询问一个问题时表格分析和检索算法会在表格的数据中检索以确定一个或更多能够潜在回答上述问题的表格。检索方法包括字符串类似算法BM25和单元数据类似计算等。系统可能包括语义解析、表格格式分析、表格问题类似比较和表格检索过程等装置。但是这种解决方案仅关注怎样匹配用户查询和表格内容。
另一种方案是本体匹配,其目的在于在寻找包括类型、参数和实例的两个本体的实体之间相关性。本体匹配包括两个基本步骤:相似点计算和队列提取。这些步骤从两个语言和结构角度比较两个本体,目的在 于从一个本体模型到另一个本体模型传递数。然而,这种方案并不将表格视为输入,其中一些类似方法也尝试基于本体信息提取网络表格信息,但是这些方案主要基于启发式规则,其很难用多样布局来延展到任意表格。
此外,工业领域现有的软件工具不能自动识别任意半结构化文件(表格)和一个领域语义模型关系,以提取相关数据实例。
发明内容
本发明第一方面提供了语义模型实例化方法,其中,包括如下步骤:S1,接收一个基于本体的语义模型,解析该语义模型并将所述语义模型转化为特征向量集合,其中,所述特征向量表征本体的类型、属性以及所述属性之间的关系;S3,导入一个半结构化文件,基于所述语义模型的语义向量将所述半结构化文件转化为关键词向量;S4,比较所述语义向量和所述关键词向量的相关性,并识别对应于所述语义向量的关键词向量。
进一步地,所述步骤S1和S3之间还包括如下步骤:S2,基于所述语义模型的语义向量匹配所述语义向量单词的近义词,其中,所述步骤S3还包括如下步骤:基于所述基于所述语义模型的语义向量及其近义词将所述半结构化文件转化为关键词向量。
进一步地,所述步骤S4之后还包括如下步骤:将对应于所述语义向量的关键词向量的半结构化文件实例数据提取到数据库。
进一步地,所述本体包括类型、属性以及所述属性之间的关系。
进一步地,所述半结构化文件为表格文件时,所述步骤S3还包括如下步骤:确定所述表格文件的表头位置,并识别该表格文件的数据部分。
进一步地,所述步骤S4还包括如下步骤:基于语义向量、同义词词库和关键词向量执行多个相关性计算方法获得多个相关值以比较所述语义向量和所述关键词向量的相关性,对所述相关值加权来构建相关性矩阵并筛选出参数映射以识别对应于所述语义向量的关键词向量,其中,所述参数映射表示匹配的关键词向量和语义向量。
进一步地,所述相关性矩阵由以下算法来构建:
M ij=∑w qSim q(O i,K j)
其中,M ij为相关性,o为语义向量,k为关键词向量,w q为权重,Sim q为相关性算法,i,j,q为自然数。
本发明第二方面提供了语义模型实例化系统,包括:处理器;以及与所述处理器耦合的存储器,所述存储器具有存储于其中的指令,所述指令在被处理器执行时使所述电子设备执行动作,所述动作包括:S1,接收一个基于本体的语义模型,解析该语义模型并将所述语义模型转化为特征向量集合,其中,所述特征向量表征本体的类型、属性以及所述属性之间的关系;S3,导入一个半结构化文件,基于所述语义模型的语义向量将所述半结构化文件转化为关键词向量;S4,比较所述语义向量和所述关键词向量的相关性,并识别对应于所述语义向量的关键词向量。进一步地,所述动作S1和S3之间还包括:S2,基于所述语义模型的语义向量匹配所述语义向量单词的近义词,其中,所述动作S3还包括:基于所述基于所述语义模型的语义向量及其近义词将所述半结构化文件转化为关键词向量。
进一步地,所述动作S4之后还包括:将对应于所述语义向量的关键词向量的半结构化文件实例数据提取到数据库。
进一步地,所述本体包括类型、属性以及所述属性之间的关系。
进一步地,所述半结构化文件为表格文件时,所述动作S3还包括:确定所述表格文件的表头位置,并识别该表格文件的数据部分。
进一步地,所述动作S4还包括:基于语义向量、同义词词库和关键词向量执行多个相关性计算方法获得多个相关值以比较所述语义向量和所述关键词向量的相关性,对所述相关值加权来构建相关性矩阵并筛选出参数映射以识别对应于所述语义向量的关键词向量,其中,所述参数映射表示匹配的关键词向量和语义向量。
进一步地,所述相关性矩阵由以下算法来构建:
M ij=∑w qSim q(O i,K j)
其中,M ij为相关性,O为语义向量,k为关键词向量,w q为权重,Sim q为相关性算法,i,j,q为自然数。
本发明第三方面提供了语义模型实例化装置,其中,包括:第一转化装置,其接收一个基于本体的语义模型,解析该语义模型并将所述语义模型转化为特征向量集合,其中,所述特征向量表征本体的类型、属 性以及所述属性之间的关系;第二转化装置,其导入一个半结构化文件,基于所述语义模型的语义向量将所述半结构化文件转化为关键词向量;比较识别装置,其比较所述语义向量和所述关键词向量的相关性,并识别对应于所述语义向量的关键词向量。
本发明第四方面提供了计算机程序产品,所述计算机程序产品被有形地存储在计算机可读介质上并且包括计算机可执行指令,所述计算机可执行指令在被执行时使至少一个处理器执行根据本发明第一方面所述的方法。
本发明第五方面提供了计算机可读介质,其上存储有计算机可执行指令,所述计算机可执行指令在被执行时使至少一个处理器执行根据本发明第一方面所述的方法。
本发明的创新点在于将语义模型转化为语义向量,其中包括类型向量和相关向量,以及计算同义词并为每个语义向量构建一个同义词词库。分离的语义向量充当一个信息提取的指导。这使得任意语义模型能够被剖析为多个数据检索的检索式,其有助于自动化匹配与语义模型描述的数据检索过程。
本发明的创新点还在于组织有用的来自任意半结构化文件的表头数据并将其转化为关键词向量,其包括识别表格文件的关键词参数部分和数据部分并提取这些关键词参数到一个树结构。这使得表格能够被转化为向量,向量能够为数据提取用于进一步比较和计算。
本发明的创新点也在于提取任意语义向量和关键词向量的相关性映射,以从半结构化文件中提取相关信息。这是为了计算语义向量和关键词向量的区别,并匹配参数映射。这实现了基于一个模型的评估和匹配数据的快速和自动方式。
本发明能够极大地减少构建知识图谱的工作量和花费,并加速了基于知识的便捷服务。
附图说明
图1是根据本发明一个具体实施例的语义模型实例化装置的结构示意图;
图2是根据本发明一个具体实施例的语义模型实例化装置的语义模 型的本体的结构示意图;
图3是根据本发明一个具体实施例的语义模型实例化装置的第二转化装置120的装置图;
图4是根据本发明一个具体实施例的语义模型实例化装置的表格文件处理示意图;
图5是根据本发明一个具体实施例的语义模型实例化装置的定义表格文件4个关键部分ULC、RH、CH、data的步骤流程图;
图6是根据本发明一个具体实施例的语义模型实例化装置的关键词矩阵的示意图;
图7是根据本发明一个具体实施例的语义模型实例化装置的相关性计算示意图;
图8是根据本发明一个具体实施例的语义模型实例化装置的相关性矩阵的示意图。
具体实施方式
以下结合附图,对本发明的具体实施方式进行说明。
本发明提供了语义模型实例化机制,其能够基于摘要模型提取数据实例,其利用了相对应的半结构化数据和语义模型。本发明通过自动筛选和执行领域半结构化文件,并基于具有合理准确度的语义定义快速确定和提取有用的数据实例到一个知识数据库中,以基于任意语义模型从半结构化文件自动提取数据,
如图1所示,本发明提供的语义模型实例化方法由语义模型实例化装置100执行,其中,所述语义模型实例化装置100包括第一转化装置110、第二转化装置120、比较识别装置130、匹配装置140、提取装置150以及数据库160。其中,第一转化装置110解析语义模型A,并将所述语义模型A转化为特征向量集合。匹配装置140用于匹配语义模型A的语义向量单词的近义词。然后,第二转化装置120输入语义向量及其单词近义词,并且导入一个半结构化文件B,以基于所述语义模型A的语义向量将所述半结构化文件B转化为关键词向量。接着,比较识别装置130比较所述语义向量和所述关键词向量的相关性,并识别对应于所述语义向量的关键词向量。最后,提取装置150将对应于所述语义向量 的关键词向量的半结构化文件实例数据提取到数据库160。
本发明第一方面提供了一种语义模型实例化方法,其中包括如下步骤:
首先执行步骤S1,第一转化装置110接收一个基于本体的语义模型A,解析该语义模型A并将所述语义模型A转化为特征向量集合,其中,所述特征向量表征本体的类型、属性以及所述属性之间的关系。也就是,第一转化装置110将语义模型A分解成一个个类和子类的概念,并用特征向量来描述类和子类。
其中,所述本体包括类型、属性以及所述属性之间的关系。所述类型还包括类型的子类。本发明可以预先建立一个本体库,并且在执行本发明的过程中不断更新本体库。例如,本体库的类型包括:设备、产品、人工、材料、工艺和维护等。上述类型之间具有相互联系的关系。
例如,如图2所示,本体包括大类产品模型,产品模型包括多个小类:维护、设备、车间、工艺、产品和人工。每个小类对应了多个属性。具体地,人工的属性包括名字、电话、级别、性别和编号;维护的属性包括编号、人工、月份、周、计划时间、实际时间、工作小时和等级;设备的属性包括参数、名字、开始服务时间、类型和功率;车间的属性包括名字;工艺的属性包括实际开始时间、实际结束时间、封锁、缓冲区大小、计划结束时间、编号、计划开始时间和名称;产品的属性包括订单编号、图片确认、实际运输时间、合同、运输方式、客户、计划运输时间、支付、价格、结构和生产能力等。
因此,第一转化装置110的输出为特征向量以及多个向量之间关系的集合,其中,所述特征向量包括语义向量和特征向量,其中所述特征向量特别地为本体类型的向量。具体地,每个向量包括类型名称、向量名字和它们之间的关系。因此,示例性地,其中一个语义向量的格式为:
(类型名称,向量1,向量2……向量N,关系1,关系2……关系M)
其中,例如语义向量为“工人操作机器C”,“工人生产产品”和“机器有故障”,其中“操作”、“生产”和“有”就为其中的关系。
然后执行步骤S3,第二转化装置120导入一个半结构化文件B,基于所述语义模型A的语义向量将所述半结构化文件B转化为关键词向量。具体地,第二转化装置120从任意一个半结构化文件B提取出表头数据 并且为后续处理按照一定逻辑重新组织这些表头数据,其中,所述半结构化文件B为表格文件。其中,如图3所示,第二转化装置120包括三个子装置:预处理装置1201、识别装置1202和关键词装置1203。所述步骤S3包括三个子步骤S31、步骤S32和步骤S33。在许多工业领域中有一种主要的文件类型,例如生产现场是半结构化的文件,例如数据库中的表格、人力构造的Excel表格、网络HTML表格等。
所述半结构化文件为表格文件时,所述步骤S3还包括如下步骤:确定所述表格文件的表头位置,并识别该表格文件的数据部分。
其中,在子步骤S31中,预处理装置1201执行对输入表格文件的基本转换和清理。例如,预处理装置1201能够将一个表格文件excel转化为HTML表格,这是由于HTML表格包括更丰富和清楚的表头数据。
然后,在子步骤S32中,识别装置1202读取预处理装置1201预处理过后的表格来识别表格文件中数据内容的属性。具体地,本发明首先对任意表格文件定义4个关键部分ULC、RH、CH、Data,然后再确定这些关键部分。
具体地,参见图4,首先对表格B 1定义4个关键部分ULC、RH、CH和data,以此识别表格B 1的表头和内容。首先参见表格结构B’,B’是一个二维表格。其中,表头部分是RH部分,RH表示表格行标题深度,RH的高度为h 1。CH表示表格列标题深度,其宽度为h 2。其中,RH和CH之间具有ULC,ULC表示整个表格的左上空间,ULC的高度为h1,ULC的宽度为h 2。其中,RH下面和CH的右边部分就是数据部分data,其中数据部分的左上格子为C3,右下格子为C 4。ULC的左上格子为C 1,ULC的右下格子为C2。问题在于如何找到并定义4个关键部分ULC、RH、CH、Data。
具体地,如图5所示,首先找到ULC部分,并识别ULC部分的C 1、C 2、h 1和h 2。当h 1>0并且h 2>0,继续判断RH=h 1并且CH=h 2,当满足以上条件则判定表格B 1为二维表格,其应当按照二维表格的提取规则来识别C 3。否则,则判断没有ULC部分,因此判定该表格应当按照一维表格的提取规则来识别C 3
接着,当不满足RH=h 1并且CH=h 2时,接着判断RH<h 1或者CH<h 2,当满足RH<h 1或者CH<h 2时接着计算语义向量和关键词向量的相关性, 并识别C 3并提取潜在内嵌的一维表格。
当不满足RH<h 1或者CH<h 2,接着判断RH>h 1,当满足RH>h 1时,仅提取RH和数据部分的C3。当不满足RH>h 1时,接着判断CH>h 2,当满足CH>h 2时,仅仅提取CH和数据部分的C 3
因此,执行上述步骤,就可以找到并定义4个关键部分ULC、RH、CH、data,以确定表格B 1的表头部分和数据部分。
在子步骤S33中,关键词装置1203的输入是具有关键位置的表格,其应用了规范规则来提取表格标题和属性,并且保存在树状结构中。其中,所述树状结构会为了后续分析步骤被重新组织为加权向量。
例如,一个一维表格的属性提取为树状结构并转化为如下的表格关键词向量:
运营设备台账 序号 分类 重要度 设备归属 安装地点 设备名称 设备编号 …… 备注
0 1 1 1 1 1 1 1 …… 1
进一步地,根据本发明一个优选实施例,所述步骤S1和S3之间还包括步骤S2:基于所述语义模型的语义向量匹配所述语义向量单词的近义词。其中,所述步骤S3还包括如下步骤:第二转化装置120基于所述基于所述语义模型的语义向量及其近义词将所述半结构化文件转化为关键词向量。
其中,第二转化装置120用于为语义向量的每个词产生一组近义词。现有软件虽然也能够自动帮助提供近义词,但是这些软件工具很难提供复杂或者复合词的合理结果,特别是那些由超过一个二级词汇组成的词语。因此,本发明提供了第二转化装置120能够适用于复杂词汇或者复合词汇。
例如,一个复合词汇首先被分为多个二级词汇(sub-word#1,sub-word#2……sub-word#n),然后计算每个耳机词汇的相关性,最后这个复合词汇利用关联原则构建。因此,第二转化装置120包括一个同义词结果列表来建立同义词矩阵,因此关键词库也由关键词矩阵组成。
图6示出了一个关键词矩阵,类型名称class name具有第一属性attribute 1、第二属性attribute 2……第N属性attribute N。上述类型名称class name、第一属性attribute 1、第二属性attribut e2……第N属性attribute N都具有一个初始词语,以及初始词汇word及其同义词s 1、s 1……s M。例如, 原始词汇和其同义词如下:
Figure PCTCN2019093873-appb-000001
最后执行步骤S4,比较识别装置130比较所述语义向量和所述关键词向量的相关性,并识别对应于所述语义向量的关键词向量。具体地,其中,根据本发明一个具体实施例,所述关键词向量为表格关键词向量。因此,比较识别装置130计算出表格关键词向量和语义向量的相关性。比较识别装置130的输入包括关键词向量、语义向量和同义词词库。本发明利用算法来计算关键词向量和语义向量之间的区别。
具体地,所述步骤S4还包括如下步骤:基于语义向量、同义词词库和关键词向量执行多个相关性计算方法获得多个相关值以比较所述语义向量和所述关键词向量的相关性,对所述相关值加权来构建相关性矩阵 并筛选出参数映射以识别对应于所述语义向量的关键词向量,其中,所述参数映射表示匹配的关键词向量和语义向量。
如图7所示,基于语义向量、同义词词库和关键词向量执行多个相关性计算方法。示例性地,相关性算法包括第一相关性算法、第二相关性算法和第三相关性算法。例如,第一相关性算法为cilin相关性算法,第二相关性算法为word2vector相关性算法,第三相关性算法为modified jaccard相关性算法。对语义向量、同义词词库和关键词向量执行了第一相关性算法、第二相关性算法和第三相关性算法以后会得到各自的相关值,分别为第一相关值、第二相关值和第三相关值。这三个相关值会综合起来利用如下算法一起来构建相关性矩阵:
M ij=∑w qSim q(O i,K j)
其中,M ij为相关性,O为语义向量,k为关键词向量,w q为权重,Sim q为相关性算法,i,j,q为自然数。表格标题和语义类型名字之间的相关性能够被给予更高的加权值,这是由于名字通常表达了比每个参数更多的信息。
图8示出了相关性矩阵,其横坐标为关键词向量k,其纵坐标为语义向量O。在得到相关性矩阵以后,然后筛选参数映射,阈值规则应用于确定匹配好的关键词配对,其输出为参数映射,也就是标记的二元向量,其代表着表格参数的匹配结果。其中,参数映射表示匹配的关键词向量和语义向量,筛选参数映射执行的是Similarity Couple Determination算法。“1”表示匹配好的参数,“0”表示没匹配好的参数。
最后,所述步骤S4之后还包括如下步骤:提取装置150将对应于所述语义向量的关键词向量的半结构化文件实例数据提取到数据库160。提取装置150基于比较识别装置130输出来提取表格数据。在一个实施方式中,只有匹配好的数据才会从语义模型中提取。在另一个实施方式中,匹配和没匹配好表格参数的数据被提取和存储,但这些数据用不同的相关性级别标注出来。提取没匹配好的表格参数的目的是为了潜在的未来分析和利用。数据相关性也同样被识别并提取。
本发明第二方面提供了语义模型实例化系统,包括:处理器;以及与所述处理器耦合的存储器,所述存储器具有存储于其中的指令,所述指令在被处理器执行时使所述电子设备执行动作,所述动作包括:S1,接收一个基于本体的语义模型,解析该语义模型并将所述语义模型转化 为特征向量集合,其中,所述特征向量表征本体的类型、属性以及所述属性之间的关系;S3,导入一个半结构化文件,基于所述语义模型的语义向量将所述半结构化文件转化为关键词向量;S4,比较所述语义向量和所述关键词向量的相关性,并识别对应于所述语义向量的关键词向量。进一步地,所述动作S1和S3之间还包括:S2,基于所述语义模型的语义向量匹配所述语义向量单词的近义词,其中,所述动作S3还包括:基于所述基于所述语义模型的语义向量及其近义词将所述半结构化文件转化为关键词向量。
进一步地,所述动作S4之后还包括:将对应于所述语义向量的关键词向量的半结构化文件实例数据提取到数据库。
进一步地,所述本体包括类型、属性以及所述属性之间的关系。
进一步地,所述半结构化文件为表格文件时,所述动作S3还包括:确定所述表格文件的表头位置,并识别该表格文件的数据部分。
进一步地,所述动作S4还包括:基于语义向量、同义词词库和关键词向量执行多个相关性计算方法获得多个相关值以比较所述语义向量和所述关键词向量的相关性,对所述相关值加权来构建相关性矩阵并筛选出参数映射以识别对应于所述语义向量的关键词向量,其中,所述参数映射表示匹配的关键词向量和语义向量。
进一步地,所述相关性矩阵由以下算法来构建:
M ij=∑w qSim q(O i,K j)
其中,M ij为相关性,O为语义向量,k为关键词向量,w q为权重,Sim q为相关性算法,i,j,q为自然数。
本发明第三方面提供了语义模型实例化装置,其中,包括:第一转化装置,其接收一个基于本体的语义模型,解析该语义模型并将所述语义模型转化为特征向量集合,其中,所述特征向量表征本体的类型、属性以及所述属性之间的关系;第二转化装置,其导入一个半结构化文件,基于所述语义模型的语义向量将所述半结构化文件转化为关键词向量;比较识别装置,其比较所述语义向量和所述关键词向量的相关性,并识别对应于所述语义向量的关键词向量。
本发明第四方面提供了计算机程序产品,所述计算机程序产品被有形地存储在计算机可读介质上并且包括计算机可执行指令,所述计算机 可执行指令在被执行时使至少一个处理器执行根据本发明第一方面所述的方法。
本发明第五方面提供了计算机可读介质,其上存储有计算机可执行指令,所述计算机可执行指令在被执行时使至少一个处理器执行根据本发明第一方面所述的方法。
本发明的创新点在于将语义模型转化为语义向量,其中包括类型向量和相关向量,以及计算同义词并为每个语义向量构建一个同义词词库。分离的语义向量充当一个信息提取的指导。这使得任意语义模型能够被剖析为多个数据检索的检索式,其有助于自动化匹配与语义模型描述的数据检索过程。
本发明的创新点还在于组织有用的来自任意半结构化文件的表头数据并将其转化为关键词向量,其包括识别表格文件的关键词参数部分和数据部分并提取这些关键词参数到一个树结构。这使得表格能够被转化为向量,向量能够为数据提取用于进一步比较和计算。
本发明的创新点也在于提取任意语义向量和关键词向量的相关性映射,以从半结构化文件中提取相关信息。这是为了计算语义向量和关键词向量的区别,并匹配参数映射。这实现了基于一个模型的评估和匹配数据的快速和自动方式。
本发明能够极大地减少构建知识图谱的工作量和花费,并加速了基于知识的便捷服务。
尽管本发明的内容已经通过上述优选实施例作了详细介绍,但应当认识到上述的描述不应被认为是对本发明的限制。在本领域技术人员阅读了上述内容后,对于本发明的多种修改和替代都将是显而易见的。因此,本发明的保护范围应由所附的权利要求来限定。此外,不应将权利要求中的任何附图标记视为限制所涉及的权利要求;“包括”一词不排除其它权利要求或说明书中未列出的装置或步骤;“第一”、“第二”等词语仅用来表示名称,而并不表示任何特定的顺序。

Claims (17)

  1. 语义模型实例化方法,其中,包括如下步骤:
    S1,接收一个基于本体的语义模型,解析该语义模型并将所述语义模型转化为特征向量集合,其中,所述特征向量表征本体的类型、属性以及所述属性之间的关系;
    S3,导入一个半结构化文件,基于所述语义模型的语义向量将所述半结构化文件转化为关键词向量;
    S4,比较所述语义向量和所述关键词向量的相关性,并识别对应于所述语义向量的关键词向量。
  2. 根据权利要求1所述的语义模型实例化方法,其特征在于,所述步骤S1和S3之间还包括如下步骤:
    S2,基于所述语义模型的语义向量匹配所述语义向量单词的近义词,
    其中,所述步骤S3还包括如下步骤:
    基于所述基于所述语义模型的语义向量及其近义词将所述半结构化文件转化为关键词向量。
  3. 根据权利要求1所述的语义模型实例化方法,其特征在于,所述步骤S4之后还包括如下步骤:将对应于所述语义向量的关键词向量的半结构化文件实例数据提取到数据库。
  4. 根据权利要求1所述的语义模型实例化方法,其特征在于,所述本体包括类型、属性以及所述属性之间的关系。
  5. 根据权利要求1所述的语义模型实例化方法,其特征在于,所述半结构化文件为表格文件时,所述步骤S3还包括如下步骤:
    确定所述表格文件的表头位置,并识别该表格文件的数据部分。
  6. 根据权利要求1所述的语义模型实例化方法,其特征在于,所述步骤S4还包括如下步骤:
    基于语义向量、同义词词库和关键词向量执行多个相关性计算方法获得多个相关值以比较所述语义向量和所述关键词向量的相关性,对所述相关值加权来构建相关性矩阵并筛选出参数映射以识别对应于所述语义向量的关键词向量,
    其中,所述参数映射表示匹配的关键词向量和语义向量。
  7. 根据权利要求6所述的语义模型实例化方法,其特征在于,所述相关性矩阵由以下算法来构建:
    M ij=∑w qSim q(O i,K j)
    其中,M ij为相关性,O为语义向量,k为关键词向量,w q为权重,Sim q为相关性算法,i,j,q为自然数。
  8. 语义模型实例化系统,包括:
    处理器;以及
    与所述处理器耦合的存储器,所述存储器具有存储于其中的指令,所述指令在被处理器执行时使所述电子设备执行动作,所述动作包括:
    S1,接收一个基于本体的语义模型,解析该语义模型并将所述语义模型转化为特征向量集合,其中,所述特征向量表征本体的类型、属性以及所述属性之间的关系;
    S3,导入一个半结构化文件,基于所述语义模型的语义向量将所述半结构化文件转化为关键词向量;
    S4,比较所述语义向量和所述关键词向量的相关性,并识别对应于所述语义向量的关键词向量。
  9. 根据权利要求8所述的语义模型实例化系统,其特征在于,所述动作S1和S3之间还包括:
    S2,基于所述语义模型的语义向量匹配所述语义向量单词的近义词,
    其中,所述动作S3还包括:
    基于所述基于所述语义模型的语义向量及其近义词将所述半结构化文件转化为关键词向量。
  10. 根据权利要求8所述的语义模型实例化系统,其特征在于,所述动作S4之后还包括:将对应于所述语义向量的关键词向量的半结构化文件实例数据提取到数据库。
  11. 根据权利要求8所述的语义模型实例化系统,其特征在于,所述本体包括类型、属性以及所述属性之间的关系。
  12. 根据权利要求8所述的语义模型实例化系统,其特征在于,所述半结构化文件为表格文件时,所述动作S3还包括:
    确定所述表格文件的表头位置,并识别该表格文件的数据部分。
  13. 根据权利要求8所述的语义模型实例化系统,其特征在于,所 述动作S4还包括:
    基于语义向量、同义词词库和关键词向量执行多个相关性计算方法获得多个相关值以比较所述语义向量和所述关键词向量的相关性,对所述相关值加权来构建相关性矩阵并筛选出参数映射以识别对应于所述语义向量的关键词向量,
    其中,所述参数映射表示匹配的关键词向量和语义向量。
  14. 根据权利要求13所述的语义模型实例化系统,其特征在于,所述相关性矩阵由以下算法来构建:
    M ij=∑w qSim q(O i,K j)
    其中,M ij为相关性,O为语义向量,k为关键词向量,w q为权重,Sim q为相关性算法,i,j,q为自然数。
  15. 语义模型实例化装置,其中,包括:
    第一转化装置,其接收一个基于本体的语义模型,解析该语义模型并将所述语义模型转化为特征向量集合,其中,所述特征向量表征本体的类型、属性以及所述属性之间的关系;
    第二转化装置,其导入一个半结构化文件,基于所述语义模型的语义向量将所述半结构化文件转化为关键词向量;
    比较识别装置,其比较所述语义向量和所述关键词向量的相关性,并识别对应于所述语义向量的关键词向量。
  16. 计算机程序产品,所述计算机程序产品被有形地存储在计算机可读介质上并且包括计算机可执行指令,所述计算机可执行指令在被执行时使至少一个处理器执行根据权利要求1至7中任一项所述的方法。
  17. 计算机可读介质,其上存储有计算机可执行指令,所述计算机可执行指令在被执行时使至少一个处理器执行根据权利要求1至7中任一项所述的方法。
PCT/CN2019/093873 2019-06-28 2019-06-28 语义模型实例化方法、系统和装置 Ceased WO2020258303A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN201980008609.4A CN112449700B (zh) 2019-06-28 2019-06-28 语义模型实例化方法、系统和装置
US16/970,692 US20220129635A1 (en) 2019-06-28 2019-06-28 Semantic model instantiation method, system and apparatus
PCT/CN2019/093873 WO2020258303A1 (zh) 2019-06-28 2019-06-28 语义模型实例化方法、系统和装置
EP19915576.3A EP3783522A4 (en) 2019-06-28 2019-06-28 METHOD, SYSTEM AND DEVICE FOR INSTANTIATING SEMANTIC MODELS

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/093873 WO2020258303A1 (zh) 2019-06-28 2019-06-28 语义模型实例化方法、系统和装置

Publications (1)

Publication Number Publication Date
WO2020258303A1 true WO2020258303A1 (zh) 2020-12-30

Family

ID=74059647

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/093873 Ceased WO2020258303A1 (zh) 2019-06-28 2019-06-28 语义模型实例化方法、系统和装置

Country Status (4)

Country Link
US (1) US20220129635A1 (zh)
EP (1) EP3783522A4 (zh)
CN (1) CN112449700B (zh)
WO (1) WO2020258303A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342976A (zh) * 2021-06-17 2021-09-03 北京海数宝科技有限公司 一种自动采集处理数据的方法、装置、存储介质及设备
CN115795075A (zh) * 2022-11-29 2023-03-14 自然资源部国土卫星遥感应用中心 一种遥感影像产品通用模型构建方法

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114880484B (zh) * 2022-05-11 2023-06-16 军事科学院系统工程研究院网络信息研究所 一种基于向量映射的卫星通信频轨资源图谱构建方法
CN115079979A (zh) * 2022-06-17 2022-09-20 北京字跳网络技术有限公司 一种虚拟角色驱动方法、装置、设备及存储介质
CN115880120B (zh) * 2023-02-24 2023-05-16 江西微博科技有限公司 一种在线政务服务系统及服务方法
CN116524926B (zh) * 2023-04-27 2024-06-04 百洋智能科技集团股份有限公司 一种用于在移动端通过语音控制生成业务表单的方法
CN118468881A (zh) * 2024-04-30 2024-08-09 北京八月瓜科技有限公司 一种自动提取关键词的语义检索方法及系统

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1766871A (zh) * 2004-10-29 2006-05-03 中国科学院研究生院 基于上下文的半结构化数据语义提取的处理方法
CN104063502A (zh) * 2014-07-08 2014-09-24 中南大学 一种基于语义模型的wsdl半结构化文档相似性分析及分类方法

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040103365A1 (en) * 2002-11-27 2004-05-27 Alan Cox System, method, and computer program product for an integrated spreadsheet and database
US20060242130A1 (en) * 2005-04-23 2006-10-26 Clenova, Llc Information retrieval using conjunctive search and link discovery
CN102682122B (zh) * 2012-05-15 2014-11-26 北京科技大学 基于本体构建材料科学领域语义数据模型的方法
US20140236860A1 (en) * 2013-02-19 2014-08-21 Ray Camrass system allowing banks to diversify their loan portfolios via exchanging loans
US9256761B1 (en) * 2014-08-18 2016-02-09 Yp Llc Data storage service for personalization system
US10496749B2 (en) * 2015-06-12 2019-12-03 Satyanarayana Krishnamurthy Unified semantics-focused language processing and zero base knowledge building system
US9984068B2 (en) * 2015-09-18 2018-05-29 Mcafee, Llc Systems and methods for multilingual document filtering
CN106919674A (zh) * 2017-02-20 2017-07-04 广东省中医院 一种基于Wiki语义网络构建的知识问答系统及智能检索方法
CN108804409A (zh) * 2017-04-28 2018-11-13 西安科技大市场创新云服务股份有限公司 一种语义检索方法和装置
EP3407208A1 (en) * 2017-05-22 2018-11-28 Fujitsu Limited Ontology alignment apparatus, program, and method
KR102472572B1 (ko) * 2017-07-21 2022-11-30 십일번가 주식회사 사용자 의도 프로파일링 방법 및 이를 위한 장치
KR102054514B1 (ko) * 2017-08-07 2019-12-10 강준철 인공지능(ai)을 통한 딥러닝훈련모듈과, 순위화프레임워크모듈을 활용하여, 법률전문가에게 최적화된 모범답안을 제시하는 한편, 법률정보를 의미 벡터로 변환하여, 데이터베이스에 저장하고, 이에 대한 문자열 사전모듈을 활용한 온라인 법률정보사전을 제공하는 시스템 및 그 방법
CN110162776B (zh) * 2019-03-26 2024-10-18 腾讯科技(深圳)有限公司 交互消息处理方法、装置、计算机设备和存储介质
US11120798B2 (en) * 2019-06-27 2021-09-14 Atlassian Pty Ltd. Voice interface system for facilitating anonymized team feedback for a team health monitor
US11270697B2 (en) * 2019-06-27 2022-03-08 Atlassian Pty Ltd. Issue tracking system having a voice interface system for facilitating a live meeting directing status updates and modifying issue records

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1766871A (zh) * 2004-10-29 2006-05-03 中国科学院研究生院 基于上下文的半结构化数据语义提取的处理方法
CN104063502A (zh) * 2014-07-08 2014-09-24 中南大学 一种基于语义模型的wsdl半结构化文档相似性分析及分类方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3783522A4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342976A (zh) * 2021-06-17 2021-09-03 北京海数宝科技有限公司 一种自动采集处理数据的方法、装置、存储介质及设备
CN113342976B (zh) * 2021-06-17 2023-07-04 北京海数宝科技有限公司 一种自动采集处理数据的方法、装置、存储介质及设备
CN115795075A (zh) * 2022-11-29 2023-03-14 自然资源部国土卫星遥感应用中心 一种遥感影像产品通用模型构建方法
CN115795075B (zh) * 2022-11-29 2023-08-11 自然资源部国土卫星遥感应用中心 一种遥感影像产品通用模型构建方法

Also Published As

Publication number Publication date
CN112449700B (zh) 2024-09-24
EP3783522A4 (en) 2021-11-24
EP3783522A1 (en) 2021-02-24
US20220129635A1 (en) 2022-04-28
CN112449700A (zh) 2021-03-05

Similar Documents

Publication Publication Date Title
WO2020258303A1 (zh) 语义模型实例化方法、系统和装置
CN110929038B (zh) 基于知识图谱的实体链接方法、装置、设备和存储介质
CN108804521B (zh) 一种基于知识图谱的问答方法及农业百科问答系统
CN111291161A (zh) 法律案件知识图谱查询方法、装置、设备及存储介质
CN118195533A (zh) 基于人工智能的项目申报与企业信息交互方法及系统
CN120873195A (zh) 用于基于图谱的动态信息检索及合成的系统和方法
CN112925901B (zh) 一种辅助在线问卷评估的评估资源推荐方法及其应用
CN111324631A (zh) 一种将查询数据的人类自然语言自动生成sql语句的方法
CN111274267A (zh) 一种数据库查询方法、装置及计算机可读取存储介质
CN115563313A (zh) 基于知识图谱的文献书籍语义检索系统
CN103886099B (zh) 一种模糊概念的语义检索系统及方法
CN119557424B (zh) 一种数据分析方法、系统以及存储介质
CN121233616B (zh) 一种基于自然语言理解的智能客服查询系统及方法
CN115982322A (zh) 一种水利行业设计领域知识图谱的检索方法及检索系统
CN120144549B (zh) 用于多领域数据共享的元数据实时自适应标准化系统
CN118245564B (zh) 一种支持语义查重查新的特征比对库构建方法及装置
CN113610626A (zh) 银行信贷风险识别知识图谱构建方法、装置、计算机设备及计算机可读存储介质
CN119578422B (zh) 一种成交客户社交网络构建和拓展方法及系统
CN120611023A (zh) 多源知识增强的大语言模型问答方法、装置、设备及介质
CN118779439A (zh) 基于检索增强的问答方法、装置、设备及存储介质
CN112486919A (zh) 文档管理方法、系统及存储介质
CN118733860A (zh) 一种基于多维特征优化的媒体影响力评估模型构建方法
CN116610810B (zh) 基于调控云知识图谱血缘关系的智能搜索方法及系统
CN115374108B (zh) 一种基于知识图谱技术的数据标准生成与自动映射方法
CN117149804A (zh) 数据处理方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2019915576

Country of ref document: EP

Effective date: 20200825

NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Ref document number: 2019915576

Country of ref document: EP