CN121661663A - Multimodal structured data intelligent recognition method, device, system and storage medium - Google Patents

Multimodal structured data intelligent recognition method, device, system and storage medium

Info

Publication number
CN121661663A
CN121661663A CN202511844237.2A CN202511844237A CN121661663A CN 121661663 A CN121661663 A CN 121661663A CN 202511844237 A CN202511844237 A CN 202511844237A CN 121661663 A CN121661663 A CN 121661663A
Authority
CN
China
Prior art keywords
text
feature
features
visual
structured data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202511844237.2A
Other languages
Chinese (zh)
Inventor
郑穗
方英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Roadshow Moment Network Data Co ltd
Original Assignee
Shenzhen Roadshow Moment Network Data Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Roadshow Moment Network Data Co ltd filed Critical Shenzhen Roadshow Moment Network Data Co ltd
Priority to CN202511844237.2A priority Critical patent/CN121661663A/en
Publication of CN121661663A publication Critical patent/CN121661663A/en
Pending legal-status Critical Current

Links

Landscapes

  • Character Input (AREA)

Abstract

发明提供了一种多模态结构化数据智能识别方法、装置、系统及存储介质,多模态结构化数据智能识别方法至少包括如下步骤:获取到包括表格图像和表格文本的多模态图片,分别提取得到视觉特征和文本特征;根据文本特征中的位置信息和/或格式信息,以及视觉特征中的线框信息,将视觉特征和文本特征融合,得到多模态融合特征;基于多模态融合特征,解码输出包含多模态图片对应的结构化数据。本发明提供的多模态结构化数据智能识别方法通过对多模态图片的解析,获取到表格图像和表格文本,以及与之对应的视觉特征和文本特征,对两种特征进行各自的分析,借助文本特征本身具有的特性,重新利用计算机将文本特征与视觉特征相互融合,最后解码出计算机可处理的结构化数据。

This invention provides a method, apparatus, system, and storage medium for intelligent recognition of multimodal structured data. The method includes at least the following steps: acquiring a multimodal image containing table images and table text, and extracting visual features and text features respectively; fusing the visual features and text features based on positional and/or format information in the text features and wireframe information in the visual features to obtain multimodal fused features; and decoding and outputting structured data containing the multimodal image based on the multimodal fused features. The multimodal structured data intelligent recognition method provided by this invention obtains table images and table text, as well as corresponding visual and text features, by parsing the multimodal image. It analyzes each type of feature separately, leverages the inherent characteristics of text features, and reuses a computer to fuse the text features and visual features, finally decoding computer-processable structured data.

Description

Multi-mode structured data intelligent identification method, device, system and storage medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for intelligently identifying multi-mode structured data.
Background
OCR, optical Character Recognition, optical character recognition, a conventional technique, refers to the process of checking characters printed on paper by a multi-modal structured data intelligent recognition system (e.g., scanner or digital camera), determining their shape by detecting dark and bright modes, and then translating the shape on the photo into computer text by character recognition. Although computer recognition technology has been developed to a great extent, the recognition accuracy of OCR technology is continuously improved, but there is still a problem of insufficient coping with actual working conditions.
In the field of work forms, for example, for complex form structures such as merging cells, nested forms and the like, a single-mode method cannot effectively associate visual features with semantic information. In addition, the existing form recognition system has poor generalization capability, is difficult to adapt to forms of different templates, definition, gradient and/or display areas, and is often subjected to separation processing between text detection and form structure recognition, so that the problems of accumulated recognition errors and serious result distortion are caused.
Disclosure of Invention
Based on this, in order to solve at least one of the above-mentioned problems, the present invention provides a method, a device, a system and a storage medium for intelligently identifying multi-modal structured data.
In a first aspect, the present invention provides a method for intelligently identifying multi-modal structured data, at least comprising the following steps:
Acquiring a multi-mode picture comprising a form image and a form text, and respectively extracting visual features and text features;
According to the position information and/or the format information in the text feature and the wire frame information in the visual feature, fusing the visual feature and the text feature to obtain a multi-mode fusion feature;
And decoding and outputting structured data corresponding to the multi-mode picture based on the multi-mode fusion characteristics.
In certain implementations of the first aspect, the location information includes first coordinate information of the form text on the multimodal picture, the wireframe information includes second coordinate information of dominant lines in the form image, and the step of fusing the visual feature and the text feature includes:
and resampling the data sequence corresponding to the text feature and the data sequence corresponding to the visual feature into an associated data sequence with consistent length according to the first coordinate information and the second coordinate information.
With reference to the first aspect and the foregoing implementation manner, in some implementation manners of the first aspect, the format information includes a length, a width and a number of rows of a form text, and the wire frame information includes second coordinate information of dominant lines in the form image, and the step of fusing the visual feature and the text feature includes:
According to the format information and the second coordinate information, carrying out standardized conversion on the data sequence corresponding to the text feature to obtain a fusion text feature;
And resampling the data sequence corresponding to the fusion text feature and the data sequence corresponding to the visual feature into an associated data sequence with the same length according to the position information and the wire frame information.
With reference to the first aspect and the foregoing implementation manner, in some implementation manners of the first aspect, the location information further includes an inclination angle of the table text, and the step of performing standardized conversion on the data sequence corresponding to the text feature further includes adding deflection angle data to both the data sequence corresponding to the text feature and the data sequence corresponding to the visual feature according to the inclination angle.
With reference to the first aspect and the foregoing implementation manner, in some implementation manners of the first aspect, the step of extracting the visual feature and the text feature separately includes:
And calling an OCR recognition engine to recognize the table text to obtain the text characteristics, wherein the text characteristics comprise a plurality of mutually independent text blocks and third coordinate information of each text block.
With reference to the first aspect and the foregoing implementation manner, in some implementation manners of the first aspect, the wire frame information includes second coordinate information of dominant lines in the table image, and the step of fusing the visual feature and the text feature further includes:
correcting the second coordinate information according to the third coordinate information by using a preset data difference value to obtain corrected second coordinate information;
And resampling the data sequence corresponding to the text block and the data sequence corresponding to the visual feature into an associated data sequence with the same length according to the third coordinate information and the corrected second coordinate information.
With reference to the first aspect and the foregoing implementation manner, in certain implementation manners of the first aspect, the step of obtaining the multi-modal picture including the form image and the form text includes scanning or photographing a document in which the form content is recorded.
In a second aspect, the present invention provides a multi-modal structured data intelligent recognition apparatus, including:
the acquisition module is used for acquiring multi-mode pictures comprising form images and form texts, and respectively extracting visual features and text features;
The fusion module is used for fusing the visual feature and the text feature according to the position information and/or the format information in the text feature and the wire frame information in the visual feature to obtain a multi-mode fusion feature;
And the decoding module is used for decoding and outputting structured data corresponding to the multi-mode picture based on the multi-mode fusion characteristics.
In a third aspect, the invention provides a multi-modal structured data intelligent recognition system, which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the multi-modal structured data intelligent recognition method according to any one of the first aspect of the invention.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the multimodal structured data intelligent identification method of any of the first aspects of the invention.
The technical scheme provided by the embodiment of the invention has the following beneficial technical effects:
According to the multi-mode structured data intelligent recognition method provided by the invention, the form image and the form text, as well as the visual features and the text features corresponding to the form image and the form text are obtained through analyzing the multi-mode picture, the two features are respectively analyzed, the text features and the visual features are mutually fused by means of the characteristics of the text features, the computer is reused, and finally the structured data which can be processed by the computer is decoded, so that the structured data corresponding to the form image can be effectively generated even if the form template is different, the form definition is not used, and even the position deviation occurs.
Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for intelligent recognition of multi-modal structured data in accordance with an embodiment of the present invention;
FIG. 2 is a flow chart of the method for fusing visual features and text features according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of implementing structured data output by a multi-modal structured data intelligent recognition method according to an embodiment of the present invention;
FIG. 4 is a schematic logic flow diagram of a fusion process performed by the multi-modal structured data intelligent recognition method according to an embodiment of the present invention;
FIG. 5 is a flow chart of table image reconstruction by the multi-modal structured data intelligent recognition method according to an embodiment of the invention;
FIG. 6 is a schematic diagram of a multi-modal identification table device according to an embodiment of the invention;
FIG. 7 is a schematic diagram of a multi-modal structured data intelligent recognition system according to an embodiment of the present invention.
Detailed Description
In order that the invention may be readily understood, a more complete description of the invention will be rendered by reference to the appended drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
It will be understood that when an element is referred to as being "fixed to" another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. The terms "vertical," "horizontal," "left," "right," and the like are used herein for illustrative purposes only.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
An embodiment of the first aspect of the present invention provides a method for intelligently identifying multi-mode structured data, as shown in fig. 1, at least including the following steps:
S100, acquiring a multi-mode picture comprising a form image and a form text, and respectively extracting visual features and text features. In S100, the multi-modal pictures are interpreted and analyzed from two aspects, namely, a form line frame and a text set, and are converted into visual features corresponding to form images and text features corresponding to form texts. As has been possible in the prior art, computers have been able to read text by OCR technology and read lines, converting them into computer internal data, recorded as text features and visual features, respectively. Specifically, the step of obtaining a multimodal picture including a form image and a form text includes scanning or photographing a document in which the form content is recorded. The application scene of the invention is to identify the form recorded on the paper in real life work and directly and accurately convert the form into the form which can be edited in a computer, so that the mode of acquiring the multi-mode picture is to record the form on the paper into a computer device or a multi-mode structured data intelligent identification system in a photographing or scanning mode, and then carry out subsequent processing through a conventional OCR technology.
And S200, fusing the visual features and the text features according to the position information and/or the format information in the text features and the wire frame information in the visual features to obtain the multi-mode fusion features. The text features include position information, i.e. the specific position of the text on the multi-mode picture, and format information, i.e. the length, width, number of lines, etc. of the area occupied by the text. The visual features and the text features are fused, that is, the text features are aligned with the visual features, which is a soft alignment concept within the model, rather than a strict one-to-one correspondence, and the criterion is that, after processing by the fusion encoder, the model can find sufficient visual evidence supporting its semantics and position in its corresponding resampled visual features for each token in the text sequence. This alignment is driven by a loss function during training, for example, when predicting the coordinates of a cell, the model uses both the text token and its corresponding visual features within the cell, i.e., the area framed by the line. If the alignment is successful, the model can accurately correlate a text with its specific location in the image. After model training is completed, it can be intuitively seen through the visual attention weighting map whether a text token (e.g., "monetary amount") falls within the visual area corresponding to the form in the image. Through S200, the visual features and text features which are originally separated and read are fused or aligned, so that a multi-mode fusion feature is formed.
S300, based on the multi-mode fusion characteristics, decoding and outputting structured data corresponding to the multi-mode pictures. The computer can decode and output structured data, such as HTML, JSON, CSV or the like, according to the multimodal fusion feature. The computer reads the structured data and the editable form content can be depicted in the computer.
According to the multi-mode structured data intelligent recognition method provided by the invention, the form image and the form text, as well as the visual features and the text features corresponding to the form image and the form text are obtained through analyzing the multi-mode picture, the two features are respectively analyzed, the text features and the visual features are mutually fused by means of the characteristics of the text features, the computer is reused, and finally the structured data which can be processed by the computer is decoded, so that the structured data corresponding to the form image can be effectively generated even if the form template is different, the form definition is not used, and even the position deviation occurs.
Specifically, in some implementations of the first aspect of the embodiments of the present invention, the position information includes first coordinate information of the form text on the multi-mode picture, the line frame information includes second coordinate information of dominant lines in the form image, and the step of merging the visual feature and the text feature includes resampling a data sequence corresponding to the text feature and a data sequence corresponding to the visual feature into an associated data sequence with a consistent length according to the first coordinate information and the second coordinate information. The position information and the size information are fully utilized, and the read picture information can be processed by a computer according to specific setting to obtain the structured data more conforming to the multi-mode picture. The text of the form, i.e. the text in the form, usually comprises a number of characters, and if it is chinese, it will often have at least one chinese character, and this text will have a certain position on the picture, which position can be determined by the computer at the same time after the text features have been read out. The form is composed of longitudinal, transverse or oblique lines, especially explicit lines, and after the computer reads the visual features corresponding to the form image, the line numbers of the lines are necessarily formed, and the length value, the starting coordinate, the end coordinate and the like of each line.
The visual features and the text features are fused, deep semantic association between the visual features (including information such as lines, cell layout and cell positions) and the text features (including information such as characters, numbers, symbols and size positions thereof) is constructed, and the two are firstly spatially aligned. Binding each text Token (Token) with a specific spatial position (namely a corresponding visual area) in an image by utilizing text bounding box coordinates provided by OCR, focusing on visual context around a certain text, such as frame thickness, shadow, color and the like of the cell, and text in one visual area, corresponding a certain range of cells around the text, merging the two, and carrying out resampling fusion of respective data sequences in a computer. In particular, the method is realized by a perception resampler, for example, the data sequence V= [ V 1, v2, ..., vk ] corresponding to the visual feature is resampled into the data sequence V '= [ V' 1, v'2, ..., v'm ] with the consistent length of the data sequence T= [ T 1, t2, ..., tm ] corresponding to the text feature.
Secondly, semantic alignment is carried out, and visual features and text features are mapped with each other through a cross-modal attention mechanism, so that the effect of mutual query is realized. For example, a text feature may be used as a "Query" to "retrieve" (Attention) the visual feature most relevant to it, whereas a visual feature (e.g., a pattern of merging cells) may be used as a "Query" to "find" the key text describing it. Taking a data sequence T corresponding to the text feature as a Query, taking a sequence V corresponding to the visual feature as a Key and a Value, performing Attention calculation for one time, namely, attention (Q=T, K=V, V=V), namely, a weight matrix A for short, and regenerating alignment weights. The weight matrix a (size m x k) of the attention output represents the attention of each text token to each image area. And weighting and summing the visual characteristics V by using the weight matrix A to obtain a preliminary fusion characteristic. The preliminary fusion feature is then mapped to a space in the same dimension as the text feature through a learnable feed-forward neural network, and the resampled visual feature V' is finally output. Through this step, each visual feature V 'i in V' is a concentration of visual context information most relevant to the text feature t i. Furthermore, the resampled visual feature V ' is further depth-fused with the text feature T, namely, the V ' and the T are spliced into a long multi-mode sequence X= [ V ' 1, v'2, ..., v'm, t1, t2, ..., tm ] by adopting a multi-mode transducer layer. This multimodal sequence X is input into a standard transducer encoder layer. In the self-attention mechanism inside this layer, each token (whether visual or text) interacts with all other tokens in the sequence. For example, one text token j may focus on its surrounding text tokens and the visual token v' j aligned therewith, possibly with other ranks of visual tokens. After each token is self-attentive, nonlinear transformation is performed through a feed-forward network, and N layers (for example, 6 layers) are stacked on the multi-mode transducer layer, so that deep cross-modal understanding is realized.
In particular, with reference to the embodiment of the first aspect and the foregoing implementation manner, in other embodiments of the first aspect, the format information includes a length, a width, and a number of rows of the form text, the line frame information includes second coordinate information of dominant lines in the form image, and the step of fusing the visual feature and the text feature includes performing standardized conversion on a data sequence corresponding to the text feature according to the format information and the second coordinate information to obtain a fused text feature. And resampling the data sequence corresponding to the fusion text feature and the data sequence corresponding to the visual feature into an associated data sequence with the same length according to the position information and the wire frame information.
As described above, the text of the form, i.e., the text in the form, typically includes a plurality of characters that are in a regular format, the spacing between the characters, the length, width, etc. of each character and the character string, i.e., format information, which is converted into data that can be processed by a computer by OCR technology. In this embodiment, according to the alignment relationship between the text in a certain cell and the text in other areas, such as the upper, lower, left or right areas, that is, the format information, it is understood that a cell boundary should exist in the area where the cell is "inferred", where the text boundary corresponding to the format information is expanded outwards by a specific size, and the specific size can be regarded as an edit parameter that can be manually set and adjusted, and the text is processed uniformly according to the edit parameter, and standardized and converted to form a new text with a new text boundary, that is, the above-mentioned fused text feature, and then the new text is fused with the visual feature to form the associated data sequence. Through the above processing of this embodiment, text can be correctly "filled" into the form, even if this "cell boundary" does not exist in the visual features, i.e., a hidden wireframe exists in the form, and finally structured data is formed.
With reference to the embodiment of the first aspect and the foregoing implementation manner, in other implementation manners of the first aspect, the location information further includes an inclination angle of the form text, and the step of performing standardized conversion on the data sequence corresponding to the text feature further includes adding deflection angle data to both the data sequence corresponding to the text feature and the data sequence corresponding to the visual feature according to the inclination angle. In the prior art, a hough transform or contour detection is typically used to find a straight line, and then the tilt angle is calculated for rotation. In the invention, firstly, the inclination angle of the text in the form is recorded, the inclination angle can be determined according to the statistical data of the inclination angles of a plurality of groups of texts in the form, and the text inclination can be processed even if the surrounding form or the cell is not an explicit form. The method comprises the steps of acquiring the minimum circumscribed rectangle of text lines by using an OCR engine, counting the inclination angles of all the text lines, and taking the median as the inclination angle of the text direction. Therefore, the text deformation caused by the shooting visual angle or the paper placing posture can be effectively corrected. Meanwhile, the border line at the outermost side of the table can be found through line detection, and the evaluation of the border inclination angle is carried out. If the confidence of the border line is high and continuous, the direction is used for secondary verification. Final correction angle = α text tilt angle + β frame tilt angle. In this way, we can achieve accurate correction through a large amount of text information even if the form wire is incomplete.
In addition, in some specific embodiments, the adaptive enhancement based on task feedback is adopted, so that the original multi-mode picture is quickly pre-identified once, and the main problems such as low contrast, background noise, uneven illumination, moire and the like are diagnosed. For low contrast/uneven illumination, the parameters (grid size and contrast limitation) of the CLAHE algorithm can be adaptively adjusted according to the image resolution. For background noise, non-local mean denoising is used instead of gaussian filtering to better preserve line and character edges. For the moire, it is eliminated using a specific frequency domain filtering method. The goal of these correction processes is to create a better input basis for subsequent other line detection and OCR recognition. Furthermore, a detection model which is finely tuned on a large number of table images is used, the backhaul neural network is more sensitive to lines and text features, and interference of non-table lines such as headers, footers, stamping, handwriting notes and the like is filtered through the detection model. In reality, a large number of tables have the problems of scanning skew, broken lines, stain, too thick/too thin lines and the like, and a pure vision method is easy to fail.
Optionally, with reference to the foregoing embodiments and implementations, in still other embodiments of the first aspect, the step of extracting the visual feature and the text feature respectively includes invoking an OCR recognition engine to recognize the form text to obtain the text feature, where the text feature includes a plurality of text blocks independent of each other and third coordinate information of each text block. In many forms, there may be more than one line in each cell, but according to the technology of semantic recognition, the computer can interpret more than one line of text as a sentence expressing one meaning, i.e. multiple lines of text are regarded as one text block, one text of one semantic corresponds to one text block, and one text block corresponds to one coordinate information, i.e. the third coordinate information. Further, in one embodiment of the present invention, as shown in fig. 2, the wire frame information includes second coordinate information of dominant lines in the form image, and the step of fusing the visual feature and the text feature in S200 further includes:
S210, correcting the second coordinate information according to the third coordinate information by using a preset data difference value to obtain corrected second coordinate information.
S220, resampling the data sequence corresponding to the text block and the data sequence corresponding to the visual feature into an associated data sequence with the same length according to the third coordinate information and the corrected second coordinate information.
By the implementation mode, whether the table is a full wired table, a half wired table or a wireless table is judged first, and processing basic conditions are provided for subsequent flows. Second, when the form is a semi-wired form or a wireless form, such as a conventional three-wire form, from the text blocks in the form, it can be determined that cells are actually present around the text blocks, but are not revealed. As described in the previous embodiment, it may be understood in terms of form that the area where the text block is located should have a cell boundary, and that the third coordinate information corresponding to the text block is correspondingly extended outwards by a specific size, where the specific size may be regarded as an edit parameter that can be set or adjusted by a person, and is uniformly processed according to the edit parameter, and standardized conversion is performed to form new second coordinate information, that is, the hidden line is presented in the original semi-wired table or wireless table in a data manner, so as to obtain corrected second coordinate information, and then the redrawn table is fused with the text to form the associated data sequence. The "explicit line detection" provides "physical structure seen by the eye", while the analysis of text blocks is "implicit structure perception", providing "logical structure that the brain can understand but has not yet had physical structure". The method provided by the invention has the two capabilities simultaneously, and can be intelligently fused to more flexibly and accurately understand the diversified forms like human beings.
The scheme provided by the invention uses a unified model and has the capability of processing a 'full wired table', 'full wireless table' and a 'half wired table' between the two. The step of implicit structure perception can accurately infer the range of merging cells by analyzing text blocks crossing rows and columns, destroying an alignment mode and the like, and different models are not required to be trained or switched for tables of different styles, so that the practical value is remarkably improved.
In some practical cases, the above method steps and the prior art may be used in combination to output structured data, and for convenience of understanding, the following details are listed:
As shown in FIG. 3, the input form image is preprocessed, which mainly comprises three aspects of image correction, image enhancement and target detection, wherein the image correction comprises image de-warping and image rotation, the image enhancement mainly comprises contrast adjustment and binarization, and the target detection mainly comprises detection of the form area. After the table image is acquired, the visual characteristics and the text characteristics are obtained through processing by a text encoder and a visual encoder of a multi-mode large model core, and then the structural data in the form of JSON, CSV or Excel is obtained through processing and outputting by a large language model core such as the existing LLaMA, GPT, chatGLM through processing by a perception resampler and attention fusion by a fusion encoder.
The above procedure, referring to fig. 4, can be understood in more detail, that the original high resolution image is already in the input layer, and that the user query text can be included. The original visual characteristic sequence and the text characteristic sequence are obtained through segmentation into image blocks and/or Token processing and output by a visual encoder and a text encoder. The number of Token of the two feature sequences is not necessarily the same, but the aligned visual Token sequence is output through the processing of the perception resampler. The input fusion and understanding core combines the information carried by the text feature sequence as the instruction prefix, and the information is processed into the understood fusion representation through the large language model to generate the answer given to the graphic fusion information.
Referring to fig. 5, technically, acquisition of projection data is started, each projection line is transformed to the frequency domain by fourier transformation, and the fan-beam projection data is interpolated to a Cartesian rectangular grid by frequency domain gridding. And transforming the frequency domain data after gridding back to an image space through inverse Fourier transform to obtain an image after reconstruction in a computer. And then carrying out quality evaluation on the image, judging whether the image quality meets the requirement, if so, outputting the reconstructed image, and if not, returning to the step of Fourier transformation by adopting a filtering function, an interpolation method and the like through adjusting parameters, and carrying out reprocessing until the final reconstructed image is output.
In order to better understand the multi-mode structured data intelligent recognition method provided by the invention, a practical application case is introduced in the form of a table, an invoice image with a slightly inclined merging unit is processed into computer-processable structured data, and the computer-processable structured data can be presented in a computer display device, and the following table 1 is referred to:
An embodiment of the second aspect of the present invention provides a multi-modal structured data intelligent recognition device 10, as shown in fig. 6, including an acquisition module 11, a fusion module 12, and a decoding module 13. Wherein:
The obtaining module 11 is configured to obtain a multi-modal picture including a form image and a form text, and extract visual features and text features respectively.
The fusion module 12 is configured to fuse the visual feature with the text feature according to the position information and/or the format information in the text feature and the wire frame information in the visual feature, so as to obtain a multimodal fusion feature.
The decoding module 13 is configured to decode and output structured data corresponding to the multi-mode picture based on the multi-mode fusion feature.
Specifically, the position information comprises first coordinate information of the form text on the multi-mode picture, the wire frame information comprises second coordinate information of dominant lines in the form image, and the step of fusing the visual features and the text features by the fusion module 12 comprises resampling a data sequence corresponding to the text features and a data sequence corresponding to the visual features into a related data sequence with consistent length according to the first coordinate information and the second coordinate information.
Specifically, the format information includes the length, width and number of lines of the form text, and the wire frame information includes the second coordinate information of dominant lines in the form image, and the step of fusing the visual features and the text features by the fusion module 12 includes:
According to the format information and the second coordinate information, carrying out standardized conversion on a data sequence corresponding to the text feature to obtain a fused text feature;
And resampling the data sequence corresponding to the fusion text feature and the data sequence corresponding to the visual feature into an associated data sequence with the same length according to the position information and the wire frame information.
The method comprises the steps of receiving position information of a text feature, carrying out standardized conversion on the text feature data sequence, and adding deflection angle data into the text feature data sequence and the visual feature data sequence according to the inclination angle.
Specifically, the step of extracting the visual feature and the text feature by the obtaining module 11 includes invoking an OCR recognition engine to recognize the text of the table to obtain the text feature, where the text feature includes a plurality of text blocks independent of each other and third coordinate information of each text block.
Further, the wire frame information comprises second coordinate information of dominant lines in the table image, the step of fusing the visual features and the text features by the acquisition module 11 further comprises the steps of correcting the second coordinate information according to third coordinate information and preset data difference values to obtain corrected second coordinate information, and resampling a data sequence corresponding to the text block and a data sequence corresponding to the visual features into a related data sequence with the same length according to the third coordinate information and the corrected second coordinate information.
Optionally, the step of obtaining the multimodal picture including the form image and the form text by the obtaining module 11 includes scanning or photographing the document in which the form content is recorded.
Based on the same inventive concept, referring to fig. 7, an embodiment of the third aspect of the present invention provides a multi-modal structured data smart identification system 1000, comprising a processor 1001, a memory 1003 and a computer program stored on the memory 1003, wherein the processor 1001 and the memory 1003 are electrically connected, such as by a bus 1002, and the processor 1001 executes the computer program to implement the multi-modal structured data smart identification method of any one of the first aspect of the present invention.
The Processor 1001 may be a CPU (Central Processing Unit ), general purpose Processor, DSP (DIGITAL SIGNAL Processor, data signal Processor), ASIC (Application SPECIFIC INTEGRATED Circuit), FPGA (Field-Programmable GATE ARRAY ) or other Programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 1001 may also be a combination that implements computing functionality, such as a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 1002 may include a path to transfer information between the components. Bus 1002 may be a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus, or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 1002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 7, but not only one bus or one type of bus.
The Memory 1003 may be, but is not limited to, ROM (Read-Only Memory) or other type of static storage device that can store static information and instructions, RAM (random access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY, electrically erasable programmable Read-Only Memory), CD-ROM (Compact Disc Read-Only Memory) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
Those skilled in the art will appreciate that the multi-modal structured data intelligent recognition system 1000 provided by embodiments of the present invention may be specially designed and manufactured for the required purposes, or may comprise known devices in a general purpose computer. These devices have computer programs stored therein that are selectively activated or reconfigured. Such a computer program may be stored in a device (e.g., computer) readable medium or in any type of medium suitable for storing electronic instructions and coupled to a bus, respectively.
According to the multi-mode structured data intelligent recognition system 1000 provided by the invention, through running a computer program recorded with the multi-mode structured data intelligent recognition method, through analyzing a multi-mode picture, a form image and a form text, as well as visual features and text features corresponding to the form image and the form text are obtained, the two features are respectively analyzed, the text features and the visual features are mutually fused by means of the characteristics of the text features, a computer is reused, finally the structured data which can be processed by the computer is decoded, and even if the forms of the form are different, the form definition is not used, and even the position deviation occurs, the structured data corresponding to the form image can be effectively generated.
Specifically, the multi-modal structured data smart identification system 1000 includes a transceiver 1004. The transceiver 1004 may be used for both reception and transmission of signals. The transceiver 1004 may allow the multi-modal structured data smart identification system 1000 to communicate wirelessly or by wire with other devices to exchange data. It should be noted that, in practical application, the transceiver 1004 is not limited to one.
In particular, the multi-modal structured data smart identification system 1000 includes an input unit 1005, the input unit 1005 being operable to receive entered numbers, characters, and/or images or to generate key signal inputs related to user settings and function controls of the multi-modal structured data smart identification system 1000. The input unit 1005 may include, but is not limited to, one or more of a touch screen, a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, a camera, a scanner, etc.
Specifically, the multi-modal structured data smart identification system 1000 further includes an output unit 1006. An output unit 1006 may be used to output or present information processed by the processor 1001. The output unit 1006 may include, but is not limited to, one or more of a display device, a speaker, a vibration device, and the like.
While FIG. 7 illustrates a multi-modal structured data intelligent recognition system 1000 having various devices, it should be understood that not all illustrated devices are required to be implemented or provided. More or fewer devices may be implemented or provided instead.
Optionally, a memory 1003 is used for storing application code for performing the aspects of the invention and is controlled by the processor 1001 for execution. The processor 1001 is configured to execute application program codes stored in the memory 1003, so as to implement any of the multi-mode structured data intelligent recognition methods provided in the embodiments of the present invention.
Based on the same technical concept, the embodiment of the fourth aspect of the present invention further provides a computer readable storage medium, on which a computer program is stored, where the computer program when executed by a processor implements the multi-modal structured data intelligent recognition method of any one of the first aspect of the present invention.
The computer readable storage medium is applied to various computer devices, can acquire a form image and a form text and visual characteristics and text characteristics corresponding to the form image and the form text through analyzing the multi-mode image, respectively analyze the two characteristics, mutually fuse the text characteristics and the visual characteristics by means of the characteristics of the text characteristics, and finally decode the structural data which can be processed by the computer even if the templates of the form are different, the form definition is not used and even the position is deviated, and can effectively generate the structural data corresponding to the form image.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (10)

1. The intelligent recognition method for the multi-mode structured data is characterized by at least comprising the following steps:
Acquiring a multi-mode picture comprising a form image and a form text, and respectively extracting visual features and text features;
According to the position information and/or the format information in the text feature and the wire frame information in the visual feature, fusing the visual feature and the text feature to obtain a multi-mode fusion feature;
And decoding and outputting structured data corresponding to the multi-mode picture based on the multi-mode fusion characteristics.
2. The method for intelligently identifying multi-modal structured data according to claim 1, wherein the location information includes first coordinate information of the form text on the multi-modal picture, the wire frame information includes second coordinate information of dominant lines in the form image, and the step of fusing the visual feature and the text feature includes:
and resampling the data sequence corresponding to the text feature and the data sequence corresponding to the visual feature into an associated data sequence with consistent length according to the first coordinate information and the second coordinate information.
3. The method for intelligently identifying multi-modal structured data according to claim 1, wherein the format information includes a length, a width and a number of lines of a form text, the wire frame information includes second coordinate information of dominant lines in the form image, and the step of fusing the visual features and the text features includes:
According to the format information and the second coordinate information, carrying out standardized conversion on the data sequence corresponding to the text feature to obtain a fusion text feature;
And resampling the data sequence corresponding to the fusion text feature and the data sequence corresponding to the visual feature into an associated data sequence with the same length according to the position information and the wire frame information.
4. The method for intelligently identifying multi-modal structured data as set forth in claim 3 wherein the location information further includes an inclination angle of the form text, and wherein the step of performing standardized conversion on the data sequence corresponding to the text feature further includes adding deflection angle data to both the data sequence corresponding to the text feature and the data sequence corresponding to the visual feature according to the inclination angle.
5. The method for intelligently identifying multi-modal structured data according to claim 1, wherein the step of extracting visual features and text features respectively comprises:
And calling an OCR recognition engine to recognize the table text to obtain the text characteristics, wherein the text characteristics comprise a plurality of mutually independent text blocks and third coordinate information of each text block.
6. The method for intelligently identifying multi-modal structured data as set forth in claim 5 wherein said wireframe information includes second coordinate information of dominant lines in said tabular image, said step of fusing said visual features with said textual features further comprising:
correcting the second coordinate information according to the third coordinate information by using a preset data difference value to obtain corrected second coordinate information;
And resampling the data sequence corresponding to the text block and the data sequence corresponding to the visual feature into an associated data sequence with the same length according to the third coordinate information and the corrected second coordinate information.
7. The method of claim 1, wherein the step of obtaining a multimodal picture including a form image and form text comprises scanning or photographing a document in which the form content is recorded.
8. A multi-modal structured data intelligent recognition device, comprising:
the acquisition module is used for acquiring multi-mode pictures comprising form images and form texts, and respectively extracting visual features and text features;
The fusion module is used for fusing the visual feature and the text feature according to the position information and/or the format information in the text feature and the wire frame information in the visual feature to obtain a multi-mode fusion feature;
And the decoding module is used for decoding and outputting structured data corresponding to the multi-mode picture based on the multi-mode fusion characteristics.
9. A multi-modal structured data intelligent recognition system comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to implement the multi-modal structured data intelligent recognition method of any one of claims 1 to 7.
10. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the multimodal structured data intelligent recognition method of any of claims 1 to 7.
CN202511844237.2A 2025-12-08 2025-12-08 Multimodal structured data intelligent recognition method, device, system and storage medium Pending CN121661663A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202511844237.2A CN121661663A (en) 2025-12-08 2025-12-08 Multimodal structured data intelligent recognition method, device, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202511844237.2A CN121661663A (en) 2025-12-08 2025-12-08 Multimodal structured data intelligent recognition method, device, system and storage medium

Publications (1)

Publication Number Publication Date
CN121661663A true CN121661663A (en) 2026-03-13

Family

ID=98984732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202511844237.2A Pending CN121661663A (en) 2025-12-08 2025-12-08 Multimodal structured data intelligent recognition method, device, system and storage medium

Country Status (1)

Country Link
CN (1) CN121661663A (en)

Similar Documents

Publication Publication Date Title
US12499422B2 (en) Mobile check deposit
CN113761976A (en) Scene semantic analysis method based on global guide selective context network
JP2021516827A (en) Identification code identification method and its devices, computer equipment and computer programs
US20070009155A1 (en) Intelligent importation of information from foreign application user interface using artificial intelligence
CN115082941A (en) Form information acquisition method and device for form document image
CN110738030A (en) Table reconstruction method and device, electronic equipment and storage medium
CN115909378A (en) Training method of receipt text detection model and receipt text detection method
CN116978030A (en) Text information recognition method and training method of text information recognition model
CN118262364A (en) Image recognition method, apparatus, device, medium, and program product
CN116485649A (en) End-to-end image stitching and positioning method and system
CN118522019A (en) Text recognition method, electronic device and storage medium
CN117391201A (en) Question and answer methods, devices and electronic equipment
CN115063818B (en) Method and system for judging office document font types
CN115880710A (en) Seal returning verification method, device, equipment and storage medium
CN111738248B (en) Character recognition method, training method of character decoding model and electronic equipment
CN121661663A (en) Multimodal structured data intelligent recognition method, device, system and storage medium
CN114863457A (en) Optical character recognition method for shopping bill
CN119339387B (en) Image text segmentation method, device, electronic device and readable storage medium
CN114708598B (en) Answer sheet recognition method, device, storage medium and electronic device
CN112183531A (en) Method, device, medium and electronic equipment for determining character positioning frame
CN120088810B (en) Invoice text information identification method and system
CN118522027B (en) Braille document table recognition method and system
US20260080379A1 (en) Mobile check deposit
CN117765546A (en) Text recognition processing method and system
CN121833986A (en) Method, device and equipment for obtaining answers to questions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination