CN114238564B - Information retrieval method, device, electronic device and storage medium - Google Patents
Information retrieval method, device, electronic device and storage medium Download PDFInfo
- Publication number
- CN114238564B CN114238564B CN202111495205.8A CN202111495205A CN114238564B CN 114238564 B CN114238564 B CN 114238564B CN 202111495205 A CN202111495205 A CN 202111495205A CN 114238564 B CN114238564 B CN 114238564B
- Authority
- CN
- China
- Prior art keywords
- target
- word
- document
- information retrieval
- relevance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides an information retrieval method, a device, electronic equipment and a storage medium, wherein the information retrieval method is based on a trained target information retrieval model, a first word vector of target words in target query sentences is determined, a central word vector of the target words in each target document is determined, and a first correlation degree between the query sentences and the target documents is determined according to the first word vector and the central word vector.
Description
Technical Field
The present application relates to the field of power grid optimization technologies, and in particular, to an information retrieval method, an information retrieval device, an electronic device, and a storage medium.
Background
The information retrieval mode in the prior art mainly comprises the steps of retrieving a search text in a database based on a matching algorithm of word frequency, and the information retrieval mode based on the word frequency is still the main stream method of the current retrieval system, but the information retrieval based on the word frequency only considers the occurrence times of words in the database, and does not combine a semantic environment and a word sense environment, so that the matching degree and accuracy of the retrieved documents are lower.
Disclosure of Invention
Accordingly, an object of the present application is to provide an information retrieval method, an apparatus, an electronic device, and a storage medium, based on a trained target information retrieval model, determine a first word vector of a target word in a target query sentence, and determine a central word vector of the target word in each target document, and determine a first relevance between the query sentence and the target document according to the first word vector and the central word vector.
The embodiment of the application provides an information retrieval method, which comprises the following steps:
Inputting a target query sentence into a query layer in a trained target information retrieval model to obtain at least one target vocabulary which is the same as each target document in a text library, wherein the target documents are documents which have the same vocabulary as the target query sentence in the text library;
Inputting each obtained target vocabulary into a word vector extraction network layer in a trained target information retrieval model to obtain a first word vector of each target vocabulary in the target query statement and a second word vector of each target vocabulary in a sliding window of each target document, wherein the sliding window is a window containing adjacent first preset number of characters in the target document, and the sliding window contains at least one character in at least one target vocabulary;
for any target word in each target document, determining a central word vector of the target word based on second word vectors of sliding windows associated with the target word;
inputting a first word vector and a central word vector corresponding to a target word of each target document into a relevance scoring layer in a trained target information retrieval model, and calculating to obtain a first relevance between the target query sentence and each target document;
and selecting a preset number of target documents to output as search documents according to the sequence of the correlation degree between the target query statement and the target documents from high to low.
Further, for any target word in each target document, determining a center word vector of the target word based on the second word vector of each sliding window associated with the target word, including:
aiming at any target vocabulary in each target document, obtaining a second word vector of the target vocabulary in each sliding window;
And carrying out summation calculation on each second word vector, carrying out average value calculation after the summation calculation, and determining the average value as a central word vector of the target word.
Further, a first relevance between the target query statement and each target document is calculated by:
performing point multiplication calculation on a first word vector corresponding to each target word and each central word vector corresponding to each target word, and determining a point multiplication result as a second relativity between each target word and each target document;
And carrying out summation calculation on the second relevance corresponding to each target word in each target document, and determining the summation result as a first relevance between the target query sentence and each target document.
Further, the performing a dot product calculation on the first word vector corresponding to each target word and each central word vector corresponding to each target word, and determining a dot product result as a second relativity between each target word and each target document, includes:
Performing point multiplication calculation on a first word vector corresponding to each target word and each central word vector corresponding to each target word, and selecting a maximum point multiplication value as a result correlation degree;
And determining the result relatedness as a second relatedness between each target vocabulary and each target document.
Further, a trained target information retrieval model is determined by;
Acquiring a sample query statement and a sample document corresponding to the sample query statement;
carrying out relevance division on the sample query statement and the sample document according to a preset relevance, determining that the sample query statement with the relevance greater than or equal to the preset relevance and the sample document are sample related texts, and determining that the sample query statement with the relevance smaller than the preset relevance and the sample document are sample uncorrelated texts;
Training the initial information retrieval model according to the sample related text and the sample uncorrelated text, and determining a trained target information retrieval model.
The embodiment of the application also provides an information retrieval device, which comprises:
The first determining module is used for inputting target query sentences into a query layer in a trained target information retrieval model to obtain at least one target vocabulary which is the same as each target document in a text library, wherein the target documents are documents which have the same vocabulary as the target query sentences in the text library;
the second determining module is used for inputting each obtained target vocabulary into a word vector extraction network layer in a trained target information retrieval model, obtaining a first word vector of each target vocabulary in the target query statement and a second word vector of each target vocabulary in a sliding window in each target document, wherein the sliding window is a window containing adjacent first preset number of characters in the target document, and the sliding window contains at least one character in at least one target vocabulary;
The third determining module is used for determining a central word vector of any target word in each target document based on the second word vector of each sliding window associated with the target word;
the computing module is used for inputting a first word vector and a central word vector corresponding to a target word of each target document into a relevance scoring layer in a trained target information retrieval model, and computing to obtain a first relevance between the target query sentence and each target document;
And the fourth determining module is used for selecting a preset number of target documents to be output as search documents according to the sequence of the correlation degree between the target query statement and the target documents from high to low.
Further, the third determining module determines, for any target word in each target document, a center word vector of the target word based on the second word vector of each sliding window associated with the target word, including:
aiming at any target vocabulary in each target document, obtaining a second word vector of the target vocabulary in each sliding window;
And carrying out summation calculation on each second word vector, carrying out average value calculation after the summation calculation, and determining the average value as a central word vector of the target word.
Further, the calculating module calculates a first relevance between the target query statement and each target document by:
performing point multiplication calculation on a first word vector corresponding to each target word and each central word vector corresponding to each target word, and determining a point multiplication result as a second relativity between each target word and each target document;
And carrying out summation calculation on the second relevance corresponding to each target word in each target document, and determining the summation result as a first relevance between the target query sentence and each target document.
The embodiment of the application also provides electronic equipment, which comprises a processor, a memory and a bus, wherein the memory stores machine-readable instructions executable by the processor, when the electronic equipment is operated, the processor and the memory are communicated through the bus, and the machine-readable instructions are executed by the processor to execute the steps of the information retrieval method.
The embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the information retrieval method as described above.
Compared with the prior art, the information retrieval method, the device, the electronic equipment and the storage medium provided by the embodiment of the application have the advantages that based on the trained target information retrieval model, the first word vector of the target vocabulary in the target query sentence is determined, the central word vector of the target vocabulary in each target document is determined, and the first correlation degree between the query sentence and the target document is determined according to the first word vector and the central word vector.
In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of an information retrieval method according to an embodiment of the present application;
FIG. 2 is a diagram showing the relationship between a sliding window and a target document in an information retrieval method according to an embodiment of the present application;
FIG. 3 is a flow chart of another information retrieval method provided by an embodiment of the present application;
fig. 4 is a schematic structural diagram of an information retrieval device according to an embodiment of the present application;
Fig. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.
In the figure:
400-information retrieval means, 410-first determination module, 420-second determination module, 430-third determination module, 440-calculation module, 450-fourth determination module, 500-electronic device, 510-processor, 520-memory, 530-bus.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. Based on the embodiments of the present application, every other embodiment obtained by a person skilled in the art without making any inventive effort falls within the scope of protection of the present application.
Firstly, the application scene applicable to the application is introduced, and research shows that the information retrieval mode in the prior art mainly retrieves the search text in the database based on a matching algorithm of vocabulary frequency, and the information retrieval mode based on the vocabulary frequency is still the mainstream method of the current retrieval system, but the information retrieval based on the vocabulary frequency only considers the occurrence times of the vocabulary in the database and does not combine with a semantic environment and a word sense environment, so that the matching degree and accuracy of the retrieved document are lower.
Based on the above, the embodiment of the application provides an information retrieval method, an information retrieval device, an electronic device and a storage medium, which are based on a trained target information retrieval model, determine a first word vector of a target word in a target query sentence, and determine a central word vector of the target word in each target document, and determine a first relativity between the query sentence and the target document according to the first word vector and the central word vector.
Referring to fig. 1, fig. 1 is a flowchart of an information retrieval method according to an embodiment of the application. As shown in fig. 1, the information retrieval method provided by the embodiment of the present application includes:
S101, inputting target query sentences into a query layer in a trained target information retrieval model to obtain at least one target vocabulary which is the same as each target document in a text library, wherein the target documents are documents which have the same vocabulary as the target query sentences in the text library.
In the step, after a user generates a series of search documents corresponding to a target query sentence to be queried, the target query sentence is input into a query network layer in a trained target information search model to obtain a target vocabulary between the target query sentence and each target document in a text library, wherein the documents in the text library are screened, the documents which are not related to the target query sentence are removed, and the documents with the same vocabulary in the text library as the target query sentence are screened out to determine the documents as target documents.
The target vocabulary is defined as the same words which exist in the target query sentence and the target document at the same time, the number of characters of the target vocabulary is not unique, the target vocabulary can be divided in a self-defined mode according to the expression characteristics of Chinese, q is used for representing the target vocabulary in the embodiment provided by the application, qi is represented as the i-th marked target vocabulary, and the query network layer is used for representing the network structure layer for determining the target vocabulary in the target information retrieval model.
The target information retrieval model is obtained by training an initial information retrieval model, and the trained target information retrieval model is determined in the following manner;
and acquiring a sample query statement and a sample document corresponding to the sample query statement.
Firstly, acquiring an initial sample query statement and an initial sample document corresponding to the initial sample query statement, wherein the initial sample query statement is acquired from a log or other external sample databases, and then, manually labeling the initial sample document corresponding to the initial sample query statement, denoising the initial query statement and the initial sample document, deleting meaningless special characters such as blank spaces, messy codes and the like, and cleaning by using a regular expression.
The regular expression is a logic formula for operating the character string, namely, a "regular character string" is formed by a plurality of specific characters defined in advance and a combination of the specific characters, and the "regular character string" is used for expressing a filtering logic for the character string.
And carrying out relevance division on the sample query statement and the sample document according to a preset relevance, determining that the sample query statement with the relevance being greater than or equal to the preset relevance and the sample document are sample related texts, and determining that the sample query statement with the relevance being less than the preset relevance and the sample document are sample uncorrelated texts.
In the above, the sample related text and the sample uncorrelated text both include a sample training set and a sample verification set for training an initial information inspection model, a python programming language is used to divide a sample query sentence and a sample document according to a preset relevance, and the sample query sentence with the relevance being greater than or equal to the preset relevance and the sample document are used as sample related texts, wherein the sample related text is compared with the preset relevance by performing a relevance mark of the sample query sentence and the sample document, the sample query sentence with the relevance being less than the preset relevance and the sample document are used as sample uncorrelated texts, and the sample query sentence and the sample document with the high predicted relevance obtained according to a probability retrieval algorithm are determined as difficult uncorrelated samples.
Thus, the probabilistic search algorithm is an algorithm that is proposed based on a probabilistic search model, including but not limited to using BM25 information search algorithms.
And training the initial information retrieval model according to the sample training set and the sample verification set, and determining a trained target information retrieval model.
The initial information retrieval model is trained according to the sample training set, and in the training process, network structure parameters in the initial information retrieval model are updated and replaced in real time by using the sample verification set, so that verification of the training effect of the initial information retrieval model is achieved, and after training is finished, model documents in the trained target information retrieval model are correspondingly stored, so that follow-up information retrieval tasks or target document reordering tasks are facilitated.
The method comprises the steps of firstly carrying out offline processing on documents in a text library by using a document encoder to realize preprocessing of the documents based on target query sentences, deleting some noise in the documents, then loading the trained target information retrieval model in the retrieval system, initializing the trained target information retrieval model, and then inputting the denoised target query sentences to obtain a preset number of target documents corresponding to the target query sentences as retrieval documents to be output.
And secondly, using a document encoder to perform offline processing on the documents in the text library, recalling target documents of other text libraries except a designated number of text libraries by using a BM25 information retrieval algorithm in a search engine, retrieving the documents based on the initialized target information, performing relevance matching on the recalled target documents except the text library, and outputting the preset number of target documents as recalled retrieval documents.
The initial information retrieval model in the embodiment provided by the application can be a language model based on BERT coding, but is not limited to the use of BERT coding, and the BERT coder is subjected to fine tuning by using an Adam optimizer, so that the function of adjusting parameters of the initial information retrieval model in the training process is realized.
Wherein the initial information retrieval model may be trained using a negative log likelihood function, which is a function of parameters in the model, a "likelihood" and a "probability" meaning similar, but statistically having quite different meanings, the probability being used to predict the next observations given the parameters, and the likelihood being used to estimate the possible values of the parameters of a given model based on some observations.
Here, a Token is used to represent a sample query sentence and a string of character strings generated by a sample document, and is used as a Token for a client to request, after logging in for the first time, the server generates a Token and returns the Token to the client, and then the client only needs to carry the Token to request data. The Token, in fact, may call a surprise, i.e., prior to some data transmission, a check of the password is performed, different passwords are authorized for different data operations, in the embodiment provided by the present application, token is expressed as any character or word in the sample query sentence and the sample document, that is, each sample query sentence and each sample document contains a plurality of Token, and the number of Token depends on the number of characters or words.
Here, the character string Token is specifically expressed as:
Here the number of the elements is the number, Is a matrix that maps nlm-dimensional output of an initial information retrieval model to vectors of low dimension nt, a trained target information retrieval model is an information retrieval model to which [ CLS ] matching in a pre-training language model (Bidirectional Encoder Representations from Transformers, BERT) is added, wherein the pre-training language model learns deep bi-directional representation in non-labeling data through pre-training, and adding [ CLS ] in the pre-training language model takes an output character vector corresponding to the symbol as semantic representation of the whole text or sentence.
Further, a standard sample query text and a standard sample document corresponding to the standard sample query text are acquired by the following steps:
and acquiring an initial sample query text and an initial standard sample document corresponding to the initial sample query text.
And denoising the initial sample query text and the initial standard sample document to determine the standard sample query text and the standard sample document.
S102, inputting each obtained target vocabulary into a word vector extraction network layer in a trained target information retrieval model, obtaining a first word vector of each target vocabulary in the target query statement and a second word vector of each target vocabulary in a sliding window of each target document, wherein the sliding window is a window containing adjacent first preset number characters in the target document, the sliding window contains at least one character in at least one target vocabulary, and an overlapping part containing a second preset number of characters between the two adjacent sliding windows.
In the step, each target word between a target query sentence and each target document in a text library is input into a word vector extraction network layer in a trained target information retrieval model, a first word vector of the target word in the target query sentence and a second word vector of the target word in a sliding window in each target document are obtained, wherein the coding of the first word vector based on Token is specifically expressed as follows:
Here, the english name of the target query sentence is query, in which the abbreviation q is used for expression, The character string representing the i-th labeled target query term, b tok is a coefficient, and W tok is a coefficient in matrix form.
And in consideration of the relevance of different target words in the target query statement, we acquire the first word vector by using [ CLS ] matching in BERT, and the specific expression of the first word vector is:
here, the Token-based encoding of the second word vector is specifically expressed as:
here, the english name of the target document is document, in which the abbreviation d is used for, The i-th marked target document string, b tok, is a coefficient.
And in consideration of the relevance of different target words in the target query statement, we acquire a second word vector by using [ CLS ] matching in BERT, and the specific expression of the second word vector is:
In the above, the first step of, AndThe relativity between the words can provide high-level semantic matching information, and the problem of unmatched words is relieved.
In this way, the sliding window is a window sliding in each target document according to a preset direction, the sliding window includes a first preset number of characters, the first preset number of characters is not fixed, the user-defined number of characters can be set according to the expression characteristics of Chinese characters, each sliding window includes at least one character in at least one target vocabulary, and an overlapping part of a second preset number of characters is included between two adjacent sliding windows.
As shown in fig. 2, fig. 2 is a structural diagram of a relationship between a sliding window and a target document in an information retrieval method according to an embodiment of the present application.
The sliding window is set to include four characters, namely, the first preset number of characters is four, for example, the target query sentence provided by the embodiment is "proud and lagging behind", one target document in the corresponding text library is "body is Chinese", women's volleyball obtains the olympic champion, and I feel proud and self-pornography.
Here, only one target query sentence is provided with the target vocabulary in the target document, the target vocabulary is specifically "proud", at this time, all sliding windows including any character "proud" or "proud" in the target vocabulary are obtained, a second word vector based on semantics of the target vocabulary in each sliding window is obtained by using an encoder, and the following table shows that the sliding windows including the target vocabulary "proud" in fig. 2 are "proud to proud", "proud to proud and" proud to proud and self ".
If a plurality of target words exist in the target document, the process of setting the sliding window for each target word is the same as the above process, and will not be described in detail here.
S103, determining a central word vector of any target word in each target document based on the second word vector of each sliding window associated with the target word.
In the step, the second word vector of each sliding window associated with any target word is subjected to summation average calculation, and the central word vector of the target word in the target document is determined.
Thus, the center word vector of the target word in the target document is denoted by CK.
S104, inputting the first word vector and the central word vector corresponding to the target word of each target document into a relevance scoring layer in a trained target information retrieval model, and calculating to obtain the first relevance between the target query sentence and each target document.
In the step, after the first word vector and the central word vector corresponding to the target word of each target document are input into a correlation calculation network layer in a trained target information retrieval model, carrying out dot product calculation on the first word vector corresponding to each target word and each central word vector corresponding to each target word, selecting the maximum dot product value as a result correlation, and determining a dot product result as a second correlation between each target word and each target document.
Here, a summation calculation is performed for each of the second relevance degrees corresponding to each of the target words in each of the target documents, and a summation result is determined as a first relevance degree between a target query sentence and each of the target documents.
The expression for calculating the first relevance between the target query statement and each target document is specifically as follows:
In the formula, q i epsilon q d represents that the i-th marked target vocabulary is the same word shared by the target query sentence and the target document, wherein the maximum point multiplication value is selected, namely, max operation is performed to capture the important semantic information of the target vocabulary in the target document.
The above-mentioned method is a method for calculating the first relevance between the target query sentence and each target document in [ CLS ] matching mode not introduced into BERT, and the following formula is an expression added with the first relevance calculated in [ CLS ] matching mode:
Where full represents the output of the correlation computation network layer in the trained target information retrieval model with [ CLS ] matching added.
S105, selecting a preset number of target documents to be output as search documents according to the sequence of the correlation degree between the target query statement and the target documents from high to low.
In the step, when a user queries a target query sentence, a preset number of target documents which are ranked from high to low according to the relevance are output as search documents by inputting the target query sentence into a trained target information search model, and the search documents can be used as answer documents of the target query sentence.
Compared with the information retrieval method in the prior art, the information retrieval method provided by the embodiment of the application has the advantages that the first word vector of the target vocabulary in the target query sentence is determined based on the trained target information retrieval model, the central word vector of the target vocabulary in each target document is determined, and the first correlation degree between the query sentence and the target document is determined according to the first word vector and the central word vector.
Referring to fig. 3, fig. 3 is a flowchart of an information retrieval method according to another embodiment of the application. As shown in fig. 3, the information retrieval method provided by the embodiment of the present application includes:
S201, inputting target query sentences into a query layer in a trained target information retrieval model to obtain at least one target vocabulary which is the same as each target document in a text library, wherein the target documents are documents which have the same vocabulary as the target query sentences in the text library.
S202, inputting each obtained target vocabulary into a word vector extraction network layer in a trained target information retrieval model, obtaining a first word vector of each target vocabulary in the target query statement and a second word vector of each target vocabulary in a sliding window of each target document, wherein the sliding window is a window containing adjacent first preset number characters in the target document, the sliding window contains at least one character in at least one target vocabulary, and an overlapping part containing a second preset number of characters between two adjacent sliding windows.
S203, aiming at any target vocabulary in each target document, acquiring a second word vector of the target vocabulary in each sliding window.
In the step, a second word vector of any target word in each sliding window is obtained for a plurality of target words in each target document.
The method comprises the steps that a window which slides in a preset direction is formed in each target document, the sliding window comprises a first preset number of characters, the first preset number of characters are not fixed, the preset number of characters can be set according to the expression characteristics of Chinese characters, each sliding window comprises at least one character in at least one target vocabulary, an overlapping part of a second preset number of characters is arranged between two adjacent sliding windows, and the overlapping part of the second preset number of characters is set to be one character.
S204, carrying out summation calculation on each second word vector, carrying out average value calculation after the summation calculation, and determining the average value as a central word vector of the target word.
In the step, the central word vector of each target word is obtained by summing the second word vectors of the sliding window and then obtaining the average value.
S205, inputting a first word vector and a central word vector corresponding to a target word of each target document into a relevance scoring layer in a trained target information retrieval model, and calculating to obtain a first relevance between the target query sentence and each target document.
S206, selecting a preset number of target documents to be output as search documents according to the sequence of the correlation degree between the target query statement and the target documents from high to low.
The descriptions of S201 to S202 and S205 to S206 may refer to the descriptions of S101 to S102 and S104 to S105, and the same technical effects can be achieved, which will not be described in detail.
Compared with the information retrieval method in the prior art, the information retrieval method provided by the embodiment of the application has the advantages that the first word vector of the target vocabulary in the target query sentence is determined based on the trained target information retrieval model, the central word vector of the target vocabulary in each target document is determined, and the first correlation degree between the query sentence and the target document is determined according to the first word vector and the central word vector.
Referring to fig. 4, fig. 4 is a schematic structural diagram of an information retrieval device according to an embodiment of the application. As shown in fig. 4, the information retrieval apparatus 400 includes:
The first determining module 410 is configured to input a target query sentence into a query layer in a trained target information retrieval model, so as to obtain at least one target vocabulary which is the same between the target query sentence and each target document in a text library, where the target document is a document in the text library in which the same vocabulary exists as the target query sentence.
The second determining module 420 is configured to input each obtained target vocabulary into a word vector extraction network layer in a trained target information retrieval model, obtain a first word vector of each target vocabulary in the target query sentence, and a second word vector of each target vocabulary in a sliding window in each target document, where the sliding window is a window containing adjacent first preset number of characters in the target document, and the sliding window contains at least one character in at least one target vocabulary, and an overlapping portion containing a second preset number of characters between two adjacent sliding windows.
The third determining module 430 is configured to determine, for any target vocabulary in each target document, a center word vector of the target vocabulary based on the second word vectors of the sliding windows associated with the target vocabulary.
Further, the third determining module determines, for any target word in each target document, a center word vector of the target word based on the second word vector of each sliding window associated with the target word, including:
and aiming at any target word in each target document, acquiring a second word vector of the target word in each sliding window.
And carrying out summation calculation on each second word vector, carrying out average value calculation after the summation calculation, and determining the average value as a central word vector of the target word.
The calculating module 440 is configured to input a first word vector and a central word vector corresponding to a target word of each target document into a relevance scoring layer in the trained target information retrieval model, and calculate a first relevance between the target query sentence and each target document.
Further, the calculating module 440 calculates a first relevance between the target query term and each of the target documents by:
and carrying out point multiplication calculation on the first word vector corresponding to each target word and each central word vector corresponding to each target word, and determining a point multiplication result as a second relativity between each target word and each target document.
And carrying out summation calculation on the second relevance corresponding to each target word in each target document, and determining the summation result as a first relevance between the target query sentence and each target document.
And a fourth determining module 440, configured to select a preset number of target documents as search documents and output the search documents according to the order of the relevance between the target query sentence and the target documents from high to low.
Compared with the information retrieval method in the prior art, the information retrieval device 400 provided by the embodiment of the application determines the first word vector of the target word in the target query sentence and the central word vector of the target word in each target document based on the trained target information retrieval model, and determines the first relativity of the query sentence and the target document according to the first word vector and the central word vector.
Referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the application. As shown in fig. 5, the electronic device 500 includes a processor 510, a memory 520, and a bus 530.
The memory 520 stores machine-readable instructions executable by the processor 510, and when the electronic device 500 is running, the processor 510 communicates with the memory 520 through the bus 530, and when the machine-readable instructions are executed by the processor 510, the steps of the information retrieval method in the method embodiments shown in fig. 1 and fig. 4 can be executed, and the specific implementation manner can be referred to the method embodiments and will not be described herein.
The embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor may perform the steps of the information retrieval method in the method embodiments shown in fig. 1 and fig. 4, and the specific implementation manner may refer to the method embodiment and will not be described herein.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
It should be noted that the foregoing embodiments are merely illustrative embodiments of the present application, and not restrictive, and the scope of the application is not limited to the embodiments, and although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that any modification, variation or substitution of some of the technical features of the embodiments described in the foregoing embodiments may be easily contemplated within the scope of the present application, and the spirit and scope of the technical solutions of the embodiments do not depart from the spirit and scope of the embodiments of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.
Claims (9)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111495205.8A CN114238564B (en) | 2021-12-09 | 2021-12-09 | Information retrieval method, device, electronic device and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111495205.8A CN114238564B (en) | 2021-12-09 | 2021-12-09 | Information retrieval method, device, electronic device and storage medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114238564A CN114238564A (en) | 2022-03-25 |
| CN114238564B true CN114238564B (en) | 2025-05-13 |
Family
ID=80754175
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111495205.8A Active CN114238564B (en) | 2021-12-09 | 2021-12-09 | Information retrieval method, device, electronic device and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114238564B (en) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115168537B (en) * | 2022-06-30 | 2023-06-27 | 北京百度网讯科技有限公司 | Training method and device for semantic retrieval model, electronic equipment and storage medium |
| CN118568204B (en) * | 2024-07-29 | 2024-11-01 | 北京城市网邻信息技术有限公司 | Text information processing method, device and electronic device based on artificial intelligence |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108520033A (en) * | 2018-03-28 | 2018-09-11 | 华中师范大学 | Enhancing pseudo-linear filter model information search method based on superspace simulation language |
| CN109522392A (en) * | 2018-10-11 | 2019-03-26 | 平安科技(深圳)有限公司 | Voice-based search method, server and computer readable storage medium |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107491547B (en) * | 2017-08-28 | 2020-11-10 | 北京百度网讯科技有限公司 | Search method and device based on artificial intelligence |
| CN113536800A (en) * | 2020-04-13 | 2021-10-22 | 北京金山数字娱乐科技有限公司 | A word vector representation method and device |
| CN112507091A (en) * | 2020-12-01 | 2021-03-16 | 百度健康(北京)科技有限公司 | Method, device, equipment and storage medium for retrieving information |
-
2021
- 2021-12-09 CN CN202111495205.8A patent/CN114238564B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108520033A (en) * | 2018-03-28 | 2018-09-11 | 华中师范大学 | Enhancing pseudo-linear filter model information search method based on superspace simulation language |
| CN109522392A (en) * | 2018-10-11 | 2019-03-26 | 平安科技(深圳)有限公司 | Voice-based search method, server and computer readable storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114238564A (en) | 2022-03-25 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111444320B (en) | Text retrieval method and device, computer equipment and storage medium | |
| CN110347835B (en) | Text clustering method, electronic device and storage medium | |
| US10606946B2 (en) | Learning word embedding using morphological knowledge | |
| CN107836000B (en) | Improved artificial neural network method and electronic device for language modeling and prediction | |
| CN110162630B (en) | A method, device and equipment for deduplication of text | |
| US11573994B2 (en) | Encoding entity representations for cross-document coreference | |
| CN110309192B (en) | Structural data matching using neural network encoders | |
| CN111985228B (en) | Text keyword extraction method, text keyword extraction device, computer equipment and storage medium | |
| CN112256822A (en) | Text search method, apparatus, computer equipment and storage medium | |
| CN111651986B (en) | Event keyword extraction method, device, equipment and medium | |
| CN109902159A (en) | A kind of intelligent O&M statement similarity matching process based on natural language processing | |
| US20250209277A1 (en) | Systems and Methods for Machine-Learned Prediction of Semantic Similarity Between Documents | |
| CN106815252A (en) | A kind of searching method and equipment | |
| CN113076739A (en) | Method and system for realizing cross-domain Chinese text error correction | |
| CN110162771B (en) | Event trigger word recognition method and device and electronic equipment | |
| CN112395875A (en) | Keyword extraction method, device, terminal and storage medium | |
| CN111680494A (en) | Similar text generation method and device | |
| CN108475264B (en) | Machine translation method and device | |
| CN110457707B (en) | Content word keyword extraction method, device, electronic equipment and readable storage medium | |
| CN113553510A (en) | Text information recommendation method and device and readable medium | |
| CN114861654A (en) | A Defense Method for Adversarial Training Based on Part-of-Speech Fusion in Chinese Text | |
| CN113553410A (en) | Long document processing method, processing device, electronic equipment and storage medium | |
| CN114328894A (en) | Document processing method, document processing device, electronic equipment and medium | |
| CN114238564B (en) | Information retrieval method, device, electronic device and storage medium | |
| CN114548123B (en) | Machine translation model training methods and apparatus, and text translation methods and apparatus |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |