Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example one
In accordance with an embodiment of the present invention, there is provided an embodiment of a method for generating target information, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than that herein.
Fig. 1 is a flowchart of a method for generating target information according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
in step S12, the initial text content is acquired.
Specifically, in the present solution, an initial text content may be obtained through the acquisition terminal, where the initial text content may be basic description information of a travel vacation product, and the basic description information may include: it should be noted that the basic description information includes a large amount of useless information, such as the title, the feature, and the description of the trip.
Step S14, performing information point extraction processing on the initial text content according to a preset word segmentation dictionary, and generating a plurality of information points.
Specifically, in this scheme, the processing terminal may perform extraction processing on the basic description information according to a preset word segmentation dictionary, where the extraction includes word segmentation extraction and feature value extraction to generate a plurality of information points, and it should be noted that the information points are configured to: the segmentation and the feature value of the segmentation.
And step S16, extracting the plurality of information points through a preset extraction algorithm to generate target information.
Specifically, in this scheme, the processing terminal may extract the plurality of information points through a preset algorithm to generate the target information, which is travel product information, where it is to be noted that the target information may be destination, hotel, shopping, traffic information, and the like.
The embodiment obtains the initial text content; performing information point extraction processing on the initial text content according to a preset word segmentation dictionary to generate a plurality of information points; and extracting the plurality of information points through a preset algorithm to generate target information. It is easy to notice that in this embodiment, only need obtain basic description information, processing terminal can extract basic description information automatically to generate tourism vacation product information, great saving the time of type-in, also can avoid because the mistake that the work load leads to is type-in greatly, consequently, this embodiment has solved current tourism product information and has needed the manual work to filter the generation to a large amount of text contents, leads to the technical problem that the inefficiency of tourism product information generation. The method and the device have the advantages that the important information of the product is automatically extracted from the existing product information, the manual input time is shortened, the manual input error rate is reduced, and therefore the experience of a user for acquiring the travel product information is improved.
Optionally, before the step S12, acquiring the initial text content, the method provided in this embodiment may further include:
step S10, creating a word segmentation dictionary according to the travel vocabulary database, wherein the word segmentation dictionary comprises a plurality of travel product vocabularies and the characteristics of the travel product vocabularies.
Specifically, in the present solution, the travel vocabulary database may be an information knowledge base constructed by information of the existing travel industry and corresponding product information, and the present solution may utilize a word segmentation tool to construct the word segmentation dictionary according to the information knowledge base.
Optionally, in step S14, the step of performing information point extraction processing on the initial text content according to a preset word segmentation dictionary to generate a plurality of information points may include:
step S141 is performed to segment the initial text content to generate a plurality of sub-initial text contents.
Specifically, in this scheme, the basic description information may be divided, for example, the basic description information may be segmented or divided into sentences, so as to generate the plurality of sub-initial text contents (for example, a plurality of segments or a plurality of clauses).
Step S142, using a plurality of tourism product vocabularies to sequentially perform word segmentation processing and feature extraction processing on each sub-initial text content to generate a plurality of information points, wherein each information point at least comprises: the segmentation and the feature value of the segmentation.
Specifically, in the scheme, word segmentation processing and feature extraction can be performed on the text data of the basic information of the product through a plurality of tourism product vocabularies in the word segmentation dictionary, so that a plurality of information points are obtained.
It should be noted that the present solution may perform word segmentation on each of the sub-initial text contents through the KMP algorithm to obtain all the mentioned information in the product and the features of the product.
Optionally, the preset decimation algorithm may be an area algorithm, and the area algorithm is used for the preset decimation algorithm
In step S16, the step of extracting the plurality of information points by using a preset extraction algorithm to generate the target information may include the following steps:
step S1601, respectively counting the occurrence frequency of the first information point in the plurality of information points in each sub-initial text content.
Specifically, the present solution may randomly select an information point, that is, the information point, and then count the frequency of occurrence of the information point in each paragraph and sentence.
Step S1602, calculating a decreasing rate of the occurrence frequency of the first information point according to the occurrence frequency of the first information point in each sub-initial text content.
In step S1603, the first information point is determined to be the target information when the falling rate does not exceed the first threshold.
Specifically, in the present embodiment, a descending rate of the first information point in each paragraph and sentence may be calculated, and when the descending rate does not exceed a first threshold, it is determined that the first information point is the main information of the travel product, and the present embodiment determines the first information point as the target information.
In a preferred embodiment, the scheme can extract the information points related to the travel products through an area algorithm, that is, if the sentence and paragraph in all the description information are used as the area for measuring the space of the described product information, if the frequency of the information point appearing in the first area is a, the frequency of occurrence in the second area is b, the area reduction rate is q ═ a-b)/b, the area reduction rate can be used to find the zone boundaries of the information points, the area principal algorithm may sort the areas of all information points in descending order, then find the boundary from large to small in area, calculating the corresponding area descending speed according to the characteristic accumulation area of the information in the searching process, and when the descending rate and the accumulated area are larger than the set threshold value, stopping searching, wherein the accumulated area is the corresponding main information of the product, namely the target information.
Alternatively, in step S16, the extracting the plurality of information points by the preset extracting algorithm, and the generating the target information may include the following steps:
in step S1604, in a case where the feature value of a first information point of the plurality of information points exceeds a second threshold and/or a text content associated with the first information point is included in the initial text content, the first information point is determined to be target information.
In particular, in the present solution, isolated and occasionally mentioned information points may be filtered by a plaintext rule algorithm: when the characteristic value of the first information point exceeds a second threshold value, the first information point is indicated to be small in characteristic value and isolated, and if relevant information of the first information point is described by corresponding characters in the context of the initial text content, the first information point is considered to be mentioned unintentionally and not belonging to main information of the travel product, namely the target information.
Alternatively, in step S16, the extracting the plurality of information points by the preset extracting algorithm, and the generating the target information may include the following steps:
step S1605, filtering the plurality of information points in the initial text content according to the preset standard information point database, and determining the plurality of information points contained in the standard information point database as the target information.
Specifically, in the present solution, the unreliable information points may be filtered through a preset standard information point database, that is, a plurality of information points included in the standard information point database are determined as target information.
It should be noted that, in the present solution, analog or non-true information points may be filtered through a semantic annotation algorithm: the semantic annotation algorithm is to adopt artificial knowledge to label a large number of existing product lines and record final results, use the data as training data to perform machine learning model training, use the trained model to perform data processing on the current product data, and filter information points with unreliable similar labels.
Alternatively, in step S16, the extracting the plurality of information points by the preset extracting algorithm, and the generating the target information may include the following steps:
in step S1606, the distances between the first information point and the other information points in the plurality of information points are obtained.
Step S1607, when the distance does not exceed the second threshold, determining the first information point as the target information.
Specifically, in the scheme, unreliable information points in the plurality of information points can be filtered through the distances between the information points, and if the difference between the distance of one information point and the distances of all other information points exceeds a certain threshold value, namely the first threshold value, the information point belongs to the product information.
It should be noted that, according to the present disclosure, unreliable information points in the plurality of information points may also be filtered through a region calculation method, and all the characteristics of the information points are in the same region, and only a few information points are not in the region, and the few information points are not in the same region, so that the few information points are excluded.
Alternatively, in step S16, the extracting the plurality of information points by the preset extracting algorithm, and the generating the target information may include the following steps:
in step S1608, the probability that the first information point of the plurality of information points and the other plurality of information points appear in the preset text content together is calculated.
Step S1609, determining the first information point as the target information when the probability exceeds the third threshold.
Specifically, in the scheme, information points with a smaller co-occurrence probability of the information product may be filtered through a co-occurrence relation algorithm, and information points with a larger co-occurrence probability (i.e., exceeding a third threshold) are determined as target information. It should be noted that, the co-occurrence relationship algorithm is to calculate the probability of the co-occurrence of different information points in the same product through the existing product information and a statistical method, and use these probabilities to guide whether the information points in one product are available, for example, if the probability of co-occurrence of the information point a and the information points B and C is relatively high, it is considered reasonable if a, B, and C occur simultaneously in the product. If the probability of co-occurrence of A and B, C is small, if A, B and C occur in the product at the same time, the result is considered unreasonable, and at the moment, A needs to be filtered to achieve the effect of reasonable product information.
Optionally, in step S16, after the multiple information points are extracted by the preset extraction algorithm to generate the target information, the method provided in this embodiment may further include:
in step S17, target information is sent to the search engine, wherein the target information at least includes: destination, hotel, shopping, and traffic information.
Specifically, the scheme can provide the extracted information point data (target information) to a search engine to provide a search basis for a user.
Preferably, the information point data of the product can also be directly displayed on the processing terminal to provide reference for the user.
In summary, in the embodiment, by acquiring the basic information of the product, performing word segmentation and feature extraction on the basic information of the product through the accumulated information knowledge base, acquiring all information points and feature values of the product, and extracting information points related to the product by analyzing the product and using an extraction algorithm (information area algorithm, plaintext rule algorithm, semantic annotation algorithm, distance calculation algorithm, area range algorithm, co-occurrence relation algorithm), the user can refer to and search conveniently, the user experience is improved, and the entry cost of a supplier is reduced.
Example two
The present application also provides an apparatus for generating target information, which may be configured to execute the method for generating target information, as shown in fig. 2, the apparatus may include: an acquisition unit 20 configured to acquire an initial text content; the processing unit 22 is configured to perform information point extraction processing on the initial text content according to a preset word segmentation dictionary to generate a plurality of information points; and the extracting unit 24 is used for extracting the plurality of information points through a preset extracting algorithm to generate target information.
The embodiment obtains the initial text content; performing information point extraction processing on the initial text content according to a preset word segmentation dictionary to generate a plurality of information points; and extracting the plurality of information points through a preset algorithm to generate target information. It is easy to notice that in this embodiment, only need obtain basic description information, processing terminal can extract basic description information automatically to generate tourism vacation product information, great saving the time of type-in, also can avoid because the mistake that the work load leads to is type-in greatly, consequently, this embodiment has solved current tourism product information and has needed the manual work to filter the generation to a large amount of text contents, leads to the technical problem that the inefficiency of tourism product information generation.
Optionally, the apparatus may further include: and the creating unit is used for creating a word segmentation dictionary according to the travel vocabulary database, wherein the word segmentation dictionary comprises a plurality of travel product vocabularies.
Optionally, the processing unit may include: the first processing module is used for carrying out segmentation processing on the initial text content to generate a plurality of sub-initial text contents; the second processing module is used for carrying out word segmentation processing and feature extraction processing on each sub-initial text content in sequence by using a plurality of tourism product vocabularies to generate a plurality of information points, wherein each information point at least comprises: the segmentation and the feature value of the segmentation.
Alternatively, the extracting unit may include: the statistical module is used for respectively counting the occurrence frequency of a first information point in the plurality of information points in each sub-initial text content; the first calculating module is used for calculating the descending rate of the appearance frequency of the first information point according to the appearance frequency of the first information point in each sub-initial text content; and the first determining module is used for determining the first information point as the target information under the condition that the descending speed does not exceed the first threshold value.
Alternatively, the extracting unit may include: and the second determining module is used for determining the first information point as the target information under the condition that the characteristic value of the first information point in the plurality of information points exceeds a second threshold value and/or the text content associated with the first information point is contained in the initial text content.
Alternatively, the extracting unit may include: and the filtering module is used for filtering the plurality of information points in the initial text content according to a preset standard information point database and determining the plurality of information points contained in the standard information point database as target information.
Optionally, the extraction unit may further include: the acquisition module is used for acquiring the distance between a first information point in the plurality of information points and other information points; and the third determining module is used for determining the first information point as the target information under the condition that the distance does not exceed the second threshold.
Optionally, the extraction unit may further include: the second calculation module is used for calculating the probability that the first information point in the plurality of information points and other information points appear in the preset text content together; and the fourth determining module is used for determining the first information point as the target information under the condition that the probability exceeds the third threshold.
Optionally, the apparatus may further include: a sending unit, configured to send target information to a search engine, where the target information at least includes: destination, hotel, shopping, and traffic information.
EXAMPLE III
The present application also provides a server, as shown in fig. 3, the server may include:
a receiving end 30, configured to receive an initial text content; the processor 32 is configured to perform information point extraction processing on the initial text content according to a preset word segmentation dictionary to generate a plurality of information points, and extract the plurality of information points through a preset extraction algorithm to generate target information; and a transmitting end 34 for transmitting the target information to the user terminal.
The embodiment obtains the initial text content; performing information point extraction processing on the initial text content according to a preset word segmentation dictionary to generate a plurality of information points; and extracting the plurality of information points through a preset algorithm to generate target information. It is easy to notice that in this embodiment, only need obtain basic description information, processing terminal can extract basic description information automatically to generate tourism vacation product information, great saving the time of type-in, also can avoid because the mistake that the work load leads to is type-in greatly, consequently, this embodiment has solved current tourism product information and has needed the manual work to filter the generation to a large amount of text contents, leads to the technical problem that the inefficiency of tourism product information generation.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.