CN119512674A

CN119512674A - A data generation method, device, equipment and readable storage medium

Info

Publication number: CN119512674A
Application number: CN202411724366.3A
Authority: CN
Inventors: 韩东明; 丁子祥; 高军; 汪建波
Original assignee: Hangzhou Xingrui Network Information Technology Co ltd; Zhejiang University ZJU
Current assignee: Hangzhou Xingrui Network Information Technology Co ltd; Zhejiang University ZJU
Priority date: 2024-11-28
Filing date: 2024-11-28
Publication date: 2025-02-25
Anticipated expiration: 2044-11-28
Also published as: CN119512674B

Abstract

The present application discloses a data generation method, device, equipment and readable storage medium, the method comprising: splitting an application programming interface at different granularities to obtain a plurality of application programming sub-interfaces; performing type statistics on the plurality of application programming sub-interfaces, and determining a generation ratio based on the statistical results; obtaining prompt words corresponding to the plurality of application programming sub-interfaces; inputting the prompt words into a text language model for data generation processing to obtain instruction result pairs; wherein the text language model has learned the technical documentation and user guide of the application programming interface; storing the instruction result pairs into a data set, and performing statistics on the instruction result pairs in the data set to obtain distribution data; and determining the data set as a target data set when the distribution data matches the generation ratio. Technical effect of the present application: a target data set with high instruction coverage and customizable data distribution can be generated.

Description

Data generation method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a data generating method, apparatus, device, and readable storage medium.

Background

The API documents and guidelines are generally directed to the expert domain for the purpose of accurately describing the functionality of the API and avoiding ambiguity. For this purpose, APIs are typically described and defined using standardized languages and uniform formats. However, when general users or non-professional users use an API, they often face a great obstacle in understanding that the function and usage of the API must be understood by referring to documents. This understanding gap significantly increases the difficulty of their correct calls to the API.

Currently, the calling and parameter setting of APIs are controlled through natural language. Due to the development of natural language processing (NLP, natural Language Processing) technology, the semantic understanding of the model and the alignment of codes are more accurate. However, while models have excellent generalization capability in a wide range of fields, the underlying model data is often difficult to overlay for custom API functions. At this point, further pre-training or supervised fine Tuning (SFT, supervised Fine-Tuning) is required through a large amount of task data to enhance the model's understanding of a particular domain.

In order for a generative model (e.g., GPT) to learn about an API in a particular domain, a data set containing instruction result pairs (including correspondence between input instructions, i.e., user-input description instructions, and API results, i.e., instruction results describing instruction programming) must be trained with a large number of data sets. However, the existing scheme for generating the data set has the problems of low instruction coverage rate, unbalanced data distribution and the like, and further causes that the generated model cannot learn more comprehensive APIs effectively.

In summary, how to effectively solve the problems of generating the data set for training the generative model is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The application aims to provide a data generation method, a device, equipment and a readable storage medium, so as to obtain a target data set with high instruction coverage rate and customizable data distribution.

In order to solve the technical problems, the application provides the following technical scheme:

A data set generation method, comprising:

splitting the application programming interfaces under different granularities to obtain a plurality of application programming sub-interfaces;

performing type statistics on a plurality of application programming sub-interfaces, and determining a generation proportion based on a statistical result;

Acquiring prompt words respectively corresponding to a plurality of application programming sub-interfaces;

Inputting the prompt word into a text language model to perform data generation processing to obtain an instruction result pair, wherein the text language model learns the technical document and the use guide of the application programming interface;

storing the instruction result pairs into a data set, and counting the instruction result pairs in the data set to obtain distributed data;

And determining the data set as a target data set in the condition that the distribution data is matched with the generation proportion.

Preferably, the application programming interface is split under different granularity to obtain a plurality of application programming sub-interfaces, including:

acquiring declaration information of the application programming interface;

determining different granularity based on the declaration information;

And splitting the programming interfaces under different granularities to obtain a plurality of application programming sub-interfaces.

Preferably, in the case where the distribution data does not match the generation ratio, the method further includes:

And supplementing the data set by re-acquiring new prompt words so as to match the distribution data corresponding to the supplemented data set with the generation proportion.

Preferably, supplementing the data set by re-acquiring new hint words includes:

Comparing the distribution data with the generation proportion, and determining the target type of the lacking instruction result pair;

determining an application programming sub-interface matched with the target type as a target application programming sub-interface;

Re-acquiring the prompt words corresponding to the target application programming sub-interface;

inputting the newly acquired prompt words into a text language model for data generation processing to obtain a complementary instruction result pair;

And writing the complementary instruction result pair into the data set.

Preferably, determining the generation ratio based on the statistical result includes:

Outputting the statistical result on a visual interface;

Acquiring input configuration information;

And determining the generation proportion by using the configuration information.

Preferably, determining the generation ratio using the configuration information includes:

And reading the category proportion of each category in the configuration information, and determining the generation proportion based on the category proportion.

Preferably, determining whether the distribution data matches the generation ratio includes:

obtaining the subclass proportion of the subclass under each class by using the class proportion;

Reading the minimum expansion record number of each subclass from the configuration information;

Determining the minimum record number corresponding to the sub-category by utilizing the sub-category proportion and the minimum expansion record number;

calculating the minimum total record number meeting the category matching based on the minimum record number;

judging whether the minimum record number, the minimum total record number and the distribution data are matched;

if yes, determining that the distribution data is matched with the generation proportion;

If not, determining that the distribution data is not matched with the generation proportion.

A data generating apparatus comprising:

The splitting module is used for splitting the application programming interfaces under different granularities to obtain a plurality of application programming sub-interfaces;

The generation proportion statistics module is used for carrying out type statistics on a plurality of application programming sub-interfaces and determining generation proportion based on statistical results;

The prompt processing module is used for acquiring prompt words respectively corresponding to the application programming sub-interfaces;

the instruction result pair generating module is used for inputting the prompt word into a text language model to perform data generation processing to obtain an instruction result pair, wherein the text language model learns the technical document and the use guide of the application programming interface;

The analysis module is used for storing the instruction result pairs into a data set and counting the instruction result pairs in the data set to obtain distribution data;

and the data generation module is used for determining the data set as a target data set under the condition that the distribution data is matched with the generation proportion.

An electronic device, comprising:

a memory for storing a computer program;

and a processor for implementing the steps of the data generation method when executing the computer program.

A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the data generation method described above.

The method provided by the embodiment of the application is applied to split application programming interfaces under different granularity to obtain a plurality of application programming sub-interfaces, the type statistics is carried out on the plurality of application programming sub-interfaces, the generation proportion is determined based on the statistics result, prompt words corresponding to the plurality of application programming sub-interfaces are obtained, the prompt words are input into a text language model to carry out data generation processing to obtain an instruction result pair, wherein the text language model has learned a technical document and a use guide of the application programming interfaces, the instruction result pair is stored into a data set, the statistics is carried out on the instruction result pair in the data set to obtain distribution data, and the data set is determined to be a target data set under the condition that the distribution data is matched with the generation proportion.

And splitting the application programming interfaces under different granularities to obtain application programming sub-interfaces with different granularities. Type statistics are performed on the application programming sub-interfaces, and the generation ratio can be determined based on the statistics. In order to make the coverage of the data set including the instruction result pairs related to the application programming interfaces more comprehensive, when the prompt words are generated, the prompt words respectively corresponding to the application programming sub-interfaces are acquired, and then the instruction result pairs can be obtained by utilizing the technical documents of which the application programming interfaces are already learned and the text language model of the instruction guide to carry out data generation processing. And writing the generated instruction result pair into a data set, carrying out statistical analysis on the instruction result pair in the data set, and determining that the data set generation is completed under the condition that the distribution data is matched with the production proportion to obtain a target data set.

The method has the technical effects that when the instruction result pairs in the data set are generated, the instruction result pairs can be split through the application programming interface, different granularity instructions which can cover the application programming interface can be obtained, the generation proportion can be determined through statistics on the application programming sub-interfaces, after the distributed data of the instruction result pairs in the data set are obtained, the target data set is determined and obtained under the condition that the definite generation proportion is matched with the distributed data, the data distribution of the finally generated target data set can be controlled, and a comprehensive and customizable training sample for further training generation type model training is provided.

Correspondingly, the embodiment of the application also provides a data generating device, equipment and a readable storage medium corresponding to the data generating method, which have the technical effects and are not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is a flowchart of a data generation method according to an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating an implementation of a data generating method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a data generating device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to better understand the aspects of the present application, the present application will be described in further detail with reference to the accompanying drawings and detailed description. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, fig. 1 is a flowchart of a data generating method according to an embodiment of the application, the method includes the following steps:

S101, splitting the application programming interfaces under different granularities to obtain a plurality of application programming sub-interfaces.

For ease of description, the application programming subinterface is referred to below by sub AIP as an API for short. Sub APIs are sub functional modules of the corresponding APIs.

In the application, different granularities can be preset, and then the APIs are split based on the different granularities, so that a plurality of sub APIs are obtained.

For example, if the internal functions of the API are visualized as a tree structure according to parent-child relationships and logical architecture, the path from the root node to each leaf node may correspond to a child API.

In one embodiment of the present application, splitting an application programming interface under different granularities to obtain a plurality of application programming sub-interfaces includes:

Step one, acquiring declaration information of an application programming interface;

step two, determining different granularity based on the declaration information;

And thirdly, splitting the programming interfaces under different granularities to obtain a plurality of application programming sub-interfaces.

For convenience of description, the above steps are described in combination.

In this embodiment, the declaration information of the API may be obtained by reading a storage device, or receiving an input, or the like.

As shown in fig. 2, the declaration information may be explanatory information of API declarations by a technician (developer). The declaration information (i.e., belonging to the root requirement) may specifically include the following features:

Description of the overall functionality of the API, including usage scenarios and application scopes.

Types, such as colors, enumerated values, strings, functions, etc., that specify the type of returned value.

Hierarchy-the hierarchy to which the current API belongs, i.e. which module it is a child node.

Dependency relationship, which indicates whether the API needs to work in concert with other APIs. For example, before adjusting the footer, the footer function may need to be enabled first.

Items may be enumerated listing the enumerated values supported by the API, such as font size (small four, large four, etc.), location (top left, bottom right, etc.).

Expanding words, namely listing key instructions or key words related to the current API, such as the highest, largest, uppermost and other words related to the word high.

By way of illustration, below is declaration information for a particular API:

{ "level 1 api": "footnote", "level 1 description": "footnote" used for data generation "," level 1 data generation ": data generation", "level 2 api_name": "footnote _position", "level 2 description": "adjust the position of the footnote in the document", "level 2 type": "enumeration value", "level 2 dependency": [ "enable footnote" ], "level 2 enumeration item" [ { "value": "lower left", "expansion word": [ "leftbottm", "left" ] }, { "value": lower right "," expansion word ": [" rightbottom "," right "] }, {" value ": upper", "expansion word": top "," upper "," highest "] }, {" value ": lower", "expansion word": lower "," bottom "," lowest "," level 2 }, "value" ], "page", "lower word" ], and other contents.

Based on the features in the declaration information of the API of the previous step, the API can be split into sub-APIs with different granularities. The splitting process specifically comprises the following steps:

type combination-according to the type, hierarchy, dependency, enumerated items and expansion words of the APIs, they are combined into multiple sub APIs to more carefully process and generate data. For example, a complex API may be split into multiple sub-functional modules, each generating data independently.

Hierarchical analysis-the top-level APIs are split from their sub-APIs taking into account the hierarchical relationship of the APIs to ensure that the APIs of each level are fully covered.

Through the splitting of the step, finer operation and control can be carried out on the API, and subsequent data processing is facilitated.

For example, the declaration information for the above listed APIs, the following 3 sub APIs can be split:

The sub API1 is { "1-level API": "footnote," 1-level description ":" footnotes, "used for data generation," "1-level data generation": "data generation," "2-level api_name": "footnote _position," "2-level description": "position of footnotes in the adjustment document," "2-level type": enumeration value, "" 2-level dependency ": [" enable footnotes "]," 2-level enumeration item ": {" value ": lower left," "expansion word": { "leftbottm" }, "2-level expansion word": [ "footnotes" ],. Other content }.

The sub API2 { "1-level API": "footnote", "1-level description": "footnotes" for data generation "," 1-level data generation ":" data generation "," 2-level api_name ":" footnote _position "," 2-level description ":" position of footnotes in the document is adjusted "," 2-level type ": enumeration value", "2-level dependency": [ "enable footnotes" ] "2-level enumeration item" { "value": lower left "," expansion word ": {" left "}," 2-level expansion word ": page footnotes" ],...

The sub API3 { "1-level API": "footnote", "1-level description": "footnotes" for data generation "," 1-level data generation ":" data generation "," 2-level api_name ":" footnote _position "," 2-level description ":" adjust the position of footnotes in the document "," 2-level type ": enumeration value", "2-level dependency": [ "enable footnotes" ] "" 2-level enumeration item ": {" value ":" lower left "," expansion word ": {" left "}," 2-level expansion word ": lower word" ],...

In practical applications, when a corresponding data set is generated for a certain API, splitting and subsequent processing may be performed only for that API, and when a corresponding data set is generated for a plurality of APIs, splitting may be performed for all of the APIs, or splitting processing may be performed only for some of the APIs. The actual splitting condition can be determined according to the requirement of actually training and generating a model, and is not limited herein.

S102, performing type statistics on a plurality of application programming sub-interfaces, and determining a generation proportion based on a statistical result.

After the sub-APIs are split, the sub-APIs can be subjected to type statistics, so that a statistical result is obtained. Based on these statistics, the generation ratio can be determined.

After splitting the sub-APIs, statistics of the distribution information of the number of different granularities and types of sub-APIs may be performed, where the statistics may include statistics of the distribution of the sub-APIs of different types, such as the total number of sub-APIs, the number of sub-APIs under a certain type, and so on.

After the completion of the API splitting, statistics needs to be performed on the APIs with different granularity and different types to obtain the quantity distribution of the APIs. Therefore, the distribution condition of the data can be clarified, and a basis is provided for the subsequent data proportion determination.

After the statistical information is obtained, the generation ratio may be directly specified based on the statistical information.

In one embodiment of the present application, determining the generation ratio based on the statistical result includes:

Step one, outputting a statistical result on a visual interface;

step two, acquiring input configuration information;

And thirdly, determining the generation proportion by using the configuration information.

For convenience of description, the following description will be given by combining the above three steps.

The statistical result is displayed on the visual interface, and can be displayed and output in a list form or in a tree topology mode.

Then, the user can input configuration information according to the statistics and the actual requirements. The configuration information may directly carry the generation proportion, or may directly carry the number of generation records of each category, so that the generation proportion is calculated based on the generation records.

In a specific embodiment of the present application, determining the generation ratio using the configuration information includes:

That is, the generation ratio is determined directly based on the category ratio of each category. For example, if there are only A, B and C in the 1:2:1 ratio, then the A, B and C three ratios may be directly 1:2:1.

S103, acquiring prompt words respectively corresponding to the application programming sub-interfaces.

In order to enable the text language model to generate instruction result pairs of all sub-APIs, prompt words corresponding to the sub-APIs respectively can be obtained.

Specifically, according to the API characteristics of different granularities, a developer can customize and design the prompt word. These hint words will incorporate the type, hierarchy, dependencies, etc. of the API to ensure that the data generated is more accurate. The design of the hint words should be refined to specific API categories to avoid interference with other API categories.

One specific hint word is illustrated as follows:

tasks-the user uses natural language to control the fields, properties, events, parameters of the component.

The method comprises the following steps of 1, generating corresponding API for current selectable items and expansion words, wherein query is a command statement of actual demands of users, has rich and various requirements and is close to the use habit of the users;

Note that outputting JSON format does not require sorry and interpretation, and does not require annotation generation. Repeating the pass, no apology and interpretation is required, and no annotation is required to be generated.

And the result is that a json format containing a List is used, each element contains two fields, namely a query and an api, the parameters of the event are generated according to the content of the corresponding field, the query is natural language description of a user, and the api is the corresponding json format.

The case is that the input is { sample_input }, and the output is { sample_output };

Currently, the input is { d }, please output { k } bar results.

S104, inputting the prompt word into the text language model for data generation processing, and obtaining an instruction result pair.

Wherein the text language model has learned technical documents and instructions for use of the application programming interface.

It should be noted that, in the embodiment of the present application, in order to enable the text language model to generate the instruction result pair based on the prompt word, the text language model may learn the technical document and the instruction of the API in advance. For the technical documents and instructions of how to enable the text language model to learn the API, reference may be made to a specific training scheme of the text language model, which is not described herein.

The text language model may be embodied as a model such as a GPT (GENERATIVE PRETRAINED transducer, i.e., a generative pre-training transducer, GPT being a language model based on the architecture of the transducer).

The following takes a text language model as a GPT generation model as an example, and the generation of instruction result pairs is described in detail:

The customized prompt words are input into the GPT generation model, and a large number of structured API instructions and result pairs (namely instruction result pairs) are generated by utilizing the strong generation capability of the customized prompt words. This step will generate the required diversified data, providing the basis for subsequent data processing and analysis.

The instruction result pair includes an input description instruction and a coding result of the description instruction, for example, a user inputs an instruction of "setting a title color to red", and the coding result is { title: { color: red }.

For example, a partial example of the instruction result pair generated by the GPT generation model is as follows:

"[ { \query\": setting the tag field assembly to be a line type display ", \api\ \api", \api\ { \atom \u Tabs\ ": tabs\". \api\ { \atom \u Tabs\ ": tabs\". \api\ { \atom \u Tabs\ ": tabs\".

S105, storing the instruction result pairs into a data set, and counting the instruction result pairs in the data set to obtain distributed data.

In this embodiment, a data set may be created in advance, in which the generated instruction result pair is stored.

In the process of generating the instruction result pairs, or after the instruction result pairs corresponding to all prompt words are generated, the instruction result pairs in the data set can be counted, so that the distribution data of the instruction result pairs can be obtained.

And S106, determining the data set as a target data set when the distribution data is matched with the generation proportion.

In the case where the explicitly distributed data matches the generated scale, it may be determined that the current data set has satisfied the demand, and the current data set may be determined as the target data set.

Specifically, the distribution data may be a distribution ratio calculated based on the number after counting the number of pairs of instruction results of different categories in the dataset. In the case that the distribution ratio matches the production ratio, it is determined that the distribution data matches the production ratio. Of course, the generation ratio and the minimum number of the extended records may be calculated, so as to determine the minimum total number of records corresponding to different types. The distribution data may be specifically a total number of records corresponding to different categories, and the distribution data is determined to be matched with the generation proportion when the total number of records is greater than or equal to the minimum total number of records.

It should be noted that, based on the above embodiments, the embodiments of the present application further provide corresponding improvements. The preferred/improved embodiments relate to the same steps as those in the above embodiments or the steps corresponding to the steps may be referred to each other, and the corresponding advantages may also be referred to each other, so that detailed descriptions of the preferred/improved embodiments are omitted herein.

In a specific embodiment of the present application, in a case where the distribution data does not match the generation ratio, the method further includes:

That is, when the distribution data and the generation ratio are found to be not matched, the current data distribution condition in the data set on the surface still does not meet the required generation ratio, at this time, the generation processing of the data result pair may be continued, and the data set may be supplemented, so that the data of the supplemented data set is distributed in the production ratio to be matched.

Specifically, the data set may be supplemented by re-acquiring new hint words and then supplementing the data set based on the new hint words.

In one embodiment of the present application, supplementing the data set by reacquiring new hint words includes:

Step one, comparing distribution data with a generation proportion, and determining a target type of a lacking instruction result pair;

step two, determining the application programming sub-interface matched with the target type as a target application programming sub-interface;

Step three, acquiring the prompt words corresponding to the target application programming sub-interfaces again;

And fifthly, writing the complementary instruction result pair into a data set.

For convenience of description, the following description will be given by combining the above five steps.

In this embodiment, the target type of the absent instruction result pair may be determined by comparing the distribution data with the generation ratio. For example, assuming that the type includes A, B and C, the current data distribution corresponds to a ratio of 1:2:3 and the generation ratio is 2:2:3, it can be seen that the instruction result pair of the type a is still missing, and therefore, the type a can be determined as the target application sub-interface, then, the prompt word of the type a is obtained again, and then, the instruction result pair corresponding to the prompt word is generated by means of the text language model. For convenience of distinction, in this embodiment, the instruction result generated based on the retrieved hint word is symmetrical to a complementary instruction result pair. The supplemented instruction results are then written into the dataset to supplement the dataset.

After supplementing the data set, whether the distribution data of the current data set is consistent with the generation proportion or not can be judged again, if so, the target data set can be obtained, and if not, the data in the data set can be supplemented again.

It should be noted that after the data in the data set has satisfied the record number corresponding to the minimum expansion multiple, only the proportion of the distributed data is not matched with the generation proportion, and the proportion of the distributed data in the data set can be matched with the generation proportion by deleting redundant data result pairs.

In one specific embodiment of the present application, determining whether the distribution data matches the generation ratio includes:

step one, obtaining the subclass proportion of the subclass in each class by using the class proportion;

step two, reading the minimum expansion record number of each subclass from the configuration information;

Step three, determining the minimum record number corresponding to the sub-category by using the sub-category proportion and the minimum expansion record number;

step four, calculating the minimum total record number meeting the category matching based on the minimum record number;

step five, judging whether the minimum record number, the minimum total record number and the distribution data are matched;

step six, if yes, determining that the distribution data is matched with the generation proportion;

And step seven, if not, determining that the distribution data is not matched with the generation proportion.

For convenience of description, the above steps are described in combination.

For the class set A, the class set A comprises a plurality of sub-classes a, b, c and d, and if the class proportion is P, the proportion of each sub-class is Pa, pb, pc and Pd.

The minimum number of extended records per sub-API per sub-class is required to be lima, limb, limc, limd. That is, each sub-API of the current sub-category requires at least how many instruction result pairs the model generates.

The total number of sub-APIs is N, and the number of sub-APIs of each sub-category is Na, nb, nc and Nd.

The match determination may be made by performing the following steps:

And 1, calculating the record number meeting the minimum number requirement.

For each subcategory i in class set A, i.e., calculate the minimum number of records Mi required to satisfy each subcategory, multiply the number of subcategories Ni of the current subcategory i, i.e., using the minimum number of expansion strips limi per subcategory i of the current subcategory i。

Step 2, calculating the minimum total record number meeting the proportioning requirement P, calculating the minimum total record number as S on the premise that each subcategory i meets the proportioning requirement,。

Step 3, calculating the final record number needed by each subcategory

For each categoryCalculating the final record number Si required by sub-category i, i.e。

After the final record number required by each subcategory is obtained, under the condition that the distribution data of the data set is obtained, whether the statistical number of each subcategory is consistent with the final record number or not can be judged, and if so, the distribution data and the generation proportion are matched. If the number of differences is exceeded, deletion can be performed, and if the number of differences is exceeded, data replenishment can be performed subsequently by replenishing the subcategory.

For example, if there is a set of attributes (i.e., a set of categories, in this case attributes are corresponding to categories)And the number of records for each attribute in the original dataset:;

Proportioning requirements ;

The minimum expansion record number of each sub-api is;

Step 1, calculating the record number meeting the minimum expansion requirement of each subcategory a, b, c and d, namely:

,,, 。

step2, calculating the minimum total record number S meeting the proportioning requirement as follows: 。

step 3, calculating the final record number required by each subcategory:

,,, 。

Final results:

Attribute a requires 1800 records, attribute b requires 1800 records, attribute c requires 1200 records, and attribute d requires 1200 records.

Verification that the final record count for each attribute is at least a specified multiple of the original record count: a: 1800 > 2 x 100, b: 1800 > 3 x 200, c: 1200 = 4 x 300, d: 1200 > 3 x 400.

The final proportions are a 1800/6000=0.3, b 1800/6000=0.3, c 1200/6000=0.2, d 1200/6000=0.2.

The verification shows that the determination mode of the final record number of each attribute ensures that each attribute at least reaches the minimum expansion multiple specified by the attribute, and simultaneously meets the given proportioning requirement.

In practical application, the generated instruction result pair can be strictly screened to remove the data which are generated in error or are not in accordance with the specification. Thereby ensuring high quality and consistency of the final output data.

After the data generation is completed, the data proportion of each API can be recalculated and compared with the proportioning requirement set before. An insufficient number of APIs are screened for further supplemental processing.

For the screened APIs with insufficient quantity, the existing API generation result is used as input through customizing prompt words, and the model is guided to generate new data different from the existing data. Thus, the method is helpful for further enriching the diversity of the data and avoiding the generation of similar or repeated data.

In addition, the data can be continuously generated and optimized until the number and diversity requirements of all APIs are met. This iterative process ensures that the final dataset is of high quality, diversity and coverage.

In order to facilitate the understanding and implementation of the data generation method provided by the embodiments of the present application by those skilled in the art, the data generation method is described in detail below with reference to related technical schemes and specific application scenarios as examples.

The data synthesis method facing the API is mainly divided into the following three types:

1. And (3) manually constructing data, namely manually creating data or acquiring data manually generated by a user by grabbing historical data.

2. Program automation generation, namely setting a strategy for each API by a program automation means, automatically generating possible API results, and then reversely labeling corresponding input instructions.

3. The training data is constructed based on the generation of the API documents by constructing the API documents and inputting each API description and generating possible input and output cases by using the existing open source large model.

Wherein the manual construction of the data or the generation by grabbing the history data has the following disadvantages:

(1) And the verification cost is high, each piece of data needs to be manually verified, so that the correctness of the data is ensured, and the time and the labor are consumed.

(2) The data generation efficiency is low, the efficiency of generating data is low due to the fact that a large amount of manual operation is relied on, and the requirement of large-scale data is difficult to meet.

(3) It is difficult to control the data distribution and quantity, and in terms of data scale and diversity, it is difficult to perform fine control manually, and data deviation or deficiency easily occurs.

And for automatically generating data by the program, setting a strategy for each API by writing the program, automatically enumerating and generating an output result of the API, and reversely labeling an input instruction according to the output result. There are disadvantages:

(1) Development cost is high, and a great deal of development time and resources are required to be consumed depending on detailed generation strategies of each API, such as colors, fonts, preset values and the like.

(2) And the expansibility is poor, and when an API, a data type or a return value is newly added, the generation strategy needs to be redesigned and realized, so that the maintenance cost is higher.

(3) The flexibility is limited, and the automatically generated data is often limited by rules, so that the possible full use scenes of the API are difficult to cover.

For automatic generation based on API documents and large models, input-output example data of the API is generated using the open source large model by providing the API documents. There are disadvantages:

(1) The instruction coverage rate is low, the input and output cases generated by the large model often cannot fully cover all the use scenes of the API, and some important use cases are easy to miss.

(2) And the data distribution is unbalanced, namely the generated data may have deviation in distribution, and the full coverage of the data cannot be realized.

(3) And the extreme value processing is insufficient, so that the consideration of boundary values and abnormal conditions is less, and the comprehensiveness of the data is difficult to ensure.

(4) The generalization is not strong, and the generated data instruction lacks flexibility and is difficult to adapt to wide application in various API scenes.

From the above embodiments, it is known that the technical solution provided by the present application aims to automatically split a plurality of small sub-APIs based on features defining APIs by introducing the features to generate dedicated business data for large model training. The method is not only helpful for controlling the distribution proportion of data, but also can remarkably increase the diversity and coverage range of the generated instructions. So that a set of general schemes can be used for automatically and controllably generating training data in batches for multiple service parties.

Specifically, the present application defines a standard set of API specifications, including types, levels, dependencies, enumerated items, and vocabularies. Not only provides a clear framework, but also provides a basis for subsequent automated splitting and data generation.

Based on the defined characteristics, the API is split into different granularities, and the sub-functional modules are further refined and reconstructed. This process allows for a more flexible way of API processing.

The number distribution of the APIs with different types and granularity is counted, and the data generation proportion is calculated according to the number distribution, so that the balance and quality of the data set are controlled.

And the universal solution framework with controllable distribution of the API is used for realizing customization and diversified generation of the training data of the large model by defining API characteristics and automatically splitting. The method allows the unified framework to automatically generate the business data with specific parameters and configurations for different business scenes, ensures the balance and coverage rate of the data, and improves the efficiency and quality of data generation.

After training the generated model based on the target data set generated by the application, the following effects can be achieved:

the user needs to specify the corresponding component (i.e. api name) through natural language, and set various parameters (api parameters) of the component, so as to achieve quick and convenient use and increase the creation efficiency.

In a front-end business building scene, a page can be built quickly for a product manager, and comparison of different schemes is tried, for example, I need a tag column assembly to be displayed as a text type, I want a third tag of a tag column to be in an unavailable state.

In the scene of component adjustment, the user can quickly specify the effect without searching for needed adjustment items in a large number of APIs, wherein the user needs to set various parameters of the component through natural language, such as font size adjustment to number four, starting footnotes and using seasonal charts.

In the market map configuration in the finance field, the differences of different schemes can be quickly compared for product managers and designers, and different market patterns and functions can be quickly constructed, for example, the color of the outer frame of the candle map is dark red, and the cross cursor is changed into blue.

Referring to fig. 2, the steps of the present application include:

1. A standard API specification is constructed by first defining a standard set of API specifications including types, levels, dependencies, enumerated items and extensible vocabulary. This step provides an explicit framework and basis for subsequent splitting and data generation.

2. API splitting and reconstruction, namely splitting the API into different granularities according to defined characteristics. This means that not only the functionality of the overall API is considered, but further refinement and reconstruction is made for the sub-functional modules therein. This helps to more flexibly process APIs and lay the foundation for subsequent data generation.

3. And counting the number distribution, namely counting the number of APIs with different granularities and different types to form a comprehensive data distribution diagram. The purpose of this step is to ensure a clear grasp of the number and distribution of each class of APIs.

4. Calculating the data proportion, namely calculating the proportion required by various APIs in the data generation process according to the distribution data. This step helps to control the amount of data generated more precisely to ensure the balance of the data set.

5. And constructing customized prompt words, namely designing customized prompt words for APIs with different granularities, wherein the prompt words can be combined with specific quantity requirements so as to ensure that the generated data can achieve the expected effect.

6. Generating data, namely generating data according to the designed prompt words by using a GPT generation model. This step takes advantage of the powerful capabilities of the generated AI to generate diverse data on a large scale.

7. And (3) carrying out data processing and error screening, namely carrying out post-processing on the generated data, and screening and removing error data. This step ensures the quality and accuracy of the final data.

8. And calculating the data proportion and screening the APIs, namely calculating the data proportion again and screening the APIs with insufficient quantity. The purpose of this step is to ensure that each API achieves the required amount and coverage.

9. And (3) continuously generating data, namely combining the API screened in the step (8) with the customized prompt word constructed in the step (5), and incorporating the existing API generation result into a context, and guiding the model to generate new data different from the existing data. The method is helpful for matching up data, guaranteeing the diversity of rich data and avoiding repetition and singleization.

10. Iterative optimization, namely repeatedly executing the steps 6 to 9, and continuously optimizing and generating data until the number and diversity requirements of all APIs are met.

Corresponding to the above method embodiments, the embodiments of the present application further provide a data generating device, where the data generating device described below and the data generating method described above may be referred to correspondingly to each other.

Referring to fig. 3, the apparatus includes the following modules:

the splitting module 101 is configured to split the application programming interfaces under different granularities to obtain a plurality of application programming sub-interfaces;

the generation proportion statistics module 102 is used for carrying out type statistics on the plurality of application programming sub-interfaces and determining generation proportion based on the statistical result;

The prompt processing module 103 is configured to obtain prompt words respectively corresponding to the multiple application programming subinterfaces;

the instruction result pair generating module 104 is configured to input a prompt word into the text language model to perform data generation processing, so as to obtain an instruction result pair, where the text language model has learned a technical document and a use guide of the application programming interface;

the analysis module 105 is configured to store the instruction result pairs into a data set, and perform statistics on the instruction result pairs in the data set to obtain distribution data;

The data generating module 106 is configured to determine the data set as a target data set in a case where the distribution data matches the generation scale.

The device provided by the embodiment of the application is applied to split application programming interfaces under different granularities to obtain a plurality of application programming sub-interfaces, the types of the plurality of application programming sub-interfaces are counted, the generation proportion is determined based on the counting result, prompt words corresponding to the plurality of application programming sub-interfaces are obtained, the prompt words are input into a text language model to be subjected to data generation processing to obtain an instruction result pair, wherein the text language model learns technical documents and use guidelines of the application programming interfaces, the instruction result pair is stored into a data set, the statistics is performed on the instruction result pair in the data set to obtain distribution data, and the data set is determined to be a target data set under the condition that the distribution data is matched with the generation proportion.

In one embodiment of the present application, the splitting module is specifically configured to obtain declaration information of the application programming interface;

Determining different granularity based on the declaration information;

and splitting the programming interfaces under different granularities to obtain a plurality of application programming subinterfaces.

In one embodiment of the present application, the method further comprises:

And the supplementing module is used for supplementing the data set by re-acquiring the new prompt words under the condition that the distribution data is not matched with the generation proportion, so that the distribution data corresponding to the supplemented data set is matched with the generation proportion.

In one embodiment of the present application, the supplemental module is specifically configured to compare the distribution data with the generation ratio, and determine a target type of the lacking instruction result pair;

Re-acquiring the prompt words corresponding to the target application programming sub-interfaces;

The complementary instruction result pairs are written into the dataset.

In one embodiment of the application, a proportion statistics module is generated, and is specifically used for outputting a statistics result on a visual interface;

Acquiring input configuration information;

The generation ratio is determined using the configuration information.

In a specific embodiment of the present application, the generation proportion statistics module is specifically configured to read a category proportion of each category in the configuration information, and determine the generation proportion based on the category proportion.

In a specific embodiment of the present application, the matching judgment module is configured to obtain a subclass proportion of the subclass under each class by using the class proportion;

determining the minimum record number corresponding to the sub-category by using the sub-category proportion and the minimum expansion record number;

Corresponding to the above method embodiment, the embodiment of the present application further provides an electronic device, where an electronic device described below and a data generating method described above may be referred to correspondingly.

Referring to fig. 4, the electronic device includes:

a memory 332 for storing a computer program;

a processor 322 for implementing the steps of the data generating method of the above-described method embodiment when executing a computer program.

Specifically, referring to fig. 5, fig. 5 is a schematic diagram of a specific structure of an electronic device according to the present embodiment, where the electronic device may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) (e.g., one or more processors) and a memory 332, where the memory 332 stores one or more computer programs 342 or data 344. Wherein the memory 332 may be transient storage or persistent storage. The program stored in memory 332 may include one or more modules (not shown), each of which may include a series of instruction operations in the data processing apparatus. Still further, the processor 322 may be configured to communicate with the memory 332 and execute a series of instruction operations in the memory 332 on the electronic device 301.

The electronic device 301 may also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input/output interfaces 358, and/or one or more operating systems 341.

The steps in the data generation method described above may be implemented by the structure of the electronic device.

Corresponding to the above method embodiments, the embodiments of the present application further provide a readable storage medium, where a readable storage medium described below and a data generating method described above may be referred to correspondingly to each other.

A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the data generation method of the above-described method embodiments.

The readable storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, which may store various program codes.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms include, comprise, or any other variation is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

While specific examples have been set forth herein to illustrate the principles and embodiments of the present application, the above examples are provided to assist in understanding the methods and concepts of the application and are intended to be within the purview of those skilled in the art, given the benefit of this disclosure, to the extent that they can vary from the details to the embodiments and applications described herein, the disclosure should not be interpreted as limiting.

Claims

1. A data generation method, comprising:

2. The method of claim 1, wherein splitting the application programming interface at different granularities results in a plurality of application programming subinterfaces, comprising:

acquiring declaration information of the application programming interface;

determining different granularity based on the declaration information;

3. The method of claim 1, further comprising, in the event that the distribution data does not match the generation ratio:

4. A method according to claim 3, wherein supplementing the data set by re-acquiring new hint words comprises:

And writing the complementary instruction result pair into the data set.

5. The method of claim 1, wherein determining the generation ratio based on the statistics comprises:

Outputting the statistical result on a visual interface;

Acquiring input configuration information;

6. The method of any of claims 1 to 5, wherein determining the generation ratio using the configuration information comprises:

7. The method of claim 6, wherein determining whether the distribution data matches the generation ratio comprises:

8. A data generating apparatus, comprising:

9. An electronic device, comprising:

a memory for storing a computer program;

Processor for implementing the steps of the data generation method according to any of claims 1 to 7 when executing said computer program.

10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the data generation method according to any of claims 1 to 7.