CN121237196A - Artificial intelligence-based target sequence screening methods, kits, and applications for species identification - Google Patents

Artificial intelligence-based target sequence screening methods, kits, and applications for species identification

Info

Publication number
CN121237196A
CN121237196A CN202511335015.8A CN202511335015A CN121237196A CN 121237196 A CN121237196 A CN 121237196A CN 202511335015 A CN202511335015 A CN 202511335015A CN 121237196 A CN121237196 A CN 121237196A
Authority
CN
China
Prior art keywords
species
gene
target
target sequence
screening
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202511335015.8A
Other languages
Chinese (zh)
Inventor
辛天怡
宋经元
史志杰
甘雨桐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Medicinal Plant Development of CAMS and PUMC
Original Assignee
Institute of Medicinal Plant Development of CAMS and PUMC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Medicinal Plant Development of CAMS and PUMC filed Critical Institute of Medicinal Plant Development of CAMS and PUMC
Priority to CN202511335015.8A priority Critical patent/CN121237196A/en
Publication of CN121237196A publication Critical patent/CN121237196A/en
Pending legal-status Critical Current

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The disclosure provides a target sequence screening method, a kit and application of species identification based on artificial intelligence. The method comprises the steps of firstly predicting a plurality of first gene fragments of whole genome data from a target species by using a target sequence recognition model, outputting prediction probability, then screening the plurality of first gene fragments based on the prediction probability and a preset first screening condition to obtain second gene fragments, then screening the plurality of second gene fragments based on a pre-constructed nucleic acid database to obtain third gene fragments, and finally determining the third gene fragments verified by target markers as target sequences. By adopting the technical scheme, the target sequence identification model can predict a plurality of first gene fragments and output prediction probability, so that the screening efficiency and accuracy of the target sequence are greatly improved.

Description

Target sequence screening method and kit based on artificial intelligence species identification and application
Technical Field
The disclosure relates to the field of biotechnology, and in particular relates to a target sequence screening method, a kit and application for species identification based on artificial intelligence.
Background
The species classical identification method faces a plurality of problems in practical application, such as misjudgment caused by morphological similarity, dependence on the experience and subjective judgment of an identification expert, time consumption, high cost and difficulty in coping with large-scale samples and the like. Emerging molecular biology techniques, such as DNA barcodes, focus on only a few specific regions of the genome, and although species identification can be achieved to a certain extent, still some species cannot be accurately identified due to the identical DNA barcode sequences.
The artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) recognition model based on deep learning mainly relies on image recognition, and since the data set trained by the model usually relates to various apparent morphological characteristics of species, the recognition capability of the existing AI model can suffer from obvious bottlenecks with the increase of the number of species and the data scale, for example, in practical application, the AI recognition model can usually achieve a better identification effect only on the generic level, but the identification of the species and the subspecies is difficult to solve. In recent years, AI technology has shown significant advantages in the fields of gene sequence analysis, mutation site annotation and the like through algorithms such as neural networks, natural language processing and the like. However, there is no study reporting how to exploit the identification features in the whole genome data of species for biological species identification using AI technology.
Disclosure of Invention
In view of the above, the disclosure aims to provide a target sequence screening method, a kit and an application of the target sequence screening method based on artificial intelligence species identification.
Based on the above objects, the present disclosure provides a target sequence screening method based on artificial intelligence species identification, comprising:
acquiring and determining a plurality of first gene segments based on genome-wide data of the target species;
predicting target sequence probability of the plurality of first gene segments by using a target sequence recognition model, and outputting the prediction probability of each first gene segment;
Screening the first gene fragments according to the prediction probability and a preset first screening condition to obtain a plurality of second gene fragments;
screening the plurality of second gene fragments based on a pre-constructed nucleic acid database to obtain a third gene fragment;
Designing a target primer from the third gene fragment, verifying the specificity of the third gene fragment using the target primer, the first genome of the target species and the second genome of the non-target species;
the target sequence recognition model is obtained by training an initial model based on a data set;
wherein the dataset comprises a target sequence sample and a whole genome sample of a species to which the target sequence sample corresponds.
In some embodiments, the obtaining and determining a plurality of first gene segments based on whole genome data of the target species, specifically comprises;
Acquiring whole genome data of a plurality of individuals of the target species, and segmenting each whole genome data to obtain a plurality of fourth gene segments;
Establishing an inverted index based on a plurality of the fourth gene segments and each individual, and counting the number of times each of the fourth gene segments appears in different individuals;
And screening the fourth gene fragments based on preset second screening conditions and the times to obtain the first gene fragments.
In some embodiments, the preset first screening condition includes a first ordering range;
screening the plurality of first gene fragments according to the prediction probability and a preset first screening condition to obtain a plurality of second gene fragments, wherein the method specifically comprises the following steps:
ranking the plurality of first gene segments based on the predictive probability;
In response to determining that the first gene segment belongs to the first ordering range, the first gene segment is determined to be the second gene segment.
In some embodiments, the screening the plurality of second gene segments to obtain a third gene segment based on a pre-constructed nucleic acid database specifically includes:
aligning each of said second gene fragments with nucleic acid sequences in said nucleic acid database;
in response to determining that there is at least N base differences in any of the second gene segments and nucleic acid sequences of the nucleic acid database that are equal in length to the second gene segments in any of the species other than the target species, the second gene segment is determined to be the third gene segment;
Wherein N is more than or equal to 3.
In some embodiments, the target sequence sample satisfies a plurality of filtering rules, wherein the filtering rules are statistically derived based on a plurality of the target sequence samples;
the dataset also includes at least one of a first type of sequence sample and a second type of sequence sample;
the first type sequence samples meet the plurality of filtering rules and do not belong to the target sequence, and the second type sequence samples are random sequences or meet part of the plurality of filtering rules.
In some embodiments, the ratio of the number of target sequence samples, the first class of sequence samples, and the second class of sequence samples is 1:0.8-1.0:0.05-0.2.
In some embodiments, the initial model comprises a feature extraction module and a classification module, wherein the feature extraction module is a pre-trained model based on a transducer architecture.
Based on the same inventive concept, the disclosed embodiments also provide a kit for species identification, the kit comprising a primer sequence designed for a target sequence selected from at least one of 5'-TTTCAGATTCTAAGCCTACCCTACT-3', SEQ ID NO:1 and 5'-TTTCCTGACGAATGGACATGTTGCG-3', SEQ ID NO: 4.
Based on the same inventive concept, the presently disclosed embodiments also provide for the use of a target sequence selected from at least one of 5'-TTTCAGATTCTAAGCCTACCCTACT-3', SEQ ID NO:1 and 5'-TTTCCTGACGAATGGACATGTTGCG-3', SEQ ID NO:4 in species identification. It is noted that the application provided by the embodiments of the present disclosure can identify all samples capable of obtaining a target sequence, including but not limited to traditional Chinese medicinal materials, decoction pieces, chinese patent medicines, dietary supplements, and the like.
In some embodiments, the SEQ ID NO. 1 is used to identify the fungus Celastrus spinosus and the SEQ ID NO. 4 is used to identify the fungus Celastrus siamensis.
From the above, it can be seen that the present disclosure provides a target sequence screening method, a kit and an application based on artificial intelligence species identification, which firstly predicts a plurality of first gene segments of whole genome data from a target species by using a target sequence recognition model and outputs a prediction probability, then screens the plurality of first gene segments based on the prediction probability and a preset first screening condition to obtain a second gene segment, then screens the plurality of second gene segments based on a pre-constructed nucleic acid database to obtain a third gene segment, and finally determines the third gene segment verified by a target index primer as a target sequence. By adopting the technical scheme, the target sequence identification model can predict a plurality of first gene fragments and output prediction probability, so that the screening efficiency and accuracy of the target sequence are greatly improved.
Drawings
In order to more clearly illustrate the technical solutions of the present disclosure or related art, the drawings required for the embodiments or related art description will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.
FIG. 1A shows a schematic flow diagram of a method for screening target sequences based on artificial intelligence-based species identification provided in an embodiment of the present disclosure;
FIG. 1B shows a partial flow diagram of yet another artificial intelligence based screening method for target sequences for species identification provided by embodiments of the present disclosure;
FIG. 2A shows GenBank alignment of species-specific target sequences of the disc spine of example 2 of the present disclosure;
FIG. 2B shows the Siamese coronella in example 2 of the present disclosure a species-specific target sequence GenBank alignment of (a);
FIG. 2C shows Sanger sequencing results of species-specific target sequences of D.fruit-thorn in example 2 of the present disclosure;
FIG. 2D shows Sanger sequencing results of species-specific target sequences of the species of the genus Cephalosporium in example 2 of the present disclosure;
FIG. 3A shows the results of the microplate reader of the D.fruit and other Colletotruichum related species in example 3 of the present disclosure;
FIG. 3B shows the results of the microplate reader of the Siamese and other Colletotruichum related species of the genus Centipeda in example 3 of the present disclosure;
FIG. 4A shows the visual fluorescence detection results of the D.fruit and other Colletotruichum genus related species in example 3 of the present disclosure;
Fig. 4B shows the visual fluorescence detection results of the siamesed aschersonia species and other Colletotruichum genus closely related species in example 3 of the present disclosure.
Detailed Description
For the purposes of promoting an understanding of the principles and advantages of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same.
It should be noted that unless otherwise defined, technical or scientific terms used in the embodiments of the present disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present disclosure pertains. The terms "first," "second," and the like, as used in embodiments of the present disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items.
Abbreviations used in the examples of the present disclosure have their conventional meaning in the chemical and biological arts. The chemical structures and formulas set forth herein are constructed according to standard valence rules known in the chemical arts. "μM" herein refers to "μmol/L" and "mM" refers to "mmol/L" unless otherwise specified.
As described in the background section, the related art has not exploited AI technology to mine identifying features in the whole genome data of species for biological species identification.
In view of the above, the embodiments of the present disclosure provide a target sequence screening method, a kit and an application based on artificial intelligence species identification. The screening method comprises the steps of firstly predicting a plurality of first gene fragments of whole genome data from a target species by using a target sequence recognition model and outputting prediction probability, then screening the plurality of first gene fragments based on the prediction probability and preset first screening conditions to obtain second gene fragments, then screening the plurality of second gene fragments based on a pre-constructed nucleic acid database to obtain third gene fragments, and finally determining the third gene fragments verified by target markers as target sequences. By adopting the technical scheme, the target sequence identification model can predict a plurality of first gene fragments and output prediction probability, so that the screening efficiency and accuracy of the target sequence are greatly improved.
Fig. 1A shows a schematic flow chart of a target sequence screening method for artificial intelligence-based species identification according to an embodiment of the present disclosure, and fig. 1B shows a schematic flow chart of a portion of a target sequence screening method for further artificial intelligence-based species identification according to an embodiment of the present disclosure.
In some embodiments, an initial model is first constructed.
Alternatively, the initial model may include a feature extraction module that is a pre-trained model based on a transducer architecture, such as the DNABERT model, and a classification module that may be a linear layer or other network that implements classification. It should be noted that, the feature extraction module may output the feature of the CLS location as a sequence feature to the classification module.
Alternatively, the initial model may also be DNABERT models with classification heads, such as a sequence-level classification model, a Token-level classification model, etc., which is not limited by the present disclosure.
It should be noted that the initial model may also be a text classification large model, which is not limited in this disclosure.
Next, referring to fig. 1B, a dataset 201A is constructed over the data for the known specific species. It should be noted that the data set may include a training set and a test set.
Illustratively, the dataset may include a sample of target sequences of 31 species of alternaria and a sample of the whole genome of 145 individuals of their corresponding species identified based on previous studies. The target sequence sample may be a Positive sample (Positive Samples).
To enhance the ability of the model to distinguish subtle differences, the inventors of the present disclosure also add difficult negative samples in the dataset (HARD NEGATIVE SAMPLES).
In some embodiments, the difficult negative samples (HARD NEGATIVE SAMPLES) may be determined by statistically analyzing the positive samples (corresponding to the target sequence samples) to obtain the filtering rules. Illustratively, the GC content of the whole sequence, the GC content of the 3-kmer and the 6-kmer of the 5 'end and the 3' end are counted, and the number of single base, double base and three base repeated continuously (such as the number of AAA repeated is 3, the number of AAAA repeated is 4, and the number of sequences contained in the sequence is the number) are counted, so that the content range of each index is constructed, namely the filtering rule.
The whole genome sequence is filtered by the filtering rule, and for the gene fragments which satisfy the whole filtering rule and do not belong to the positive sample, it can be determined that the negative sample is difficult.
It follows that the difficult negative and positive samples have high sequence similarity but do not belong to the positive samples, which are difficult points of model learning, helping to promote the ability of the model to distinguish subtle differences.
To enhance model discrimination against non-target generic background, some embodiments of the present disclosure also add a negative sample to the dataset (NEGATIVE SAMPLES).
In some embodiments, the negative samples may be constructed in two ways that randomly decimate the sequence segments. For example, a random sequence is used to construct the negative samples. As another example, a gene segment that satisfies a portion of the plurality of filtering rules (e.g., one or both conditions) is selected as a negative sample. The gene segments meeting part of the filtering rules are similar to the positive samples in the whole sequence characteristics, but the sequence contents are obviously different, so that the discrimination of the model can be enhanced.
The data set is then used to perform supervised transfer learning training on the initial model. It should be noted that, in the training process, a strategy of partial freezing of parameters can be adopted, so that the noise is prevented from damaging the parameters in a large range, and the overfitting is reduced. Meanwhile, a label smoothing strategy can be applied, so that the transient fitting of noise data is avoided.
In some embodiments, the ratio of the positive sample to the negative sample can be 1:0.8-1.0:0.05-0.2, so that the learning of the positive sample and the negative sample can be ensured while the learning of the model to the negative sample can be enhanced, and the classification precision is improved. If the proportion of the difficult negative samples is increased, the model cannot learn the positive samples sufficiently, and if the proportion of the difficult negative samples is reduced, the model cannot learn the difficult negative samples sufficiently.
Alternatively, the ratio of positive to negative difficult to negative may be 1:0.9:0.1.
Through targeted data set construction and training strategy optimization, the target sequence recognition model obtained through training can be accurately adapted to the target sequence characteristics of the species, and finally high-precision recognition and discrimination of the species target sequence are realized, so that the recognition performance of the model on the species specific target sequence is remarkably improved.
With continued reference to FIG. 1B, a local nucleic acid database 202A is constructed. In some embodiments, the local nucleic acid database may be constructed based on DNA sequence data in a GenBank database, national genome science data center (National Genomics DATA CENTER, NGDC), or other public databases.
It should be noted that the GenBank database is a DNA sequence database established by the national center for Biotechnology information (National Center for Biotechnology information, NCBI) and has a website address of https:// www.ncbi.nlm.nih.gov/GenBank/. The national genome science data center is a Beijing genome research institute of China academy of sciences as a depending unit, and is combined with a life and health big data center which is commonly constructed by China academy of sciences biophysical research institute and Shanghai nutrition and health research institute of China academy of sciences, and the website address of the database is https:// ngdc.cncb.ac.cn/.
For example, if the species to be detected is eukaryotic, the local nucleic acid database may use essentially Core nucleotide database of the GenBank database and all eukaryotic genome data.
Based on the target sequence recognition model obtained by the training and the local nucleic acid database, the target sequences identified by the species can be screened. The method of screening for target sequences is described in detail below with reference to the accompanying drawings.
As shown in fig. 1A and 1B, the screening method 100 includes:
s101, acquiring and determining a plurality of first gene fragments based on whole genome data of a target species.
Here, as shown in fig. 1B, the user may input the name of the target species and its whole genome data storage path 203A, based on which whole genome data may be acquired.
In some embodiments, step S101 may include:
the length of the fourth gene fragment can be preset, for example, any one of 20bp to 800bp, for example, 20bp to 100bp,20bp to 80bp,25bp,50bp and the like, and the disclosure is not limited thereto.
Alternatively, each of the whole genome data may be cut to obtain a plurality of fourth gene segments, for example, the length of the fourth gene segment is K and the length of the whole genome is L, and the whole genome may be cut into L-K+1 fourth gene segments.
Referring to 204B of fig. 1B, an inverted index is established based on a plurality of the fourth gene segments and each of the individuals, counting the number of times each of the fourth gene segments occurs in a different one of the individuals.
And screening the fourth gene fragments based on preset second screening conditions and the times to obtain the first gene fragments.
Illustratively, the second screening condition may be a fourth gene segment that occurs 80% of the first order of number of occurrences.
The fourth gene segment with fewer occurrences in the individual may be an individual-specific segment, rather than a species-specific segment, and the fourth gene segment with fewer occurrences in the individual may be excluded from the input of the individual-specific fourth gene segment into the target sequence recognition model using the preset second screening conditions.
It should be noted that the inverted index is merely an example, and those skilled in the art may also use other ways to count the number of times the fourth gene segment appears in different individuals, which is not limited in this disclosure.
S103, referring to 205B in FIG. 1B, target sequence probability prediction is performed on the plurality of first gene segments by using a target sequence recognition model, and the prediction probability of each first gene segment is output.
S105, screening the plurality of first gene fragments according to the prediction probability and a preset first screening condition to obtain a plurality of second gene fragments;
In some embodiments, the preset first filtering condition includes a first sorting range, for example 20000 pieces before predicting sorting, and S105 specifically includes:
ranking the plurality of first gene segments based on the predictive probability;
In response to determining that the first gene segment belongs to the first ordering range, the first gene segment is determined to be the second gene segment.
In some alternative embodiments, the preset first filtering condition includes a first ordering range (for example, 500 before the predicted ordering), a second ordering range (501-1000 before the predicted ordering), a third ordering range (1001-1500 before the predicted ordering), and so on. Based on this, steps S105 to S109 may be performed in a loop, for example, when each of the third gene segments in S109 is not determined to be the target sequence, the first gene segment in the second sequencing range may be determined to be the second gene segment, and S107 and S109 may be continuously performed. And so on until at least one third gene segment in S109 is determined to be the target sequence.
S107, screening the plurality of second gene fragments based on a pre-constructed nucleic acid database (202A) to obtain a third gene fragment;
In some embodiments, S107 specifically includes:
referring to 206B of FIG. 1B, aligning each of the second gene segments with nucleic acid sequences in the nucleic acid database;
Here, a Query library may be constructed based on the second gene fragment, and BLAST (Basic Local ALIGNMENT SEARCH Tool, BLAST) alignment may be performed based on the Query library and the nucleic acid database, and the alignment result may be outputted.
Referring to 207B of FIG. 1B, in response to determining that there are at least N base differences in any of the second gene segments and nucleic acid sequences of equal length to the second gene segment in any of the nucleic acid databases except for the target species, the second gene segment is determined to be the third gene segment, wherein N≥3. Note that N is a positive integer.
Through S107, the second gene fragment having smaller difference from the comparison in the nucleic acid database can be filtered, and the specificity of the third gene fragment can be improved.
Designing a target primer based on the third gene segment, verifying the specificity of the third gene segment by using the target primer, the first genome of the target species and the second genome of the non-target species, and determining the verified third gene segment as a target sequence, wherein the non-target species may be a species of the same genus as the target species, the number of which may be plural, and the disclosure is not limited thereto;
it should be noted that the first genome of the target species and the second genome of the non-target species may be obtained by a DNA extraction technique, for example, the organism of the target species may be obtained, the first genome may be obtained by a DNA extraction technique, the organism of the non-target species may be obtained, and the second genome may be obtained by a DNA extraction technique.
With continued reference to 207B of FIG. 1B, the target index designed based on the third gene fragment may be a base sequence designed within a range of 500bp (corresponding context match) upstream and downstream of the position of the third gene fragment (e.g., SEQ ID NO:2, SEQ ID NO:3, SEQ ID NO:5, SEQ ID NO: 6).
In some embodiments, the verifying the specificity of the third gene fragment using the target primer, the first genome of the target species, and the second genome of the non-target species specifically comprises:
amplifying the first genome by using the target primer to obtain a first amplified product, and amplifying the second genome to obtain a second amplified product;
obtaining gel electrophoresis patterns and sequencing data (e.g., sanger sequencing) based on the first amplification product and the second amplification product;
Determining the specificity of the third gene fragment based on the gel electrophoresis pattern and the sequencing data.
The target primer amplified only in the first genome amplified the target band, not amplified in the second genome amplified the target band, and the sequencing data only in the first genome sequence completely matched, the second genome outside of the two has at least N base differences, the specificity is established.
In some embodiments, the target sequence sample satisfies a plurality of filtering rules, wherein the filtering rules are statistically derived based on a plurality of the target sequence samples;
the dataset also includes at least one of a first class of sequence samples (corresponding to difficult negative samples) and a second class of sequence samples (corresponding to negative samples);
the first type sequence samples meet the plurality of filtering rules and do not belong to the target sequence, and the second type sequence samples are random sequences or meet part of the plurality of filtering rules.
In some embodiments, the ratio of the number of target sequence samples, the first class of sequence samples, and the second class of sequence samples is 1:0.8-1.0:0.05-0.2.
It should be noted that, the screening method according to the embodiments of the present disclosure may be performed by a single device, for example, a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of embodiments of the present disclosure, the devices interacting with each other to accomplish the methods.
It should be noted that the foregoing describes some embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In order to make the technical scheme of the present disclosure clearer and easier to understand, the following describes in detail a target sequence screening method based on artificial intelligence species identification provided by the present disclosure with reference to the accompanying drawings and specific examples.
The experimental methods used in the following examples are conventional methods, unless specifically indicated, and are carried out according to techniques or conditions described in the literature in the field or according to the product specifications. Materials, reagents and the like used in the examples described below are commercially available unless otherwise specified.
Example 1
1. Material
17 Pieces of whole genome data published by the fungus, the genus Cephalosporium, the fruit, and 24 pieces of whole genome data published by the fungus, the genus Cephalosporium (Colletotrichum fructicola) and the genus Cephalosporium (Colletotrichum siamense) were downloaded from GenBank.
Table 1GenBank download genome data information table
2. Target sequence prediction for artificial intelligence-based species identification
1) Dividing the genome data into 25kmer fragments and establishing inverted indexes, and marking the occurrence times of each 25kmer fragment in the genome of different individuals in the species;
2) Cutting the 25kmer fragments obtained in the step 1) into fragments 80% of which are generated among different individuals in a species, submitting the fragments to a target sequence recognition model obtained by training, predicting whether each 25kmer fragment is a species-specific target sequence by the model, and outputting prediction probability;
3) Selecting 20,000 Top sequences of the species-specific target sequence prediction probability score obtained in the step 2), constructing a Query library, carrying out BLAST comparison with a local nucleic acid database, and outputting comparison results, wherein FIG. 2A shows the species-specific target sequence GenBank comparison result of the Conyza rupestris in the embodiment 2 of the disclosure, and FIG. 2B shows the species-specific target sequence GenBank comparison result of the Conyza rupestis in the embodiment 2 of the disclosure;
4) According to the comparison result in the step 3), screening 25kmer which has more than 3 base differences with any other species except the species itself as a species-specific candidate target sequence of the species, wherein the colletotrichum fruit obtains 14 species-specific candidate target sequences, and the colletotrichum siamensis obtains 5 species-specific candidate target sequences.
Example 2
In order to determine that the specific candidate target sequences selected according to example 1 are actually present only in the target species, it is ensured that their theoretical properties are accurately reflected in practical applications. Then, a specific primer pair is designed based on species-specific candidate target sequences of the colletotrichum fruit and the siamese colletotrichum, and PCR amplification and Sanger sequencing verification are respectively carried out on the species of the colletotrichum fruit and other closely related species.
1. Material
The fungus strain of genus Cephalosporium is purchased from China general microbiological culture Collection center and North Nata-invasive biological technology Co., ltd. Specific species information is as follows:
TABLE 2 information table of sample of fungus belonging to genus Cephalosporium
2. Experimental procedure
2.1DNA extraction
Extracting genome DNA of the test colletotrichum sample by a kit method.
2.2PCR amplification
The specific candidate target sequences selected in example 1 were used as extension sequences for designing primers by extending 100bp upstream and downstream, and all samples to be detected were subjected to PCR amplification using the above-designed primer pairs, respectively, in a PCR system of 12.5. Mu.L of 2X TAQ PCR MASTER Mix, 1. Mu.L of each of the upstream and downstream primers (concentration: 10. Mu. Mol/L), 2. Mu.L of DNA template (about 20 ng), 8.5. Mu. LddH2O, and 25. Mu.L in total. The PCR reaction was performed at 95℃for 3min, at 95℃for 30s, at 56℃for 30s, at 72℃for 30s, at 30cycles, and at 72℃for 10min.
2.3 Agarose gel electrophoresis
Amplification product specificity and fragment length distribution were assessed by 1.5% agarose gel electrophoresis (120 v,50 min) in combination with DL1000 molecular weight standards.
2.4Sanger sequencing
And carrying out Sanger bidirectional sequencing on the specific target band and all visible amplified bands shown in the agarose gel imaging diagram, and if the sequencing result is only completely matched with the target species specific candidate target sequence, and 3 or more base differences exist between the sequencing data of no amplified band or the visible amplified band and the specific candidate target sequence in other non-target species, the specific candidate target sequence is actually established and can be used as the target sequence for species identification.
The final screening of 1 specific target sequence and primer pair information for the final screening of the colletotrichum fruit is as follows:
Cfr_Target:5’-TTTCAGATTCTAAGCCTACCCTACT-3’,SEQ ID NO:1;
Cfr_F:5’-GAACAAGGAAATCCAGGCCCTACTC-3’,SEQ ID NO:2;
Cfr_R:5’-ATAATCAGGCTTTGCGTGGCTGTAG-3’,SEQ ID NO:3;
The information of 1 specific target sequence and primer pair aiming at the final screening of the Siamese thorn spore is as follows:
Csi_Target:5’-TTTCCTGACGAATGGACATGTTGCG-3’,SEQ ID NO:4;
Csi_F:5’-TTTCCAGTCCGGCTCAGTGTATTGG-3’,SEQ ID NO:5;
Csi_R:5’-TGAAAGTCCGTCGAAGTTCAATGGC-3’,SEQ ID NO:6;
Fig. 2C shows Sanger sequencing results of the species-specific target sequence of the colletotrichum fruit in example 2 of the present disclosure, and fig. 2D shows Sanger sequencing results of the species-specific target sequence of the siamese colletotrichum fruit in example 2 of the present disclosure.
It should be noted that, considering that the CRISPR-Cas12a detection technology is adopted in the following to provide technical support for realizing accurate identification and rapid detection requirements of the colletotrichum fruit and the siamese colletotrichum fruit, the specific target sequences for the colletotrichum fruit and the siamese colletotrichum fruit have 3 or more base differences from other non-target species except for the protospacer adjacent motif (Protospacer Adjacent Motif, PAM).
It should be appreciated that PAM is not a concern when other identification and rapid detection techniques are employed, such as sequencing, microdroplet digital PCR (droplet DIGITAL PCR, DDPCR), real-time fluorescent quantitative PCR (quantitative real-time PCR, qPCR), and the like.
From the above results, it can be seen that the target sequence obtained by the artificial intelligence-based screening method for the target sequence for species identification provided by the embodiments of the present disclosure can be used for the identification of the genus Cephalosporium, and has good feasibility.
Example 3
In the embodiment, a CRISPR-Cas12a detection technology is adopted to provide technical support for realizing accurate identification and rapid detection requirements of the colletotrichum fruit and the siamese colletotrichum.
1. Material
As in example 2.
2. Experimental procedure
2.1DNA extraction
As in example 2.
2.2PCR amplification
As in example 2.
2.3 Detection of species-specific target sequences of the species of the Viscum album crassifolium based on CRISPR/Cas12a Gene editing System
CrRNA is designed for the colletotrichum fruit, cfr_crRNA is 5'-UAAUUUCUACUAAGUGUAGAUAGAUUCUAAGCCUACCCUACU-3', SEQ ID NO:7, and crRNA is designed for the colletotrichum siamensis, csi_crRNA is 5'-UAAUUUCUACUAAGUGUAGAUCUGACGAAUGGACAUGUUGCG-3', SEQ ID NO:8. 10. Mu.L of the PCR product obtained in the step 2.2 was added with 1.65. Mu.L of crRNA (300 nM), 5. Mu.L of 10 XNEBuffer 2.1,1. Mu. L EnGen Lba Cas12a Cpf1 (20 nM), 30.35. Mu.L of ddH2O was mixed, followed by incubating at 37℃for 10min, and after removal, 2. Mu.L of Poly_C_FQ (5'6-FAM/CCCCCCCCCC/3' BHQ-1, SEQ ID NO: 9) was added, and fluorescence data were detected by an enzyme-labeled instrument at 37℃and at 0, 5, 10, 15, 20min intervals of time at a wavelength of λex 483nM/λem 535nM, respectively, or fluorescence was directly observed using a blue light transilluminator.
In the embodiment, the discodermatum and the siamese discodermatum are selected as target species, and the rest of samples to be tested are used as kindred species for experimental study.
Fig. 3A shows the detection results of the microplate reader of the aschersonia aleyrodis and other Colletotruichum related species in example 3 of the present disclosure, and fig. 3B shows the detection results of the microplate reader of the aschersonia aleyrodis and other Colletotruichum related species in example 3 of the present disclosure. As shown in fig. 3A and 3B, the fluorescence value of the target species was statistically significantly higher than that of other species and control group (CK) (P < 0.01).
Fig. 4A shows the visual fluorescence detection results of the discoium fruit and other Colletotruichum genus related species in example 3 of the present disclosure, and fig. 4B shows the visual fluorescence detection results of the siamesed discoium species and other Colletotruichum genus related species in example 3 of the present disclosure. Only the target species showed a strong fluorescent signal in the case of macroscopic viewing.
According to the experimental results, the technical system disclosed by the disclosure obtains the identity evidence through two detection means of enzyme-labeled instrument detection and visual fluorescence detection, and the technical system is proved to be capable of realizing the requirements of accurate identification and rapid detection of target species.
It will be appreciated by persons skilled in the art that the foregoing discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure, including the claims, is limited to these examples, that the steps may be implemented in any order and that many other variations of the different aspects of the disclosed embodiments described above are present, which are not provided in detail for the sake of brevity, and that the features of the above embodiments or of the different embodiments may also be combined within the spirit of the disclosure.
The disclosed embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Accordingly, any omissions, modifications, equivalents, improvements, and the like, which are within the spirit and principles of the embodiments of the disclosure, are intended to be included within the scope of the disclosure.

Claims (10)

1. The artificial intelligence-based screening method for the target sequences for species identification is characterized by comprising the following steps:
acquiring and determining a plurality of first gene segments based on genome-wide data of the target species;
predicting target sequence probability of the plurality of first gene segments by using a target sequence recognition model, and outputting the prediction probability of each first gene segment;
Screening the first gene fragments according to the prediction probability and a preset first screening condition to obtain a plurality of second gene fragments;
screening the plurality of second gene fragments based on a pre-constructed nucleic acid database to obtain a third gene fragment;
Designing a target primer from the third gene fragment, verifying the specificity of the third gene fragment using the target primer, the first genome of the target species and the second genome of the non-target species;
the target sequence recognition model is obtained by training an initial model based on a data set;
wherein the dataset comprises a target sequence sample and a whole genome sample of a species to which the target sequence sample corresponds.
2. The method of claim 1, wherein the obtaining and determining a plurality of first gene segments based on whole genome data of the target species, specifically comprises;
Acquiring whole genome data of a plurality of individuals of the target species, and segmenting each whole genome data to obtain a plurality of fourth gene segments;
Establishing an inverted index based on a plurality of the fourth gene segments and each individual, and counting the number of times each of the fourth gene segments appears in different individuals;
And screening the fourth gene fragments based on preset second screening conditions and the times to obtain the first gene fragments.
3. The method of claim 1, wherein the predetermined first screening condition comprises a first ordering range;
screening the plurality of first gene fragments according to the prediction probability and a preset first screening condition to obtain a plurality of second gene fragments, wherein the method specifically comprises the following steps:
ranking the plurality of first gene segments based on the predictive probability;
In response to determining that the first gene segment belongs to the first ordering range, the first gene segment is determined to be the second gene segment.
4. The method according to claim 1, wherein the screening of the plurality of second gene fragments based on the pre-constructed nucleic acid database to obtain a third gene fragment specifically comprises:
aligning each of said second gene fragments with nucleic acid sequences in said nucleic acid database;
in response to determining that there is at least N base differences in any of the second gene segments and nucleic acid sequences of the nucleic acid database that are equal in length to the second gene segments in any of the species other than the target species, the second gene segment is determined to be the third gene segment;
Wherein N is more than or equal to 3.
5. The method of claim 1, wherein the target sequence samples satisfy a plurality of filtering rules, wherein the filtering rules are statistically derived based on a plurality of the target sequence samples;
the dataset also includes at least one of a first type of sequence sample and a second type of sequence sample;
the first type sequence samples meet the plurality of filtering rules and do not belong to the target sequence, and the second type sequence samples are random sequences or meet part of the plurality of filtering rules.
6. The method of claim 5, wherein the ratio of the number of the target sequence samples, the first type of sequence samples, and the second type of sequence samples is 1:0.8-1.0:0.05-0.2.
7. The method of claim 1, wherein the initial model comprises a feature extraction module and a classification module, wherein the feature extraction module is a pre-training model based on a transducer architecture.
8. A kit for species identification, characterized in that the kit comprises a primer sequence designed for a target sequence selected from at least one of 5'-TTTCAGATTCTAAGCCTACCCTACT-3', SEQ ID NO. 1 and 5'-TTTCCTGACGAATGGACATGTTGCG-3', SEQ ID NO. 4.
9. Use of a target sequence selected from at least one of 5'-TTTCAGATTCTAAGCCTACCCTACT-3', SEQ ID No. 1 and 5'-TTTCCTGACGAATGGACATGTTGCG-3', SEQ ID No.4 for species identification.
10. The kit of claim 8 or the use of claim 9, wherein SEQ ID No. 1 is used for identifying colletotrichum fruit and SEQ ID No. 4 is used for identifying colletotrichum siamensis.
CN202511335015.8A 2025-09-18 2025-09-18 Artificial intelligence-based target sequence screening methods, kits, and applications for species identification Pending CN121237196A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202511335015.8A CN121237196A (en) 2025-09-18 2025-09-18 Artificial intelligence-based target sequence screening methods, kits, and applications for species identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202511335015.8A CN121237196A (en) 2025-09-18 2025-09-18 Artificial intelligence-based target sequence screening methods, kits, and applications for species identification

Publications (1)

Publication Number Publication Date
CN121237196A true CN121237196A (en) 2025-12-30

Family

ID=98143785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202511335015.8A Pending CN121237196A (en) 2025-09-18 2025-09-18 Artificial intelligence-based target sequence screening methods, kits, and applications for species identification

Country Status (1)

Country Link
CN (1) CN121237196A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110322931A (en) * 2019-05-29 2019-10-11 南昌大学 A kind of base recognition methods, device, equipment and storage medium
CN112331268A (en) * 2020-10-19 2021-02-05 成都基因坊科技有限公司 Method for obtaining specific sequence of target species and method for detecting target species
CN114317792A (en) * 2022-01-11 2022-04-12 湖南大学 Screening method and application of 16S rRNA gene specificity detection target fragment of bacterial species
CN115087750A (en) * 2022-03-30 2022-09-20 中国医学科学院药用植物研究所 Eukaryotic organism species identification method based on whole genome analysis and application
CN116030881A (en) * 2022-12-13 2023-04-28 北京邮电大学 Gene and gene cluster function prediction method and device based on artificial intelligence
CN116083602A (en) * 2023-01-10 2023-05-09 中国医学科学院药用植物研究所 Species-specific target sequence for identifying deer based on time-base method, kit and application
CN116665777A (en) * 2023-05-15 2023-08-29 予果生物科技(北京)有限公司 Primer design method, system and storage medium based on primer-template binding ability
CN120041596A (en) * 2024-12-26 2025-05-27 中国医学科学院药用植物研究所 Specific target sequence, primer pair, detection method and kit for identifying Alternaria species based on time-series method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110322931A (en) * 2019-05-29 2019-10-11 南昌大学 A kind of base recognition methods, device, equipment and storage medium
CN112331268A (en) * 2020-10-19 2021-02-05 成都基因坊科技有限公司 Method for obtaining specific sequence of target species and method for detecting target species
CN114317792A (en) * 2022-01-11 2022-04-12 湖南大学 Screening method and application of 16S rRNA gene specificity detection target fragment of bacterial species
CN115087750A (en) * 2022-03-30 2022-09-20 中国医学科学院药用植物研究所 Eukaryotic organism species identification method based on whole genome analysis and application
CN116030881A (en) * 2022-12-13 2023-04-28 北京邮电大学 Gene and gene cluster function prediction method and device based on artificial intelligence
CN116083602A (en) * 2023-01-10 2023-05-09 中国医学科学院药用植物研究所 Species-specific target sequence for identifying deer based on time-base method, kit and application
CN116665777A (en) * 2023-05-15 2023-08-29 予果生物科技(北京)有限公司 Primer design method, system and storage medium based on primer-template binding ability
CN120041596A (en) * 2024-12-26 2025-05-27 中国医学科学院药用植物研究所 Specific target sequence, primer pair, detection method and kit for identifying Alternaria species based on time-series method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHOU ZH等: "DNABERT-S: pioneering species differentiation with species-aware DNA embeddings", BIOINFORMATICS, 20 July 2025 (2025-07-20), pages 4 *

Similar Documents

Publication Publication Date Title
AU2023282274B2 (en) Variant classifier based on deep neural networks
Saeed et al. Unsupervised discovery of microbial population structure within metagenomes using nucleotide base composition
WO2019200338A1 (en) Variant classifier based on deep neural networks
US20200294628A1 (en) Creation or use of anchor-based data structures for sample-derived characteristic determination
CN113470743A (en) Differential gene analysis method based on BD single cell transcriptome and proteome sequencing data
CN113838528B (en) Single-cell level coupling visualization method based on single-cell immune repertoire data
Yu et al. SANPolyA: a deep learning method for identifying poly (A) signals
CN118248210A (en) Pedigree tracing method based on whole genome resequencing SNP big data and deep learning
Hickl et al. Binny: an automated binning algorithm to recover high-quality genomes from complex metagenomic datasets
CN119479773B (en) Evaluation and optimization method for ultra-high sensitivity multiplex sampling based on primer depolymerization algorithm
Wicker et al. Density of points clustering, application to transcriptomic data analysis
Belliardo et al. Improvement of eukaryotic protein predictions from soil metagenomes
CN103348350B (en) Nucleic acid information processing device and processing method thereof
CN119252334B (en) A screening method and system for synthetic biological probiotics
Chen et al. Identifying DNA methylation types and methylated base positions from bacteria using nanopore sequencing with multi-scale neural network
CN121237196A (en) Artificial intelligence-based target sequence screening methods, kits, and applications for species identification
US20140019062A1 (en) Nucleic Acid Information Processing Device and Processing Method Thereof
CN115905898B (en) Methods, devices, and reagent kits for predicting drug efficacy based on expression profiles of a small number of genes.
CN118497379A (en) Microbial marker combination and screening method for saliva sample BMI prediction
Grant et al. KSGP 3.1: improved taxonomic annotation of Archaea communities using LotuS2, the genome taxonomy database and RNAseq data
WO2024018467A1 (en) System and method for tcr sequence identification and/or classification
Dawood et al. Human verification system based on DNA biometrics
Hu et al. Accurate estimation of intrinsic biases for improved analysis of bulk and single-cell chromatin accessibility sequencing data using SELMA
CN117116351B (en) Construction method of species identification model based on machine learning algorithm, species identification method and species identification system
Weir et al. Sample barcoding-associated technical variation in probe-based single-cell RNA sequencing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination