WO2023087277A1 - 序列变异分析方法、系统以及存储介质 - Google Patents
序列变异分析方法、系统以及存储介质 Download PDFInfo
- Publication number
- WO2023087277A1 WO2023087277A1 PCT/CN2021/131904 CN2021131904W WO2023087277A1 WO 2023087277 A1 WO2023087277 A1 WO 2023087277A1 CN 2021131904 W CN2021131904 W CN 2021131904W WO 2023087277 A1 WO2023087277 A1 WO 2023087277A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- variation
- data
- sequence
- site
- prediction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
Definitions
- the present disclosure relates to the technical field of gene detection, in particular to a sequence variation analysis method, system and storage medium.
- NGS Next Generation Sequencing, Next Generation Sequencing
- Gene detection procedures generally include sample processing, gene sequencing, variant identification, variant annotation, variant interpretation, variant verification, and test reports. At present, there are mainly the following problems in gene interpretation:
- the results after manual or automatic interpretation may have manual subjective differences or interpretation errors, so the results need to be rechecked. Due to the inability to quickly and accurately locate the variation that needs to be reviewed, the workload of the review is heavy, resulting in unnecessary labor costs.
- the disclosure proposes a sequence variation analysis method, system, and storage medium to quickly locate variations that need to be reviewed, reduce labor costs, improve review efficiency, and improve the accuracy of interpretation of genetic testing reports.
- the present disclosure proposes a sequence variation analysis method, which includes the following steps: obtaining sequence variation data to be analyzed; performing feature extraction on the sequence variation data to be analyzed to obtain a first variation feature set and a second variation feature set ; Input the first variation feature set into the trained first phenotype relationship prediction model to obtain the first phenotype relationship prediction result, and input the second variation feature set into the trained second phenotype relationship prediction model , to obtain a second phenotypic relationship prediction result; taking the union of the first phenotypic relationship prediction result and the second phenotypic relationship prediction result to obtain a third phenotypic relationship prediction result.
- the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned sequence variation analysis method is realized.
- the present disclosure proposes a sequence variation analysis system, including: an acquisition module for acquiring sequence variation data to be analyzed; a first analysis module for feature extraction of the sequence variation data to be analyzed to obtain the first A variation feature set, a second variation feature set, and inputting the first variation feature set into the trained first phenotype relationship prediction model to obtain the first phenotype relationship prediction result, and inputting the second variation feature set
- the trained second phenotype relationship prediction model obtains the second phenotype relationship prediction result, and takes the union of the first phenotype relationship prediction result and the second phenotype relationship prediction result to obtain a third phenotype Relationship Prediction Results.
- Fig. 1 is a flow chart of the sequence variation analysis method of the first embodiment of the present disclosure
- FIG. 2 is a schematic diagram of the structure and application of a prediction model according to an embodiment of the present disclosure
- Fig. 3 is a flow chart of the sequence variation analysis method of the second embodiment of the present disclosure.
- Fig. 4 is a flow chart of the sequence variation analysis method of the third embodiment of the present disclosure.
- Fig. 5 is a flow chart of the sequence variation analysis method of the fourth embodiment of the present disclosure.
- Fig. 6 is a schematic diagram of the results of the variation type prediction analysis of an example of the present disclosure.
- Fig. 7 is a schematic diagram of the prediction result of phenotypic relationship in an example of the present disclosure.
- Fig. 8 is a structural block diagram of a sequence variation analysis system according to an embodiment of the present disclosure.
- Fig. 1 is a flow chart of the sequence variation analysis method of the first embodiment of the present disclosure.
- the sequence variation analysis method includes the following steps:
- the unique identifier of each variation in the sequence variation data to be analyzed is used for data search and processing.
- the unique identity has a definite and unique mapping relationship with the variation.
- "chromosome number-genome coordinates-reference sequence-altered sequence” (chr-pos-ref-alt, referred to as cpra) is used as the unique identity of the variation; where, chr is the chromosome number, pos is the reference sequence physical position, ref is the reference sequence, and alt is the variant sequence.
- the sequence variation data to be analyzed includes at least one variation site information.
- the variation site information includes at least the variation site and the unique identifier of the variation site.
- the variation site information can be obtained through biological information analysis, and the biological information analysis can include various routine steps, such as quality control, filtering, comparison, and the like.
- the technical means for obtaining the data for example, specific sequence information can be obtained through various sequencing technologies.
- each first variation feature value in the first variation feature set is the result of judging the phenotypic relationship of a public variation site
- each second variation feature value in the second variation feature set is a variation site, etc. Allele frequency data as well as functional prediction data.
- the aforementioned phenotypic relationship refers to the relationship between the variation and the clinical phenotype, and more precisely refers to the pathogenicity of the variation.
- the first variation characteristic value can come from the collection information of various public databases, such as HGMD (http://www.hgmd.cf.ac.uk/ac/index.php), ClinVar (https://www.ncbi.nlm .nih.gov/clinvar/), LOVD (https://www.lovd.nl/3.0/search), etc.
- the source of the first variation characteristic value is not limited, including specific databases and the number of databases.
- the judgment results of phenotypic relationship recorded in each database are usually based on personalized judgment rules, and the judgment results are often described in personalized fields. Taking HGMD as an example, its phenotypic relationship determination results include DM (disease-causing mutation), DM?
- the first variation feature can be extracted from the public database by using the unique identifier (such as cpra) of each variation in the sequence variation data to be analyzed.
- the allele frequency data of the variant site in the second variant characteristic value can come from the collection information of various public databases, such as gnomAD (https://gnomad.broadinstitute.org/), EVS (http://evs.gs. washington.edu/EVS), 1000Genomes Project (http://browser.1000genomes.org), etc.
- the source of the allele frequency data of the variant site in the second variant characteristic value is not limited, including specific databases and the number of databases.
- the function prediction data in the second variation characteristic value can come from the prediction results of various existing function prediction software, and one or more specific indicators can be selected as the characteristic value from the prediction results.
- the public database includes the function prediction data
- the corresponding function prediction data can be directly extracted from the public database.
- the source of the function prediction data in the second variable characteristic value is not limited, including specific software and the number of software, specific indicators and the number of indicators.
- the prediction data of the function prediction software may include at least one of, but not limited to, protein conservation prediction data, nucleic acid conservation prediction data, shear hazard prediction data, and mutation site hazard prediction data.
- the second variation feature can be extracted from the public database by using the unique identifier (such as cpra) of each variation in the sequence variation data to be analyzed.
- step S4 it may also include after step S4:
- the variation interpretation result refers to the determination result of the phenotypic relationship of each variation site in the sequence variation data to be analyzed, which is obtained by adopting a different method from the third phenotypic relationship prediction result.
- Human interpretation results are a common type of variant interpretation results described.
- conventional means in this field can be adopted to add custom marks to the variant interpretation results judged as credible and unreliable, respectively, for subsequent data processing.
- This solution can quickly judge whether a large number of variation interpretation results are credible or unreliable, and deal with them by type in subsequent links. For example, a review may be performed on the variant sites that are judged to be unreliable by the variant interpretation results. It is usually confirmed by manual review. This solution can greatly reduce the workload of subsequent links, reduce labor costs, and improve interpretation efficiency.
- the technical solution including steps S1 to S4 uses two relatively independent models (ie, the first phenotype relationship prediction model and the second phenotype relationship prediction model) according to different data types. model) to predict, and the two models are complementary.
- the first phenotype relationship prediction model can make up for the false negatives that may be caused by the second phenotype relationship prediction model without functional data of variant sites, co-segregation data, and new data
- the second phenotype relationship prediction model can make up for the first phenotype False negatives in the relationship prediction model caused by the judgment results of the phenotypic relationship of the variant site that is not disclosed.
- Both the first phenotype relationship prediction model and the second phenotype relationship prediction model can be constructed based on machine learning models, and the construction method is as follows:
- constructing a data set including a first data set and a second data set
- the first data set is the first variation characteristic of the selected variation site and the credible phenotypic relationship determination result of the variation site
- the second data set Determine the result of the second variation feature of the selected variation site and the credible phenotype relationship of the variation site.
- the selected variation site refers to a plurality of variation sites including all phenotypic relationship determination result types.
- the credible phenotypic relationship determination result may be the result of manual phenotypic relationship determination according to industry guidelines or consensus; the credible phenotypic relationship determination result may also be a plurality of public variants The results of point phenotype relationship judgment results were consistent.
- the first machine learning model and the second machine learning model can be selected from at least one of logistic regression, naive Bayesian, support vector machine and artificial neural network.
- the first phenotype relationship prediction model is a logistic regression model
- the second phenotype relationship prediction model is a neural network model
- the first phenotypic relationship prediction model is used to rapidly determine the pathogenicity of the variation recorded in the public database with a unified standard.
- the construction process of the first phenotype relationship prediction model includes:
- the first data set includes the pathogenicity determination results recorded in the public variation database and the pathogenicity determination results made by interpretation experts.
- the public variation database may include at least one of ClinVar, HGMD, LOVD, UMD (http://umd-predictor.eu/) and other databases.
- the variation annotation tools can be used to annotate the variation data collected from each public variation database to obtain the cpra information, and use this information as the unique identification of the variation.
- the variation data of each public variation database is summarized.
- the interpretation experts combined the population frequency database, prediction software, and medical literature to conduct a pathogenicity analysis of the same variation. Judgment, the judgment result is the credible phenotype relationship judgment result, and this result is used as the standard result of the training data.
- the disclosure converts the character data in the first data set into numerical data, as shown in Table 2.
- the first preset ratio may be 8:2.
- the first machine learning model is at least one of machine learning models such as logistic regression, naive Bayesian, support vector machine and artificial neural network.
- the function to be realized by the first phenotype relationship prediction model is to output a unique mutation pathogenicity judgment based on the mutation pathogenicity judgments of multiple different public mutation databases, and the pathogenicity judgment of mutations is a feature that is relatively complex.
- machine learning models such as logistic regression, naive Bayesian, support vector machine and artificial neural network can be selected to realize this function.
- a plurality of different machine learning models can be selected, and the selected machine learning models are independently trained through the first training set, and the performance of each machine learning model trained is evaluated through the first test set , the evaluation indicators can include accuracy rate, precision rate, recall rate, etc.
- the performance evaluation results of each model a model with the best performance is selected as the first phenotype relationship prediction model used in this disclosure. For example, the model with the highest accuracy rate obtained from training can be used as the first phenotype relationship prediction model.
- the first variation characteristic value of each variation can be extracted; input the first variation characteristic value into the first phenotypic relationship prediction model, and the first phenotypic relationship prediction can be obtained result.
- the first phenotypic relationship prediction model outputs prediction results for the variation included in the public database and according to the preset phenotypic relationship.
- the pathogenicity of human Mendelian genetic disease-related variants there are usually three types of preset phenotypic relationships (pathogenic or likely pathogenic, benign or likely benign, and unclear significance), five types ( pathogenic, possibly pathogenic, benign, possibly benign, and of unclear significance), without limitation.
- the construction process of the second phenotypic relationship prediction model may include:
- the second data set includes allele frequency data of variant sites, function prediction data, and credible phenotypic relationship determination results.
- the allele frequency data can be derived from gnomAD, Thousands Database, ExAC database, etc.; the functional prediction data can be derived from the data of various prediction software, such as SIFT, Polyphen2, MutationTaster, GERP++, DANN, etc. It can be seen that, different from the first phenotypic relationship prediction model, the second phenotypic relationship prediction model is obtained based on the original data for judging the phenotypic relationship, while the first phenotypic relationship prediction model is based on the public phenotypic relationship The judgment result is obtained.
- the variation sites in the second data set need to be consistent in the judgment results of the phenotypic relationship of multiple public variation locations.
- the consistent judgment result is the credible phenotypic relationship judgment result, and this result is used as the standard result of the training data .
- the second preset ratio may be 8:2.
- the second machine learning model may choose machine learning models such as logistic regression, naive Bayesian, support vector machine and artificial neural network.
- the optimal second machine learning model is used as the second phenotype relationship prediction model.
- the second variation characteristic value of each variation can be extracted; input the second variation characteristic value into the second phenotypic relationship prediction model, and the second phenotypic relationship prediction can be obtained result.
- the variation interpretation result can be obtained by manual interpretation based on the sequence variation data to be analyzed.
- the variation interpretation result can be compared with the corresponding prediction result of the third phenotype relationship, and if the two results are consistent, it is determined that the variation interpretation result is credible, otherwise it is determined not to be credible. Unreliable variant interpretation results require further manual review.
- the prediction result is the result of judging the phenotypic relationship of each variation site in the sequence variation data to be analyzed obtained by any different method, such as the result of manual interpretation by those skilled in the art.
- sequence variation analysis method may further include:
- the simple repeat region refers to a repeat region composed of 1-5 base repeat units, such as AAA, CAACAACAACAA.
- a weak association with a phenotypic change can mean that the variant is located in a non-coding region or an intronic region, and the allele frequency of the variant is greater than 0.05.
- the variant located in the non-coding region or the intron region can refer to the functional annotation of the variant as - (indicating no functional annotation), intron (indicating intron), 3′-UTR (indicating the untranslated region at the 3′ end), 5′ -UTR (represents the 5' untranslated region), nochange (no change compared to the reference mRNA sequence).
- the first variable site also meets any of the following conditions:
- Condition 1 the total number of detections of the first variant site is greater than the first preset value, and the number of low-quality detections is greater than the second preset value;
- Condition 2 Located within the third preset value upstream and downstream of the physical position of the reference sequence of the variant site meeting condition 1.
- the first variable site data set containing the first variable site can be pre-established, and its construction steps are as follows:
- this step is the same as the aforementioned step S1. More specifically, sample sequencing data on a fixed sequencing chip (such as LCY171 chip) on a gene sequencer (such as MGISEQ-2000 sequencer) can be collected, and then the variation site and the corresponding unique identification of the variation can be obtained from the sample sequencing data.
- a fixed sequencing chip such as LCY171 chip
- a gene sequencer such as MGISEQ-2000 sequencer
- the first variation site data set is continuously updated along with the accumulation of samples, so as to realize the filtering of sequence variation data to be analyzed.
- variable site B1 judging whether each of the obtained variable sites is located in the simple repeat region of the reference sequence, and if so, recording the unique identifier of the variation corresponding to the variable site.
- variable sites that meet the preset filtering conditions can be: functional annotation as - (indicating no functional annotation), intron (indicating intron), 3'-UTR (indicating the untranslated region at the 3' end), 5' - UTR (representing the untranslated region at the 5′ end), nochange, and a variation with a frequency (representing the probability of the variation in the population) greater than 0.05.
- filtering can be performed based on the mutation function and mutation frequency, and the filtering conditions are: the function annotation is -, intron, 3′-UTR, 5′-UTR, nochange, and the frequency is greater than 0.05.
- the setting of this filter condition is to select the variants that have a high probability of not causing phenotype changes, that is, high-frequency variants located in non-exon regions.
- D1 count the number of low-quality detections and the total number of detections of each variant site in the results of C1.
- the number of low quality detections, the number of high quality detections and the total number of detections are counted using the unique identifier of the variation as an index.
- the total number of detections is the sum of the number of low-quality detections and the number of high-quality detections.
- the statistical rules are: 1) The total number of detections ⁇ the first preset value (such as 8), the number of low-quality detections ⁇ the second preset value (such as 1), and non-single-base variation mutation sites; 2) Find the non-single-base variation variation site within the third preset value (such as 3 bp) upstream and downstream of the reference sequence physical position of the variation site in 1), which is still located in the simple repeat region. It should be noted that in the simple repeat region, the position of the variant site compared with the reference sequence will fluctuate. Different databases or analysis software have different recording methods, but the essence is still the same variant.
- the reference sequence ATATAT, the variant site sequence AT, and the variant site sequence can be matched with the first, third and fifth positions of the reference sequence respectively. Therefore, the statistical rule 2) is set to ensure the comprehensive inclusion of the first variant site.
- the first variable site data set is composed of the variable sites satisfying any one of the above statistical rules 1) and 2).
- the variation unique identifiers of each variation sites in the sequence variation data to be analyzed can be firstly obtained, and filtering is performed based on the variation unique identifiers.
- the analysis process may be: calling the first variation site data set, and traversing the first variation site data set based on the unique variation identifier of the sequence variation data to be analyzed. If there is a consistent unique identifier for the variation, it means that the variation site in the sequence variation data to be analyzed is the first variation site and needs to be filtered; otherwise, it needs to be retained.
- the data used for phenotypic relationship prediction may be the data after the filtering operation in step S5, that is, steps S2-S4 may be performed after step S5, and the data filtered in step S5 may be used. Therefore, by reducing the data for prediction of phenotypic relationship, unnecessary prediction can be avoided and prediction efficiency can be improved.
- sequence variation analysis method may further include:
- the complex variation may include insertion and substitution composite events, deletion and deletion composite events, deletion and substitution composite events, and there is no limit to the number of each specific variation of each type of complex variation.
- conventional means in the art can be adopted to add custom markers to sites judged to have complex variations and/or not exist complex variations for subsequent data processing.
- variable site with complex variation may satisfy the following conditions:
- At least two reference sequence coordinates of variant sites leading to amino acid changes are adjacent and affect the same encoded amino acid.
- a variation detection tool such as GATK software
- a substitution event or an insertion event is simply detected, and a compound event including insertion and substitution cannot be detected.
- the present disclosure performs complex variation judgment on the variation annotations detected by the variation detection tool, specifically to determine whether the variation at adjacent positions of the same gene coding region affects the same amino acid residue coding. If there is a complex variation, it indicates that a review is required, such as by manual combined interpretation; and vice versa.
- the HGVS naming rules are usually used to name the annotation results.
- the variation with amino acid changes it is judged that there is a complex variation when any of the following conditions are met: 1) There are at least two variations of the same gene with overlapping reference sequence coordinates; 2) There are at least two variations of the same gene with reference sequence coordinates Adjacent and affect the same amino acid code.
- condition 1) "coordinate” refers to the position of the reference sequence corresponding to the variation, and "overlap” refers to the reference sequence coordinate overlap of ⁇ 1 bp; this condition can screen complex variations as comprehensively as possible.
- condition 2 "adjacent" refers to the range of 4 bp upstream and downstream of the reference sequence; this condition can accurately screen complex variations. Therefore, through the judgment of complex variation, the defects of existing variation detection tools can be made up for, and complex variation can be combined and interpreted, which can improve the efficiency of variation interpretation.
- the sequence variation data to be analyzed in step S6 may be the sequence variation data to be analyzed in step S1, or it may be the sequence variation data to be analyzed after the third phenotypic relationship prediction result is compared with the corresponding variation interpretation result, with a judgment result The sequence variation data to be analyzed.
- sequence variation analysis method may further include:
- the third variation feature value in the third variation feature set may include variation support data (when the sequence variation data to be analyzed is obtained by sequencing means, the variation support data may include quality value, sequencing depth, reads ratio supporting variation) , allele frequency (eg ESP6500_MAF, G1000_AF).
- variation support data when the sequence variation data to be analyzed is obtained by sequencing means, the variation support data may include quality value, sequencing depth, reads ratio supporting variation
- allele frequency eg ESP6500_MAF, G1000_AF.
- variation type may include homozygous variation, heterozygous variation, and no variation.
- No variation can be the homozygous genotype of the reference sequence at this locus.
- the method for determining the prediction result of the variation type is: compare the prediction probabilities of each variation type, and determine the variation type with the highest prediction probability; compare the maximum prediction probability with a preset threshold; if the prediction probability is greater than the preset threshold , then it is determined that the variation type corresponding to the maximum prediction probability is the variation type prediction result of the variation site.
- the process of building a variant-type predictive model could include:
- A4 construct the third data set, the third data set includes quality value, sequencing depth, reads ratio supporting variation, allele frequency and credible variation type determination results.
- the credible determination result of the variation type may be the result obtained by using the consensus gold standard means in the industry, for example, the Sanger verification result.
- the third preset ratio may be 8:2.
- a variation type prediction model is constructed to calculate the prior probability and conditional probability required by the Bayesian classifier.
- Model testing is performed on the variation type prediction model through the data in the third test set, the classification of the test set data is judged by the variation type prediction model, and the accuracy of the variation type prediction model is obtained according to the correct number of judgments.
- the mutation sites verified by Sanger are selected as training samples, and their relevant data are collected to form the third training set.
- These training samples are divided into K mutation types c 1 , c 2 , ..., c K ; the preprocessing obtains
- the variation data of the variation information is obtained, which is denoted as xi ;
- the prior probability of each characteristic value is calculated based on the selected training set, and the prior probability can be determined by the number of occurrences of various samples in the third training set Estimated, taking the eigenvalue x1 when the variation type is c1 as an example, its prior probability is:
- P(x) is the evidence factor used for normalization, which is a fixed value for all categories.
- the performance of the above machine learning model is evaluated through the third test set, and the evaluation indicators may include accuracy rate, precision rate, and recall rate.
- the variation type prediction model constructed for the variation to be tested is predicted to obtain the probability P(c k
- a quality control threshold is set, and if and only when P max (c k
- the sequence variation data to be analyzed in step S8 may be the sequence variation data to be analyzed in step S1, or it may be the sequence variation data to be analyzed after the third phenotypic relationship prediction result is compared with the corresponding variation interpretation result, with a judgment result
- the sequence variation data to be analyzed can also be the sequence variation data to be analyzed with complex variation results.
- step S10 it may also include: when the variation type prediction result only includes the reference sequence genotype, then it is determined that the variation site is not credible; when the variation type prediction result contains at least one non-reference sequence genotype, it is determined that the variant site is credible.
- conventional means in the art can be adopted to add custom marks to the variant sites judged to be unreliable and/or credible, so as to facilitate subsequent data processing. This solution can quickly determine whether a large number of mutation sites are credible or unreliable, and deal with them by type in subsequent links. For example, only the variant sites judged to be unreliable are verified.
- variant sites associated with phenotypic changes that are judged to be unreliable are verified, where the variant sites associated with phenotypic changes may be pathogenic variants.
- the gold standard methods recognized in the field are usually used for verification, such as generation sequencing verification. This solution can greatly reduce the workload of subsequent verification links, reduce labor costs, and improve interpretation efficiency.
- the sequence variation analysis method of the present disclosure can quickly determine whether the variation needs to be rechecked, whether experimental verification is required, and the like. For the entire interpretation process, the cost of variation experiment verification is reduced to a certain extent, and the results of manual interpretation or automatic interpretation are analyzed, and the variation that needs to be reviewed is quickly output, which reduces the cost of manual review and ensures the accuracy of the results.
- the filter conditions that can be set in the embodiment are: 1 the variation whose function annotation is "-”, “intron”, “3-UTR”, “5-UTR” and “nochange”; 2 the variation whose frequency is greater than 0.05, which can be specified by G1000, ESP6500, genomAD and other population databases were used for judgment.
- the allele frequency of the variation recorded in the three databases G1000, ESP6500, and genomAD is greater than 0.05, so this variation is filtered.
- the total number of detections is the sum of the number of low-quality detections and the number of high-quality detections.
- variable sites that meet one of the following two conditions to construct the first variable site data set.
- first variation site data set at least the cpra information of the included variation sites must be recorded.
- complex variation determination can also process received data (for example, input data in VCF format, FASTQ format, etc.), and use bioinformatics software (such as GATK, ANNOVAR, Alamut, etc.) to perform annotation at the amino acid level, Determines whether the variation contains the result of the pHGVS point. For example, for the variation chr7-142458526-A-G, the annotation result at the amino acid level is p.Asn54Ser, which contains the pHGVS point.
- bioinformatics software such as GATK, ANNOVAR, Alamut, etc.
- the detected variation contains both chr7-142458526-A-G (c.161A>G, p.Asn54Ser) and chr7-142458527-C-G (c.162C>G, p.Asn54Lys), the former is cDNA position 161 and the latter is 162 bit, adjacent to each other.
- chr7-142458526-A-G chr7-142458527-C-G
- SNPs Single Nucleotide Polymorphism, single nucleotide polymorphism
- variation annotation tool uses the variation annotation tool to annotate the training data to obtain its cpra information (genome version is hg19), and use this information as the unique identification of the variation.
- variant cpra gene version is hg19
- the variants and their pathogenicity determinations in the public variant database are sorted and summarized into a training summary database.
- the interpretation experts will manually determine the pathogenicity of the mutations in the summary database in combination with the population frequency database, prediction software and medical literature, and use the determination results as the standard results of the training data.
- Table 3 The structure and content of the preprocessed training summary database are shown in Table 3.
- the character data in the summary database can be converted into numerical data according to the correspondence shown in Table 4 to Table 6.
- Table 6 Complete the construction of the first training set.
- the structure and content of the first training set after feature extraction are shown in Table 8.
- HGMD pathogenicity determination (character type) HGMD pathogenicity determination (numerical type) Null (the database does not have the mutation record) 0 R 1 FP 2 DP 3 DFP 4 DM? 5 DM 6
- the first training set is randomly split into a first training set and a first testing set.
- the first training set is used for training by logistic regression model.
- the logistic regression model is trained through the first training set, and its accuracy rate is 91.0% evaluated by the first test set, and the test performance data are shown in Table 8.
- Collect variants to be predicted such as NM_000267.3(NF1):c.1722-2A>G, NM_000057.4(BLM):c.893C>T(p.Thr298Met) and NM_000244.3(MEN1):c.670- 6C>T, and its pathogenicity determination in the public variation database ClinVar and HGMD are used as the data to be predicted.
- the variation to be predicted can be obtained from the variation site filtered by the above-mentioned first variation location.
- the variant annotation tool uses the variant annotation tool to annotate the variant to be predicted, obtain its cpra information (genome version is hg19), and use this information as the unique identifier of the variant.
- the variant cpra as the index, the pathogenicity determination of the variants to be predicted in the public variant databases ClinVar and HGMD is sorted and summarized into one database. Because the pathogenicity determination results of the public variation database are sequenced discrete data, the character data in the prediction summary database is converted into numerical data according to the conversion logic shown in Table 8, and the prediction data processing is completed. Table 9 shows the structure and content of the prediction summary database after preprocessing and feature extraction.
- the variation data can be obtained from the ClinVar database.
- the ClinVar database contains 46,585 clear pathogenic loci, screened in the HGMD database to be considered "DM", and the mutation in the dbNSFP (https://sites.google.com/site/jpopgen/dbNSFP) database There are a total of 21105 loci, and half of them, 10552 in total, are selected as the pathogenic variants in the second training set, and the other 10553 pathogenic loci constitute the first test subset of the second test set.
- the ClinVar database contains 23892 benign loci, and a total of 4664 loci in the dbNSFP database are selected as benign variants in the second training set.
- the processing of missing values in the dbNSFP database is to directly assign a value of 0.
- the model is based on the allele frequency database 1000 Genomes Project, ESP6500, and multiple function prediction software data.
- function prediction software can be divided into protein conservation prediction such as SIFT, Polyphen2, MutationTaster, nucleic acid conservation prediction such as GERP++, shear hazard prediction such as dbscSNV, and mutation site hazard prediction For example DANN.
- score_rankscore FATHMM_converted_rankscore, PROVEAN_converted_rankscore, MetaSVM_rankscore, MetaLR_rankscore, REVEL_score, CADD_raw_rankscore, DANN_rankscore, GERP++_RS_rankscore, splicing_consensus_ada_score, splicing_consensus_rf_score, phyloP10 0way_vertebrate_rankscore, phyloP20way_mammalian_rankscore, phastCons100way_vertebrate_rankscore, phastCons20way_mammalian_rankscore, SiPhy_29way_logOdds_rankscore, 1000Gp3_AF, ESP6500_AA_AF.
- a neural network model may be selected as the second machine learning model.
- the constructed neural network model can contain three hidden layers, the number of nodes is 16, 128, and 16 respectively; the weight initialization function is uniform, the activation function is hard_sigmoid, the optimizer uses the default parameters of the Adadelta method, and the neuron inactivation probability dropout_rate is 0.05, the number of training iterations is 800, and the batch_size of samples selected for one-time training is 64.
- Test subset 1 is the above-mentioned first test subset
- test subset 2 can be the result of mutations interpreted by ACMG based on clinical interpretation and accumulation, and the screening is clearly considered to be Pathogenic (causal variant) and Benign (benign variant) Among them, for the site that has been read multiple times and the interpretation results are inconsistent, select the result with the most interpretation results.
- dbNSFP database After screening the variants in the dbNSFP database and removing the variants in the training set, a total of 728 variants were obtained, including 618 pathogenic variants and 108 benign variants.
- the performance of the trained second machine learning model was evaluated through the above two test subsets.
- the results of ROC (Receiver Operating Characteristic, Receiver Operating Characteristic Curve) are shown in Figure 6.
- the AUC can reach 0.99.
- the prediction sub-model of the present disclosure can increase AUC by 4-5 percentage points compared with previous methods, and can assist genetic analysis for rapid pathogenicity review, and can be used for mutation pathogenicity quality control, which can achieve higher accuracy.
- 3.2.b1 Collect the eigenvalues of the variation to be predicted, that is, collect the eigenvalues x n used in the training data for the variation to be predicted, specifically the same multiple eigenvalues as described in 3.2.1.2.
- the variation to be predicted may be the variation site obtained from the filtering of the above-mentioned first variation location.
- 4.1.1 Build the training set. Select 843 mutation sites from the historical sequencing data of the mutation results verified by Sanger, and randomly divide them into the third training set and the third test set according to the ratio of 4:1.
- the third training set is used to train the model.
- the third test set is used to test the model.
- the specific data distribution of the training set with 674 variant sites is: 269 non-variant sites, 212 homozygous variant sites, and 193 heterozygous variant sites.
- the prior probability P(c k ) and class conditional probability are calculated based on the third training set of 674 mutation sites.
- the performance of the machine learning model obtained through the above training is evaluated through the third test set, and the evaluation indicators may include accuracy rate, precision rate, and recall rate.
- the 169 mutation data selected above are used as the third test set.
- the variation type that is greater than the preset threshold and corresponding to the maximum prediction probability is selected as the variation type prediction result of the variation site, and the preset threshold is set to 0.8.
- the specific test results are shown in Table 11. It should be noted that the preset threshold of the variation type prediction model can be adjusted according to the actual situation.
- the constructed variation type prediction model predicts the probability P(c k
- the first phenotype prediction model and the second phenotype prediction model are judged as credible variant sites that do not require manual review.
- the method for judging the prediction result of the variation type is specifically to select the variation type corresponding to the maximum prediction probability greater than the preset threshold of 0.8 as the prediction result of the variation type of the variation site.
- the site where the variation interpretation result is a pathogenic variant as an example, when the predicted result of the variant type is homozygous or heterozygous, it is determined that the variant site is credible and no experimental verification is required; when the predicted result of the variant type is no If the mutation type is selected, it is determined that the mutation site is not credible and needs to be verified experimentally.
- sequence variation analysis method of the embodiment of the present disclosure can quickly, accurately and comprehensively determine the variation that needs to be filtered, checked and verified. It only needs to review and maintain the interpretation results judged as unreliable, and verify the pathogenic variants judged as unreliable. As a result, labor costs are greatly reduced, the efficiency of review and verification of genetic variation sites is improved, the time for issuing genetic test reports is shortened, the accuracy of interpretation of genetic test reports is also improved, and the entire interpretation and analysis process is optimized.
- the present disclosure proposes a computer-readable storage medium.
- a computer program is stored on a computer-readable storage medium, and when the computer program is executed by a processor, the above-mentioned sequence variation analysis method is realized.
- Fig. 8 is a structural block diagram of one of the specific implementations of the sequence variation analysis system of the present disclosure.
- the sequence variation analysis system 100 includes: an acquisition module 110 and a first analysis module 120 .
- the acquisition module 110 is used to obtain the sequence variation data to be analyzed;
- the first analysis module 120 is used to perform feature extraction on the sequence variation data to be analyzed, obtain the first variation feature set, the second variation feature set, and convert the first variation feature set
- Input the trained first phenotype relationship prediction model to obtain the first phenotype relationship prediction result, input the second variation feature set into the trained second phenotype relationship prediction model, obtain the second phenotype relationship prediction result, and take A union of the first phenotype relationship prediction result and the second phenotype relationship prediction result is obtained to obtain a third phenotype relationship prediction result.
- the sequence variation analysis system 100 may further include: a second analysis module 130 .
- the second analysis module 130 is used to filter the first variation site in the sequence variation data to be analyzed, the first variation site is located in the repeat region of the reference sequence, is weakly correlated with phenotypic changes, and is not a single base variation.
- the sequence variation analysis system 100 may further include: a third analysis module 140 .
- the third analysis module 140 is used to obtain the amino acid level annotation results of the sequence variation data to be analyzed, and judge whether complex variations exist in the sequence variation data to be analyzed according to the amino acid level annotation results.
- the sequence variation analysis system 100 may further include: a fourth analysis module 150 .
- the fourth analysis module 150 is used to perform feature extraction on the sequence variation data to be analyzed to obtain the third variation feature set, and input the third variation feature set into the trained variation type prediction model to obtain each variation in the sequence variation data to be analyzed.
- the prediction probability of each variation type to which the site belongs, and the prediction result of the variation type of the corresponding variation site is determined according to the prediction probability.
- sequence variation analysis system of the embodiment of the present disclosure can quickly, accurately and comprehensively determine the variation that needs to be filtered, checked and verified. Only need to review and maintain the unreliable interpretation results, and verify the unreliable pathogenic variants. As a result, labor costs are greatly reduced, the efficiency of review and verification of genetic variation sites is improved, the time for issuing genetic test reports is shortened, the accuracy of interpretation of genetic test reports is also improved, and the entire interpretation and analysis process is optimized.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
| HGMD致病性判定(字符型) | HGMD致病性判定(数值型) |
| Null(该数据库无该变异记录) | 0 |
| R | 1 |
| FP | 2 |
| DP | 3 |
| DFP | 4 |
| DM? | 5 |
| DM | 6 |
Claims (18)
- 一种序列变异分析方法,其特征在于,包括以下步骤:获取待分析序列变异数据;对所述待分析序列变异数据进行特征提取,得到第一变异特征集、第二变异特征集;将所述第一变异特征集输入训练好的第一表型关系预测模型,得到第一表型关系预测结果,并将所述第二变异特征集输入训练好的第二表型关系预测模型,得到第二表型关系预测结果;取所述第一表型关系预测结果和所述第二表型关系预测结果的并集,获得第三表型关系预测结果。
- 如权利要求1所述的序列变异分析方法,其特征在于,在所述获得第三表型关系预测结果后还包括:将所述第三表型关系预测结果与对应的变异解读结果进行比较;当所述第三表型关系预测结果与所述变异解读结果一致时,判定所述变异解读结果可信;当所述第三表型关系预测结果与所述变异解读结果不一致时,判定所述变异解读结果不可信。
- 如权利要求1所述的序列变异分析方法,其特征在于,所述方法还包括:过滤所述待分析序列变异数据中的第一变异位点,所述第一变异位点位于参考序列的简单重复区、与表型改变弱相关、且为非单碱基变异。
- 如权利要求3所述的序列变异分析方法,其特征在于,所述与表型改变弱相关是指,变异位于非编码区或内含子区,且所述变异的等位基因频率大于0.05。
- 如权利要求3所述的序列变异分析方法,其特征在于,所述第一变异位点还满足如下任一条件:条件一:总检出次数大于第一预设值,且低质量检出次数大于第二预设值;条件二:位于满足所述条件一的变异位点的参考序列物理位置上下游第三预设值内。
- 如权利要求1所述的序列变异分析方法,其特征在于,所述方法还包括:获取所述待分析序列变异数据的氨基酸水平注释结果;根据所述氨基酸水平注释结果判断所述待分析序列变异数据是否存在复杂变异。
- 如权利要求6所述的序列变异分析方法,其特征在于,存在复杂变异的变异位点满足如下条件:至少两个导致氨基酸改变的变异位点参考序列坐标重叠;或者,至少两个导致氨基酸改变的变异位点参考序列坐标相邻且影响同一个编码氨基酸。
- 如权利要求1所述的序列变异分析方法,其特征在于,所述方法还包括:对所述待分析序列变异数据进行特征提取,得到第三变异特征集;将所述第三变异特征集输入训练好的变异类型预测模型,得到所述待分析序列变异数据中各变异位点所属各变异类型的预测概率;根据所述预测概率确定对应变异位点的变异类型预测结果。
- 如权利要求8所述的序列变异分析方法,其特征在于,所述根据所述预测概率确定对应变异位点的变异类型预测结果,具体为:选择大于预设阈值且最大所述预测概率所对应的变异类型为该变异位点的变异类型预测结果。
- 如权利要求9所述的序列变异分析方法,其特征在于,确定对应变异位点的变异类型预测结果后还包括:当所述变异类型预测结果仅包含参考序列基因型,则判定该变异位点不可信;当所述变异类型预测结果包含至少一个非参考序列基因型,则判定该变异位点可信。
- 如权利要求8所述的序列变异分析方法,其特征在于,所述第三变异特征值包括变异支持数据以及等位基因频率,所述变异类型包括纯合变异、杂合变异、无变异。
- 如权利要求1所述的序列变异分析方法,其特征在于,所述第一变异特征集中的各第一变异特征值为公开的变异位点的表型关系判定结果,所述第二变异特征集中的各第二变异特征值为变异位点等位基因频率数据以及功能预测数据。
- 如权利要求12所述的序列变异分析方法,其特征在于,所述第二变异特征值中的所述功能预测数据包括蛋白保守性预测数据、核酸保守性预测数据、剪切危害性预测数据、变异位点有害程度预测数据至少之一。
- 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,实现如权利要求1-13中任一项所述的序列变异分析方法。
- 一种序列变异分析系统,其特征在于,包括:获取模块,用于获取待分析序列变异数据;第一分析模块,用于对所述待分析序列变异数据进行特征提取,得到第一变异特征集、第二变异特征集,并将所述第一变异特征集输入训练好的第一表型关系预测模型,得到第一表型关系预测结果,将所述第二变异特征集输入训练好的第二表型关系预测模型,得到第二表型关系预测结果,以及取所述第一表型关系预测结果和所述第二表型关系预测结果的并集,得到第三表型关系预测结果。
- 如权利要求15所述的序列变异分析系统,其特征在于,所述系统还包括:第二分析模块,用于过滤所述待分析序列变异数据中的第一变异位点,所述第一变异位点位于参考序列的重复区、与表型改变弱相关、且为非单碱基变异。
- 如权利要求15所述的序列变异分析系统,其特征在于,所述系统还包括:第三分析模块,用于获取所述待分析序列变异数据的氨基酸水平注释结果,并根据所述氨基酸水平注释结果判断所述待分析序列变异数据是否存在复杂变异。
- 如权利要求15所述的序列变异分析系统,其特征在于,所述系统还包括:第四分析模块,用于对所述待分析序列变异数据进行特征提取,得到第三变异特征集,并将所述第三变异特征集输入训练好的变异类型预测模型,得到所述待分析序列变异数据中各变异位点所属各变异类型的预测概率,以及根据所述预测概率确定对应变异位点的变异类型预测结果。
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202180103963.2A CN118302817A (zh) | 2021-11-19 | 2021-11-19 | 序列变异分析方法、系统以及存储介质 |
| EP21964431.7A EP4435791A4 (en) | 2021-11-19 | 2021-11-19 | SYSTEM AND METHOD FOR ANALYZING SEQUENCE VARIATION, AND STORAGE MEDIUM |
| AU2021474767A AU2021474767B2 (en) | 2021-11-19 | 2021-11-19 | Sequence variation analysis method and system, and storage medium |
| PCT/CN2021/131904 WO2023087277A1 (zh) | 2021-11-19 | 2021-11-19 | 序列变异分析方法、系统以及存储介质 |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2021/131904 WO2023087277A1 (zh) | 2021-11-19 | 2021-11-19 | 序列变异分析方法、系统以及存储介质 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023087277A1 true WO2023087277A1 (zh) | 2023-05-25 |
Family
ID=86396019
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/131904 Ceased WO2023087277A1 (zh) | 2021-11-19 | 2021-11-19 | 序列变异分析方法、系统以及存储介质 |
Country Status (4)
| Country | Link |
|---|---|
| EP (1) | EP4435791A4 (zh) |
| CN (1) | CN118302817A (zh) |
| AU (1) | AU2021474767B2 (zh) |
| WO (1) | WO2023087277A1 (zh) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117711487A (zh) * | 2024-02-05 | 2024-03-15 | 广州嘉检医学检测有限公司 | 胚系SNV、InDel变异的鉴定方法、系统以及可读存储介质 |
| CN118866092A (zh) * | 2024-07-05 | 2024-10-29 | 华中农业大学 | 融合门控与线性注意力机制g×e互作的基因组预测方法 |
| CN119601083A (zh) * | 2024-11-16 | 2025-03-11 | 大连易康生物科技有限公司 | 一种基因测序数据处理方法及系统 |
| CN119811479A (zh) * | 2025-03-13 | 2025-04-11 | 北京市农林科学院信息技术研究中心 | 育种性状解析方法、装置、设备、介质及计算机程序产品 |
| CN120470419A (zh) * | 2025-07-15 | 2025-08-12 | 中南大学湘雅医院 | 变异富集区域的识别方法、装置及计算机设备 |
| CN121483394A (zh) * | 2026-01-08 | 2026-02-06 | 中南大学湘雅医院 | 面向新发变异的致病性预测方法 |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119314553A (zh) * | 2024-12-16 | 2025-01-14 | 上海第二工业大学 | 一种基于组合特征编码和dna结合位点的预测方法 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110832597A (zh) * | 2018-04-12 | 2020-02-21 | 因美纳有限公司 | 基于深度神经网络的变体分类器 |
| CN111063392A (zh) * | 2019-12-17 | 2020-04-24 | 人和未来生物科技(长沙)有限公司 | 基于神经网络的基因突变致病性检测方法、系统及介质 |
| US20200342955A1 (en) * | 2017-10-27 | 2020-10-29 | Apostle, Inc. | Predicting cancer-related pathogenic impact of somatic mutations using deep learning-based methods |
| CN112687332A (zh) * | 2021-03-12 | 2021-04-20 | 北京贝瑞和康生物技术有限公司 | 用于确定致病风险变异位点的方法、设备和存储介质 |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8744982B2 (en) * | 2011-05-12 | 2014-06-03 | University Of Utah Research Foundation | Gene-specific prediction |
-
2021
- 2021-11-19 WO PCT/CN2021/131904 patent/WO2023087277A1/zh not_active Ceased
- 2021-11-19 CN CN202180103963.2A patent/CN118302817A/zh active Pending
- 2021-11-19 AU AU2021474767A patent/AU2021474767B2/en active Active
- 2021-11-19 EP EP21964431.7A patent/EP4435791A4/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200342955A1 (en) * | 2017-10-27 | 2020-10-29 | Apostle, Inc. | Predicting cancer-related pathogenic impact of somatic mutations using deep learning-based methods |
| CN110832597A (zh) * | 2018-04-12 | 2020-02-21 | 因美纳有限公司 | 基于深度神经网络的变体分类器 |
| CN111063392A (zh) * | 2019-12-17 | 2020-04-24 | 人和未来生物科技(长沙)有限公司 | 基于神经网络的基因突变致病性检测方法、系统及介质 |
| CN112687332A (zh) * | 2021-03-12 | 2021-04-20 | 北京贝瑞和康生物技术有限公司 | 用于确定致病风险变异位点的方法、设备和存储介质 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4435791A4 * |
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117711487A (zh) * | 2024-02-05 | 2024-03-15 | 广州嘉检医学检测有限公司 | 胚系SNV、InDel变异的鉴定方法、系统以及可读存储介质 |
| CN117711487B (zh) * | 2024-02-05 | 2024-05-17 | 广州嘉检医学检测有限公司 | 胚系SNV、InDel变异的鉴定方法、系统以及可读存储介质 |
| CN118866092A (zh) * | 2024-07-05 | 2024-10-29 | 华中农业大学 | 融合门控与线性注意力机制g×e互作的基因组预测方法 |
| CN119601083A (zh) * | 2024-11-16 | 2025-03-11 | 大连易康生物科技有限公司 | 一种基因测序数据处理方法及系统 |
| CN119811479A (zh) * | 2025-03-13 | 2025-04-11 | 北京市农林科学院信息技术研究中心 | 育种性状解析方法、装置、设备、介质及计算机程序产品 |
| CN120470419A (zh) * | 2025-07-15 | 2025-08-12 | 中南大学湘雅医院 | 变异富集区域的识别方法、装置及计算机设备 |
| CN121483394A (zh) * | 2026-01-08 | 2026-02-06 | 中南大学湘雅医院 | 面向新发变异的致病性预测方法 |
| CN121483394B (zh) * | 2026-01-08 | 2026-03-27 | 中南大学湘雅医院 | 面向新发变异的致病性预测方法 |
Also Published As
| Publication number | Publication date |
|---|---|
| EP4435791A4 (en) | 2025-09-10 |
| AU2021474767B2 (en) | 2026-02-05 |
| AU2021474767A1 (en) | 2024-06-06 |
| EP4435791A1 (en) | 2024-09-25 |
| CN118302817A (zh) | 2024-07-05 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2023087277A1 (zh) | 序列变异分析方法、系统以及存储介质 | |
| US10354747B1 (en) | Deep learning analysis pipeline for next generation sequencing | |
| US12242943B2 (en) | Generating machine learning models using genetic data | |
| Zhang et al. | Genetic variants underlying differences in facial morphology in East Asian and European populations | |
| CN111883210B (zh) | 基于临床特征和序列变异的单基因病名称推荐方法及系统 | |
| CN115273970A (zh) | 用于检测异常核型的方法和系统 | |
| US12272431B2 (en) | Detecting false positive variant calls in next-generation sequencing | |
| CN116486913A (zh) | 基于单细胞测序从头预测调控突变的系统、设备和介质 | |
| CN116564406A (zh) | 一种遗传变异自动化解读方法及设备 | |
| CN120164524B (zh) | 一种遗传病基因检测的数据分析方法、系统及存储介质 | |
| WO2022123067A2 (en) | Method and apparatus for classification and/or prioritization of genetic variants | |
| US20250069702A1 (en) | Population frequency modeling for quantitative variant pathogenicity estimation | |
| Frazer et al. | Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning | |
| AU2019480813A1 (en) | Methods for determining chromosome aneuploidy and constructing classification model, and device | |
| CN110211632A (zh) | 一种基于神经网络的核苷酸单位点变异检测方法 | |
| CN117275577A (zh) | 一种基于二代测序技术检测人线粒体遗传突变位点算法 | |
| US20240011105A1 (en) | Analysis of microbial fragments in plasma | |
| CN117497047B (zh) | 基于外显子测序筛选肿瘤基因标志物的方法、设备和介质 | |
| CN114694752B (zh) | 预测同源重组修复缺陷的方法、计算设备和介质 | |
| CN121506262B (zh) | 一种胚胎植入前遗传学检测方法、系统、产品及设备 | |
| KR102532991B1 (ko) | 태아의 염색체 이수성 검출방법 | |
| CN121483636A (zh) | 一种基于机器学习的慢性淋巴细胞白血病预测模型的构建方法、构建系统及电子设备、存储介质 | |
| Chen et al. | CoCoRV: a rare variant analysis framework using publicly available genotype summary counts to prioritize germline disease-predisposition genes | |
| Nivashini et al. | GENOME-WIDE IDENTIFICATION OF FUNCTIONAL GENETIC VARIANTS ASSOCIATED WITH PHENOTYPIC DIVERSITY | |
| Fan et al. | Stratifying variant deleteriousness and trait-modulating effect under human recent adaptation using the FIND model |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21964431 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202180103963.2 Country of ref document: CN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2021474767 Country of ref document: AU Ref document number: AU2021474767 Country of ref document: AU |
|
| ENP | Entry into the national phase |
Ref document number: 2021474767 Country of ref document: AU Date of ref document: 20211119 Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2021964431 Country of ref document: EP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2021964431 Country of ref document: EP Effective date: 20240619 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 11202403283V Country of ref document: SG |











