WO2012034251A2 - Méthode et systèmes de détection de changements de structure génomique - Google Patents

Méthode et systèmes de détection de changements de structure génomique Download PDF

Info

Publication number
WO2012034251A2
WO2012034251A2 PCT/CN2010/001409 CN2010001409W WO2012034251A2 WO 2012034251 A2 WO2012034251 A2 WO 2012034251A2 CN 2010001409 W CN2010001409 W CN 2010001409W WO 2012034251 A2 WO2012034251 A2 WO 2012034251A2
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
variation
sequencing
alignment
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2010/001409
Other languages
English (en)
Chinese (zh)
Inventor
罗锐邦
邵浩靖
林浩翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Shenzhen Co Ltd
Original Assignee
BGI Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Shenzhen Co Ltd filed Critical BGI Shenzhen Co Ltd
Priority to CN201080068345.0A priority Critical patent/CN103080333B/zh
Priority to PCT/CN2010/001409 priority patent/WO2012034251A2/fr
Publication of WO2012034251A2 publication Critical patent/WO2012034251A2/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2535/00Reactions characterised by the assay type for determining the identity of a nucleotide base or a sequence of oligonucleotides
    • C12Q2535/122Massive parallel sequencing

Definitions

  • the invention relates to the technical field of bioinformatics, in particular to a method and system for detecting a structural variation (SV) of a genome.
  • Background technique SV
  • Structural variability plays an important role in the genome, and structural variability may lead to changes in individual gene coding and functional changes.
  • biologists have identified a large number of candidate regions of the genome associated with human disease through genetic linkage or association analysis. However, identifying disease-causing genes or mutations in these regions requires re-sequencing these regions.
  • Existing genome-wide resequencing analysis techniques are costly, and the information obtained through genome-wide resequencing analysis techniques contains a wealth of redundant information for some studies and individual medical guidance. In order to improve the efficiency of obtaining effective information, the concentration of existing genetic analysis techniques in high-value genetic research areas is of great significance for scientific research and medical guidance.
  • One technical problem to be solved by one aspect of the present disclosure is to provide a genome
  • the structural variation detection method has higher accuracy.
  • One aspect of the present disclosure provides a method for detecting a genomic structural variation, comprising:
  • the skeleton sequence is globally paired with the reference genome to obtain a comparison result containing the variation information
  • the extraction step extracts the variation information from the comparison result containing the variation information.
  • the method before the assembling step, further comprises: an optimizing step of optimizing the sequencing sequence by comparing the reference genome to obtain an optimized sequencing sequence;
  • the assembly step includes: assembling the optimized sequencing sequences into a backbone sequence.
  • the method further includes: a verifying step of verifying the extracted mutated information to remove the mutated information that has not passed the verification.
  • the verifying step comprises:
  • variants with a length less than 50 bp in the variation information, construct a variant sequence, and perform a gap alignment between the sequencing sequence and the variant sequence by a short sequence alignment tool. If the alignment result conforms to the logical theory comparison result, the verification is passed, otherwise the verification is not verified. , remove the variation.
  • the extracting step further comprises:
  • the comparison result containing the variation information is processed as follows:
  • the optimizing step comprises:
  • the sequencing sequence is aligned to the reference genome by a short sequence alignment tool to obtain an alignment sequence
  • the optimization steps also include:
  • the repeated sequencing sequence is removed by a short sequence alignment tool
  • the sequencing sequence in which the average mass in the aligned sequence is below a predetermined value is removed.
  • the assembling step includes:
  • the double-ended relationship obtained by sequencing is used to construct a skeleton sequence according to the contig; the skeleton sequence is complemented to obtain the final skeleton sequence.
  • the skeleton sequence is obtained, and compared with the reference genome, and the individual-specific genome irrelevant to the reference genome is obtained with high accuracy.
  • One technical problem to be solved by another aspect of the present disclosure is to provide a genetic variation detection system for genomes with higher accuracy.
  • genomic structural variation detection system comprising:
  • An assembly device for assembling a sequencing sequence into a scaffold sequence is configured to perform a global pairwise alignment of the skeleton sequence on the reference genome to obtain a comparison result containing the variation information;
  • An extracting means for extracting the mutated information from the aligning result containing the mutated information.
  • system further comprises:
  • the assembly device is used to assemble the optimized sequencing sequences into a backbone sequence.
  • system further comprises:
  • the verification device is configured to verify the extracted variation information and remove the variation information that has not passed the verification.
  • the verification device determines whether the repeatability is less than 10% for a variation of 50 bp or longer in the variation information, and if so, constructs a variation sequence, and compares the sequence of the sequence to the variation sequence, if the depth of the variation sequence If the logical theory distribution is met, the verification is passed, otherwise the mutation is not verified, and the mutation is removed. If the repeatability is greater than or equal to 10%, the mutation site extension sequence is judged whether there is no repeatability. If so, the mutation sequence is constructed, and the sequencing sequence is aligned.
  • the upper variation sequence, the extended sequence alignment depth feature conforms to the logical theory distribution, and is verified, otherwise removed; for the variation of the variation information less than 50 bp in length, the variation sequence is constructed, and the short sequence alignment tool is used to perform the gap ratio between the sequencing sequence and the variation sequence. Yes, if the comparison result conforms to the logical theory comparison result, pass the verification, otherwise the verification is not passed, and the variation is removed.
  • the extraction device comprises:
  • a mutation information filtering unit for filtering or re-running the abnormal result of the comparison result containing the variation information; and/or filtering the logical error result; and/or removing the uncommon result, and outputting the filtered comparison result;
  • a mutation information extraction unit for filtering from the output of the mutation information filtering unit
  • the alignment results extract mutated information.
  • the optimization device comprises:
  • a comparison unit configured to compare the sequencing sequence to the reference genome to obtain a alignment sequence
  • a filtering unit configured to filter the sequence, and remove the sequence whose average quality is lower than a predetermined value in the comparison result
  • the assembly apparatus includes:
  • a map construction unit for constructing a Debron map after cutting the optimized sequencing sequence into an N-mer
  • a cutting unit for outputting a ring structure in the Debrunn diagram, cutting the Debrunn diagram into a plurality of contigs and a heterozygous sequence
  • the skeleton construction unit is configured to construct a skeleton sequence according to a plurality of overlapping groups by using the double-end relationship obtained by sequencing, and fill the skeleton sequence to obtain a final skeleton sequence.
  • the whole genome sequencing result is assembled by an assembly device to obtain a skeleton sequence, and the skeleton sequence and the reference genome are globally compared by a comparison device to obtain a personal unique genome irrelevant to the reference genome. , high accuracy.
  • Figure 1 is a flow chart showing one embodiment of the genomic structural variation detecting method of the present invention
  • Figure 2 is a flow chart showing another embodiment of the method for detecting genomic structural variation of the present invention.
  • Figure 3 is a flow chart showing still another embodiment of the method for detecting genomic structural variation of the present invention.
  • Figure 4 shows an embodiment of the genomic structural variation detecting system of the present invention Structure diagram
  • Figure 5 is a view showing the structure of another embodiment of the genomic structural variation detecting system of the present invention.
  • Fig. 6 is a view showing the configuration of still another embodiment of the genomic structural mutation detecting system of the present invention. detailed description
  • the method and system for detecting structural variation based on assembly is a method for performing a series of biological information analysis on genomic DNA sequence information and a related analysis tool, aiming at solving the problem of imperfect genomic bioinformatics analysis methods and tools.
  • Figure 1 is a flow chart showing one embodiment of the genomic structural variation detecting method of the present invention.
  • Step 102 an assembly step.
  • the sequencing sequences are assembled into a scaffold sequence (scaffold).
  • a scaffold sequence for example, by cutting the sequencing sequence into N-mers and constructing a Debrunn diagram, the partial ring structure in the Debrunn diagram is output, and the De Bruen map is cut into a plurality of contigs. , and hybrid sequences; using the double-ended relationship obtained by sequencing to process the contigs to construct the skeleton sequence.
  • the skeleton sequence is complemented by the base "N" to obtain the final skeleton sequence.
  • Step 104 the comparison step.
  • the skeleton sequence is globally pairwise aligned with the reference genome to obtain a comparison result containing the variation information.
  • the assembly results obtained in step 102 are globally aligned using a long sequence alignment software with a reference genome.
  • the long sequence alignment software is, for example, LASTZ, and can be found in the reference [ Harris, RS Improved pair ise alignment of genomic DNA. PhD thesis, Pennsylvania State University (2007)].
  • Step 106 The extracting step extracts the variation information from the comparison result containing the variation information.
  • the mutation information includes the location of the mutation site, the type of variation, and the sequence of the mutation.
  • the whole genome sequencing results were assembled to obtain a skeleton sequence, and compared with the reference genome, and the individual-specific genome irrelevant to the reference genome was obtained with high accuracy.
  • Fig. 2 is a flow chart showing another embodiment of the genomic structural variation detecting method of the present invention.
  • step 202 an optimization step.
  • the sequenced sequence is optimized by aligning the reference genome to obtain an optimized sequencing sequence.
  • the aligned sequences are aligned by the sequence alignment tool and the reference genome to obtain aligned sequences, and the aligned sequences are optimized, such as deduplicating, replacing the wrong bases, and filtering, and converting into optimized sequencing sequences.
  • the alignment of the sequencing sequence and the reference genome is performed by the BWA software, and the specific parameters of the BWA are "aln -e O -o O".
  • the meaning of this parameter is: "aln” is a sub-function of BWA, the role is the comparison; "-e” means that the gap can be compared
  • the deduplication process of the comparison sequence refers to the removal of some sequence regions with high repetition. For example, a sequence region of ATCATCATCATCATC containing multiple ATCs will have an effect on the comparison pair and such sequence regions should be excluded.
  • the replacement error base of the aligned sequence is treated by replacing all of the incorrect alignment bases of the referenced reference genome with the bases that are identical to the reference genome.
  • the filtering process of the aligned sequence is to remove a sequence whose average quality value is lower than a predetermined value X; for example, the parameter X is based on the sequenced sequence.
  • the recommended value range is, for example, [10-20], and the corresponding average error rate is [10%-1%].
  • the option is 15. By optimizing the sequencing sequence, the accuracy of the next processing can be improved.
  • Step 204 an assembly step.
  • the optimized sequencing sequences are assembled into a backbone sequence. For example, it was assembled using the software Soapdenovo developed by the Huada Gene Research Institute. The specific assembly parameter is "-K 31", where the parameter " - K” is used to set the value of the K- mer.
  • Soapdenovo software can be found in the reference: [Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res (2009)].
  • Step 206 the comparison step.
  • the skeleton sequence and the reference genome are globally pairwise, and the alignment result containing the variation information is obtained.
  • LASTZ is used to globally align the skeleton sequence to the reference genome.
  • the parameter definition can be found in the LASTZ software documentation.
  • One chain means linking, and " ⁇ ambiguousN" means treating N as multiple base types.
  • Gapped refers to the gap comparison
  • a noentropy means that high-precision results are filtered without introducing entropy.
  • “12 ⁇ 9” is the seed mode of 12 ⁇ 9.
  • One seed is a 19 base length sequence selected by the software setting rules in the reference sequence. Whether the target sequence can match the seed sequence only considers the 12 base position in the seed set by the software. If the seed regions are aligned, the alignment will extend in both directions starting from the seed region until the alignment is completed, and the alignment results are output.
  • Step 208 an extraction step.
  • the comparison result including the variation information is filtered, and the variation information in the filtered comparison result is extracted. Filtering includes: (1) filtering or reruning abnormal results, (2) filtering logical error results, and (3) common results are incomplete.
  • Filter or rerun the abnormal result Filter the abnormal result in laste, filter the meaningless part of the comment in the lastz result, and re-run the lastz program without the normal end identifier.
  • Filtering logic error results This includes an assembly sequence that compares two or more chromosomes, and the same position of one chromosome is aligned with two or more assembly sequences, and a better quality retention is selected from these results. .
  • N ACGT is possible
  • - alignment gap
  • Step 210 the face verification step.
  • the extracted variation information is verified to remove the mutated information that has not been verified.
  • the candidate mutation information can be verified by various calculation methods to remove the unqualified mutation information. For example, verification is performed by depth and sequence cutting methods. For mutations greater than or equal to 50 bp in length, first construct a variant sequence, and then compare the sequence of the sequence to the sequence of the variant, and if the depth of the variant sequence conforms to the logical theory distribution, pass the verification, otherwise remove; for the variation of less than 50 bp in length, first construct the mutation Sequence, and then use sequence alignment software such as BWA to perform gap-to-sequence alignment of sequencing short and variant sequences, and the alignment parameter is "-e 50 -0 1 - i 5", if the alignment result is in accordance with the logical theory comparison result Pass verification, otherwise remove.
  • sequence alignment software such as BWA to perform gap-to-sequence alignment of sequencing short and variant sequences
  • the depth of the above-mentioned mutated sequence conforms to the logical theory distribution. If the target sequence is consistent with the reference sequence, the depth of each point in the region should have a relatively high value, and the depth of each point is relatively close, and vice versa. Relatively low.
  • optimization step and the verification step may be included as one or both of the optional steps of the embodiment of the present invention.
  • the accuracy of the next processing can be improved by optimizing the sequencing sequence.
  • Various methods for genome-wide candidate structural variation sets The method performs verification to remove the mutated information that has not been verified, so that the false positive of the mutated information is low. Experiments show that the method of the embodiment of the present invention can obtain false positives.
  • Fig. 3 is a flow chart showing still another embodiment of the method for detecting genomic structural variation of the present invention.
  • step 301 the BWA is aligned.
  • the alignment sequence was compared with the reference genome by BWA software to obtain a comparison sequence.
  • the BWA is repeated. Eliminate highly repetitive sequences with BWA software.
  • the error alignment base is replaced with a reference sequence base and filtered according to the quality value. Substitution of all mismatch bases of the reference reference genome to bases consistent with the reference genome, and removal of sequences having an average quality value below a predetermined value X.
  • a spliced De Bruen diagram is generated.
  • the contig and the hybrid sequence are output according to the Debron map.
  • step 306 a contig or hybrid sequence is obtained.
  • steps 307 through 309 process the contigs and the hybrid sequences, respectively.
  • the reference sequence and the spliced result sequence are segmented, where the resulting sequence refers to the contig and the heterozygous sequence.
  • splitting into two pairs of pairs is performed.
  • the reference sequence and the result sequence are split into multiple copies and then aligned with a small sequence from the resulting sequence using a split small sequence from the reference sequence, until all small sequences are aligned.
  • step 309 the comparison is performed by «, to logical ⁇ , and the variation information is output.
  • step 310 the mutation information is obtained.
  • step 311 it is determined whether the length of the mutation is greater than or equal to 50 bp base pairs. If yes, proceed to step 312, otherwise, continue to step 317.
  • sequence repeatability is calculated. Comparing the information of a certain region of the sequence with the information in the repetitive sequence library to determine whether the sequence is consistent; if the agreement is consistent, the sequence is determined
  • the area is a repeating sequence area. It is also possible that the sequence is a repeat region. Sequence repeatability can be calculated by calculating the ratio of the length of the repeat region to the entire sequence.
  • step 313 it is determined if the repeatability is less than 10%, and if so, then step 316 is continued, otherwise, step 314 is continued.
  • step 314 it is determined if the variant site extension sequence is non-repetitive, and if so, the root proceeds to step 315.
  • a variant sequence is obtained that is aligned with the reference sequence.
  • the sequence of ⁇ i is obtained according to the depth feature of the extended sequence, and the variation result is output.
  • a variant sequence is obtained that is aligned with the reference sequence. If the variant sequence is correct, the alignment depth of the variant sequence will be higher and more average. The variation is obtained based on the depth ratio, and the variation result is output.
  • a single-ended or double-ended BWA alignment result with gaps is obtained.
  • sequence sequences There are two types of sequence sequences, one is single-end, the other is pair-end, and the different methods are different when BWA is compared. For details, please see: http://bio-bwa.sourceforge.net/bwa.shtml.
  • each variant site will have positional information, find this position in the reference sequence, and intercept the sequence of a certain length before and after this position and connect it with the mutation sequence of the mutation site to become a new sequence.
  • the BWAs with gaps are aligned.
  • the BWA is aligned with the -o 1 parameter, allowing gaps in the target sequence to be aligned with the reference sequence, or no gaps.
  • step 322 the verification variation is obtained according to the gap condition and the depth distribution of the comparison result, and the variation result is output.
  • Application Example 1 human exon capture sequencing.
  • NA12156 exon sequencing Take the NA12156 exon sequencing as an example of the International Human Genome HapMap (Sample No.: NA12156; download address ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX005/SRX005923).
  • Raw data a total of 11346285 short sequences.
  • the sequencing results of human exon NA12156 were filtered and optimized based on the reference genome using the basic software BWA tool and filter software; the sequence optimized by filtration was assembled with soapdenovo; the assembly results were analyzed using software LASTZ software.
  • the reference genome was pairwise compared.
  • the comparison results were filtered by the extracted structural mutation information software and the abnormal results were removed.
  • the verification of the structural variation software was used to verify the depth and sequence cutting methods. For mutations greater than or equal to 50 bp in length, determine whether the repeatability is less than 10%. If yes, construct a variant sequence, and compare the sequencing sequence to the upper variant sequence.
  • the depth of the variant sequence conforms to the logical theory distribution, pass the verification, otherwise remove If the repeatability is greater than or equal to 10%, it is judged whether the mutation site extension sequence has no repetitiveness, and if so, the mutation sequence is constructed, and then the sequencing sequence is aligned with the upper variation sequence, and the extension sequence alignment depth characteristic is in accordance with the logical theory distribution. Validation, otherwise removed; for variants less than 50 bp in length, construct a variant sequence, then use BWA for gap-to-gap alignment, the alignment parameter is -e 50 -0 1 -i 5 , if the alignment results are in logical alignment Then pass the verification, otherwise it is removed. Finally merge the two to get the final result. Specific steps are as follows:
  • the optimized short sequence was assembled, and the assembled result genome size was 218030396 bp, with 3941732 assembly sequences, and the assembly sequence was the longest. 9042 bp, N50 is 298 bp and N90 is 122 bp.
  • This application example takes the exon sequencing of colon cancerous cancer cells as an example (sample number: Yv 090508).
  • the original data has a total of 105,972,839 short sequences (sequencing sequences).
  • the optimized short sequence was assembled.
  • the assembled genome size was 118938172 bp, and there were 253868 assembly sequences.
  • the assembly sequence was up to 16885 bp, N50 was 793 bp and N90 was 170 bp.
  • the application example is a Vibrio parahaemolyticus (sample number: VIBydvDlOpoolingIAAPEI-9-l).
  • the original data has a total of 563,1982 short sequences.
  • the assembled result has a genome size of 5056512 bp and 684 assembled sequences.
  • the assembled sequence is 94989 bp, N50 is 23988 bp and N90 is 5603 bp.
  • the alignment of the assembled sequence with the reference genome contained 1442 alignment results.
  • Table 3 Figure 4 is a block diagram showing one embodiment of the genomic structural variation detecting system of the present invention.
  • the structural variation detecting system 400 of this embodiment includes an assembling device 41, a comparing device 42, and an extracting device 43.
  • the assembly device 41 assembles the sequencing sequence into a skeleton sequence (scaffold), and outputs a skeleton sequence;
  • the comparison device 42 performs a global pairwise alignment of the skeleton sequence output from the assembly device 41 on the reference genome to obtain a comparison result containing the variation information;
  • the extracting means 43 extracts the variation information from the comparison result containing the variation information.
  • the whole genome sequencing result is assembled by the assembly device to obtain the skeleton sequence, and the skeleton sequence and the reference genome are globally compared by the comparison device, and the individual-specific genome irrelevant to the reference genome is obtained, and the accuracy is high.
  • Figure 5 is a block diagram showing another embodiment of the genomic structural variation detecting system of the present invention.
  • the structural variation detecting system 400 of the embodiment further includes an optimizing device 50 and a verifying device 54.
  • the optimization device 50 optimizes the sequencing sequence by comparing the reference genomes to obtain an optimized sequencing sequence, and transmits the optimized sequencing sequence to the assembly device 41.
  • the assembly device 41 assembles the optimized sequencing sequences into a scaffold sequence.
  • the optimizing device 50 compares the sequencing sequence and the reference genome by the short sequence alignment software to obtain a aligned sequence, and then performs optimization processing such as deduplication, substitution, and filtering on the sequence to obtain an optimized sequencing sequence.
  • the verification device 54 verifies the extracted variation information and removes the variation information that has not been verified.
  • the verification device 54 can verify the candidate variation information by various calculation methods to remove the unvalidated variation information, for example, by depth and sequence cutting methods.
  • the verification device determines whether the repeatability is less than 10% for a variation of 50 bp or longer in the variation information, and if so, constructs a mutation sequence, and compares the sequence of the sequence to the mutation sequence if the sequence of the mutation If the depth conforms to the logical theory distribution, it is verified, otherwise the mutation is not passed, and the mutation is removed; if the repeatability is greater than or equal to 10%, it is judged whether the mutation site extension sequence has no repetitiveness, and if so, the mutation sequence is constructed, and the sequencing sequence is constructed.
  • the extended sequence alignment depth characteristic conforms to the logical theory distribution and is verified, otherwise removed; for the variability of less than 50 bp in the variation information, the mutated sequence is constructed, and the sequencing sequence and the mutated sequence are performed by the short sequence alignment tool. The gap is compared. If the comparison result is in accordance with the logical theory comparison result, the verification is passed, otherwise the verification is not passed, and the variation is removed.
  • the sequencing sequence is optimized by the optimization device, Can improve the accuracy of the next step.
  • the verification device performs a plurality of methods for verifying the genome-wide candidate structural variation set, and removes the unqualified mutation information, so that the false positive of the variation information is low. It has been experimentally shown that the method of the embodiment of the present invention can obtain a structural variation set of less than 10% of false positives.
  • Fig. 6 is a view showing the configuration of still another embodiment of the genomic structural mutation detecting system of the present invention.
  • the optimizing means 50 includes a comparing unit 501, a filtering unit 502, and an error base replacement unit 503.
  • the assembly device 41 includes a map construction unit 411, a cutting unit 412, and a skeleton construction unit 413.
  • the extracting means 43 includes a variation information filtering unit 431 and a variation information extracting unit 432.
  • the comparing unit 501 compares the sequencing sequence with the reference genome to obtain a matching sequence; the filtering unit 502 is configured to compare and filter the sequence, and remove the sequence whose average quality in the alignment queue is lower than a predetermined value; the wrong base replacing unit 503 will compare All errors in the upper reference genome are aligned to bases that are identical to the reference genome.
  • the graph construction unit 411 constructs a Debron map after cutting the optimized sequencing sequence into an N-mer; the cutting unit 412 outputs a partial ring structure in the Debrunn diagram, and the cutting of the Debrunn graph becomes more A contig group; the skeleton construction unit 413 constructs a skeleton sequence by using the double-ended relationship obtained by sequencing, and complements the skeleton sequence to obtain a final skeleton sequence.
  • the mutation information filtering unit 431 filters or re-runs the abnormal result of the comparison result containing the variation information; and/or filters the logical error result; and/or removes the common result incomplete, and outputs the filtered comparison result; the mutation information extraction unit 432 extracting variation from the filtered alignment result output from the mutation information filtering unit
  • a code can represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, or any combination of instructions, data structures, or program statements.
  • the code can be located on a computer readable medium.
  • the computer readable medium can include one or more storage devices including, for example, RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, mobile hard disk, CD-ROM, or any other form known in the art. Storage medium.
  • the computer readable medium can also include a carrier wave that encodes the data signal.
  • the whole genome sequencing results were assembled to obtain a skeleton sequence, and compared with the reference genome, and the individual-specific genome irrelevant to the reference genome was obtained with high accuracy.
  • the actual data shows that the method of the embodiment of the present invention can exhibit excellent accuracy between genomes of 1M-3G.
  • a cluster of candidate structural variations is obtained by analyzing the results of genome-wide sequencing assembly, resulting in more comprehensive results.
  • This candidate structural variation set can be analyzed in the next step.
  • the present invention performs a variety of other methods for verifying the genome-wide candidate structural variation set, and obtains a structural variation set of less than 10% of false positives, with a low positive.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Molecular Biology (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
PCT/CN2010/001409 2010-09-14 2010-09-14 Méthode et systèmes de détection de changements de structure génomique Ceased WO2012034251A2 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201080068345.0A CN103080333B (zh) 2010-09-14 2010-09-14 一种基因组结构性变异检测方法和系统
PCT/CN2010/001409 WO2012034251A2 (fr) 2010-09-14 2010-09-14 Méthode et systèmes de détection de changements de structure génomique

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2010/001409 WO2012034251A2 (fr) 2010-09-14 2010-09-14 Méthode et systèmes de détection de changements de structure génomique

Publications (1)

Publication Number Publication Date
WO2012034251A2 true WO2012034251A2 (fr) 2012-03-22

Family

ID=45832006

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2010/001409 Ceased WO2012034251A2 (fr) 2010-09-14 2010-09-14 Méthode et systèmes de détection de changements de structure génomique

Country Status (2)

Country Link
CN (1) CN103080333B (fr)
WO (1) WO2012034251A2 (fr)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103093121A (zh) * 2012-12-28 2013-05-08 深圳先进技术研究院 双向多步deBruijn图的压缩存储和构造方法
CN103258145A (zh) * 2012-12-22 2013-08-21 中国科学院深圳先进技术研究院 一种基于De Bruijn图的并行基因拼接方法
CN103810402A (zh) * 2014-02-25 2014-05-21 北京诺禾致源生物信息科技有限公司 用于基因组的数据处理方法和装置
CN104968806A (zh) * 2013-02-01 2015-10-07 Sk电信有限公司 提供与基于基因序列的个人标记有关的信息的方法和装置
CN108140070A (zh) * 2015-02-25 2018-06-08 螺旋遗传学公司 多样品差分变异检测
CN109074429A (zh) * 2016-04-20 2018-12-21 华为技术有限公司 基因组变异检测方法、装置及终端
CN110079589A (zh) * 2019-05-21 2019-08-02 中国农业科学院农业基因组研究所 一种精准获得全基因组范围内结构变异的方法
CN112086131A (zh) * 2020-08-18 2020-12-15 西安医学院 一种高通量测序中假阳性变异位点的筛选方法
CN116884497A (zh) * 2023-07-28 2023-10-13 天津诺禾致源生物信息科技有限公司 基因组组装缺口的填补方法和装置
CN117153248A (zh) * 2023-09-05 2023-12-01 天津极智基因科技有限公司 一种基于泛基因组的基因区变异检测及可视化方法、系统
WO2024138733A1 (fr) * 2022-12-30 2024-07-04 深圳华大生命科学研究院 Procédé et système de détection de variation structurelle, dispositif et support

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714263B (zh) * 2013-12-10 2017-06-13 深圳先进技术研究院 双向多步De Bruijn图的错误双向边识别与去除方法
CN104751015B (zh) * 2013-12-30 2017-08-29 中国科学院天津工业生物技术研究所 一种基因组测序数据序列组装方法
CN104164479B (zh) * 2014-04-04 2017-09-19 深圳华大基因科技服务有限公司 杂合基因组处理方法
WO2016000267A1 (fr) * 2014-07-04 2016-01-07 深圳华大基因股份有限公司 Procédé permettant de déterminer la séquence d'une sonde et procédé de détection de variation structurale génomique
CN105483244B (zh) * 2015-12-28 2019-10-22 武汉菲沙基因信息有限公司 一种基于超长基因组的变异检测方法及检测系统
DK3982368T3 (da) * 2016-06-07 2024-06-24 Illumina Inc Bioinformatiksystemer, apparater og fremgangsmåder til udførelse af sekundær og/eller tertiær processering
CN110462063B (zh) * 2017-05-23 2023-06-23 深圳华大生命科学研究院 一种基于测序数据的变异检测方法、装置和存储介质
CN110021359B (zh) * 2017-07-24 2021-05-04 深圳华大基因科技服务有限公司 一种二代和三代序列联合组装结果去冗余的方法和装置
CN110349629B (zh) * 2019-06-20 2021-08-06 湖南赛哲医学检验所有限公司 一种利用宏基因组或宏转录组检测微生物的分析方法
CN111724858B (zh) * 2020-05-14 2024-06-07 东北林业大学 利用软件运行基因组序列比对修补gap的方法
CN111863135B (zh) * 2020-07-15 2022-06-07 西安交通大学 一种假阳性结构变异过滤方法、存储介质及计算设备
CN112289376B (zh) * 2020-10-26 2021-07-06 北京吉因加医学检验实验室有限公司 一种检测体细胞突变的方法及装置
CN112599193A (zh) * 2021-03-02 2021-04-02 北京橡鑫生物科技有限公司 结构变异检测模型、其构建方法和装置
CN115602244B (zh) * 2022-10-24 2023-04-28 哈尔滨工业大学 一种基于序列比对骨架的基因组变异检测方法
CN118658524A (zh) * 2024-06-07 2024-09-17 天津必佳生物技术有限公司 一种用于动物饲料的嵌合溶菌酶基因序列数据分析方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1618216A2 (fr) * 2003-04-25 2006-01-25 Sequenom, Inc. Procedes et systemes de fragmentation et systemes de sequencage de novo

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103258145A (zh) * 2012-12-22 2013-08-21 中国科学院深圳先进技术研究院 一种基于De Bruijn图的并行基因拼接方法
CN103093121B (zh) * 2012-12-28 2016-01-27 深圳先进技术研究院 双向多步deBruijn图的压缩存储和构造方法
CN103093121A (zh) * 2012-12-28 2013-05-08 深圳先进技术研究院 双向多步deBruijn图的压缩存储和构造方法
CN104968806A (zh) * 2013-02-01 2015-10-07 Sk电信有限公司 提供与基于基因序列的个人标记有关的信息的方法和装置
KR101770962B1 (ko) * 2013-02-01 2017-08-24 에스케이텔레콤 주식회사 유전자 서열 기반 개인 마커에 관한 정보를 제공하는 방법 및 이를 이용한 장치
CN103810402A (zh) * 2014-02-25 2014-05-21 北京诺禾致源生物信息科技有限公司 用于基因组的数据处理方法和装置
CN108140070A (zh) * 2015-02-25 2018-06-08 螺旋遗传学公司 多样品差分变异检测
CN109074429B (zh) * 2016-04-20 2022-03-29 华为技术有限公司 基因组变异检测方法、装置及终端
CN109074429A (zh) * 2016-04-20 2018-12-21 华为技术有限公司 基因组变异检测方法、装置及终端
CN110079589A (zh) * 2019-05-21 2019-08-02 中国农业科学院农业基因组研究所 一种精准获得全基因组范围内结构变异的方法
CN112086131A (zh) * 2020-08-18 2020-12-15 西安医学院 一种高通量测序中假阳性变异位点的筛选方法
CN112086131B (zh) * 2020-08-18 2024-05-24 西安医学院 一种重测序数据库中假阳性变异位点的筛选方法
WO2024138733A1 (fr) * 2022-12-30 2024-07-04 深圳华大生命科学研究院 Procédé et système de détection de variation structurelle, dispositif et support
CN116884497A (zh) * 2023-07-28 2023-10-13 天津诺禾致源生物信息科技有限公司 基因组组装缺口的填补方法和装置
CN117153248A (zh) * 2023-09-05 2023-12-01 天津极智基因科技有限公司 一种基于泛基因组的基因区变异检测及可视化方法、系统
CN117153248B (zh) * 2023-09-05 2024-05-07 天津极智基因科技有限公司 一种基于泛基因组的基因区变异检测及可视化方法、系统

Also Published As

Publication number Publication date
CN103080333B (zh) 2015-06-24
CN103080333A (zh) 2013-05-01

Similar Documents

Publication Publication Date Title
WO2012034251A2 (fr) Méthode et systèmes de détection de changements de structure génomique
JP6725481B2 (ja) 母体血漿の無侵襲的出生前分子核型分析
US10783984B2 (en) De novo diploid genome assembly and haplotype sequence reconstruction
Deschamps et al. Utilization of next-generation sequencing platforms in plant genomics and genetic variant discovery
CN110621785B (zh) 基于三代捕获测序对二倍体基因组单倍体分型的方法和装置
CN110189796A (zh) 一种绵羊全基因组重测序分析方法
Wildschutte et al. Discovery and characterization of Alu repeat sequences via precise local read assembly
WO2013097257A1 (fr) Procédé et système d'analyse de gène de fusion
WO2016055971A2 (fr) Procédés, systèmes et processus d'assemblage de novo de lectures de séquençage
Kremer et al. Approaches for in silico finishing of microbial genome sequences
CN110021355B (zh) 二倍体基因组测序片段的单倍体分型和变异检测方法和装置
Ratan Assembly algorithms for next-generation sequence data
WO2015094844A1 (fr) Assemblage de graphiques de chaînes pour génomes polyploïdes
CN111696628A (zh) 新生抗原的鉴定方法
Chiu et al. A comprehensive tandem repeat catalog of the human genome
US10424395B2 (en) Computation pipeline of single-pass multiple variant calls
Te Boekhorst et al. Computational problems of analysis of short next generation sequencing reads
WO2019132010A1 (fr) Procédé, appareil et programme d'estimation de type de base dans une séquence de bases
CN108595914A (zh) 一种烟草线粒体rna编辑位点高精度预测方法
Lapidus Genome sequence databases (overview): sequencing and assembly
WO2013097143A1 (fr) Procédé et dispositif d'estimation du taux d'hétérozygotie d'un génome
Isakov et al. Deep sequencing data analysis: challenges and solutions
US20240412808A1 (en) Detection of cystic fibrosis transmembrane conductance regulator polytg/polyt variations by an ngs-based method
WO2013097149A1 (fr) Procédé et dispositif d'estimation de la teneur en séquences répétées d'un génome
JP2026510423A (ja) 核酸エラーの抑制

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201080068345.0

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10857107

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10857107

Country of ref document: EP

Kind code of ref document: A2