WO2021023142A1 - 基因比对技术 - Google Patents
基因比对技术 Download PDFInfo
- Publication number
- WO2021023142A1 WO2021023142A1 PCT/CN2020/106498 CN2020106498W WO2021023142A1 WO 2021023142 A1 WO2021023142 A1 WO 2021023142A1 CN 2020106498 W CN2020106498 W CN 2020106498W WO 2021023142 A1 WO2021023142 A1 WO 2021023142A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- gene
- sub
- gene sequence
- sequence
- tested
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Definitions
- This application relates to the field of optical technology, in particular to a gene comparison technology.
- DNA Deoxyribonucleic acid
- Gene refers to the DNA sequence that carries genetic information, also called genetic factor, which is the basic structural unit and functional unit of genetic material that controls biological traits. Genes express the genetic information they carry by directing protein synthesis, thereby controlling the performance of individual organisms.
- HGP Human Genome Project
- DNA sequence comparison is the prerequisite for gene identification, information analysis, structure prediction and other problems. Through the comparison of multiple DNA sequences, we can find the same and different sites and regions to help judge the homology of the gene to be tested. Variation points and sources.
- the gene comparison technology provided by this application can improve the efficiency of DNA comparison.
- an embodiment of the present invention provides a gene comparison method, which is applied to a computer system including an optical computing chip.
- the processor of the computer system can obtain the first group of gene fragments from the gene database according to the gene sequence to be tested, and compare the sequence of the gene to be tested with the first group of genes. Multiple reference gene fragments in the fragments are input to the optical computing chip for optical comparison.
- the gene database contains a plurality of reference gene fragments of a reference gene sequence
- the first set of gene fragments includes a plurality of reference gene fragments matching a partial base of the gene sequence to be tested.
- the gene comparison method provided by the embodiment of the present invention combines two methods of database search and optical autocorrelation comparison.
- the constructed gene database is matched with the sequence of the gene to be tested for the first time, so as to screen out the gene to be tested.
- the first set of reference gene fragments whose sequences may match. After screening the gene fragments to be compared through the gene database provided by the embodiments of the present invention, the number of reference gene fragments that need to be compared in detail can be greatly reduced.
- the optical computing chip is further used to perform optical comparison between the test gene sequence and multiple reference gene fragments in the first set of reference gene fragments. Yes, because the optical computing chip performs optical comparison, the comparison speed is faster than the method of electrical comparison of genes. Therefore, the gene comparison method provided by the embodiment of the present invention also greatly improves the comparison efficiency.
- the processor can obtain the first set of gene fragments from the database according to some bases of the gene sequence to be tested. For example, according to the first m bases and the last n bases of the gene sequence to be tested, the first set of gene fragments are obtained from the database, wherein the value of m and the value of n are both greater than 0, and the sum of m and n is less than The number of bases in the gene sequence to be tested. Generally, the values of m and n can be determined according to factors such as the length of the gene sequence to be tested and the length of the reference gene sequence.
- the database may be a key-value database, wherein the key value is a partial base of multiple reference gene fragments of the reference gene sequence, and the value value is the The positions of multiple reference gene fragments in the reference gene sequence.
- the method further includes determining, according to the output result of the optical computing chip, that the degree of similarity between the gene sequence to be tested and the first gene fragment in the first group of gene fragments is less than
- the first threshold is greater than the second threshold
- a plurality of sub-reference gene sequences are obtained from the reference gene sequence, and the test gene sequence and the first sub-reference gene sequence of the plurality of sub-reference gene sequences are input
- the optical computing chip performs optical comparison to obtain the first degree of similarity between the test gene sequence and the first sub-reference gene sequence, wherein each sub-reference gene sequence is a part of the reference gene sequence.
- the similarity between the gene sequence to be tested and at least one gene segment in the first set of gene segments is less than the first threshold and greater than the second threshold, it indicates that the gene sequence to be tested is very likely Finding a matching reference gene fragment in the reference gene sequence requires further comparison. Therefore, the multiple sub-reference gene sequences of the gene sequence to be tested and the reference gene sequence can be further optically aligned, so as to quickly find the reference gene segment that matches at least a part of the gene sequence to be tested.
- the method may further include determining that the first similarity is greater than a third threshold and less than a fourth threshold, and in response to the above determination, obtaining the first test sequence according to the test gene sequence
- the first test sub-gene sequence and the first sub-reference gene sequence are input into the optical computing chip for optical comparison to obtain a second degree of similarity
- the second test sub-gene sequence Input the optical computing chip with the first sub-reference gene sequence to perform optical comparison to obtain a third degree of similarity.
- the test gene sequence when the similarity between the test gene sequence and the first sub-reference gene sequence meets the preset condition, the test gene sequence can be further split, and the split first test sub-gene The sequence and the second test sub-gene sequence are respectively compared with the first sub-reference gene sequence, so that a partial segment of the test-gene sequence that matches the first sub-reference gene sequence can be located as soon as possible.
- this method of maximum similarity matching can tolerate the phenomenon of base deletion, it can realize the precise positioning of the missing part or the variant part in the test gene sequence.
- the first test gene sequence may include bases of a first preset length obtained from the head to the tail of the test gene sequence
- the second test gene sequence may include The bases of the first preset length are obtained from the tail of the gene sequence to be tested toward the head, and part of the bases of the first gene sequence to be tested and the second gene sequence to be tested overlap.
- the method further includes recording the position of the first sub-reference gene sequence in the reference gene sequence when the second degree of similarity is greater than the fourth threshold. According to this manner, when the second degree of similarity between the first test sub-gene sequence and the first sub-reference gene sequence is greater than the fourth threshold, it can be confirmed that the first test sub-gene sequence and all the The first sub-reference gene sequence has a maximum similarity match, so that the position of the first sub-reference gene sequence in the reference gene sequence can be recorded to obtain the maximum similarity matching fragment of the first test sub-gene sequence.
- the method further includes: when the third similarity is greater than the third threshold and less than the fourth threshold, obtaining the second test sub-gene sequence according to the A test sub-gene sequence unit and a second test sub-gene sequence unit, and input the first test sub-gene sequence unit and the first sub-reference gene sequence into the optical computing chip for optical comparison, and The second sub-gene sequence unit to be tested and the first sub-reference gene sequence are input into the optical computing chip for optical comparison. Wherein, part of the bases of the first test sub-gene sequence unit and the second test sub-gene sequence unit are the same.
- the second test sub-gene sequence can continue to be split. Split comparison, so that, according to this recursive search method, the largest similar matching segment of at least a partial segment of the second gene sequence to be tested can be quickly located. Since this method of maximum similarity matching can tolerate the phenomenon of base deletions, the precise location of gene deletions and gene mutation points can be realized.
- the method further includes: inputting the gene sequence to be tested and the second sub-reference gene sequence of the plurality of sub-reference gene sequences into the optical computing chip for optical comparison, Obtain the fourth degree of similarity between the test gene sequence and the second sub-reference gene sequence, and input the test gene sequence and the third sub-reference gene sequence among the plurality of sub-reference gene sequences into the light
- the calculation chip performs optical comparison to obtain the fifth degree of similarity between the test gene sequence and the third sub-reference gene sequence, wherein the third sub-reference gene sequence is continuous with the second sub-reference gene sequence
- the child reference gene sequence is Obtain the fourth degree of similarity between the test gene sequence and the second sub-reference gene sequence, and input the test gene sequence and the third sub-reference gene sequence among the plurality of sub-reference gene sequences into the light
- the calculation chip performs optical comparison to obtain the fifth degree of similarity between the test gene sequence and the third sub-reference gene sequence, wherein the third sub-reference gene sequence is continuous
- a fourth sub-reference gene sequence is obtained according to the second sub-reference gene sequence and the third sub-reference gene sequence , And input the sequence of the gene to be tested and the sequence of the fourth sub-reference gene into the optical computing chip for optical comparison.
- the fourth sub-reference gene sequence includes a partial base of the second sub-reference gene sequence and a partial base of the third sub-reference gene sequence.
- the test gene when it is determined that the value of the similarity between the sequence of the test gene and the sequence of the second sub-reference gene does not meet the conditions for further matching with the sequence of the second sub-reference gene, the test gene
- the position of the sub-reference gene sequence can be adjusted in time, from the second sub-reference gene sequence.
- the reference gene sequence and the third sub-reference gene sequence are taken consecutively to obtain the fourth sub-reference gene sequence, so that the maximum value of the gene sequence to be tested can be found from the fourth sub-reference gene sequence as soon as possible.
- This method of adjusting the sub-reference gene sequence in time according to the partial comparison results can increase the probability and speed of obtaining the largest similar gene fragments and reduce the number of comparisons.
- a part of the reference gene fragment composition structure can be obtained from the second sub-reference gene sequence and the third sub-reference gene sequence according to the ratio of the fourth degree of similarity to the fifth degree of similarity.
- the fourth sub-reference gene sequence can be obtained from the second sub-reference gene sequence and the third sub-reference gene sequence according to the ratio of the fourth degree of similarity to the fifth degree of similarity.
- the method further includes determining, according to the output result of the optical computing chip, that the second gene segment in the first set of gene segments matches the gene sequence to be tested, and recording all The position of the second gene fragment in the reference gene sequence.
- the inputting the gene sequence to be tested and the multiple reference gene fragments in the first group of gene fragments into the optical computing chip for optical comparison includes: respectively comparing the The gene sequence to be tested and the plurality of reference gene fragments in the first set of gene fragments are optically encoded, and the optical encoding of the gene sequence to be tested is compared with the plurality of genes in the first set of gene sequences
- the optical codes of the segments are respectively input to the optical computing chip for optical comparison.
- the gene sequence to be tested and the multiple reference gene fragments can be optically encoded according to light intensity information and/or light spatial information.
- the embodiment of the present invention provides a gene comparison device including a processor and an optical computing chip.
- the processor is used to obtain a first set of gene fragments from a database according to the gene sequence to be tested, wherein the database system contains a plurality of reference gene fragments of reference gene sequences, and the first set of gene fragments includes Multiple reference gene fragments with partial base matching of the gene sequence to be tested.
- the optical computing chip is connected to the processor and is used for optically comparing the gene sequence to be tested with a plurality of reference gene fragments in the first group of gene fragments.
- the processor may obtain the first group of gene fragments from the database according to some bases of the gene sequence to be tested. For example, according to the first m bases and the last n bases of the gene sequence to be tested, the first set of gene fragments are obtained from the database, wherein the value of m and the value of n are both greater than 0, and the sum of m and n is less than The number of bases in the gene sequence to be tested.
- the database may be a key-value database, wherein the key value is a partial base of a plurality of reference gene fragments of the reference gene sequence, and the value value is a partial base of the plurality of reference gene fragments. The position in the reference gene sequence.
- the processor is further configured to determine the similarity between the gene sequence to be tested and the first gene fragment in the first group of gene fragments according to the output result of the optical computing chip It is less than the first threshold and greater than the second threshold, and a plurality of sub-reference gene sequences are obtained from the reference gene sequence, wherein each sub-reference gene sequence is a part of the reference gene sequence.
- the optical computing chip is also used for optically comparing the sequence of the gene to be tested with the first sub-reference gene sequence among the plurality of sub-reference gene sequences to obtain the sequence of the gene to be tested and the first sub-reference The first degree of similarity of gene sequences.
- the processor is further configured to determine that the first similarity is greater than a third threshold and less than a fourth threshold, wherein the fourth threshold is not greater than the first threshold, and responds According to the above determination, the first test sub gene sequence and the second test sub gene sequence are obtained according to the test gene sequence, wherein part of the bases of the first test sub gene sequence and the second test sub gene sequence the same.
- the optical computing chip is also used for optically comparing the first test sub-gene sequence with the first sub-reference gene sequence to obtain a second degree of similarity; and comparing the second test sub-gene sequence with The first sub-reference gene sequence is optically aligned to obtain a third degree of similarity.
- the processor is further configured to record the position of the first sub-reference gene sequence in the reference gene sequence when the second degree of similarity is greater than the fourth threshold.
- the processor is further configured to obtain according to the second test sub-gene sequence when the third similarity is greater than the third threshold and less than the fourth threshold.
- the optical computing chip is also used for optically comparing the first test sub-gene sequence unit with the first sub-reference gene sequence, and compare the second test sub-gene sequence unit with the first sub-reference gene sequence. Sub-reference gene sequence for optical alignment.
- the optical computing chip is also used for optically comparing the sequence of the gene to be tested with a second sub-reference gene sequence among the plurality of sub-reference gene sequences;
- the test gene sequence is optically aligned with a third sub-reference gene sequence among the plurality of sub-reference gene sequences, wherein the third sub-reference gene sequence is a sub-reference gene sequence continuous with the second sub-reference gene sequence .
- the processor is further configured to determine a fourth degree of similarity between the test gene sequence and the second sub-reference gene sequence and a fifth degree similarity between the test gene sequence and the third sub-reference gene sequence The sum is greater than the first threshold; the fourth child reference gene sequence is obtained according to the second child reference gene sequence and the third child reference gene sequence; and the test gene sequence and the fourth child reference The gene sequence is input into the optical computing chip for optical comparison.
- the fourth sub-reference gene sequence includes a partial base of the second sub-reference gene sequence and a partial base of the third sub-reference gene sequence.
- the processor is further configured to determine, according to the output result of the optical computing chip, that the second gene segment in the first group of gene segments matches the gene sequence to be tested, and The position of the second gene fragment in the reference gene sequence is recorded.
- the processor is further configured to optically encode the gene sequence to be tested and the multiple reference gene fragments in the first set of gene fragments, and to encode the The optical codes of the gene sequence to be tested and the optical codes of the multiple gene fragments in the first group of gene sequences are respectively input into the optical computing chip for optical comparison.
- an embodiment of the present invention provides a comparison device including a processor and an optical computing chip.
- the processor is configured to obtain a first group of reference objects from a database according to the first object to be matched, wherein the first group of reference objects includes a plurality of reference objects that have the same partial characteristics as the first object.
- the optical computing chip is connected to the processor and used for optical comparison of the first object and the multiple reference objects.
- the comparison device provided by the embodiment of the present invention combines two methods of database search and optical comparison. After the reference objects to be compared are screened through the database, the number of reference objects that require detailed comparison can be greatly reduced. Moreover, the use of optical computing chips for comparison can greatly increase the comparison speed.
- the comparison device provided by the implementation of the present invention can not only be used in gene detection scenarios, but also can be applied to various scenarios that require massive data comparison.
- the processor is further configured to determine, according to the output result of the optical computing chip, that the similarity between the first object and the first reference object in the first group of reference objects is less than The first threshold is greater than the second threshold, and a plurality of sub-reference objects are obtained according to the standard object, wherein each sub-reference object is a part of the reference object.
- the optical computing chip is further configured to perform an optical comparison between the first object and the first sub-reference object among the plurality of sub-reference objects to obtain the first object and the first sub-reference object. Similarity.
- the processor is further configured to determine that the first similarity is greater than a third threshold and less than a fourth threshold, and in response to the above determination, obtain a first sub-object according to the first object And the second child object.
- the fourth threshold is not greater than the first threshold, and partial data of the first sub-object and the second sub-object are the same.
- the optical computing chip is also used to perform optical comparison between the first sub-object and the first sub-reference object to obtain a second degree of similarity, and to compare the second sub-object and the first sub-reference object Perform optical comparison to obtain the third degree of similarity.
- the processor is further configured to record the position of the first sub-reference object in the standard object when the second similarity is greater than the fourth threshold.
- this application also provides a comparison device, including an acquisition module, a comparison module, a result processing module, a judgment module, etc., used to implement the first aspect or any one of the possible implementation methods of the first aspect. Module.
- the present application also provides a computer program product, including program code, and instructions included in the program code are executed by a computer to implement the first aspect and any one of the first aspects. Implement the gene alignment method described in the mode.
- the present application also provides a computer-readable storage medium for storing program code, and instructions included in the program code are executed by a computer to implement the aforementioned first aspect and The gene alignment method described in any one of the possible implementations of the first aspect.
- Figure 1 is a schematic structural diagram of a gene comparison device provided by an embodiment of the present invention.
- FIG. 2A is a schematic diagram of a gene database provided by an embodiment of the present invention.
- 2B is a schematic diagram of an optical encoding provided by an embodiment of the present invention.
- 3A is a schematic structural diagram of an optical computing chip provided by an embodiment of the present invention.
- 3B is a schematic structural diagram of yet another optical computing chip provided by an embodiment of the present invention.
- 3C is a schematic diagram of the principle of optical comparison provided by an embodiment of the present invention.
- FIG. 5A, FIG. 5B, FIG. 5C, and FIG. 5D are examples of optical encoding provided by embodiments of the present invention.
- FIG. 6 is a flowchart of another gene comparison method provided by an embodiment of the present invention.
- FIG. 7 is a schematic diagram of the sub-reference gene sequence and the test sub-gene sequence provided by the embodiment of the present invention.
- Fig. 8 is a flowchart of another gene comparison method provided by an embodiment of the present invention.
- FIG. 9 is a schematic structural diagram of a comparison device provided by an embodiment of the present invention.
- FIG. 10 is a schematic structural diagram of yet another comparison device provided by an embodiment of the present invention.
- DNA sequencing data has exploded. Therefore, how to improve the speed of DNA comparison is an urgent technical problem to be solved.
- the search rate is usually accelerated by indexing the reference gene sequence in a computer system.
- the essence of the index is to improve the search efficiency by optimizing the data structure.
- there is a bottleneck in index optimization itself, and creating a lot of responsible indexes at the same time will consume a lot of practicality. Therefore, the efficiency of this gene comparison method cannot withstand the massive increase in DNA sequencing data.
- the gene comparison scheme provided by the embodiment of the present invention can greatly increase the speed of gene comparison, and can quickly realize gene comparison even when facing massive gene sequencing data.
- Gene refers to the genetic information that controls biological traits, usually carried by DNA sequences. Genes can also be regarded as the basic genetic unit, that is, a functional DNA or ribonucleic acid (RNA) sequence. The process of figuring out the sequence itself is called gene sequencing.
- Gene sequence to be tested It can also be called reads, which is a small sequence of fragments, which is a kind of sequencing data generated by a high-throughput sequencing platform. In the process of sequencing the entire genome, millions of reads will be generated, and then these reads can be spliced together to obtain the full sequence of the genome.
- Reference gene sequence also called reference sequence: It is a verified and edited standard sequence.
- the reference gene sequence can provide a basis for the functional annotation of the human genome. Provide a stable reference point for mutation analysis, gene expression research and polymorphism discovery.
- Base pair the chemical structure that forms DNA and RNA monomers and encodes genetic information.
- the bases that make up a base pair include adenine A, guanine G, thymine T, cytosine C, and uracil U. Strictly speaking, a base pair is a pair of matched bases (ie, A-T, G-C, A-U interaction) connected by hydrogen bonds. It is often used to measure the length of DNA and RNA (although RNA is single-stranded).
- FIG. 1 is a schematic diagram of using an optical system to realize gene comparison according to an embodiment of the present invention.
- the gene comparison device 100 may include a processor 102, a memory 104, and an optical computing chip 106.
- the processor 102 and the memory 104 can be regarded as a part of the host 101.
- the optical computing chip 106 may be connected to the host 101 through a host interface.
- the host interface may include a standard host interface and a network interface (network interface).
- the host interface may include a Peripheral Component Interconnect Express (PCIE) interface.
- PCIE Peripheral Component Interconnect Express
- the data can be sent to the optical computing chip 106 through the host interface, and the data processed by the optical computing chip 106 can also be sent to the processor 102 through the host interface.
- the processor 102 can also monitor the working status of the optical computing chip 106 through the host interface.
- the processor 102 and the memory 104 may not be part of the host, and the processor 102, the memory 104 and the optical computing chip 106 may be part of a system chip (System on a Chip, SOC).
- the processor 102 is the computing core and the control unit of the gene comparison device 100.
- the processor 102 may include multiple processor cores (cores).
- the processor 102 may be a very large-scale integrated circuit.
- An operating system and other software programs are installed in the processor 102, so that the processor 102 can implement access to the memory 1042, cache, disk, and peripheral devices (such as the optical computing chip 106 in FIG. 1).
- the Core in the processor 102 may be, for example, a central processing unit (CPU), a graphics processing unit (GPU), or a field programmable logic gate array ( Field Programmable Gate Array, FPGA), other specific integrated circuits (Application Specific Integrated Circuit, ASIC), etc.
- the memory 104 is used to store data.
- the memory 104 may include a memory 1042, a magnetic disk, and other memory for storing data.
- the memory 1042 is the main memory of the host 101.
- the memory 1042 may be connected to the processor 102 through a double data rate (DDR) bus.
- the memory 1042 is generally used to store various running software in the operating system, input and output data, and information exchanged with external memory. In order to increase the access speed of the processor 102, the memory 1042 needs to have the advantage of fast access speed. In practical applications, a dynamic random access memory (Dynamic Random Access Memory, DRAM) is usually used as the memory 1042.
- the processor 102 can access the memory 1042 at a high speed through a memory controller (not shown in FIG. 1), and perform read and write operations on any storage unit in the memory 1042.
- the memory 104 may be used to store the gene database 1044.
- the gene database 1044 may be a key-value database established based on the reference sequence. Among them, the key value can be obtained based on part of the base of the gene fragment.
- the value value may include the position in the memory of the reference gene fragment corresponding to the key value, and may also include the position of the reference gene fragment corresponding to the key value in the reference gene sequence.
- part of the bases of the reference gene sequence may be used as the key value.
- the first m bases and the last n bases of a reference gene fragment of a preset length may be used as the key value.
- m and n may be the same or different, and are not limited here.
- the 1044_1 part of the key takes 10 bases as an example. Specifically, 5 bases at the head and 5 bases at the tail of the reference fragment can be respectively taken as the key value.
- how to establish a gene database 1044 is described by taking 150 bases as the length of a reference gene fragment as an example. Specifically, first construct an index table of an empty set (only key values), the number of rows is 4 5+5 , and the sorting combination of keys is the alphabet sorting from AAAAAAAAAA to TTTTTTTT.
- the mapping method is shown in Figure 2B. Specifically, the sequence is arranged in the order of the head base as the high position and the tail base as the low position.
- the base on each unit advances forward with A, C, G, T, and one base unit is entered for full T.
- all tail bases are TTTTT, one base is added to the head base.
- the bases can be in the following order: AAAAAAAAAA, AAAAAAAAAC, AAAAAAAAAAG, AAAAAAAAAT, AAAAAAAACA, AAAAAAAACC, AAAAAAAACG, AAAAAAAACT, etc.
- the key 1044_1 as shown in FIG. 2A can be obtained.
- the preset base length is used as the unit window in turn, and the step length is used as the unit base (ie, 1 base) to slide on the reference gene sequence, so that multiple reference gene fragments can be obtained.
- the key value of the reference gene segment can be obtained according to the 5 bases in the head and 5 bases in the tail of the reference gene segment, and the key value corresponding to the key value can be obtained.
- the position of the reference gene fragment in the reference gene sequence is recorded in value 1044_2. For example, the position of the first base of the reference gene fragment can be recorded. In this way, slide all the way to the end of the reference gene sequence to obtain the value values of all the reference gene segments of the reference gene sequence (ie, the position information of the reference gene segments).
- the gene database 1044 as shown in FIG. 2A can be established.
- the key value mapping method depends on the form of permutation and combination. Assuming that the sequence fragments of the first n and the last m bases are Seq 1 and Seq 2 respectively , the key value mapping is defined as:
- n and m directly affects the efficiency of the algorithm itself, and the increase in n and m will result in a decrease in the value (ie location information) stored in the key value. If hardware factors are not considered, the addressing rate for each gene sequence to be tested will increase by 4 times for every additional unit base. However, due to sequencing errors and gene mutations, n and m cannot be increased indefinitely. Increasing n and m may reduce the reliability of the Key value. Therefore, the values of m and n can be determined by themselves, and the length of the reference gene fragment can also be set according to actual needs.
- the values of m and n can be determined according to factors such as the length of the gene sequence to be tested and the length of the reference gene sequence. In practical applications, the length of the reference gene fragment is usually the same as the base length of the gene sequence to be tested.
- the optical computing chip 106 may be an on-chip optical computing system.
- FIG. 3A is a schematic structural diagram of an optical computing chip provided by an embodiment of the present invention.
- the optical computing chip 106 may include a light source array 202, a modulator array 204, a detector array 206, a first concave mirror 208, and a second concave mirror 210.
- the light source array 202 is located on the focal plane of the object plane of the first concave mirror 208.
- the modulator array 204 is located on the image focal plane of the first concave mirror 208, and the modulator array 204 is also located on the object focal plane of the second concave mirror 210.
- the detector array 206 is located on the image focal plane of the second concave mirror 210.
- the light source array 202 is used for data modulation and transmission, as a data input unit of the optical computing chip 106.
- the light source array 202 can generate multiple light signals with different light intensities according to input data.
- the first concave mirror 208 is used to implement a standard Fourier transform on the data light signal sent by the light source array 202.
- the modulator array 204 has two operating modes: recording mode and modulation mode.
- the recording mode is used to obtain an image of the spectral surface of the data light signal sent by the light source array 202 after passing through the first concave mirror 208.
- the modulation mode is used to modulate the spectral surface image of the data light signal sent by the light source array 202 on the modulator array 204.
- the second concave mirror 210 is used to implement a standard inverse Fourier transform on the optical signal after passing through the modulator array 204.
- the detector array 206 is used for light intensity signal detection, as a result output unit of the light calculation chip 106.
- FIG. 3B is a schematic structural diagram of yet another optical computing chip provided by an embodiment of the present invention. Different from the on-chip integrated optical computing chip provided in FIG. 3A, in the optical computing chip shown in FIG. 3B, the light source array 202 and the detector array 206 are arranged on the same side of the chip, making the structure of the entire computing chip more compact. Reduce chip size. As shown in FIG. 3B, compared with the optical computing chip shown in FIG. 3A, the positions of the first concave mirror 208, the second concave mirror 210, and the modulator array 204 have not changed.
- the light source array 202, the modulator array 204 and the detection The focal positions of the device array 206 with respect to the first concave mirror 208 and the second concave mirror 210 are also unchanged.
- FIG. 3B For the implementation of each device shown in FIG. 3B, reference may be made to the description of each device in the optical computing chip shown in FIG. 3A. I will not repeat them here.
- FIG. 3A and 3B are only schematic diagrams of the structure of the optical computing chip provided by the embodiment of the present invention.
- the specific structure of the optical computing chip 106 is not limited, and optical computing chips of other structures may also be used.
- the optical computing chip 106 may also be an optical computing chip of other structure implemented by using the principle of the 4F optical computing system.
- Figure 3C is a schematic diagram of the principle of the 4F optical computing system. As shown in FIG. 3C, the first modulator 302 is located at the focal point of the first convex lens 304 on the object plane.
- the second modulator 306 is located at the focal position of the image plane of the first convex lens 304 and at the focal position of the object plane of the second convex lens 308.
- the interval between the first convex lens 304 and the second convex lens 308 is the sum of the focal lengths of the two convex lenses (304 and 308).
- the detector 310 is at the focal position of the image plane of the second convex lens 308, and the entire system length is 4 times the focal length.
- the spectrum data of the second data has completed the multiplication operation in the optical space. Essentially, it changes the optical field energy distribution of the spectral optical signal of the first data in the optical space.
- the multiplied spectrum optical signal undergoes an inverse Fourier transform through the second convex lens 308 and then becomes a time-domain optical signal.
- the detector 310 can obtain the autocorrelation result of the two data according to the intensity of the time-domain optical signal detected through the second convex lens 308. It should be noted that the first data and the second data loaded on the optical computing chip may both be vectors.
- the data comparison process of the optical computing chip in FIGS. 3A-3C is obtained by detecting the autocorrelation results of the optical signals of the two data in the optical space.
- auto-correlation is also called sequence correlation, which is the cross-correlation of a signal with itself at different time points.
- sequence correlation is the cross-correlation of a signal with itself at different time points.
- autocorrelation is a function of the similarity between two observations to the time difference between them.
- Autocorrelation is a mathematical tool for finding repeating patterns of random variable sequences.
- Figure 4 is a flowchart of a gene comparison method provided by an embodiment of the present invention.
- the method shown in FIG. 4 will be specifically introduced below in conjunction with FIG. 1.
- the embodiment of the present invention is described by taking the detection of a gene sequence to be tested as an example. It is understandable that even if multiple gene sequences to be tested are detected at one time in practical applications, each gene sequence to be tested can be compared with reference to the embodiment of the present invention.
- the method includes the following steps.
- the processor 102 obtains the first set of gene fragments from the database according to some bases of the gene sequence to be tested.
- the key value of the gene sequence to be tested can be obtained according to the key 1044_1 of the gene database 1044.
- 5 bases in the head and 5 bases in the tail of the gene sequence to be tested can be used as the key value of the gene sequence to be tested.
- Search the gene database 1044 according to the key value of the gene sequence to be tested and obtain multiple value values that match the key value, and the multiple value values are used to indicate possible positions of the gene sequence to be tested on the reference gene sequence .
- multiple reference gene fragments can be obtained according to multiple matching value values.
- multiple reference gene fragments matching the key value of the gene sequence to be tested are called the first group of gene fragments.
- the optical computing chip 106 is used to optically compare the gene sequence to be tested with multiple reference gene fragments in the first set of gene fragments.
- the processor 102 may respectively optically encode the gene sequence to be tested and the multiple reference gene fragments, and load the optical codes of the gene sequence to be tested and the optical codes of the multiple reference gene fragments into the Optical computing chip for comparison.
- the base string in the test gene sequence and the reference gene segment can be encoded separately.
- the coding scheme of A, C, G, and T is 0001, 0010, 0100, 1000, as shown in Figure 5A.
- the optical coding of multiple reference gene segments in the gene sequence to be tested and the first group of gene segments can be obtained.
- the obtained optical codes of the gene sequence to be tested and the multiple reference gene fragments in the first group of gene fragments can be sent to the optical computing chip 106 for optical comparison.
- the light intensity information and/or light spatial information may also be included in the encoding process.
- the method of encoding using the intensity information of light may be referred to as the intensity encoding method
- the method of encoding using the spatial information of light may be referred to as the spatial encoding method.
- the two encoding methods can also be combined, and this combined method can be called a hybrid encoding method.
- the intensity coding method can use different voltage amplitudes to modulate the light intensity, and express four different bases through light signals of different intensities.
- the intensity coding method can be as shown in 5B.
- multiple point light sources can be used as single-base unit clusters, and four different bases can be represented by different brightness levels (for example, 0 means light source off, 1 means light source bright).
- the spatial encoding method may be as shown in FIG. 5C, and multiple optical signals with the same voltage and different light intensities may be used to represent different bases.
- the hybrid encoding method can be a combination of intensity encoding and spatial encoding. For example, as shown in FIG. 5D, a plurality of specific light signals with different voltages and different light intensities can be combined to express different bases.
- the specific coding mode is not limited in the embodiment of the present invention.
- the light source array 202 may first send a first light signal according to the encoding of the gene sequence to be tested after being flipped, and the first light signal is reflected by the first concave mirror 208 and then Fourier is generated.
- the modulator array 204 receives the reflected spectrum optical signal of the first optical signal, and modulates the reflected spectrum optical signal of the first optical signal on the modulator array 204.
- the light source array 202 respectively sends a plurality of optical signals according to the optical codes of the plurality of reference gene fragments in the first group of reference gene fragments, so that the optical signals sent according to the optical codes of the reference gene fragments pass through the first concave mirror 208
- the multiplication operation is completed in the optical space with the reflected signal of the first optical signal.
- the spectral light signal output by the modulator array 204 is inverse Fourier transformed into a time domain light signal through the second concave mirror 210.
- the detector array 206 detects the light intensity of the time domain light signal output by the second concave mirror 210, The matching results of the first light signal and the light signals of the multiple reference gene fragments are obtained respectively.
- Those skilled in the art can know that after the spectrum data is multiplied and undergoes inverse Fourier transform, it is the autocorrelation result of the two data.
- the processor 102 determines the similarity between the gene sequence to be tested and the multiple reference gene fragments according to the output result of the optical computing chip.
- the optical computing chip 106 may send the matching result to the processor 102.
- some peripheral circuits may collect the light intensity signal detected by the detector array 206, convert the collected light intensity signal into an electric signal, and convert the electric signal into a digital signal and then send it to the processor 102, so that the processor 102 can obtain the comparison result of the optical computing chip 106 of the gene sequence to be tested and the reference gene fragment.
- the detector array 206 may generate feedback every time a comparison result is obtained, or may generate feedback when the similarity reaches a preset threshold. It should be noted that the similarity in the embodiment of the present invention is used to indicate the degree of matching between the gene sequence to be tested and the reference gene fragment.
- step 408 the processor 102 determines whether the similarity between the gene segment to be tested and the first reference gene segment of the plurality of reference gene segments is greater than or equal to a first threshold, and if greater than or equal to the first threshold, then In step 410, when it is determined that the similarity between the gene segment to be tested and the first reference gene segment is less than the first threshold, the method proceeds to step 412. In this step, the processor 102 can compare with the set threshold after obtaining the comparison result. The matching result between the gene sequence to be tested and any reference gene segment can be compared with the set threshold.
- the embodiment of the present invention is described by taking the gene sequence to be tested and the first reference gene fragment in the first set of reference gene fragments as an example, where the first reference gene fragment is any reference gene fragment in the first set of reference gene fragments .
- the method proceeds to step 410; otherwise, the method proceeds to step 412.
- the processor 102 records the position of the first reference gene fragment in the reference gene sequence, and ends the matching of the gene sequence to be tested.
- a matching result with a similarity greater than or equal to the first threshold may be considered as a successful match.
- the processor 102 determines that the gene sequence to be tested matches the first reference gene segment successfully, it may record the position of the first gene segment in the reference gene sequence.
- the matching of the gene sequence to be tested is ended, and the matching process ends. It is understandable that in the embodiment of the present invention, the similarity is used to indicate the degree of matching between the gene sequence to be tested and the reference gene segment.
- the first threshold is used to indicate whether the required matching standard is met.
- the first threshold can be used to indicate a perfect match or a maximum similarity match. If the similarity is greater than or equal to the set first threshold, it can be considered that the test gene sequence matches the reference gene sequence or the maximum similarity match.
- the first threshold may be 100% or 95%, which is not limited here.
- step 408 the processor determines that the similarity between the gene segment to be tested and the first gene segment is less than the first threshold, then in step 412, the processor 102 will further determine the gene segment to be tested Whether the similarity with the first gene segment is greater than the second threshold, and when the similarity between the gene segment to be tested and the first gene segment is greater than the second threshold, the method proceeds to step 414 to enter the maximum similarity matching The process. Otherwise, the method proceeds to step 416, confirming that the sequence of the gene to be tested does not match the first reference gene segment, and ending the matching of the gene segment to be tested and the first gene segment.
- the second threshold may be set to 50%.
- test gene segment When the similarity between the test gene segment and the first reference gene segment is less than the first threshold and greater than the second threshold, it means that the test gene sequence is more likely to match the reference gene sequence, or Part of the fragments in the gene sequence to be tested may be matched with the reference gene sequence. Therefore, it is necessary to further compare the sequence of the gene to be tested with the sequence of the reference gene, and this method enters the process of maximum similarity matching.
- steps 408 to 416 in FIG. 4 are described by taking the matching between the gene sequence to be tested and the first reference gene segment as an example.
- the similarity between the test gene sequence and each reference gene fragment can be determined according to step 408.
- step 416 is processed.
- the operations from step 404 to step 416 may be performed in sequence for each reference gene fragment in the test gene sequence and the first set of reference gene fragments.
- the specific implementation manner is not limited.
- the constructed gene database is first matched with the gene sequence to be tested, thereby screening the first set of reference gene fragments that may match the gene sequence to be tested.
- the human reference gene fragment has 3 billion bases, and it will take a lot of time to directly compare the test gene fragment with the reference gene fragment one by one.
- the reference gene fragments that need to be compared can be reduced from 3 billion to hundreds of times, thereby greatly reducing the reference to be compared. The number of gene fragments.
- the optical computing chip is further used to perform optical comparison between the test gene sequence and multiple reference gene fragments in the first set of reference gene fragments. Yes, because the optical computing chip performs optical comparison, the comparison speed is faster than the method of electrical comparison of genes. Therefore, the gene comparison method provided by the embodiment of the present invention also greatly improves the comparison efficiency.
- FIG. 6 is a flowchart of another gene comparison method provided by an embodiment of the present invention. The method shown in FIG. 6 is still executed by the gene matching device 100. As shown in Figure 6, the method may include the following steps.
- the processor 102 obtains a plurality of sub-reference gene sequences from the reference gene sequence. Specifically, the processor 102 obtains multiple sub-reference gene sequences from the reference gene sequence according to the length of the gene sequence to be tested. For example, the length of the gene sequence to be tested can be used as the window and sliding step to obtain multiple sub-reference gene sequences from the reference gene sequence.
- the reference gene sequence can also be split into multiple sub-reference gene sequences according to the base length of the gene sequence to be tested. For example, as shown in FIG. 7, multiple sub-reference gene sequences can be obtained according to the length of the test gene sequence 702 and the reference gene sequence 700. Taking the reference gene sequence of 3 billion bases as an example, if the test gene sequence is 150 bases, then 20 million sub-reference gene sequences can be obtained.
- step 604 the gene sequence to be tested and the i-th sub-reference gene sequence obtained in step 602 are input into the optical computing chip 106 for optical comparison.
- the initial value of i is 1, and the value of i is not greater than the number of sub-reference gene sequences obtained in step 602.
- the processor 102 may optically encode the gene sequence to be tested and the i-th reference gene sequence respectively, and load the optical codes of the gene sequence to be tested and the i-th reference gene sequence into The optical computing chip 106 performs optical comparison to obtain the similarity between the gene sequence to be tested and the i-th reference gene sequence, and the optical computing chip 106 sends the comparison result to the processor 102.
- the similarity between the gene sequence to be tested and the first sub-reference gene sequence among the plurality of sub-reference gene sequences may be referred to as the first similarity.
- the processor 102 determines that the similarity between the gene sequence to be tested and the i-th sub-reference gene sequence is greater than the third threshold, and the method proceeds to step 610.
- the third threshold in order to find as many reference gene fragments as possible that match at least part of the gene sequence to be tested, can be set to a similarity lower than 50%.
- the first The three thresholds can be set to 20%. It can be understood that in practical applications, the third threshold may also be the same as the second threshold, which is not limited here.
- the processor 102 determines whether the test gene sequence is the same as the i-th sub-reference gene sequence. Whether the similarity of gene sequences is greater than the fourth threshold. If the similarity between the gene sequence to be tested and the i-th reference gene sequence is greater than the fourth threshold, the method proceeds to step 612.
- the fourth threshold value is not greater than the first threshold value
- the first threshold value may be used to indicate a set threshold value for complete matching
- the fourth threshold value is a threshold value used to indicate maximum similarity matching.
- the first threshold can be set to 100%
- the fourth threshold can be set to 95%.
- the fourth threshold may also be the same as the first threshold.
- both the first threshold and the fourth threshold may be set to 95%, which is used to indicate that the maximum similarity matching threshold is reached.
- the processor 102 determines that the i-th reference gene sequence is the gene segment with the greatest similarity to the gene sequence to be tested, records the position of the i-th reference gene sequence in the reference gene sequence, and ends the pairing The alignment process of the gene sequence to be tested. If the similarity between the sequence of the gene to be tested and the sequence of the i-th reference gene is not greater than the fourth threshold, the method proceeds to step 614.
- step 614 the processor 102 obtains the first test sub gene sequence and the second test sub gene sequence according to the test gene sequence.
- the processor 102 can obtain the first test sub-gene sequence 7022 and the second test sub-gene sequence 7024 according to the test gene sequence 702. Among them, part of the bases of the first test target gene sequence 7022 and the second test target gene sequence 7024 are the same.
- the first test gene sequence 7022 may include bases of a first preset length obtained from the head of the test gene sequence 702 toward the tail
- the second test gene sequence 7024 may include all The bases of the first preset length obtained from the tail of the test gene sequence 702 toward the head, the first test sub gene sequence 7022 and the second test sub gene sequence 7024 have the same bases. The method proceeds to step 616.
- the optical computing chip 106 optically compares the j-th sub-gene sequence to be tested with the i-th sub-reference gene sequence.
- the initial value of j is 1, and the value of j may not be greater than the number of sub-gene sequences to be tested. Since two test gene sequences are obtained from the test gene sequence in the embodiment of the present invention, the value of j is not greater than 2 in the embodiment of the present invention. It is understandable that if p (p is greater than 2) gene sequences to be tested need to be obtained in practical applications, the value of j is not greater than p.
- the processor 102 also needs to optically encode the j-th test sub-gene sequence unit first, and then combine the optical code of the j-th test sub-gene sequence unit with that of the i-th reference gene sequence.
- the code is loaded onto the optical computing chip 106 for optical comparison to obtain the similarity between the j-th sub-test gene sequence and the i-th reference gene sequence. The method proceeds to step 618.
- step 618 the processor 102 determines that the similarity between the jth sub-gene sequence to be tested and the i-th sub-reference gene sequence is greater than the third threshold
- the method proceeds to step 622 to further determine the j Whether the similarity between the gene sequence of the test sub-gene and the sequence of the i-th sub-reference gene is greater than the fourth threshold.
- the matching result of the optical computing chip between the first test sub-gene sequence and the first sub-reference gene sequence may be called the second similarity
- the optical computing chip The matching result of the calculation chip between the second test sub-gene sequence and the first sub-reference gene sequence is called the third similarity.
- step 622 the processor 102 determines that the similarity between the j-th sub-gene sequence to be tested and the i-th sub-reference gene sequence is greater than the fourth threshold, the method proceeds to step 624, and the i-th sub-reference gene sequence is recorded.
- the position in the reference gene sequence of the reference gene fragment matching the j-th test sub-reference gene sequence in the sub-reference gene sequence ends the matching of the test gene sequence.
- step 622 the processor 102 determines that the similarity between the j-th sub-gene sequence to be tested and the i-th sub-reference gene sequence is not greater than the fourth threshold, the method proceeds to step 626.
- step 626 the processor 102 obtains the first test gene sequence unit and the second test gene sequence unit of the j-th test sub-gene sequence, wherein the first test gene sequence unit and the second test gene sequence unit Part of the base of the gene sequence unit is the same. Specifically, reference may be made to the method of obtaining the first test sub-gene sequence and the second test sub-gene sequence from the test gene sequence in step 614.
- the first test gene sequence unit may include bases of a second preset length obtained from the head to the tail of the j-th test sub gene sequence
- the second test gene sequence unit may It includes bases of the second preset length obtained from the tail to the head of the j gene sequence to be tested.
- step 628 optically compare the k-th gene sequence unit to be tested with the i-th reference gene sequence through the optical computing chip 106.
- the initial value of k is 1, and the value of k is not greater than the number of gene sequence units to be tested.
- the processor 102 may optically encode the k-th gene sequence unit to be tested, and respectively combine the optical code of the k-th gene sequence unit to be tested with the i-th sub-reference gene sequence. The optical code of is loaded on the optical computing chip 106 for optical comparison.
- step 630 the processor 102 determines that the similarity between the k-th gene sequence unit to be tested and the i-th sub-reference gene sequence is greater than the third threshold
- the method proceeds to step 634 to determine the Whether the similarity between the k test gene sequence units and the i-th reference gene sequence is greater than the fourth threshold, if it is greater than the fourth threshold, the method proceeds to step 636 to record the i-th reference gene sequence
- the gene fragment matching the k-th gene sequence unit to be tested is at the position of the reference gene sequence, and the matching is ended. Specifically, in one case, in order to improve the matching speed, after obtaining the gene fragment with the greatest similarity, the matching of the gene sequence to be tested may be ended.
- step 634 the processor 102 determines that the similarity between the k-th gene sequence unit to be tested and the i-th sub-reference gene sequence is not greater than the fourth threshold
- the method proceeds to step 638, according to recursive The method continues to split the k-th gene sequence unit to be tested, and optically compare the sub-units of the k-th gene sequence unit to the i-th reference gene sequence until it is found to be Until the gene fragments to be tested whose similarity of the i-child reference gene sequence is greater than the fourth threshold.
- the reference gene segment in the sub-reference gene segment whose similarity with the part of the test gene segment in the test gene sequence is greater than the fourth threshold may be referred to as the most similar gene segment.
- the gene comparison method shown in FIG. 6 can be used to further perform the maximum similarity matching of the gene fragments to be tested.
- the method shown in Figure 6 allows the gene to be tested to be incompletely consistent with the largest similar gene fragment obtained, there may be some base deletions in the sequence of the gene to be tested or different from the reference gene fragment, thereby enabling the gene to be tested The precise location of missing genes or variant genes in the sequence.
- the gene comparison method provided by the embodiment of the present invention may also include the method flow shown in FIG. 8.
- the method shown in FIG. 8 may follow step 604 shown in FIG. 6.
- the method may include the following steps.
- step 802 the processor 102 determines that the first similarity between the gene sequence to be tested and the i-th reference gene sequence is less than a third threshold.
- step 804 when the processor 102 further determines that the second degree of similarity between the gene sequence to be tested and the i+1th sub-reference gene sequence is greater than the third threshold, the method proceeds to step 806.
- the description of step 802 and step 804 may refer to the description of step 606 in FIG. 6, and the third threshold may be the same as the third threshold set in step 606, for example, it may be 50%.
- step 806 the processor further determines whether the sum of the first similarity and the second similarity is greater than 100%. If the sum of the first degree of similarity and the second degree of similarity is not greater than 100%, the method proceeds to step 808 to compare the sequence of the gene to be tested with the i+2th sub-reference gene through the optical computing chip 106 Sequences are optically aligned. If the sum of the first similarity and the second similarity is greater than 100%, the method proceeds to step 810. In step 810, the processor 102 obtains a new sub-reference gene sequence according to the i-th sub-reference gene sequence and the i+1-th sub-reference gene sequence.
- a part of the reference gene fragments may be obtained from the i-th sub-reference gene sequence and the i+1-th sub-reference gene sequence to form a new sub-reference gene sequence according to the ratio of the first similarity to the second similarity. For example, if the first degree of similarity is 40%, the second degree of similarity is 80%, and the length of a reference gene sequence is 150 base pairs, then the 50 base pairs at the tail of the i-th reference sequence can be compared with The 100 base pairs of the (i+1)-th header constitute a continuous new sub-reference sequence with a length of 150 base pairs.
- the method proceeds to step 812, and the optical computing chip 106 optically compares the gene sequence to be tested with the obtained new sub-reference sequence.
- the optical computing chip 106 optically compares the gene sequence to be tested with the obtained new sub-reference sequence.
- the process of comparing the gene sequence to be tested with the obtained new sub-reference sequence refer to the process of comparing the gene sequence to be tested with the i-th sub-reference sequence in FIG. 6. According to this method, if the similarity between the gene sequence to be tested and the new sub-reference gene sequence is greater than the third threshold, you can continue to refer to the method of step 610 to step 638 in FIG.
- the reference gene fragments found in the reference gene sequence according to the comparison method shown in FIG. 6 and FIG. 8 whose similarity with the gene sequence to be tested is greater than the fourth threshold value It is called the largest similar gene segment.
- the method shown in FIG. 8 can be used in combination with the method shown in FIG. 6. For example, when it is determined that the test gene sequence has a low similarity with the i-th reference gene sequence and a high similarity with the i+1-th sub-reference gene sequence, the sequence shown in FIG. 8 can be performed instead. In the method shown, it is possible to compare a new reference gene sequence with the test gene sequence by obtaining a new reference gene sequence from the i-th sub-reference gene sequence and the i+1-th sub-reference gene sequence. This method of adjusting the sub-reference gene sequence in time according to the partial comparison results can increase the probability and speed of obtaining the largest similar gene fragments and reduce the number of comparisons.
- the sequence of the gene to be tested and the multiple sub-reference gene sequences obtained in step 602 may be aligned according to the method shown in FIG. 6, and then the method shown in FIG. 8 may be executed. , After adjusting the sub-reference gene sequence for alignment.
- the specific execution manner is not limited.
- FIG. 8 is described by taking the gene sequence to be tested and the i-th sub-reference gene sequence as an example.
- the i-th sub-reference gene sequence may be any one of the multiple sub-reference gene sequences.
- Child reference gene sequence the processor may take the comparison of the sequence of the gene to be tested with the sequence of the second sub-reference gene in the plurality of sub-reference gene sequences as an example.
- the similarity of the second sub-reference gene sequence is the fourth similarity, and the fourth similarity is less than the third threshold.
- step 804 the processor 102 determines that the similarity between the gene sequence to be tested and the third sub-reference gene sequence among the plurality of sub-reference gene sequences is a fifth similarity, and the fifth similarity is greater than the third threshold. If in step 806, the processor further determines that the sum of the fourth similarity and the fifth similarity is greater than 100%, the processor 102 may follow the method shown in FIG. 8 according to the second sub-reference gene sequence and the first Three sub-reference gene sequences obtain new sub-reference gene sequences.
- the largest similar gene fragment of the gene sequence to be tested is found by the method of FIG. 6 and FIG. 8, the largest similar gene fragment can also be found in all by the Smith-Waterman local alignment algorithm.
- the sequence of the gene to be tested and the reference sequence are expanded so as to obtain a longer maximum similar gene fragment, thereby facilitating further gene analysis of the gene fragment to be tested.
- the method shown in the above embodiment is described by taking the comparison of the sequence of the gene to be tested and the sequence of one sub-reference gene among the plurality of sub-reference gene sequences as an example. In actual applications, it can be compared with multiple sub-reference gene sequences respectively, which is not limited here.
- ordinal numbers such as “first” and “second” are used to distinguish multiple objects, and are not used to limit the order, timing, priority, or importance of multiple objects.
- FIG. 9 is a schematic diagram of a comparison device provided by an embodiment of the present invention.
- the comparison device can be used to implement various data comparison scenarios including gene comparison.
- the comparison device 900 may include a processor 902, a memory 904, and an optical computing chip 906.
- the processor 902 is configured to obtain a first group of reference objects from a database stored in the memory 904 according to the first object to be matched, where the first group of reference objects includes parts with the same features as the first object. Multiple reference objects.
- the optical computing chip 906 is used for connecting to the processor and used for optical comparison of the first object and the multiple reference objects.
- the processor 902 may be further configured to determine the similarity between the first object and the multiple reference objects according to the output result of the optical computing chip.
- the processor 902 may be further configured to determine, according to the output result of the optical computing chip, that the similarity between the first object and the first reference object in the first group of reference objects is less than that of the first reference object.
- the threshold is greater than the second threshold, and a plurality of sub-reference objects are obtained according to the standard object, wherein each sub-reference object is a part of the reference object.
- the optical computing chip 906 may also be used to perform an optical comparison between the first object and a first sub-reference object among the plurality of sub-reference objects, to obtain a comparison between the first object and the first sub-reference object.
- the first degree of similarity is also be used to perform an optical comparison between the first object and a first sub-reference object among the plurality of sub-reference objects, to obtain a comparison between the first object and the first sub-reference object.
- the processor 902 may be further configured to determine that the first similarity is greater than a third threshold and less than a fourth threshold, and in response to the above determination, obtain the first sub-object and the sub-object according to the first object.
- the optical computing chip 906 may also be used to perform an optical comparison between the first sub-object and the first sub-reference object to obtain a second degree of similarity, and compare the second sub-object with the first sub-object.
- the sub-reference object is optically compared to obtain the third degree of similarity.
- the processor 902 may also be configured to record the position of the first sub-reference object in the standard object when the second similarity is greater than the fourth threshold.
- the comparison device shown in FIG. 9 can be used to realize the function of the comparison device shown in FIG. 1, and the description of the comparison device in FIG. description.
- the comparison device shown in FIG. 9 can be applied to various scenarios that require data comparison or feature comparison, including gene comparison. It can be said that the gene comparison device shown in FIG. 1 is a specific application of the comparison device shown in FIG. 9.
- the comparison device shown in FIG. 9 and the comparison method provided by the embodiments of this aspect can also be used for image comparison, image search, sequence comparison, fuzzy matching and other scenarios. limited.
- Fig. 10 is a schematic diagram of another comparison device provided by an embodiment of the present invention.
- the comparison apparatus 1000 may include an acquisition module 1002, a comparison module 1004, and a result processing module 1006.
- the obtaining module 1002 is used to obtain a first set of gene fragments from a database according to the gene sequence to be tested, wherein the database system contains a plurality of reference gene fragments of the reference gene sequence, and the first set of gene fragments includes the Multiple reference gene fragments with partial base matching of the gene sequence to be tested.
- the comparison module 1004 is used for optically comparing the test gene sequence with multiple reference gene fragments in the first set of gene fragments.
- the result processing module 1006 is configured to determine the similarity between the gene sequence to be tested and the plurality of reference gene fragments in the first group of gene fragments according to the output result of the comparison module 1004.
- the comparison apparatus 1000 may further include a judgment module 1008.
- the judgment module 1008 is configured to determine, according to the output result of the comparison module 1004, that the similarity between the test gene sequence and the first gene segment in the first group of gene segments is less than a first threshold and greater than a second threshold.
- the acquiring module 1002 is further configured to: when the determining module 1008 determines that the similarity between the gene sequence to be tested and the first gene segment in the first group of gene segments is less than a first threshold and greater than a second threshold, from the The reference gene sequence obtains a plurality of sub-reference gene sequences, wherein each sub-reference gene sequence is a part of the reference gene sequence.
- the comparison module 1004 is also used to perform optical comparison between the gene sequence to be tested and the first sub-reference gene sequence among the plurality of sub-reference gene sequences.
- the result processing module 1006 is further configured to obtain the first similarity between the test gene sequence and the first sub-reference gene sequence according to the output result of the optical computing chip.
- the judgment module 1008 is further configured to determine that the first similarity is greater than a third threshold and less than a fourth threshold, wherein the fourth threshold is not greater than the first threshold.
- the acquisition module 1002 is further configured to obtain a first test sub gene sequence and a second test sub gene sequence according to the test gene sequence in response to the judgment of the judgment module 1008, wherein the first test sub gene sequence
- the sub-gene sequence is the same as part of the base of the second test sub-gene sequence.
- the comparison module 1004 is also used to optically compare the first test sub-gene sequence with the first sub-reference gene sequence to obtain a second degree of similarity, and compare the second test sub-gene sequence Optical alignment is performed with the first sub-reference gene sequence to obtain a third degree of similarity.
- the result processing module 1006 is further configured to record the position of the first sub-reference gene sequence in the reference gene sequence when the second similarity is greater than the fourth threshold.
- the obtaining module 1002 is further configured to obtain according to the second test sub-gene sequence when the determining module 1008 determines that the third similarity is greater than the third threshold and less than the fourth threshold.
- the first tester gene sequence unit and the second tester gene sequence unit, wherein part of the bases of the first tester gene sequence unit and the second tester gene sequence unit are the same.
- the comparison module 1004 is also used for optically comparing the first test sub-gene sequence unit with the first sub-reference gene sequence, and compare the second test sub-gene sequence unit with the first sub-reference gene sequence. Optical alignment of a sub-reference gene sequence.
- the comparison module 1004 is also used to optically compare the sequence of the gene to be tested with the sequence of the second sub-reference gene among the plurality of sub-reference gene sequences to obtain the gene to be tested The fourth degree of similarity between the sequence and the second sub-reference gene sequence; and optically compare the test gene sequence with the third sub-reference gene sequence among the multiple sub-reference gene sequences to obtain the The fifth degree of similarity between the test gene sequence and the third sub-reference gene sequence, wherein the third sub-reference gene sequence is a sub-reference gene sequence continuous with the second sub-reference gene sequence.
- the acquisition module 1002 is further configured to determine that the second sub-reference gene sequence and the The third sub-reference gene sequence obtains the fourth sub-reference gene sequence, wherein the fourth sub-reference gene sequence includes a partial base of the second sub-reference gene sequence and a partial base of the third sub-reference gene sequence .
- the comparison module 1004 is also used to input the sequence of the gene to be tested and the sequence of the fourth sub-reference gene into the optical computing chip for optical comparison.
- the result processing module 1006 is further configured to determine that the second gene segment in the first set of gene segments matches the gene sequence to be tested according to the output result of the optical computing chip, and record The position of the second gene fragment in the reference gene sequence.
- the comparison device shown in FIG. 10 can be used to realize the function of the gene comparison device shown in FIG. 1.
- the device embodiments described above are merely illustrative, for example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation. For example, multiple modules or components can be combined or integrated into another system, or some features can be omitted or not implemented.
- the connections between the modules discussed in the above embodiments may be electrical, mechanical or other forms.
- the modules described as separate components may or may not be physically separated.
- the components displayed as modules may be physical modules or may not be physical modules.
- the functional modules in the various embodiments of the application embodiments may exist independently, or may be integrated into one processing module.
- the embodiment of the present invention also provides a computer program product for realizing gene comparison, including a computer-readable storage medium storing program code.
- the program code includes instructions for executing the method described in any of the foregoing method embodiments. Method flow.
- the aforementioned storage media include: U disk, mobile hard disk, magnetic disk, optical disk, random-access memory (RAM), solid state disk (SSD) or non-volatile
- RAM random-access memory
- SSD solid state disk
- non-volatile Various non-transitory (non-transitory) machine-readable media that can store program codes, such as non-volatile memory.
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
Abstract
一种基因比对技术。所述基因比对技术可以应用于包括光计算芯片的计算机系统中。在执行基因比对的过程中,可以先根据待测基因序列从基因数据库中获取第一组基因片段,所述第一组基因片段包括与所述待测基因序列的部分碱基匹配的多个参考基因片段。在获得所述第一组基因片段后,可以将所述待测基因序列与所述第一组基因片段中的多个参考基因片段输入所述光计算芯片进行光学比对。该技术可以大幅度提升基因比对速度,减少基因比对次数。
Description
本申请涉及光学技术领域,尤其涉及一种基因比对技术。
脱氧核糖核酸(Deoxyribonucleic acid,DNA)是染色体的主要化学成分,同时也是组成基因的材料。基因(Gene)是指携带有遗传信息的DNA序列,也称为遗传因子,是控制生物性状的遗传物质的基本结构单位和功能单位。基因通过指导蛋白质的合成来表达自己所携带的遗传信息,从而控制生物个体的性状表现。随着DNA测序技术的出现,直到人类基因组计划(Human Genome Project,HGP)完成,DNA序列数据的产生便呈现了指数级的增长。DNA序列比对是进行基因识别、信息分析、结构预测等问题的前提,通过多个DNA序列的比对,寻找它们相同和不同的位点、区域,从而帮助判断待测基因的同源性、变异点以及来源。
随着新一代DNA测序技术的快速发展,DNA测序数据爆炸式积累的速度远远大于其被处理的速度。面对这些生物信息领域的大数据分析任务以及多种不同维度的数据整合,亟待需要一种快速、便捷的DNA比对方法。
发明内容
本申请提供的一种基因比对技术,能够提高DNA比对效率。
第一方面,本发明实施例提供了一种基因比对方法,该方法应用于包括光计算芯片的计算机系统中。根据该方法,在实现基因比对的过程中,计算机系统的处理器可以根据待测基因序列从基因数据库中获取第一组基因片段,并将所述待测基因序列与所述第一组基因片段中的多个参考基因片段输入所述光计算芯片进行光学比对。其中,所述基因数据库中包含有参考基因序列的多个参考基因片段,所述第一组基因片段包括与所述待测基因序列的部分碱基匹配的多个参考基因片段。
本发明实施例提供的基因比对方法,结合了数据库查找和光学自相关比对两种方式,通过构建的基因数据库与所述待测基因序列进行初次匹配,从而筛选出与所述待测基因序列可能匹配的第一组参考基因片段。通过本发明实施例提供的基因数据库对待比对的基因片段进行筛选后,可以大幅度减少需要详细比对的参考基因片段的数量。并且,在本发明实施例中,在获得第一组参考基因片段后,进一步通过光计算芯片对所述待测基因序列和所述第一组参考基因片段中的多个参考基因片段进行光学比对,由于光计算芯片进行光学比对,与通过电学方式进行基因比对的方法相比,比对速度更快。从而本发明实施例提供的基因比对方法,也极大的提高了比对效率。
实际应用中,处理器可以根据待测基因序列的部分碱基从所述数据 库中获取第一组基因片段。例如,根据待测基因序列的前m个碱基以及后n个碱基从所述数据库中获取第一组基因片段,其中,m的值和n的值均大于0,m与n的和小于所述待测基因序列中的碱基的数量。通常,m和n的取值可以根据待测基因序列的长度、参考基因序列的长度等因素来确定。
在一种可能的实现方式中,所述数据库可以为键-值(key-value)数据库,其中,key值为所述参考基因序列的多个参考基因片段的部分碱基,value值为所述多个参考基因片段在所述参考基因序列中的位置。
在一种可能的实现方式下,所述方法还包括当根据所述光计算芯片的输出结果,确定所述待测基因序列与所述第一组基因片段中的第一基因片段的相似度小于第一阈值且大于第二阈值时,从所述参考基因序列获得多个子参考基因序列,并且,将所述待测基因序列与所述多个子参考基因序列中的第一子参考基因序列输入所述光计算芯片进行光学比对,获得所述待测基因序列与所述第一子参考基因序列的第一相似度,其中,每个子参考基因序列为所述参考基因序列的一部分。
在本发明实施例中,当所述待测基因序列与所述第一组基因片段中的至少一个基因片段的相似度小于第一阈值且大于第二阈值时,说明待测基因序列很有可能在参考基因序列中找到匹配的参考基因片段,需要进一步的比对。因此,可以将所述待测基因序列和参考基因序列的多个子参考基因序列进一步进行光学比对,以便能够快速找到与所述待测基因序列的至少一部分片段匹配的参考基因片段。
在又一种可能的实现方式中,所述方法还可以包括确定所述第一相似度大于第三阈值且小于第四阈值,并响应上述确定,根据所述待测基因序列获得第一待测子基因序列和第二待测子基因序列,其中,所述第四阈值不大于所述第一阈值,所述第一待测子基因序列和第二待测子基因序列的部分碱基相同。进一步的,将所述第一待测子基因序列与所述第一子参考基因序列输入所述光计算芯片进行光学比对以获得第二相似度,以及将所述第二待测子基因序列与所述第一子参考基因序列输入所述光计算芯片进行光学比对以获得第三相似度。根据这种方式,当经过待测基因序列和第一子参考基因序列的相似度满足预设条件时,可以将待测基因序列进一步进行拆分,并将拆分后的第一待测子基因序列和第二待测子基因序列分别与所述第一子参考基因序列进行比对,从而能够尽快定位出所述待测基因序列中与所述第一子参考基因序列匹配的部分片段。并且,由于这种最大相似匹配的方法能够容忍碱基的缺失现象,从而能够实现对待测基因序列中的缺失部分或变异部分的精确定位。实际应用中,第一待测子基因序列可以包括从所述待测基因序列的头部向尾部方向获取的第一预设长度的碱基,所述第二待测子基因序列可以包括从所述待测基因序列的尾部向头部方向获取的所述第一预设长度的碱基,且所述第一待测子基因序列和所述第二待测子基因序列的部分碱基重合。
在又一种可能的实现方式中,所述方法还包括当所述第二相似度大 于所述第四阈值时,记录所述第一子参考基因序列在所述参考基因序列中的位置。根据这种方式,当所述第一待测子基因序列与所述第一子参考基因序列的第二相似度大于所述第四阈值时,可以确认所述第一待测子基因序列与所述第一子参考基因序列最大相似匹配,从而,可以记录所述第一子参考基因序列在所述参考基因序列中的位置,获得所述第一待测子基因序列的最大相似匹配片段。
在又一种可能的实现方式中,所述方法还包括:当所述第三相似度大于所述第三阈值且小于所述第四阈值时,根据所述第二待测子基因序列获得第一待测子基因序列单元和第二待测子基因序列单元,并将所述第一待测子基因序列单元与所述第一子参考基因序列输入所述光计算芯片进行光学比对,以及将所述第二待测子基因序列单元与所述第一子参考基因序列输入所述光计算芯片进行光学比对。其中,所述第一待测子基因序列单元和所述第二待测子基因序列单元的部分碱基相同。根据这种方式,若所述第二待测子基因序列和所述第一子参考基因序列的匹配结果仍然没有达到最大相似匹配标准,则可以继续对所述第二待测子基因序列进行拆分比对,从而,根据这种递归查找的方式,能够快速定位出所述第二待测基因序列的至少部分片段的最大相似匹配片段。由于这种最大相似匹配的方法能够容忍碱基的缺失现象,从而实现对基因缺失和基因变异点的精确定位。
在又一种可能的实现方式中,所述方法还包括:将所述待测基因序列与所述多个子参考基因序列中的第二子参考基因序列输入所述光计算芯片进行光学比对,获得所述待测基因序列与所述第二子参考基因序列的第四相似度,以及将所述待测基因序列与所述多个子参考基因序列中的第三子参考基因序列输入所述光计算芯片进行光学比对,获得所述待测基因序列与所述第三子参考基因序列的第五相似度,其中,所述第三子参考基因序列为与所述第二子参考基因序列连续的子参考基因序列。在确定所述第四相似度和所述第五相似度的和大于所述第一阈值时,根据所述第二子参考基因序列和所述第三子参考基因序列获得第四子参考基因序列,并将所述待测基因序列和所述第四子参考基因序列输入所述光计算芯片进行光学比对。其中,所述第四子参考基因序列包括所述第二子参考基因序列的部分碱基以及所述第三子参考基因序列的部分碱基。
根据这种方式,在确定所述待测基因序列与第二子参考基因序列的相似度的值不满足需要继续与所述第二子参考基因序列进行进一步匹配的条件,而所述待测基因序列与所述第二子参考基因序列以及所述第三子参考基因序列的相似度的和大于所述第一阈值的情况下,可以及时调整子参考基因序列的位置,从所述第二子参考基因序列和所述第三子参考基因序列中取连续的部分获得所述第四子参考基因序列,从而能够尽快从所述第四子参考基因序列中查找到所述待测基因序列的最大相似匹配片段,而无需将所述待测基因片段与所述第三子参考基因序列后面的子参考基因序列继续比对。这种根据部分比对结果及时调整子参考基因序列的方式,能够提高获得最大相似基因片段的概率和速度,减少比对次数。
可以理解的是,实际应用中,可以根据所述第四相似度和所述第五相似度的比例分别从所述第二子参考基因序列和第三子参考基因序列获取一部分参考基因片段组成所述第四子参考基因序列。
在又一种可能的实现方式中,所述方法还包括根据所述光计算芯片的输出结果确定所述第一组基因片段中的第二基因片段与所述待测基因序列匹配,并记录所述第二基因片段在所述参考基因序列中的位置。
在又一种可能的实现方式中,所述将所述待测基因序列与所述第一组基因片段中的多个参考基因片段输入所述光计算芯片进行光学比对包括:分别将所述待测基因序列以及所述第一组基因片段中的所述多个参考基因片段进行光学编码,将所述待测基因序列的光学编码与所述第一组基因序列中的所述多个基因片段的光学编码分别输入所述光计算芯片进行光学比对。实际应用中,可以根据光的强度信息和/或光的空间信息对所述待测基因序列与所述多个参考基因片段进行光学编码。
第二方面,本发明实施例提高一种基因比对装置,包括处理器和光计算芯片。所述处理器用于用于根据待测基因序列从数据库中获取第一组基因片段,其中,所述数据库系统中包含有参考基因序列的多个参考基因片段,所述第一组基因片段包括与所述待测基因序列的部分碱基匹配的多个参考基因片段。所述光计算芯片连接所述处理器并用于对所述待测基因序列与所述第一组基因片段中的多个参考基因片段进行光学比对。
在一种可能的实现方式中,处理器可以根据待测基因序列的部分碱基从所述数据库中获取第一组基因片段。例如,根据待测基因序列的前m个碱基以及后n个碱基从所述数据库中获取第一组基因片段,其中,m的值和n的值均大于0,m与n的和小于所述待测基因序列中的碱基的数量。具体的,所述数据库可以为键-值(key-value)数据库,其中,key值为所述参考基因序列的多个参考基因片段的部分碱基,value值为所述多个参考基因片段在所述参考基因序列中的位置。
在一种可能的实现方式中,所述处理器还用于根据所述光计算芯片的输出结果,确定所述待测基因序列与所述第一组基因片段中的第一基因片段的相似度小于第一阈值且大于第二阈值,并从所述参考基因序列获得多个子参考基因序列,其中,每个子参考基因序列为所述参考基因序列的一部分。所述光计算芯片还用于对所述待测基因序列与所述多个子参考基因序列中的第一子参考基因序列进行光学比对,获得所述待测基因序列与所述第一子参考基因序列的第一相似度。
在又一种可能的实现方式中,所述处理器还用于确定所述第一相似度大于第三阈值且小于第四阈值,其中所述第四阈值不大于所述第一阈值,并且响应上述确定,根据所述待测基因序列获得第一待测子基因序列和第二待测子基因序列,其中,所述第一待测子基因序列和第二待测子基因序列的部分碱基相同。所述光计算芯片还用于对所述第一待测子基因序列与所述第一子参考基因序列 进行光学比对,获得第二相似度;以及对所述第二待测子基因序列与所述第一子参考基因序列进行光学比对,获得第三相似度。
在又一种可能的实现方式中,所述处理器还用于当所述第二相似度大于所述第四阈值时,记录所述第一子参考基因序列在所述参考基因序列中的位置。
在又一种可能的实现方式中,所述处理器还用于当所述第三相似度大于所述第三阈值且小于所述第四阈值时,根据所述第二待测子基因序列获得第一待测子基因序列单元和第二待测子基因序列单元,其中,所述第一待测子基因序列单元和所述第二待测子基因序列单元的部分碱基相同。所述光计算芯片还用于对所述第一待测子基因序列单元与所述第一子参考基因序列进行光学比对,以及对所述第二待测子基因序列单元与所述第一子参考基因序列进行光学比对。
在又一种可能的实现方式中,所述光计算芯片还用于对所述待测基因序列与所述多个子参考基因序列中的第二子参考基因序列进行光学比对;对所述待测基因序列与所述多个子参考基因序列中的第三子参考基因序列进行光学比对,其中,所述第三子参考基因序列为与所述第二子参考基因序列连续的子参考基因序列。所述处理器还用于:确定所述待测基因序列与所述第二子参考基因序列的第四相似度与所述待测基因序列与所述第三子参考基因序列的第五相似度的和大于所述第一阈值;根据所述第二子参考基因序列和所述第三子参考基因序列获得第四子参考基因序列;并将所述待测基因序列和所述第四子参考基因序列输入所述光计算芯片进行光学比对。其中,所述第四子参考基因序列包括所述第二子参考基因序列的部分碱基以及所述第三子参考基因序列的部分碱基。
在又一种可能的实现方式中,所述处理器还用于根据所述光计算芯片的输出结果确定所述第一组基因片段中的第二基因片段与所述待测基因序列匹配,并记录所述第二基因片段在所述参考基因序列中的位置。
在又一种可能的实现方式中,所述处理器还用于分别将所述待测基因序列以及所述第一组基因片段中的所述多个参考基因片段进行光学编码,并将所述待测基因序列的光学编码与所述第一组基因序列中的所述多个基因片段的光学编码分别输入所述光计算芯片进行光学比对。
第三方面,本发明实施例提供一种比对装置,包括处理器和光计算芯片。所述处理器用于根据待匹配的第一对象从数据库中获取第一组参考对象,其中,所述第一组参考对象中包括与所述第一对象的部分特征相同的多个参考对象。所述光计算芯片连接所述处理器并用于对所述第一对象以及所述多个参考对象进行光学比对。
本发明实施例提供的比对装置,结合了数据库查找及光学比对两种方式,通过数据库对待比对的参考对象进行筛选后,可以大幅度减少需要详细比对的参考对象的数量。并且,采用光计算芯片进行比对,能够大幅度提升比对速度。本发明实施提供的比对装置不仅能够用于对基因检测场景,还可以应用于各种需要进行海量数据比对的场景中。
在一种可能的实现方式中,所述处理器还用于根据所述光计算芯片的输出结果,确定所述第一对象与所述第一组参考对象中的第一参考对象的相似度小于第一阈值其而大于第二阈值,根据标准对象获得多个子参考对象,其中,每个子参考对象为所述参考对象的一部分。所述光计算芯片还用于对所述第一对象与所述多个子参考对象中的第一子参考对象进行光学比对,获得所述第一对象与所述第一子参考对象的第一相似度。
在又一种可能的实现方式中,所述处理器还用于确定所述第一相似度大于第三阈值且小于第四阈值,并响应上述确定,根据所述第一对象获得第一子对象和第二子对象。其中,所述第四阈值不大于所述第一阈值,所述第一子对象和所述第二子对象的部分数据相同。所述光计算芯片还用于对所述第一子对象和所述第一子参考对象进行光学比对,获得第二相似度,以及对所述第二子对象和所述第一子参考对象进行光学比对,获得第三相似度。
在又一种可能的实现方式中,所述处理器还用于当所述第二相似度大于所述第四阈值时,记录所述第一子参考对象在所述标准对象中的位置。
第四方面,本申请还提供了一种比对装置,包括获取模块、比对模块、结果处理模块、判断模块等用于实现第一方面或第一方面的任意一种可能的实现方式的功能模块。
第五方面,本申请还提供了一种计算机程序产品,包括程序代码,所述程序代码包括的指令被计算机所执行,以实现所述第一方面以及所述第一方面的任意一种可能的实现方式中所述的基因比对方法。
第六方面,本申请还提供了一种计算机可读存储介质,所述计算机可读存储介质用于存储程序代码,所述程序代码包括的指令被计算机所执行,以实现前述第一方面以及所述第一方面的任意一种可能的实现方式中所述的基因比对方法。
为了更清楚的说明本发明实施例或现有技术中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例。
图1为本发明实施例提供的一种基因比对装置的结构示意图;
图2A为本发明实施例提供的一种基因数据库的示意图;
图2B为本发明实施例提供的一种光学编码示意图;
图3A为本发明实施例提供的一种光计算芯片的结构示意图;
图3B为本发明实施例提供的又一种光计算芯片的结构示意图;
图3C为本发明实施例提供的光学比对的原理示意图;
图4为本发明实施例提供的一种基因比对方法流程图;
图5A、图5B、图5C和图5D为本发明实施例提供的光学编码的示例;
图6为本发明实施例提供的又一种基因比对方法流程图;
图7为本发明实施例提供的子参考基因序列和待测子基因序列的示意;
图8为本发明实施例提供的又一种基因比对方法流程图;
图9为本发明实施例提供的一种比对装置的结构示意图;
图10为本发明实施例提供的又一种比对装置的结构示意图。
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚的描述。显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。
如前所述,由于脱氧核糖核酸(Deoxyribonucleic acid,DNA)测序技术的快速发展,DNA测序数据爆炸式增长。因此,如何提高DNA比对的速度是一个亟需解决的技术问题。现有技术中,通常是通过在计算机系统中对参考基因序列构建索引的方式来加速查找速率。索引的本质上是通过优化数据结构来提高查找效率。然后索引优化本身存在瓶颈,并且同时创建很多的负责索引会耗费大量的实际。因此这种基因比对方式效率难以承受DNA测序数据的大量增长。本发明实施例提供的基因比对方案,能够大幅提供基因比对速度,即使在面对海量基因测序数据时,也能够快速实现基因比对。
为了更清楚的理解本方案,先对本发明实施例涉及的几个技术术语进行描述。
基因:是指控制生物性状的遗传信息,通常由DNA序列来承载。基因也可视作基本遗传单位,亦即一段具有功能性的DNA或核糖核酸(Ribonucleic acid,RNA)序列。弄清其序列本身的过程叫基因测序。
待测基因序列:也可以被称为reads,是一小段测序片段,是一种由高通量测序平台产生的测序数据。在对整个基因组进行测序的过程中,就会产生成百上千万的reads,然后将这些reads拼接起来就能获得基因组的全序列。
参考基因序列(也可以被称为reference sequence):是一种经过验证和编辑的标准序列。参考基因序列可以为人类基因组的功能注解提供一个基础。为突变分析、基因表达研究和多态发现提供一个稳定的参考点。
碱基对:是形成DNA、RNA单体以及编码遗传信息的化学结构。组成碱基对的碱基包括腺嘌呤A、鸟嘌呤G、胸腺嘧啶T、胞嘧啶C、尿嘧啶U。严格地说,碱基对是一对相互匹配的碱基(即A-T,G-C,A-U相互作用)被氢键连接起来。它常被用来衡量DNA和RNA的长度(尽管RNA是单链)。
下面将对本发明实施例进行详细介绍。图1为本发明实施例提供的一种利用光学系统实现基因比对的示意图。如图所示,基因对比装置100可以包括处理器102、存储器104以及光计算芯片106。其中,处理器102和存储器104可以看成是主机101的一部分。光计算芯片106可以通过主机接口与主机101 连接。主机接口可以包括标准的主机接口以及网络接口(network interface)。例如,主机接口可以包括快捷外设互联标准(Peripheral Component Interconnect Express,PCIE)接口。数据可以通过主机接口发送给光计算芯片106,光计算芯片106处理后的数据也可以通过主机接口发送给处理器102。处理器102也可以通过主机接口监测光计算芯片106的工作状态。实际应用中,处理器102和存储器104也可以不作为主机的一部分,处理器102、存储器104和光计算芯片106可以是系统芯片(System on a Chip,SOC)的一部分。
处理器(Processor)102是基因比对装置100的运算核心和控制核心(Control Unit)。处理器102中可以包括多个处理器核(core)。处理器102可以是一块超大规模的集成电路。在处理器102中安装有操作系统和其他软件程序,从而处理器102能够实现对内存1042、缓存、磁盘及外设设备(如图1中的光计算芯片106)的访问。可以理解的是,在本发明实施例中,处理器102中的Core例如可以是中央处理器(Central Processing unit,CPU)、图形处理器(Graphics Processing Unit,GPU)、现场可编程逻辑门阵列(Field Programmable Gate Array,FPGA)、还可以是其他特定集成电路(Application Specific Integrated Circuit,ASIC)等。
存储器104用于存储数据。存储器104可以包括内存1042、磁盘等其他存储数据的存储器。内存1042是主机101的主存。内存1042可以通过双倍速率(double data rate,DDR)总线和处理器102相连。内存1042通常用来存放操作系统中各种正在运行的软件、输入和输出数据以及与外存交换的信息等。为了提高处理器102的访问速度,内存1042需要具备访问速度快的优点。实际应用中,通常采用动态随机存取存储器(Dynamic Random Access Memory,DRAM)作为内存1042。处理器102能够通过内存控制器(图1中未示出)高速访问内存1042,对内存1042中的任意一个存储单元进行读操作和写操作。
在本发明实施例中,存储器104可以用于存储基因数据库1044。基因数据库1044可以是根据参考序列建立的键-值(key-value)数据库。其中,key值可以是根据基因片段的部分碱基获得。value值可以包括key值对应的参考基因片段在存储器中的位置,还可以包括key值对应的参考基因片段在参考基因序列中的位置。
在本发明实施例中,可以通过将参考基因序列的部分碱基作为key值,例如,可以将预设长度的参考基因片段的前m个碱基以及后n个碱基作为key值。m和n可以相同也可以不相同,在此不做限定。通过遍历参考基因序列,定位出所有满足该key值的参考基因片段,并将所有参考基因片段的位置信息作为所述key值对应的value值进行记录。图2A为本发明实施例提供的一种基因数据库的示意图。如图2A所示,基因数据库1044可以包括键(key)1044_1以及值(value)1044_2。其中,键1044_1部分以10个碱基为例,具体的,可以分别取参考片段头部的5个碱基和尾部的5个碱基为key值。在本发明实施例中,以150个碱基为参考基因片段的长度为例对如何建立基因数据库1044进行描述。 具体的,首先构建一个空集的索引表(只具备key值),其行数为4
5+5,key的排序组合即为AAAAAAAAAA到TTTTTTTTTT的字母表排序。映射方式如图2B所示。具体的,以头部碱基作为高位,尾部碱基作为低位的顺序进行排列。每单位上的碱基以A、C、G、T向前递进,满T则进一位单位碱基。待尾部碱基全为TTTTT时向头部碱基部分进一位碱基。通过这种方式,可以如下顺序的碱基:AAAAAAAAAA、AAAAAAAAAC、AAAAAAAAAG、AAAAAAAAAT、AAAAAAAACA、AAAAAAAACC、AAAAAAAACG、AAAAAAAACT等等。由此,可以获得如图2A中所示的键1044_1。
在建立键值索引表后,依次以预设碱基长度作为单位窗口,以步长为单位碱基(即1个碱基)在参考基因序列上进行滑动,从而能够获得多个参考基因片段。在获得每一个参考基因片段的过程中,可以分别根据该参考基因片段头部的5个碱基和尾部的5个碱基,获得该参考基因片段的key值,并在与该key值对应的value 1044_2中记录该参考基因片段在参考基因序列中的位置。例如,可以记录所述参考基因片段的第一个碱基所在的位置。按照这种方式,一直滑动到参考基因序列的末端,获得所述参考基因序列的所有参考基因片段的value值(即,参考基因片段的位置信息)。从而可以建立如图2A所示的基因数据库1044。
实际应用中,key值映射的方式取决于排列组合的形式。假设令前n后m个碱基的序列片段分别为Seq
1、Seq
2,key值的映射定义为:
例如,如果存在一条DNA序列为GTGGA……..CGAGC,令A、C、G、T的值分为0、1、2、3,则此条序列对应的Key值为:
Key
GTG……..AGC
=(Seq
1[4]×4
4+Seq
1[3]×4
3+Seq
1[2]×4
2+Seq
1[1]
×4
1+Seq
1[0]×4
0)×4
5+Seq
2[4]×4
4+Seq
2[3]
×4
3+Seq
2[2]×4
2+Seq
2[1]×4
1+Seq
2[0]×4
0
=728×4
5+393=745865
可以理解的是,n和m碱基数目的选择直接影响到算法本身的效率,n和m的增大会导致key值存储的value(即位置信息)减少。如果不考虑硬件因素,平均每增加单位碱基对于每一条待测基因序列的寻址速率会提高4倍。然而,由于测序误差以及基因突变限制了n和m不可无限增加,增加n和m可能会降低Key值可靠性。因此,m和n的值可以根据需要自行确定,参考基因片段的长度也可以根据实际需要进行设定。通常,m和n的取值可以根据待测基因序列的长度、参考基因序列的长度等因素来确定。实际应用中,参考基因片段的长度通常与待测基因序列的碱基长度相同。
光计算芯片106可以是一种片上光计算系统。图3A为本发明实施例提供的一种光计算芯片的结构示意图。如图3A所示,该光计算芯片106可以包括光源阵列202、调制器阵列204、探测器阵列206、第一凹面镜208和第二凹面镜210。其中,光源阵列202位于第一凹面镜208的物面焦平面上。调制器阵列204位于第一凹面镜208的像面焦平面上,并且,调制器阵列204也位于第二凹面镜210的物面焦平面上。探测器阵列206位于第二凹面镜210的像面焦平面上。
光源阵列202用于数据的调制和发送,作为光计算芯片106的数据输入单元。光源阵列202可以根据输入数据生成多个不同光强的光信号。第一凹面镜208用于对光源阵列202发送的数据光信号实现标准的傅里叶变换。调制器阵列204有两种工作模式:记录模式和调制模式。其中,记录模式用于获得光源阵列202发送的数据光信号经过第一凹面镜208后频谱面的像。调制模式用于将光源阵列202发送的数据光信号频谱面像调制在调制器阵列204上。第二凹面镜210用于对经过调制器阵列204后的光信号实现标准的傅里叶逆变换。探测器阵列206用于光强度信号探测,作为光计算芯片106的结果输出单元。
图3B为本发明实施例提供的又一种光计算芯片的结构示意图。与图3A提供的片上集成光计算芯片不同的是,图3B所示的光计算芯片中,将光源阵列202和探测器阵列206布置在芯片的同一侧,使得整个计算芯片的结构更加紧凑,可以降低芯片尺寸。如图3B所示,与图3A所示的光计算芯片相比,第一凹面镜208、第二凹面镜210、以及调制器阵列204的位置没有改变,光源阵列202、调制器阵列204和探测器阵列206分别相对于第一凹面镜208和第二凹面镜210的焦距位置也没有改变。图3B所示的各个器件的实现可以参考对图3A所示的光计算芯片中各个器件的描述。在此不再赘述。
图3A和图3B仅仅是本发明实施例提供的光计算芯片的结构示意图,实际应用中,并不对光计算芯片106的具体结构进行限制,还可以采用其他结构的光计算芯片。例如,光计算芯片106还可以是利用4F光计算系统原理实现的其他结构的光计算芯片。图3C为4F光计算系统原理示意图。如图3C所示,第一调制器302位于第一凸透镜304的物面焦点位置。第二调制器306位于所述第一凸透镜304的像面焦点位置,且位于第二凸透镜308的物面焦点位置。所述第一凸透镜304和第二凸透镜308之间的间隔为所述两个凸透镜(304和308)的焦距之和。探测器310在第二凸透镜308的像面焦点位置,整个系统长度为4倍焦距。在利用图3C所示的4F光学系统进行数据比对时,可以分别将待比对的第一数据加载在第一调制器302上,将经过翻转后的第二数据的频谱数据加载在第二调制器306上,从而根据第一数据产生的光信号穿过第一凸透镜304后发生傅里叶变换在第二调制器306位置处变成频谱光信号,与第二调制器306上的翻转后的第二数据的频谱数据在光空间完成了乘法操作。本质上改变了第一数据的频谱光信号在光空间的光场能量分布。相乘后的频谱光信号经过第二凸透镜308发生反傅里叶变换又变回时域光信号。探测器310可以根据探测穿过第二凸 透镜308的时域光信号的强度获得两个数据的自相关结果。需要说明的是,上述加载在光计算芯片上的第一数据和第二数据可以均为向量。
可以理解的是,上述图3A-3C的光计算芯片实现数据比对的过程是通过探测两个数据的光信号在光空间的自相关结果获得的。本领域技术人员可以知道,自相关也叫序列相关,是一个信号于其自身在不同时间点的互相关。换一种表达方式,自相关是两次观察之间的相似度对它们之间的时间差的函数。自相关是一种找出随机变量序列重复模式的数学工具。在实际进行序列识别时,通过使用自相关运算,能保证在待测序列和目标序列相同时,其自相关结果中会出现一个明显的最大值位置,监测最大值的出现就可以较为容易地实现序列的比对。
下面将具体介绍如何采用图1所示的基因比对装置实现基因比对,提高基因比对速度。图4为本发明实施例提供的一种基因比对方法流程图。下面将结合图1对图4所示的方法进行具体介绍。为描述清楚、简便,本发明实施例以检测一个待测基因序列为例进行描述。可以理解的是,即使实际应用中会一次检测多个待测基因序列,但对每一个待测基因序列都可以参考本发明实施例进行比对。如4所示,该方法包括下述步骤。
在步骤402中,处理器102根据待测基因序列的部分碱基从数据库中获取第一组基因片段。具体的,可以按照基因数据库1044的key 1044_1的获取方式,获取待测基因序列的key值。例如,可以将待测基因序列的头部的5个碱基和尾部的5个碱基作为待测基因序列的key值。根据所述待测基因序列的key值查找所述基因数据库1044,获得与所述key值匹配的多个value值,该多个value值用于指示待测基因序列在参考基因序列上的可能位置。由于基因数据库1044中与某个key值对应的value值指示的是对应的参考基因片段在参考基因序列中的位置信息,因此,可以根据匹配的多个value值获得多个参考基因片段。在本发明实施例中,将与待测基因序列的key值匹配的多个参考基因片段称为第一组基因片段。
在步骤404中,通过光计算芯片106对所述待测基因序列与所述第一组基因片段中的多个参考基因片段进行光学比对。具体的,处理器102可以分别对将待测基因序列和所述多个参考基因片段进行光学编码,并将待测基因序列的光学编码以及所述多个参考基因片段的光学编码加载到所述光计算芯片进行比对。在对待测基因序列和参考基因片段进行光学编码的过程中,可以分别对待测基因序列和参考基因片段中的碱基字符串进行编码。例如,以4个点光源作为单碱基的单位簇,以其不同的明暗程度(0表示光源灭,1表示光源亮)表示四种不同碱基,A、C、G、T的编码方案为0001、0010、0100、1000,如图5A所示。根据单碱基A、C、G、T的编码方式,可以获得待测基因序列和第一组基因片段中的多个参考基因片段的光学编码。从而可以将获得的待测基因序列和第一组基因片段中的多个参考基因片段的光学编码发送给光计算芯片106进行光学比对。
实际应用中,由于不同的编码方式直接影响解码难度以及自相关结 果输出的可信度。又一种情形下,还可以在编码过程中包括光的强度信息和/或光的空间信息。在本发明实施例中,可以将利用光的强度信息进行编码的方式称为强度编码方式,将利用光的空间信息进行编码的方式称为空间编码方式。实际应用中,还可以将两种编码方式进行结合,可以将这种结合的方式称为混合编码方式。强度编码方式可以利用不同的电压幅度调制光强,通过不同强度的光信号表示四种不同碱基。强度编码方式可以如5B所示。空间编码方式则可以利用多个点光源作为单碱基的单位簇,以不同的明暗程度(例如,0表示光源灭,1表示光源亮)表示四种不同碱基。空间编码方式可以如图5C所示,可以采用多个具有相同电压且具有不同光强的光信号表示不同的碱基。混合编码方式则可以是结合强度编码和空间编码的方式,例如可以如图5D所示,可以采用多个具体不同电压以及不同光强的光信号结合表示不同的碱基。在本发明实施例中不对具体的编码方式进行限定。
在光计算芯片106进行基因比对的过程中,光源阵列202可以先根据翻转后的待测基因序列的编码发送第一光信号,第一光信号经过第一凹面镜208反射后发生傅里叶变换变成频谱光信号,调制器阵列204接收第一光信号的反射频谱光信号,并将所述第一光信号的反射频谱光信号调制在调制器阵列204上。然后,光源阵列202根据所述第一组参考基因片段中的多个参考基因片段的光学编码分别发送多个光信号,从而根据参考基因片段的光学编码发送的光信号穿过第一凹面镜208在调制器阵列204位置处变成频谱光信号后,与所述第一光信号的反射信号在光空间完成乘法操作。调制器阵列204输出的频谱光信号经过第二凹面镜210进行反傅里叶变换成为时域光信号,最后探测器阵列206通过探测第二凹面镜210输出的时域光信号的光强,能够分别获得第一光信号和所述多个参考基因片段的光信号的匹配结果。本领域技术人员可以知道,频谱数据相乘后经过反傅里叶变换后就是两个数据的自相关结果。
在步骤406中,处理器102根据所述光计算芯片的输出结果确定所述待测基因序列与所述多个参考基因片段的相似度。实际应用中,探测器阵列206获得匹配结果后,光计算芯片106可以将匹配结果发送给处理器102。例如,可以通过一些外围电路采集探测器阵列206探测获得的光强信号,将采集的光强信号转换为电信号的,并将电信号转换为数字信号后发送给处理器102,从而,处理器102能够获得光计算芯片106对待测基因序列以及参考基因片段的比对结果。可以理解的是,实际应用中,探测器阵列206可以每获得一比对结果就产生反馈,也可以在相似度达到预设阈值时产生反馈。需要说明的是,本发明实施例的相似度用于指示待测基因序列和参考基因片段的匹配程度。
在步骤408中,处理器102确定所述待测基因片段与所述多个参考基因片段中的第一参考基因片段的相似度是否大于或等于第一阈值,如果大于或等于第一阈值,则进入步骤410,当确定待测基因片段与所述第一参考基因片段的相似度小于所述第一阈值时,该方法进入步骤412。在本步骤中,处理器102在获得比对结果后,可以和设置的阈值进行比较。对于待测基因序列与任意一个 参考基因片段的匹配结果,都可以与设置的阈值进行比较。本发明实施例以将待测基因序列与第一组参考基因片段中的第一参考基因片段为例进行描述,其中,第一参考基因片段为第一组参考基因片段中的任意一个参考基因片段。当述待测基因片段与第一参考基因片段的相似度是否大于或等于第一阈值,该方法进入步骤410,否则该方法进入步骤412。
在步骤410中,处理器102记录所述第一参考基因片段在参考基因序列中的位置,结束对所述待测基因序列的匹配。在本发明实施例中,可以将相似度大于或等于第一阈值的匹配结果认为匹配成功。当处理器102确定所述待测基因序列与所述第一参考基因片段匹配成功时,可以记录所述第一基因片段在所述参考基因序列中的位置。结束对所述待测基因序列的匹配,匹配过程结束。可以理解的是,在本发明实施例中,相似度用于指示待测基因序列与参考基因片段的匹配程度。第一阈值用于指示是否达到需求的匹配标准。实际应用中,第一阈值可以用于指示完全匹配,也可以用于指示最大相似度匹配。如果相似度大于等于设置的第一阈值,则可以认为所述待测基因序列与所述参考基因序列匹配或最大相似匹配。例如,第一阈值可以为100%,也可以为95%,在此不进行限定。
若在步骤408中,处理器确定所述待测基因片段与所述第一基因片段的相似度小于所述第一阈值,则在步骤412中,处理器102会进一步判断所述待测基因片段与所述第一基因片段的相似度是否大于第二阈值,当所述待测基因片段与所述第一基因片段的相似度大于第二阈值时,该方法进入步骤414,进入最大相似度匹配的流程。否则该方法进入步骤416,确认所述待测基因序列与所述第一参考基因片段不匹配,结束所述待测基因片段与所述第一基因片段的匹配。在本发明实施例中,可以将第二阈值设置为50%。当待测基因片段与所述第一参考基因片段的相似度小于第一阈值且大于第二阈值时,说明所述待测基因序列与所述参考基因序列能够匹配的可能性较大,或者说,所述待测基因序列中的部分片段可能与所述参考基因序列进行匹配。因此需要进一步的对所述待测基因序列与所述参考基因序列进行比对,该方法进入最大相似度匹配流程。
可以理解的是,图4中的步骤408至步骤416是以待测基因序列与第一参考基因片段的匹配为例进行描述。实际应用中,可以通过步骤404和步骤406获得所述待测基因序列与多个参考基因片段的相似度后,再分别根据所述待测基因序列与每个参考基因片段的相似度按照步骤408和步骤416进行处理。当然,在获得第一组参考基因片段后,也可以依次对待测基因序列与第一组参考基因片段中每个参考基因片段执行步骤404到步骤416的操作。在此,不对具体的实现方式进行限定。
本发明实施例提供的基因比对方法,通过构建的基因数据库与所述待测基因序列进行初次匹配,从而筛选出与所述待测基因序列可能匹配的第一组参考基因片段。本领域技术人员知道,以人的参考基因片段为例,人的参考基因片段有30亿个碱基,如果直接将待测基因片段与参考基因片段一一比对,会花费很多时间。而通过本发明实施例提供的基因数据库对待比对的基因片段进行筛 选后,可以将需要比对的参考基因片段从30亿降低到几百次,从而大幅度的减少了需要进行比对的参考基因片段的数量。并且,在本发明实施例中,在获得第一组参考基因片段后,进一步通过光计算芯片对所述待测基因序列和所述第一组参考基因片段中的多个参考基因片段进行光学比对,由于光计算芯片进行光学比对,与通过电学方式进行基因比对的方法相比,比对速度更快。从而本发明实施例提供的基因比对方法,也极大的提高了比对效率。
需要说明的是,在本发明实施例中,只要待测基因序列与所述第一组参考基因片段中的任意一个参考基因片段的相似度小于所述第一阈值且大于所述第二阈值,则可以按照图6所示的最大相似度匹配方法对所述待测基因序列进一步的进行比对。图6为本发明实施例提供的又一种基因比对方法流程图。图6所示的方法仍然由基因匹配装置100来执行。如图6所示,该方法可以包括下述步骤。
在步骤602中,处理器102从参考基因序列中获得多个子参考基因序列。具体的,处理器102根据待测基因序列的长度从参考基因序列中获得多个子参考基因序列。例如,可以以所述待测基因序列的长度为窗口及滑动步长,从所述参考基因序列中获得多个子参考基因序列。也可以按照所述待测基因序列的碱基长度将所述参考基因序列拆分成多个子参考基因序列。例如,如图7所示,可以按照所述待测基因序列702的长度根据所述参考基因序列700获得多个子参考基因序列。以参考基因序列有30亿个碱基为例,如果待测基因序列为150个碱基,则可以获得0.2亿个子参考基因序列。
在步骤604中,将所述待测基因序列与步骤602中获得的第i个子参考基因序列输入光计算芯片106进行光学比对。i的初始值为1,且i的值不大于步骤602中获得的子参考基因序列的数量。具体的,处理器102可以分别对所述待测基因序列和所述第i子参考基因序列进行光学编码,并将所述待测基因序列和所述第i子参考基因序列的光学编码加载到光计算芯片106中进行光学比对,以获得待测基因序列与第i子参考基因序列的相似度,光计算芯片106会将比对的结果发送给处理器102。在本发明实施例中,可以将所述待测基因序列与所述多个子参考基因序列中的第一子参考基因序列的相似度称为第一相似度。
在步骤606中,处理器102判断所述待测基因序列与所述第i个子参考基因序列的相似度是否大于设置的第三阈值。若不大于所述第三阈值,则所述待测基因序列与第i个子参考基因序列不匹配,该方法进入步骤608,令i=i+1,并返回步骤604,继续将所述待测基因序列与下一个子参考基因序列进行比对,直到将所述待测基因序列与在步骤602中获得的所有子参考基因序列通过光计算芯片106完成光学比对。如在步骤606中,处理器102判断所述待测基因序列与所述第i个子参考基因序列的相似度大于所述第三阈值,则该方法进入步骤610。在本发明实施例中,为了尽可能的找到与所述待测基因序列的至少部分片段匹配的参考基因片段,可以将第三阈值设置为低于50%的相似度,例如,可以所述第三阈值可以设置为20%。可以理解的是,实际应用中,第三阈值也可以与 第二阈值相同,在此不做限定。
若所述待测基因序列与所述第i子参考基因序列的相似度大于所述第三阈值,则在步骤610中,处理器102进一步判断所述待测基因序列与所述第i子参考基因序列的相似度是否大于第四阈值。若所述待测基因序列与所述第i子参考基因序列的相似度大于所述第四阈值,该方法进入步骤612。在本发明实施例中,第四阈值不大于所述第一阈值,第一阈值可以用于指示完全匹配设置的阈值,第四阈值为用于指示最大相似度匹配的阈值。通常第一阈值可以设置为100%,第四阈值可以设置为95%。可以理解的是,实际应用中,第四阈值也可以与第一阈值相同,例如,第一阈值和第四阈值都可以设置为95%,用于指示达到最大相似度匹配的阈值。在此不做限定。在步骤612中,处理器102确定第i子参考基因序列为与所述待测基因序列具有最大相似度的基因片段,记录所述第i子参考基因序列在参考基因序列中的位置,结束对所述待测基因序列的比对流程。若所述待测基因序列与所述第i子参考基因序列的相似度不大于所述第四阈值,该方法进入步骤614。
在步骤614中,处理器102根据所述待测基因序列获得第一待测子基因序列和第二待测子基因序列。继续参考图7,在本步骤中,处理器102可以根据待测基因序列702获得第一待测子基因序列7022和第二待测子基因序列7024。其中,第一待测子基因序列7022和第二待测子基因序列7024的部分碱基相同。例如,第一待测子基因序列7022可以包括从所述待测基因序列702的头部开始向尾部方向获取的第一预设长度的碱基,第二待测子基因序列7024可以包括从所述待测基因序列702的尾部开始向头部方向获取的第一预设长度的碱基,第一待测子基因序列7022和第二待测子基因序列7024的部分碱基相同。该方法进入步骤616。
在步骤616中,通过光计算芯片106将所述第j待测子基因序列与所述第i子参考基因序列进行光学比对。其中,j的初始值为1,且j的值可以不大于待测子基因序列的数量。由于本发明实施例中从待测基因序列获得了两个待测子基因序列,因此,在本发明实施例中,j的值不大于2。可以理解的是,如果实际应用中,需要获得p(p大于2)个待测子基因序列,则j的值不大于p即可。在本步骤中,处理器102也需要先对第j待测子基因序列单元进行光学编码,然后将所述第j待测子基因序列单元的光学编码与所述第i子参考基因序列的光学编码加载到光计算芯片106进行光学比对,以获得第j待测子基因序列与所述第i子参考基因序列的相似度。该方法进入步骤618。在步骤618中,处理器102确定所述第j待测子基因序列与所述第i子参考基因序列的相似度是否大于所述第三阈值,如果不大于所述第三阈值,则该方法进入步骤620,令j=j+1,并进入步骤616,将所述第j+1待测子基因序列与所述第i子参考基因序列进行光学比对,以获得第j+1待测子基因序列与所述第i子参考基因序列的相似度。如果在步骤618中,处理器102确定所述第j待测子基因序列与所述第i子参考基因序列的相似度大于所述第三阈值,则该方法进入步骤622,进一步判断所述第j 待测子基因序列与所述第i子参考基因序列的相似度是否大于所述第四阈值。在本发明实施例中,为了描述清楚方便,可以将所述光计算芯片对第一待测子基因序列与所述第一子参考基因序列的匹配结果称为第二相似度,将所述光计算芯片对第二待测子基因序列与所述第一子参考基因序列的匹配结果称为第三相似度。
若在步骤622中,处理器102确定所述第j待测子基因序列与所述第i子参考基因序列的相似度大于所述第四阈值,则该方法进入步骤624,记录所述第i子参考基因序列中与所述第j待测子基因序列匹配的参考基因片段在所述参考基因序列中的位置,结束对所述待测基因序列的匹配。需要说明的是,实际应用中,若已经确定第j待测子基因序列与所述第i子参考基因序列的部分片段相似度大于所述第四阈值的情况下,为了提高匹配速度,也可以不继续将所述第j+1待测子基因序列与所述第i子参考基因序列进行匹配,而直接结束对所述待测基因序列的比对流程。当然,可以理解的是,实际应用中,也可以根据需要,继续对第j+1待测子基因序列与所述第i子参考基因序列进行光学比对。
若在步骤622中,处理器102确定所述第j待测子基因序列与所述第i子参考基因序列的相似度不大于所述第四阈值,则该方法进入步骤626。在步骤626中,处理器102获取所述第j待测子基因序列的第一待测基因序列单元和第二待测基因序列单元,其中所述第一待测基因序列单元和第二待测基因序列单元的部分碱基相同。具体的,可以参考步骤614中从所述待测基因序列中获取第一待测子基因序列和所述第二待测子基因序列的方法。例如,所述第一待测基因序列单元可以包括从所述第j待测子基因序列的头部向尾部方向获取的第二预设长度的碱基,所述第二待测基因序列单元可以包括从所述j待测子基因序列的尾部向头部方向获取的第二预设长度的碱基。
在步骤628中,通过光计算芯片106将第k个待测基因序列单元与所述第i子参考基因序列进行光学比对。其中,k的初始值为1,且k的值不大于待测基因序列单元的数量。在本发明实施例中,由于以根据第j待测子基因序列获得两个待测基因序列单元为例,因此,k的值不大于2。具体的,在步骤628中,处理器102可以对第k个待测基因序列单元进行光学编码,并分别将所述第k个待测基因序列单元的光学编码与所述第i子参考基因序列的光学编码加载到光计算芯片106上进行光学比对。该方法进入步骤630。在步骤630中,处理器102判断所述第k个待测基因序列单元与所述第i子参考基因序列的相似度是否大于所述第三阈值。如果不大于所述第三阈值,则进入步骤632,令k=k+1,并进入步骤628,通过所述光计算芯片106将第二待测基因序列单元与所述第i子参考基因序列进行光学比对。
若在步骤630中,处理器102判断所述第k个待测基因序列单元与所述第i子参考基因序列的相似度大于所述第三阈值,则该方法进入步骤634,判断所述第k个待测基因序列单元与所述第i子参考基因序列的相似度是否大于第四阈值,如果大于所述第四阈值,则该方法进入步骤636,记录所述第i子参考基因序列中与所述第k待测基因序列单元匹配的基因片段在所述参考基因序 列的位置,并结束匹配。具体的,一种情形下,为了提高匹配速度,在获得最大相似度的基因片段后,可以结束对所述待测基因序列的匹配。另一种情形下,也可以结束对所述第j待测子基因序列的匹配,或结束对所述第k待测基因序列单元的匹配,而继续对所述k+1待测基因序列单元的匹配或继续对第j+1待测子基因序列进行匹配。
若在步骤634中,处理器102确定所述第k个待测基因序列单元与所述第i子参考基因序列的相似度不大于所述第四阈值,则该方法进入步骤638,按照递归的方式继续对所述第k待测基因序列单元进行拆分,并对所述第k待测基因序列单元的子单元与所述第i子参考基因序列进行光学比对,直到找到与所述第i子参考基因序列的相似度大于所述第四阈值的待测基因片段为止。在本发明实施例中,可以将子参考基因片段中与所述待测基因序列中的部分待测基因片段的相似度大于所述第四阈值的参考基因片段称为最大相似基因片段。
本发明实施例提供的基因比对方法,对于通过图4未能精确匹配的待测基因片段,能够继续通过图6所示的基因比对方法进一步对待测基因片段进行最大相似度匹配。由于图6所示的方法可以允许所述待测基因与获得的最大相似基因片段不完全一致,待测基因序列中可能有部分碱基的缺失或与参考基因片段不同,从而能够实现对待测基因序列中缺失的基因或变异基因的精确定位。
在又一种情况下,本发明实施例提供的基因比对方法还可以包括图8所示的方法流程。图8所示的方法可以在图6所示的步骤604之后。如图8所示,该方法可以包括下述步骤。在步骤802中,处理器102确定所述待测基因序列与所述第i子参考基因序列的第一相似度小于第三阈值。并且,在步骤804中,处理器102进一步确定所述待测基因序列与所述第i+1子参考基因序列的第二相似度大于所述第三阈值时,该方法进入步骤806。需要说明的是,步骤802和步骤804的描述可以参考图6中关于步骤606的描述,第三阈值可以与步骤606中设置的第三阈值相同,例如可以为50%。
在步骤806中,处理器进一步判断所述第一相似度与所述第二相似度的和是否大于100%。如果所述第一相似度与所述第二相似度的和不大于100%,则该方法进入步骤808,通过光计算芯片106将所述待测基因序列与所述第i+2子参考基因序列进行光学比对。如果所述第一相似度与所述第二相似度的和大于100%,则该方法进入步骤810。在步骤810中,所述处理器102根据所述第i子参考基因序列和所述第i+1子参考基因序列获得新的子参考基因序列。在步骤810中,可以根据第一相似度和第二相似度的比例分别从所述第i子参考基因序列和第i+1子参考基因序列获取一部分参考基因片段组成新的子参考基因序列。例如,如果第一相似度为40%,第二相似度为80%,一个参考基因序列的长度是150个碱基对,则可以将所述第i子参考序列尾部的50个碱基对与所述第i+1头部的100个碱基对组成一个连续的长度为150碱基对的新的子参考序列。在获得新的子参考序列后,该方法进入步骤812,通过光计算芯片106将所述待测基因序列与获得的所述新的子参考序列进行光学比对,具体的光学比对方法可以参 见图6中步骤604的描述。并且,在将所述待测基因序列与获得的所述新的子参考序列比对的过程中,可以参见图6中将所述待测基因序列与第i子参考序列进行比对的过程。根据这种方式,如果所述待测基因序列与所述新的子参考基因序列的相似度大于所述第三阈值,则可以继续参见图6中步骤610至步骤638的方法,从所述新的子参考序列中查找与所述待测基因序列的相似度大于所述第四阈值的参考基因片段。在本发明实施例中,可以将按照图6和图8所示的比对方法在参考基因序列中查找到的与所述待测基因序列的相似度大于所述第四阈值的参考基因片段均称为最大相似基因片段。
图8所示的方法可以与图6所示的方法结合使用。例如,当确定所述待测基因序列与所述第i子参考基因序列的相似度较低,而与第i+1子参考基因序列的相似度较高时,则可以转而执行图8所示的方法,从而能够通过从第i子参考基因序列和第i+1子参考基因序列中获得新的参考基因序列与所述待测基因序列进行比对。这种根据部分比对结果及时调整子参考基因序列的方式,能够提高获得最大相似基因片段的概率和速度,减少比对次数。可以理解的是,实际应用中,也可以先按照图6所示的方法将所述待测基因序列与步骤602中获得的多个子参考基因序列比对完成后,再执行图8所示的方法,调整子参考基因序列后进行比对。在本发明实施例中,不对具体的执行方式进行限定。
需要说明的是,图8是以所述待测基因序列与第i子参考基因序列为例进行描述,实际应用中,第i子参考基因序列可以是所述多个子参考基因序列中的任意一个子参考基因序列。例如,在步骤802中,处理器可以以将所述待测基因序列与所述多个子参考基因序列中第二子参考基因序列的比对情况为例,若所述待测基因序列与所述第二子参考基因序列的相似度为第四相似度,且第四相似度小于所述第三阈值。在步骤804中,处理器102确定所述待测基因序列与所述多个子参考基因序列中第三子参考基因序列的相似度为第五相似度,第五相似度大于所述第三阈值。若在步骤806中,处理器进一步判断所述第四相似度与所述第五相似度的和大于100%,则处理器102可以按照图8所示的方法根据第二子参考基因序列和第三子参考基因序列获得新的子参考基因序列。
在本发明实施例中,在通过图6和图8的方法查找到所述待测基因序列的最大相似基因片段后,还可以通过Smith-Waterman局部比对算法将所述最大相似基因片段在所述待测基因序列和所述参考序列上进行扩展,以便能得到更长的最大相似基因片段,从而便于更好的对待测基因片段做进一步的基因分析工作。
可以理解的是,上述实施例所示的方法是以待测基因序列与所述多个子参考基因序列中一个子参考基因序列的比对为例进行描述。实际应用中,可以与多个子参考基因序列分别进行比对,在此不进行限定。本申请实施例提及“第一”、“第二”等序数词用于对多个对象进行区分,不用于限定多个对象的顺序、时序、优先级或者重要程度。
可以理解的是,本发明实施例的比对方法仅仅是以基因比对为例。实际应用中还可以将本发明实施例提供的将基于数据库实现的电学比对方法和 基于光计算芯片的光学比对相结合的比对方法应用于各种其他的应用场景。图9为本发明实施例提供的一种比对装置的示意图。该比对装置可以用于实现包括基因比对在内的各种数据比对场景。
如图9所示,该比对装置900可以包括处理器902、存储器904和光计算芯片906。其中,处理器902用于根据待匹配的第一对象从存储器904存储的数据库中获取第一组参考对象,其中,所述第一组参考对象中包括与所述第一对象的部分特征相同的多个参考对象。光计算芯片906用于连接所述处理器并用于对所述第一对象以及所述多个参考对象进行光学比对。处理器902还可以用于根据所述光计算芯片的输出结果确定所述第一对象与所述多个参考对象的相似度。
在又一种情形下,处理器902还可以用于根据所述光计算芯片的输出结果,确定所述第一对象与所述第一组参考对象中的第一参考对象的相似度小于第一阈值其而大于第二阈值,并根据标准对象获得多个子参考对象,其中,每个子参考对象为所述参考对象的一部分。所述光计算芯片906还可以用于对所述第一对象与所述多个子参考对象中的第一子参考对象进行光学比对,获得所述第一对象与所述第一子参考对象的第一相似度。
在又一种情形下,所述处理器902还可以用于确定所述第一相似度大于第三阈值且小于第四阈值,并且响应上述确定,根据所述第一对象获得第一子对象和第二子对象,其中,所述第四阈值不大于所述第一阈值,所述第一子对象和所述第二子对象的部分数据相同。所述光计算芯片906还可以用于对所述第一子对象和所述第一子参考对象进行光学比对,获得第二相似度,并且,对所述第二子对象和所述第一子参考对象进行光学比对,获得第三相似度。实际应用中,处理器902还可以用于当所述第二相似度大于所述第四阈值时,记录所述第一子参考对象在所述标准对象中的位置。
可以理解的是,图9所示的比对装置可以用于实现图1所示的比对装置的功能,对图9的比对装置的描述可以本发明实施例中前述图1-图8的描述。图9所示的比对装置均可以应用于包括基因比对在内的各种需要进行数据比对或特征比对的场景。可以说,图1所示的基因比对装置是图9所示的比对装置的一种具体应用。需要说明的是,图9所示的比对装置以及本方面实施例提供的比对方法还可以用于进行图片比对、以图搜图、序列比对、模糊匹配等场景,在此不进行限定。
图10为本发明实施例提供的另一种比对装置的示意图。如图10所示,该比对装置1000可以包括获取模块1002、比对模块1004和结果处理模块1006。获取模块1002用于根据待测基因序列从数据库中获取第一组基因片段,其中,所述数据库系统中包含有参考基因序列的多个参考基因片段,所述第一组基因片段包括与所述待测基因序列的部分碱基匹配的多个参考基因片段。比对模块1004用于对所述待测基因序列与所述第一组基因片段中的多个参考基因片段进行光学比对。结果处理模块1006用于根据所述比对模块1004的输出结果确定 所述待测基因序列与所述第一组基因片段中的所述多个参考基因片段的相似度。
在又一种情形下,比对装置1000还可以包括判断模块1008。判断模块1008用于根据所述比对模块1004的输出结果确定所述待测基因序列与所述第一组基因片段中的第一基因片段的相似度小于第一阈值且大于第二阈值。所述获取模块1002还用于当判断模块1008判断所述待测基因序列与所述第一组基因片段中的第一基因片段的相似度小于第一阈值且大于第二阈值时,从所述参考基因序列获得多个子参考基因序列,其中,每个子参考基因序列为所述参考基因序列的一部分。所述比对模块1004还用于对所述待测基因序列与所述多个子参考基因序列中的第一子参考基因序列进行光学比对。所述结果处理模块1006还用于根据所述光计算芯片的输出结果获得所述待测基因序列与所述第一子参考基因序列的第一相似度。
在又一种情形下,所述判断模块1008还用于确定所述第一相似度大于第三阈值且小于第四阈值,其中,所述第四阈值不大于所述第一阈值。所述获取模块1002还用于响应于所述判断模块1008的判断,根据所述待测基因序列获得第一待测子基因序列和第二待测子基因序列,其中,所述第一待测子基因序列和第二待测子基因序列的部分碱基相同。所述比对模块1004还用于将所述第一待测子基因序列与所述第一子参考基因序列进行光学比对,获得第二相似度,以及将所述第二待测子基因序列与所述第一子参考基因序列进行光学比对,获得第三相似度。
在又一种情形下,所述结果处理模块1006还用于当所述第二相似度大于所述第四阈值时,记录所述第一子参考基因序列在所述参考基因序列中的位置。
在又一种情形下,获取模块1002还用于当判断模块1008判断所述第三相似度大于所述第三阈值且小于所述第四阈值时,根据所述第二待测子基因序列获得第一待测子基因序列单元和第二待测子基因序列单元,其中,所述第一待测子基因序列单元和所述第二待测子基因序列单元的部分碱基相同。所述比对模块1004还用于将所述第一待测子基因序列单元与所述第一子参考基因序列进行光学比对,以及将所述第二待测子基因序列单元与所述第一子参考基因序列进行光学比对。
在又一种情形下,所述比对模块1004还用于将所述待测基因序列与所述多个子参考基因序列中的第二子参考基因序列进行光学比对,获得所述待测基因序列与所述第二子参考基因序列的第四相似度;并且,将所述待测基因序列与所述多个子参考基因序列中的第三子参考基因序列进行光学比对,获得所述待测基因序列与所述第三子参考基因序列的第五相似度,其中,所述第三子参考基因序列为与所述第二子参考基因序列连续的子参考基因序列。所述判断模块1008在确定所述第四相似度和所述第五相似度的和大于所述第一阈值时,所述获取模块1002还用于根据所述第二子参考基因序列和所述第三子参考基因序列获得第四子参考基因序列,其中,所述第四子参考基因序列包括所述第二子参考 基因序列的部分碱基以及所述第三子参考基因序列的部分碱基。所述比对模块1004还用于将所述待测基因序列和所述第四子参考基因序列输入所述光计算芯片进行光学比对。
在又一种情形下,所述结果处理模块1006还用于根据所述光计算芯片的输出结果确定所述第一组基因片段中的第二基因片段与所述待测基因序列匹配,并记录所述第二基因片段在所述参考基因序列中的位置。
可以理解的是,图10所示的比对装置可以用于实现图1所示的基因比对装置的功能。具体可以参见前面对图1相关模块的功能的描述。在此不在赘述。可以理解的是,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如,多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,上述实施例所讨论的模块相互之间的连接可以是电性、机械或其他形式。所述作为分离部件说明的模块可以是物理上分开的,也可以不是物理上分开的。作为模块显示的部件可以是物理模块或者也可以不是物理模块。另外,在申请实施例各个实施例中的各功能模块可以独立存在,也可以集成在一个处理模块中。
本发明实施例还提供一种用于实现基因比对的计算机程序产品,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令用于执行前述任意一个方法实施例所述的方法流程。本领域普通技术人员可以理解,前述的存储介质包括:U盘、移动硬盘、磁碟、光盘、随机存储器(random-access memory,RAM)、固态硬盘(solid state disk,SSD)或者非易失性存储器(non-volatile memory)等各种可以存储程序代码的非短暂性的(non-transitory)机器可读介质。
需要说明的是,本申请所提供的实施例仅仅是示意性的。所属领域的技术人员可以清楚的了解到,为了描述的方便和简洁,在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。在本发明实施例、权利要求以及附图中揭示的特征可以独立存在也可以组合存在。在本发明实施例中以硬件形式描述的特征可以通过软件来执行,反之亦然。在此不做限定。
Claims (24)
- 一种基因比对方法,其特征在于,所述方法由包括光计算芯片的计算机系统执行,所述方法包括:根据待测基因序列从基因数据库中获取第一组基因片段,其中,所述基因数据库中包含有参考基因序列的多个参考基因片段,所述第一组基因片段包括与所述待测基因序列的部分碱基匹配的多个参考基因片段;将所述待测基因序列与所述第一组基因片段中的多个参考基因片段输入所述光计算芯片进行光学比对。
- 根据权利要求1所述的基因比对方法,其特征在于,所述方法还包括:根据所述光计算芯片的输出结果,确定所述待测基因序列与所述第一组基因片段中的第一基因片段的相似度小于第一阈值且大于第二阈值;从所述参考基因序列获得多个子参考基因序列,其中,每个子参考基因序列为所述参考基因序列的一部分;将所述待测基因序列与所述多个子参考基因序列中的第一子参考基因序列输入所述光计算芯片进行光学比对,获得所述待测基因序列与所述第一子参考基因序列的第一相似度。
- 根据权利要求2所述的基因比对方法,其特征在于,所述方法还包括:确定所述第一相似度大于第三阈值且小于第四阈值,其中,所述第四阈值不大于所述第一阈值;响应上述确定,根据所述待测基因序列获得第一待测子基因序列和第二待测子基因序列,其中,所述第一待测子基因序列和第二待测子基因序列的部分碱基相同;将所述第一待测子基因序列与所述第一子参考基因序列输入所述光计算芯片进行光学比对,获得第二相似度;将所述第二待测子基因序列与所述第一子参考基因序列输入所述光计算芯片进行光学比对,获得第三相似度。
- 根据权利要求3所述的基因比对方法,其特征在于,所述方法还包括:当所述第二相似度大于所述第四阈值时,记录所述第一子参考基因序列在所述参考基因序列中的位置。
- 根据权利要求3或4所述的基因比对方法,其特征在于,所述方法还包括:当所述第三相似度大于所述第三阈值且小于所述第四阈值时,根据所述第二待测子基因序列获得第一待测子基因序列单元和第二待测子基因序列单元,其中, 所述第一待测子基因序列单元和所述第二待测子基因序列单元的部分碱基相同;将所述第一待测子基因序列单元与所述第一子参考基因序列输入所述光计算芯片进行光学比对;将所述第二待测子基因序列单元与所述第一子参考基因序列输入所述光计算芯片进行光学比对。
- 根据权利要求2-5任意一项所述的基因比对方法,其特征在于,所述方法还包括:将所述待测基因序列与所述多个子参考基因序列中的第二子参考基因序列输入所述光计算芯片进行光学比对,获得所述待测基因序列与所述第二子参考基因序列的第四相似度;将所述待测基因序列与所述多个子参考基因序列中的第三子参考基因序列输入所述光计算芯片进行光学比对,获得所述待测基因序列与所述第三子参考基因序列的第五相似度,其中,所述第三子参考基因序列为与所述第二子参考基因序列连续的子参考基因序列;确定所述第四相似度和所述第五相似度的和大于所述第一阈值;根据所述第二子参考基因序列和所述第三子参考基因序列获得第四子参考基因序列,其中,所述第四子参考基因序列包括所述第二子参考基因序列的部分碱基以及所述第三子参考基因序列的部分碱基;将所述待测基因序列和所述第四子参考基因序列输入所述光计算芯片进行光学比对。
- 根据权利要求1所述的基因比对方法,其特征在于,所述方法还包括:根据所述光计算芯片的输出结果确定所述第一组基因片段中的第二基因片段与所述待测基因序列匹配;记录所述第二基因片段在所述参考基因序列中的位置。
- 根据权利要求1所述的基因比对方法,其特征在于,所述将所述待测基因序列与所述第一组基因片段中的多个参考基因片段输入所述光计算芯片进行光学比对包括:分别将所述待测基因序列以及所述第一组基因片段中的所述多个参考基因片段进行光学编码;将所述待测基因序列的光学编码与所述第一组基因序列中的所述多个基因片段的光学编码分别输入所述光计算芯片进行光学比对。
- 根据权利要求1-7任意一项所述的基因比对方法,其特征在于,所述根据待测基因序列从数据库中获取第一组基因片段包括:根据待测基因序列的前m个碱基以及后n个碱基从所述数据库中获取第一组 基因片段,其中,m的值和n的值均大于0,m与n的和小于所述待测基因序列中的碱基的数量。
- 一种基因比对装置,包括:处理器,用于根据待测基因序列从数据库中获取第一组基因片段,其中,所述数据库系统中包含有参考基因序列的多个参考基因片段,所述第一组基因片段包括与所述待测基因序列的部分碱基匹配的多个参考基因片段;光计算芯片,连接所述处理器并用于对所述待测基因序列与所述第一组基因片段中的多个参考基因片段进行光学比对。
- 根据权利要求10所示的基因比对装置,其特征在于,所述处理器还用于:根据所述光计算芯片的输出结果,确定所述待测基因序列与所述第一组基因片段中的第一基因片段的相似度小于第一阈值且大于第二阈值;从所述参考基因序列获得多个子参考基因序列,其中,每个子参考基因序列为所述参考基因序列的一部分;所述光计算芯片还用于:对所述待测基因序列与所述多个子参考基因序列中的第一子参考基因序列进行光学比对,获得所述待测基因序列与所述第一子参考基因序列的第一相似度。
- 根据权利要求11所述的基因比对装置,其特征在于,所述处理器还用于:确定所述第一相似度大于第三阈值且小于第四阈值,其中所述第四阈值不大于所述第一阈值;响应上述确定,根据所述待测基因序列获得第一待测子基因序列和第二待测子基因序列,其中,所述第一待测子基因序列和第二待测子基因序列的部分碱基相同;所述光计算芯片还用于:对所述第一待测子基因序列与所述第一子参考基因序列进行光学比对,获得第二相似度;以及对所述第二待测子基因序列与所述第一子参考基因序列进行光学比对,获得第三相似度。
- 根据权利要求12所述的基因比对装置,其特征在于,所述处理器还用于:当所述第二相似度大于所述第四阈值时,记录所述第一子参考基因序列在所述参考基因序列中的位置。
- 根据权利要求12或13所述的基因比对装置,其特征在于,所述处理器还用于:当所述第三相似度大于所述第三阈值且小于所述第四阈值时,根据所述第二待测子基因序列获得第一待测子基因序列单元和第二待测子基因序列单元,其中,所述第一待测子基因序列单元和所述第二待测子基因序列单元的部分碱基相同;所述光计算芯片还用于:对所述第一待测子基因序列单元与所述第一子参考基因序列进行光学比对;对所述第二待测子基因序列单元与所述第一子参考基因序列进行光学比对。
- 根据权利要求11-14任意一项所述的基因比对装置,其特征在于,所述光计算芯片还用于:对所述待测基因序列与所述多个子参考基因序列中的第二子参考基因序列进行光学比对;对所述待测基因序列与所述多个子参考基因序列中的第三子参考基因序列进行光学比对,其中,所述第三子参考基因序列为与所述第二子参考基因序列连续的子参考基因序列;所述处理器还用于:确定所述待测基因序列与所述第二子参考基因序列的第四相似度与所述待测基因序列与所述第三子参考基因序列的第五相似度的和大于所述第一阈值;根据所述第二子参考基因序列和所述第三子参考基因序列获得第四子参考基因序列,其中,所述第四子参考基因序列包括所述第二子参考基因序列的部分碱基以及所述第三子参考基因序列的部分碱基;将所述待测基因序列和所述第四子参考基因序列输入所述光计算芯片进行光学比对。
- 根据权利要求10所述的基因比对装置,其特征在于,所述处理器还用于:根据所述光计算芯片的输出结果确定所述第一组基因片段中的第二基因片段与所述待测基因序列匹配;记录所述第二基因片段在所述参考基因序列中的位置。
- 根据权利要求11所述的基因比对装置,其特征在于,所述处理器还用于:分别将所述待测基因序列以及所述第一组基因片段中的所述多个参考基因片段进行光学编码;将所述待测基因序列的光学编码与所述第一组基因序列中的所述多个基因片段的光学编码分别输入所述光计算芯片进行光学比对。
- 根据权利要求11-17任意一项所述的基因比对装置,其特征在于,所述处理器用于:根据待测基因序列的前m个碱基以及后n个碱基从所述数据库中获取所述第一组基因片段,其中,m的值和n的值均大于0,m与n的和小于所述待测基因序列中的碱基的数量。
- 一种比对装置,其特征在于,包括:处理器,用于根据待匹配的第一对象从数据库中获取第一组参考对象,其中,所述第一组参考对象中包括与所述第一对象的部分特征相同的多个参考对象;光计算芯片,连接所述处理器并用于对所述第一对象以及所述多个参考对象进行光学比对。
- 根据权利要求19所述的比对装置,其特征在于,所述处理器还用于:根据所述光计算芯片的输出结果,确定所述第一对象与所述第一组参考对象中的第一参考对象的相似度小于第一阈值其而大于第二阈值;根据标准对象获得多个子参考对象,其中,每个子参考对象为所述参考对象的一部分;所述光计算芯片还用于:对所述第一对象与所述多个子参考对象中的第一子参考对象进行光学比对,获得所述第一对象与所述第一子参考对象的第一相似度。
- 根据权利要求19所述的比对装置,其特征在于,所述处理器还用于:确定所述第一相似度大于第三阈值且小于第四阈值,其中,所述第四阈值不大于所述第一阈值;响应上述确定,根据所述第一对象获得第一子对象和第二子对象,其中,所述第一子对象和所述第二子对象的部分数据相同;所述光计算芯片还用于:对所述第一子对象和所述第一子参考对象进行光学比对,获得第二相似度;以及对所述第二子对象和所述第一子参考对象进行光学比对,获得第三相似度。
- 根据权利要求21所述的比对装置,其特征在于,所述处理器还用于:当所述第二相似度大于所述第四阈值时,记录所述第一子参考对象在所述标准对象中的位置。
- 一种计算机程序产品,包括程序代码,所述程序代码包括的指令被计算机所执行以执行如权利要求1-9任意一项所述的基因比对方法。
- 一种计算机可读存储介质,包括计算机程序指令,当所述计算机程序指 令在计算机上运行时,使得所述计算机执行如权利要求1-9任意一项所述的基因比对方法。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP20849621.6A EP4006908A4 (en) | 2019-08-02 | 2020-08-03 | GENE ALIGNMENT TECHNIQUE |
| JP2022506634A JP7286872B2 (ja) | 2019-08-02 | 2020-08-03 | 遺伝子アライメント技術 |
| US17/587,507 US20220238185A1 (en) | 2019-08-02 | 2022-01-28 | Gene Alignment Technology |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910713689 | 2019-08-02 | ||
| CN201910713689.5 | 2019-08-02 | ||
| CN201911046513.5A CN112309501B (zh) | 2019-08-02 | 2019-10-30 | 基因比对技术 |
| CN201911046513.5 | 2019-10-30 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/587,507 Continuation US20220238185A1 (en) | 2019-08-02 | 2022-01-28 | Gene Alignment Technology |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021023142A1 true WO2021023142A1 (zh) | 2021-02-11 |
Family
ID=74486806
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2020/106498 Ceased WO2021023142A1 (zh) | 2019-08-02 | 2020-08-03 | 基因比对技术 |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US20220238185A1 (zh) |
| EP (1) | EP4006908A4 (zh) |
| JP (1) | JP7286872B2 (zh) |
| CN (1) | CN112309501B (zh) |
| WO (1) | WO2021023142A1 (zh) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118053537A (zh) * | 2024-03-04 | 2024-05-17 | 中国医学科学院阜外医院 | 一种心原性猝死疾病遗传变异解读报告系统及应用 |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113268461B (zh) * | 2021-07-19 | 2021-09-17 | 广州嘉检医学检测有限公司 | 一种基因测序数据重组封装的方法和装置 |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140309142A1 (en) * | 2012-04-16 | 2014-10-16 | Jingdong Tian | Method of on-chip nucleic acid molecule synthesis |
| CN107653299A (zh) * | 2016-07-23 | 2018-02-02 | 成都十洲科技有限公司 | 一种基于高通量测序的基因芯片探针序列的获取方法 |
| CN108604260A (zh) * | 2016-01-11 | 2018-09-28 | 艾迪科基因组公司 | 用于现场或基于云的dna和rna处理和分析的基因组学基础架构 |
| CN109690359A (zh) * | 2016-04-22 | 2019-04-26 | 伊鲁米那股份有限公司 | 在像素内的多个位点的发光成像中使用的基于光子结构的设备和组成物及使用其的方法 |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US4892408A (en) * | 1988-03-03 | 1990-01-09 | Grumman Aerospace Corporation | Reference input patterns for evaluation and alignment of an optical matched filter correlator |
| US20130091121A1 (en) * | 2011-08-09 | 2013-04-11 | Vitaly L. GALINSKY | Method for rapid assessment of similarity between sequences |
| US10191929B2 (en) * | 2013-05-29 | 2019-01-29 | Noblis, Inc. | Systems and methods for SNP analysis and genome sequencing |
| SG11201707668WA (en) * | 2015-03-17 | 2017-10-30 | Agency Science Tech & Res | Bioinformatics data processing systems |
-
2019
- 2019-10-30 CN CN201911046513.5A patent/CN112309501B/zh active Active
-
2020
- 2020-08-03 EP EP20849621.6A patent/EP4006908A4/en active Pending
- 2020-08-03 WO PCT/CN2020/106498 patent/WO2021023142A1/zh not_active Ceased
- 2020-08-03 JP JP2022506634A patent/JP7286872B2/ja active Active
-
2022
- 2022-01-28 US US17/587,507 patent/US20220238185A1/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20140309142A1 (en) * | 2012-04-16 | 2014-10-16 | Jingdong Tian | Method of on-chip nucleic acid molecule synthesis |
| CN108604260A (zh) * | 2016-01-11 | 2018-09-28 | 艾迪科基因组公司 | 用于现场或基于云的dna和rna处理和分析的基因组学基础架构 |
| CN109690359A (zh) * | 2016-04-22 | 2019-04-26 | 伊鲁米那股份有限公司 | 在像素内的多个位点的发光成像中使用的基于光子结构的设备和组成物及使用其的方法 |
| CN107653299A (zh) * | 2016-07-23 | 2018-02-02 | 成都十洲科技有限公司 | 一种基于高通量测序的基因芯片探针序列的获取方法 |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP4006908A4 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118053537A (zh) * | 2024-03-04 | 2024-05-17 | 中国医学科学院阜外医院 | 一种心原性猝死疾病遗传变异解读报告系统及应用 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN112309501A (zh) | 2021-02-02 |
| JP7286872B2 (ja) | 2023-06-05 |
| EP4006908A1 (en) | 2022-06-01 |
| JP2022543094A (ja) | 2022-10-07 |
| EP4006908A4 (en) | 2022-08-31 |
| US20220238185A1 (en) | 2022-07-28 |
| CN112309501B (zh) | 2025-01-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US8798936B2 (en) | Methods and systems for data analysis using the Burrows Wheeler transform | |
| US12125559B2 (en) | Parallelizable sequence alignment systems and methods | |
| CN111564179B (zh) | 一种基于三元组神经网络的物种生物学分类方法及系统 | |
| CN113178233B (zh) | 大规模单细胞转录组数据高效聚类方法 | |
| WO2020238039A1 (zh) | 神经网络搜索方法及装置 | |
| CN115527612B (zh) | 基于数值特征表达的基因组二四代融合组装方法及系统 | |
| US20220238185A1 (en) | Gene Alignment Technology | |
| JP2022553473A (ja) | 生物学的配列に基づく負の配列パターンの類似性分析方法、その実装システム及び媒体 | |
| US20140336950A1 (en) | Clustering copy-number values for segments of genomic data | |
| Li et al. | Seeding with minimized subsequence | |
| Zhao et al. | ANN softmax: Acceleration of extreme classification training | |
| JPWO2016114009A1 (ja) | 融合遺伝子解析装置、融合遺伝子解析方法、及びプログラム | |
| US20180239866A1 (en) | Prediction of genetic trait expression using data analytics | |
| CN116665772B (zh) | 一种基于内存计算的基因组图分析方法、装置和介质 | |
| US20190050531A1 (en) | Dna sequence processing method and device | |
| CN117932518A (zh) | 一种地热异常探测方法、装置、电子设备及存储介质 | |
| CN112669907B (zh) | 基于分治整合策略的成对蛋白质相互作用网络比对方法 | |
| Dang et al. | Using dimension reduction with feature selection to enhance accuracy of tumor classification | |
| KR20230134429A (ko) | 무세포 dna를 이용한 암 진단 장치 및 그 방법 | |
| CN110648719B (zh) | 基于能量和概率的局部结构胃癌耐药lncRNA二级结构预测方法 | |
| CN115713107A (zh) | 用于变体识别的神经网络 | |
| Zheng et al. | In-Storage Read-Centric Seed Location Filtering Using 3D-NAND Flash for Genome Sequence Analysis | |
| TWI897271B (zh) | 用於處理基因體序列之儲存系統 | |
| CN119851761B (zh) | 一种基于Spark的分布式序列比对方法及系统 | |
| Sommer et al. | Predicting protein structure classes from function predictions |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20849621 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2022506634 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2020849621 Country of ref document: EP Effective date: 20220223 |
