WO2025190902A1 - Amélioration des scores de qualité d'appel de base - Google Patents
Amélioration des scores de qualité d'appel de baseInfo
- Publication number
- WO2025190902A1 WO2025190902A1 PCT/EP2025/056537 EP2025056537W WO2025190902A1 WO 2025190902 A1 WO2025190902 A1 WO 2025190902A1 EP 2025056537 W EP2025056537 W EP 2025056537W WO 2025190902 A1 WO2025190902 A1 WO 2025190902A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cluster
- nucleobase
- intensity
- sequencing
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Definitions
- NGS next-generation sequencing
- SBS sequencing-by-synthesis
- deoxyribonucleic acid analogs conjugated to fluorescent labels are hybridized to the template nucleic acids, and excitation light sources are used to excite the fluorescent labels on the deoxyribonucleic acid analogs.
- Detectors capture fluorescent emissions from the fluorescent labels and identify the deoxyribonucleic acid analogs.
- the sequence of the template nucleic acids may be determined by repeatedly performing such sequencing cycles.
- the identification of the deoxyribonucleic acid analogs is also known as base calling.
- the accuracy of a base call can be determined by calculating a quality score.
- the Phred quality score is a commonly used metric which indicates the probability that a given base is called incorrectly by the sequencer.
- a Phred quality score is determined by calculating several parameters relevant to the sequencing cycle and comparing these parameters to an associated quality score using a lookup table.
- the lookup table is conventionally calibrated based on a chastity metric which does not fully represent the signal generated by the cluster.
- lookup tables are generally constructed using simple rules or models, and only a limited set of parameters are typically used as input to the lookup table. Whilst this limits the computational burden in using the lookup table, it also results in a lack of accuracy. 69334895-1 M&C PM363975US 2 US 8932126 B2 describes a Phred based quality scoring method for next generation sequencing applications. Zhang, S., Wang, B., Wan, L. et al.
- Phred-based quality scoring was initially implemented for Sanger sequencing, where a small number of Sanger based sequencing metrics were linked to a known sequencing accuracy in order to generate a lookup table. For next-generation sequencing, a similar scheme is used to generate a Phred-based lookup table, although parameters which are relevant to the particular sequencing chemistry are used to generate the lookup table. Chastity is often used as a metric to calibrate a Phred-based lookup table but this metric may not fully reflect base calling quality. It can be important for a quality scoring model to be accurate since quality scores can be used to determine the quality of a read and assess if the read should be filtered out or used in downstream analysis. Inaccurate quality scoring may therefore negatively impact the quality of downstream analysis performed on the sequencing data.
- Phred-based lookup tables are used for quality scoring because they allow computationally efficient determination of a quality score and therefore they are suitable for implementation in real time during sequencing.
- M&C PM363975US 3 limitations to a Phred-based lookup table.
- the number of input parameters is restricted and the resolution of the Phred- based lookup table is also limited.
- Long Short-Term Memory Networks may be used. Due to the computational burden of evaluating these deep learning base callers, these models are not currently implemented in real time. By comparison, the non-linear machine learning model described herein relies on selected, relevant, input features, and thus the evaluation of the machine-learning based quality score estimator allows real time or near to real time quality scoring without use of expensive compute hardware. Compared to a Phred based lookup table, a non-linear machine learning model can handle a larger number of predictors which is one of the reasons why a non-linear machine learning model can predict base calling quality with a greater accuracy than a Phred-based model.
- Phred-based lookup tables may be limited in their size since, since as they get bigger, more memory is required to use them, which means that they cannot be cached effectively.
- a lookup table based on a simple model can be difficult to train to give accurate predictions.
- a non-linear machine learning 69334895-1 M&C PM363975US 4 model, or a lookup table based on the non-linear machine learning model models the interactions between input features with finer detail and captures the non-linear relationships between them to predict quality scores at a finer resolution and accuracy without additional significant amounts of compute.
- the relevant input features can be determined during model building.
- the inputs to the non-linear machine learning model can comprise parameters which are estimated over the whole read (or over a currently read portion of the read when operating in real-time), rather than small windows around the current base call, which can be better predictors of a quality score since these estimates may be a better representation of the cluster.
- the present application describes input features that have never before been used as predictors for a quality score, such as cluster amplification coefficient and cluster scale coefficient, and upstream events relating to homopolymers and repeat sequences.
- the statistical model based on a mixture model may utilise a Gaussian mixture model to determine the probability of a base calling error.
- the Gaussian mixture model is computed during the base calling methodology during sequencing.
- the Gaussian mixture model provides a good approximation for base calling quality since it can assign high quality scores to intensity values that, when plotted on a two-channel intensity plot, are relatively close to one distribution associated with one nucleobase type and far from the remaining distributions associated with other nucleobase types (i.e. the intensity values positioned towards the corners of a two-channel intensity plot).
- a quality score based on chastity assigns a low quality score to such intensity values.
- the chastity metric is essentially a simplification of the Guassian mixture model and whilst this simplification was beneficial during the early days of base calling, the chastity metric contains less information than a Guassian mixture model.
- the statistical model can further include a probability of a sample preparation error. The statistical model therefore considers errors during both the sample preparation stage and during sequencing. 69334895-1 M&C PM363975US 5
- a method for determining the quality of a base call wherein the base call is based upon a signal generated in a current sequencing cycle of a sequencing run based upon the incorporation of a nucleobase into a plurality of polynucleotide sequence portions forming a cluster.
- the method comprises: accessing a plurality of features associated with the cluster; and determining a quality score of the base call based upon the plurality of features and a non-linear machine learning algorithm.
- intensity data is detected in at least one channel, and wherein at least one of the plurality of features is a property of a intensity data for the current and/or at least one of a preceding and succeeding cycle.
- at least one of the plurality of features is a corrected intensity in the at least one channel.
- the corrected signal is corrected for cluster and sequencer dependent effects.
- At least one of the plurality of features associated with the cluster is an amplification coefficient and/or an offset coefficient, wherein the amplification coefficient corresponds to a scale of an intensity profile based upon intensity data from a plurality of sequencing cycles, and wherein the offset coefficient corresponds to a shift of an intensity profile based intensity data from a plurality of sequencing cycles.
- one of the plurality of features associated with the cluster is a signal- to-noise ratio, wherein the signal-to-noise ratio is determined based upon a plurality of detected signals.
- at least one of the plurality of features is based on the chastity of the base call for the current and/or at least one of a preceding or succeeding cycle.
- At least one of the plurality of features is a phasing weight and/or a pre- phasing weight. In embodiments, at least one of the plurality of features is based on a conditional probability that the intensity data belongs to one of the four nucleobase types A, C, G or 69334895-1 M&C PM363975US 6 T, wherein the conditional probability is a likelihood, log-likelihood, or a posterior probability. In embodiments, the conditional probability is calculated based upon a Gaussian mixture model, wherein each of the components of the Gaussian mixture model are intensity distributions corresponding to a respective nucleobase type.
- one of the plurality of features associated with the cluster is conditional probability corresponding to the nucleobase type associated with the highest conditional probability out of the four nucleobase types.
- one of the plurality of features associated with the cluster is the base call.
- one of the plurality of features associated with the cluster is a property of a base call of the nucleobase for at least one of a preceding sequencing cycle and/or at least one of a succeeding sequencing cycle.
- at least one of the plurality of features associated with the cluster is the base call for the two preceding sequencing cycles and the succeeding sequencing cycle.
- At least one of the plurality of features associated with the cluster corresponds to the location of the cluster on a flow cell and/or the movement of the cluster relative to a scanner of the sequencer. In embodiments, at least one of the plurality of features is the number of one or more of G, GGs, CCs, GGGs, or GCs upstream of the nucleobase. In embodiments, one of the plurality of features is the estimated fraction of GCs in the read up the current sequencing cycle or the estimated fraction of GCs in the total read. In embodiments, wherein at least one of the plurality of features associated with the cluster is a property of the polynucleotide sequence portions forming the cluster.
- At least one of the plurality of features associated with the cluster is based on the arrangement of a homopolymer in the plurality of polynucleotide sequence portions. In embodiments, at least one of the plurality of features associated with the cluster is based on cycle index. In embodiments, at least one of the plurality of features associated with the cluster is an estimate of the polyclonality of the cluster. In embodiments, paired-end sequencing is performed and at least one of the plurality of features associated with the cluster is a read-index. In embodiments, at least one of the plurality of features associated with the cluster is an estimate of the polyclonality of the cluster.
- At least one of the plurality of features associated with the cluster is the determinant of a covariant matrix for a Gaussian corresponding to a respective nucleobase, wherein the Gaussian is determined based upon a Gaussian mixture model, wherein each of the components of the Gaussian mixture model are intensity distributions corresponding to a respective nucleobase type.
- at least one of the plurality of features associated with the cluster is an estimate over the entire read.
- At least one of the plurality of features is associated with the cluster corresponds to the current sequencing cycle and additionally, or alternatively, with at least one of a preceding and/or succeeding cycle In embodiments, at least one of the plurality of features associated with the cluster is based on a sample preparation method. In embodiments, the quality score is determined by inputting the plurality of features associated with the cluster into the machine learning model. 69334895-1 M&C PM363975US 8 In embodiments, the quality score is determined by inputting the plurality of features associated with the cluster into a look up table based upon the machine learning model.
- the quality of the base call is determined based upon a posterior probability that the first nucleobase is a selected nucleobase type given the intensity data.
- the selected nucleobase type is selected by calculating a plurality of posterior probabilities, wherein each of the plurality of posterior probabilities is a probability that the first nucleobase is a respective nucleobase type given the intensity data, and selecting the posterior probability of the plurality of posterior probabilities with the highest value. 69334895-1 M&C PM363975US 9
- the base call is the selected nucleobase type.
- the method further comprises: accessing a probability of a sample preparation error; wherein the sample preparation error corresponds to a chemistry error during sample preparation and before sequencing, and wherein the quality of the base call is further based upon a probability of a sample preparation error.
- the probability of the sample preparation error is dependent on the base call.
- the probability of a sample preparation error is learnt based upon a sequencing run performed on one or more known polynucleotide sequence portions and the corresponding intensity data obtained from the sequencing run.
- said intensity data is a combined intensity of said first signal obtained based upon the incorporation of said respective first nucleobase into a plurality of first polynucleotide sequence portions forming a cluster and a second signal obtained based upon the incorporation of a respective second nucleobase into a plurality of second polynucleotide sequence portions forming the cluster; wherein each of the plurality of components correspond to an intensity distribution associated with a respective combination of first and second nucleobases.
- the base call is a combined base call for the respective first nucleobase and the respective second nucleobase, and wherein a quality of the combined base call is determined based upon the combined probability distribution and the intensity data.
- the combined base call is the selected first nucleobase type and the selected second nucleobase type.
- the probability of a sample preparation error is learnt during the sequencing run performed on first polynucleotide sequence portions and second polynucleotide sequence portions relating to the same genetic sequence based on one or more detected mismatches between the respective first nucleobase and the respective second nucleobase indicated by the combined intensity data.
- the combined probability distribution is provided by a Gaussian mixture model, wherein each of the plurality of components are Gaussian distributions.
- Figure 1 illustrates a cross-section of an example biosensor that can be used to obtain intensity data
- Figure 2 illustrates an example flow cell with eight lanes, a close-up of a section of a lane, and a further close-up of a tile comprising clusters
- Figure 3 illustrates a plurality of first polynucleotide sequence portions of interest forming a cluster
- Figure 4 is a scatter-plot of intensity data obtained from a plurality of clusters using two- channel chemistry
- Figure 5 illustrates the chastity metric for based on two-channel intensity data
- Figure 6 shows a quality scoring system that implements a non-linear machine learning model accordance with embodiments
- Figure 7 shows a plurality of features associated with a cluster than can be used to determine the quality of a base call
- Figures 8 illustrates phasing and prephasing effects on a cluster
- Figure 9 illustrates the intensity output of base calls “C” every 15 cycles in
- Figure 10 illustrates the relative shift and scale of intensity profiles from respective clusters
- Figure 11 shows a method for determining the quality of a base call in accordance with embodiments
- Figure 12 is a table showing the percentage of base calls assigned a quality score of greater than Q30, Q35 and Q40 for each of a first and second read of a paired-end read, based on either a Phred based lookup table or the ML model according to embodiments disclosed herein; 69334895-1 M&C PM363975US 12
- Figures 13A-C is a scatter plot of predicted Q-scores from cycle 25 of a sequencing run based upon respective clusters with a signal-to-noise ratio of 7-10dB (13A), a signal-to- noise ratio of 10-13dB (13B), and a signal-to-noise ratio of 13-16 dB (13C);
- Figure 13A-C is a scatter plot of predicted Q-scores from cycle 25 of a sequencing run based upon respective clusters
- the x and y axes correspond to fully corrected intensities in the first and second channels respectively.
- Plot intensity corresponds to the predicted quality score;
- Figure 16 illustrates errors which may be introduced during the sample preparation step before sequencing;
- Figure 17 is a plot of predicted base call quality based on a signal-to-noise error and a sample preparation error according to a second model in accordance with embodiments;
- Figure 18 illustrates a plurality of first and second polynucleotide sequence portions of interest forming a cluster;
- Figure 19 is a schematic of two-channel intensity data which can be obtained based upon sequencing of a cluster in accordance with Figure 18;
- Figure 20 is a plot of predicted base call quality according to a second model based on combined intensity data;
- Figure 21 illustrates a computer system that can be used to implement the technology disclosed.
- Next-generation sequencing experiments comprise the steps of sample preparation, base calling, read alignment and variant calling.
- the accuracy of base calling can be determined using a quality score metric.
- a base calling quality score indicates the probability that a base call indicating a particular base is incorrect, i.e. the nucleobase at the location of the particular base on a template is different to the nucleobase indicated by the base calling process. This can be useful for a number of purposes, for example for assessing the quality of sequencing data, in assessing the confidence in which any variants in the sequence can be called, and in determining whether additional sequencing is required.
- the first section of this application provides an overview of sequencing technology, base calling, and conventional quality scoring.
- cluster of oligonucleotides refers to a localized group or collection of DNA or RNA molecules on a nucleotide-sample slide, such as a flow cell, or other solid surface.
- a cluster includes tens, hundreds, thousands, or more copies of a cloned or the same DNA or RNA segment.
- a cluster includes a grouping of oligonucleotides immobilized in a section of a flow cell or other nucleotide-sample slide.
- clusters are evenly spaced or organized in a systematic structure within a patterned flow cell.
- clusters are randomly organized within a non-patterned flow cell.
- a cluster of oligonucleotides can be imaged utilizing one or more light signals.
- an oligonucleotide-cluster image may be captured by a camera during a sequencing cycle of light emitted by irradiated 69334895-1 M&C PM363975US 14 fluorescent tags incorporated into oligonucleotides from one or more clusters on a flow cell.
- a sequencer uses sequencing by synthesis (SBS) technology for generating sequencing images.
- SBS sequencing by synthesis
- SBS relies on growing nascent strands complementary to cluster strands with fluorescently-labeled nucleotides, while tracking the emitted signal of each newly added nucleotide.
- the fluorescently-labeled nucleotides have a 3’ removable block that anchors a fluorophore signal of the nucleotide type.
- SBS occurs in repetitive sequencing cycles, each comprising three steps: (a) extension of a nascent strand by adding the fluorescently-labeled nucleotide; (b) excitation of the fluorophore using one or more lasers of an optical system of the sequencer and imaging through different filters of the optical system, yielding sequencing images; and (c) cleavage of the fluorophore and removal of the 3’ block in preparation for the next sequencing cycle. Incorporation and imaging are repeated up to a designated number of sequencing cycles, defining the read length, which refers to the number of base pairs (bp) sequenced from a DNA fragment. Using this approach, each sequencing cycle interrogates a new position along the cluster strands.
- Intensity values can be extracted from different color/intensity channel sequencing images generated by a sequencer at each sequencing cycle during a sequencing run.
- the sequencer include Illumina’s iSeq, HiSeqX, HiSeq 3000, HiSeq 4000, HiSeq 2500, NovaSeq 6000, NextSeq 550, NextSeq 1000, NextSeq 2000, NextSeqDx, MiSeq, and MiSeqDx.
- the tremendous power of Illumina’s sequencers stems from their ability to simultaneously execute and sense millions or even billions of analytes (e.g., clusters).
- a cluster comprises approximately one thousand identical copies of a template strand, though clusters vary in size and shape.
- Clusters are grown from the template strand, prior to the sequencing run, by bridge amplification or exclusion amplification of the input library which is a collection of similarly sized DNA fragments.
- the purpose of the amplification and cluster growth is to increase the intensity of the emitted signal since the imaging device cannot reliably sense the fluorophore signal of a single strand.
- the imaging device perceives a cluster of thousands of template strands as a single spot. For instance, the imaging device can detect such a cluster of thousands of template strands as a spot represented by a single pixel or multiple pixels. 69334895-1 M&C PM363975US 15
- the sequencing process occurs in a flow cell – a small glass slide that holds the input DNA fragments during the sequencing process.
- FIG. 1 illustrates a cross-section of a biosensor 100 that can be used in various embodiments.
- Biosensor 100 has pixel areas 106’, 108’, 110’, 112’, and 114’ that each can hold more than one cluster during a base calling cycle (e.g., 2 clusters per pixel area).
- the biosensor 100 includes a flow cell 102 that is mounted onto a sampling device 104.
- the flow cell 102 is affixed directly to the sampling device 104.
- the flow cell 102 may be removably coupled to the sampling device 104.
- the sampling device 104 has a sample surface 134 that may be functionalized (e.g., chemically or physically modified in a suitable manner for conducting the desired reactions).
- the sample surface 134 may be functionalized and may include a plurality of pixel areas 106’, 108’, 110’, 112’, and 114’ that can each hold more than one cluster during a base calling cycle (e.g., each having a corresponding cluster pair 106A, 106B; 108A, 108B; 110A, 110B; 112A, 112B; and 114A, 114B immobilized thereto).
- Each pixel area is associated with a corresponding sensor (or pixel or photodiode) 106, 108, 110, 112, and 114, such that light received by the pixel area is captured by the corresponding sensor.
- a pixel area 106’ can be also associated with a corresponding reaction site 106’’ on the sample surface 134 that holds a cluster pair, such that light emitted from the reaction site 106’’ is received by the pixel area 106’ and captured by the corresponding sensor 106.
- the pixel signal in that base calling cycle carries information based on all of the two or more clusters.
- signal processing as described herein is used to distinguish each cluster, where there are more clusters than pixel signals in a given sampling event of a particular base calling cycle.
- the flow cell 102 includes sidewalls 138, 125, and a flow cover 136 that is supported by the sidewalls 138, 125.
- the sidewalls 138, 125 are coupled to the sample surface 134 and extend between the flow cover 136 and the sidewalls 138, 125.
- the sidewalls 138, 125 are formed from a curable adhesive layer that bonds the flow cover 136 to the sampling device 104.
- the sidewalls 138, 125 are sized and shaped so that a flow channel 144 exists between the flow cover 136 and the sampling device 104.
- the flow cover 136 may include a material that is transparent to excitation light 101 propagating from an exterior of the biosensor 100 into the flow channel 144.
- the excitation light 101 approaches the flow cover 136 at a non-orthogonal (or orthogonal) angle.
- the flow cover 136 may include inlet and outlet ports 142, 146 that are configured to fluidically engage other ports (not shown).
- the other ports may be from the cartridge or the workstation.
- the flow channel 144 is sized and shaped to direct a fluid along the sample surface 134.
- a height H1 and other dimensions of the flow channel 144 may be configured to maintain a substantially even flow of a fluid along the sample surface 134.
- the dimensions of the flow channel 144 may also be configured to control bubble formation.
- the flow cover 136 may comprise a transparent material, such as glass or plastic.
- the flow cover 136 may constitute a substantially rectangular block having a planar exterior surface and a planar inner surface that defines the flow channel 144.
- the block may be mounted onto the sidewalls 138, 125.
- the flow cell 102 may be etched to define the flow cover 136 and the sidewalls 138, 125.
- a recess may be etched into the transparent material. When the etched material is mounted to the sampling device 104, the recess may become the flow channel 144.
- the sampling device 104 may be similar to, for example, an integrated circuit comprising a plurality of stacked substrate layers 120-126.
- the substrate layers 120-126 may include a base substrate 120, a solid-state imager 122 (e.g., CMOS image sensor), a filter or light-management layer 124, and a passivation layer 126. It should be noted that the above is only illustrative and that other embodiments may include fewer or additional layers. Moreover, each of the substrate layers 120-126 may include a plurality of sub- 69334895-1 M&C PM363975US 17 layers.
- the sampling device 104 may be manufactured using processes that are similar to those used in manufacturing integrated circuits, such as CMOS image sensors and CCDs. For example, the substrate layers 120-126 or portions thereof may be grown, deposited, etched, and the like to form the sampling device 104.
- the passivation layer 126 is configured to shield the filter layer 124 from the fluidic environment of the flow channel 144.
- the passivation layer 126 is also configured to provide a solid surface (i.e., the sample surface 134) that permits biomolecules or other analytes-of-interest to be immobilized thereon.
- each of the reaction sites may include a cluster of biomolecules that are immobilized to the sample surface 134.
- the passivation layer 126 may be formed from a material that permits the reaction sites to be immobilized thereto.
- the passivation layer 126 may also comprise a material that is at least transparent to a desired fluorescent light.
- the passivation layer 126 may include silicon nitride (Si 2 N 4 ) and/or silica (SiO 2 ). However, other suitable material(s) may be used. In the illustrated embodiment, the passivation layer 126 may be substantially planar. However, in alternative embodiments, the passivation layer 126 may include recesses, such as pits, wells, grooves, and the like. In the illustrated embodiment, the passivation layer 126 has a thickness that is about 150-200 nm and, more particularly, about 170 nm.
- the filter layer 124 may include various features that affect the transmission of light. In some embodiments, the filter layer 124 can perform multiple functions.
- the filter layer 124 may be configured to (a) filter unwanted light signals, such as light signals from an excitation light source; (b) direct emission signals from the reaction sites toward corresponding sensors 106, 108, 110, 112, and 114 that are configured to detect the emission signals from the reaction sites; or (c) block or prevent detection of unwanted emission signals from adjacent reaction sites.
- the filter layer 124 may also be referred to as a light-management layer.
- the filter layer 124 has a thickness that is about 1-5 ⁇ m and, more particularly, about 2-4 ⁇ m.
- the filter layer 124 may include an array of microlenses or other optical components. Each of the microlenses may be configured to direct emission signals from an associated reaction site to a sensor.
- the solid-state imager 122 and the base substrate 120 may be provided together as a previously constructed solid-state imaging device (e.g., CMOS 69334895-1 M&C PM363975US 18 chip).
- the base substrate 120 may be a wafer of silicon and the solid-state imager 122 may be mounted thereon.
- the solid-state imager 122 includes a layer of semiconductor material (e.g., silicon) and the sensors 106, 108, 110, 112, and 114.
- the sensors are photodiodes configured to detect light.
- the sensors comprise light detectors.
- the solid-state imager 122 may be manufactured as a single chip through CMOS-based fabrication processes.
- the solid-state imager 122 may include a dense array of sensors 106, 108, 110, 112, and 114 that are configured to detect activity indicative of a desired reaction from within or along the flow channel 144.
- each sensor has a pixel area (or detection area) that is about 1-2 square micrometer ( ⁇ m2).
- the array can include 500,000 sensors, 5 million sensors, 10 million sensors, or even 200 million sensors.
- the sensors 106, 108, 110, 112, and 114 can be configured to detect a predetermined wavelength of light that is indicative of the desired reactions.
- the sampling device 104 includes a microcircuit arrangement, such as the microcircuit arrangement described in U.S. Patent No.7,595,882, which is incorporated herein by reference in the entirety.
- the sampling device 104 may comprise an integrated circuit having a planar array of the sensors 106, 108, 110, 112, and 114. Circuitry formed within the sampling device 104 may be configured for at least one of signal amplification, digitization, storage, and processing. The circuitry may collect and analyze the detected fluorescent light and generate pixel signals (or detection signals) for communicating detection data to a signal processor. The circuitry may also perform additional analog and/or digital signal processing in the sampling device 104. Sampling device 104 may include conductive vias 130 that perform signal routing (e.g., transmit the pixel signals to the signal processor). The pixel signals may also be transmitted through electrical contacts 132 of the sampling device 104. The sampling device 104 is discussed in further detail with respect to U.S.
- sampling device 104 is not limited to the above constructions or uses as described above. In alternative embodiments, the sampling device 104 may take other forms.
- the sampling 69334895-1 M&C PM363975US 19 device 104 may comprise a CCD device, such as a CCD camera, that is coupled to a flow cell or is moved to interface with a flow cell having reaction sites therein.
- Figure 2 depicts an example flow cell 200 where clusters 216 are immobilized and base called during a sequencing process.
- the flow cell 200 is partitioned in a plurality of chambers called lanes, such as lanes 202a, 202b, ..., 202p, i.e., p represents a number of lanes.
- the lanes are physically separated from each other and may contain different tagged sequencing input libraries, distinguishable without sample cross-contamination.
- Each individual lane 202 can further be partitioned into non- overlapping regions called “tiles” 212.
- Figure 2 illustrates a magnified view of section 208 of an example lane. Section 208 is illustrated to comprise a plurality of tiles 212. Hundreds of thousands to millions of clusters 216 can be immobilized on the surface of each tile.
- Sequencing Figures 3 illustrates a cluster 300 comprising a plurality of template stands, also referred to as a plurality of first polynucleotide sequence portions of interest 301.
- the sequencer uses sequencing by synthesis (SBS) for generating the sequencing images. SBS relies on growing nascent strands complementary to cluster strands with fluorescently-labeled nucleotides, illustrated by arrow 302, while tracking the emitted signal of each newly added nucleotide.
- SBS sequencing by synthesis
- the fluorescently-labeled nucleotides have a 3′ removable block that anchors a fluorophore signal of the nucleotide type.
- SBS occurs in repetitive sequencing cycles, each comprising three steps: (a) extension of a nascent strand by adding the fluorescently-labeled nucleotide; (b) excitation of the fluorophore using one or more lasers of an optical system of the sequencer and imaging through different filters of the optical system, yielding the sequencing images; and (c) cleavage of the fluorophore and removal of the 3′ block in preparation for the next sequencing cycle. Incorporation and imaging are repeated up to a designated number of sequencing cycles, defining the read length.
- nucleic acids can be sequenced by providing, four different labeled nucleotide bases to the array of molecules forming the cluster so as to produce four different images, each image comprising signals having a single color, wherein the signal color is different for each of the four different images, thereby producing a cycle of four color images that corresponds to the four possible nucleotides present at a particular position in the nucleic acid.
- such methods can further comprise providing additional labeled nucleotide bases to the array of molecules forming the cluster, thereby producing a plurality of cycles of color images.
- nucleic acids can be sequenced utilizing methods and systems described in U.S. Patent Application Publication 44 WO 2015/084985 PCT/US2014/068409 No. 2013/0079232, the disclosure of which is incorporated herein by reference in its entirety.
- a nucleic acid can be sequenced by providing a first nucleotide type that is detected in a first channel, a second nucleotide type that is detected in a second channel, a third nucleotide type that is detected in both the first and the second channel and a fourth nucleotide type that lacks a label that is not, or minimally, detected in either channel.
- such methods can further comprise providing additional labeled nucleotide bases to the array of molecules forming the cluster, thereby producing a plurality of cycles of color images.
- Base calling Base Calling refers to the process of determining a base call (A, C, G, T) for every cluster of a given tile at a specific cycle. In implementations that make use of two-channel detection, base calling is performed by extracting image data from two images, rather than four.
- Base calling methods can comprise iteratively fitting four Gaussian distributions to intensity data from two channels. When signals from channel 1 are plotted against signals from channel 2, signal intensity typically segregates into four general populations of intensity.
- data from a 2 channel sequencing system can be plotted as intensity values from channel I (x-axis) versus intensity values from channel 2 (y-axis).
- one of the four nucleotides is unlabeled (dark), such as "G" nucleotide shown in Figure 4, which has near zero signal in both channel I and channel 2.
- G unlabeled nucleotide shown in Figure 4
- the signals from a certain portion of the data points are clustered near the zero point in each axis.
- the signals from a certain portion of the data points labeled with one or both labels form identifiable populations when plotted in a two-dimensional scatter-plot such as the one shown in Figure 4.
- the intensity itself of a particular label does not encode the base. Rather, the combination of intensities, [on, off], [off, on], [on, on], [off, off], provide the encoding information for the base identity.
- Base calling can be performed by fitting a mathematical model, such as one based on a set of Gaussian distributions, to a set of intensity data.
- Gaussian distributions can be fit to a set of two-channel intensity data such that one distribution is applied for each of the four nucleotides represented in the data set.
- a mixture of four intensity distributions can be fitted to the intensity values of a target cluster to be called at a given sequencing cycle and determine the likelihoods of the intensity values of the target cluster belonging to each of the four intensity distributions.
- the mixture of intensity distribution is a Gaussian mixture model.
- a Gaussian mixture model comprises multiple Gaussians, each identified by k ⁇ ⁇ 1,..., K ⁇ , where K is the number of components (i.e., groups of data points).
- the Gaussian mixture model can include four intensity distributions, corresponding to four nucleotide bases A, G, C and T.
- Each Gaussian k in the mixture includes the following parameters: a mean value ⁇ that defines its centroid; and covariances ⁇ that define its width.
- the covariances ⁇ define the dimension of an ellipsoid of the intensity distribution.
- Intensity values can be normalized prior to fitting a Gaussian distribution. For example, intensity values can be normalized so that 5 th and 95th percentiles have values of 0 and 1, respectively.
- EM expectation maximization
- EM algorithms are known in the art and are useful tools to construct statistical models of the underlying data 69334895-1 M&C PM363975US 22 source and naturally generalize to cluster databases containing both discrete-valued and continuous-valued data.
- an EM algorithm can be applied to iteratively maximize the likelihood of observing the given data.
- an EM algorithm can be applied to iteratively maximize this likelihood over the mean and covariance for each of the Gaussian distributions.
- a subset of the data points in a data set, or all or substantially all data points in the data set can be included in the calculation.
- each X, Y value (referring to each of the two channel intensities respectively) a value can be generated which represents the likelihood that a certain X, Y intensity value belongs to one of the four distributions.
- each X, Y intensity value will also have four associated likelihood values, one for each of the four bases.
- the maximum of the four likelihood values indicates the base call.
- each of the intensity values for a two- channel data set can be assigned a base call after performing a Gaussian fit to the data set.
- Quality scoring As used herein, a “quality score” refers to a measure of uncertainty of a base call. Quality scoring refers to the process of assigning a quality score to each base call.
- the quality score can be presented in any suitable format that allows a user to determine the probability of error of any given base call.
- the quality score is presented as a numerical value.
- the quality score can be quoted as QXX where the XX is the score and it means that that particular call has a probability of error of 10 ⁇ XX/10.
- Q30 equates to an error rate of 1 in 1000, or 0.1%
- Q40 equates to an error rate of 1 in 10,000 or 0.01%.
- Quality scoring can be performed by calculating a set of predictors, or features, for each base call, and using those predictor values to look up the quality score in a quality table.
- the quality table is generated using a modification of the Phred algorithm on a calibration data set representative of run and sequence variability.
- the predictor values for each base call can be any suitable aspect that may indicate or predict the quality of the base call in a given sequencing run.
- some predictors can include approximate homopolymer; intensity decay; penultimate chastity; signal overlap with background (SOWB); and shifted purity G adjustment.
- approximate homopolymer refers to a calculation of the number of consecutive identical base calls preceding a base call. In certain embodiments, the calculation can allow one exception, in order to identify problematic sequence contexts such as homopolymer runs and problematic motifs such as “GGCGG”.
- intensity decay refers to the identification of base calls that suffer loss of signal as sequencing progresses. For example, this can be done by comparing the brightest intensity at the current cycle to the brightest intensity at cycle 1.
- penultimate chastity refers to measurement of early read quality in the first 25 bases based on the second worst chastity value. Chastity can be determined as the highest intensity value divided by the sum of the highest intensity value and the second highest intensity value, where the intensity values are obtained from four color channels.
- signal overlap with background SOWB
- Square refers to a measurement of the separation of the signal from the noise in previous and subsequent cycles. In a preferred embodiment, the measurement utilizes the 5 cycles immediately preceding and following the current cycle.
- Shated Purity G adjustment refers to a measurement of the separation of the signal from the noise for the current base call only, while also accounting for G quenching effects.
- the intensities in certain color channels may be decreased (quenched) in cycles following those cycles where a G nucleotide was incorporated.
- the measurement of the separation of signal from noise is adjusted 69334895-1 M&C PM363975US 24 for G quenching by multiplying T channel intensity for a cycle following a G incorporation by 1.3, and by multiplying A channel intensity for a cycle following a G incorporation by 1.05.
- additional operations can be performed.
- the method for evaluating the quality of a base call further comprises discounting unreliable quality scores at the end of each read.
- the step of discounting unreliable quality scores comprises using an algorithm to identify a threshold of reliability.
- reliable base calls comprise q-values above the threshold and unreliable base calls comprise q-values below the threshold.
- An algorithm for determining a threshold of reliability can comprise the End Anchored Maximal Scoring Segments (EAMSS) algorithm, for example.
- EAMSS algorithm is an algorithm that identifies transition points where good and reliable base calls (with mostly high q-values) become unreliable base calls (with mostly low q-values). The identification of such transition points can be done, for example, using a Hidden Markov Model that identifies shifts in the local distributions of quality scores.
- unreliable base calls can include base calls with a strong bias toward G base calls.
- Another additional operation that can be performed includes identifying reads where the second worst chastity in the first 25 base calls is below a pre-established threshold, and marking the reads as poor quality data. This is referred to as read filtering.
- chastity can be determined as the highest intensity value divide by the sum of the highest intensity value and the second highest intensity value, where the intensity values are obtained from four color channels. Because some of the above-described predictors utilize corrected intensities from future cycles, Quality Scoring will typically lag Base Calling. A tile can be ready for Quality Scoring if a base call file exists for that cycle and if the corrected intensity files exist for the next few cycles (determined by the complexity of the predictors). 69334895-1 M&C PM363975US 25 For sequencing performed using a two-channel system, a quality score can also be generated based on the Gaussian distribution approach to base calling.
- the distance of a point to the center of the "called" distribution gives a measure of the purity of the base call. Specifically, the closer a data point lies to the center of the distribution for the called base, the greater the likelihood that the base call is accurate.
- Any suitable method to calculate and express the relationship between distance to the center and the likely purity of the base call can be used.
- the quality or purity of the base call for a given data point can be expressed as the distance to the nearest centroid divided by the sum of all distances to each of the other three centroids.
- the quality or purity of the base call for a given data point can be expressed as the distance to the nearest centroid be divided by the distance to the second nearest centroid, which can also be defined as the chastity of the cluster.
- Chastity for two-channel basecalling takes on a separate meaning from the use of the term in four-channel basecalling.
- chastity is defined in terms of intensity of a cluster ("spot") relative to a nearby spot), and can be calculated as the highest intensity value divided by the sum of the highest intensity value and the second highest intensity value, where the intensity values are obtained from four color channels.
- spot intensity of a cluster
- chastity determinations are unsuitable for two- channel basecalling.
- the chastity of a cluster from a two-channel sequencing system can be computed as a function of relative distances from intensity values to Gaussian centroids. Clusters that are not close enough to one particular Gaussian centroid in a given number of cycles can be given a low chastity value and are filtered out.
- An estimate of a quality score determined by a quality scoring model can be compared to an empirical quality score in order to assess the accuracy of the quality scoring model.
- Empirical quality scores can be determined when base calling is performed on a cluster where the corresponding sequence information is known.
- An empirical quality score can be determined based on a comparison between a read, i.e. the sequence of base called nucleobases from a cluster, and the known sequence of polynucleotide sequence portions forming the cluster.
- the known sequence of polynucleotide sequence portions is obtained by aligning the read to a reference genome. This is known as a mapped read.
- chastity may be used as one of the predictors which are used as an input to the lookup table. As will now be described, chastity does not always accurately indicate the quality of a base call.
- Figure 5 shows a simulated chastity plot for two-channel sequencing data that is determined based on a Gaussian mixture model comprising four Gaussians each mapped to a respective base. Chastity values decrease in an approximately radial manner away from where the centroids of each respective Gaussian would be located 69334895-1 M&C PM363975US 27 on the plot.
- the regions indicated by arrows 510, 520, 530 and 540 correspond to areas of high chastity which are near to where respective centroids of the Gaussians would be located.
- chastity is a reasonable proxy for base calling quality because an intensity data point that is positioned near the centroid of a Gaussian is likely to correspond to the base associated with that Gaussian.
- regions 550, 560, 570 and 580 which are positioned in the corner regions of the scatter plot, that chastity may not always be a reasonable proxy for a quality score.
- the chastity values in regions 550, 560, 570 and 580 are low since intensity data points in these regions are moderately far from the nearest Gaussian and very far from the second nearest Gaussians.
- FIG. 6 schematically illustrates a quality scoring system 600 for determining the quality of a base call which uses a non-linear machine learning model.
- the base call for which a quality score is determined is based upon a signal generated in a current sequencing cycle of a sequencing run obtained based upon the incorporation of a nucleobase into a plurality of polynucleotide sequence portions forming a cluster.
- the base call may be determined using any suitable base calling method such as any one of the methods previously described herein.
- the term “machine learning model” refers to a computer algorithm or a collection of computer algorithms that learn a particular task through experience based on use of data.
- a machine-learning model can utilize one or more learning techniques to improve in accuracy and/or effectiveness.
- Example machine-learning models include various types of decision trees, logistic regressions, linear regressions, random forests, support vector machines, Bayesian networks, or neural networks.
- Non-linear machine learning models are a type of machine learning model which can capture complex relationships between features of experimental data by modelling the 69334895-1 M&C PM363975US 28 non-linear interactions between them.
- a non-linear ML model models a response (i.e. a quality score), based on a set of predictor variables, or features.
- non-linear ML model 650 is configured to receive a plurality of features associated with a cluster.
- the non-linear ML model 650 is configured to process a plurality of features 602 and output a quality score 660 of a base call.
- a feature extractor 640 is configured to process one or more features associated with a cluster and output one or more features to the non-linear ML model 650.
- Each of the plurality of features associated with the cluster 602 can be any feature related to, or which can be derived from, the cluster, such as: properties, statistics or metrics related to or derived from the intensity data obtained from the cluster; properties, statistics or metrics related to or derived from a base call associated with the cluster; properties, statistics or metrics related to or derived from with the template strands in the cluster; the sample preparation method used to obtain the cluster; the arrangement of the cluster on the flow cell; and cycle index.
- Each of the plurality of features 602 can be associated with any one or more of the sequencing cycles of the sequencing run.
- At least one of the plurality of features is associated with the current sequencing cycle and additionally, or alternatively, with at least one of a preceding and/or succeeding cycle. In some embodiments, at least one of the plurality of features 602 is an estimate over all previously completed cycles and the current sequencing cycle. In some embodiments, at least one of the plurality of features 602 is an estimate over an entire read, i.e. all sequencing cycles in the sequencing run.
- the quality scoring system 600 can be a component of a sequencer. Alternatively, the quality scoring system 600 can be a separate component to the sequencer, for example a computer system.
- the quality scoring system 600 accesses each of the plurality of features 602 from storage associated with a sequencer or any other suitable storage, or from a module of a sequencer, such as a base calling module, an intensity extractor module, or an intensity correction module.
- Figure 7 shows a plurality of features 702.
- the plurality of features associated with a cluster 602 of Figure 6 may comprise the plurality of features 702, or may be a subset of features selected from the plurality of features 702.
- the features associated with the cluster 702 can comprise one or more of: intensity data 704, a base call 706, a conditional probability 708 that a nucleobase is one of the four bases A, T C and G, chastity 710, the signal-to-noise 712 of the intensity data obtained from the cluster, phasing weights 714 for the cluster, cluster coefficients 716 which relate to characteristics of the intensity profile of the cluster when represented on a scatter plot, base context 718 of the nucleobase, the spatial location of the cluster on the flow cell or movement of the cluster in the sequencer 720, the number of upstream Gs, GGs, CCs and/or GCs in the read 722, GC bias 724, homopolymer length 726, which is the length of the longest homopolymer upstream of the nucleobase in the read, and homopolymer distance 728, which is the distance from the current nucleobase to the start of the longest homopolymer upstream of the current nucleobase in the read,
- the plurality of features associated with the cluster 702 can further comprise one or more features derived from the aforementioned features 704 to 738. Additionally, or alternatively, the plurality of features associated with the cluster 702 can further comprise any other suitable feature associated with the cluster which is suitable for determining a quality score. Feature combinations Some combinations of four features associated with a cluster 702 have found to be particularly effective at improving quality scoring accuracy whilst reducing the impact on memory and compute resources necessary, which are defined in the following embodiments: In some embodiments, the plurality of features associated with the cluster 702 comprise a base call 706 for the current sequencing cycle, the chastity 710, the signal-to-noise ratio 712, and the read index 736.
- the plurality of features 69334895-1 M&C PM363975US 30 associated with the cluster 702 comprise a base call 706 for the current sequencing cycle, a conditional probability 708 comprising the maximum posterior probability (or likelihood) of the posterior probabilities (or likelihoods) calculated for each the four nucleobases, the signal-to-noise ratio 712, and the read index 736.
- the plurality of features associated with the cluster 702 comprise the chastity 710, the signal-to-noise ratio 712, the base context 718 comprising the two preceding and one succeeding base calls (i.e. a trimer context) and the read index 736.
- the intensity data 704 is the detected signal obtained based upon the incorporation of a nucleobase into the cluster.
- the intensity data 704 can comprise the raw intensity data obtained from the sequencing platform, which for a two-channel sequencing system comprises a first channel intensity value and a second channel intensity value.
- the raw intensity data is obtained from an intensity extractor module of the sequencer.
- the intensity data can comprise fully corrected intensity data which is obtained by correcting the raw intensity data for sequencer and cluster dependent effects.
- raw intensity corrections that correct for sequencer dependent effects can include laser ramp correction andcamera gain correction.
- raw intensity corrections that correct for cluster dependent effects can includebackground corrections, scale correction, decay correction and phasing/prephasing correction.
- the fully corrected intensity data comprises a first channel intensity value and a second channel intensity value. 69334895-1 M&C PM363975US 31
- the phasing/prephasing correction is to address loss of synchrony in the readout of the sequence copies of an analyte loses synchrony caused by phasing and prephasing.
- Phasing is caused by incomplete removal of 3' terminators and fluorophores as well as sequences in the analyte missing an incorporation cycle.
- Prephasing is caused by the incorporation of nucleotides without effective 3'-blocking. Incomplete extension due to phasing results in lagging strands (e.g., t-1 from the current cycle). Addition of multiple nucleotides or probes in a population of identical strands due to prephasing results in leading strands (e.g., t+1 from the current cycle).
- Phasing and prephasing effects are nonstationary distortions and thus the proportion of sequences in each analyte that is affected by phasing and prephasing increases with cycle number, which hampers correct base identification and limiting the length of useful sequence reads.
- the decay correction is to address the signal decay, for example, fading of the intensities of the fluorophores that are incorporated into the template sequences during the sequencing-by-synthesis process. Fading is an exponential decay in fluorescent signal intensity as a function of base calling cycle number As sequencing proceeds, accurate base calling becomes increasingly difficult, because signal strength decreases and noise increases, resulting in a substantially decreased signal-to-noise ratio. It has been observed that later synthesis steps attach tags in a different position relative to the sensor than earlier synthesis steps.
- the scale correction is to address the variations in the intensities of clusters.
- clusters When clusters are immobilized on the surface of the flow, their size and shape may vary. A larger-sized cluster includes more template oligonucleotides than a small-sized cluster and thus, may show higher intensity values when more fluorophores are incorporated into the oligonucleotides.
- the scale correction can account for the difference in the scale of the intensities of clusters.
- the background corrections are to address background variation. Background intensity of a particular sensor is relatively steady between cycles, but varies across the sensors.
- Positioning of the illumination source which can vary by illumination color, creates a spatial pattern of background variation over a field of the sensors. It has been found that 69334895-1 M&C PM363975US 32 manufacturing differences among the sensors were observed to produce different background intensity readouts, even between adjoining sensors. In a first approximation, idiosyncratic variation among sensors can be ignored. In a refinement, the idiosyncratic variation in background intensity among sensors can be taken into account.
- the feature extractor is configured to determine fully corrected intensities and input the fully corrected intensities to the non-linear ML model.
- the non-linear ML model receives fully corrected intensities from a storage or from another component of the sequencer, such as an intensity corrector module.
- Base call The base call 706 is the determined base (A, C, G or T) for a nucleobase incorporated into the cluster for a sequencing cycle of the sequencing run.
- the base call can be determined using any of the methods previously described. For example, for a two- channel system, a mixture model of four Gaussians, each Gaussian corresponding to a respective base, can be iteratively applied to a set of intensity values using, for example, the EM algorithm.
- a value can be generated which represents the likelihood that each intensity value belongs to one of the four distributions.
- Each of the likelihoods are respectively a likelihood that the intensity value belongs to one of the four bases corresponding to the respective distribution.
- the maximum of the four likelihood values indicates the base call.
- the feature extractor is configured to determine the base call 706 and input the base call to the non-linear ML model.
- the non-linear ML model receives the base call from a storage or from another component of the sequencer, such as a base calling module.
- conditional probabilities 708 can be based on four probabilities corresponding to each nucleobase type A, C, G or T calculated based on Baye’s rule.
- the conditional probabilities 708 comprise four probabilities, where each of the four probabilities is a likelihood of the intensity value given a respective nucleobase 69334895-1 M&C PM363975US 33 type, as described above in relation to base calling.
- each of the four probabilities is a posterior probability that the nucleobase is a respective nucleobase type given the intensity data
- the conditional probabilities 708 are calculated based upon a Gaussian mixture model, wherein each of the components of the Gaussian mixture model are intensity distributions corresponding to a respective nucleobase type.
- the components of the Gaussian mixture model can be determined using the EM algorithm as previously described.
- the conditional probabilities 708 can comprise the logarithm of the likelihood of the intensity value given a respective nucleobase type. Additionally, or alternatively, the conditional probabilities 708 can comprises a maximum probability.
- This maximum probability corresponds to a maximum of the four probabilities (the four probabilities being either likelihoods, log-likelihoods or posterior probabilities), calculated for each nucleobase type A, C, G or T.
- the conditional probability 708 comprises the probability corresponding to the nucleobase type associated with the highest probability. It will be understood that when a uniform prior is used for each of the nucleobases, the same base that maximises the posterior probability will also maximise the likelihood.
- the feature extractor is configured to determine the conditional probabilities 708 and input the conditional probabilities 708 to the non-linear ML model.
- the non-linear ML model receives the conditional probabilities 708 from a storage or from another component of the sequencer, such as a base calling module.
- Chastity refers to a measure of purity of a received signal from a cluster.
- the chastity 710 of the cluster for a given sequencing cycle can be determined according to the methods previously described herein, according to whether the sequencing is performed based on either a two-channel or four-channel system.
- chastity 710 may be an average chastity value calculated for a plurality of sequencing cycles of a sequencing run.
- chastity 710 may be an average chastity over the previous two, three, four, five, six etc sequencing cycles of a sequencing run. In some embodiments, chastity 710 is an average chastity over the five previous sequencing cycles. In some embodiments, the chastity 710 may be the mean chastity over all previous and current sequencing cycles. In some embodiments, the feature extractor is configured to determine the chastity 710 and input the chastity to the non-linear ML model. In other embodiments, the non-linear ML model receives the chastity 710 from a storage or from a component of the sequencer.
- the signal-to-noise ratio 712 is an estimated measure of the signal power divided by an estimated measure of the noise power for the respective cluster.
- the signal-to-noise ratio 712 may be an estimate for the current sequencing cycle (instantaneous signal-to-noise ratio) or an average over preceding cycles and the current sequencing cycle (averaged signal-to-noise ratio).
- the signal power for a respective cluster can be calculated based on the power detected from the cluster when it is on and/or off. In an idealised system, at each sequencing cycle, all of the fluorescently-labeled nucleotides incorporated into the cluster would emit light energy in a short time frame ⁇ following the excitation of the cluster with one or more lasers.
- the intensity signal obtained in time frame ⁇ can be considered an “on” component of the cluster.
- the signal obtained for the remainder of the time in the sequencing cycle can be considered an “off” component of the cluster.
- the “off” component would correspond to zero signal and the “on” component would be a maximal intensity, i.e. the “on” and “off” components would be binary. It has been found in real world systems however that fluorophores can become “stuck” and 69334895-1 M&C PM363975US 35 therefore the “on” component does not correspond to a light signal from all of the recently incorporated fluorophores in the cluster.
- the noise power for a respective cluster can be calculated based the power measured at the cluster which derives from sources other than the cluster, such as one or more of crosstalk from other cluster, background noise, jitter, imperfect estimation of amplitude coefficients etc.
- the feature extractor is configured to determine the signal-to- noise ratio 712 and input the signal-to-noise ratio 712 to the non-linear ML model.
- the non-linear ML model receives the signal-to-noise ratio 712 from a storage or from another component of the sequencer Phasing weights Phasing weights 714 can comprise a phasing weight and/or a pre-phasing weight.
- Phasing refers to the situation where a fluorescently-labeled nucleotide incorporated into a cluster falls at least one base behind other incorporated fluorescently-labeled nucleotides in the same cluster in a sequencing cycle.
- pre-phasing refers to the situation where a fluorescently-labeled nucleotide incorporated into a cluster jumps at least one base ahead other incorporated fluorescently-labeled nucleotides in the same cluster in a sequencing cycle.
- Figures 8 and 9 illustrate an example of the phasing and prephasing effects.
- Figure 8 shows that some fluorescently-labeled nucleotide incorporated into strands of a cluster lead (pre-phasing, 800) while others lag behind (phasing, 801), leading to a mixed signal readout of the cluster.
- Figure 9 depicts the intensity output of the second intensity channel, which corresponds to the incorporation of a “C” into a cluster every 15 cycles in a heterogeneous background.
- sub-panel 910 within Figure 9B which shows the signal intensity around cycle 15
- Fading, or signal decay is an exponential decay in fluorescent signal intensity as a function of sequencing cycle number. Fading, or signal decay, will be described in further detail below in relation to emission stack effects. In order to capture the effects of phasing in a cluster, phasing weights 714 can be calculated.
- the phasing weights 714 can comprise a phasing weight Wx and a pre- phasing weight Wy. identified based on the following first order phasing correction for intensity data ⁇ ⁇ for a given sequencing cycle c: where ⁇ ⁇ is the intensity data for the preceding sequencing cycle and ⁇ ⁇ is the intensity data for the succeeding sequencing cycle. It will be understood that, utilizing this approach, if suitable values of ⁇ and ⁇ are chosen, then the mean chastity of intensity values for a plurality of sequencing cycles are maximized. For example, it is possible to numerically optimize via a pattern search over ⁇ ⁇ and ⁇ ⁇ to maximize the mean chastity for a population of clusters.
- the ⁇ and ⁇ values can be identified as the values that provide for the maximal mean chastity.
- the ⁇ and ⁇ values can be used to determine fully corrected intensity data as previously described.
- a second order empirical phasing correction can be calculated.
- the calculation is optimized over ⁇ ⁇ , ⁇ , ⁇ ⁇ , ⁇ ⁇ ⁇ , ⁇ and ⁇ ⁇ , ⁇ to maximise the mean chastity.
- further higher order weights can be calculated.
- phasing weights 714 can be calculated according to traditional phasing correction approaches, for example as described in WO 2015/084985 PCT/US2014/068409.
- the feature extractor is configured to determine the phasing weights 714 and input the phasing weights 714 to the non-linear ML model.
- the non-linear ML model receives the phasing weights 714 from a storage or from another component of the sequencer, such as an intensity correction module.
- Cluster coefficients Cluster coefficients 716 can relate to characteristics of the intensity profile of a cluster when represented on a scatter plot.
- the cluster coefficients 716 can comprise a scale coefficient and an offset coefficient.
- the cluster coefficients are described in detail U.S. Patent Publication No.11361194 B2, the disclosure of which is incorporated herein by reference in its entirety. An overview of the cluster coefficients 716 will now be described with reference to Figure 10.
- the cluster coefficients 716 are estimated at each sequencing cycle of the sequencing run and are iteratively updated for subsequent sequencing cycles.
- the cluster coefficients 716 that represent an estimate over the entire read, i.e. that have been iteratively updated over all sequencing cycles in the sequencing run, are used as input features.
- the intensity profiles of respective clusters in the population of clusters on a flow cell take similar form but differ in scale and shift across a multi- dimensional space 1000.
- the multidimensional space 1000 can be defined by an x-axis and y-axis corresponding respectively to the first intensity channel and second intensity channel.
- the “X” symbol represents the intensity values for cluster 1
- the “+” symbol represents the intensity values for cluster 2
- the symbol represents the intensity values for cluster 3.
- the cluster coefficients comprise a scale coefficient a ⁇ that accounts for scale variation in the inter-cluster intensity profile variation, and two channel-specific offset coefficients and ⁇ that account for shift variation along the first and the second intensity channels in the inter-cluster intensity profile variation, respectively.
- the shift variation is accounted for by using a common offset coefficient ⁇ for the different intensity channels (e.g., the first and the second intensity channels).
- the cluster coefficients for a cluster can be generated at a current sequencing cycle of the sequencing run based on combining analysis of historic intensity statistics determined for the target cluster at preceding sequencing cycles of the sequencing run with analysis of current intensity statistics determined for the target cluster at the current sequencing cycle.
- the target cluster can be base called according to any of the methods described herein and a base-specific intensity distribution, which is the Gaussian distribution corresponding to the base call, can be defined for the sequencing cycle c.
- Expressions for ⁇ and ⁇ can be obtained by evaluation equations (3) and (4) and substituting in intensity correction parameters ⁇ comprising parameter ⁇ , ⁇ , ⁇ , ⁇ , ⁇ , ⁇ , ⁇ , ⁇ , ⁇ , ⁇ .
- the intensity correction parameters ⁇ can be calculated for each sequencing cycle c based on the intensity value y and the base-specific intensity distribution to which the target cluster belongs at the current sequencing cycle c: 1 )
- a distribution intensity in the first intensity channel ⁇ , ⁇ is the intensity value in the first intensity channel at the centroid of the base-specific intensity distribution.
- the base-specific intensity distribution can be the basis for calling the current base call.
- a distribution intensity in the second intensity channel ⁇ , ⁇ is the intensity value in the second intensity channel at the centroid of the base-specific intensity distribution. 69334895-1 M&C PM363975US 40 3)
- a distribution centroid-to-origin distance ⁇ ⁇ is the Euclidean distance between the centroid of the base-specific intensity distribution and the origin 1032 of the multi- dimensional space 100 in which the base-specific intensity distribution were fit (e.g., by using the expectation maximization algorithm).
- distance metrics such as the Mahalanobis distance and the minimum covariance determinant (MCD) distances, and their associated centroid estimators can be used.
- the fifth accumulated intensity correction parameter ⁇ ⁇ is the sum of distribution centroid-to-origin distances calculated for the target cluster at each of the preceding sequencing cycles 1 to C ⁇ 1, and the current sequencing cycle C; 6 ) The sixth accumulated intensity correction parameter ⁇ ⁇ is the sum of distribution intensity-to-intensity error similarity measures calculated for the target cluster at each of the preceding sequencing cycles 1 to i ⁇ 1, and the current sequencing cycle I;
- the offset coefficient for the first intensity channel i.e.
- the feature extractor 740 is configured to determine intensity correction parameters ⁇ for a cluster for each sequencing cycle.
- the feature extractor 740 is further configured to determine accumulated intensity correction for each sequencing cycle, and subsequently determine the cluster coefficients 716 based on the accumulated intensity correction parameters.
- the current accumulated intensity correction parameters ⁇ ⁇ can be calculated for the target cluster based on accumulating the preceding accumulated intensity correction parameters ⁇ ⁇ 1 ⁇ with the current intensity correction parameters ⁇ .
- One example of the accumulation is summing, as defined in the present embodiment.
- the current accumulated intensity correction parameters are calculated by summing the accumulated intensity correction parameters for the preceding cycle ⁇ ⁇ ⁇ ⁇ with the current intensity correction parameters ⁇ on an intensity correction parameter- basis. For example, consider that ⁇ ⁇ 2 ⁇ 4 ⁇ ,1 ⁇ is the first accumulated intensity correction parameter for the twenty-fourth sequencing cycle.
- the accumulation may be an average of intensity correction parameters for all previous sequencing cycles and the current sequencing cycle, i.e. 69334895-1 M&C PM363975US 43
- the each of the current accumulated intensity correction parameters for a sequencing cycle c can be calculated by multiplying the respective accumulated intensity correction parameters for the previous sequencing cycle by (c-1), adding the current respective intensity correction parameter, and dividing by the current cycle number.
- the current accumulated intensity correction parameters are cached in a memory during the sequencing run and are accumulated with next intensity correction parameters or the target cluster to generate next accumulated intensity correction parameters.
- the current accumulated intensity correction parameters are stored in a quantized fixed bit width format. For example, one or two bytes can be used to store each preceding accumulated intensity correction parameter in the current accumulated intensity correction parameters.
- other calculations can be used to determine the cluster coefficients 716.
- the cluster coefficients may include a weighting function.
- the weighting function can be used to attenuate the variation correction coefficients in initial sequencing cycles of the sequencing run, and to amplify the variation correction coefficients in later sequencing cycles of the sequencing run.
- the weighting function works as follows. First, an initial amplification coefficient ⁇ ⁇ and initial offset coefficients ⁇ ⁇ , ⁇ and ⁇ ⁇ , ⁇ are initialized.
- the initial 69334895-1 M&C PM363975US 44 amplification coefficient ⁇ ⁇ is initialized at a first sequencing cycle of the sequencing run with a predetermined value (e.g., “1”) and the initial offset coefficients ⁇ , ⁇ and ⁇ , ⁇ are initialized at the first sequencing cycle with a predetermined value (e.g., “0”).
- the weighting function combines (e.g., sums) the initial amplification coefficient ⁇ ⁇ with the amplification coefficient determined by the least-squares approach a ⁇ , and combines the initial first and second offset coefficients ⁇ and ⁇ and with the first and second offset coefficients determined by the least-squares approach ⁇ and ⁇ such that the amplification coefficient and the first and second offset coefficients are attenuated in the initial sequencing cycles and amplified in the later sequencing cycles, i.e.
- the expression 1 ⁇ ⁇ equals to “2.”
- the feature extractor is configured to determine the cluster coefficients 716 and input the cluster coefficients 716 to the non-linear ML model.
- the non-linear ML model receives cluster coefficients 716 from a storage or from another component of the sequencer.
- Base context 718 refers to the identity of bases preceding and/or succeeding a nucleobase incorporated into a cluster for a given sequencing cycle.
- Base context can be defined as a sequence of k nucleobases comprising a nucleobase X to be base called and the one or more preceding and/or succeeding nucleobases. It has been found that the error rate of a base call varies according to the base context.
- the base context 718 comprises a trimer context comprising the current base call and the two preceding base calls, i.e. KKX.
- the trimer context is GCA.
- the base context 718 comprises a downstream trimer context comprising the current base call and the two succeeding base calls, i.e. XKK.
- the base context 718 can comprise any other number of preceding and/or succeeding bases, for example. KXK, KKXK, KKKXK, etc.
- the base context 718 can be determined by analysing the sequence of previous base call for the cluster. 69334895-1 M&C PM363975US 46
- the feature extractor is configured to determine the base context 718 and input the base context 718 to the non-linear ML model.
- the non-linear ML model receives the base context 718 from a storage or from another component of the sequencer such as an intensity correction module.
- Spatial location The spatial location 720 comprises one or more features which relate to the location of the cluster on the flow cell and/or the movement of the cluster relative to the scanner of the sequencer. The location of the cluster on the flow cell can have predictive value since it has been found that measured intensity may vary across a flow cell due to variation in illumination across the flow cell.
- the spatial location 720 can comprise a value corresponding to the lane and/or tile to which the cluster belongs to on the flow cell. Some sequencers utilise one or more TDI scanners relative to which the flow cell moves during imaging.
- the spatial location 720 can comprise features relating to the movement of the tile, tile region, or lane to which the cluster belongs along the time axis relative to the scanner, and/or the direction of movement of the tile, tile region, lane, or flow cell relative to the scanner.
- the feature extractor is configured to determine the spatial location 720 and input the spatial location 720 to the non-linear ML model.
- the non-linear ML model receives the spatial location 720 from a storage or from another component of the sequencer.
- Upstream Gs, GGs, CCs, GGG, GCs It has been found that positions in a read with low base quality values tend to have an over-representation of G containing motifs, such as GGG, CGG or GGC.
- the number of upstream Gs, GGs, CCs, GGGs and/or GCs 722 can comprise the number of G bases occurring upstream of a nucleotide for a given sequencing cycle and/or the number of two consecutive Gs (GGs) upstream of a nucleotide for a given sequencing cycle and/or the number of two consecutive Cs (CCs) upstream of a nucleotide for a 69334895-1 M&C PM363975US 47 given sequencing cycle and/or the number of G bases subsequently followed by a C base (GCs) upstream of a nucleotide for a given sequencing cycle.
- the number of upstream Gs, GGs, CCs and/or GCs 722 can be determined by analysing the sequence of previous base calls for the cluster.
- the feature extractor is configured to determine the number of upstream Gs, GGs, CCs and/or GCs 722 and input the number of upstream Gs, GGs, CCs and/or GCs 722 to the non-linear ML model.
- the non-linear ML model receives the number of upstream Gs, GGs, CCs and/or GCs 722 from a storage or from another component of the sequencer.
- GC bias The GC bias 724 is an estimate of the fraction of GCs (i.e.
- a G base followed immediately by a C base in a read. It may be calculated as (#GCs in entire read)/(read length), or the (#GCs in entire insert) / (combined length of first and second reads of a paired-end read).
- the GC bias 724 can be determined by analysing the sequence of previous base calls for the cluster.
- the feature extractor is configured to determine the number of upstream GC bias 724 and input the GC bias 724 to the non-linear ML model.
- the non-linear ML model receives the GC bias 724 from a storage or from another component of the sequencer.
- Homopolymer length and distance A homopolymer is a repeating sequence of the same base and their presence in a read can sometimes be associated with an error.
- the homopolymer length 726 can correspond to the length of the longest homopolymer upstream of the incorporated 69334895-1 M&C PM363975US 48 nucleobase.
- the homopolymer distance 728 can correspond to the upstream distance between the incorporated nucleobase and the start of the homopolymer
- the homopolymer length 726 and homopolymer distance 728 can be determined by analysing the sequence of previous base calls for the cluster.
- the feature extractor is configured to determine the homopolymer length 726 and/or homopolymer distance 728 and input the homopolymer length 726 and/or homopolymer distance 728 to the non-linear ML model.
- the non-linear ML model receives the homopolymer length 726 and/or homopolymer distance 728 from a storage or from a component of the sequencer.
- sample preparation 730 can comprise one or more features relating to the sample preparation method that are used to prepare a sample for sequencing, such as the fragmentation method and/or the type of library preparation kit used. Further features relating to the sample preparation 730 can correspond to either paired-end or single-end sequencing.
- the non-linear ML model receives the sample preparation 730 from a storage or from a component of the sequencer. Polyclonality of the cluster It may occur that a single cluster comprises non identicial template strands, which results in noise in intensity signal obtained from the cluster. The polyclonality of the cluster 732 is an estimate of this variation between the respective template strands within the cluster and can be determined based on the intensity data obtained from the cluster.
- the feature extractor is configured to determine the polyclonality of the cluster 732 and input the the polyclonality of the cluster 732 to the non-linear ML model.
- the non-linear ML model receives the polyclonality of the cluster 732 from a storage or from a component of the sequencer. 69334895-1 M&C PM363975US 49 Cycle index
- the cycle index 734 corresponds to the sequencing cycle number for the current sequencing cycle.
- the feature extractor is configured to access the cycle index 734 from a storage or from a component of the sequencer.
- the plurality of features associated with cluster 702 can further comprise the determinant of the covariance matrix which is generated during by the expectation maximization method.
- Read index Paired-end sequencing refers to the sequencing of both ends of a fragment.
- the read index 736 corresponds to which read of the pair (i.e. either the first or second read) the cluster corresponds to.
- the feature extractor is configured to determine the read index 736 and input the read index 736 to the non-linear ML model.
- the non-linear ML model receives the read index 736 from a storage or from a component of the sequencer. Determinant of the covariance matrix As described with reference to base calling, when two-channel sequencing is performed, a Gaussian can be evalulated for each of the four bases based on the expectation maximization algorithm.
- Each the Gaussians comprises a respective mean ⁇ and covariance ⁇ .
- a covariance matrix can be generated which corresponds to the size and orientation of the respective Gaussian.
- the determinant of the covariance matrix for a respective base provides a measure of how dispersed the intensity distribution corresponding ot the respective base lies on a scatter plot. 69334895-1 M&C PM363975US 50
- the determinant of the covariance matrix 738 can comprise a determinant of the covariance matrix for each of the Gaussians corresponding to a respective base.
- the determinant of the covariance matrix 738 can comprise the determinant of the covariance matrix of the Gaussian corresponding to the base called base only.
- the feature extractor is configured to determine the determinant of the covariance matrix 738 and input the determinant of the covariance matrix 738 to the non-linear ML model.
- the non-linear ML model receives the determinant of the covariance matrix 738 from a storage or from a component of the sequencer.
- the base call may be determined using any suitable base calling method such as any one of the methods previously described.
- the method 1100 comprises, at 1110, accessing a plurality of features associated with the cluster.
- Each of the plurality of features associated with the cluster 1102 can be any feature related to, or which can be derived from, the cluster, such as: properties, statistics or metrics related to or derived from the intensity data obtained from the cluster; properties, statistics or metrics related to or derived from a base call associated with the cluster; properties, statistics or metrics related to or derived from with the template strands in the cluster; the sample preparation method used to obtain the cluster; the arrangement of the cluster on the flow cell; and cycle index.
- Each of the plurality of features associated with the cluster 1102 can be any combination of features described above with reference to Figure 7.
- Each of the plurality of features accessed at 1110 can be associated with any one or more of the sequencing cycles of the sequencing run. In some embodiments, each of the plurality of features is associated with the current sequencing cycle and/or at least one of a preceding and/or succeeding cycle.
- one or more features associated with the cluster are determined based on the plurality of features accessed at 1110. For example, one or more statistic and/or metric can be derived from the intensity data and/or base call accessed at 1110 and can be processed to result in one or more additional features associated with the cluster.
- the method 1100 further comprises, at 1130, determining the quality of the base call with a non-linear machine learning model.
- the plurality of features accessed at 1110, and optionally the one or more features determined at 1120 are used as input to the non- linear model and a quality of the base call is output by the non-linear ML model.
- the quality score may be assigned to bin. For example, a set of bins, e.g with range (0-5, 5-10, 10-15, 15-20, 20- 25, 25-30, 30-35, 35-40, 40-45, 45-50, 50-55, 55-60) may be determined and the quality score can be assigned to the corresponding bin.
- the quality score is represented as a floating point number with a value between 0.0 and 1.0, where 1.0 is complete confidence in the base call and 0.0 is very low confidence.
- Training The non-linear ML model can be trained based on a dataset comprising ground truth data and associated input features.
- the ground truth data comprises known polynucleotide sequences and corresponding sequences of base calls obtained from sequencing clusters of the respective known polynucleotide sequence portions.
- the associated input features comprise a plurality of features associated with the cluster for each of the base called nucleobases.
- a plurality of non-linear ML models are defined, each of the respective models corresponding to a different combination of input features.
- the best performing non-linear ML model can be selected.
- the dataset can be split into a training dataset, a validation dataset and a test dataset.
- the training dataset can be used to learn the underlying patterns and non-linear relationships between a plurality of features associated with a cluster and a quality score for each of the plurality of non-linear ML models.
- the validation dataset can be used to compare the relative performance of each of the plurality of non-linear ML models and select the best performing non-linear ML model.
- the test dataset can be used to assess the performance of the selected non-linear ML model.
- a non-linear ML model can be trained to minimise a loss function.
- the loss function evaluates a difference which is determined based on the ground truth data and the model predictors.
- ML model is trained to minimise a loss function based upon predicted quality scores determined by the ML model based upon an input plurality of features and ground truth quality scores.
- the parameters of the non-linear ML model are updated by back propagation. As more steps of the back propagation algorithm are applied the loss function decreases towards a minimum.
- the non-linear ML model can automatically improve accuracy by selecting resolution details to minimize the loss function.
- features or regions can be split at any particular scale based on interactions between those features or regions which minimise the loss function. This compares to a Phred based model, where there are constraints on the size and orientation of each lookup bin. To prevent overfitting, regularization and/or early stopping can be used. In some embodiments, a Gradient Boosting algorithm is used to generate the non-linear ML model.
- the dataset can be derived from several tiles of one or more flow cells. The dataset can be derived from 100s of millions, or billions, of clusters. 69334895-1 M&C PM363975US 53 In some embodiments, the sequencing of the known polynucleotide sequence portions is performed using a PCR free library.
- the quality scoring method 1100 is performed on the selected non-linear ML model.
- the selected non-linear ML model is converted to a lookup table and the quality scoring method 1100 is performed using the lookup table.
- the lookup table is an array that maps respective pluralities of features associated with a cluster to a quality score based on the input/output relationship defined by the non-linear ML model.
- Figure 12 is a table of the predicted quality scores based on either the non-linear ML method 1100 described herein (ML) or a conventional quality scoring method that is current implemented using a Phred-based lookup table (Phred).
- the quality scores for each method are determined based on the same set of sequencing data which is obtained using a paired-end sequencing method.
- Read 1 corresponds to sequencing performed in a first (forward) direction on clusters based on a polynucleotide sequence portion
- read 2 correspond to sequencing performed in a second (reverse) direction, on clusters based on the polynucleotide sequence portion.
- the non-linear ML model receives a plurality of features associated with the cluster 702 as input which comprises the intensity 704 comprising the fully corrected intensity in each of the first and second channel, conditional probabilities 708 comprising the log likelihoods for each nucleobase, the chastity 710, the signal-to-noise ratio 712, the base context 718 comprising the two preceding base calls and the current base call, spatial location 720 comprising a lane number, and the cycle index 734. It can be seen that when compared to a Phred based model, the ML model improves on the percentage of base calls assigned a quality score greater than Q30, Q35 and Q40.
- Figures 13A-13C are scatterplots of fully corrected intensity values each corresponding to a base called nucleobase from a respective cluster. A value is assigned to each of the intensity values according to the determined quality score obtained using method 1100.
- Each of the Figures 13A-13C relate to a different cluster: Figure 13A comprises data points from a cluster with a signal-to-noise ratio of 7-10dB, Figure 13B comprises data points from a cluster with a signal-to-noise ratio of 10-13dB, and Figure 13C comprises data points from a cluster with a signal-to-noise ratio of 7-10dB.
- conditional probabilities 708 comprising a likelihood that the intensity data belongs to one of the four nucleobase types A, C, G or T are used as input to the non-linear ML model. It can be seen that compared to using a chastity metric alone, a non-linear ML model which receives the likelihoods as input features is better able to decrease quality scores near the decision boundaries between respective bases and increase quality scores far from decision boundaries. This shows that the likelihoods are suitable input feature to the non-linear ML model.
- the method comprises, at 1410, accessing intensity data for a current sequencing cycle of a sequencing run, wherein the intensity data comprises a first signal obtained based upon a respective first nucleobase of at least one first polynucleotide sequence portion.
- the intensity data is fully corrected intensity data as described above.
- the method further comprises, at 1420, accessing a combined probability distribution provided by a mixture model comprising a plurality of components, each of the plurality of components corresponding to an intensity distribution associated with a respective nucleobase type. 69334895-1 M&C PM363975US 55
- the mixture model is a Gaussian mixture model where each of the components of the mixture model is a Gaussian distribution corresponding to a respective nucleobase A, C, G or T.
- the mean ⁇ and covariance ⁇ of each of the Gaussian distributions can be determined by fitting four Gaussians with a respective mean ⁇ and covariance ⁇ to a plurality of intensity values from a plurality of clusters obtained from a sequencing run.
- the Gaussian distributions can be fit using the EM algorithm, which has been previously described in this application in relation to base calling.
- the parameters of the Gaussian distributions are determined by a base calling module of the sequencer and stored in a memory of the sequencer.
- the method further comprises, at 1430, accessing a probability of a sample preparation error.
- the sample preparation error corresponds to a probability that a mutation introduced into the template strands was caused by a chemistry error during the preparation of the sample prior to sequencing by synthesis.
- the probability of a sample preparation error is a constant value irrespective of the base call.
- the probability of a sample preparation error is dependent upon the base call.
- the method further comprises, at 1440, determining a quality of the base call for the respective first nucleobase based upon the combined probability distribution and the intensity data. In some embodiments, the quality of the base call is determined based upon a posterior probability that the first nucleobase is a selected nucleobase type given the intensity data.
- the selected nucleobase type is selected by calculating a posterior probability for each respective nucleobase type and selecting the nucleobase type associated with the highest posterior probability. For example, in some embodiments, a posterior probability is determined for each of the four respective nucleobases based on the combined probability distribution and the intensity data. From the four calculated posterior probabilities, the posterior probability with the highest value is selected. In some embodiments, the base call is the selected nucleobase type. Since the selected nucleobase type corresponds to the highest posterior probability, the selected nucleobase type is therefore statistically the most likely identity of the nucleobase given the value of intensity data.
- the quality of the base call can be evaluated based on a first model which determines the probability of a base calling error based on the posterior probability associated with the selected nucleobase type. The quality score can then be determined based on the probability of a base calling error. In embodiments where a probability of a sample preparation error is accessed, the quality of the base call is further based upon the probability of a sample preparation error according to a second model. The quality score can then be determined based on the probability of a base calling error. The determination of a base calling quality based on a mixture model will now be described in further detail.
- the mixture model is a probability density function of intensity values based on k components, where each of the k components correspond to an intensity distribution associated with a respective nucleobase type, i.e. one of A, C, G or T.
- the intensity values are two-channel intensity values and each of the components is a multivariate Gaussian distribution N with mean ⁇ k and covariance ⁇ k .
- the values of parameters ⁇ k , ⁇ k and ⁇ ⁇ may provide the combined probability distribution accessed at step 1420 described above.
- the values of the parameters ⁇ k , ⁇ k and ⁇ can be estimated using an expectation maximization algorithm. In some embodiments, this estimate is performed by a base calling module of a sequencer.
- a posterior probability that a nucleobase is one of the four nucleobase types given intensity data x can be calculated by considering a latent variable ⁇ .
- the latent variable z maps an intensity value x to one of the four components 69334895-1 M&C PM363975US 57 by functioning as a one-hot vector.
- z indicates the membership of an i ntensity value x to one of the four components.
- k 4
- one of the entries of z has a value of 1 and the remainder of the entries have a value of 0.
- ⁇ 0,0,1,0, i.e.
- the posterior probability that a nucleobase is one of the four nucleobase types given intensity data x, ⁇ ( ⁇ 1
- ⁇ ) can be defined as: 15 C ontinuing with the previous example, in order to calculate a posterior probability that a nucleobase is nucleobase type C (i.e.
- the posterior probability can be evaluated by calculating:
- ⁇ ), ⁇ ( ⁇ 1
- ⁇ ), ⁇ ( ⁇ 1
- ⁇ ) and ⁇ ( ⁇ 1
- ⁇ ( ⁇ ) will now be used to represent the posterior probabilities.
- mapping between each of the components and a 30 respective nucleobase type is arbitrary and any one-to-one mapping is suitable.
- the base call corresponds to the selected nucleobase type.
- ⁇ ⁇ is equal to 1 ⁇ ⁇ ( ⁇ ).
- ⁇ ( ⁇ ) is evaluated using the softmax operation when a uniform prior is used.
- Figure 15 illustrates a plot of ⁇ ⁇ for given intensity values x comprising a first- channel intensity value and a second-channel intensity value. It can be seen that each of the regions 1510, 1520, 1530 and 1540 assign a high quality score to the regions located towards the corners of each respective quadrant of the scatter plot. As has been previously described, it would be expected that intensity values that are located towards the corners of the scatter plot should correspond to reliable base calls since these intensity values are relatively near one of the intensity distributions and very far from the three remaining intensity distributions.
- the quality score is determined according to a second model which considers the probability of sample preparation error.
- the sample preparation error corresponds to an error that can arise during sample preparation and before the commencement of sequencing by synthesis.
- Figure 16 is a schematic of the sample preparation process from extracted DNA sample 1600 to clusters 1610, 1620 and 1630 which are ready for sequencing by synthesis. It can be seen that a mutation, such as an insertion, deletion or substitution of one or more nucleobases, can be introduced during the fragmentation, amplification, and hybridization and cluster generation steps. A mutation that arises in one step of the sample preparation process can permeate through the downstream strands.
- a mutation that arises during a current step is shaded in dark grey, e.g.1640, and a mutation that is inherited from a previous step is shaded in light grey. e.g.1650.
- the final term ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ reflects a possible double counting of errors when an intensity value comprises both a SNR error and a sample preparation error. This term is negligible in a practical sense however due to its small value.
- the probability of a sample preparation error ⁇ ⁇ ⁇ can be set as a constant value irrespective of the base call of the nucleobase.
- ⁇ ⁇ is greater than or equal to 0.00001 and less than or equal to 0.0001. In some embodiments, ⁇ ⁇ is equal to 0.0001. In other embodiments, ⁇ ⁇ is dependent on the base call. In such embodiments, ⁇ ⁇ is represented as ⁇ ⁇ ⁇ ⁇ , which corresponds to the probability, for a base call B, of any of 69334895-1 M&C PM363975US 60 the other three nucleobases ⁇ transitioning to nucleobase B due to a mutation arising from sample preparation. ⁇ can be determined from a 4x4 matrix T of base transition probabilities.
- the columns correspond to base called nucleobases and the rows correspond to the nucleobases that may transition to a base called nucleobase with some probability.
- the diagonal of the matrix T is zero and each of the remaining entries correspond to a probability of the nucleobase represented by a respective row transitioning to a nucleobase base represented by a respective column.
- ⁇ ⁇ ⁇ ⁇ for a given nucleobase B can be calculated based on ⁇ , which is column index j of matrix T, by summing over the jth column of matrix T.
- the probability of a sample preparation error is learnt during a calibration process, where a sequencing run is performed on a plurality of clusters comprising known polynucleotide sequence portions. A probability of a sample preparation error can be determined by comparing the base called sequences with the known polynucleotide sequences. Sample preparation errors occur before the clonal amplification stage and therefore all strands in a cluster will comprise the same incorrect base identity at a specific base if a sample preparation error has occurred.
- a sample preparation error will therefore result in what looks like a variant in the base called sequence.
- the probability of a sample preparation error is base specific and encoded by a matrix T
- the values of each of the non-diagonal coefficients of the matrix are learnt during the calibration run.
- Figure 17 illustrates ⁇ intensity values x comprising a first-channel intensity value and a second-channel intensity values calculated using the second model. It can be seen that each of the regions 1710, 1720, 1730 and 1740 assign a high quality score to roughly square-shaped areas of the scatter plot which are located towards the corners of the scatter plot. Compared to the first model, it can be seen that the estimated quality scores do not exceed empirical quality scores.
- FIG. 18 illustrates one possible arrangement of a cluster 1800 which allows for simultaneous sequencing.
- the cluster 1800 comprises multiple template strands, which can also be referred to as first polynucleotide sequence portions of interest 1801a and second polynucleotide sequence portions of interest 1801b.
- the cluster may be processed to determine sequence information based upon a signal obtained based upon the first polynucleotide sequence portions of interest 1801a and second polynucleotide sequence portions of interest 1801b.
- the template strands 1801 may be configured on substrate 1810 which may be a flow cell, which may be patterned or unpatterned.
- the substrate 1810 may be a patterned flow cell comprising a number of discrete nanowells 1811, with each well having a single respective sensor associated with the well.
- a single sensor is associated with the well, signals from the two or more portions of interest cannot be resolved, irrespective of whether the different portions (or respective clusters) are spatially resolved within the well.
- Two or more polynucleotide sequence portions of interest contained within a single well in this way is sometimes referred to herein as a “cluster” irrespective of whether the different portions are spatially resolved in the well given that light emissions from such a well form a single combined signal.
- Simultaneous sequencing information from the first and second polynucleotide sequence portion of interest 1801a, 1801b can be obtained using first primers 1802a specific to the first portion 1801a, or to a region 1803a adjacent to the first portion, and second primers 1802b specific to the second portion 1801b, or to a region 1803b adjacent to the second portion, in the same reaction run.
- the first and second sequence portions 1801a, 1801b may be flanked at one or both ends by respective primer binding sites 1803a, 1803b having a known sequence. Sequencing primers 1802a, 1802b specific to the different 69334895-1 M&C PM363975US 62 primer binding sites 1803a, 1803b can therefore be designed and used for simultaneous sequencing of the two sequence portions 1801a, 1801b. Base calling from simultaneous sequencing data For a cluster comprising multiple copies of first and second different polynucleotide sequence portions there are sixteen possible combinations of nucleobases at any given position (i.e., an A in the first sequence portion and an A in the second sequence portion, an A in the first sequence portion and a T in the second sequence portion, and so on).
- the light emissions associated with each portion during the relevant base calling cycle will be characteristic of the same nucleobase.
- the cluster behaves as a cluster containing only one sensed sequence portion, and the identity of the bases at that position are uniquely callable.
- the light emissions associated with each portion in the relevant base calling cycle will be characteristic of different nucleobases.
- the fluorescent signal coming from the collection of extended first portion sequencing primers 1802a have substantially the same intensity as the fluorescent signal coming from the collection of extended second portion sequencing primers 1802b in the same cluster.
- the two signals may also be co-localized, and may not be optically resolved. Therefore, when different nucleobases are present at corresponding positions of the sequence portions, the identity of the nucleobases cannot be uniquely called from the combined signal alone. However, it is demonstrated herein that useful sequencing information can still be determined from these signals.
- the scatter plot of Figure 19 shows nine distributions (or bins) of intensity values from the combination of two co-localized signals of substantially equal intensity (e.g.
- the intensity values shown in Figure 19 may be up to a scale or normalization factor; the units of the intensity values may be arbitrary or relative (i.e., representing the ratio of the actual intensity to a reference intensity).
- the sum of the signal from the extended first portion primers 1802a and the signal from the extended second portion primers 1802b 69334895-1 M&C PM363975US 63 results in a combined signal.
- the combined signal may be captured by the first optical channel and the second optical channel.
- the computer system can map the combined signal from a cluster into one of the nine bins, for example by fitting one or more Gaussian mixture models (GMMs), and thus determine sequence information relating to the added nucleobase at the extended first portion primers 1802a and the added nucleobase at the extended second portion primers 1802b.
- Bins are selected based upon the combined intensity of the signals originating from each sequence portion sensed during the base calling cycle. For example, bin 1903 may be selected following the detection of a high-intensity (or “on/on”) signal in the first channel and a high-intensity signal in the second channel.
- Bin 1906 may be selected following the detection of a high-intensity signal in the first channel and an intermediate-intensity (“on/off” of “off/on”) signal in the second channel.
- Bin 1909 may be selected following the detection of a high-intensity signal in the first channel and a low-intensity or zero-intensity (“off/off”) signal in the second channel.
- Bin 1902 may be selected following the detection of an intermediate-intensity signal in the first channel and a high-intensity signal in the second channel.
- Bin 1905 may be selected following the detection of an intermediate- intensity signal in the first channel and an intermediate-intensity signal in the second channel.
- Bin 1908 may be selected following the detection of an intermediate-intensity signal in the first channel and a low-intensity or zero-intensity signal in the second channel.
- Bin 1901 may be selected following the detection of a low-intensity signal in the first channel and a high-intensity signal in the second channel.
- Bin 1904 may be selected following the detection of a low-intensity or zero-intensity signal in the first channel and an intermediate-intensity signal in the second channel.
- Bin 1907 may be selected following the detection of a low-intensity or zero-intensity signal in the first channel and a low-intensity signal in the second channel.
- the computer processor may detect a match between the first and second sequence portions at the sensed position.
- the computer processor may base call the respective nucleobases. For example, when the combined signal is mapped to bin 1901 for a base calling cycle, the computer processor base calls both the added nucleobase at the extended first portions 69334895-1 M&C PM363975US 64 primers 1802a and the added nucleobase at the extended second portion primers 1902b as T.
- bins 1902, 1904, 1906, and 1908 each represent two possible combinations of first and second nucleobases.
- Bin 1905 meanwhile, represents four possible combinations.
- mapping the combined signal to an ambiguous bin may still allow for sequencing information to be determined.
- bins 1902, 1904, 1905, 1906, and 1908 represent mismatches between respective nucleobases of the two polynucleotide sequence portions sensed during the cycle. Therefore, in response to mapping the combined signal to a bin representing a mismatch, the computer processor may detect a mismatch between the first and second polynucleotide sequence portions at the sensed position.
- the number of classifications which may be selected based upon the combined signal intensities may be predetermined, for example based on the number of sequence portions of interest expected to be present in the nucleic acid cluster. Whilst Figure 19B shows a set of nine possible classifications, the number of classifications may be greater or smaller.
- the cluster is configured such that the signal from one of the portions 1801b is approximately the same as the signal from the other of the portions 1801a. In other embodiments, the cluster may be configured such that the signal from one of the portions 1801b is diminished relative to the other of the portions 1801a.
- the number of classifications will vary according to the ratio of the signal from one of the portions 1801b to the other of the portions 1801a, up 69334895-1 M&C PM363975US 65 to a maximum of sixteen classifications for a cluster comprising first and second polynucleotide sequence portions of interest..
- Quality scores for simultaneous sequencing Accordingly, in embodiments where sequencing data corresponds to a simultaneous sequencing run, a quality score is determined for a combined base call ⁇ , ⁇ , of a respective first nucleobase and a respective second nucleobase.
- the intensity data accessed at step 1410 of Figure 14 described above is therefore a combined intensity of said first signal obtained based upon the incorporation of said respective first nucleobase into a plurality of first polynucleotide sequence portions forming a cluster and a second signal obtained based upon the incorporation of a respective second nucleobase into a plurality of second polynucleotide sequence portions forming the cluster.
- the combined probability distribution accessed at step 1420 of Figure 14 described above therefore relates to a combined probability distribution for intensity values associated with the respective first and second nucleobases.
- Each of the components of the combined probability distribution correspond to an intensity distribution associated with a respective combination of first and second nucleobases.
- the combined probability distribution comprises nine components.
- the number of classifications is greater or smaller, the number of components will vary accordingly.
- the combined probability distribution can comprise sixteen components.
- the sample preparation error optionally accessed at step 1430 of Figure 14 described above can be a constant value.
- the probability of a sample preparation error is base-combination specific. For example, for a combined base call ⁇ , ⁇ , the probability of a sample preparation error is calculated based on a probability of any of the three nucleobases other than nucleobase ⁇ transitioning to ⁇ , i.e.
- the probability of a base-combination specific sample preparation error can 69334895-1 M&C PM363975US 66 be represented as ⁇ ⁇ , ⁇ , This probability can be the sum of the probability of a mutation in a nucleobase of the first polynucleotide sequence portion, the probability of a mutation in a nucleobase of the second polynuceoltide sequence portion, and the probability of a mutation in both the first and second polynucleotide sequence portions.
- the probability of a sample preparation error can be learnt based upon a sequencing run performed on respective clusters of known first and second polynucleotide sequence portions during a calibration run, analogously to how was previously described. In other embodiments, the probability of a sample preparation error be determined directly from the intensity data obtained from a sequencing run. In some embodiments, the first polynucleotide sequence portion and the second polynucleotide sequence portion forming a cluster correspond to the same genetic sequence, but may have differences due to sample preparation errors.
- the intensity data will be located in one of the four corners of a two-channel intensity scatter plot, corresponding to combined bases AA, TT, CC or GG (i.e. bins 1901, 1904, 1907 and 1909 in Figure 19). If there is a mutation error at the first or second nucleobase, the intensity data will be located in the regions of a two-channel intensity scatter plot corresponding to a combination of two different bases, e.g. GT, TG, CT, TC, AT, TA, CG, GC, AC, CA, GA, or AG.
- two different bases e.g. GT, TG, CT, TC, AT, TA, CG, GC, AC, CA, GA, or AG.
- the probability of a sample preparation error can be directly determined during the sequencing run.
- the coefficients of a base transition matrix for combined base calls can be learnt from the intensity data using the Expectation Maximization process or gradient descent. If a mutation occurs due to an indel error, i.e. an insertion or deletion in the sequence, care must be taken to account for this since this effectively result in a “shift” in the first polynucleotide sequence portion relative to the second polynucleotide sequence portion which will lead to what appears to be mutation errors for successive base calls.
- a quality of the combined base call is determined based upon the combined probability distribution and the intensity data.
- the quality of the combined base call is determined based upon a posterior probability that the respective first nucleobase 69334895-1 M&C PM363975US 67 is a first selected nucleobase type and the respective second nucleobase is a second selected nucleobase type given the intensity data.
- the first selected nucleobase type and the second selected nucleobase type are selected by calculating a plurality of posterior probabilities, wherein each of the plurality of posterior probabilities is a probability that the first nucleobase is a first respective nucleobase type and the second nucleobase is a second respective nucleobase type given the intensity data, and selecting the posterior probability of the plurality of posterior probabilities with the highest value.
- each of the regions 2010, 2020, 2030 and 2040 assign a high quality score to generally rectangular-shaped areas of the scatter plot which are located towards the corners of the scatter plot. These regions correspond to combined bases AA, TT, CC or GG and are assigned high quality scores. The remaining regions of the scatter plot are assigned lower quality scores. The boundary regions between each of the regions, indicated by reference 2050, are assigned the lowest quality scores which corresponds to areas of greatest uncertainty in the combined base call.
- Figure 20 shows that the second model when applied to combined sequencing data is able to capture the difference in quality scores per ninth of the intensity plot.
- peripheral devices can include a storage subsystem 2110 including, for example, memory devices and a file storage subsystem 2136, user interface input devices 2138, user interface output devices 2176, and a network interface subsystem 2174.
- the input and output devices allow user interaction with computer system 2100.
- Network interface subsystem 2174 provides an interface to outside networks, including an interface to corresponding interface devices in other computer systems.
- one or more of the quality scoring systems and methods disclosed herein is communicably linked to the storage subsystem 2110 and the user interface input devices 2138.
- User interface input devices 2138 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices.
- input device is intended to include all possible types of devices and ways to input information into computer system 2100.
- User interface output devices 2176 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices.
- the display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image.
- CTR cathode ray tube
- LCD liquid crystal display
- projection device or some other mechanism for creating a visible image.
- the display subsystem can also provide a non-visual display such as audio output devices.
- output device is intended to include all possible types of devices and ways to output information from computer system 2100 to the user or to another machine or computer system.
- Storage subsystem 2110 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed by processors 2178. 69334895-1 M&C PM363975US 69 Processors 2178 can be graphics processing units (GPUs), field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), and/or coarse-grained reconfigurable architectures (CGRAs).
- GPUs graphics processing units
- FPGAs field-programmable gate arrays
- ASICs application-specific integrated circuits
- CGRAs coarse-grained reconfigurable architectures
- Processors 2178 can be hosted by a deep learning cloud platform such as Google Cloud PlatformTM, XilinxTM, and CirrascaleTM.
- processors 2178 include Google's Tensor Processing Unit (TPU)TM, rackmount solutions like GX4 Rackmount SeriesTM, GX15 Rackmount SeriesTM, NVIDIA DGX-1TM, Microsoft' Stratix V FPGATM, Graphcore's Intelligent Processor Unit (IPU)TM, Qualcomm's Zeroth PlatformTM with Snapdragon processorsTM, NVIDIA's VoltaTM, NVIDIA's DRIVE PXTM, NVIDIA's JETSON TX1/TX2 MODULETM, Intel's NirvanaTM, Movidius VPUTM, Fujitsu DPITM, ARM's DynamicIQTM, IBM TrueNorthTM, Lambda GPU Server with Testa V100sTM, and others.
- TPU Tensor Processing Unit
- rackmount solutions like GX4 Rackmount SeriesTM, GX15 Rackmount SeriesTM, NVIDIA
- Memory subsystem 2121 used in the storage subsystem 2110 can include a number of memories including a main random access memory (RAM) 2132 for storage of instructions and data during program execution and a read only memory (ROM) 2134 in which fixed instructions are stored.
- a file storage subsystem 2136 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges.
- the modules implementing the functionality of some implementations can be stored by file storage subsystem 2136 in the storage subsystem 2110, or in other machines accessible by the processor.
- Bus subsystem 2155 provides a mechanism for letting the various components and subsystems of computer system 2100 communicate with each other as intended.
- bus subsystem 2155 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
- Computer system 2100 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 2100 depicted in Figure 21 is intended only as a specific example for purposes of illustrating the preferred implementations of the present invention. Many other configurations of computer system 2100 are possible having more or less components than the computer system depicted in Figure 21.
- Each of the processors or modules discussed herein may include an algorithm (e.g., instructions stored on a tangible and/or non-transitory computer readable storage medium) or sub-algorithms to perform particular processes.
- the base calling pipeline can be implemented utilizing any combination of dedicated hardware boards, DSPs, processors, etc.
- the base calling pipeline implemented utilizing an off-the- shelf PC with a single processor or multiple processors, with the functional operations distributed between the processors.
- the modules described below may be implemented utilizing a hybrid configuration in which some modular functions are performed utilizing dedicated hardware, while the remaining modular functions are performed utilizing an off-the-shelf PC and the like.
- the modules also may be implemented as software modules within a processing unit.
- Various processes and steps of the methods set forth herein can be carried out using a computer.
- the computer can include a processor that is part of a detection device, networked with a detection device used to obtain the data that is processed by the computer or separate from the detection device.
- information e.g., image data
- a local area network (LAN) or wide area network (WAN) may be a corporate computing network, including access to the Internet, to which computers and computing devices comprising the system are connected.
- the LAN conforms to the transmission control protocol/internet protocol (TCP/IP) industry standard.
- TCP/IP transmission control protocol/internet protocol
- the information (e.g., image data) is input to a system disclosed herein via an input device (e.g., disk drive, compact disk player, USB port etc.).
- the information is received by loading the information, e.g., from a storage device such as a disk or flash drive.
- a processor that is used to run an algorithm or other process set forth herein may comprise a microprocessor.
- the microprocessor may be any conventional general purpose single- or multi-chip microprocessor such as a PentiumTM processor made by Intel Corporation.
- a particularly useful computer can utilize an Intel Ivybridge dual-12 core processor, LSI raid controller, having 128 GB of RAM, and 2 TB solid state disk drive.
- the processor may comprise any conventional special purpose processor such as a digital signal processor or a graphics processor.
- the processor typically has conventional address lines, conventional data lines, and one or more conventional control lines. 69334895-1 M&C PM363975US 71
- the implementations disclosed herein may be implemented as a method, apparatus, system or article of manufacture using standard programming or engineering techniques to produce software, firmware, hardware, or any combination thereof.
- article of manufacture refers to code or logic implemented in hardware or computer readable media such as optical storage devices, and volatile or non-volatile memory devices.
- Such hardware may include, but is not limited to, field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), complex programmable logic devices (CPLDs), programmable logic arrays (PLAs), microprocessors, or other similar processing devices.
- FPGAs field programmable gate arrays
- ASICs application-specific integrated circuits
- CPLDs complex programmable logic devices
- PDAs programmable logic arrays
- microprocessors or other similar processing devices.
- FPGAs field programmable gate arrays
- ASICs application-specific integrated circuits
- CPLDs complex programmable logic devices
- PDAs programmable logic arrays
- microprocessors or other similar processing devices.
- One or more implementations of the technology disclosed, or elements thereof can be implemented in the form of a computer product including a non-transitory computer readable storage medium with computer usable program code for performing the method steps indicated.
- an apparatus including a memory and at least one processor that is coupled to the memory
- one or more implementations of the technology disclosed or elements thereof can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer readable storage medium (or multiple such media).
- cluster refers to a group of molecules, e.g., a group of DNA, or a group of signals.
- the signals of a cluster are derived from different features.
- a signal clump represents a physical region covered by one amplified oligonucleotide.
- a physical region 69334895-1 M&C PM363975US 72 may be a tile, a sub-tile, a lane or a sub-lane on a flow cell, etc.
- Each signal clump could be ideally observed as several signals. Accordingly, duplicate signals could be detected from the same clump of signals.
- a cluster or clump of signals can comprise one or more signals or spots that correspond to a particular feature.
- a cluster can comprise one or more signals that together occupy the physical region occupied by an amplified oligonucleotide (or other polynucleotide or polypeptide with a same or similar sequence).
- a cluster can be the physical region covered by one amplified oligonucleotide.
- a cluster or clump of signals need not strictly correspond to a feature.
- spurious noise signals may be included in a signal cluster but not necessarily be within the feature area.
- a cluster of signals from four cycles of a sequencing reaction could comprise at least four signals.
- a “flow cell” can include a device having a lid extending over a reaction structure to form a flow channel therebetween that is in communication with a plurality of reaction sites of the reaction structure, and can include a detection device that is configured to detect designated reactions that occur at or proximate to the reaction sites.
- a flow cell may include a solid-state light detection or “imaging” device, such as a Charge- Coupled Device (CCD) or Complementary Metal-Oxide Semiconductor (CMOS) (light) detection device.
- a flow cell may be configured to fluidically and electrically couple to a cartridge (having an integrated pump), which may be configured to fluidically and/or electrically couple to a bioassay system.
- a cartridge and/or bioassay system may deliver a reaction solution to reaction sites of a flow cell according to a predetermined protocol (e.g., sequencing-by-synthesis), and perform a plurality of imaging events.
- a cartridge and/or bioassay system may direct one or more reaction solutions through the flow channel of the flow cell, and thereby along the reaction sites.
- At least one of the reaction solutions may include four types of nucleotides having the same or different fluorescent labels.
- the nucleotides may bind to the reaction sites of the flow cell, such as to corresponding oligonucleotides at the reaction sites.
- the cartridge and/or bioassay system may then illuminate the reaction sites using an excitation light source (e.g., solid-state light sources, such as light-emitting diodes (LEDs)).
- the excitation light may have a predetermined wavelength or wavelengths, including a range of wavelengths.
- the fluorescent labels excited by the incident excitation light may provide emission signals (e.g., light of a wavelength or 69334895-1 M&C PM363975US 73 wavelengths that differ from the excitation light and, potentially, each other) that may be detected by the light sensors of the flow cell.
- Flow cells described herein may be configured to perform various biological or chemical processes.
- flow cells described herein may be used in various processes and systems where it is desired to detect an event, property, quality, or characteristic that is indicative of a designated reaction.
- flow cells described herein may include or be integrated with light detection devices, biosensors, and their components, as well as bioassay systems that operate with biosensors.
- the flow cells may be configured to facilitate a plurality of designated reactions that may be detected individually or collectively.
- the flow cells may be configured to perform numerous cycles in which the plurality of designated reactions occurs in parallel.
- the flow cells may be used to sequence a dense array of DNA features through iterative cycles of enzymatic manipulation and light or image detection/acquisition.
- the flow cells may be in fluidic communication with one or more microfluidic channels that deliver reagents or other reaction components in a reaction solution to a reaction site of the flow cells.
- the reaction sites may be provided or spaced apart in a predetermined manner, such as in a uniform or repeating pattern. Alternatively, the reaction sites may be randomly distributed.
- Each of the reaction sites may be associated with one or more light guides and one or more light sensors that detect light from the associated reaction site.
- light guides include one or more filters for filtering certain wavelengths of light.
- the light guides may be, for example, an absorption filter (e.g., an organic absorption filter) such that the filter material absorbs a certain wavelength (or range of wavelengths) and allows at least one predetermined wavelength (or range of wavelengths) to pass therethrough.
- the reaction sites may be located in reaction recesses or chambers, which may at least partially compartmentalize the designated reactions therein.
- spot radius or “cluster radius” refers to a defined radius which encompasses a diffraction-limited spot or a cluster of signals. Accordingly, by defining a cluster radius as larger or smaller, a greater number of signals can fall within the radius for subsequent ordering and selection.
- a cluster radius can be defined by any distance measure, such as pixels, meters, millimeters, or any other useful measure of distance. 69334895-1 M&C PM363975US 74
- a “signal” refers to a detectable event such as an emission, such as light emission, for example, in an image.
- a signal can represent any detectable light emission that is captured in an image (i.e., a “spot”).
- signal can refer to an actual emission from a feature of the specimen, or can refer to a spurious emission that does not correlate to an actual feature. Thus, a signal could arise from noise and could be later discarded as not representative of an actual feature of a specimen.
- an “intensity” of an emitted light refers to the intensity of the light transferred per unit area, where the area is measured on the plane perpendicular to the direction of propagation of the light ray, and where the intensity is the amount of energy transferred per unit time.
- signal “strength”, “amplitude”, “magnitude” or “level” may be used synonymously with signal intensity.
- an image taken by a detector is approximately or proportional to an intensity map integrated over some amount of time.
- the signal of a diffraction-limited spot of a DNA cluster is extracted from the image as the total intensity included in the spot, up to a factor of the integration time.
- the signal of a DNA cluster may be defined as the intensity included within the spot radius of the DNA cluster, up to a factor of the integration time.
- the peak intensity value found within the spot radius may be used to represent the signal of the DNA cluster, up to a factor of the integration time.
- the process of aligning the template of signal positions onto a given image is referred to as “registration”, and the process for determining an intensity value or an amplitude value for each signal in the template for a given image is referred to as “intensity extraction”.
- the methods and systems provided herein may take advantage of the random nature of signal clump positions by using image correlation to align the template to the image.
- nucleotide includes a nitrogen containing heterocyclic base, a sugar, and one or more phosphate groups. Nucleotides are monomeric units of a nucleic acid sequence. Examples of nucleotides include, for example, ribonucleotides or deoxyribonucleotides. In ribonucleotides (RNA), the sugar is a ribose, and in deoxyribonucleotides (DNA), the sugar is a deoxyribose, i.e., a sugar lacking a hydroxyl group that is present at the 2' position in ribose.
- RNA ribonucleotides
- DNA deoxyribonucleotides
- the nitrogen containing heterocyclic 69334895-1 M&C PM363975US 75 base can be a purine base or a pyrimidine base.
- Purine bases include adenine (A) and guanine (G), and modified derivatives or analogs thereof.
- Pyrimidine bases include cytosine (C), thymine (T), and uracil (U), and modified derivatives or analogs thereof.
- the C-1 atom of deoxyribose is bonded to N-1 of a pyrimidine or N-9 of a purine.
- the phosphate groups may be in the mono-, di-, or tri-phosphate form.
- nucleotides may be natural nucleotides, but it is to be further understood that non-natural nucleotides, modified nucleotides or analogs of the aforementioned nucleotides can also be used.
- nucleobase is a heterocyclic base such as adenine, guanine, cytosine, thymine, uracil, inosine, xanthine, hypoxanthine, or a heterocyclic derivative, analog, or tautomer thereof.
- a nucleobase can be naturally occurring or synthetic.
- nucleobases are adenine, guanine, thymine, cytosine, uracil, xanthine, hypoxanthine, 8-azapurine, purines substituted at the 8 position with methyl or bromine, 9-oxo-N6-methyladenine, 2-aminoadenine, 7-deazaxanthine, 7-deazaguanine, 7- deaza-adenine, N4-ethanocytosine, 2,6- diaminopurine, N6-ethano-2,6-diaminopurine, 5- methylcytosine, 5-(C3-C6)- alkynylcytosine, 5-fluorouracil, 5-bromouracil, thiouracil, pseudoisocytosine, 2-hydroxy-5-methyl-4-triazolopyridine, isocytosine, isoguanine, inosine, 7,8-dimethylalloxazine, 6-dihydrothymine, 5-
- nucleic acid or “polynucleotide” refers to a deoxyribonucleotide or ribonucleotide polymer in either single- or double-stranded form, and unless otherwise limited, encompasses known analogs of natural nucleotides that hybridize to nucleic acids in manner similar to naturally occurring nucleotides, such as peptide nucleic acids (PNAs) and phosphorothioate DNA. Unless otherwise indicated, a particular nucleic acid sequence includes the complementary sequence thereof.
- Nucleotides include, but are not limited to, ATP, dATP, CTP, dCTP, GTP, dGTP, UTP, TTP, dUTP, 5-methyl-CTP, 5- methyl-dCTP, ITP, dITP, 2-amino-adenosine-TP, 2-amino-deoxyadenosine-TP, 2- thiothymidine triphosphate, pyrrolo-pyrimidine triphosphate, and 2-thiocytidine, as well as the alphathiotriphosphates for all of the above, and 2′-O-methyl-ribonucleotide 69334895-1 M&C PM363975US 76 triphosphates for all the above bases.
- Modified bases include, but are not limited to, 5- Br-UTP, 5-Br-dUTP, 5-F-UTP, 5-F-dUTP, 5-propynyl dCTP, and 5-propynyl-dUTP.
- the polymerase used is an enzyme generally for joining 3'-OH 5'- triphosphate nucleotides, oligomers, and their analogs.
- Polymerases include, but are not limited to, DNA-dependent DNA polymerases, DNA-dependent RNA polymerases, RNA-dependent DNA polymerases, RNA-dependent RNA polymerases, T7 DNA polymerase, T3 DNA polymerase, T4 DNA polymerase, T7 RNA polymerase, T3 RNA polymerase, SP6 RNA polymerase, DNA polymerase I, Klenow fragment, Thermophilus aquaticus DNA polymerase, Tth DNA polymerase, VentR® DNA polymerase (New England Biolabs), Deep VentR® DNA polymerase (New England Biolabs), Bst DNA Polymerase Large Fragment, Stoeffel Fragment, 90N DNA Polymerase, 90N DNA polymerase, Pfu DNA Polymerase, TfI DNA Polymerase, Tth DNA Polymerase, RepliPHI Phi29 Polymerase, TIi DNA polymerase, eukaryotic DNA polymerase beta, telomerase, TherminatorTM polymerase
- Nucleosides and nucleotides may be labeled at sites on the sugar or nucleobase.
- a dye may be attached to any position on the nucleotide base, for example, through a linker.
- Watson-Crick base pairing can still be carried out for the resulting analog.
- Particular nucleobase labeling sites include the C5 position of a pyrimidine base or the C7 position of a 7-deaza purine base.
- a linker group may be used to covalently attach a dye to the nucleoside or nucleotide.
- covalently attached or “covalently bonded” refers to the forming of a chemical bonding that is characterized by the sharing of pairs of electrons between atoms.
- a covalently attached polymer coating refers to a polymer coating that forms chemical bonds with a functionalized surface of a substrate, as compared to attachment to the surface via other means, for example, adhesion or electrostatic interaction. It will be 69334895-1 M&C PM363975US 77 appreciated that polymers that are attached covalently to a surface can also be bonded via means in addition to covalent attachment.
- linker encompasses any moiety that is useful to connect one or more molecules or compounds to each other, to other components of a reaction mixture, and/or to a reaction site.
- a linker can attach a reporter molecule or “label” (e.g., a fluorescent dye) to a reaction component.
- the linker is a member selected from substituted or unsubstituted alkyl (e.g., a 2-5 carbon chain), substituted or unsubstituted heteroalkyl, substituted or unsubstituted aryl, substituted or unsubstituted heteroaryl, substituted or unsubstituted cycloalkyl, and substituted or unsubstituted heterocycloalkyl.
- substituted or unsubstituted alkyl e.g., a 2-5 carbon chain
- substituted or unsubstituted heteroalkyl substituted or unsubstituted aryl
- substituted or unsubstituted heteroaryl substituted or unsubstituted cycloalkyl
- substituted or unsubstituted heterocycloalkyl substituted or unsubstituted heterocycloalkyl.
- the linker moiety is selected from straight- and branched carbon-chains, optionally including at least one heteroatom (e.g., at least one functional group, such as ether, thioether, amide, sulfonamide, carbonate, carbamate, urea and thiourea), and optionally including at least one aromatic, heteroaromatic or non-aromatic ring structure (e.g., cycloalkyl, phenyl).
- at least one heteroatom e.g., at least one functional group, such as ether, thioether, amide, sulfonamide, carbonate, carbamate, urea and thiourea
- aromatic, heteroaromatic or non-aromatic ring structure e.g., cycloalkyl, phenyl
- molecules that have trifunctional linkage capability are used, including, but are not limited to, cynuric chloride, mealamine, diaminopropanoic acid, aspartic acid, cysteine, glutamic acid, pyroglutamic acid, S- acetylmercaptosuccinic anhydride, carbobenzoxylysine, histine, lysine, serine, homoserine, tyrosine, piperidinyl-1,1-amino carboxylic acid, diaminobenzoic acid, etc.
- a hydrophilic PEG (polyethylene glycol) linker is used.
- Reactive functional group refers to groups including, but not limited to, olefins, acetylenes, alcohols, phenols, ethers, oxides, halides, aldehydes, ketones, carboxylic acids, esters, amides, cyanates, isocyanates, thiocyanates, isothiocyanates, amines, hydrazines, hydrazones, hydrazides, diazo, diazonium, nitro, nitriles, mercaptans, sulfides, disulfides, sulfoxides, sulfones, sulfonic acids, sulfinic acids, acetals, ketals, anhydrides, sulfates, sulfenic acids isonitriles, amidines, imides, imidates, nitrones, hydroxylamines, oximes, hydroxamic acids thiohydroxamic acids, allenes, ortho
- Reactive functional groups also include those used to prepare bioconjugates, e.g., N-hydroxysuccinimide esters, maleimides and the like.
- Cleavable linkers may be, by way of non-limiting example, electrophilically cleavable linkers, nucleophilically cleavable linkers, photocleavable linkers, cleavable under reductive conditions (for example disulfide or azide containing linkers), oxidative conditions, cleavable via use of safety-catch linkers and cleavable by elimination mechanisms.
- one or more dye or label molecules may attach to the nucleotide base by non-covalent interactions, or by a combination of covalent and non- covalent interactions via a plurality of intermediating molecules.
- a nucleotide or a nucleotide analog being newly incorporated by the polymerase synthesizing from a target polynucleotide, is initially unlabeled.
- Each of the four types of nucleotides may have a 3 ⁇ hydroxy blocking group to ensure that only a single base can be added by a polymerase to the 3 ⁇ end of a copy polynucleotide being synthesized from the target polynucleotide.
- an affinity reagent may be then introduced that specifically binds to the incorporated dNTP to provide a labeled extension product comprising the incorporated dNTP.
- the affinity reagent may be designed to specifically bind to the incorporated dNTP via antibody- antigen interaction or ligand-receptor interaction, for example.
- the dNTP may be modified to include a specific antigen, which will pair with a specific antibody included in the corresponding affinity reagent.
- a specific antigen which will pair with a specific antibody included in the corresponding affinity reagent.
- one, two, three or each of the four different types of nucleotides may be specifically labeled via their corresponding affinity reagents.
- the affinity reagents may include small molecules or protein tags that may bind to a hapten moiety of the nucleotide (such as streptavidin-biotin, anti-DIG 69334895-1 M&C PM363975US 79 and DIG, anti-DNP and DNP), antibody (including but not limited to binding fragments of antibodies, single chain antibodies, bispecific antibodies, and the like), aptamers, knottins, affimers, or any other known agent that binds an incorporated nucleotide with a suitable specificity and affinity.
- a hapten moiety of the nucleotide such as streptavidin-biotin, anti-DIG 69334895-1 M&C PM363975US 79 and DIG, anti-DNP and DNP
- antibody including but not limited to binding fragments of antibodies, single chain antibodies, bispecific antibodies, and the like
- aptamers including but not limited to binding fragments of antibodies, single chain antibodies, bispecific antibodies, and the like
- the hapten moiety of the unlabeled nucleotide may be attached to the nucleobase through a cleavable linker, which may be cleaved under the same reaction condition as that for removing the 3’ blocking group.
- one affinity reagent may be labeled with multiple copies of the same fluorescent dye, for example, 1, 2, 3, 4, 5, 6, 8, 10, 12, 15 copies of the same dye.
- each affinity reagent may be labeled with a different number of copies of the same fluorescent dye.
- a first affinity reagent may be labeled with a first number of a first fluorescent dye
- a second affinity reagent may be labeled with a second number of a second fluorescent dye
- a third affinity reagent may be labeled with a third number of a third fluorescent dye
- a fourth affinity reagent may be labeled with a fourth number of a fourth fluorescent dye.
- each affinity reagent may be labeled with a distinct combination of one of more types of dye, where each type of dye has a certain copy number.
- different affinity reagents may be labeled with different dyes that can be excited by the same light source, but each dye will have a distinguishable fluorescent intensity or a distinguishable emission spectrum.
- affinity reagents may be labeled with the same dye in different molar ratios to create measurable differences in their fluorescent intensities.
- a nucleotide analog may be attached to or associated with one or more photo-detectable labels to provide a detectable signal.
- a photo- detectable label may be a fluorescent compound, such as a small molecule fluorescent label.
- Fluorescent molecules (fluorophores) suitable as a fluorescent label include, but are not limited to: 1,5 IAEDANS; 1,8-ANS; 4-methylumbelliferone; 5-carboxy-2,7-dichlorofluorescein; 5- carboxyfluorescein (5-FAM); fluorescein amidite (FAM); 5-carboxynapthofluorescein; tetrachloro-6-carboxyfluorescein (TET); hexachloro-6-carboxyfluorescein (HEX); 2,7- dimethoxy-4,5-dichloro-6-carboxyfluorescein (JOE); VIC®; NEDTM; tetramethylrhodamine (TMR); 5-carboxytetramethylrhodamine (5-TAMRA); 5-HAT (Hydroxy Tryptamine); 5- hydroxy tryptamine (HAT); 5-ROX (carboxy-X-rhodamine); 6- carboxyrhodamine 6G;
- a first photo-detectable label interacts with a second photo- detectable moiety to modify the detectable signal, e.g., via fluorescence resonance energy transfer (“FRET”; also known as Förster resonance energy transfer).
- FRET fluorescence resonance energy transfer
- the fluorescent labels utilized by the systems and methods disclosed herein can have different peak absorption wavelengths, for example, ranging from 400 nm to 800 nm.
- the peak absorption wavelengths of the fluorescent labels can be, 69334895-1 M&C PM363975US 81 or be about, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800 nm, or a number or a range between any two of these values.
- the peak emission wavelengths of the fluorescent labels can be, or be about, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800 nm, or a number or a range between any two of these values.
- the peak emission wavelengths of the fluorescent labels can be at least, or at most, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, or 800 nm.
- the fluorescent labels can have different Stokes shift, for example, ranging from 10 nm to 200 nm.
- the stoke shift can be, or be about, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200 nm, or a number or a range between any two of these values. In some embodiments, the stoke shift can be at least, or at most, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nm. In some embodiments, the distance between the peak emission wavelengths of any two fluorescent labels can vary, for example, ranging from 10 nm to 200 nm.
- the distance between the peak emission wavelengths of any two fluorescent labels can be, or be about, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200 nm, or a number or a range between any two of these values. In some embodiments, the distance between the peak emission wavelengths of any two fluorescent labels can be at least, or at most, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200 nm. 69334895-1 M&C PM363975US 82
- a “light source” may be any device capable of emitting energy along the electromagnetic spectrum.
- a light source may be a source of visible light (VIS), ultraviolet light (UV) and/or infrared light (IR).
- VIS visible light
- UV ultraviolet light
- IR infrared light
- VIS generally refers to the band of electro-magnetic radiation with a wavelength from about 400 nm to about 750 nm.
- Ultraviolet (UV) light generally refers to electromagnetic radiation with a wavelength shorter than that of visible light, or from about 10 nm to about 400 nm range.
- Infrared light” or infrared radiation (IR) generally refers to electromagnetic radiation with a wavelength greater than the VIS range, or from about 750 nm to about 50,000 nm.
- a light source may also provide full spectrum light. Light sources may output light from a selected wavelength or a range of wavelengths.
- the light source may be configured to provide light above or below a predetermined wavelength, or may provide light within a predetermined range.
- a light source may be used in combination with a filter, to selectively transmit or block light of a selected wavelength from the light source.
- a light source may be connected to a power source by one or more electrical connectors; an array of light sources may be connected to a power source in series or in parallel.
- a power source may be a battery, or a vehicle electrical system or a building electrical system.
- the light source may be connected to a power source via control electronics (control circuit); control electronics may comprise one or more switches. The one or more switches may be automated, or controlled by a sensor, timer or other input, or may be controlled by a user, or a combination thereof.
- a user may operate a switch to turn on a UV light source; the light source may be applied on a constant basis until it is turned off, or it may be pulsed (repeated on/off cycles) until it is turned off.
- the light source may be switched from a continuously-on state to a pulsed state, or vice versa.
- the light source may be configured to be brightening or darkening over time.
- the light source may be connected to a power source capable of providing sufficient intensity to illuminate the sample.
- Control electronics may be used to switch the intensity on or off based on input from a user or some other input, and can also be used to modulate the intensity to a suitable level (e.g. to control brightness of the output light).
- Control electronics may be configured to turn the light source on and off as desired.
- Control electronics may include a switch for manual, automatic, or semi-automatic operation of the light sources.
- the one or more switches may be, for example, a transistor, a relay or an electromechanical switch.
- the control circuit may further comprise an AC-DC and/or a DC-DC converter for converting the 69334895-1 M&C PM363975US 83 voltage from the voltage source to an appropriate voltage for the light source.
- the control circuit may comprise a DC-DC regulator for regulation of the voltage.
- the control circuit may further comprise a timer and/or other circuitry elements for applying electric voltage to the optical filter for a fixed period of time following the receipt of input.
- light output from a light source may be from about 350 to about 750 nm, or any amount or range therebetween, for example from about 350 nm to about 360, 370, 380, 390, 400, 410, 420, 430 or about 450 nm, or any amount or range therebetween.
- light from a light source may be from about 550 to about 700 nm, or any amount or range therebetween, for example from about 550 to about 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690 or about 700 nm, or any amount or range therebetween.
- the wavelength of the light generated by the light source can vary, for example, ranging from 400 nm to 800 nm. In some embodiments, the wavelength of the light generated by the light source can be, or be about, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800 nm, or a number or a range between any two of these values.
- the wavelength of the light generated by the light source can be at least, or at most, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570, 580, 590, 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, or 800 nm.
- the light source may be capable of emitting electromagnetic waves in any spectrum.
- the light source may have a wavelength falling between 10 nm and 69334895-1 M&C PM363975US 84 100 ⁇ m. In some embodiments, the wavelength of light may fall between 100 nm to 5000 nm, 300 nm to 1000 nm, or 400 nm to 800 nm.
- the wavelength of light may be less than, and/or equal to 10 nm, 100 nm, 200 nm, 300 nm, 400 nm, 500 nm, 600 nm, 700 nm, 800 nm, 900 nm, 1000 nm, 1100 nm, 1200 nm, 1300 nm, 1500 nm, 1750 nm, 2000 nm, 2500 nm, 3000 nm, 4000 nm, or 5000 nm.
- a light source may be a light-emitting diode (LED) (e.g., gallium arsenide (GaAs) LED, aluminum gallium arsenide (AlGaAs) LED, gallium arsenide phosphide (GaAsP) LED, aluminum gallium indium phosphide (AlGaInP) LED, gallium(III) phosphide (GaP) LED, indium gallium nitride (InGaN)/gallium(III) nitride (GaN) LED, or aluminum gallium phosphide (AlGaP) LED).
- LED light-emitting diode
- GaAs gallium arsenide
- AlGaAs aluminum gallium arsenide
- GaAsP gallium arsenide phosphide
- AlGaInP aluminum gallium indium phosphide
- GaP gallium(III) phosphide
- a light source can be a laser, for example a vertical cavity surface emitting laser (VCSEL) or other suitable light emitter such as an Indium-Gallium-Aluminum-Phosphide (InGaAIP) laser, a Gallium- Arsenic Phosphide/Gallium Phosphide (GaAsP/GaP) laser, or a Gallium-Aluminum- Arsenide/Gallium-Aluminum-Arsenide (GaAIAs/GaAs) laser.
- VCSEL vertical cavity surface emitting laser
- InGaAIP Indium-Gallium-Aluminum-Phosphide
- GaAsP/GaP Gallium- Arsenic Phosphide/Gallium Phosphide
- GaAIAs/GaAs Gallium-Aluminum-Arsenide
- light sources may include but are not limited to electron stimulated light sources (e.g., Cathodoluminescence, Electron Stimulated Luminescence (ESL light bulbs), Cathode ray tube (CRT monitor), Nixie tube), incandescent light sources (e.g., Carbon button lamp, Conventional incandescent light bulbs, Halogen lamps, Globar, Nernst lamp), electroluminescent (EL) light sources (e.g., Light-emitting diodes—Organic light-emitting diodes, Polymer light-emitting diodes, Solid-state lighting, LED lamp, Electroluminescent sheets Electroluminescent wires), gas discharge light sources (e.g., Fluorescent lamps, Inductive lighting, Hollow cathode lamp, Neon and argon lamps, Plasma lamps, Xenon flash lamps), or high-intensity discharge light sources (e.g., Carbon arc lamps, Ceramic discharge metal halide lamps, Hydrargyrum medium-arc iodide lamps, Hydr
- a light source may be a bioluminescent, chemiluminescent, phosphorescent, or fluorescent light source.
- an “optical channel” is a predefined profile of optical frequencies (or equivalently, wavelengths).
- a first optical channel may have wavelengths of 500 nm–600 nm.
- a detector which is only responsive to 500 nm–600 nm light, or use a bandpass filter having a transmission window of 500 nm–600 nm to filter the incoming light onto a detector responsive to 300 nm–800 nm light.
- a second optical channel may have wavelengths of 69334895-1 M&C PM363975US 85 300 nm–450 nm and 850 nm–900 nm.
- a detector responsive to 300 nm–450 nm light and another detector responsive to 850 nm–900 nm light and then combine the detected signals of the two detectors.
- a bandstop filter which rejects 451 nm–849 nm light in front of a detector responsive to 300 nm–900 nm light.
- DSP digital signal processor
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- a processor can be a microprocessor, but in the alternative, the processor can be a controller, microcontroller, or state machine, combinations of the same, or the like.
- a processor can also be implemented as a combination of computing devices, e.g., a 69334895-1 M&C PM363975US 86 combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
- systems described herein may be implemented using a discrete memory chip, a portion of memory in a microprocessor, flash, EPROM, or other types of memory.
- the elements of a method, process, or algorithm described in connection with the embodiments disclosed herein can be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two.
- Disjunctive language such as the phrase “at least one of X, Y or Z,” unless specifically stated otherwise, is otherwise understood with the context as used in general to present that an item, term, etc., may be either X, Y or Z, or any combination thereof (e.g., X, Y 69334895-1 M&C PM363975US 87 and/or Z).
- X, Y 69334895-1 M&C PM363975US 87 and/or Z e.g., X, Y 69334895-1 M&C PM363975US 87 and/or Z.
- phrases such as “a device configured to” or “a device to” are intended to include one or more recited devices. Such one or more recited devices can also be collectively configured to carry out the stated recitations.
- a processor to carry out recitations A, B and C can include a first processor configured to carry out recitation A working in conjunction with a second processor configured to carry out recitations B and C.
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
L'invention concerne un procédé de détermination de la qualité d'un appel de base, l'appel de base étant basé sur un signal généré dans un cycle de séquençage en cours d'un essai de séquençage basé sur l'incorporation d'une nucléobase dans une pluralité de parties de séquence polynucléotidique formant un groupe, le procédé consistant à : accéder à une pluralité de caractéristiques associées au groupe ; et déterminer un score de qualité de l'appel de base à partir de la pluralité de caractéristiques et d'un algorithme d'apprentissage automatique non linéaire.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202463564906P | 2024-03-13 | 2024-03-13 | |
| US63/564,906 | 2024-03-13 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025190902A1 true WO2025190902A1 (fr) | 2025-09-18 |
Family
ID=94974020
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/EP2025/056537 Pending WO2025190902A1 (fr) | 2024-03-13 | 2025-03-11 | Amélioration des scores de qualité d'appel de base |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025190902A1 (fr) |
Citations (17)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO1992002258A1 (fr) | 1990-07-27 | 1992-02-20 | Isis Pharmaceuticals, Inc. | Oligonucleotides, a pyrimidine modifiee et resistants a la nuclease, detectant et modulant l'expression de genes |
| WO1993010820A1 (fr) | 1991-11-26 | 1993-06-10 | Gilead Sciences, Inc. | Formation amelioree de triple et double helices a l'aide d'oligomeres contenant des pyrimidines modifiees |
| WO1994022892A1 (fr) | 1993-03-30 | 1994-10-13 | Sterling Winthrop Inc. | Oligonucleotides modifies contant des nucleosides 7-deazapurines |
| WO1994024144A2 (fr) | 1993-04-19 | 1994-10-27 | Gilead Sciences, Inc. | Formation a helice triple et double a l'aide d'oligomeres contenant des purines modifiees |
| US5432272A (en) | 1990-10-09 | 1995-07-11 | Benner; Steven A. | Method for incorporating into a DNA or RNA oligonucleotide using nucleotides bearing heterocyclic bases |
| US6150510A (en) | 1995-11-06 | 2000-11-21 | Aventis Pharma Deutschland Gmbh | Modified oligonucleotides, their preparation and their use |
| US6329178B1 (en) | 2000-01-14 | 2001-12-11 | University Of Washington | DNA polymerase mutant having one or more mutations in the active site |
| US6395524B2 (en) | 1996-11-27 | 2002-05-28 | University Of Washington | Thermostable polymerases having altered fidelity and method of identifying and using same |
| US20070048748A1 (en) | 2004-09-24 | 2007-03-01 | Li-Cor, Inc. | Mutant polymerases for sequencing and genotyping |
| US7595882B1 (en) | 2008-04-14 | 2009-09-29 | Geneal Electric Company | Hollow-core waveguide-based raman systems and methods |
| US20120020537A1 (en) | 2010-01-13 | 2012-01-26 | Francisco Garcia | Data processing system and methods |
| US20130079232A1 (en) | 2011-09-23 | 2013-03-28 | Illumina, Inc. | Methods and compositions for nucleic acid sequencing |
| US8932126B2 (en) | 2007-12-20 | 2015-01-13 | Aristocrat Technologies Australia Pty Limited | Method of gaming, a game controller and a gaming system |
| WO2015084985A2 (fr) | 2013-12-03 | 2015-06-11 | Illumina, Inc. | Procédés et systèmes d'analyse de données d'image |
| US20200302297A1 (en) * | 2019-03-21 | 2020-09-24 | Illumina, Inc. | Artificial Intelligence-Based Base Calling |
| AU2020240141A1 (en) * | 2019-03-21 | 2021-01-14 | Illumina, Inc. | Artificial intelligence-based sequencing |
| US11361194B2 (en) | 2020-10-27 | 2022-06-14 | Illumina, Inc. | Systems and methods for per-cluster intensity correction and base calling |
-
2025
- 2025-03-11 WO PCT/EP2025/056537 patent/WO2025190902A1/fr active Pending
Patent Citations (19)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO1992002258A1 (fr) | 1990-07-27 | 1992-02-20 | Isis Pharmaceuticals, Inc. | Oligonucleotides, a pyrimidine modifiee et resistants a la nuclease, detectant et modulant l'expression de genes |
| US5432272A (en) | 1990-10-09 | 1995-07-11 | Benner; Steven A. | Method for incorporating into a DNA or RNA oligonucleotide using nucleotides bearing heterocyclic bases |
| WO1993010820A1 (fr) | 1991-11-26 | 1993-06-10 | Gilead Sciences, Inc. | Formation amelioree de triple et double helices a l'aide d'oligomeres contenant des pyrimidines modifiees |
| WO1994022892A1 (fr) | 1993-03-30 | 1994-10-13 | Sterling Winthrop Inc. | Oligonucleotides modifies contant des nucleosides 7-deazapurines |
| WO1994024144A2 (fr) | 1993-04-19 | 1994-10-27 | Gilead Sciences, Inc. | Formation a helice triple et double a l'aide d'oligomeres contenant des purines modifiees |
| US6150510A (en) | 1995-11-06 | 2000-11-21 | Aventis Pharma Deutschland Gmbh | Modified oligonucleotides, their preparation and their use |
| US6395524B2 (en) | 1996-11-27 | 2002-05-28 | University Of Washington | Thermostable polymerases having altered fidelity and method of identifying and using same |
| US6602695B2 (en) | 2000-01-14 | 2003-08-05 | University Of Washington | DNA polymerase mutant having one or more mutations in the active site |
| US6329178B1 (en) | 2000-01-14 | 2001-12-11 | University Of Washington | DNA polymerase mutant having one or more mutations in the active site |
| US20070048748A1 (en) | 2004-09-24 | 2007-03-01 | Li-Cor, Inc. | Mutant polymerases for sequencing and genotyping |
| US8932126B2 (en) | 2007-12-20 | 2015-01-13 | Aristocrat Technologies Australia Pty Limited | Method of gaming, a game controller and a gaming system |
| US7595882B1 (en) | 2008-04-14 | 2009-09-29 | Geneal Electric Company | Hollow-core waveguide-based raman systems and methods |
| US20120020537A1 (en) | 2010-01-13 | 2012-01-26 | Francisco Garcia | Data processing system and methods |
| US20130079232A1 (en) | 2011-09-23 | 2013-03-28 | Illumina, Inc. | Methods and compositions for nucleic acid sequencing |
| WO2015084985A2 (fr) | 2013-12-03 | 2015-06-11 | Illumina, Inc. | Procédés et systèmes d'analyse de données d'image |
| US20180274023A1 (en) | 2013-12-03 | 2018-09-27 | Illumina, Inc. | Methods and systems for analyzing image data |
| US20200302297A1 (en) * | 2019-03-21 | 2020-09-24 | Illumina, Inc. | Artificial Intelligence-Based Base Calling |
| AU2020240141A1 (en) * | 2019-03-21 | 2021-01-14 | Illumina, Inc. | Artificial intelligence-based sequencing |
| US11361194B2 (en) | 2020-10-27 | 2022-06-14 | Illumina, Inc. | Systems and methods for per-cluster intensity correction and base calling |
Non-Patent Citations (4)
| Title |
|---|
| LAWRENCE R. RABINER: "A tutorial on Hidden Markov Models and selected applications in speech recognition", PROCEEDINGS OF THE IEEE, vol. 77, no. 2, February 1989 (1989-02-01), pages 257 - 286, XP002550447, DOI: 10.1109/5.18626 |
| SAMBROOK ET AL.: "Practical Handbook of Biochemistry and Molecular Biology", 1989, COLD SPRING HARBOR PRESS, pages: 385 - 394 |
| SINGLETON ET AL.: "Dictionary of Microbiology and Molecular Biology", 1994, J. WILEY & SONS |
| ZHANG, S.WANG, B.WAN, L. ET AL.: "Estimating Phred scores of Illumina base calls by logistic regression and sparse modeling", BMC BIOINFORMATICS, vol. 18, 2017, pages 335 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| KR102515638B1 (ko) | 뉴클레오타이드 서열분석 데이터의 이차 분석을 위한 시스템 및 방법 | |
| US20220403450A1 (en) | Systems and methods for sequencing nucleotides using two optical channels | |
| KR20210047980A (ko) | 단일 광원, 2-광학 채널 서열분석 | |
| EP4500536A1 (fr) | Appel de bases séquence par séquence | |
| US20250084402A1 (en) | Methods of preparing libraries for sequencing and methods of analysis | |
| US20230101253A1 (en) | Amplitude modulation for accelerated base calling | |
| US20240212791A1 (en) | Context-dependent base calling | |
| WO2025190902A1 (fr) | Amélioration des scores de qualité d'appel de base | |
| WO2025061922A1 (fr) | Procédés de séquençage | |
| WO2025061942A1 (fr) | Identification et correction d'erreur de séquençage | |
| US20230295719A1 (en) | Paired-end sequencing | |
| US20240177807A1 (en) | Cluster segmentation and conditional base calling | |
| US20230183799A1 (en) | Parallel sample and index sequencing | |
| US20250210137A1 (en) | Directly determining signal-to-noise-ratio metrics for accelerated convergence in determining nucleotide-base calls and base-call quality | |
| WO2025006466A1 (fr) | Systèmes et procédés de séquençage de polynucléotides avec quatre nucléotides marqués | |
| WO2025006464A1 (fr) | Systèmes et procédés de séquençage de polynucléotides avec des diagrammes de dispersion alternatifs | |
| WO2025006460A9 (fr) | Systèmes et procédés de séquençage de polynucléotides à bases modifiées |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25711690 Country of ref document: EP Kind code of ref document: A1 |