WO2003100541A2 - Profilage de molecules - Google Patents
Profilage de molecules Download PDFInfo
- Publication number
- WO2003100541A2 WO2003100541A2 PCT/IB2003/002634 IB0302634W WO03100541A2 WO 2003100541 A2 WO2003100541 A2 WO 2003100541A2 IB 0302634 W IB0302634 W IB 0302634W WO 03100541 A2 WO03100541 A2 WO 03100541A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- assignment
- molecules
- gene
- candidate
- link
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/11—Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/10—Signal processing, e.g. from mass spectrometry [MS] or from PCR
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
Definitions
- the present invention relates to profiling and/or identifying molecules in a sample. It is particularly concerned with profiling chemical or biological molecules contained in an experimental sample using measured data about molecules actually present and known information about candidate molecules that may be present.
- Such analysis may comprise measurement or determination of one or more of properties of molecules in that sample, e.g. absolute number, concentration, fluorescent intensity, etc..
- the information about the possible candidate molecular species is incomplete. For instance there may be other candidate species which are not known about. Alternatively or additionally, the knowledge of the relevant properties of each candidate (such as its DNA sequence, amino acid sequence or atomic composition) may be incomplete .
- the present invention provides a method of assigning tags to candidates. This may be achieved with a high degree of accuracy and a low false positive rate by minimising the effect of one or more possible sources of error.
- an objective goal is optimised by linear programming.
- assignment is optimised by mixed integer programming.
- the present invention provides a method of providing a profile of molecular species present in a mixture contained in a sample and/or identifying the presence of multiple molecular species in a sample, the method comprising: generating a dataset comprising information measured or determined for molecules of the sample, including, for one or more fractions out of a plurality of different fractions of molecules wherein each fraction has a different property, a measured or determined sum total of an activity over all molecules that have a particular property, the sum total being assigned to the fraction of molecules with that particular property, and assigning to molecules for which measured or determined information is present in the dataset a candidate molecular species which may be contained in the sample and for which information is present in a database; wherein there is uncertainty over at least one of the following: a) the information for each candidate molecular species in the database, b) completeness of the database, c) accuracy of generation of the fractions, 2 be physical fractions obtained by separating the sample using physical properties of the molecules (e.g.
- the measurement relates to the amount of nucleic acid fragments arising from cutting transcripts with a given restriction enzyme and which fragments have a given length.
- Measurement of each property may not be exact, and there may be an error in the sum total activity. As all such measurements are experimental, they may be subject to errors, which may be errors of fractionation, such as an uncertainty as to which fraction a particular measured species should be placed in or derives from, or of measurement, such as an error as to the precise property or quantity measured. Such errors, if they occur, may complicate resolution of which actual molecules in the sample correspond to which candidate.
- tags which are the physical properties that are used to define the fractions
- tags quantities which are the measured quantity of each tag in each fraction.
- the desire is to create a number of "assignments” between tags and candidates in order to create a profile of the sample under examination and/or identify which candidates are actually present.
- assigning to molecules for which measured or determined information is present in the dataset a candidate molecular species which may be contained in the sample and for which information is present in the database comprises : l) generating a set of constraints for a number of candidates based on possible assignments of candidates to measured fractions and resultant limitations on total activity of each candidate potentially present in the sample, 11) creating an objective function which it is desired to either maximise or minimise, m) optimising the objective function with regard to the set of constraints to provide a profile of candidate molecular species actually present in the sample and/or identify candidate molecular species actually present in the sample, and optionally to determining amounts of different candidate molecular species actually present in the sample .
- a profile of molecular species present in a mixture may comprise information on the abundance of each molecular species or on some activity of each molecular species.
- activities may include but are not limited to enzymatic activity or molecular modifications such as protein phosphorylation, or experimentally introduced labels such as fluorochromes or isotope labels.
- a method of the invention as claimed may comprise measuring or determining information for molecules of the sample.
- a method of the invention as claimed may comprise measuring or determining a sum total of an activity over all molecules that have a particular property (e.g. sum total of abundance of nucleic acid fragments of a particular length) .
- the objective function may be created in any of various suitable forms, each of which may focus on minimising the impact of a particular form or source of error.
- an objective function may be used which minimises the part of the sum total quantity in each fraction which is not assigned to candidate species.
- an objective function may be used which minimises the errors arising from mutually exclusive assignment of candidate species to fractions, as discussed below.
- a general objective function can be created which contains contributions from each possible source of error, such as those discussed in the previous paragraph, and these contributions can be independently weighted relative to each other using constants.
- some of the constraints take the form of equalities which are constructed using the knowledge that for each tag the quantities of the candidates assigned to the tag together make up the tag-quantity. If the database of candidates is incomplete, then there is a possibility that some of the tag-quantity derives from an unknown candidate. Therefore, an equality can be derived for each tag using a slack variable to represent the amount of the tag-quantity which is not accounted for by the candidates assigned to the tag.
- the assignment of candidates to tags may also be ambiguous.
- SAGE is based on comparison with sequence databases having an expected error rate of more than 1% per base (or about 10% per 9 base-pair (bp) tag)
- bp base-pair
- fragments are generated by cutting nucleic acid copied from a sample using restriction enzymes in different fractions, and sorted according to partial sequence information and length/fragment size, optionally with measurement of abundance
- ambiguity may occur when measured fragment sizes are not perfectly accurate. If a gene g in a sequence database would give rise to a 123 bp fragment, and fragments A and B are observed at 122 and 124 bp respectively with a 1% error in the measurement, then it is impossible to decide on this information alone which fragment should be assigned to the gene.
- This ambiguity may be incorporated into the constraints using the fact that the quantity of the candidate is less than or equal to the sum of the tag-quantities associated with all the tags which that candidate could possibly be assigned to. Whilst this is certainly true, it results in a loss of information about the system, namely that a given candidate can in reality only be assigned to one tag, but here it is allowed to contribute a portion to several. In a situation where several tags are closely spaced, or the possible error is comparatively large, this can result in the combining of a significant proportion of the tags in the sample, resulting in insufficient information being available to resolve the possible assignments. Therefore it is preferable to work from the fact that the assignment of the candidate to each of the tags is normally mutually exclusive (a "mutex assignment") - i.e.
- An embodiment of the present invention proposes a solution to this problem by the introduction of pseudo-Boolean variables (constrained to be either 0 or 1) .
- These variables are "slack" variables (i.e. they are introduced as a way of expressing an additional constraint) which can be used to express this mutual exclusion in the constraints and thereby create a more powerful model of the true assignment problem.
- MIP Mixed integer programming
- candidate transcripts in a database are assigned to genes actually transcribed in a sample on the basis of determination of partial sequence information ("subsequence") and length of fragments generated in different fractions for the transcribed genes, along with abundance of fragments of determined partial sequence and length. This kind of system is discussed in a great deal more detail further below.
- candidate transcripts in a database are assigned to genes actually transcribed in a sample on the basis of partial sequence information ("subsequence") alone - e.g. as in SAGE (Velculescu et al) - along with the abundance of fragments of a determined subsequence measured by counting the number of occurrences of each subsequence in the dataset.
- candidate proteins in a database are assigned to proteins actually present in a sample on the basis of partial structure information obtained by digesting the protein sample into peptides and measuring the mass of each peptide fragment with mass spectroscopy, along with the abundance of each peptide fragment so measured.
- Differential display invented by Liang and Pardee in 1990, was one of the first methods capable of analyzing a large proportion of the expressed genome (Liang et al., 1992, Science 257, 967-971) . It is also the first example of an open system, so called because it does not rely on specific primers or probes and is thus not limited to detecting a pre-determined set of genes.
- Microarrays invented by Ed Southern (Southern et al., (1994) J Biotechnol 35, 217-227, Southern et al., (1992) Genomics 13, 1008-1017), introduced a highly parallel closed system, where gene-specific probes are synthesized or deposited on a surface which is then used to probe an mRNA sample.
- Microarrays have been improved in various ways. Brown and colleagues pioneered contact printing of cDNAs (Schena et al., (1995) Science 270, 467-470) while Fodor and colleagues introduced direct oligonucleotide synthesis m si tu using a mask-based light-directed system (Fodor et al. (1991) Science 251, 767-773) , commercialized by Affymetrix as the GeneChip. Since then, oligonucleotide and cDNA microarrays have dominated the market for closed systems. Recently, microarrays carrying oligonucleotides synthesized in si tu using an ink jet system have become a viable alternative (Hughes et al . , (2001) Nat Biotechnol 19, 342- 347) .
- Real-time PCR is a sensitive, accurate and reproducible closed system for gene expression analysis.
- target sequences are amplified and the yield of product is monitored during each cycle.
- the approach permits measuring exactly the PCR amplification efficiency and yield and thereby to calculate the initial concentration of the target.
- the procedure is expensive, however, and low- throughput, because each gene must be analyzed separately. Hence it is not suitable for global gene expression analysis .
- SAGE Serial analysis of gene expression
- Velculescu et al. (1995) Science 270, 484-487) .
- SAGE is an improvement over differential display in that it at least provides a 9-10 bp clue to the identity of each gene measured, such a tag is still not specific enough to unambiguously identify a gene in most cases.
- MPSS Massively Parallel Signature Sequencing
- TOGA Total Gene Expression Analysis
- a further approach called DNA Indexing uses a Type IIS enzyme to cut the target.
- Type IIS enzymes bind a recognition sequence but cut outside it so that a sequence-specific cohesive end is generated.
- This activity can in principle be used to ligate adaptors consisting of a universal primer template sequence and a cohesive end which is varied between sub-reactions. If a four-base overhang is generated by the enzyme, 256 adaptors would be needed. There have however been problems with finding conditions where adaptor ligation discriminates a four base overhang (Kato (1995)
- RNA or DNA species present in the sample including some measure of the quantity (such as a tag count or a fluorescent intensity measured for the fragments in a size-fractionated sample) and some data related to the identity of each molecule (such as a tag sequence or the position of a band on a gel) .
- some measure of the quantity such as a tag count or a fluorescent intensity measured for the fragments in a size-fractionated sample
- data related to the identity of each molecule such as a tag sequence or the position of a band on a gel
- SAGE serial analysis of gene expression
- a library of short sequence tags typically 9 bp long
- 9 bp multiple tags can be sequenced s multaneoulsy, thus reducing cost to a manageable level.
- 9 bp is normally not enough to uniquely identify a gene corresponding to the tag.
- TOGA Digital Gene Technology's TOGA, fragments are isolated and displayed on a capillary electrophoresis system in such a way that 8 bp of information is revealed about each fragment. Again, 8 bp is not enough to find a unique candidate in a sequence database.
- the system is thus a global gene expression analysis method which combines whole-genome scope with direct and robust identification in silico.
- DNA indexing a comprehensive RNA profile is generated in which each gene is represented once, by a single fragment.
- the indexing procedure reveals a 10 bp sequence tag for each fragment, plus its length.
- tags can not normally be unambiguosly assigned to a gene in a sequence database.
- a combinatorial identification algorithm based on three 10 bp tags per gene, is then used to simultaneously quantify and identify virtually all genes.
- Combinatorial identification is a procedure that uses length and/or partial sequence information obtained for a set of fragments - where each gene is represented by more than one fragment - to identify in a sequence database those genes (or other sequences) which produced the observed fragments.
- the key to combinatorial identification is that each gene is probed more than once. This has the consequence that, even though one may find multiple candidate genes for each fragment (as in SAGE) , there is collectively enough information to unambiguously identify each gene's contribution to a particular fragment.
- Figure la shows the four possible assignments between candidates (transcripts) and tags: (a) unambiguous, (b) unresolved, (c) mutually exclusive, with an error measure indicated, (d) multiple, unambiguous, (e) mutex candidates.
- Figure lb shows the schematic map of the problem of Example I.
- Figure 2 shows a portion of a schematic map of an assignment problem.
- Figure 3 shows a scatter graph of the experiment described in Example III.
- transcribed genes in a sample are analysed on the basis of partial sequence information, abundance and length, with different fractions being generated by means of cutting to create fragments using different restriction enzymes and amplification using different adaptors and primers. Details of embodiments of a method of providing a profile for transcribed genes in a sample are found in WO02/08461 and GB patent 2365124.
- Tags and tag-quantities produced from the sample may be analysed in accordance with the present invention, employing a database of candidates.
- Candidates may first be assigned to the tags. Then any tag-quantity of zero, or of less than a threshold value may be used as evidence that the candidates assigned to it were not present in the sample at all. These candidates are given an activity of zero, and are removed from all other assignments.
- mapping of assignments may include one or more of each of the following:
- a decision variable B ag for each link a in each mutex assignment g which takes the value 0 if the link is not used, and 1 if the link is used.
- a slack variable S p for each tag p denoting the part of its tag-quantity that is not contributed by the expression E ag of all candidates assigned to it.
- This slack variable can be used in the objective function, for example to minimise the fraction of tag-quantity not explained by assigned candidates - i.e. minimising S p /Q p .
- S p must be zero or positive.
- Each mutex assignment can only have one active link. That is to say that in each mutex assignment at most one decision variable B ag can be non-zero. This is expressed as
- the total contribution of all candidates assigned to a particular tag may not exceed the tag-quantity. This is expressed by stating that the sum of all such contributions to each tag ⁇ E ap for all a) plus the remainder S p must equal the tag-quantity:
- each candidate is the same in every assignment it occurs in. In other words, all the E ag for a given candidate must be equal, and can be set equal to the total expression for that candidate Et .
- This constraint can be expressed as
- This constraint operates with the constraint involving M above to ensure that in each assignment for which a candidate' s link is active, its contribution to the tag quantity is its expression E t , the sum totals of the contributions of all active assignments to a particular tag being constrained to be less than the tag quantity as described in the previous condition.
- an additional decision variable D ⁇ may be introduced for each candidate and similar constraints: E 1 ⁇ M x D 2 and ⁇ . . D x ⁇ 1 (with one such sum for each set of mutually exclusive candidates) used.
- additional constraints can be useful either to address a particular source of error, or to reflect a known situation. For example, in gene expression analysis, it may be known that one candidate is longer than another, although there is minimal (1 bp) separation between the two. An additional constraint can be used to enforce this proper ordering when assigning these candidates to closely spaced tags. To illustrate more specifically, if candidates Cj and C 2 are 199 and 200 bp in length, and they are to be assigned to three tags T l t T 2 and T 3 at 199, 200 and 201 bp with a known error of ⁇ 2 bp, there would normally be nothing to prevent assigning Cj to T 2 and C 2 to T ⁇ .
- Such assignments can be explicitly prevented by forcing their decision variables to be mutually exclusive, i.e. by constraining B 12 + B 21 ⁇ 1 in this case.
- One example of the objective function contains three terms independently weighted using constants Ki, K 2 , K, wherein the terms are: a) the negative of the ratio between the tag- quantity which is not assigned to a candidate species S p and the total tag-quantity Q p for all tags p - i.e.
- K l r K 2 l and K 3 can be used to weight the objective function in favour of one or more of the terms, for example if a particular source of error is known to be small. It will be appreciated that the same result will be obtained by minimising the negative of the objective function.
- the following section provides further description of embodiments of the invention as applied to a tissue sample containing mRNA treated as follows: take a tissue sample, extract mRNA, translate that into cDNA attached to a solid support such as magnetic beads, and then pass them through restriction enzymes and amplification with selected ligation fragments (see e.g. WO02/08461) .
- the restriction enzymes may recognize DNA sequences 5 base pairs long. A more generic name for these recognition sequences is, e.g., a restriction group.
- Each of these may be referred to as a frame .
- All the discrete information on fragments, frames and enzymes, may be referred to, e.g., as a frameset.
- a head-frame as the four letters overhang (bases resulting from restriction enzyme digest)
- a sub-frame as the single letter before the polyadenylation sequence.
- the frame is the combination of the head-frame and the su--fra- ⁇ e.
- a gene (candidate) is (ID; s; stop; pal), where ID is a unique identifier, s is a string of nucleotides, stop is a stop position and pal is a pointer to a list of alternative poly-adenylation sites.
- Such a list contains items (ID, pos; q) , where ID is a unique identifier, pos is the position of the poly-adenylation site, and q is a quality indicator.
- ID is a unique identifier
- pos is the position of the poly-adenylation site
- q is a quality indicator.
- a fragment is (enzyme; frame; length). This represents a string of integer length length, where we only know the information that it is produced by enzyme, and occurs m frame.
- a peak (tag) is (ID; enzyme; frame; length; Area) , where ID is an identifier provided with the processed experimental data.
- a dark peak is (dark; enzyme; frame; length; any) . This models a peak which was not observed because of technological limitations , e.g. machine failure, or because length too long. It is not supposed to have been listed in the experimental data, but to have identifier dark. Since its properties are not observed, it can have any area.
- a zero peak is (zero; enzyme; frame; length; bg) . This models a peak which was not observed because its area was below the detection limit. It is not supposed to have been listed m the experimental data, but to have identifier zero. Its area must be less than bg (background) .
- a t-gene for transcribed or terminated gene, is (ID; s; pos; min; max) here ID is a unique identifier, s is a string of nucleotides , pos is the most likely position of the poly-adenylation site, min is furthermost possible position of the poly-adenylation site in the 5' direction, and max the furthermost possible position of the poly-adenylation site in the 3' direction.
- ID is a unique identifier
- s is a string of nucleotides
- pos is the most likely position of the poly-adenylation site
- min is furthermost possible position of the poly-adenylation site in the 5' direction
- max the furthermost possible position of the poly-adenylation site in the 3' direction.
- a t-gene is formed from the information in one gene, and in one entry in its list of poly-adenylation sites. If there is no information in the quality indicator q on two hard bounds min and max,
- a u-gene, for unambiguous gene, is a tuple (ID; s; pos) where pos is the fixed position of the poly-A site, and the other fields are the same as in a t-gene.
- the bipartite graph (e.g. that of Figure 2) encodes two sources of uncertainty.
- the genomic information linking genes and fragments may be incomplete.
- One possibility is that the set of genes in itself is incorrect or incomplete, for (at least) the following reasons:
- the sequence in the database may not be complete all the way to the end.
- a second possibility is sequencing errors, If those errors are at the bases specifying a frame, it would mean that a gene would consistently show up in the wrong frame.
- SNPs single nucleotide polymorphisms
- Celera quotes e.g. an average rate of SNP in the human of one per 1500 base pairs. There are ten base pairs in the frame and in the recognition sequence of the enzyme. If we assume (1/1500) to be an error rate of a genomic sequence, the chance that all of ten base pairs are right is (1 - 1/1500) 10 * 0:94. This rate is of course highly dependent on the accuracy of the genomic sequence, and on the rate of SNPs.
- the second source of uncertainty is that of experimental errors for example in the determination of length and Area of a peak, and in the comparison with fragments. We can assume that Area is reproducible up to a factor about two.
- True DNA string length is an integer (a whole number of nucleotides) , while the observed length is approximately given with one decimal position. (Resolution of the sequencing machine is about 11 data points per base, run- to-run variability is 15% of one base.)
- mapping depends somewhat on the actual sequence.
- DNA may have a tendency to curl up, which varies with the sequence. That will effectively make pieces of DNA of the same length travel at different speeds in the capillary.
- This error may, as a first approximation, be modeled as a random spread. Nevertheless, it is not really random, because if one knows the letters in the DNA piece in question, and measures the passage of this sequence directly through the sequencing machine, one can determine the actual passage time. Thus, these length corrections may be determined and may be included as corrections in a data base.
- the actual experimental data from a tissue sample is a collection of peaks. Every peak can be the observation of fragments of one or several genes.
- the bipartite graph of Figure 2 expresses all possible such observations; if a peak p can be an observation of a gene g there is a link ( g,p) .
- Let the set of links be denoted ⁇ ⁇ (g,p) ⁇ g e ⁇ Genes ⁇ ; p e ⁇ Peaks ⁇ (3.1)
- ⁇ c ⁇ (t,p); t e ⁇ t-Genes ⁇ ; p e ⁇ Peaks ⁇ (3.2)
- Enzyme enzyme (I) cleaves the string s in t-gene t at a well-defined site, and leaves a fragment which agrees in frame with -ra-ne(I). Its length is close to length ( 1) .
- the predicted length of link 1 is the length of this fragment. Definition 3 . 1 Defini tion : an assignmen t is a subset ⁇ of
- ⁇ * 1 such tha t for each t-gene tg, ei ther there are no links in ⁇ or there is a link from tg to one distinct peak per enzyme .
- the sub-frames of these peaks must be iden tical .
- assignments can be of the type that only some links from a given gene can be assigned together. They can also be of the type that the links that can be assigned from two genes vary together, e.g. that they have a common offset in length .
- J tg expression level of t-gene tg (4.1; Given an assignment, we can then compute predicted peak areas :
- the predicted peak areas are functions of the assignment ⁇ and the set of gene expression levels ⁇ E tg ⁇ . Given an element in an assignment, we also have from above its predicted length L p ( l ) .
- One example is a quadratic target function for lengths
- a special case to consider is an assignment to a dark peak.
- One idea would be to estimate the expected distribution of peaks in the range of the dark peak (e.g. around 1200 bp, for example) and then calculate the expected distance from an average such peak as above using some distribution. That can be done by finding the known frequency of fragment lengths in the database across the entire range of lengths
- the peak areas are the outcomes of a multiplicative process (the PCR) . Such processes often give rise to log-normal distributions . We may therefore also list a log-normal target function for areas
- Hard bounds on the length differences should be taken care of in the definition of ⁇ . Hard bounds on differences in expression levels may be taken care of by modifying the target function on areas.
- the matrix M tg is hence the inverse of the correlation matrix.
- Definition 5.1 Represent the expected FRAGMENT LENGTH (in discrete base pairs) of all known t-genes as a n x matrix where each column represents the expected lengths of the final fragment of the i ; th t-gene from the last occurrence of the recognition sequence to the end of the t-gene for each of the m RESTRICTION GROUPS.
- each t-gene will occur either in a particular head-frame or not at all in each of the restriction groups and since we know the sequence of the (known) t-genes we can define a MAPPING from a t-gene i and a restriction group j to the initial sequence of the final fragment for that restriction group.
- ⁇ be a function from a t-gene l and an restriction group j to the initial sequence of the fragment immediately following the last occurrence of the recognition sequence in the t-gene.
- the expected fragment length given by L 1D for restriction group j may be off by one or more bases. We cannot therefore predict with certainty the sub-frame m which the expression of a particular t-gene can be observed. We can however for each t-gene and restriction group give the sub-frame for any given offset from the expected fragment length.
- ⁇ be a function from a t-gene l and a (positive or negative) offset o -from the expected length given by L J to a sub-frame s such that if the actual polyadenylation site would give a fragment of length of L J + o m restriction group j, then the expression of t-gene l for that restriction group will occur m subframe ⁇ ( ⁇ ,o) (of head-frame ⁇ ( ⁇ ,j)) .
- ⁇ and ⁇ correspond to the length and enzyme function of section 3 and that the frame function of that section is here split into the two functions and ⁇ .
- Definition 5.6 Represent a MATCHING as an assignment for each t-gene i of a (discrete) offset variable 0 L and of m
- any assigned peak occurs in the correct restriction group.
- the peak also has to occur in the correct head-frame and for a given offset O ⁇ the correct sub-frame:
- ⁇ and v ⁇ be parameters that depend on the probability of finding the polyadenylation site for t-gene i at an offset O. from the one used to compute the expected fragment lengths ⁇ J in each of the restriction groups. Note that with the definition of t-gene used we can expect the value of 0 1 to be quite close to zero. A first guess for ⁇ x would be between -5 and 0 and for v between 0 and 15 depending on the confidence put in the expected fragment length L J . Asymmetry of the parameters can be used to encode the case where we know e.g. the first likely polyadenylation site but are uncertain if there are more.
- the O ⁇ variable captures the uncertainty in the exact position of the polyadenylation site while the possible assignments for the C J variables should depend on the position r-i of each candidate peak 1.
- Me J Since Me J encodes the absolute offset from the expected fragment length we can use it to constrain the matching variables as follows:
- ⁇ J and v XJ are parameters encoding the maximum acceptable absolute error in detected fragment length in an experiment for t-gene i in restriction group j .
- ⁇ XJ and v XJ depend on the sequence dependent probabili ty of detecting a fragment of t -gene i a t posi tion ⁇ ⁇ C XJ ) in restriction group j .
- a first guess is tha t typical values for ⁇ XJ should be between -l ⁇ and -15 ⁇ depending on fragment length (and possibly actual sequence) and for v XJ -between 0 and 2 ⁇ depending (probably only) on fragment length .
- the asymmetry of the parameters is motiva ted by the observa tion tha t sequence dependent errors usually manifests as apparently shorter fragmen ts lengths .
- ⁇ x represents a maximum acceptable total error over the m restriction groups for t-gene i.
- ⁇ XJ are function parameters encoding the relative weight we give to each individual match error for a given t-gene i and restriction group j
- ⁇ ⁇ :i k a e constant parameters encoding the relative weight given to error discrepancies for each pair (j,k) of restriction groups
- ⁇ x is a constant parameter encoding the relative penalty given to offsets in polyadenylation site for t-gene i.
- the ⁇ lk should probably depend on the size and difference in expected fragment lengths for the gene i in enzyme groups j and k.
- Definition 5.16 Represent the quantitative level of expression of the i : th t-gene with a variable E x in the same units as that used to express the area ⁇ L of each peak i.
- the matching procedure should however aim to minimize the sum of such "unexplained" peak areas.
- V(l ⁇ p) (( ⁇ E ) ⁇ ⁇ t ⁇ d)) ti ⁇ C, r ,i) U
- ⁇ i is a parameter representing a maximum acceptable AREA UNDERESTIMATE of the area of the peak 1.
- ⁇ and ⁇ are sui table weights expressing the rela tive contribution to the penal ty from area underestima te and unexplained area of observed peaks respectively
- the set of candidate t-genes of the candidate peaks of a particular t-gene can extended by recursively considering candidate t-genes and peaks.
- the goal of the MIP model is to compute an assignment ⁇ — ⁇ u and gene expression levels with minimal total cost, which we take as a weighted sum of the peak area error plus the peak length error.
- p ( ug,p) which gives the penalty of (ug,p) C ⁇ .
- the key dea of the MIP model is to capture the search space by decision (0-1) variables B ug ⁇ P , one per link ( ug,p) C ⁇ u .
- B ug/P 1 iff ug contributes to peak p.
- the model decomposes into independent subproblems, one per connected component of ⁇ u .
- M is a very large constant (chosen so as to be larger than any possible value of E ug ) .
- Vp ⁇ E ug , p ⁇ Area (p) (6.4) ug Consistent expression levels.
- the expression level of a u- gene is bounded above by the expression level admitted by each enzyme.
- a subproblem may easily be underdetermined. That is, the subproblem may admit many optimal solutions that vary only in the expression levels of some groups of u-genes, reflecting a lack of data to resolve ambiguities. Rather than reporting an arbitrarily chosen solution, we propose to identify each such group ⁇ of u-genes and to replace its expression variables by a single expression variable for the whole group. The idea is to reflect the ambiguous evidence in the solutions reported.
- ID (ug) ID (g) E ug .
- each node V ug , z can have non-zero flow on at most one outgoing link; - for each u-gene ug, the flow from the source node to each V ug/ Z node must be equal.
- a given solution can be improved step-wise.
- Pick a cluster for example, the one with the largest error E
- redo the peak assignment for one or more t-genes in the cluster for example, those corresponding to the peaks with the largest errors
- pick one t-gene assigned to the peak with the largest error in the cluster with the largest error Re-assign it by picking the second-most- probable link from its set of possible links.
- pick another gene assigned to the same peak then pick the second-most erroneous peak etc.
- ( tg) be the set of links in ⁇ fc that start at t-gene tg. Associate to each of these links 1 a variable s l r equal to 0 or 1. We can refer to this variable as the spin of the link; it is analogous to the decision variables B Ug , p n section 6.
- the state variables are the link spins.
- auxiliary variables we have the gene expression levels of (4.1). A set of link spins is identified with an assignment ⁇ according to
- the state space is hence the set of configurations of the link spins that can be identified with an assignment.
- the auxiliary gene expression variables must be non-negative.
- Equations (7.2) and (7.3) form the basis for simulated annealing.
- the procedure is as follows:
- the following example shows a sample input fil e for the set of three tags shown in schematic map form in Figure lb , along with the resulting mixed-integer programming (MI P) problem and the result .
- MI P mixed-integer programming
- the MIP approach has also been tested by generating a simulated dataset based on based on a mouse gene database.
- the database contained approximately 18,000 known genes, derived from the non-redundant set of genes that can be obtained from Refseq, Unigene, Riken and Genbank.
- Penalties were assigned based on the distance from observed to expected fragment length.
- the algorithm was capable of correctly identifying 99% of the ⁇ 18,000 genes designated as "known", with a false- positive rate of 8%.
- the result was also quantitatively correct: in 96% of the cases, the expression level of the gene was within 2% of the correct value.
- Table 2 The results of a number of additional simulations with different parameter settings are shown in Table 2.
- 7.5 ⁇ g mouse muscle mRNA was analyzed using three different Type IIS enzymes to generate three complete fragment profiles, each consisting of approximately 30,000 fragments in a total of 768 subreactions . Due to the experimental protocol, each fragment revealed 10 bp of sequence plus its length.
- a sequence database was constructed from several sources, including annotated GeneBank sequences, RIKEN full length cDNAs, Unigene and RefSeq as well as some internally generated 3' -end cDNA sequences. Sequences were aligned to each other, and when possible, ambiguities were corrected. Sequences that had a closed overlap were joined into a longer contig.
- PolyA sites were assigned using annotated data, polyA tails, alignment of 3' ends, the occurrence of polyA signals and additional accumulated knowledge.
- PolyA sites were scored according to how strong evidence that support them, and the scores were used in optimizations during the combinatorial identification.
- the database contained -48 000 transcripts. Only the -18 000 transcripts which had good evidence for an open reading frame coding for a protein were used in the following analysis.
- Peak linking is the process of building a model that represents all possible assignments of genes to peaks.
- the model can then be optimized using various methods to find an assignment (and thus a gene expression profile) that minimizes some measure of the error.
- the algorithm starts with a file containing all peaks, each obtained in a particular sub-reaction which determines the frame ("subsequence") and restriction enzyme used. So for each peak, the size (in base pairs) of the corresponding fragment is known to within ⁇ 2 bp, together with a 9 bp sequence tag.
- a gene was scanned and gene candidates were assigned to each peak. For each such assignment, there can be additional constraints. For example, a gene may be a candidate for one or the other of two nearby peaks, but not both at the same time (since it has a definite length) . Such constraints were compiled as described above and added to the model.
- the model was then submitted to a combinatorial solver (Hog CPLEX) which optimizes an error measure.
- the objective was to minimize the assignment error (i.e. the difference between observed and expected fragment sizes) while maximizing explanatory power (i.e. the extent to which the pattern of peaks is Explained' by the calculated gene expression profile) .
- the constraints and objective function were as described above.
- Figure 3 is a scatterplot of a replicated experiment, where 511 transcripts that were detected in both experiments are plotted against each other, illustrating the reproducibility of the method. The variance observed was not greater than that observed for the raw data (electrophoretic peaks) , providing indication that the method of th s invention adds insignificantly to the experimental error while affording a direct and robust identification and quantification of molecules present in a sample.
- Table 2 shows results of a large number of simulations involving a set of genes ('Simulated genes') that were treated in silico to generate three fragments each in accordance with WO02/08461. Each fragment was quantified and sized and a 9 bp tag was obtained. Noise was added to both the quantity and the fragment size. The resulting dataset was then analysed by means of algorithm as described in Example II, identifying and quantifying the genes in the sample ('Detected genes'). The detection ratio ('Sensitivity') and the inverse of the false- positive ratio ('Specificity') are shown.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Genetics & Genomics (AREA)
- Mathematical Analysis (AREA)
- Biophysics (AREA)
- Pure & Applied Mathematics (AREA)
- Computational Mathematics (AREA)
- Mathematical Optimization (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Operations Research (AREA)
- Signal Processing (AREA)
- Algebra (AREA)
- General Engineering & Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| AU2003239736A AU2003239736A1 (en) | 2002-05-24 | 2003-05-20 | Methods for profiling molecules with an objective function |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US38262802P | 2002-05-24 | 2002-05-24 | |
| US60/382,628 | 2002-05-24 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2003100541A2 true WO2003100541A2 (fr) | 2003-12-04 |
| WO2003100541A3 WO2003100541A3 (fr) | 2004-02-05 |
Family
ID=29584433
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IB2003/002634 Ceased WO2003100541A2 (fr) | 2002-05-24 | 2003-05-20 | Profilage de molecules |
Country Status (2)
| Country | Link |
|---|---|
| AU (1) | AU2003239736A1 (fr) |
| WO (1) | WO2003100541A2 (fr) |
Family Cites Families (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6081766A (en) * | 1993-05-21 | 2000-06-27 | Axys Pharmaceuticals, Inc. | Machine-learning approach to modeling biological activity for molecular design and to modeling other characteristics |
| US5871697A (en) * | 1995-10-24 | 1999-02-16 | Curagen Corporation | Method and apparatus for identifying, classifying, or quantifying DNA sequences in a sample without sequencing |
| US6132969A (en) * | 1998-06-19 | 2000-10-17 | Rosetta Inpharmatics, Inc. | Methods for testing biological network models |
| AU3482200A (en) * | 1999-02-02 | 2000-08-25 | Bernhard Palsson | Methods for identifying drug targets based on genomic sequence data |
| GB2365124B (en) * | 2000-07-21 | 2002-05-01 | Karolinska Innovations Ab | Methods for analysis and identification of transcribed genes and fingerprinting |
| WO2002010456A2 (fr) * | 2000-07-31 | 2002-02-07 | The Institute For Systems Biology | Analyse a parametres multiples en medecine predictive |
| US8898021B2 (en) * | 2001-02-02 | 2014-11-25 | Mark W. Perlin | Method and system for DNA mixture analysis |
-
2003
- 2003-05-20 WO PCT/IB2003/002634 patent/WO2003100541A2/fr not_active Ceased
- 2003-05-20 AU AU2003239736A patent/AU2003239736A1/en not_active Abandoned
Also Published As
| Publication number | Publication date |
|---|---|
| AU2003239736A8 (en) | 2003-12-12 |
| WO2003100541A3 (fr) | 2004-02-05 |
| AU2003239736A1 (en) | 2003-12-12 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12087401B2 (en) | Using cell-free DNA fragment size to detect tumor-associated variant | |
| US8594951B2 (en) | Methods and systems for nucleic acid sequence analysis | |
| CN110770840B (zh) | 用于对来自已知或未知基因型的多个贡献者的dna混合物分解和定量的方法和系统 | |
| CN110914911B (zh) | 压缩分子标记的核酸序列数据的方法 | |
| CN112397149A (zh) | 无参考基因组序列的转录组分析方法及系统 | |
| EP3483286B1 (fr) | Procédé d'analyse de séquence, appareil d'analyse de séquence, et appareil de génération de séquence de référence | |
| US20260018252A1 (en) | Machine learning techniques for analysis of structural variants | |
| Stokes et al. | Transcriptomics for clinical and experimental biology research: Hang on a Seq | |
| CN103168118A (zh) | 用减少数量的转录物测量进行的基因表达概况分析 | |
| Ahmed et al. | Identifying A-and P-site locations on ribosome-protected mRNA fragments using Integer Programming | |
| Selega et al. | Robust statistical modeling improves sensitivity of high-throughput RNA structure probing experiments | |
| WO2023244983A1 (fr) | Méthodes et compositions de validation de processus de séquence | |
| CN116246703A (zh) | 一种核酸测序数据的质量评估方法 | |
| Melov et al. | Microarrays as a tool to investigate the biology of aging: a retrospective and a look to the future | |
| Hong et al. | Random Forest model reveals the interaction between N6-methyladenosine modifications and RNA-binding proteins | |
| Cleary et al. | Compressed sensing for imaging transcriptomics | |
| JP5403563B2 (ja) | 網羅的フラグメント解析における遺伝子同定方法および発現解析方法 | |
| Scharpf et al. | Statistical modeling and visualization of molecular profiles in cancer | |
| Chong et al. | SeqControl: process control for DNA sequencing | |
| WO2003100541A2 (fr) | Profilage de molecules | |
| Dannemann et al. | The effects of probe binding affinity differences on gene expression measurements and how to deal with them | |
| CN107533588B (zh) | 估计dna芯片探针-靶亲和性的方法和制造dna芯片的方法 | |
| Zuo et al. | Pm 6 A: an Integrated Classification Algorithm for 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) Identifying m 6 A Sites | |
| Nieuwenhuis | DISCOVERING TECHNICAL AND BIOLOGICAL DRIVERS OF VARIATION IN BULK RNA-SEQUENCING | |
| Akeson et al. | tRNA isodecoder analysis using Nanopore ionic current signals and deep learning |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PH PL PT RO RU SC SD SE SG SK SL TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| 122 | Ep: pct application non-entry in european phase | ||
| NENP | Non-entry into the national phase |
Ref country code: JP |
|
| WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |