HK1246830B

HK1246830B - Mutational analysis of plasma dna for cancer detection

Info

Publication number: HK1246830B
Application number: HK18106285.1A
Authority: HK
Inventors: 赵慧君; 卢煜明; 陈君赐; 江培勇
Original assignee: 香港中文大学
Priority date: 2012-06-21
Filing date: 2018-05-15
Publication date: 2023-01-13

Description

Plasma DNA mutation analysis for cancer detection

Cross Reference to Related Applications

This application is U.S. provisional patent application No. 61/662,878, filed on 21/2012, entitled "plasma DNA mutation ANALYSIS FOR CANCER DETECTION" (multationanal ANALYSIS OF PLASMA DNA FOR CANCER DETECTION), "U.S. provisional patent application No. 61/682,725, filed on 31/2012, entitled" plasma DNA mutation ANALYSIS FOR CANCER DETECTION, "U.S. provisional patent application No. 61/695,795, filed on 31/2012, entitled" plasma DNA mutation ANALYSIS FOR CANCER DETECTION, "and non-provisional application No. 61/711,172, filed on 8/2012, entitled" plasma DNA mutation ANALYSIS FOR CANCER DETECTION, "which provisional patent application is incorporated herein by reference in its entirety FOR all purposes, and claims the benefit OF that provisional patent application.

Background

It has been shown that DNA of tumor origin is present in cell-free plasma/serum of cancer patients (Chen XQ) et al Nature medicine (Nat Med) 1996; 2: 1033-. Most current methods are based on direct analysis of mutations known to be associated with cancer (Diel F (Diehl F) et al Proc Natl Acad Sci 2005; 102: 16368-16373; Schiff T (Forshew T) et al science transformation medicine 2012; 4:136ra 68). Another approach has been to investigate cancer-associated genomic copy number variation (Lo et al, U.S. patent publication 2013/0040824) detected by random sequencing of plasma DNA.

It is known that over time, more than one cancer cell will acquire a growth advantage and produce multiple progeny cell clones. Eventually, the growth of the tumor and/or its metastases will contain aggregates of multiple clonal cancer cell populations. This phenomenon is typically referred to as tumor heterogeneity (Gelingjim (M) (Gerlinger M)) et al, New England journal of medicine (N Engl J Med)2012, 366:883- < - > 892, leaf TA (Yap TA) et al, science. transformation medicine 2012, 4:127ps 10.

It is known that cancers are highly heterogeneous, i.e. the mutation distribution of cancers of the same tissue type can vary widely. Thus, direct analysis of specific mutations typically can only detect a subset of cases within a particular cancer type known to be associated with those specific mutations. In addition, DNA of tumor origin is usually a trace substance of DNA in human plasma; the absolute concentration of DNA in plasma is low. Thus, even in patients with cancers known to have a target mutation, direct detection of a certain or a small set of cancer-associated mutations in plasma or serum may result in low analytical sensitivity. Furthermore, it has been shown that there is significant intratumoral heterogeneity with respect to mutations even within a single tumor. Mutations may be found only in a subset of tumor cells. The mutation distribution between the primary tumor and the metastatic lesion is even more different. One example of heterogeneity between intratumoral and primary-metastatic includes KRAS, BRAF and PIK3CA genes in patients with colorectal Cancer (Clin Cancer Research 2010.16:790-9, bardus (Baldus) et al).

In cases where the patient has a primary tumor (carrying the KRAS mutation but no PIK3CA mutation) and a cryptic metastatic lesion (carrying the PIK3CA mutation but no KRAS mutation), if focus is placed on detecting the KRAS mutation in the primary tumor, then the cryptic metastatic lesion cannot be detected. However, if both mutations are included in the analysis, then both primary tumors and cryptic metastatic lesions can be detected. Thus, a test comprising both mutations will have a higher sensitivity in the detection of residual tumor tissue. When screening for cancer, such simple examples become more complex when having little or no information about the types of mutations that may occur.

There is therefore a need to provide new techniques in order to perform extensive screening, detection or assessment against cancer.

Disclosure of Invention

Embodiments may observe the frequency of somatic mutations in a biological sample (e.g., plasma or serum) of a subject undergoing screening or monitoring for cancer, when compared to the frequency of somatic mutations in the constitutive DNA of the same subject. Random sequencing can be used to determine these frequencies. Parameters may be derived from these frequencies and used to determine a classification of a cancer grade. False positives can be filtered out by requiring that any variant locus have at least a specified number of variant sequence reads (tags), thereby providing more accurate parameters. The relative frequency of different variant loci can be analyzed to determine the level of heterogeneity of the tumor in the patient.

In one embodiment, the parameter may be compared to the same parameter derived from a group of subjects not having cancer or having a low risk of cancer. A significant difference in the parameters obtained from the test subject from a group of subjects who do not have cancer or who have a low risk of cancer may indicate that the test subject has cancer or a pre-cancerous condition or is at increased risk of having cancer in the future. Thus, in one embodiment, plasma DNA analysis can be performed without prior genomic information of the tumor. This embodiment is therefore particularly suitable for screening for cancer.

In another embodiment, embodiments may also be used to monitor cancer patients after treatment and see if residual tumors are present or if tumors have recurred. For example, patients with residual tumor or tumors that have recurred will have a higher frequency of somatic mutations than in patients without residual tumor present or no tumor recurrence observed. The monitoring may include obtaining samples from the cancer patient at various time points after treatment in order to determine the temporal changes in tumor-associated genetic aberrations in bodily fluids or other samples with cell-free nucleic acids (e.g., plasma or serum).

According to one embodiment, a method detects cancer or a pre-cancerous change in a subject. A constitutive genome of the subject is obtained. Receiving one or more sequence tags for each of a plurality of DNA fragments in a biological sample of the subject, wherein the biological sample comprises cell-free DNA. Determining the genomic position of the sequence tag. The sequence tags are compared to a constitutive genome to determine a first number of a first locus. At each first locus, the number of sequence tags having a variant sequence relative to the constitutive genome is above a cut-off, wherein the cut-off is greater than one. The parameter is determined based on the number of sequence tags having a variant sequence at the first locus. The parameter is compared to a threshold to determine a classification of a cancer grade in the subject.

According to another embodiment, a method analyzes the heterogeneity of one or more tumors in a subject. A constitutive genome of the subject is obtained. Receiving one or more sequence tags for each of a plurality of DNA fragments in a biological sample of the subject, wherein the biological sample comprises cell-free DNA. Determining the genomic position of the sequence tag. The sequence tags are compared to a constitutive genome to determine a first number of the first locus. At each first locus, the number of sequence tags having a variant sequence relative to the constitutive genome is above a cut-off value, wherein the cut-off value is greater than one. A measure of heterogeneity of the one or more tumors is calculated based on the respective first numbers of the first set of genomic positions.

According to another embodiment, a method determines the percent concentration of tumor DNA in a biological sample comprising cell-free DNA. One or more sequence tags for each of a plurality of DNA fragments in a biological sample are received. Determining the genomic position of the sequence tag. For each of the plurality of genomic regions, the respective amount of DNA fragments within the genomic region is determined from sequence tags having genomic positions within the genomic region. The respective amounts are normalized to obtain the respective densities. The respective densities are compared to a reference density to identify whether the genomic region exhibits a 1 copy loss or a 1 copy increase. The first density is calculated from the respective density identified as exhibiting a 1-copy loss or from the respective density identified as exhibiting a 1-copy gain. The percent concentration is calculated by comparing the first density to another density to obtain a difference, wherein the difference is normalized to a reference density.

Other embodiments relate to systems and computer-readable media related to the methods described herein.

A better understanding of the nature and advantages of the present invention may be gained with reference to the following detailed description and the accompanying drawings.

Drawings

Fig. 1 is a flow diagram of a method 100 of detecting cancer or a pre-cancerous change in a subject according to an embodiment of the present invention.

Fig. 2 shows a flow chart of a method for directly comparing a Sample Genome (SG) with a Constitutive Genome (CG) according to an embodiment of the present invention.

Fig. 3 shows a flow diagram of a method 300 for comparing a Sample Genome (SG) to a Constitutive Genome (CG) using a Reference Genome (RG) according to an embodiment of the present invention.

Fig. 4 is a table 400 showing the number of correctly identified cancer-associated single nucleotide mutations using different numbers of occurrences as criteria for classifying mutations present in a sample according to an embodiment of the present invention, assuming a percentage concentration of tumor-derived DNA in the sample of 10%.

Fig. 5 is a table showing the expected number of false positive loci and the expected number of mutations identified when the percentage concentration of tumor-derived DNA in a sample is assumed to be 5%.

Fig. 6A is a graph 600 showing the detection rates of cancer-related mutations in plasma with plasma percent concentrations of tumor-derived DNA of 10% and 20% and using four and six occurrences (r) as criteria for calling for potential cancer-related mutations. Fig. 6B is a graph 650 showing error classification versus sequencing depth using the number of occurrences (r) of mutant nucleic acids 4, 5, 6, and 7, respectively, as criteria for identifying mutation sites.

Fig. 7A is a graph 700 showing the variation in the number of true cancer-associated mutation sites and false positive sites relative to different sequencing depths when the percentage concentration of tumor-derived DNA in the sample is assumed to be 5%. Fig. 7B is a graph 750 showing the predicted number of false positive sites, including analysis of Whole Genome (WG) and all exons.

Fig. 8 is a table 800 showing the results before and after treatment, including the percentage concentration of tumor-derived DNA in plasma, of 4 HCC patients according to embodiment of the present invention.

Figure 9 is a table 900 showing the detection of HCC-associated SNVs in 16 healthy control subjects according to an embodiment of the invention.

Figure 10A shows a profile of sequence read densities of tumor samples from HCC patients according to an embodiment of the invention. Figure 10B shows a profile 1050 of z-scores of all genomic segments in plasma of an HCC patient according to an embodiment of the invention.

Figure 11 shows a profile 1100 of the z-fraction of plasma of an HCC patient according to an embodiment of the invention.

Fig. 12 is a flow diagram of a method 1200 of determining the percent concentration of tumor DNA in a biological sample comprising cell-free DNA, according to an embodiment of the invention.

Fig. 13A shows a table 1300 of analyzing mutations in the plasma of patients with ovarian and breast cancer at the time of diagnosis according to an embodiment of the present invention.

Fig. 13B shows a table 1350 of analyzing mutations in plasma after tumor resection for patients with bilateral ovarian and breast cancer, according to embodiments of the present invention.

Figure 14A is a table 1400 showing the detection of single nucleotide variations in plasma DNA of HCC 1. Figure 14B is a table 1450 showing the detection of single nucleotide variations in plasma DNA of HCC 2.

Figure 15A is a table 1500 showing the detection of single nucleotide variations in plasma DNA of HCC 3. Figure 15B is a table 1550 showing the detection of single nucleotide variations in plasma DNA of HCC 4.

Figure 16 is a table 1600 showing the detection of single nucleotide variations in plasma DNA of patients with ovarian (and breast) cancer.

Fig. 17 is a table 1700 showing predicted sensitivities to different requirements for mutation frequency of occurrence and different sequencing depths.

Fig. 18 is a table 1800 showing the predicted number of false positive loci for different cut-off values and different sequencing depths.

Figure 19 shows a tree diagram illustrating the number of mutations detected at different tumor sites.

Figure 20 is a table 2000 showing the number of fragments carrying tumor-derived mutations in pre-and post-treatment plasma samples.

Fig. 21 is a graph 2100 showing the frequency distribution of mutations detected at a single tumor site in plasma and mutations detected at all four tumor sites.

Fig. 22 is a graph 2200 showing the predicted frequency distribution of mutations from heterogeneous tumors in plasma.

Fig. 23 demonstrates the specificity of embodiments of the invention among 16 enrolled healthy control subjects.

Fig. 24 is a flow diagram of a method 2400 of analyzing heterogeneity of one or more tumors in a subject, according to an embodiment of the invention.

FIG. 25 shows a block diagram of an exemplary computer system 2500 that may be used with systems and methods according to embodiments of the invention.

Detailed Description

Definition of

As used herein, the term "locus (loci)" or its plural form "locus (loci)" is the position or address of nucleotides (or base pairs) of any length, which may vary across the genome. The "interval (bin)" is a region of a predetermined length in a genome. Multiple intervals may have the same first length (resolution), while different ones may have the same second length. In one embodiment, the intervals do not overlap with each other.

As used herein, the term "random sequencing" refers to sequencing in which a nucleic acid fragment sequenced prior to a sequencing procedure has not been specifically identified or predetermined. Sequence specific primers are not required to target specific loci. The term "universal sequencing" refers to sequencing in which sequencing can begin on any fragment. In one embodiment, adapters are added to the ends of the fragments and primers for sequencing are ligated to the adapters. Thus, any fragment can be sequenced with the same primer, and thus the sequencing can be random.

As used herein, the term "sequence tag" (also referred to as a sequence read) refers to a string of nucleotides sequenced from any part or all of a nucleic acid molecule. For example, a sequence tag can be a short nucleotide string (e.g., about 30) sequenced from a nucleic acid fragment, a short nucleotide string at both ends of a nucleic acid fragment, or the entire nucleic acid fragment present in a sequenced biological sample. A nucleic acid fragment is any portion of a larger nucleic acid molecule. Fragments (e.g., genes) may be present separately (i.e., not linked) from other portions of the larger nucleic acid molecule.

The term "constitutive genome" (also referred to as CG) consists of consensus nucleotides at loci within the genome and can therefore be viewed as a consensus (consensus) sequence. The CG may cover the entire genome of the subject (e.g., the human genome), or only a portion of the genome. The Component Genomes (CG) may be obtained from cellular DNA as well as cell-free DNA (e.g., as may be found in plasma). Ideally, a consensus nucleotide should indicate that the locus is homozygous for one allele or heterozygous for both alleles. The locus of a heterozygote typically contains two alleles, which are members of a genetic polymorphism. As one example, the criterion for determining whether a locus is heterotypic zygosity may be a threshold of two alleles that are each aligned to a corresponding locus with at least a predetermined percentage (e.g., 30% or 40%) of reads. If a nucleotide is present in a sufficiently large percentage (e.g., 70% or greater), then the locus can be determined to be homozygous within the CG. Although the genome of one healthy cell may differ from the genome of another healthy cell due to random mutations that occur spontaneously during cell division, the CG should not differ when such consensus nucleotides are used. Some cells may have genomes with genomic rearrangements, such as B and T lymphocytes, including, for example, antibodies and T cell receptor genes. The large-scale difference will still be a relatively small population of the overall population of nucleated cells in the blood, and thus the rearrangement will not affect the constitutive genome of the blood cells being determined with sufficient sampling (e.g., sequencing depth). Other cell types (including buccal cells, skin cells), hair follicles, or biopsies of various normal body tissues may also serve as a source of CG.

The term "constitutive DNA" refers to DNA from any source that reflects the genetic composition that a subject has at birth. Examples of "constitutive samples" from which constitutive DNA can be obtained for a subject include healthy blood cell DNA, oral cell DNA, and hair root DNA. DNA from these healthy cells defines the subject's CG. Cells can be identified as healthy in a number of ways, such as when it is known that a person does not have cancer or a sample can be obtained from tissue that is unlikely to contain cancerous or precancerous cells (e.g., hairy root DNA when liver cancer is suspected). As another example, when a patient is cancer-free, a plasma sample can be obtained and the determined constituent DNA compared against results of subsequent plasma samples (e.g., after a year or more). In another embodiment, a single biological sample containing < 50% tumor DNA can be used to infer constitutive genomic and tumor-associated genetic changes. In such a sample, the concentration of tumor-associated single nucleotide mutations will be lower than the concentration of each allele of the heterozygous SNP in CG. Such a sample may be the same as the biological sample described below for determining the genome of the sample.

As used herein, the term "biological sample" refers to any sample obtained from a subject (e.g., a human, a person having cancer, a person suspected of having cancer, or other organism) and containing one or more cell-free nucleic acid molecules of interest. The biological sample may comprise cell-free DNA, some of which may be derived from healthy cells and some of which may be derived from tumor cells. For example, tumor DNA can be found in blood; or other body fluids such as urine, pleural fluid, ascites fluid, peritoneal fluid, saliva, tears, or cerebrospinal fluid. An example of a non-bodily fluid is a stool sample, which may be mixed with the diarrhea fluid. For some of the samples, the biological sample may be obtained non-invasively. In some embodiments, a biological sample may be used as a constituent sample.

The term "sample genome" (also referred to as SG) is a collection of sequence reads that have been aligned with the location of a genome (e.g., the human genome). The Sample Genome (SG) is not a consensus sequence, but includes nucleotides that can only be present in sufficient read numbers (e.g., at least 2 or 3, or higher cut-off values). An allele may indicate a "single nucleotide mutation" (also referred to as SNM) if it is presented a sufficient number of times and it is not part of the CG (i.e., not part of the consensus sequence). Other types of mutations can also be detected using the invention, for example, mutations that include two or more nucleotides (e.g., affecting the number of tandem repeat units in a microsatellite or simple tandem repeat polymorphism), chromosomal translocations (which can be intra-chromosomal or inter-chromosomal), and sequence inversions.

The term "reference genome" (also known as RG) refers to a haploid or diploid genome to which sequence reads from biological and constitutive samples can be aligned and compared. For a haploid genome, there is only one nucleotide at each locus. For a diplome, a heterozygote locus can be identified which has two alleles, either of which can allow for alignment of a locus match.

The term "cancer grade" may refer to the presence or absence of cancer, the stage of cancer, the size of a tumor, and/or other measures of the severity of cancer. The cancer grade may be numerical or other characteristic. The rank may be zero. The cancer grade also includes pre-cancerous or precancerous conditions (states) associated with the mutation or number of mutations. Cancer grade can be used in various ways. For example, the screening can examine whether a person known not to have a cancer previously has a cancer. Evaluation may investigate someone who has been diagnosed with cancer. Detection may mean 'screening' or may mean examining whether someone with suggested characteristics of cancer (e.g., symptoms or other positive tests) has cancer.

Description of the preferred embodiment

Embodiments are provided for detecting cancer by analyzing a biological sample (e.g., a plasma/serum sample) that is not directly taken from a tumor and includes cell-free nucleic acids. Cell-free nucleic acids can be produced for various types of tissues throughout the body. In this way, a wide range of assays for detecting various cancers can be performed.

Genetic aberrations, including single nucleotide mutations, deletions, amplifications and rearrangements, accumulate in tumor cells during the development of cancer. In embodiments, massively parallel sequencing can be used to detect and quantify Single Nucleotide Mutations (SNM), also known as Single Nucleotide Variants (SNV), in body fluids such as plasma, serum, saliva, ascites fluid, pleural fluid, and cerebrospinal fluid, in order to detect and monitor cancer. Quantifying the number of SNMs (or other types of mutations) may provide a mechanism to identify early stages of cancer as part of a screening test. In various embodiments, sequencing errors are carefully distinguished from spontaneous mutations that occur in healthy cells (e.g., by requiring the identification of multiple SNMs at a particular locus, e.g., at least 3, 4, or 5).

Some embodiments also provide non-invasive methods of analyzing tumor heterogeneity, which may include cells within the same tumor (i.e., intratumoral heterogeneity) or cells of different tumors (from the same site or from different sites) in vivo. For example, clonal structure of the tumor heterogeneity can be analyzed non-invasively, including assessing the relative tumor cell mass containing each mutation. Mutations present at higher relative concentrations are present in a larger number of malignant cells in vivo, e.g., human cells (Cell) 2012; 150: 264-. Since the mutation has a higher relative abundance, it is expected to exhibit higher diagnostic sensitivity for detecting cancer DNA than a mutation having a lower relative abundance. Continuous monitoring of changes in the relative abundance of mutations will allow for non-invasive monitoring of changes in the clonal architecture of a tumor, either spontaneously as disease progresses or in response to treatment. The information will be useful for assessing prognosis or early detection of tumor resistance to treatment.

I. Introduction to

Mutations may occur during cell division due to errors in DNA replication and/or DNA repair. One type of such mutation includes a change in a single nucleotide, which may include multiple sequences from different parts of the genome. Cancer is generally thought to be due to clonal expansion of a single cancer cell that has acquired a growth advantage. This clonal expansion will result in the accumulation of mutations (e.g., single nucleotide mutations) in all cancer cells derived from the cancer cells of the previous generation. These progeny tumor cells will share a set of mutations (e.g., single nucleotide mutations). As described herein, cancer-associated single nucleotide mutations are detectable in the plasma/serum of cancer patients.

Some embodiments can be effectively screened for all mutations in a biological sample (e.g., plasma or serum). Because the number of mutations is not fixed (hundreds, thousands, or millions of cancer-associated mutations from different tumor cell subpopulations can be detected), embodiments can provide better sensitivity than techniques that detect specific mutations. The number of mutations can be used to detect cancer.

To provide such screening for many or all mutations, embodiments can perform a search (e.g., a random search) for genetic variation in a biological sample (e.g., a bodily fluid, including plasma and serum) that can contain DNA of tumor origin. The use of a sample (e.g., plasma) eliminates the need to perform an invasive biopsy for a tumor or cancer. Furthermore, because the screening can cover all or a large region of the genome, the screening is not limited to any of the enumerated and known mutations, but any mutation present can be used. In addition, higher sensitivity can be obtained because the number of all mutations across a large region or genome is accumulated.

However, there are polymorphic sites in the human genome, including Single Nucleotide Polymorphisms (SNPs), which should not be counted in mutations. Embodiments can determine whether the genetic variation that has been detected is likely to be a cancer-associated mutation or a polymorphism in the genome. For example, as part of determining between a cancer-associated mutation and a polymorphism in the genome, embodiments may determine a constitutive genome, which may include the polymorphism. Polymorphisms that make up the genome (CG) can be limited to those that are exhibited in sufficiently high percentages (e.g., 30-40%) in sequencing data.

Sequences obtained from biological samples can then be aligned with the constitutive genome and variations that are Single Nucleotide Mutations (SNMs) or other types of mutations identified. These SNMs will be variations not included in known polymorphisms and therefore can be tagged as cancer-associated and not part of the constitutive genome. Healthy individuals may have a certain number of SNMs due to random mutations in healthy cells, e.g. produced during cell division, but individuals with cancer will have more.

For example, for individuals with cancer, the number of detectable SNMs in bodily fluids will be higher than the polymorphisms present in the same person's component genome. A comparison can be made between the amount of variation detected in a body fluid sample containing DNA of tumor origin and a DNA sample containing mostly constitutive DNA. In one embodiment, the term 'majority' will mean more than 90%. In another preferred embodiment, the term 'majority' will mean more than 95%, 97%, 98% or 99%. When the amount of variation in the body fluid exceeds that of a sample with mostly composed DNA, the likelihood that the body fluid may contain DNA of tumor origin increases.

One method that can be used to randomly search for variations in a DNA sample is random or shotgun sequencing (e.g., using massively parallel sequencing). Any massively parallel sequencing platform may be used, including ligation sequencing platforms (e.g., Life Technologies (SOLiD Technologies) SOLiD platform), Ion Torrent/Ion protons (Ion Torrent/Ion Proton), semiconductor sequencing, Roche (Roche)454, single molecule sequencing platforms (e.g., herlicks (helicoss), Pacific Biosciences (Pacific Biosciences), and nanopores). However, it is known that sequencing errors may occur and may misinterpret as variations in the constituent DNA or mutations derived from tumor DNA. Thus, to improve the specificity of our proposed method, the probability of other components causing sequencing errors or analytical errors can be considered, e.g., by using the number of alleles with the appropriate sequencing depth at a certain locus and meeting at least a specified number (e.g., 2 or 3) of requirements to determine the SNM.

As described herein, embodiments can provide evidence of the presence of tumor-derived DNA in a biological sample (e.g., a bodily fluid) when the amount of randomly detected genetic variation present in the sample exceeds the expected amount of variation in constitutive DNA that may be unintentionally detected due to analytical errors (e.g., sequencing errors). The information can be used to screen, diagnose, prognose, and monitor cancer. In the following section, we describe analytical procedures that can be used to detect single nucleotide mutations in plasma/serum or other samples (e.g., body fluids). The body fluid may include plasma, serum, cerebrospinal fluid, pleural fluid, ascites fluid, nipple discharge, saliva, bronchoalveolar lavage, sputum, tears, sweat, and urine. In addition to body fluids, the technique can also be applied to stool samples, as the latter have been reported to contain tumor DNA from colorectal cancer (berger bm), nyquist da (ahlquist da) Pathology (Pathology) 2012; 44: 80-88).

General screening method

Fig. 1 is a flow diagram of a method 100 of detecting cancer or a pre-cancerous change in a subject according to an embodiment of the present invention. Embodiments may analyze cell-free DNA in a biological sample of a subject to detect variations in cell-free DNA likely to be produced by a tumor. Analysis can use the subject's constitutive genome to account for polymorphisms that are part of healthy cells and to account for sequencing errors. Method 100 and any methods described herein may be performed in whole or in part with a computer system that includes one or more processors.

In step 110, a constitutive genome of the subject is obtained. The Constitutive Genome (CG) can be determined from the constitutive DNA of the tested subject. In various embodiments, CG can be efficiently determined by memory reads or, for example, by analyzing sequence reads of constitutive DNA that can be in cells from samples including cell-free DNA. For example, when a non-blood malignancy is suspected, blood cells can be analyzed to determine the subject's constituent DNA.

In various embodiments, analysis of the formed DNA can be performed using massively parallel sequencing, array-based hybridization, probe-based in-solution hybridization, ligation-based analysis, primer extension reaction analysis, and mass spectrometry analysis. In one embodiment, CG may be determined at one point in time of the subject's life, e.g., at birth or even in the prenatal period (which may be performed using fetal cells or via cell-free DNA fragments, see U.S. publication 2011/0105353), and then be referenced when obtaining a bodily fluid or other sample at other times of the subject's life. Thus, the CG can simply be read by the computer memory. A constitutive genome can be read as a list of loci, wherein the constitutive genome is different from the reference genome.

In step 120, for each of a plurality of DNA fragments in a biological sample of the subject, one or more sequence tags are obtained, wherein the biological sample comprises cell-free DNA. In one embodiment, one or more sequence tags are generated by random sequencing of DNA fragments in a biological sample. When double-ended sequencing is performed, more than one sequence tag may be obtained. One tag would correspond to each end of the DNA fragment.

Cell-free DNA in a sample (e.g., plasma, serum, or other bodily fluid) can be analyzed to search for genetic variations. Cell-free DNA can be analyzed using the same analysis platform that has been used to analyze the constituent DNA. Alternatively, a different analysis platform may be used. For example, a cell-free DNA sample may be sequenced using massively parallel sequencing, or portions of the genome may be captured or enriched prior to massively parallel sequencing. If enrichment is used, then liquid phase or solid phase capture of selected portions of the genome can be used, for example. The captured DNA can then be massively parallel sequenced.

In step 130, the genomic position of the sequence tag is determined. In one embodiment, the sequence tags are aligned to a reference genome obtained from one or more other subjects. In another embodiment, the genomic sequence tags are aligned to the component genomes of the subjects tested. The alignment can be performed using techniques known to those skilled in the art, for example, using the local sequence base search tool (BLAST).

In step 140, a first number of loci is determined in which at least N of the sequence tags have a variant sequence relative to the Component Genome (CG). N is equal to or greater than two. As discussed in more detail below, sequencing errors and randomly occurring somatic mutations in cells (e.g., due to cell division) can be removed by making N equal to 2, 3, 4, 5, or higher. Loci that meet one or more specified criteria can be identified as either mutated (variant) or mutated loci (variant loci), whereas loci that have variants but do not meet one or more criteria (e.g., such as only one variant sequence tag) are referred to as potential or putative mutations. Sequence variants may be directed to only one nucleotide or to multiple nucleotides.

N can be determined as a percentage of the total signature of the locus, as opposed to absolute values. For example, a variant locus can be identified when the relative percentage concentration of tumor DNA inferred from variant reads is determined to be equal to or greater than 10% (or some other percentage). In other words, when a locus is covered by 200 sequence reads, it may be desirable for at least 10 sequence reads to display criteria for variant alleles to define variants as mutations. 10 sequence reads of the variant allele and 190 reads of the wild-type allele will yield a tumor DNA percentage concentration of 10% (2X 10/(10+ 190)).

In one embodiment, the sequence tags (collectively referred to as the sample genome) can be directly compared to CG to determine variants. In another embodiment, the Sample Genome (SG) is compared to CG by a Reference Genome (RG) to determine variants. For example, both CG and SG can be compared to RG to determine the respective number (e.g., set) of loci exhibiting variants, and then a difference can be taken to obtain a first number of loci. The first number may simply be obtained as a number or may correspond to a set of specific loci, which may then be further analyzed to determine a parameter from the sequence tag at the first locus.

In one embodiment, the sequencing results of the constitutive DNA and the plasma DNA are compared to determine whether a single nucleotide mutation is present in the plasma DNA. Regions of constitutive DNA homozygotes can be analyzed. For purposes of illustration, it is assumed that the genotype of a particular locus is homozygous in the constitutive DNA and is AA. Then in plasma, the presence of an allele other than a would indicate that a Single Nucleotide Mutation (SNM) may be present at a particular locus. Loci indicative of likely SNM presence may form the first number of loci in step 140.

In one embodiment, it may be applicable to target portions of the genome that are known to be particularly susceptible to mutation in a particular cancer type or in a particular subset of the population. In connection with the latter aspect, embodiments may look for mutation types that are particularly prevalent in a particular population group, such as mutations that are particularly common in subjects who are carriers of hepatitis B virus (for liver cancer) or human papillomavirus (for cervical cancer) or who have genetic factors prone to somatic mutations or who have germline mutations in DNA mismatch repair genes. The techniques will also be applicable to screening for mutations in ovarian and breast cancer in subjects with BRCA1 or BRCA2 mutations. The techniques will similarly be applicable to screening for mutations in colorectal cancer in subjects with APC mutations.

In step 150, the parameter is determined based on calculating the number of sequence tags having a variant sequence at the first locus. In one example, the parameter is a first number of loci at which at least N DNA fragments have a variant sequence relative to the component genome. Thus, counting can be used simply to ensure that a locus has more than N copies of a particular variant identified, which is then included in the first number. In another embodiment, the parameter may be or include the total number of sequence tags having sequence variants at the first locus relative to the component genome.

In step 160, the parameter of the subject is compared to a threshold (e.g., derived from one or more other subjects) to determine a classification of a cancer grade in the subject. Examples of cancer grade include whether the subject has cancer or a pre-cancerous condition, or an increased likelihood of developing cancer. In one embodiment, the threshold value may be determined from a previously obtained sample from the subject.

In another embodiment, one or more other subjects may be determined to be not suffering from cancer or to have a low risk of cancer. Thus, the threshold may be a normal value, a normal range, or indicate a statistically significant deviation from a normal value or range. For example, the number of detectable mutations in the plasma of a subject that does not have cancer or has a low risk of cancer relative to the CG of a particular subject can be used as a normal range to determine whether the number of mutations detected in the tested subject is normal. In another embodiment, other subjects may be known to have cancer, and thus the number of similar mutations may be indicative of cancer.

In one embodiment, the other subject may be selected to match the clinical profile to that of the test subject (e.g., sex, age, diet, smoking habit, drug history, previous disease, family history, genotype of selected genomic locus, viral infection (e.g., hepatitis B or C virus or human papilloma virus or human immunodeficiency virus or ebestan-Barr virus) infection) or other infectious agent (e.g., bacteria (e.g., Helicobacter pylori)) and parasites (e.g., Clonorchis sinensis)) infection, etc.). For example, subjects who are carriers of hepatitis B or C virus have an increased risk of developing hepatocellular carcinoma. Thus, test subjects with a similar number or pattern of mutations as carriers of hepatitis B or C may be considered to have an increased risk of developing hepatocellular carcinoma. On the other hand, a hepatitis B or C patient exhibiting more mutations than another hepatitis patient may be suitably identified as having a higher cancer grade classification if a suitable baseline (i.e., relative to another hepatitis patient) is used. Similarly, subjects who are carriers of human papillomavirus infection have an increased risk of cervical cancer and head and neck cancer. Eberstein-Barr virus infection has been associated with nasopharyngeal carcinoma, gastric carcinoma, Hodgkin's lymphoma, and non-Hodgkin's lymphoma. Helicobacter pylori infection has been associated with gastric cancer. Clonorchis sinensis infection has been associated with cholangiocarcinoma.

Monitoring changes in the number of mutations at different time points can be used to monitor cancer progression and response to treatment. The monitoring may also be used to record the progression of a pre-cancerous condition or the change in the risk that a subject will develop cancer.

The amount of sequence tags displaying variation can also be monitored. For example, the percent concentration of variant reads at the locus may be used. In one embodiment, an increase in the percent concentration of tumor-associated genetic aberrations in the sample during continuous monitoring may indicate disease progression or impending recurrence. Similarly, a decrease in the percent concentration of tumor-associated genetic aberration in a sample during continuous monitoring may indicate a therapeutic response and/or remission and/or good prognosis.

Assay of the genome

The various genomes discussed above are described in more detail below. For example, reference, constitutive, and sample genomes are discussed.

A. Reference genome

Reference Genome (RG) refers to a subject's or population's consensus haploid or diploid genome. The reference genome is known and can therefore be used to compare sequencing reads from new patients. Sequence reads from a sample from a patient can be aligned with RG and compared to identify variations in the reads. For a haploid genome, there is only one nucleotide at each locus, and thus each locus can be considered hemizygous. For a diplome, a heterozygote locus can be identified which has two alleles, either of which can be allowed to match a certain alignment of the locus.

The reference genome can be identical among a population of subjects. This same reference genome can be used in healthy subjects to determine an appropriate threshold for classifying patients (e.g., with or without cancer). However, different reference genomes may be used for different populations, e.g. for different ethnicities or even for different families.

B. Component genome

A Constitutive Genome (CG) of a subject (e.g., a human or other diploid organism) refers to a diploid genome of the subject. The CG may specify a heterozygous locus wherein a first allele is from a first haplotype and a second, different allele is from a second haplotype. Note that the structure of the two haplotypes covering the pair of heterozygous loci need not be known, i.e., the structure of which allele at one heterozygous locus is the same haplotype as the allele at the other heterozygous locus, i.e., need not be known. It is sufficient to know only that two alleles are present at each heterozygote locus.

CG may differ from RG due to polymorphism. For example, a locus on RG can be homozygous for T, but CG is heterozygous for T/a. Thus, the CG will exhibit a variation at this locus. CG may also differ from RG due to genetic mutations (e.g., its spread in a family) or neonatal mutations (which occur in a fetus but which are not present in its parents). Inherited mutations are typically referred to as 'germline mutations'. Some of these mutations are associated with cancer predisposition, such as the BRCA1 mutation inherited in the family. The mutation is different from a 'somatic mutation' that can occur due to cell division in a human lifetime and can push the cell and its progeny to become cancer soon.

The goal of CG determination is to remove the germline and nascent mutations from mutations in the Sample Genome (SG) in order to identify somatic mutations. The amount of somatic mutation in SG can thus be used to assess the likelihood of cancer in a subject. These somatic mutations may be further filtered to remove sequencing errors, and it is possible to remove rare occurrences of somatic mutations (e.g., only one read of a display variant) because these removed somatic mutations are unlikely to be associated with cancer.

In one embodiment, CG may be measured using cells (leukocyte DNA). However, CG can also be determined from cell-free DNA (e.g., plasma or serum) as well. For sample types where the majority of cells are non-malignant, such as white blood cells from healthy subjects, then the majority or consensus genome is CG. For CG, the locus in each genome consists of the DNA sequence that most cells in the sampled tissue have. The depth of sequencing should be sufficient to elucidate the heterozygous sites within the constitutive genome.

As another example, plasma may be used as a constitutive sample to determine CG. For example, for a situation in which less than 50% of the tumor DNA is in plasma and the SNM is in a heterozygous state, e.g., the mutation is a newly added allele, the new allele may have a concentration of less than 25%. However, the concentration of heterozygous alleles for the SNP in CG should be about 50%. Thus, it is possible to distinguish between somatic mutations and polymorphisms in CG. In one embodiment, when plasma or other mixtures with significant tumor concentrations are used, a suitable cut-off value can be used between 30-40% to distinguish between somatic mutations and polymorphic sites. Measurement of tumor DNA concentration may be useful to ensure that less than 50% of tumor DNA is present in the plasma. Examples of determining tumor DNA concentration are described herein.

C. Sample genome

The Sample Genome (SG) is not as simple as a haploid or diploid genome as in the case of RG and CG. SG is a batch of reads from a sample, and may include: reads from constitutive DNA corresponding to CG, reads from tumor DNA, reads from healthy cells that exhibit random mutations relative to CG (e.g., due to mutations resulting from cell division), and sequencing errors. Various parameters may be used to control exactly which reads are included in the SG. For example, alleles are required at least 5 reads to reduce the occurrence of sequencing errors in the SG, as well as reads derived from random mutations.

As an example, it is assumed that the subject is healthy, i.e. does not suffer from cancer. For illustration purposes, DNA from 1000 cells is in 1ml of plasma (i.e., 1000 genome equivalents of DNA) obtained from this subject. Plasma DNA typically consists of DNA fragments of about 150 bp. Since the human genome is 3X 10⁹bp, so there will be about 2X 10 of each haplotype genome⁷And (3) DNA fragments. Since the human genome is twofold, there will be about 4X 10 per ml of plasma⁷And (3) DNA fragments.

Since millions to billions of cells release their DNA per unit time in plasma and fragments from these cells will mix together during circulation, 4 x 10 ⁷The DNA fragments may be from 4X 10⁷A plurality of different cells. If the cells do not bear a close (as opposed to distant, e.g., original zygote) clonal relationship to each other (i.e., they do not share the nearest ancestor cell), then it is statistically likely that no more than one mutation will be seen among the segments.

On the other hand, if there is a certain percentage of cells sharing similar progeny cells (i.e., they share a related clonal relationship with each other) among 1000 genome equivalents per ml of plasma DNA, it can be seen that mutations from this clone are preferentially present in plasma DNA (e.g., exhibit a clonal mutation profile in plasma). The clonally related cells may be cancer cells, or cells that are about to become cancer but are not yet cancerous (i.e., preneoplastic). Thus, requiring that a mutation occur more than once can remove this natural difference in identifying "mutations" in a sample, which can leave more mutations associated with cancer cells or pre-neoplastic cells, thereby allowing detection, particularly early detection, of cancer or a precancerous condition.

In one approximation, it has been stated that on average one mutation will accumulate in the genome after each cell division. Previous studies have shown that most plasma DNA is derived from hematopoietic cells (Lei YY (Lui YY) et al clinical chemistry (Clin Chem)2002:48: 421-. It is estimated that hematopoietic stem cells are replicated every 25-50 weeks (Katelin SN (Catlin SN) et al Blood (Blood) 2011; 117: 4460-. Thus, as a simplified approximation, a healthy subject 40 years of age will accumulate about 40 to 80 mutations per hematopoietic stem cell.

If there are 1000 genome equivalents per ml of plasma of this human, and if each of these cells is derived from a different hematopoietic stem cell, then this is 4X 10¹⁰40,000 to 80,000 mutations can be expected among the DNA fragments (i.e., 4X 10 per genome)⁷Individual DNA fragments and 1000 genome equivalents per ml plasma). However, because each mutation will be visible once, each mutation may still be below the detection limit (e.g., when the cutoff value N is greater than 1), and thus these mutations may be filtered out, thereby allowing the analysis to focus on mutations that are more likely to result from cancerous conditions. The cutoff value can be any value (integer or non-integer) greater than one and can be dynamic for different loci and regions. The sequencing depth and fractional concentration of tumor DNA can also affect the sensitivity of detecting mutations (e.g., the percentage of mutations that can be detected) by cancer cells or pre-tumor cells.

Comparing SG directly with CG

Some embodiments can identify nucleotide positions where CG is homozygous, but where a small number of substances in SG (i.e., tumor DNA) are heterozygous. When sequencing a location at a high depth (e.g., more than 50-fold coverage), the presence or absence of one or both alleles at that location in a DNA mixture of healthy and cancer cells can be detected. When both alleles are detected, (1) the CG is heterozygous, or (2) the CG is homozygous but the SG is heterozygous. These two cases can be distinguished by looking at the relative counts of the major and minor alleles. In the former case, both alleles will have similar count numbers; but in the latter case there will be a large difference in the number of counts. Reading of test samples comparison of this relative count on alleles is one embodiment of comparing sequence tags to component genomes. The first locus of method 100 can consist of loci in which the number of alleles is below an upper threshold (corresponding to a threshold for polymorphisms in CG) and above a lower threshold (corresponding to a threshold for errors and somatic mutations that are not associated with a cancerous condition and occur at a sufficiently low rate). Thus, the component genome and the first locus can be determined simultaneously.

In another embodiment, the method of identifying mutations may first determine CG, and then determine loci that have a sufficient number of mutations relative to CG. CG can be determined from a constitutive sample different from the test sample.

Fig. 2 shows a flow diagram of a method 200 for directly comparing a Sample Genome (SG) with a Constitutive Genome (CG) according to an embodiment of the invention. At block 210, a constitutive genome of the subject is obtained. The component genome may be obtained, for example, from a sample taken at a previous time or a constitutive sample obtained and analyzed immediately prior to performing method 200.

At block 220, one or more sequence tags are obtained for each of a plurality of DNA fragments in a biological sample of a subject. Sequencing can be performed using various techniques as mentioned herein. Sequence tags are a measure of what the sequence content of a fragment is. But one or more bases of the sequence tag may be erroneous.

At block 230, at least a portion of the sequence tags are aligned to the component genomes. The alignment may involve CG being heterozygous at many loci. The alignment will not require an exact match to allow detection of variants.

At block 240, a sequence tag is identified that has a variant sequence at a locus relative to a constitutive genome. It is possible that a sequence tag may have more than one variant. Variants can be tracked for each locus and each sequence tag. A variant may be any allele that is not in the CG. For example, CG is heterozygous for A/T and variants can be G or C.

At block 250, for each locus having a variant, the computer system can count a first number that corresponds to each of the sequence tags that are aligned with the locus and have a variant sequence at the locus. Thus, each locus may have an associated counted number of variants distributed at its locus. Typically, fewer variants will be visible at the locus than the sequence tag corresponding to CG, for example due to tumor DNA concentrations of less than 50%. However, some samples may have tumor DNA concentrations greater than 50%.

At block 260, parameters are determined based on the respective first numbers. In one embodiment, if the respective number is greater than a cutoff value (e.g., greater than two), then the respective number may be added to the sum, which is the parameter or used to determine the parameter. In another embodiment, the number of loci, each number of which is greater than the cut-off value, is used as a parameter.

At block 270, the parameters are compared to a threshold to classify the cancer grade. As described above, the threshold value may be determined by analyzing samples from other subjects. Depending on the health or cancer status of these other subjects, a classification may be determined. For example, if the other subject has stage 4 cancer, then the current subject may be classified as having stage 4 cancer if the current parameter is close (e.g., within a particular range) to the value of the parameter obtained from the other subject. However, if the parameter exceeds a threshold (i.e., is greater or less than, depending on how the parameter is defined), then the classification may be identified as less than stage 4. Similar analysis can be performed when other subjects do not have cancer.

Multiple thresholds may be used to determine the classification, with each threshold determined from a different set of subjects. Each set of subjects may have a common cancer grade. Thus, the current parameter may be compared to the value of each set of subjects, which may provide a match to one of the sets or provide a range. For example, the parameter may be approximately equal to the parameter obtained for a subject that is pre-cancerous or at stage 2. As another example, the current parameters may be in a range that may match several different cancer grades. Thus, a classification may include more than one cancer grade.

V. use of reference genome

The genomic sequences of both the constitutive DNA and the DNA from the biological sample can be compared to a human reference genome. There is a higher probability of cancer if there are more changes in the plasma sample than constitutive DNA when compared to the reference genome. In one embodiment, homozygous loci in a reference genome are studied. The amount of heterozygous loci in both the constituent DNA and the DNA from the biological sample is compared. When the amount of heterozygous sites detected from the DNA of a biological sample exceeds the amount of heterozygous sites constituting the DNA, there is a higher probability of cancer.

The analysis can also be limited to loci that are homozygous in CG. SNM can also be defined for heterozygous loci, but this will generally require the generation of a third variant. In other words, if the heterozygous locus is A/T, then the new variant will be C or G. SNMs that identify homozygous loci are generally easier.

The extent of the increase in the amount of heterozygous loci relative to constitutive DNA in biological sample DNA when compared to the rate of change seen in healthy subjects can be suggestive of cancer or a pre-cancerous state. For example, if the degree of increase in the site exceeds a certain threshold corresponding to the degree observed in a healthy subject, the data may be considered to suggest a cancer or pre-cancerous state. In one embodiment, the mutation distribution in a subject not having cancer is determined, and the threshold can be considered a number of standard deviations (e.g., 2 or 3 standard deviations).

One embodiment may require that there be at least a specified number of variants at a locus before the locus is counted. Another embodiment provides for testing even for data based on seeing only one change. For example, when the total number of variations (false + true mutations or polymorphisms) visible in plasma is statistically significantly higher than the total number in constitutive DNA, then evidence of cancer is present.

Fig. 3 shows a flow diagram of a method 300 for comparing a Sample Genome (SG) to a Constitutive Genome (CG) using a Reference Genome (RG) according to an embodiment of the present invention. The method 300 assumes that RG has been obtained and that a sequence tag has been obtained for the biological sample.

At block 310, at least a portion of the sequence tags are aligned to a reference genome. The alignment may allow mismatches to be detected when the variation is to be detected. The reference genome can be from a population similar to the subject. The aligned sequence tags effectively comprise the sample genome (sample genome).

At block 320, a first number (a) of potential variants, such as Single Nucleotide Mutations (SNMs), are identified. A potential SNM is a locus where the sequence tag of SG displays nucleotides other than RG. Other criteria can be used, such as the number of sequence tags that display variation must be greater than the cut-off value and whether the locus is homozygous in RG. The set of potential SNMs may be set a when a particular locus is identified and tracked by storing the locus in memory. A particular locus may be determined or several of the SNMs may be determined directly.

At block 330, the component genomes are determined by aligning sequence tags obtained by sequencing DNA fragments from the constitutive sample to the reference genome. This step can be performed at any prior time and using a constitutive sample obtained at any prior time. The CG may simply be read from memory, with prior comparisons. In one embodiment, the constitutive sample may be a blood cell.

At block 340, a second number (B) of loci are identified, wherein the aligned sequence tags of the CGs have variants (e.g., SNMs) at the loci relative to the reference genome. If a set of loci is tracked in particular, then B may represent the set, as opposed to just one number.

At block 350, pool B is subtracted from pool a to identify variants (SNM) that are present in the sample genome but not in the CG. In one embodiment, the SNM set may be limited to nucleotide positions where the CG is homozygous. To achieve this filtering, specific loci where the CG is homozygous can be identified in set C. In another embodiment, if the CG is not homozygous at a locus, then the locus is not counted in the first number a or the second number B. In another embodiment, any known polymorphism (e.g., by virtue of its presence in a SNP database) may be filtered out.

In one embodiment, the subtraction in block 350 may be a numerical subtraction only, and thus does not remove a particular potential SNM, but only subtracts the value. In another embodiment, the subtraction takes the difference between set a and set B (e.g., when set B set is a subset of set a) to identify a particular SNM that is not in set B. In logical values, this can be expressed as [ A AND NOT (B) ]. The resulting set of identified variants may be labeled C. The parameters may be determined as a number C or from a set C.

In some embodiments, the nature of the mutations may be considered, and different weights correspond to different classes of mutations. For example, mutations that are commonly associated with cancer may be given higher weight (also referred to as significant value when relative weight of loci is involved). The Mutations may be found in a database of tumor-associated Mutations, such as a catalog of Somatic Mutations in Cancer (calcium of viral Mutations in Cancer, COSMIC) (www.sanger.ac.uk/genetics/CGP/COSMIC /). As another example, mutations associated with non-synonymous changes may be given higher weight.

Thus, the first number a may be determined as a weighted sum, wherein the count of tags displaying a variant at one locus may have a different weight than the count of tags at another locus. The first number a may reflect this weighted sum. Thus, a similar calculation can be performed for B and number C, and the parameters can reflect this weight. In another embodiment, this weight is considered to be integrated when determining a particular locus set C. For example, a weighted sum may be determined for the counts of loci of set C. The weights may be used in other methods described herein.

Thus, the parameter that is compared to a threshold to determine the classification of the cancer grade may be the number of loci that exhibit SG and CG variation relative to RG. In other embodiments, the total number of DNA fragments that exhibit variation (e.g., as counted via sequence tags) can be counted. In other embodiments, the number may be used in another formula to obtain a parameter.

In one embodiment, the concentration of the variant at each locus may be a parameter and compared to a threshold. This threshold can be used to determine whether a locus is a potential variant locus (including a cutoff value that shows a particular number of reads of variants) and then count loci. The concentration may also be used as a weighting factor in the sum of SNMs.

Reduction of false positives using cut-off values

As mentioned above, single nucleotide mutations can be investigated for large genomic regions (e.g., the entire genome) or multiple genomic regions in multiple cell-free DNA fragments (e.g., circulating DNA in plasma) to improve the sensitivity of the method. However, analytical errors (e.g., sequencing errors) can affect the feasibility, accuracy, and specificity of this approach. Here, we used the massively parallel sequencing platform as an example to illustrate the importance of sequencing errors. The sequencing error rate for the Illomina (Illumina) sequencing-by-synthesis platform was about 0.1% to 0.3% per nucleotide sequenced (Minoche et al Genome biology (Genome Biol)2011,12: R112). Any massively parallel sequencing platform may be used, including ligation sequencing platforms (e.g., life technology SOLiD platform), ion torrent/ion protons, semiconductor sequencing, roche 454, single molecule sequencing platforms (e.g., hurikes, pacific biosciences, and nanopores).

In a previous study on hepatocellular carcinoma, it has been shown that there are about 3,000 single nucleotide mutations (ceramic (tao y) et al 2011 U.S. national countries for the entire genome of cancerJournal of academy of sciences (Proc Natl Acad Sci USA); 108:12042-12047). Assuming that only 10% of the total DNA in the circulation is derived from tumor cells and we sequenced plasma DNA at an average sequencing depth with haplotype genomic coverage of 1, we would encounter 9 million (3X 10) due to sequencing errors⁹X 0.3%) Single Nucleotide Variation (SNV). However, it is expected that most single nucleotide mutations occur on only one of the two homologous chromosomes. In the case of haplotype genomic coverage of 1 at the sequencing depth in samples with 100% tumor DNA, we expect that only half of 3,000 mutations, i.e. 1,500 mutations, were detected. When we sequenced plasma samples containing 10% tumor-derived DNA with a haplotype genomic coverage of 1, we expected that only 150(1,500 x 10%) cancer-associated single nucleotide mutations were detected. Thus, the signal to noise ratio for detection of cancer-associated mutations is 1 to 60,000. This very low signal-to-noise ratio indicates that if we simply use all single nucleotide changes in a biological sample (e.g., plasma) as a parameter, the accuracy of distinguishing between normal and cancer cases using this method will be very low.

It is expected that as sequencing technology advances, the sequencing error rate will continue to decrease. It is also possible to use more than one sequencing platform to analyze the same sample and indicate reads that may be affected by sequencing errors by comparing the cross-platform sequencing results. Another approach is to analyze two samples taken from the same subject at different times. However, the method is time consuming.

In one embodiment, one method of enhancing signal-to-noise ratio in detecting single nucleotide mutations in the plasma of a cancer patient is to count mutations only when the same mutation occurs multiple times in a sample. In the sequencing platform chosen, sequencing errors involving specific nucleotide substitutions are likely to be more common and will affect the sequencing results of both the test sample and the constitutive DNA sample of both the test subject and the control subject. However, in general, sequencing errors occur randomly.

When identical changes are observed at the same nucleotide position in multiple DNA fragments, the chance of having a sequencing error will be exponentially lower. In another aspect, the probability of detecting a true cancer-associated mutational change in a sample is affected by the depth of sequencing and the percent concentration of tumor DNA in the sample. The probability of observing mutations in multiple DNA fragments will increase with depth of sequencing and percent concentration of tumor DNA. In various embodiments using samples with cell-free tumor DNA (e.g., in plasma), fractional concentrations can be 5%, 10%, 20%, and 30%. In one embodiment, the fractional concentration is less than 50%.

Fig. 4 is a table 400 showing the number of cancer-associated single nucleotide mutations that are correctly identified using different frequency of occurrence as criteria for classifying mutations present in a sample, according to an embodiment of the present invention. The number of nucleotide positions that were erroneously identified as having a mutation due to sequencing errors based on the same classification criteria is also shown. The sequencing error rate was assumed to be 0.1% (Minicorn et al genome biology 2011,12: R112). The fractional concentration of tumor-derived DNA in the sample was assumed to be 10%.

Figure 4 shows that the ratio between the number of cancer-associated mutations detected in plasma and the number of false positives will increase exponentially with the number of identical changes used to define the mutations in the sample, when it is assumed that the fractional concentration of tumor-derived DNA in the sample is 10%. In other words, both the sensitivity and specificity of cancer mutation detection will be improved. In addition, the sensitivity of detecting cancer-associated mutations is affected by the depth of sequencing. In the case of 100-fold genome coverage of the sequenced haplotypes, 2,205 (73.5%) out of 3,000 mutations can be detected even using the criterion that a specific mutation occurs in at least 4 DNA fragments in a sample. Other values for the minimum number of such fragments may be used, such as 3, 5, 8, 10, and greater than 10.

Fig. 5 is a table 500 showing the expected number of false positive loci and the expected number of mutations identified when the percentage concentration of tumor-derived DNA in the sample is assumed to be 5%. Where the percent concentration of tumor-derived DNA in a sample is low, a higher sequencing depth will be required to achieve the same sensitivity of detecting cancer-associated mutations. More stringent criteria will also be required to maintain specificity. For example, it would be desirable to use a criterion that a particular mutation occurs in at least 5 DNA fragments, rather than at least 4 times in a sample with a 10% tumor DNA concentration. Tables 400 and 500 provide guidance for using cut-off values at given fold-sequencing coverage and tumor DNA concentrations, which can be assumed or measured as described herein.

Another advantage of using criteria that detect a single nucleotide change more than once to define a mutation is that it is expected that this will minimize false positive detection due to single nucleotide changes in non-malignant tissue. Because nucleotide changes can occur during mitosis of normal cells, each healthy cell in the body can have multiple single nucleotide changes. These changes may produce false positive results. However, when the cells die, changes in the cells will be present in the plasma/serum. While different normal cells are expected to carry different sets of mutations, it is unlikely that mutations present in one cell will be present in numerous copies in plasma/serum. This is in contrast to mutations within tumor cells, where multiple copies are expected to be found in plasma/serum, since tumor growth is essentially clonal. Thus, multiple cells from a clone will die and release the marker mutation representing the clone.

In one embodiment, target enrichment of a particular genomic region may be performed prior to sequencing. This target enrichment step can increase the sequencing depth of the region of interest if the same amount of sequencing is performed. In another embodiment, sequencing at a relatively low sequencing depth may be performed first in a first round. The regions displaying at least one single nucleotide change can then be enriched for a second round of sequencing, with higher fold coverage. Multiple occurrence criteria can then be applied to define mutations for sequencing results with target enrichment.

VII dynamic cut-off

As described above, a cutoff value N that supports the number of reads of variants (potential mutations) may be used to determine whether a locus is qualified enough as a mutation to be counted (e.g., SNM). The use of such a cutoff value can reduce false positives. The following discussion provides methods for selecting cutoffs for different loci. In the following embodiments, we assume that there is a single major cancer clone. Similar analysis can be performed for situations involving multiple cancer cell clones releasing different amounts of tumor DNA into plasma.

A. Number of cancer-associated mutations detected in plasma

The number of detectable cancer-associated mutations in plasma can be influenced by a number of parameters, for example: (1) number of mutations in tumor tissue (N)_T) The total number of mutations present in the tumor tissue is the maximum number of tumor-associated mutations detectable in the patient's plasma; (2) a percentage concentration of tumor-derived DNA in plasma (f), the higher the percentage concentration of tumor-derived DNA in plasma, the higher the probability that a tumor-associated mutation is detected in plasma; (3) sequencing depth (D), which refers to the number of times the sequenced region is covered by sequence reads. For example, a 10-fold average sequencing depth means that each nucleotide within the sequenced region is covered by 10 sequence reads on average. When the sequencing depth is increased, the probability of detecting cancer-associated mutations will increase; and (4) defining nucleotide changes in plasma as the minimum number of times (r) the change needs to be detected for a potential cancer-related mutation, which is a cut-off value to distinguish sequencing errors from true cancer-related mutations.

In one embodiment, the poisson distribution is used to predict the number of cancer-associated mutations that can be detected in plasma. Assuming that the mutation is present at sequencing depth D in the nucleotide position on one of the two homologous chromosomes, the mutation is present in the plasma at the expected number of times (M) _P) The calculation is as follows: m is a group of_P＝D×f/2。

The probability (Pb) of detecting a mutation in plasma at a particular mutation site is calculated as follows:

wherein r (cut-off) is required in plasma to define nucleotide changes as potential tumor-associated mutationsThe number of times the change was made; poisson (i, M)_P) Is with i occurrences, the average number being M_PPoisson distribution probability of (a).

Total number of cancer-associated mutations expected to be detected in plasma (N)_P) It can be calculated as follows: n is a radical of_P＝N_TX Pb, wherein N_TIs the number of mutations present in the tumor tissue. The following chart shows the percentage of tumor-associated mutations expected to be detected in plasma using different frequency of occurrence (r) criteria for detecting potential mutations and different sequencing depths.

Fig. 6A is a graph 600 showing the change in the detection rate of cancer-associated mutations in plasma when the plasma fractional concentrations of tumor-derived DNA are 10% and 20% and the frequency of occurrence (r) of four and six times is used as a criterion to identify potential cancer-associated mutations. The higher the percentage concentration of tumor-derived DNA in plasma, given the same r, will result in a higher number of cancer-associated mutations detectable in plasma. A higher r will result in a lower number of mutations detected at the same percentage concentration of tumor-derived DNA in the plasma.

B. The number of false positives detected due to errors

Single nucleotide changes in plasma DNA sequencing data can occur due to sequencing and alignment errors. The number of nucleotide positions with false positive single nucleotide changes can be predicted based on a mathematical binomial distribution model. Influence the number of false positive sites (N)_FP) The parameters of (a) may include: (1) a sequencing error rate (E), defined as the proportion of sequenced nucleotides that are incorrect; (2) a sequencing depth (D), the higher the sequencing depth, the greater the number of nucleotide positions that exhibit sequencing errors will increase; (3) a minimum number of occurrences (r) of the same nucleotide change required to define a potential cancer-associated mutation; and (4) the total number of nucleotide positions (N) within the region of interest_I)。

The occurrence of mutations can generally be regarded as a random process. Thus, as the criteria defining the occurrence of potential mutations increase, the number of false positive nucleotide positions will decrease exponentially with r. In some existing sequencing platforms, certain sequence contexts are more prone to sequencing errors. Examples of such sequencing contexts include GGC motifs, homopolymers (e.g., AAAAAAA), and simple repeats (e.g., ATATATATAT). These sequence contexts will substantially increase single nucleotide changes or insertion/deletion artifacts (Zhongmura K (Nakamura K) et al, Nucleic Acids research (Res) 2011; 39, e90 and Minicose AE et al, genome biology 2011; 12, R112). In addition, repeated sequences (e.g., homopolymers and simple repeats) would computationally introduce ambiguity in alignment and thus cause false positive results for single nucleotide variations.

The larger the region of interest, the higher the number of false positive nucleotide positions that will be observed. If a mutation is sought in the whole genome, the region of interest will be the whole genome and the number of nucleotides involved will be 30 billion. On the other hand, if an exon is focused, the number of nucleotides encoding the exon (i.e., about 4500 ten thousand) will constitute the region of interest.

The number of false positive nucleotide positions associated with sequencing errors can be determined based on the following calculation. Probability of having the same nucleotide change at the same position due to sequencing error (P)_Er) Can be calculated as follows:

where C (D, r) is the number of possible combinations of r elements selected from the total of D elements; r is the number of occurrences required to define a potential mutation; d is the sequencing depth; and E is the sequencing error rate. C (D, r) can be calculated as follows:

number of nucleotide positions false positive for mutation (N)_FP) It can be calculated as follows:

N_FP＝N_lP_Zr

wherein N is_IIs the total number of nucleotide positions in the region of interest.

FIG. 6B is a graph 650 showing the expected number of misclassifications versus sequencing depth when defining nucleotide changes using the criteria of frequency of occurrence (r) of 4, 5, 6, and 7. The region of interest was assumed in this calculation to be the whole genome (30 hundred million nucleotide positions). The sequencing error rate was assumed to be 0.3% of the nucleotides sequenced. As can be seen, the value of r has a significant effect on false positives. However, as can be seen from fig. 6A, higher r-values also reduced the number of mutations detected, at least until a significantly higher sequencing depth was used.

C. Select the least frequently occurring (r)

As discussed above, the number of true cancer-associated mutation sites and false positive sites due to sequencing errors will increase with the depth of sequencing. However, the rate of increase will be different. Thus, it is possible to utilize the selection of sequencing depth and r-value to maximize the detection of true cancer-associated mutations while keeping the number of false positive sites at a low level.

Fig. 7A is a graph 700 showing the number of true cancer-associated mutation sites and false positive sites at different sequencing depths. The total number of cancer-associated mutations in tumor tissue was assumed to be 3,000, and the percentage concentration of tumor-derived DNA in plasma was assumed to be 10%. The sequencing error rate was assumed to be 0.3%. In the legend, TP represents the true positive site for the corresponding mutation present in the tumor tissue, and FP represents the false positive site for the corresponding mutation absent in the tumor tissue and present in the sequencing data, whose nucleotide changes are due to sequencing errors.

According to graph 700, if we used a minimum of 6 occurrences (r ═ 6) as a criterion to define potential mutation sites in plasma, then at 110-fold sequencing depth, about 1,410 true cancer-associated mutations will be detected. Using this criterion, only about 20 false positive sites will be detected. If we used a minimum of 7 occurrences (r-7) as a criterion to define potential mutations, the number of cancer-associated mutations that can be detected would be reduced by 470, i.e., about 940. Thus, the criterion of r-6 will make the detection of cancer-related mutations in plasma more sensitive.

On the other hand, if we used the criterion of least 6 and 7 occurrences (r) to define potential mutations, respectively, then at 200-fold sequencing depth, the number of true cancer-associated mutations detected would be about 2,800 and 2,600. Using these two r values, the number of false positive sites will be about 740 and 20, respectively. Thus, using a more stringent criterion of r-7 to define potential mutations at 200-fold sequencing depth can greatly reduce the number of false positive sites without significantly adversely affecting the sensitivity of detecting true cancer-related mutations.

D. Dynamic computer readable medium of sequencing data defining potential mutations in plasma

The depth of sequencing will vary for each nucleotide within the region of interest. If we apply a fixed cut-off value for the occurrence of nucleotide changes to define potential mutations in plasma, nucleotides covered by more sequence reads (i.e., higher sequencing depth) will have a higher probability of having such a change incorrectly scored as having the nucleotide variation in tumor tissue than nucleotides having a lower sequencing depth. One embodiment to address this problem is to apply a dynamic cut-off value for r to different nucleotide positions based on the actual sequencing depth for a particular nucleotide position and the upper limit sought based on the probability of identifying a false positive variation.

In one embodiment, the maximum allowable false positive rate may be fixed at 1 to 1.5X 10⁸The nucleotide position. At this maximum allowable false positive rate, the total number of false positive sites identified in the whole genome will be less than 20. R-values for different sequencing depths can be determined according to the curve shown in fig. 6B and these cut-off values are shown in table 1. In other embodiments, other different maximum allowable false positive rates may be used, such as 1 to 3 × 10⁸1 to 10⁸Or 1 to 6X 10⁷. The corresponding total number of false positive sites will be less than 10, 30 and 50, respectively.

TABLE 1 different sequencing depths relative to a particular nucleotide position to define the minimum number of occurrences (r) of nucleotide changes present in plasma required for a potential mutation. Maximum false positive rate fixed at 1: 1.5X 10⁸And (4) nucleotide.

E. Target enrichment sequencing

As shown in fig. 7A, a higher sequencing depth may result in better sensitivity for detecting cancer-associated mutations while keeping the number of false positive sites low by allowing higher r-values to be used. For example, at 110-fold sequencing depth, 1,410 true cancer-related mutations can be detected in plasma using r-value 6, while when the sequencing depth is increased to 200-fold and r-value 7 is applied, the number of true cancer-related mutations detected will be 2,600. Two data sets are expected to yield about 20 false positive sites.

While sequencing a whole genome at 200-fold depth is currently relatively expensive, one possible way to achieve this sequencing depth would be to focus on smaller regions of interest. Analysis of the target region can be achieved, for example, by, but not limited to, capturing the genomic region of interest by hybridization using DNA or RNA decoys. The captured region is then pulled down, for example by magnetic means, and then sequenced. The target capture can be performed, for example, using an Agilent SureSelect target enrichment system, a romblegen target enrichment system, and an illumana (Illumina) target re-sequencing system. Another approach is to perform PCR amplification and then sequencing of the target region. In one embodiment, the region of interest is an exome. In such embodiments, target capture of all exons can be performed on plasma DNA, and then the exon region-enriched plasma DNA can be sequenced.

In addition to having a higher sequencing depth, focusing on a specific region rather than analyzing the whole genome would significantly reduce the number of nucleotide positions in the search space and would result in a reduction in the number of false positive positions given the same sequencing error rate.

Fig. 7B is a graph 750 showing the predicted number of false positive sites, including analysis of the Whole Genome (WG) and all exons. For each type of analysis, two different r values, 5 and 6, were used. At 200-fold depth of sequencing, if r-5 was used to define mutations in plasma, the predicted number of false positive sites was about 23,000 and 230 for the whole genome and all exons, respectively. If r-6 is used to define mutations in plasma, the predicted number of false positive sites is 750 and 7, respectively. Thus, limiting the number of nucleotides in the region of interest can significantly reduce the number of false positives in plasma mutation analysis.

In exon capture or even exome capture sequencing, the number of nucleotides in the search space is reduced. Thus, even though we allow for a higher false positive rate for the detection of cancer-associated mutations, the absolute number of false positive sites can be kept at a relatively low level. Allowing a higher false positive rate would allow less stringent criteria to be used that define the minimum occurrence (r) of single nucleotide variations in plasma. This will result in a higher sensitivity of detecting true cancer-related mutations.

In one embodiment, we can use 1.5X 10 ⁶The maximum allowable false positive rate of (c). At this false positive rate, the total number of false positive sites within the target exon will be only 20. Using 1.5X 10⁶The maximum allowable false positive rate of (a) is shown in table 2 for r values of different sequencing depths. In other embodiments, other different maximum allowable false positive rates may be used, such as 1 to 3 × 10⁶1 to 10⁶Or 1 to 6X 10⁵. The corresponding total number of false positive sites will be less than 10, 30 and 50, respectively. In one embodiment, as described above, different classes of mutations may be given different weights.

Table 2. different sequencing depths relative to a particular nucleotide position to define the minimum number of occurrences (r) of nucleotide changes present in plasma required for a potential mutation. The maximum false positive rate was fixed at 1: 1.5X 10⁶And (4) one nucleotide.

VIII cancer detection

As mentioned above, the count of sequence tags at variant loci can be used in various ways to determine parameters that are compared to thresholds to classify cancer grade. Percent concentration of variant reads at one locus or many loci relative to all reads is another parameter that may be used. The following are some examples of calculating parameters and thresholds.

A. Measuring parameters

If the CG is homozygous for the first allele at a particular locus and the variant allele is found in a biological sample (e.g., plasma), the percent concentration can be calculated as follows: 2p/(p + q), wherein p is the number of sequence tags with variant alleles and q is the number of sequence tags of the first allele with CG. This formula assumes that only one of the tumor's haplotypes has a variant, as will typically be the case. Thus, for each homozygous locus, the percent concentration can be calculated. The percent concentrations may be averaged. In another embodiment, the count p may include the number of sequence tags for all loci, and similarly for count q, to determine the percent concentration. An example is now described.

Single Nucleotide Variants (SNVs) of tumor origin in the plasma of 4 HCC patients were studied and tested at the whole genome level. We sequenced tumor DNA and leukocyte DNA at the haplotype genome mean coverage depths of 29.5-fold (range, 27-fold to 33-fold) and 43-fold (range, 39-fold to 46-fold), respectively. MPS data from each of 4 HCC patients was compared to leukocyte DNA, and SNVs present in tumor DNA but absent in leukocyte DNA were sought with a strict bioinformatics algorithm. This algorithm requires that the sequenced tumor DNA fragments (i.e., at the corresponding sequence tags) corresponding to candidate SNVs exceed at least a certain threshold number before they are classified as true SNVs. For example, as described herein, the threshold number is determined by considering the sequencing depth and sequencing error rate of a particular nucleotide.

Fig. 8 is a table 800 showing the results before and after treatment, including the percentage concentration of tumor-derived DNA in plasma, of 4 HCC patients according to embodiment of the present invention. In 4 HCC cases, the number of tumor-associated SNVs ranged from 1,334 to 3,171. The ratio of the SNV detectable in plasma is listed before and after treatment. Prior to treatment, 15% to 94% of tumor associated SNV were detected in plasma. After treatment, the percentage detected was between 1.5% and 5.5%. Thus, the number of SNVs detected did correlate with the cancer grade. This shows that the number of SNVs can be used as a parameter to classify the cancer grade.

The percent concentration of tumor-derived DNA in plasma was determined by fractional enumeration of mutants versus total (i.e., mutant plus wild-type) sequence. The formula is 2p/(p + q), where 2 takes into account that mutations on the tumor occur in only one haplotype. These percent concentrations correlated well with those determined by whole Genome Aggregation Allele Loss (GAAL) analysis (Chen KC (Chan KC) et al clinical chemistry 2013; 59:211-24) and decreased after surgery. Thus, the indicated percent concentration is also a parameter that can be used to determine the grade of cancer.

The percent concentration from the SNV analysis may reflect tumor burden. Cancer patients with higher tumor burden (e.g., higher inferred percent concentration) will have a higher frequency of somatic mutations than cancer patients with lower tumor burden. Thus, embodiments may also be used for prognosis. In general, cancer patients with higher tumor burden have a poorer prognosis than cancer patients with lower tumor burden. The former group will therefore have a higher chance of death due to the disease. In some embodiments, if the absolute concentration of DNA in a biological sample (e.g., plasma) can be determined (e.g., using real-time PCR or fluorimetry), the absolute concentration of tumor-associated genetic aberrations can be determined and used for clinical detection and/or monitoring and/or prognosis.

B. Determination of threshold value

Table 800 may be used to determine the threshold. As mentioned above, the number and percent concentration of SNVs determined by SNV analysis correlates with cancer grade. The threshold value may be determined on an individual basis. For example, the pre-treatment parameter value may be used to determine a threshold value. In various embodiments, the threshold may be a relative change from a pre-treatment absolute value. A suitable threshold may be a value corresponding to a 50% reduction in the number or percentage concentration of SNVs. Such a threshold would provide a classification of lower cancer grade for each case in table 800. Note that the threshold may depend on the sequencing depth.

In one embodiment, a threshold may be used across samples, and may or may not be considered a pre-treatment value for a parameter. For example, a threshold of 100 SNVs may be used to classify a subject as not having cancer or having a low cancer grade. In table 800, each of the four cases met this threshold of 100 SNVs. If percent concentration is used as a parameter, a threshold of 1.0% classifies HCC1-HCC3 as an almost zero grade cancer, and a second threshold of 1.5% classifies HCC4 as a low cancer grade. Thus, more than one threshold may be used to obtain more than two classifications.

To illustrate other possible thresholds, we analyzed plasma of healthy controls for tumor-associated SNV. Numerous measurements can be made on healthy subjects to determine the extent to which variations of a biological sample are expected relative to a constitutive genome.

Figure 9 is a table 900 showing the detection of HCC-associated SNVs in 16 healthy control subjects according to an embodiment of the invention. Table 900 can be used to estimate the specificity of the SNV analysis method. 16 healthy controls are listed in different rows. Each column is the SNV detected for a particular HCC patient and shows the number of sequence reads with a variant allele at the variant locus and the number of sequence reads with the wild-type allele (i.e., the allele from CG). For example, for HCC1, control C01 had 40 variant reads at the variant locus, but 31,261 wild-type allele reads. The last column shows the total percent concentration of all SNVs across HCC patients. Because HCC-associated SNVs are specific for HCC patients, the presence of HCC-associated SNVs represents a false positive. As described herein, if a cutoff value is applied to these apparent sequence variants, then all of these false positives will be filtered out.

These minute numbers of pending tumor-associated mutations in the plasma of 16 healthy controls represent the corresponding "random noise" of this method and may be due to sequencing errors. The average percent concentration estimated from the noise is 0.38%. These values show the range for healthy subjects. Thus, the threshold for classification of zero cancer grade of HCC may be about 0.5% since the highest percentage concentration is 0.43%. Thus, if all cancer cells are removed by an HCC patient, it would be expected to exhibit a low percent concentration similar to this.

Referring back to table 800, if 0.5% is used as the threshold for zero cancer grade, the post-treatment plasma data for HCC1 and HCC3 would be determined to have zero grade based on SNV analysis. HCC2 may be classified as one level from zero up. HCC4 may also be classified as one grade from zero up, or some higher grade, but still at a relatively low grade compared to the pre-treatment sample.

In one embodiment, where the parameter corresponds to the number of variant loci, the threshold can be zero (i.e., one variant locus can indicate a non-zero cancer grade). However, under many settings (e.g. sequencing depth) the threshold will be higher, e.g. the absolute value is 5 or 10. In one embodiment, where a person is monitored after treatment, the threshold may be a percentage of the SNV present in the sample (identified by direct analysis of the tumor). If the cut-off value for the number of variant reads required at a locus is large enough, having only one variant locus can indicate a non-zero cancer grade.

Thus, quantitative analysis of variations (e.g., single nucleotide variations) in DNA of biological samples (e.g., plasma) can be used to diagnose, monitor, and predict cancer. To detect cancer, the number of single nucleotide variations detected in the plasma of a test subject can be compared to the number of single nucleotide variations in a group of healthy subjects. In healthy subjects, the apparent single nucleotide variation in plasma can be due to sequencing errors, non-clonal mutations from blood cells and other organs. It has been shown that cells in normal healthy subjects can carry a small number of mutations (Conraddf (Conrad DF) et al Nature genetics (Nat Genet) 2011; 43:712-4) as shown in Table 900. Thus, the total number of apparent single nucleotide variations in the plasma of a group of apparently healthy subjects can be used as a reference range to determine whether an abnormally high number of single nucleotide variations in the plasma of the tested patients corresponds to a non-zero cancer grade.

Healthy subjects for determining the reference range can be matched to the tested subject with respect to age and gender. In previous studies, it has been shown that the number of mutations in somatic cells will increase with age (Zhang NK (Cheung NK) et al, J.Med.Acad.USA (JAMA) 2012; 307: 1062-71). Thus, as we become older, the accumulating cell clone will be 'normal', even if it is relatively benign for most of the time, or will take an extremely long time to become clinically influential. In one embodiment, the reference levels may be generated for different groups of subjects, e.g., different ages, sexes, ethnicities, and other parameters (e.g., smoking status, hepatitis status, alcohol, drug history).

The reference range may vary based on the cutoff value used (i.e., the number of variant sequence tags required at the locus) as well as the assumed false positive rate and other variables (e.g., age). Thus, the reference range may be determined with respect to one or more criteria, and the same criteria used to determine the parameters of the sample. The parameter can then be compared to a reference range, since both are determined using the same criteria.

As mentioned above, embodiments may use multiple thresholds to determine the cancer grade. For example, a first grade may determine that a parameter below a threshold is indicative of no cancer; and at least a first cancer grade, which may be a pre-neoplastic grade. Other grades may correspond to different cancer stages.

C. Dependence on experimental variables

The sequencing depth may be important to determine the lowest detection threshold for a few (e.g., tumor) genomes. For example, if the depth of sequencing using the haplotype genome is 10, then the lowest tumor DNA concentration that can be detected, even using a sequencing technique without any errors, is 1/5, i.e., 20%. On the other hand, if the depth of sequencing using the haplotype genome is 100, the minimum detected concentration can be reduced to 2%. This analysis involves the analysis of only one mutated locus. However, when more mutant loci are analyzed, the lowest tumor DNA concentration The degree may be lower and determined by a binomial probability function. For example, if the sequencing depth is 10 times and the percentage concentration of tumor DNA is 20%, then the probability of detecting a mutation is 10%. However, if we have 10 mutations, then the probability of detecting at least one mutation will be 1- (1-10%)¹⁰＝65％。

There are several effects of increasing sequencing depth. The higher the sequencing depth, the more sequencing errors will be encountered, see fig. 4 and 5. However, at higher sequencing depths, it may be easier to distinguish sequencing errors from clonally amplified mutations of a subpopulation of cells (e.g., cancer cells) because sequencing errors occur randomly in the genome, but corresponding mutations for a given cell population occur at the same location.

The higher the sequencing depth, the more mutations will be identified by "healthy cells". However, when there is no clonal expansion of these healthy cells and their mutation profiles are different, then the mutations in these healthy cells can be distinguished from cancer-related mutations by their frequency of occurrence in plasma (e.g., a cutoff value N is used for the required number of reads exhibiting mutations, e.g., N is equal to 2, 3, 4, 5, or greater).

As mentioned above, the threshold may depend on the amount of mutations in healthy cells that will clonally expand and thus may not be filtered out by other mechanisms. It is expected that this difference can be obtained by analyzing healthy subjects. As clonal expansion occurs over time, the age of the patient can affect the variation present in healthy subjects, and thus the threshold may have some dependence on age.

D. In combination with targeting methods

In some embodiments, random sequencing may be used in combination with the target method. For example, random sequencing may be performed on plasma samples after a patient's cancer has occurred. Sequencing data of plasma DNA can be analyzed for copy number aberrations and SNV. Regions exhibiting aberrations (e.g., amplification/deletion or high density of SNVs) can be targeted for continuous monitoring purposes. Monitoring can be effected as a single program, over a period of time, or immediately after random sequencing. For target analysis, liquid phase hybridization-based capture methods have been successfully used to enrich plasma DNA for non-invasive prenatal diagnosis (Liao GJ (Liao GJ) et al clinical chemistry 2011; 57: 92-101). The techniques are as mentioned above. Thus, targeted and stochastic methods can be used in combination for cancer detection and monitoring.

Thus, target sequencing can be performed on loci of possible mutations found by the non-target whole genome approach mentioned above. The target sequencing can be performed using liquid phase or solid phase hybridization techniques (e.g., using agilent SureSelect, NimbleGen sequence capture, or the ilumno target resequencing system) followed by massively parallel sequencing. Another approach is to implement target sequencing based on an amplification (e.g., PCR-based) system (Friegh T (Forshew T) et al science transformation medicine 2012; 4:135ra 68).

IX. percent concentration

The percent concentration of tumor DNA can be used to determine a cut-off for the number of desired mutations at a locus, which is then followed by identification of mutations at the locus. For example, if the known percentage concentration is relatively high, a high cutoff value may be used to filter out more false positives, since it is known that for a true SNV there should be a relatively high number of variant reads. On the other hand, if the percentage concentration is low, a lower cutoff value may be needed so that some SNV will not be missed. In this case, the percentage concentration will be determined by a method different from the SNV analysis, with it being used as a parameter.

Various techniques can be used to determine the percent concentration, some of which are described herein. These techniques can be used to determine the percent concentration of tumor-derived DNA in a mixture, such as a biopsy sample containing a mixture of tumor cells and non-malignant cells or a plasma sample from a cancer patient containing DNA released by tumor cells and DNA released by non-malignant cells.

A.GAAL

Whole Genome Aggregation Allele Loss (GAAL) analysis focused on loci that had lost heterozygosity (Chen KC (Chan KC) et al clinical chemistry 2013; 59: 211-24). For heterozygous sites in the constitutive genomic CG, tumors usually involve a locus where one of the alleles is deleted. Thus, sequence reads of such loci will exhibit more than one allele than another, with the difference being proportional to the percent concentration of tumor DNA in the sample. An example of such a calculation is as follows.

DNA extracted from leukocytes and tumor tissues of HCC patients was genotyped with the Affymetrix (Affymetrix) whole genome human SNP array 6.0 system. Microarray data were processed using the Onfei genotyping console version 4.1. Genotyping analysis and Single Nucleotide Polymorphism (SNP) calling were performed using the birdsededv 2 algorithm. Genotyping data of white blood cells and tumor tissue are used to identify loss of heterozygosity (LOH) regions and to perform copy number analysis. Copy number analysis was performed with the genotyping console from afo with default parameters and with the smallest genomic fragment of 100bp and the smallest genetic tag within the fragment of 5.

Regions with LOH were identified as regions with 1 copy in tumor tissue and 2 copies in leukocyte cells, and SNPs within these regions were heterozygous in leukocyte cells but homozygous in tumor tissue. For genomic regions exhibiting LOH in tumor tissue, SNP alleles present in white blood cells but absent or reduced in intensity in tumor tissue are considered to contain alleles on chromosomal regions where the fragment is deleted. Alleles present in both white blood cells and tumor tissue are believed to have been non-deleted segments derived from chromosomal regions. The total number of reads for sequences carrying both deletion and non-deletion alleles was calculated for all chromosomal regions with single copy loss in the tumor. The difference between these two values is used to infer the percent concentration of tumor-derived DNA in the sample (F) using the following equation _GAAL)：

Wherein N is_non-delRepresents the total number of sequence reads carrying non-deleted alleles, and N_delRepresenting the total number of sequence reads carrying the deletion allele.

B. Genome-present-amount-based assessment

One problem with the GAAL technique is that a particular locus (i.e., a locus exhibiting LOH) is identified and only sequence read alignments with that locus are used. This requirement may add extra steps and thus cost. One embodiment is now described that uses only copy number, e.g., sequence read density.

Chromosomal aberrations such as amplification and deletion are often observed in cancer genomes. Chromosomal aberrations observed in cancer tissues usually involve sub-chromosomal regions, and these aberrations can be shorter than 1 Mb. Also, cancer-related chromosomal aberrations are heterogeneous in different patients, and thus different regions can be affected in different patients. It is not uncommon for tens, hundreds, or even thousands of copy number aberrations to be found in cancer genomes. All these factors make it difficult to determine tumor DNA concentration.

Embodiments include analyzing the quantitative changes resulting from tumor-associated chromosomal aberrations. In one embodiment, DNA samples containing DNA derived from cancer cells and normal cells are sequenced using massively parallel sequencing, for example by the immuna HiSeq2000 sequencing platform. The source DNA may be cell-free DNA in plasma or other suitable biological samples.

An amplified chromosomal region in tumor tissue will have an increased probability of being sequenced, and a deleted region in tumor tissue will have a reduced probability of being sequenced. Thus, the read density corresponding to sequence tags aligned to amplified regions will increase, and the read density corresponding to sequence tags aligned to deleted regions will decrease. The degree of variation is directly proportional to the percent concentration of tumor-derived DNA in the DNA mixture. The higher the proportion of DNA from the tumor tissue, the greater the changes that will be caused by chromosomal aberrations.

1. Evaluation in samples with high tumor concentration

DNA was extracted from tumor tissues of four patients with hepatocellular carcinoma. DNA was fragmented using a Covarya (Covaria) DNA sonication system and sequenced using the ImmunonaHiSeq 2000 platform sequencing as described (Chen KC et al clinical chemistry 2013; 59: 211-24). The sequence reads were aligned to the human reference genome (hg 18). The genome was then divided into 1Mb intervals (regions) and the sequence read density for each interval was calculated after adjusting the GC-induced bias as described (ChenEZ et al public science library Integrated (PLoS One) 2011; 6: e 21791).

After aligning the sequence reads to the reference genome, the sequence read density for each region can be calculated. In one embodiment, the sequence read density is the number of reads that map to a particular interval (e.g., 1Mb region) divided by the proportion of total sequence reads that can be aligned to a reference genome (e.g., to a unique location in the reference genome). Intervals that overlap with the amplified chromosomal region in tumor tissue are expected to have higher sequence read densities than intervals that do not have the overlap. In another aspect, intervals that overlap with a deleted chromosomal region are expected to have lower sequence read densities than intervals that do not have the overlap. The magnitude of the difference in sequence read density between regions with and without chromosomal aberrations is largely influenced by the proportion of tumor-derived DNA in the sample and the extent of amplification/deletion in the tumor cells.

Various statistical models can be used to identify intervals having sequence read densities corresponding to different types of chromosomal aberrations. In one embodiment, a normal distribution mixture model (McLachlan G) and a pierce D (Peel D) multivariate normal distribution mixture (Multvarate normal mixtures) Finite distribution mixture model (Finite mixture models)2004: pp. 81-116 John Wiley Giraffe Press (John Wiley & Sons Press) may be used. Other statistical models, such as binomial distribution mixture models and Poisson regression models (McRake G and pierce D. mixtures with abnormal components, Limited mixture model 2004: pp. 135. Willi, Giken. sub-Press) may also be used.

The sequence read density of an interval can be normalized using the sequence read density determined by sequencing leukocyte DNA at the same interval. The sequence read density of the different intervals can be influenced by the context in which the sequence of a particular chromosomal region is located, and thus normalization can help to more accurately identify regions exhibiting aberrations. For example, the comparability (which refers to the probability of aligning a sequence back to its original position) of different chromosomal regions may differ. In addition, copy number polymorphisms (i.e., copy number variations) will also affect the sequence read density of the region. Thus, normalization with leukocyte DNA may minimize variations associated with differences in sequence context between different chromosomal regions.

Figure 10A shows a profile 1000 of sequence read densities of a tumor sample from an HCC patient according to an embodiment of the invention. Tumor tissue was obtained after surgical resection from HCC patients. The x-axis represents the log base two value of the ratio (R) of the sequence read densities between tumor tissue and leukocyte tissue of the patient. The y-axis represents the number of corresponding intervals.

The peaks can be fitted to a distribution curve using a normal distribution mixture model to represent regions with deletions, amplifications, and no chromosomal aberrations. In one embodiment, the number of peaks may be determined by covering the Akaike's Information Criterion (AIC) for different possible values. log of₂The central peak of R ═ 0 (i.e., R ═ 1) represents a region without any chromosomal aberrations. The left peak (relative to the central peak) represents the region with a copy loss. The right peak (relative to the central peak) represents the region with one copy amplification.

The percent concentration of tumor-derived DNA can be reflected by the distance between peaks representing amplified and deleted regions. The greater the distance, the higher the percentage concentration of tumor-derived DNA in the sample will be. The percent concentration of tumor-derived DNA in a sample can be determined by this method of genome representation using the following equation, denoted F _GR：F_GR＝R_right-R_leftWherein R is_rightIs the R value of the right peak and R_leftIs the R value of the left peak. The maximum difference will be 1, corresponding to 100%. The percent concentration of tumor-derived DNA in tumor samples obtained from HCC1 patients was estimated to be 66%,wherein R is_rightAnd R_leftAre 1.376 and 0.712, respectively.

To confirm this result, the percent concentration of tumor DNA was determined independently using the method of whole Genome Aggregation Allele Loss (GAAL) analysis (Chen KC et al clinical chemistry 2013; 59: 211-24). Table 3 shows the genome presentation used (F)_GR) And GAAL (F)_GAAL) The percentage concentrations of tumor-derived DNA in the tumor tissues of four HCC patients obtained by the method. The values determined by these two different methods are sufficiently consistent with each other.

HCC tumor	F_GAAL	F_GR
			1	60.0％	66.5％
2	60.0％	61.4％
			3	58.0％	58.9％
4	45.7％	42.2％

Table 3 shows fractional concentrations determined by GAAL and genome presentation (GR).

2. Evaluation in samples with Low tumor concentration

The above analysis has shown that our method of genome presentation can be used to measure the percent concentration of tumor DNA in sample DNA with more than 50% of the tumor derived DNA, i.e. in the case when the tumor DNA is in the majority proportion. In previous analyses, we have shown that this method can also be applied to samples in which the tumor-derived DNA is present in only a minor proportion (i.e. below 50%). Samples that may contain a minor proportion of tumor DNA include, but are not limited to, blood, plasma, serum, urine, pleural fluid, cerebrospinal fluid, tears, saliva, ascites fluid, and stool from cancer patients. In some samples, the percent concentration of tumor-derived DNA may be 49%, 40%, 30%, 20%, 10%, 5%, 2%, 1%, 0.5%, 0.1% or lower.

For such samples, the peak in sequence read density representing the region with amplification and deletion may not be apparent as in the samples containing relatively high concentrations of tumor-derived DNA as described above. In one embodiment, regions with chromosomal aberrations in cancer cells can be identified by comparison to a reference sample known to not contain cancer DNA. For example, plasma of a subject not suffering from cancer can be used as a reference to determine the normal range of sequence read densities for a chromosomal region. The sequence read densities of the tested subjects can be compared to the values corresponding to the reference group. In one embodiment, the mean and Standard Deviation (SD) of the sequence read densities may be determined. For each interval, the sequence read densities of the tested subjects were compared to the mean of the reference group using the following formula to determine the z-score:

wherein GR_testSequence read densities representative of cancer patients;representing a reference subjectAverage sequential read density of, and SD_refSD representing sequence read density for the reference subject.

The region with a z-score < -3 represents a significant under-representation of sequence read density for a particular interval in cancer patients, indicating the presence of a deletion in tumor tissue. Regions with z-scores >3 indicate a significant over-representation of sequence read density for a particular interval in cancer patients, indicating the presence of amplification in tumor tissue.

The distribution of z-scores over all intervals can then be constructed to identify regions with different numbers of copy additions and losses, e.g., deletions of 1 or 2 copies of chromosomes; and amplification, yielding 1, 2, 3, and 4 additional copies of the chromosome. In some cases, more than one chromosome or more than one chromosome region may be involved.

Figure 10B shows a profile 1050 of z-fractions for all intervals in plasma of an HCC patient according to an embodiment of the invention. The peaks (from left to right) representing 1 copy loss, no copy change, 1 copy increase and 2 copy increase were fitted to the z-score distribution. Regions with different types of chromosomal aberrations can then be identified, for example, using a normal distribution mixture model as described above.

The percent concentration of cancer DNA in the sample can then be inferred from the sequence read density of regions exhibiting one copy increase or one copy loss (F). The percent concentration determined for a particular bitcell can be calculated as follows:this can also be expressed as:it can be rewritten as: f ═ z fraction | × CV × 2, where CV is the coefficient of variation in the measured sequence read densities for the reference subjects; and is provided with

In one embodiment, the results for each interval are combined. For example, the z-fraction of bins exhibiting 1 copy increase or the resulting F-values may be averaged. In another embodiment, the value of the z-score used to infer F is determined by a statistical model and is represented by the peaks shown in fig. 10B and fig. 11. For example, the z-fraction of the right peak can be used to determine the fractional concentration of the region exhibiting a 1 copy increase.

In another embodiment, all intervals of z-score < -3 and z-score >3 can be assigned to regions with single copy loss and single copy gain, respectively, since these two types of chromosomal aberrations are most common. This approximation works best when the number of intervals with chromosomal aberrations is relatively small and the fit of a normal distribution may not be accurate.

Figure 11 shows a profile 1100 of the z-fraction of plasma of an HCC patient according to an embodiment of the invention. Although the number of intervals overlapping with chromosomal aberrations is relatively small, all intervals with z-scores < -3 and z-scores >3 are fitted to normal distributions for single copy loss and single copy gain, respectively.

The percent concentration of tumor-derived DNA in the plasma of four HCC patients was determined using the GAAL assay and this GR-based method. The results are shown in table 4. It can be seen that the inferred score represents a sufficient correlation between the GAAL analysis and the GR analysis.

TABLE 4 percent concentration of tumor-derived DNA in plasma inferred by analysis of chromosomal aberrations.

C. Method for determining percent concentration

Fig. 12 is a flow diagram of a method 1200 of determining the percent concentration of tumor DNA in a biological sample comprising cell-free DNA, according to an embodiment of the invention. The method 1200 may be performed via various embodiments, including the embodiments described above.

At block 1210, one or more sequence tags are obtained for each of a plurality of DNA fragments in a biological sample. Block 1210 may be performed as described herein with respect to other methods. For example, one end of a DNA fragment may be sequenced from a plasma sample. In another embodiment, both ends of a DNA fragment may be sequenced, thereby allowing the length of the fragment to be estimated.

At block 1220, the sequence tags of the genomic locations are determined. Genomic position can be determined, for example, by aligning the sequence tags to a reference genome as described herein. If both ends of the fragment are sequenced, the paired tags can be aligned separately in pairs to the reference genome and the distance between the two tags is required to be less than a specified distance (e.g., 500 or 1,000 bases).

At block 1230, for each of the plurality of genomic regions, the amount of DNA fragments within the respective corresponding genomic region is determined from the sequence tags having genomic positions within the genomic region. The genomic regions may be non-overlapping regions of equal length in the reference genome. In one embodiment, a plurality of tags aligned to an interval may be counted. Thus, each interval may have a corresponding number of alignment tags. A histogram may be calculated that illustrates the frequency with which an interval has a certain number of alignment tags. Method 1200 may be performed for genomic regions each having the same length (e.g., 1Mb interval), where the regions are non-overlapping. In other embodiments, it is contemplated that different lengths may be used and that the regions may overlap.

At block 1240, the respective amounts are normalized to obtain respective densities. In one embodiment, normalizing the respective amounts to obtain the respective densities comprises using the same number as the total number of reference tags resulting from the alignment to determine the respective densities and the reference densities. In another embodiment, the respective amounts may be divided by the total number of reference tags resulting from the alignment.

At block 1250, the respective densities are compared to a reference density to identify whether the genomic region exhibits a 1 copy loss or a 1 copy increase. In one embodiment, the difference between the respective density and the reference density is calculated (e.g., as part of determining the z-score) and compared to a cutoff value. In various embodiments, the reference density may be obtained from a sample of healthy cells (e.g., from white blood cells) or from the respective amount itself (e.g., by taking a median or average value under the assumption that most of the area does not exhibit a loss or increase).

At block 1260, a first density is calculated from one or more respective densities identified as exhibiting a 1-copy loss or from one or more respective densities identified as exhibiting a 1-copy gain. The first density may correspond to only one genomic region, or may be determined from the densities of multiple genomic regions. For example, the first density may be calculated from the respective densities with 1 copy loss. The respective densities provide a measure of the difference in density produced by a missing region in a tumor at a given tumor concentration. Similarly, if the first density is from a respective density with a 1 copy increase, a measure of the difference in density produced by the amplified region in the tumor can be obtained. The above section describes various examples of how the density of multiple regions can be used to determine an average density to determine the first density.

At block 1270, the percent concentration is calculated by: the first density is compared to the other density to obtain a difference. The difference is normalized by a reference density, which may be performed in block 1270. For example, the difference may be normalized to the reference density by dividing the difference by the reference density. In another embodiment, the differences may be normalized in the previous block.

In one embodiment, the other density is a reference density, e.g., as in part 2 above. Thus, calculating the percent concentration may include multiplying the difference by two. In another embodiment, the another density is a second density calculated from the respective densities identified as exhibiting a 1-copy loss (where the first density is calculated using the respective densities identified as exhibiting a 1-copy increase), e.g., as described in section 1 above. In this case, a first ratio (e.g., R) of the first density to the reference density may be calculated_right) And calculating a second ratio (R) of the second density to the reference density_left) To determine a normalized difference, wherein the difference is between the first ratio and the second ratio. As described above, mayThe identification of genomic regions exhibiting 1 copy loss or 1 copy increase is performed by fitting peaks to distribution curves of histograms of the respective densities.

In summary, embodiments can analyze the genomic presence of plasma DNA in different chromosomal regions to simultaneously determine whether there is amplification or deletion of the chromosomal regions in the tumor tissue, and if there is amplification or deletion of the regions, to use their genomic presence to infer the percent concentration of tumor-derived DNA. Some embodiments use a normal distribution mixture model to analyze the total distribution of genome presences for different regions in order to determine genome presences associated with different types of aberrations, i.e., 1, 2, 3, or 4 copies increase and 1 or 2 copies loss.

Embodiments have several advantages over other methods, such as whole Genome Aggregation Allele Loss (GAAL) methods (U.S. patent application 13/308,473; chenk et al clinical chemistry 2013; 59:211-24) and analysis of tumor-associated single nucleotide mutations (french T et al science. transformation medicine 2012; 4:136ra 68). All sequence reads that map to a region with a chromosomal aberration can be used to determine the sequence read density of the region, and thus are informative about the fractional concentration of tumor DNA. On the other hand, in the GAAL analysis, sequence reads covering only heterozygotes in individuals and mapping single nucleotides in chromosomal regions with chromosomal additions or deletions would be informative. Similarly, for analysis of cancer-associated mutations, sequence reads covering only mutations would be useful for inferring tumor DNA concentration. Thus, embodiments may allow for more cost-effective use of sequencing data, as relatively fewer sequencing reads may be required to achieve the same degree of accuracy in the assessment of the percent concentration of tumor-derived DNA when compared to other methods.

Alternative Process X

In addition to using the number of times a particular mutation can be found on a sequence tag as a criterion for identifying a locus as a true mutation (thereby modulating a positive predictive value), other techniques can be employed in place of or in addition to methods using cut-off values to provide a greater predictive effect in identifying cancer-related mutations. For example, when processing sequencing data, bioinformatic filtering means of different stringency can be used, for example by taking into account the mass fraction of nucleotides sequenced. In one embodiment, DNA sequencers and sequencing chemistries with different sequencing error distributions may be used. Sequencers and chemicals with lower sequencing error rates will yield higher positive predictive values. Repeated sequencing of the same DNA fragment can also be used to increase sequencing accuracy. One possible strategy is the circular consensus sequencing strategy provided by pacific biosciences.

In another embodiment, size information of the sequenced fragments may be incorporated into the interpretation of the data. Because tumor-derived DNA is shorter than non-tumor-derived DNA in plasma (see U.S. patent application No. 13/308,473), the positive predictive value of shorter plasma DNA fragments containing mutations of potential tumor origin will be higher than that of longer plasma DNA fragments. Size data will be readily available if double-ended sequencing is performed on plasma DNA. As an alternative, a DNA sequencer with a long read length can be used, thus generating the full length of the plasma DNA fragments. Size fractionation may also be performed on plasma DNA samples prior to DNA sequencing. Examples of methods that can be used for size fractionation include gel electrophoresis, the use of microfluidic methods (e.g., the caripper LabChip XT system), and size exclusion chromatography.

In another embodiment, the percentage concentration of tumor-associated mutations in the plasma of a patient with a non-hematologic cancer is expected to increase if one concentrates on shorter DNA fragments in the plasma. In one embodiment, the percentage concentrations of tumor-associated mutations in two or more DNA fragments of different size distributions in plasma may be compared. Patients with non-hematologic cancers will have a higher percentage concentration of tumor-associated mutations in the shorter fragments when compared to the larger fragments.

In some embodiments, two or more aliquots from the same blood sample or sequencing results from two or more blood samples taken at the same or different occasions may be combined. Potential mutations that can be seen in more than one aliquot or sample will have a higher positive predictive value for tumor-associated mutations. Positive predictive value will increase with increasing number of samples showing such mutations. Potential mutations present in plasma samples taken at different time points may be considered potential mutations.

Example XI

The following are example technologies and data to which embodiments of the present invention should not be construed as limited.

A. Materials and methods

For sample collection, patients with hepatocellular carcinoma (HCC), carriers of chronic hepatitis B, and patients with both breast and ovarian cancer were recruited. All HCC patients had Barcelona clinical Liver Cancer (Barcelona clinical Liver Cancer) stage a1 disease. Peripheral blood samples from all participants were collected into EDTA-containing tubes. Tumor tissue was obtained during cancer resection surgery in HCC patients.

Peripheral blood samples were centrifuged at 1,600g for 10 minutes at 4 ℃. The plasma fractions were centrifuged at 16,000g for another 10 min at 4 ℃ and then stored at 80 ℃. Cell-free DNA molecules from 4.8mL plasma were extracted according to the blood and body fluid protocol of the QIAamp DSP DNA blood mini kit (Qiagen). For each case, the plasma DNA was concentrated to a final volume of 40. mu.l using a SpeedVac concentrator (Savant DNA 120; Samer science) for subsequent preparation of DNA sequencing libraries.

Genomic DNA was extracted from patient buffy coat samples according to the blood and body fluid protocol of the QIAamp DSP DNA blood mini kit. DNA was extracted from tumor tissue using a QIAamp DNA mini kit (Qiagen).

A sequencing library of genomic DNA samples was constructed using a double-ended sample preparation kit (Ill.) according to the manufacturer's instructions. Briefly, 1-5. mu.g of genomic DNA was first fragmented into 200bp fragments using a Covaris (Covaris) S220 focused ultrasound generator. Then, the DNA molecule is end-repaired with T4DNA polymerase and Klenow (Klenow) polymerase; the T4 polynucleotide kinase was then used to phosphorylate the 5' end. The 3' overhang is generated by a klenow fragment lacking the 3' -5' exonuclease. An illimina adaptor oligonucleotide was ligated to the sticky ends. Adaptor-ligated DNA was enriched with 12 cycles of PCR. Because plasma DNA molecules are short fragments and the amount of total DNA in the plasma sample is relatively small, when constructing a DNA library from the plasma sample, we omitted the fragmentation step and used 15 cycles of PCR.

An Agilent 2100 bioanalyzer (Agilent technology) was used to examine the quality and size of adaptor-ligated DNA libraries. The DNA library was then measured by Kappa (KAPA) library quantification kit (kappa Biosystems) according to the manufacturer's instructions. The DNA pool was diluted and hybridized to double-ended sequencing flow cells. DNA clusters were generated using TruSeq PE cluster generation kit v2 (mila) on a cBot cluster generation system (mila), followed by sequencing of 51x2 cycles or 76x2 cycles using TruSeq SBS kit v2 (mila) on a HiSeq 2000 system (mila).

Both-end sequencing data were analyzed in a both-end mode by means of short oligonucleotide alignment program 2(SOAP 2). For each double-ended read, 50bp or 75bp from each end was aligned to the reference human genome (hg 18). For alignment at each end, up to 2 nucleotide mismatches are allowed. These potential aligned genomic coordinates of the 2 ends are then analyzed to determine if any combination would satisfy the 2 ends aligned with the same chromosome in the correct orientation, less than or equal to 600bp across the insertion region, and mapped to a single location in the reference human genome. Repeat reads are defined as double-ended reads in which the inserted DNA molecule exhibits the same starting and end positions in the human genome; duplicate reads were removed as previously described (Loc et al science transformation medicine 2010; 2:61ra 91).

In some embodiments, paired tumor and constitutive DNA samples are sequenced to identify tumor-associated Single Nucleotide Variants (SNVs). In some embodiments, we focus on SNVs occurring at homozygous sites in the constitutive DNA (in this example, buffy coat DNA). In principle, any nucleotide variation detected in the sequencing data of tumor tissue that is not present in the constitutive DNA can be a potential mutation (i.e., SNV). However, if a single occurrence of any nucleotide change in the sequencing data of tumor tissue is considered a tumor-associated SNV, millions of false positives will be identified in the genome due to sequencing errors (accounting for 0.1% -0.3% of the nucleotides sequenced). One way to reduce the number of false positives would be to establish the following criteria: multiple occurrences of the same nucleotide change were observed in the sequencing data for tumor tissue, which would then be detected for tumor-associated SNVs.

Because the occurrence of sequencing errors is a random process, the number of false positives due to sequencing errors will decrease exponentially as the number of occurrences required for the observed SNVs to qualify as tumor-associated SNVs increases. On the other hand, the number of false positives will increase with increasing sequencing depth. These relationships can be predicted using poisson and binomial distribution functions. Embodiments can determine the observed SNV adequacy as a dynamic cutoff for tumor-related occurrence. Embodiments may take into account the actual coverage of particular nucleotides in the tumor sequencing data, the sequencing error rate, the maximum false positive rate allowed, and the desired sensitivity of mutation detection.

In some examples, we set very strict criteria to reduce false positives. For example, it may be desirable that the mutation is not present at all in constitutive DNA sequencing and that the sequencing depth for a particular nucleotide position must be 20-fold. In some embodiments, a false positive detection rate of less than 10 is achieved^-7Cut-off of the frequency of occurrence of the corresponding mutated sequence tag. In some examples, we also filtered SNVs within the centromere, telomeres, and low complexity regions to minimize false positives due to mis-alignments. In addition, known SNPs that mapped to the dbSNP construction 135 database in the pending SNVs were also removed.

B. Before and after resection

Fig. 13A shows a table 1300 of analyzing mutations in the plasma of patients with ovarian and breast cancer at the time of diagnosis according to an embodiment of the present invention. Here we demonstrate an example of a patient with bilateral ovarian and breast cancer. The sequencing data of the plasma was compared with the sequencing results of the constitutive DNA (white blood cells) of the patient. Single nucleotide changes that are present in plasma but not in constitutive DNA are considered potential mutations. Ovarian cancer on the right and left side of the patient were sampled at two sites each, resulting in a total of four tumor samples. Tumor mutations are mutations detected in all four ovarian tumor tissues from different sites.

More than 360 ten thousand single nucleotide changes were detected in plasma at least once by sequencing. Of these changes, only 2,064 were also detected in tumor tissue, yielding a positive predictive value of 0.06%. Using the criteria of detecting at least twice in plasma, the number of potential mutations was significantly reduced by 99.5%, to 18,885. The number of tumor mutations decreased only 3%, became 2,003, and the positive predictive value increased to 11%.

Using the criterion of detecting at least five times in plasma, only 2,572 potential mutations were detected, and of them, 1,814 were mutations detected in all tumor tissues, thus yielding a positive predictive value of 71%. Other criteria for the number of occurrences (e.g., 2, 3, 4, 6, 7, 8, 9, 10, etc.) may be used to define potential mutations depending on the desired sensitivity and positive predictive value. The higher the number of occurrences used as a criterion, the higher the positive predictive value but the corresponding reduced sensitivity.

Fig. 13B shows a table 1350 of analyzing mutations in plasma of patients with bilateral ovarian and breast cancer after tumor resection, according to embodiments of the invention. A surgical resection is performed on the patient. Blood samples were taken one day after removal of the ovarian tumor and breast cancer. Plasma DNA was then sequenced. For this example, only mutations from ovarian cancer were analyzed. More than 300 million potential mutations were detected at least once in the plasma samples. However, using the criterion of at least five occurrences, the number of potential mutations was reduced to 238. A significant reduction was observed when compared to the number of potential mutations for the samples taken at diagnosis, obtained using the same five mutation criteria.

In one embodiment, the number of single nucleotide changes detected in plasma can be used as a parameter to detect, monitor and predict cancer patients. The different number of occurrences may be used as a criterion to achieve the desired sensitivity and specificity. It is expected that patients with higher tumor burden and thus poorer prognosis will have higher mutation burden seen in plasma.

For the analysis, a mutation load distribution can be established for different types of cancer. For monitoring purposes, it will be seen that the mutation load in the plasma of patients responding to treatment will be reduced. If the tumor has recurred, for example during recurrence, then the mutation load is expected to increase. The monitoring will allow monitoring of the efficacy of the selected treatment pattern on the patient and detecting the occurrence of resistance to the particular treatment.

By analyzing specific mutations that can be seen in the plasma DNA sequencing results, sensitivity (e.g., mutations in the epidermal growth factor receptor gene and in response to tyrosine kinase inhibitor treatment) and resistance (e.g., KRAS mutations in colorectal cancer and resistance to treatment by panitumumab and cetuximab) to specific target treatments can also be identified and predicted, and planning of treatment regimens can be guided.

The above example is for bilateral ovarian cancer. The same analysis can also be performed on mutations of breast cancer and then it will be possible to track mutations of both of these cancer types in plasma. Similar strategies can also be used to track primary cancers and their metastatic cancers.

Embodiments will be useful for screening for cancer in apparently healthy subjects or subjects with specific risk factors, such as smoking status, viral status (e.g., hepatitis virus carrier, human papillomavirus infected subject). It can be seen that the mutational load in the plasma of the subject will result in a risk that the subject will develop a symptomatic cancer within a specific time frame. Thus, it is expected that subjects with higher mutation load in plasma will exhibit higher risk than subjects with lower mutation load. Furthermore, the temporal distribution of the mutational load in plasma will also be a strong risk indicator. For example, if a subject performs a plasma mutational burden analysis once a year and if the mutational burden is gradually increasing, then this subject should be transferred to other screening modalities of cancer, such as using chest X-ray, ultrasound, computed tomography, magnetic resonance imaging, or positron emission tomography.

C. Dynamic cut-off to infer mutations from sequencing plasma

Four patients with hepatocellular carcinoma (HCC) and one patient with ovarian and breast cancer were recruited for this study. For the latter patient, we focused on the analysis of ovarian cancer. Blood samples were collected from each patient before and after surgical resection of the tumor. Excised tumor tissue was also collected. DNA extracted from tumor tissue, white blood cells of pre-operative blood samples and pre-and post-operative plasma samples was sequenced using a HiSeq2000 sequencing system (illemina). Sequencing data were aligned to a reference human genome sequence (hg18) using short oligonucleotide analysis package 2(SOAP2) (plum r (li r) et al (Bioinformatics) 2009; 25: 1966-. The DNA sequence of the white blood cells was considered as a constitutive DNA sequence for each study subject.

In this example, tumor-associated SNM was first inferred from plasma DNA sequencing data and CG without reference to tumor tissue. The inferences from the plasma are then compared to sequencing data (as a standard) generated from the tumor tissue to determine the accuracy of the inferences. In this regard, criteria were established by comparing sequencing data from tumor tissue to constitutive sequences to investigate mutations in tumor tissue. In this analysis, we focused on nucleotide positions where the constituent DNA of the subject studied was homozygous.

1. Non-target whole genome analysis

The sequencing depth of leukocyte, tumor tissue and plasma DNA for each patient is shown in table 5.

Table 5 median sequencing depth for different samples of four HCC cases.

The dynamic cut-off values used to define the minimum frequency of occurrence (r) of plasma mutations as shown in table 1 were used to identify mutations in the plasma of each patient. Because the sequencing depth of each locus can vary, the cut-off value can vary, which effectively provides a dependency of the cut-off value on the total number of reads of the locus. For example, although the median depth was less than 50 (table 5), the sequencing depth of individual loci can vary greatly and be covered >100 times.

In addition to sequencing errors, another source of errors would be alignment errors. To minimize this type of error, sequence reads carrying mutations were re-aligned to a reference genome using the buttery (Bowtie) alignment program (lamydad b et al genome biology 2009,10: R25). Only reads that can be aligned by SOAP2 and bautay to unique positions of the reference genome were used for downstream analysis of plasma mutations. Other combinations of alignment packages based on different algorithms may also be used.

To further minimize sequencing and alignment errors in the actual sequencing data, we applied two additional filtering algorithms to examine the nucleotide positions corresponding to single nucleotide variations in the display sequence reads: (1) more than or equal to 70% of the sequence reads carrying the mutation can be re-aligned to the same genomic coordinates using bautay with an alignment quality of more than or equal to Q20 (i.e. a probability of wrong alignment < 1%); (2) at least 70% of the sequence reads carrying the mutation are not within 5bp of the two ends (i.e., the 5 'and 3' ends) of the sequence reads. This filtering rule is established because sequencing errors are more prevalent at both ends of the sequence read.

We also investigated the factors that influence the tumor thrust without a priori knowledge of the tumor genome. One such parameter is the percentage concentration of tumor-derived DNA in plasma. This parameter can be considered as another standard parameter and inferred using GAAL under a priori knowledge of the tumor genome for reference purposes.

Table 6 shows the nucleotide variations detected before treatment and in plasma. For HCC1, a total of 961 single nucleotide variations were detected without prior knowledge of the tumor genome. Of these nucleotide variations detected in plasma, 828 were cancer-associated mutations. After surgical resection of HCC, the total number of nucleotide variations was reduced to 43 and none of them were cancer-related mutations.

For reference purposes, the percentage concentration of tumor-derived DNA in the preoperative plasma sample was 53%, and was inferred with a priori knowledge of the tumor genome. For HCC2, HCC3, and HCC4, the number of single nucleotide variations in plasma was inferred to be in the range of 27 to 32 for pre-operative plasma samples without prior knowledge of the tumor genome. These results are consistent with the mathematical prediction that at about 20-fold sequencing depth, a very low percentage of cancer-associated mutations can be detected in plasma and most of the sequence variation detected in plasma is due to sequencing errors. There was no significant change in the number of sequence variations detected after tumor resection. For reference purposes, the percent concentration of tumor-derived DNA in plasma was inferred to be in the range of 2.1% to 5%, and was inferred with prior knowledge of the tumor genome.

TABLE 6 nucleotide variations detected in plasma.

2. Targeted enrichment of exons

As discussed above, increasing the sequencing depth of the region of interest can increase both the sensitivity and specificity of identifying cancer-associated mutations in plasma, and thus increase the discriminatory power between cancer patients and non-cancer subjects. While increasing the sequencing depth of the whole genome is still very costly, an alternative is to enrich certain regions for sequencing. In one embodiment, selected exons, or virtually the entire exome target, can be enriched for sequencing. This approach can significantly increase the sequencing depth of the target region without increasing the total number of sequence reads.

Sequencing libraries of plasma DNA of HCC patients and patients with ovarian (and breast) cancer were captured using the agilent SureSelect all exon kit for exome target enrichment. The exon-enriched sequencing library was then sequenced using the HiSeq 2000 sequencing system. The sequence reads were aligned to the human reference genome (hg 18). After alignment, single nucleotide variations of sequence reads uniquely located to exons were analyzed. To identify single nucleotide variations in plasma in the exome capture assay, the dynamic cut-off values shown in table 2 were used.

Figure 14A is a table 1400 showing the detection of single nucleotide variations in plasma DNA of HCC 1. Without prior knowledge of the tumor genome, we inferred a total of 57 single nucleotide variations in plasma from the target sequencing data. In subsequent confirmation from sequencing data obtained from tumor tissue, 55 were found to be true tumor-associated mutations. As previously discussed, the percentage concentration of tumor-derived DNA in preoperative plasma was 53%. After tumor resection, no single nucleotide variation was detected in target sequencing data obtained from plasma. These results indicate that quantitative analysis of the number of single nucleotide variations in plasma can be used to monitor disease progression in cancer patients.

Figure 14B is a table 1450 showing detection of single nucleotide variations in plasma DNA of HCC 2. Without prior knowledge of the tumor genome, we inferred a total of 18 single nucleotide variations in plasma from the target sequencing data. All of these mutations are found in tumor tissue. As previously discussed, the percentage concentration of tumor-derived DNA in preoperative plasma was 5%. After tumor resection, no single nucleotide variation was detected in plasma. Fewer single nucleotide variations were detected in the plasma of cases of HCC2 compared to HCC1, which had a higher percentage concentration of tumor-derived DNA in the plasma. These results indicate that the number of single nucleotide variations in plasma can be used as a parameter to reflect the percentage concentration of tumor-derived DNA in plasma and the tumor burden of a patient, as it has been shown that the concentration of tumor-derived DNA in plasma is positively correlated with tumor burden (ChenKC et al clinical chemistry 2005; 51: 2192-5).

Figure 15A is a table 1500 showing the detection of single nucleotide variations in plasma DNA of HCC 3. Without prior knowledge of the tumor genome, we did not observe any single nucleotide variation from the target sequencing data in both pre-and post-excision plasma samples. This may be due to the relatively low percentage concentration (2.1%) of tumor-derived DNA in the plasma of this patient. Further increases in sequencing depth are predicted to improve the sensitivity of detecting cancer-associated mutations in cases where the percentage concentration of tumor-derived DNA is low.

Figure 15B is a table 1550 showing the detection of single nucleotide variations in plasma DNA of HCC 4. Without prior knowledge of the tumor genome, we inferred a total of 3 single nucleotide variations in plasma from the target sequencing data. All of these mutations are found in tumor tissue. Fewer single nucleotide variations were detected in plasma from HCC4 in 2.6% of cases, compared to HCC1 and HCC2, which had higher fractional concentrations of tumor-derived DNA in plasma. These results indicate that the number of single nucleotide variations in plasma can be used as a parameter to reflect the percentage concentration of tumor-derived DNA in plasma and the tumor burden of the patient.

Figure 16 is a table 1600 showing the detection of single nucleotide variations in plasma DNA of patients with ovarian (and breast) cancer. Without prior knowledge of the tumor genome, we inferred a total of 64 single nucleotide variations in plasma from the target sequencing data. Among them, 59 were found in ovarian tumor tissue. The estimated percent concentration of ovarian tumor-derived DNA in plasma was 46%. Following resection of ovarian cancer, a significant reduction in the total number of single nucleotide variations in plasma was detected.

In addition to using the SureSelect target enrichment system (agilent), we also enriched sequences from exons for sequencing using the Nimblegen SeqCap EZ exome + UTR target enrichment system (roche). The Nimblegen SeqCap system covers the exon regions of the genome as well as the 5 'and 3' untranslated regions. Four HCC patients, two healthy control subjects, and two chronic hepatitis B carriers without cancer were analyzed for pre-treatment plasma samples (table 7). In other embodiments, other target enrichment systems can be used, including (but not limited to) those based on liquid phase or solid phase hybridization.

TABLE 7 results of exome sequencing of four HCC patients (HCC1-4) using the Nimblelegen SeqCap EZ exome + UTR target enrichment system for sequence capture. Sequencing analysis of pre-treatment plasma of HCC3 was suboptimal due to the higher percentage of PCR replication reads.

In two chronic hepatitis B carriers and two healthy control subjects, one or fewer single nucleotide variations were detected that met the dynamic cut-off criteria (table 8). In three of four HCC patients, the number of sequence variations detected in plasma that meet the dynamic cut-off requirement is at least 8. In HCC3, SNVs that meet the dynamic cutoff are not detected. In this sample, there is a high proportion of PCR replication reads in the sequence reads, resulting in a lower number of non-replicated sequence reads. After surgical resection of the tumor, a significant reduction in the SNV detectable in the plasma was observed.

Table 8.2 chronic hepatitis B carriers (HBV1 and HBV2) and 2 healthy control subjects (Ctrl1 and Ctrl2) exome sequencing results using the Nimblegen SeqCap EZ exome + UTR target enrichment system for sequence capture.

Tumor heterogeneity XII

Quantification of single nucleotide mutations in biological samples (e.g., plasma/serum) is also useful for analyzing tumor heterogeneity, both intratumoral and intratumoral. Intratumoral heterogeneity refers to the presence of multiple clones of tumor cells within the same tumor. Intratumoral heterogeneity involves the presence of multiple clones of tumor cells of two or more tumors of the same histological type, but at different sites (in the same organ or in different organs). The presence of tumor heterogeneity is an indication of poor prognosis in certain types of tumors (J Clin Oncol, et al, J Clin Oncol, 2012; 30: 3932-13938; Merlo LMF, et al, Cancer prevention study (Cancer Prev Res) 2010; 3: 1388-1397). In certain types of tumors, the greater the degree of tumor heterogeneity, the higher the probability that the tumor will progress or resistant clones will appear after target treatment.

Although cancer is thought to result from clonal expansion of one tumor cell, the growth and evolution of cancer will result in the accumulation of new and different mutations in different parts of the cancer. For example, when a cancer patient develops metastasis, the tumor localized at the original organ and the metastatic tumor will share multiple mutations. However, cancer cells at both sites will also carry a unique set of mutations that are not present in another tumor site. Mutations shared by both sites are expected to be present at higher concentrations than those observed in only one tumor site.

A. Examples of the invention

We analyzed plasma from patients with bilateral ovarian and breast cancer. Both ovarian tumors were serous adenocarcinomas. In terms of longest dimension, the left tumor measures 6cm and the right tumor measures 12 cm. There are also multiple metastatic lesions at the colon and omentum. DNA extracted from white blood cells was sequenced at an average haplotype genomic coverage of 44-fold using a sequencing-by-synthesis platform from ilumeana. Further analysis showed only one allele, i.e. homozygous nucleotide positions, for single nucleotide mutations in plasma.

DNA was extracted from four different sites of the left and right tumors and sequenced using the ilomina sequencing platform. Two sites (sites a and B) were from the right tumor and two other sites (sites C and D) were from the left tumor. Sites A and B are separated by about 4 cm. The distance between sites C and D is also about 4 cm. Plasma samples were collected from patients before and after surgical removal of ovarian tumors. DNA is then extracted from the patient's plasma. The sequencing depths of the tumors at position A, B, C and D and plasma samples are shown in table 9.

TABLE 9 sequencing depth of tumors at positions A, B, C and D.

In the present example, to define a single tumor-associated single nucleotide mutation, the nucleotide positions were sequenced at least 20 times (in tumor tissue) and 30 times (in constitutive DNA). In other embodiments, other sequencing depths can be used, such as 35, 40, 45, 50, 60, 70, 80, 90, 100, and > 100-fold. A reduction in the cost of sequencing would make it much easier to increase the depth of sequencing. Nucleotide positions are homozygous in constitutive DNA, whereas nucleotide changes are observed in tumor tissue. The criteria for the occurrence of nucleotide changes in tumor tissue depend on the total sequencing depth of a particular nucleotide position in the tumor tissue. The frequency of occurrence of nucleotide changes (cut-off) was at least five times for 20 to 30 fold nucleotide coverage. For 31 to 50 fold coverage, the frequency of nucleotide changes is at least six times. For coverage of 51 to 70 times, the occurrence frequency requirement is at least seven times. These criteria are derived from the prediction of the sensitivity to detect true mutations using poisson distribution and the expected value of the number of false positive loci.

FIG. 17 is a table 1700 showing predicted sensitivity for different frequency of occurrence requirements and sequencing depths. Sensitivity will correspond to the number of true mutations detected at a particular fold depth using a particular cut-off value. The higher the sequencing depth, the more likely a mutation is to be detected for a given cut-off value, as more mutant sequence reads will be obtained. For higher cut-off values, it will be less likely that mutants will be detected because the criteria are more stringent.

Fig. 18 is a table 1800 showing the predicted number of false positive loci for different cut-off values and different sequencing depths. The number of false positives increases with increasing sequencing depth as more sequence reads are obtained. However, for a cut-off value of five or more, there were no false positives even up to a sequencing depth of 70. In other embodiments, different occurrence criteria may be used in order to achieve the desired sensitivity and specificity.

Figure 19 shows a tree diagram illustrating the number of mutations detected at different tumor sites. Mutations were determined by direct sequencing of tumors. Site a has 71 mutations specific for the tumor and site B has 122 site-specific mutations, even if they are only 4cm apart. 10 mutations were seen in both sites A and B. Site C has 168 mutations specific for the tumor, and site D has 248 site-specific mutations, even if they are only 4cm apart. 12 mutations were seen in both sites C and D. There is significant heterogeneity in the mutation distribution at different tumor sites. For example, 248 mutations were detected only in the site D tumor, but not in the other three tumor sites. A total of 2,129 mutations were seen across all sites. Thus, many mutations are common among different tumors. Thus, there are seven SNV classes. There were no observable differences in copy number distortion among these four regions.

Figure 20 is a table 2000 showing the number of tumor-derived mutant fragments carried in plasma samples before and after treatment. The deduced percentage concentrations of tumor-derived DNA carrying the respective mutations are also shown. The class of mutation refers to the tumor site at which the mutation is detected. For example, a class a mutation refers to a mutation that is present only at site a, while a class ABCD mutation refers to a mutation that is present at all four tumor sites.

For 2,129 mutations present at all four tumor sites, 2,105 (98.9%) were detectable in at least one plasma DNA fragment. On the other hand, for 609 mutations present in only one of the four tumor sites, only 77 (12.6%) were detectable in at least one plasma DNA fragment. Thus, quantification of single nucleotide mutations in plasma can be used to reflect the relative abundance of these mutations in tumor tissue. This information would be useful for studying cancer heterogeneity. In this example, the potential mutation only requires at least one occurrence in the sequencing data.

The percent concentration of circulating tumor DNA corresponding to each SNV class was determined. The percent concentration of tumor DNA in plasma before and after surgery, as determined by SNV common to all 4 regions (i.e. cohort ABCD), was 46% and 0.18%, respectively. These latter percentages correlate well with the percentages obtained in the GAAL analysis of 46% and 0.66%. Mutations shared by all 4 regions (i.e., cohort ABCD) provided tumor-derived DNA to plasma was the highest scoring contribution.

The percentage concentrations of tumor-derived DNA in preoperative plasma determined for SNVs of groups AB and CD were 9.5% and 1.1%, respectively. These concentrations are consistent with the relative sizes of the right and left ovarian tumors. The percent concentration of tumor-derived DNA determined for regiounique SNVs (i.e., those in cohorts A, B, C and D) is generally low. These data indicate that for accurate measurement of total tumor burden in cancer patients, the use of a whole genome shotgun approach can provide a more representative image than more traditional methods that target specific tumor-associated mutations. With the latter approach, if only a subset of tumor cells have the target mutation, important information about impending recurrence or disease progression by tumor cells that do not have the target mutation may be missed, or the emergence of finding a treatment resistant clone may be missed.

Fig. 21 is a graph 2100 showing the distribution of the occurrence of mutations detected at a single tumor site in plasma and the mutations detected at all four tumor sites. Bar 2100 shows data for two types of mutations: (1) mutations detected in only one site, and (2) mutations detected in all four tumor sites. The horizontal axis is the number of mutations detected in plasma. The vertical axis shows the percentage of abrupt changes corresponding to a particular value on the horizontal axis. For example, about 88% of type (1) mutations occur only once in plasma. As can be seen, the mutations that occur in one site are mostly detected once, and not more than four times. Mutations present in a single tumor site were detected much less frequently in plasma than mutations present in all four tumor sites.

One application of this technique would be to allow clinicians to estimate the burden of tumor cells carrying different classes of mutations. The proportion of these mutations will likely be treatable with the target agent. It is expected that agents targeting mutations carried by a higher proportion of tumour cells will have more pronounced therapeutic effects.

Fig. 22 is a graph 2200 showing the predicted occurrence distribution of mutations from heterogeneous tumors in plasma. Tumors contain two sets of mutations. One set of mutations was present in all tumor cells and another set of mutations was present only in 1/4 tumor cells, representing an approximation of each ovarian tumor based on two sites. The total percentage concentration of tumor-derived DNA in plasma was assumed to be 40%. Plasma samples were assumed to be sequenced at 50-fold average depth per nucleotide position. From the predicted distribution of this frequency of occurrence in plasma, mutations present in all tumor tissues can be distinguished from those present only in 1/4 tumor cells by their frequency of occurrence in plasma. For example, 6 occurrences may be used as a cutoff value. For mutations present in all tumor cells, 92.3% of the mutations will be present in plasma at least 6 times. In contrast, for mutations present in 1/4 tumor cells, only 12.4% of the mutations will be present in plasma at least 6 times.

FIG. 23 is a table 2300 in which an embodiment of the invention illustrates specificity among 16 healthy control subjects. Their plasma DNA samples were sequenced with a 30-fold median coverage. The detection of mutations present in the plasma of the above ovarian cancer patients was performed in plasma samples of these healthy subjects. Mutations present in tumors of ovarian cancer patients are very rarely detected in sequencing data of plasma of healthy control subjects, and the apparent percentage concentration of none of the classes of mutations is > 1%. These results show that this detection method is highly specific.

B. Method for producing a composite material

Figure 24 is a flow diagram of a method 2400 of analyzing heterogeneity of one or more tumors in a subject, according to an embodiment of the invention. Certain steps of method 2400 may be performed as described herein.

At block 2410, a constitutive genome of the subject is obtained. At block 2420, for each of a plurality of DNA fragments in a biological sample of the subject, one or more sequence tags are obtained, wherein the biological sample comprises cell-free DNA. At block 2430, the genomic position of the sequence tag is determined. At block 2440, the sequence tags are compared to a constitutive genome to determine a first number of first loci. At each first locus, the number of sequence tags having a variant sequence relative to the constitutive genome is above a cut-off, wherein the cut-off is greater than one.

At block 2450, a measure of heterogeneity of the one or more tumors is calculated based on the respective first numbers of the first set of genomic locations. In one aspect, the metric can provide a value representative of the number of mutations shared by the tumor relative to the number of mutations not shared by the tumor. Here, multiple tumors may be present in a single subject, and different tumors within the subject may represent what is commonly referred to as intratumoral heterogeneity. A metric may also refer to whether there are some mutations in one or a few tumors compared to mutations in many or most tumors. More than one heterogeneity metric may be calculated.

At block 2460, the heterogeneity metric may be compared to a threshold to determine a classification for the heterogeneity level. One or more metrics may be used in various ways. For example, one or more heterogeneity metrics may be used to predict the probability of tumor progression. In some tumors, the greater the heterogeneity, the higher the incidence of carcinogenesis and the higher the incidence of resistant clones after treatment (e.g., targeted therapy).

C. Tumor heterogeneity metric

One example of a heterogeneity metric is the number of 'concentration bands' of different groups of mutations in plasma. For example, if two major tumor clones are present in a patient, and if these clones are present at different concentrations, we would expect to see two different mutations at different concentrations in the plasma. These different values can be calculated by determining the percentage concentration of different sets of mutations, where each set corresponds to one of the tumors.

Each of these concentrations may be referred to as a 'concentration band' or 'concentration class'. If the patient has more clones, more concentration bands/classes will be observed. Thus, the more bands, the greater the heterogeneity. The number of concentration bands can be seen by plotting the percentage concentration of the various mutations. Individual concentrations can be histograms, with different peaks corresponding to different tumors (or different clones of a tumor). The large peak will likely be a mutation shared by all or some of the tumors (or clones of tumors). These peaks can be analyzed to determine which smaller peak combinations determine the larger peaks. A fitting procedure, for example similar to that of fig. 10B and 11, may be used.

In one embodiment, the histogram is a plot of the amount (e.g., number or ratio) of a locus on the y-axis and the percentage concentration on the x-axis. Mutations shared by all or some of the tumors will result in higher percent concentrations. The peak size will represent the amount of the locus that produces a particular percentage concentration. The relative size of the peaks at low and high concentrations will reflect the degree of heterogeneity of the tumor (or clones of the tumor). The larger peak at high concentrations reflects that most mutations are shared by most or all tumors (or clones of tumors) and indicate a lower degree of tumor heterogeneity. If the peak at low concentrations is large, then most mutations are shared by several tumors (or several clones of a tumor). This would indicate a higher degree of tumor heterogeneity.

The more peaks present, the more site-specific mutations present. Each peak may correspond to a different set of mutations, where the set of mutations is from a subset of the tumors (e.g., only one or two tumors, as explained above). For the example of fig. 19, there may be a total of 7 peaks, 4 site-specific peaks that may have the smallest concentration (depending on the relative size of the tumor), two peaks for the AB and CD sites, and one peak for mutations shared by all sites.

The location of the peak may also provide the relative size of the tumor. Larger concentrations will be associated with larger tumors, as larger tumors will release more tumor DNA into the sample, e.g., into plasma. Thus, the burden of tumor cells carrying different classes of mutations can be estimated.

Another example of a heterogeneity metric is the proportion of mutation sites for which variant reads are relatively few (e.g., 4, 5, or 6) compared to the proportion of mutation reads for which variant reads are relatively high (e.g., 9-13). Referring back to fig. 22, it can be seen that the site-specific mutations have fewer variant reads (which also result in smaller percentage concentrations). Consensus mutations have more variant reads (which also result in a greater percentage concentration). The ratio of the first ratio at 6 (smaller counts) divided by the second ratio at 10 (larger counts) yields a heterogeneity metric. If the ratio is small, few mutations are site-specific and therefore the level of heterogeneity is low. If the ratio is large (or at least larger than the value calibrated from a known sample), then the level of heterogeneity is greater.

D. Determination of threshold value

A threshold can be determined from a subject whose tumor is biopsied (e.g., as described above) to directly determine the level of heterogeneity. The rank can be defined in various ways, such as the ratio of site-specific mutations to consensus mutations. The biological sample (e.g., a plasma sample) can then be analyzed to determine a heterogeneity metric, wherein the heterogeneity metric of the biological sample can be correlated to a heterogeneity level determined by directly analyzing cells of the tumor.

Such a procedure may provide for calibration of the threshold relative to the level of heterogeneity. If the test heterogeneity metric is between two thresholds, then the heterogeneity level may be estimated to be between the levels corresponding to the thresholds.

In one embodiment, a calibration curve may be calculated between the level of heterogeneity determined from a biopsy and the corresponding heterogeneity metric determined from a plasma sample (or other sample). In such instances, the heterogeneity levels are numerical values, where the numerical levels may correspond to different classifications. Different numerical grade ranges may correspond to different diagnoses, e.g. different cancer stages.

E. Methods of using percent concentration from genome presentation

Tumor heterogeneity may also be analyzed using, for example, the percent concentration as determined using embodiments of method 1200. Genomic regions exhibiting one copy loss may be from different tumors. Thus, the percent concentration of each genomic region may vary depending on whether amplification (or 1 copy deletion) is present in only one tumor or multiple tumors. Thus, tumor heterogeneity as measured with respect to percent concentration determined via embodiments of method 1200 may be used.

For example, one genomic region may be identified as corresponding to a 1 copy loss, and fractional concentrations may be determined only from the respective densities at the genomic region (the respective densities may be used as percent concentrations). The histogram can be determined by counting the number of regions having various densities. If only one tumor or one tumor clone or one tumor placement is increased in a particular region, the density of that region will be less than the density of the region where multiple tumors or multiple tumor clones or multiple tumor placements are increased (i.e., the percent concentration of tumor DNA in the consensus region will be greater than the site-specific region). The heterogeneity metric described above can therefore be applied to peaks identified using copy number gain or loss in various regions, just as the percentage concentrations of different sites show a distribution of percentage concentrations.

In one embodiment, if the respective densities are used for the histogram, the regions of gain and loss may be analyzed separately. The histogram may be generated by the respective densities corresponding to only the increased regions, and the histogram may be generated by the respective densities corresponding to only the lost regions. If percent concentrations are used, the peaks corresponding to loss and gain can be analyzed together. For example, the difference (e.g., in absolute value) of the percent concentration from the reference density is used, and thus the percent concentration added and lost may contribute to the same peak.

Xiii. computer system

Any of the computer systems mentioned herein may utilize any suitable number of subsystems. An example of such a subsystem is shown in computer apparatus 2500 in fig. 25. In some embodiments, the computer system comprises a single computer device, wherein the subsystems may be components of the computer device. In other embodiments, the computer system may include multiple computer devices with internal components, each of which is a subsystem.

The subsystems shown in FIG. 25 are interconnected via a system bus 2575. Other subsystems such as a printer 2574, keyboard 2578, fixed disk 2579, monitor 2576 coupled to display adapter 2582, and the like, are shown. Peripheral devices and input/output (I/O) devices coupled to I/O controller 2571 may be connected to the computer system by any number of means known in the art, such as serial port 2577. For example, serial port 2577 or external interface 2581 (e.g., ethernet, Wi-Fi, etc.) may be used to connect computer system 2500 to a wide area network (e.g., the internet), a mouse input device, or a scanner. The interconnection via system bus 2575 allows a central processor 2573 to communicate with each subsystem and to control the execution of instructions from system memory 2572 or fixed disk 2579 and the exchange of information between subsystems. System memory 2572 and/or fixed disk 2579 may embody a computer readable medium. Any value mentioned herein may be output by one component to another component and may be output to a user.

The computer system may include multiple identical components or subsystems connected together, for example, through external interface 2581 or through an internal interface. In some embodiments, the computer systems, subsystems, or devices may be connected via a network. In that case, one computer may be considered a client and the other a server, each of which may be part of the same computer system. The client and server may each include multiple systems, subsystems, or components.

It should be understood that any embodiment of the invention may be implemented in the form of control logic using hardware (e.g., an application specific integrated circuit or a field programmable gate array) and/or computer software using a general purpose programmable processor in a modular or integrated manner. As used herein, a processor includes a multi-core processor on the same integrated chip, or multiple processing units on a single circuit board or network-connected circuit board. Based on the present invention and the teachings provided herein, one of ordinary skill in the art will know and appreciate other ways and/or methods to implement embodiments of the present invention using hardware and a combination of hardware and software.

Any of the software components or functions described in this application may be implemented as software code executed by a processor using any suitable computer language (e.g., Java, C + +, or Perl), using, for example, conventional or object-oriented techniques. The software code may be stored as a series of instructions or commands on a computer readable medium for storage and/or transmission, suitable media including Random Access Memory (RAM), Read Only Memory (ROM), magnetic media such as a hard drive or floppy disk, or optical media such as a Compact Disc (CD) or DVD (digital versatile disc), flash memory, etc. The computer readable medium can be any combination of the storage or transmission means.

The program may also be encoded and transmitted using a carrier wave signal adapted for transmission via a wired, optical and/or wireless network (including the internet) in accordance with various schemes. Thus, a computer readable medium according to an embodiment of the present invention may be generated using a data signal encoded with the program. The computer readable medium encoded with the program code may be packaged with a compatible device or provided separately from other devices (e.g., via internet download). Any such computer readable medium may reside on or within a single computer program product (e.g., a hard drive, a CD, or an entire computer system), and may reside on or within different computer program products within a system or network. The computer system may include a monitor, printer, or other suitable display for providing any of the results mentioned herein to a user.

Any of the methods described herein may be performed in whole or in part with a computer system that includes one or more processors that may be configured to perform the steps. Thus, implementations may relate to a computer system configured to perform the steps of any of the methods described herein, possibly with different components performing separate steps or separate groups of steps. Although the steps of the methods herein are presented in numbered steps, they may be performed simultaneously or in a different order. Additionally, portions of these steps may be used with portions of other steps of other methods. Further, all or part of the steps may be optional. Additionally, any of the steps of any of the methods may be performed by modules, circuits, or other means for performing the steps.

The particular details of the particular embodiments may be combined in any suitable manner without departing from the spirit and scope of the embodiments of the invention. However, other embodiments of the invention may relate to specific embodiments relating to each individual aspect or specific combinations of these individual aspects.

The foregoing description of exemplary embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form described, and many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention in various embodiments and with various modifications as are suited to the particular use contemplated.

The recitation of "a" or "the" is intended to mean "one or more" unless specifically indicated to the contrary.

All patents, patent applications, publications, and descriptions mentioned herein are incorporated by reference in their entirety for all purposes. No admission is made that any is prior art.

Claims

1. A system for determining a classification of a grade of cancer in a subject, the system comprising:

means for obtaining a consensus sequence of the genome of the subject;

means for receiving one or more sequence tags for each of a plurality of DNA fragments in a biological sample of the subject, the biological sample comprising cell-free DNA;

means for determining the genomic position of the sequence tag;

means for comparing the sequence tags to the consensus sequence to determine a first number of first loci, wherein:

at each first locus, the number of sequence tags having a single nucleotide variant relative to the consensus sequence is above a cut-off value, the cut-off value being greater than one;

means for determining a parameter based on a count of sequence tags having a single nucleotide variant at the first locus; and

Means for comparing the parameter to a threshold to determine a classification of a grade of cancer in the subject.

2. The system of claim 1, wherein the threshold is determined from one or more samples from one or more other subjects.

3. The system of claim 1, wherein the cutoff value for a locus depends on the total number of sequence tags having genomic positions at the locus.

4. The system of claim 1, wherein different cut-off values are used for at least two of the first loci.

5. The system of claim 4, further comprising:

means for dynamically determining a first cutoff value for one of the first loci, the one of the first loci residing within a first region.

6. The system of claim 5, wherein the first cutoff value is determined based on a sequencing depth of one of the first loci.

7. The system of claim 5, wherein the first cutoff value is determined based on a false positive rate that depends on a sequencing error rate, a sequencing depth of the first region, and a number of nucleotide positions in the first region.

8. The system of claim 7, wherein the first cutoff value is determined based on a number of true positives in the first region.

9. The system of claim 8, further comprising:

means for calculating the number of true positives for the first cut-off value based on the sequencing depth D of the first region in the biological sample and the tumor-derived DNA percentage concentration f.

10. The system of claim 9, wherein the number of true positives is calculated using a poisson distribution probability according to:where Pb is the probability of detecting a true positive, and r is the first cutoff value, and Mp ═ D × f/2.

11. The system of claim 5, wherein the first cutoff value is determined using any one of the following criteria:

if the sequencing depth is less than 50, then the first cutoff value is 5,

if the sequencing depth is 50-110, then the first cutoff value is 6,

if the sequencing depth is 111-200, then the first cut-off value is 7,

if the sequencing depth is 201-310, then the first cutoff value is 8,

if the sequencing depth is 311-450, then the first cutoff value is 9,

if the sequencing depth is 451-620, then the first cutoff value is 10, and

if the sequencing depth is 621-800, then the first cutoff value is 11.

12. The system of claim 1, wherein the parameter is a weighted sum of the first number of first loci, wherein the contribution value for each first locus is derived based on the weight value assigned to the respective first locus.

13. The system of claim 1, wherein the parameter comprises a sum of the sequence tags that contain single nucleotide variants at the first number of first loci.

14. The system of claim 13, wherein the sum is a weighted sum, and wherein a first weight in the first one of the first loci is different from a second weight in a second one of the first loci.

15. The system of claim 14, wherein the first weight is greater than the second weight, and wherein a first one of the first loci is associated with cancer and a second one of the first loci is not associated with cancer.

16. The system of claim 1, wherein the parameter is the first number of first loci.

17. The system of claim 1, wherein the means for determining the genomic position of the sequence tag comprises:

means for aligning at least a portion of the sequence tags to a reference genome, wherein the alignment of sequence tags allows for one or more mismatches between the sequence tags and the reference genome.

18. The system of claim 17, wherein the means for comparing the sequence tag to the consensus sequence comprises:

means for comparing the consensus sequence to the reference genome to determine a second number of second loci having variant sequences relative to the reference genome;

means for determining a third number of third loci based on the alignment, wherein:

(iii) at each third locus, the number of sequence tags having a single nucleotide variant relative to the reference genome is above a cut-off value; and

means for obtaining a difference between the third number and the second number to obtain the first number of first loci.

19. The system of claim 18, wherein obtaining the difference between the third number and the second number identifies the first locus.

20. The system of claim 19, wherein the means for determining the parameter comprises:

for each locus of the first number of loci:

means for counting the number of sequence tags aligned to and having a single nucleotide variant at the locus; and

Means for determining the parameter based on the respective counts.

21. The system of claim 1, wherein the consensus sequence is derived from a constitutive sample of the subject that contains more than 50% constitutive DNA.

22. The system of claim 1, wherein the means for determining the genomic position of the sequence tag comprises:

means for aligning at least a portion of the sequence tags with the consensus sequence, wherein the alignment of sequence tags allows for one or more mismatches between the sequence tags and the consensus sequence.

23. The system of claim 22, wherein the means for comparing the sequence tag to the consensus sequence comprises:

means for identifying, based on the aligning, a sequence tag having a single nucleotide variant at a genomic position relative to the consensus sequence of the subject;

for each genomic position exhibiting a single nucleotide variant:

means for calculating respective numbers of sequence tags aligned to and having a single nucleotide variant at the genomic position;

means for determining a parameter based on the respective numbers.

24. The system of claim 23, wherein the means for determining a parameter based on the respective numbers comprises:

means for summing the respective numbers to obtain a first sum; and

means for using the first sum to determine the parameter.

25. The system of claim 24, wherein means for using the first sum to determine the parameter comprises:

means for subtracting from the first sum the number of genomic positions exhibiting the single nucleotide variant.

26. The system of claim 24, wherein means for using the first sum to determine the parameter comprises:

means for normalizing the first sum based on a number of sequence tags on the alignment.

27. The system of claim 1, further comprising:

means for obtaining one or more second sequence tags for each of a plurality of DNA fragments, wherein the one or more second sequence tags are obtained by random sequencing of DNA fragments from a constitutive sample containing more than 90% constitutive DNA;

means for aligning at least a portion of the second sequence tag to a reference genome, wherein the alignment of second sequence tags allows for mismatches between the second sequence tag and the reference genome at M genomic positions, wherein M is an integer equal to or greater than one; and

Means for constructing the consensus sequence based on the second sequence tag and the alignment.

28. The system of claim 27, wherein said constitutive sample is said biological sample, and wherein said means for constructing said consensus sequence comprises:

a device for determining a homozygous locus or a heterozygous locus having two alleles.

29. The system of claim 27, wherein:

wherein the biological sample is plasma or serum obtained from a blood sample, an

The constitutive sample is white blood cells obtained from the blood sample.

30. The system of claim 1, wherein the one or more sequence tags are generated by random sequencing of DNA fragments in the biological sample.

31. The system of claim 1, wherein the parameter is a percent concentration of tumor-derived DNA.

32. The system of claim 1, wherein the sequence tags provide single nucleotide variant detection at the whole genome level.

33. A system for analyzing heterogeneity of one or more tumors in a subject, the system comprising:

means for obtaining a consensus sequence for the subject;

Means for receiving one or more sequence tags for each of a plurality of DNA fragments in a biological sample of the subject, the biological sample comprising cell-free DNA molecules;

means for determining the genomic position of the sequence tag;

means for comparing the sequence tags to the consensus sequence to determine the number of first loci, wherein:

at each first locus, a first number of sequence tags having a single nucleotide variant relative to the consensus sequence is above a cut-off value, the cut-off value being greater than one; and

means for calculating a heterogeneity metric of the one or more tumors based on the respective first numbers of sequence tags of the first loci.

34. The system of claim 33, further comprising:

means for comparing the heterogeneity metric to one or more thresholds to determine a classification of a heterogeneity level.

35. The system of claim 34, wherein the one or more thresholds are determined by one or more other subjects, and wherein a measure of heterogeneity of a biological sample comprising cell-free DNA of the one or more other subjects whose tumors have been biopsied and analyzed to determine mutations in the biopsied tumors to determine a tumor heterogeneity level is used to determine the thresholds.

36. The system of claim 35, wherein means for comparing the heterogeneity metric to one or more thresholds comprises:

means for inputting the heterogeneity metric to a calibration function that outputs a heterogeneity level based on the heterogeneity metric.

37. The system of claim 33, wherein the heterogeneity metric comprises a total number of first loci at which more than one DNA fragment aligns to a consensus sequence while having a single nucleotide variant at that locus.

38. The system as in claim 33, wherein a plurality of heterogeneity metrics are calculated, wherein the means for calculating the heterogeneity metrics comprises:

for each of the first loci, means for calculating a ratio of sequence tags having variant sequences;

means for generating a histogram of percentage values for a plurality of first loci; and

means for identifying a plurality of peaks in the histogram.

39. The system of claim 38, wherein one of the plurality of heterogeneity metrics corresponds to a plurality of identified peaks.

40. The system of claim 38, wherein one of the plurality of heterogeneity metrics comprises a ratio of heights of two of the plurality of peaks.

41. The system of claim 38, wherein the ratios each represent a percent concentration of tumor DNA measured at a particular first locus.

42. The system of claim 33, wherein the heterogeneity metric corresponds to a ratio of a first particular number of sequence tags having variant sequences at a first locus to a second ratio of a second particular number of sequence tags having variant sequences at the first locus.

43. The system of claim 42, wherein the first particular number is less than the second particular number.

44. The system of claim 43, wherein the first particular quantity is a first range and the second particular quantity is a second range, the first range being lower than the second range.

45. The system of claim 42, wherein the first specific amount and the second specific amount correspond to a percent concentration or an absolute number of sequence tags having variant sequences.

46. The system of claim 33, wherein the one or more tumors comprise multiple clones in a single subject, and wherein the heterogeneity comprises intratumoral heterogeneity.

47. The system of claim 33, wherein the one or more tumors are a plurality of tumors, and wherein the heterogeneity comprises inter-tumor heterogeneity.

48. The system of claim 33, wherein the heterogeneity metric is determined from histograms corresponding to a first number of each of the first loci.

49. The system as claimed in claim 48, wherein a plurality of heterogeneity metrics are calculated, wherein a first locus comprises a first subset and a second subset, wherein a plurality of heterogeneity metrics comprises a first histogram corresponding to each of said first number of said first subset and a second histogram corresponding to each of said second number of said second subset.

50. The system of claim 33, wherein the heterogeneity metric comprises a proportion of first loci having a corresponding first number above a prescribed value.

51. A computer-readable medium storing a plurality of instructions for controlling a computer system to perform operations for determining a classification of a grade of cancer in a subject, the operations comprising:

obtaining a consensus sequence for the subject;

receiving one or more sequence tags for each of a plurality of DNA fragments in a biological sample of the subject, the biological sample comprising cell-free DNA;

determining the genomic position of the sequence tag;

comparing the sequence tags to the consensus sequence to determine a first number of first loci, wherein:

determining a parameter based on a count of sequence tags having a single nucleotide variant at the first locus; and

comparing the parameter to a threshold to determine a classification of a grade of cancer in the subject.

52. The computer readable medium of claim 51, wherein the threshold is determined from one or more samples from one or more other subjects.

53. The computer-readable medium of claim 51, wherein the cutoff value for a locus depends on the total number of sequence tags that have genomic positions at the locus.

54. The computer-readable medium of claim 51, wherein different cutoff values are used for at least two of the first loci.

55. The computer-readable medium of claim 54, the operations further comprising:

dynamically determining a first cutoff value for one of the first loci, the one of the first loci residing within a first region.

56. The computer readable medium of claim 55, wherein the first cutoff value is determined based on a sequencing depth of one of the first loci.

57. The computer readable medium of claim 55, wherein the first cutoff value is determined based on a false positive rate that depends on a sequencing error rate, a sequencing depth of the first region, and a number of nucleotide positions in the first region.

58. The computer readable medium of claim 57, wherein the first cutoff value is determined based on a number of true positives in the first region.

59. The computer-readable medium of claim 58, the operations further comprising:

calculating the number of true positives for the first cut-off value based on the sequencing depth D of the first region in the biological sample and the tumor-derived DNA percentage concentration f.

60. The computer-readable medium of claim 59, wherein the number of true positives is calculated using a Poisson distribution probability according to:where Pb is the probability of detecting a true positive, and r is the first cutoff value, and Mp ═ D × f/2.

61. The computer readable medium of claim 55, wherein the first cutoff value is determined using any one of the following criteria:

if the sequencing depth is less than 50, then the first cutoff value is 5,

If the sequencing depth is 50-110, then the first cutoff value is 6,

if the sequencing depth is 111-200, then the first cut-off value is 7,

if the sequencing depth is 201-310, then the first cutoff value is 8,

if the sequencing depth is 311-450, then the first cutoff value is 9,

if the sequencing depth is 451-620, then the first cut-off value is 10, and

if the sequencing depth is 621-800, the first cut-off value is 11.

62. The computer-readable medium of claim 51, wherein the parameter is a weighted sum of the first number of first loci, wherein a contribution value for each first locus is derived based on a weight value assigned to the respective first locus.

63. The computer readable medium of claim 51, wherein said parameter comprises a sum of said sequence tags that contain single nucleotide variants at said first number of first loci.

64. The computer-readable medium of claim 63, wherein the sum is a weighted sum, and wherein a first weight in the first one of the first loci is different from a second weight in a second one of the first loci.

65. The computer readable medium of claim 64, wherein the first weight is greater than the second weight, and wherein a first one of the first loci is associated with cancer and a second one of the first loci is not associated with cancer.

66. The computer-readable medium of claim 51, wherein the parameter is the first number of first loci.

67. The computer readable medium of claim 51, wherein determining the genomic position of the sequence tag comprises:

aligning at least a portion of the sequence tags to a reference genome, wherein the alignment of sequence tags allows for one or more mismatches between the sequence tags and the reference genome.

68. The computer readable medium of claim 67, wherein comparing the sequence tag to the consensus sequence comprises:

comparing the consensus sequence to the reference genome to determine a second number of second loci having variant sequences relative to the reference genome;

determining a third number of third loci based on the alignment, wherein:

Obtaining a difference between the third number and the second number to obtain the first number of first loci.

69. The computer readable medium of claim 68, wherein taking the difference between the third number and the second number identifies the first locus.

70. The computer-readable medium of claim 69, wherein determining the parameter comprises:

for each locus of the first number of loci:

counting the number of sequence tags aligned to the locus and having a single nucleotide variant at the locus; and

the parameters are determined based on the respective counts.

71. The computer readable medium of claim 51, wherein said consensus sequence is derived from a constitutive sample of the subject that contains more than 50% constitutive DNA.

72. The computer readable medium of claim 51, wherein determining the genomic position of the sequence tag comprises:

aligning at least a portion of the sequence tags to the consensus sequence, wherein the alignment of sequence tags allows for one or more mismatches between the sequence tags and the consensus sequence.

73. The computer readable medium of claim 72, wherein comparing the sequence tag to the consensus sequence comprises:

based on the aligning, identifying a sequence tag having a single nucleotide variant at a genomic position relative to the consensus sequence of the subject;

for each genomic position exhibiting a single nucleotide variant:

calculating respective numbers of sequence tags aligned to the genomic position and having a single nucleotide variant at the genomic position;

determining a parameter based on the respective numbers.

74. The computer readable medium of claim 73, wherein determining a parameter based on the respective numbers comprises:

summing the respective numbers to obtain a first sum; and

using the first sum to determine the parameter.

75. The computer readable medium of claim 74, wherein using the first sum to determine the parameter comprises:

subtracting from the first sum the number of genomic positions exhibiting single nucleotide variants.

76. The computer readable medium of claim 74, wherein using the first sum to determine the parameter comprises:

normalizing the first sum based on the number of sequence tags on the alignment.

77. The computer-readable medium of claim 51, the operations further comprising:

obtaining one or more second sequence tags for each of a plurality of DNA fragments, wherein the one or more second sequence tags are obtained by random sequencing of DNA fragments from a constitutive sample containing more than 90% constitutive DNA;

aligning at least a portion of the second sequence tag to a reference genome, wherein the alignment of second sequence tags allows for mismatches between the second sequence tag and the reference genome at M genomic positions, wherein M is an integer equal to or greater than one; and

constructing the consensus sequence based on the second sequence tag and the alignment.

78. The computer readable medium of claim 77, wherein said constitutive sample is said biological sample, and wherein constructing said consensus sequence comprises:

determining a homozygous locus or a heterozygous locus having two alleles.

79. The system of claim 77, wherein:

The constitutive sample is white blood cells obtained from the blood sample.

80. The computer readable medium of claim 51, wherein the one or more sequence tags are generated by random sequencing of DNA fragments in the biological sample.

81. The computer readable medium of claim 51, wherein the parameter is the percent concentration of tumor-derived DNA.

82. The system of claim 51, wherein the sequence tags provide single nucleotide variant detection at the whole genome level.

83. A computer-readable medium storing a plurality of instructions for controlling a computer system to perform operations for analyzing heterogeneity of one or more tumors of a subject, the operations comprising:

obtaining a consensus sequence of the subject;

receiving one or more sequence tags for each of a plurality of DNA fragments in a biological sample of the subject, the biological sample comprising cell-free DNA molecules;

determining the genomic position of the sequence tag;

comparing the sequence tags to the consensus sequence to determine the number of first loci, wherein:

Calculating a heterogeneity metric of the one or more tumors based on the respective first numbers of sequence tags of the first loci.

84. The computer-readable medium of claim 83, the operations further comprising:

the heterogeneity metric is compared to one or more thresholds to determine a classification of a heterogeneity level.

85. The computer readable medium of claim 84, wherein the one or more thresholds are determined by one or more other subjects, and wherein a measure of heterogeneity of a biological sample comprising cell-free DNA of the one or more other subjects whose tumors have been biopsied and analyzed to determine mutations in the biopsied tumors to determine a tumor heterogeneity grade is used to determine the thresholds.

86. The computer-readable medium of claim 85, wherein comparing the heterogeneity metric to one or more thresholds comprises:

inputting the heterogeneity metric to a calibration function that outputs a heterogeneity level based on the heterogeneity metric.

87. The computer-readable medium of claim 83, wherein the heterogeneity metric comprises a total number of first loci at which more than one DNA fragment aligns to a consensus sequence while having a single nucleotide variant at that locus.

88. The computer-readable medium of claim 83, wherein a plurality of heterogeneity metrics are computed, wherein computing the heterogeneity metrics comprises:

calculating, for each of the first loci, a ratio of sequence tags having variant sequences;

generating a histogram of percentage values for a plurality of first loci; and

a plurality of peaks in the histogram are identified.

89. The computer-readable medium of claim 88, wherein one of the plurality of heterogeneity metrics corresponds to a plurality of identified peaks.

90. The computer-readable medium of claim 88, wherein one of the plurality of heterogeneity metrics comprises a ratio of heights of two of the plurality of peaks.

91. The computer readable medium of claim 88, wherein the ratios each represent a percent concentration of tumor DNA measured at a particular first locus.

92. The computer-readable medium of claim 83, wherein the heterogeneity metric corresponds to a ratio of a first particular number of sequence tags having variant sequences at a first locus to a second ratio of a second particular number of sequence tags having variant sequences at the first locus.

93. The computer readable medium of claim 92, wherein said first particular number is less than said second particular number.

94. The computer readable medium of claim 93, wherein the first particular number is a first range and the second particular number is a second range, the first range being lower than the second range.

95. The computer-readable medium of claim 92, wherein the first particular quantity and the second particular quantity correspond to a percent concentration or an absolute number of sequence tags having variant sequences.

96. The system of claim 83, wherein the one or more tumors comprise multiple clones in a single subject, and wherein the heterogeneity comprises intratumoral heterogeneity.

97. The system of claim 83, wherein said one or more tumors is a plurality of tumors, and wherein said heterogeneity comprises inter-tumor heterogeneity.

98. The computer-readable medium of claim 83, wherein said heterogeneity metric is determined from a histogram corresponding to a first number of each of said first loci.

99. The computer-readable medium of claim 98, wherein a plurality of heterogeneity metrics are calculated, wherein a first locus comprises a first subset and a second subset, wherein a plurality of heterogeneity metrics comprises a first histogram corresponding to each of said first number of said first subset and a second histogram corresponding to each of said second number of said second subset.

100. The computer-readable medium of claim 83, wherein the heterogeneity metric comprises a proportion of first loci having a corresponding first number above a prescribed value.