WO2014064584A1 - Analyse comparative et interprétation d'une variation génomique chez un individu ou dans des collections de données de séquence - Google Patents

Analyse comparative et interprétation d'une variation génomique chez un individu ou dans des collections de données de séquence Download PDF

Info

Publication number
WO2014064584A1
WO2014064584A1 PCT/IB2013/059421 IB2013059421W WO2014064584A1 WO 2014064584 A1 WO2014064584 A1 WO 2014064584A1 IB 2013059421 W IB2013059421 W IB 2013059421W WO 2014064584 A1 WO2014064584 A1 WO 2014064584A1
Authority
WO
WIPO (PCT)
Prior art keywords
situation under
feature
clinical
features
clinical situation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/IB2013/059421
Other languages
English (en)
Inventor
Angel Janevski
Sitharthan Kamalakaran
Nilanjana Banerjee
Vinay Varadan
Nevenka Dimitrova
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Koninklijke Philips NV
Original Assignee
Koninklijke Philips NV
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips NV filed Critical Koninklijke Philips NV
Publication of WO2014064584A1 publication Critical patent/WO2014064584A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • the following relates to the genetic analysis arts, medical arts, and to applications of same such as the medical arts including oncology arts, veterinary arts, and so forth.
  • Genetic testing typically employs standard tests that have been developed and validated for diagnosing particular medical conditions, for assessing whether a particular therapy is indicated, or for other clinical purposes.
  • the Oncotype DX ® test (available from Genomic Health, Inc., Redwood City, CA, USA) measures the levels of 21 molecular markers that have been clinically validated as being probative of breast cancer.
  • Another advanced breast cancer test the MammaPrint ® test, combines 70 molecular measurements into a prognostic marker.
  • Various treatments for example a regimen combining chemotherapy and tamoxifen, may be ordered based on the results of such tests.
  • a molecular marker test typically includes a precise specification regarding acquisition of the genetic markers, and equivalent molecular data acquired by another approach (e.g., a different sequencing technology, or an entirely different technology such as gene expression measurement rather than sequencing data) is usually not useable in the molecular marker test.
  • existing molecular marker tests typically answer a specific clinical question, and may therefore miss other relevant clinical implications of the analyzed molecular markers (possibly in combination with other available markers that were not analyzed by the molecular marker test).
  • an apparatus comprises an electronic data processing device configured to perform a method including: generating feature values for a set of features from data including molecular marker data acquired from a set of subjects to generate feature vectors representing the subjects; deriving a sub-set of discriminative features from the feature vectors and representing the subjects using reduced- dimensionality feature vectors with the sub-set of discriminative features; and identifying a set of probative features and feature values for the probative features that are indicative of a clinical situation under analysis. The identifying is based on comparison of the reduced- dimensionality feature vectors with feature values representing subjects in one or more subject populations that include subjects identified as being in the clinical situation under analysis.
  • the method performed by the electronic data processing device may further comprise: generating input feature values for the set of features from data including molecular marker data acquired from a person of interest to generate an input feature vector; performing a comparative analysis that computes a likelihood that the person of interest is in the clinical situation under analysis by comparing the input feature values with the feature values for the probative features that are indicative of the clinical situation under analysis; and displaying a result of the comparative analysis including at least an indication of the computed likelihood.
  • a method comprises: developing an in silico test for assessing likelihood that a patient is in a clinical situation under test by performing operations including (1) generating feature vectors representing subjects of a set of subjects wherein the feature vectors include feature values derived from molecular marker data and (2) performing feature reduction to generate reduced-dimensionality feature vectors representing the subjects and (3) identifying a set of probative features and feature values for the probative features that are indicative of the clinical situation under test based on comparison of the reduced-dimensionality feature vectors with a reference data set including feature values representing subjects identified as being in the clinical situation under test; generating an input feature vector from data including molecular marker data acquired from a person to be tested; performing the in silico test by comparing the input feature vector with the feature values for the probative features that are indicative of the clinical situation under test; and displaying a result of the performed in silico test.
  • the developing and generating operations are performed by an electronic data processing device.
  • a non-transitory storage medium stores instructions executable by an electronic data processing device to perform the method set forth in the immediately preceding paragraph.
  • One advantage resides in more holistic use of available genetic data for patient assessment.
  • Another advantage resides in leveraging unlabeled subject data to enhance the usefulness of clinical study results.
  • the invention may take form in various components and arrangements of components, and in various process operations and arrangements of process operations.
  • the drawings are only for the purpose of illustrating preferred embodiments and are not to be construed as limiting the invention.
  • FIG. 1 diagrammatically shows a system for developing an in silico test for assessing likelihood that a patient is in a clinical situation under test.
  • FIG. 2 diagrammatically shows a system for applying the in silico test developed by the system of FIGURE 1.
  • FIG. 3 diagrammatically shows a suitable embodiment of the
  • FIG. 4 diagrammatically shows an example of the feature
  • FIGS. 5-8 diagrammatically show some illustrative examples of the feature extraction/annotation approach of FIGURE 4.
  • FIG. 9 diagrammatically shows a consensus alignment of variations.
  • FIGS. 10-13 diagrammatically show an example of identifying probative variations based on feature vector subsets generated by clustering of feature vectors.
  • FIG. 14 diagrammatically shows an approach for identifying
  • FIG. 15 diagrammatically shows a processing sequence including identification of a subset of probative features followed by enrichment including determining an associated clinical implication and corresponding clinical decision
  • FIG. 16 diagrammatically shows performing a plurality of
  • FIGS. 17 and 18 diagrammatically show another illustrative example.
  • the clinical situation under test can be substantially any type of clinical situation that is expected to manifest in molecular marker data.
  • Some examples include: a cancer of a specified organ or tissue (e.g., breast cancer, lung cancer, leukemia or other blood cancers, or so forth), various genetic disorders, and so forth.
  • the clinical situation under test is considered to be hierarchical, that is, a more particular clinical situation under test may be encompassed by the (broader) clinical situation under test.
  • the situation under test is cancer of a specified organ or tissue
  • the more particular situation under test may be a particular type of cancer of the specified organ or tissue.
  • there may be more than one "more particular" situations under test e.g. several different types of cancer of the specified organ or tissue may be under test or analysis.
  • the in silico test development is based on a set of subjects from whom data including at least molecular marker data are derived.
  • one illustrative subject 4 undergoes a procedure in a sample extraction laboratory 6 to extract an oral swab, biopsy sample, or other tissue sample 10 (diagrammatically indicated in FIGURE 1 by a vial, but suitably may be carried by a slide or other suitable tissue sample container or support) that is processed by a sequencer apparatus 14 to generate sequencing reads.
  • the sequencer apparatus 14 may be a next generation sequencing (NGS) apparatus or a more conventional sequencing apparatus such as a Sanger sequencing facility.
  • NGS next generation sequencing
  • the sequencer apparatus 14 may in some embodiments be a commercial sequencing apparatus such as are available from Illumina, San Diego, CA, USA; Knome, Cambridge, MA, USA; Ion Torrent Inc., Guilford, CT, USA; or other NGS system vendors; however, a noncommercial or custom-built sequencer is also
  • the sequencing reads are suitably filtered to remove duplicate reads or reads of unacceptable base quality score, and the remaining reads are processed by a sequence alignment and annotation module 16 to generate aligned (and optionally annotated) sequencing data.
  • the alignment can be de novo alignment of overlapping portions sequencing reads, and/or can include mapping of the sequencing reads to a reference sequence (e.g., a human reference sequence) while allowing for a certain fraction (e.g., 5-10%) of base mismatches.
  • the resulting aligned sequence provides substantial information about the subject 4, especially if a WGS was obtained. Additionally, other information may be obtained about the subject 4.
  • other molecular marker data may be acquired by proteomic analysis using a microarray or other process. Data other than molecular marker data may optionally also be obtained, such as test data from non-molecular marker tests (e.g., imaging studies, histopathology tests, or so forth). Such data may, for example, be stored in and retrieved from an electronic patient record 18. (As indicated by a dashed arrow in FIGURE 1 , in some embodiments the WGS or other aligned sequence may also be stored in and retrieved from the electronic patient record 18).
  • the (processed) genetic sequencing data output by the alignment/annotation module 16 and other patient data for the subject 4 constitutes a large knowledge base for the subject 4.
  • the data are generally in different formats.
  • a features extraction module 20 receives the data and constructs a feature vector for the subject 4.
  • the elements of the feature vector store feature values for a set of features, where the feature values for the feature vector representing the subject 4 are generated from data, including at least molecular marker data, acquired from the subject 4.
  • the features are all binary features.
  • Such binary features can store substantial data - for example, a binary feature value for a single nucleotide variant (SNV) may store a "1" if the variant is present in the WGS of the subject 4 and a "0" if the variant is not present.
  • An imaging study can be represented by one or more binary features indicating whether the study identified potentially malignant lesions.
  • a histopathology study can be represented by binary values indicating whether a positive ("1") or negative (“0") result was obtained.
  • the feature vector may additionally or alternatively include other types of features, such as integer values (e.g., different values may be used to represent different possible SNV for a given gene location), or text or symbol values (e.g., a feature representing an image test may include a letter value where different letters correspond to different test outcomes), or so forth.
  • integer values e.g., different values may be used to represent different possible SNV for a given gene location
  • text or symbol values e.g., a feature representing an image test may include a letter value where different letters correspond to different test outcomes
  • the in silico test development is based on a set of subjects, of which the illustrative subject 4 is a single illustrative example.
  • Other subjects may be processed by the same laboratory 6 and components 14, 16, 20 to generate a feature vector for each subject. Additional data may optionally be obtained from one or more external databases 22.
  • the feature vectors representing all subjects of the set of subjects is suitably in the same "vector space", in the sense that each feature vector has the same elements in the same order representing the same features of the subject. (For example, if the fifth element of the feature vector for subject 4 represents a particular SNV, then the fifth element of each feature vector representing a subject of the set of subjects should represent that particular SNV using the same representation format).
  • the vector element includes a designated value (e.g., a NULL value) that indicates the feature is not available for that subject.
  • the output of the feature extraction module 20 applied to all subjects of the set of subjects is a set of feature vectors 24
  • the feature vector includes elements representing numerous features, some of which may be probative for the clinical situation under test or analysis, and some of which may be irrelevant for the clinical situation under test or analysis. Further, the probative features have certain feature values that are indicative of the clinical situation under test or analysis, while other feature values are not indicative of the clinical situation under test or analysis. That is, a subject whose feature vector has numerous feature values indicative of the clinical situation under test for the probative features has a higher likelihood of being in the clinical situation under test than does a subject whose feature vector has fewer feature values indicative of the clinical situation under test. However, the probative features and the feature values for those probative features that are indicative of the clinical situation under test are not known.
  • a probative features and feature values selection module 30 receives the clinical situation under test 32 and selects the probative features and the feature values for those probative features that are indicative of the clinical situation under test. In general, this is done by comparing the feature vectors of the set of feature vectors 24 with features of subjects who are known to be in the clinical situation under test, and by comparing the feature vectors of the set of feature vectors 24 with features of subjects who are known not to be in the clinical situation under test, and thereby identifying the probative features and the indicative feature values.
  • medical literature 34 may have identified biological pathways that are correlated with the feature values that are indicative of the clinical situation under test. Additionally or alternatively, the medical literature 34 may identify clinical implications that are associated with the feature values that are indicative of the clinical situation under test, or with biological pathways correlated with those feature values. In such cases, an enrichment module 36 associates enrichment data from the medical literature 34 (e.g., information on the correlated biological pathways and/or associated clinical implications) with the feature values that are indicative of the clinical situation under test.
  • the enrichment module 36 operates on an electronic medical literature database and suitably performs keyword searching or other data mining to identify relevant medical literature.
  • the enrichment data may be relatively general, for example providing citations to published clinical studies that include terms associated with the feature values indicative of the clinical situation under test (e.g., if the feature value indicates a particular SNV, the associated terms may include the name of the gene and the name of the SNV). Additionally or alternatively, the enrichment data may be more specific, e.g. expressly identifying a biological pathway correlating with the SNV. In some embodiments the enrichment module 36 is semi-automatic rather than fully automatic.
  • the enrichment module 36 may present a human operator with the feature values indicative of the clinical situation under test and links to published clinical studies that include terms associated with those feature values, and provide a viewer window via which the human operator can review the linked clinical studies and a dialog box via which the human operator can manually input enrichment data based on the human operator's review of the linked clinical studies and, optionally, further based on the human operator's medical expertise.
  • the output of the selection module 30 and the enrichment module 36 is an in silico test dataset 40 comprising the selected set of probative features for the clinical situation under test, the selected feature values indicative of the clinical situation under test for these probative features, and any added enrichment data.
  • the set of feature vectors 24 can be used in developing different in silico tests for a plurality of different clinical situations under test or analysis. For example, tests may be developed for cancers of different organs or tissues, and/or for different types of those cancers.
  • the processing for each different clinical situation under test entails inputting that clinical situation as the input 32 and invoking the selection module 30 and the enrichment module 36 for that clinical situation.
  • An input feature vector 46 is generated for a person of interest 44, e.g. a single medical patient or a cohort of patients with similar clinical symptoms.
  • the person of interest 44 is typically not one of the subjects contributing to the set of feature vectors 24 used in developing the in silico test. Rather, the person of interest 44 is typically a current medical patient undergoing clinical diagnosis or treatment.
  • the input feature vector 46 may include molecular marker data generated by genetic sequencing using the same sample extraction laboratory 6, sequencer apparatus 14, and alignment/annotation module 16 as was used for generating the set of feature vectors 24 used in developing the in silico test.
  • Other data for generating feature values may come from the entry for the patient 44 in the electronic patient record 18 (and, again, the genetic sequencing data may be stored in and retrieved from the electronic patient record 18 as indicated by a dashed arrow in FIGURE 2).
  • the feature extraction module 20 is applied to the data for the patient of interest 44 to generate the input feature vector 46. Again, if the available data for the patient 44 is insufficient to compute the feature value for any element of the feature vector, that vector element is suitably filled with the designated value (e.g., NULL value) indicating the feature is not available.
  • the designated value e.g., NULL value
  • a comparative analysis module 50 compares the input feature vector 46 with the in silico test dataset 40, and more particularly compares the feature values of the input feature vector 46 for the probative features identified in the in silico test dataset 40 with the feature values indicative of the clinical situation under test (also from the in silico test dataset 40).
  • the comparative analysis module 50 computes a likelihood that the person of interest 44 is in the clinical situation under test or analysis.
  • the likelihood is computed by comparing the input feature values with the feature values for the probative features that are indicative of the clinical situation under analysis. It should be noted that the likelihood is typically not a medical diagnosis; rather, it is an intermediate result typically provided as an item of information for consideration by a medical doctor in making a medical diagnosis based on the likelihood and possibly other information.
  • a comparative analysis results visualization module 52 displays the results, including at least an indication of the computed likelihood. If the computed likelihood is high, then the visualization module 52 may optionally also display any enrichment data (e.g., correlated biological pathways and/or associated clinical implications) for the clinical situation under analysis.
  • enrichment data e.g., correlated biological pathways and/or associated clinical implications
  • the various processing components are suitably implemented by one or more computers or other electronic data processing devices 55.
  • the electronic data processing device or devices 55 may include: a notebook computer; a desktop computer; a network server computer accessible via the Internet and/or a local wired/wireless data network; various combinations thereof; or so forth.
  • the electronic data processing device 55 includes or has operative access to a display device or screen 56 for displaying the visualization generated by the visualization module 52.
  • the same computer 55 implements both the in silico test development system of FIGURE 1 and the in silico testing system of FIGURE 2.
  • the alignment/annotation module 16 may be implemented by a computer associated with the sequencer apparatus 14 that is different from the computer that implements the features extraction module 20, the selection module 30, and the enrichment module 36.
  • Other arrangements of electronic data processing devices are also possible.
  • the disclosed in silico test development and implementation techniques are also suitably embodied as a non-transitory storage medium storing instructions executable by the computer or other electronic data processing device 55 to perform the disclosed techniques.
  • the non-transitory storage medium storing the executable instructions may, for example, include: a hard disk or other magnetic storage medium; an optical disk or other optical storage medium; a flash memory, random access memory (RAM), read-only memory (ROM), or other electronic storage medium; or so forth.
  • the probative features and feature values selection module 30 operates by comparing the feature vectors of the set of feature vectors 24 with features of subjects who are known to be in the clinical situation under test, and by comparing the feature vectors of the set of feature vectors 24 with features of subjects who are known not to be in the clinical situation under test, and thereby identifying the probative features and the indicative feature values.
  • each subject of the set of subjects from which the set of feature vectors 24 is derived is annotated to indicate whether the subject is in the clinical situation under test.
  • the probative features are identified as features having certain values for those subjects annotated as being in the clinical situation under test and having certain other (different) values for those subjects annotated as not being in the clinical situation under test.
  • this approach can be difficult or impossible to implement.
  • the available pertinent clinical studies may investigate populations that are too small for the selection module 30 to generate statistically significant results, and/or the clinical studies may not identify a sufficient number of features for the subjects.
  • Some published studies only those features that the researchers determined to be relevant are identified in the study, and this may exclude numerous other features that would be identified by the selection module 30 if the full feature sets were available.
  • Some studies may publish only summaries, rather than providing full WGS or other individualized patient data.
  • the format may be incompatible with the sequences output by the sequencing apparatus 14 that is available for characterizing the patient of interest 44, or may be obtained by an entirely different technology (e.g., proteomic analysis rather than sequencing). Still further, there may be known or unknown population biases present in the clinical study populations. For example, a given the clinical study may have been restricted to women, while the test under development may be intended to be applicable to both women and men.
  • the illustrative selection module 30 operates on the set of feature vectors representing subjects 24 in which the subjects are not (in general) annotated as to whether the subjects are in the clinical situation under test or analysis.
  • the illustrative selection module 30 also has available to it a database of feature values for subjects of one or more populations 60 that include subjects annotated as to whether they are in the clinical situation under test.
  • the database may be relatively incomplete as compared with the set of features represented feature vector.
  • subjects of the population(s) 60 may be labeled with only a few discrete molecular marker values, for example obtained in a standard test employing a fixed set of markers, rather than with a WGS or other substantial set of molecular marker data.
  • the database 60 may be limited in other ways, for example being biased toward a particular gender, age group, or other demographic due to constraints on study pools imposed by study parameters, and/or having an undesirably small population size, or so forth.
  • the population 60 typically includes both positive and negative samples (i.e. some subjects in the clinical situation under test, and some subject not in the clinical situation under test), although a population with only positive samples may be employed.
  • the selection module 30 utilizes the substantially larger quantity of data contained in the set of feature vectors representing subjects 24 to effectively generalize data of the more limited study population(s) 60.
  • a discriminative features subset extraction operation 62 analyzes the feature vectors of the set of feature vectors 24 to identify a subset of discriminative features and to discard non-discriminative or minimally discriminative features so as to generate reduced-dimensionality feature vectors 64 that are effective for discriminating amongst the subjects represented by the feature vectors 24.
  • Differential (i.e., discriminative) feature subset extraction is a process by which a set of features describing a set of entities is reduced to a subset of features based on their ability to maintain the differential information between the entities (that is, the ability to discriminate between entities). More formally, each entity E I .. M is described with a vector of N values. In other words, Ei will be represented by feature vector ⁇ fi, f 2 , .. . , 3 ⁇ 4 ⁇ >, where fi j is the j-th feature for the i-th entity. These feature vectors need not contain values for each feature (i.e. some vector elements can have a NULL/missing value).
  • the differential feature extraction 62 can be tuned by selection of the "comparable range" and/or the percentage of entities within that range required for elimination. In general, increasing the size of the comparable range increases the number of features that are eliminated, and lowering the percentage of entities in the comparable range required for elimination increases the number of features that are eliminated.
  • information about the entities and/or the features can be employed to determine how informative (i.e., discriminative) individual features or feature subsets are for distinguishing amongst individuals of the population 24. For example, features may be ranked on the properties/distribution of the values and a top set of features (e.g.
  • top 25% is selected, or subsets may be evaluated as a group by its ability to stratify entities into sub-categories - then the top performing subsets are selected.
  • the foregoing examples are merely illustrative, and other feature reduction techniques can also be employed.
  • the variations in the discriminative subset of features (i.e. the reduced-dimensionality feature vectors) 64 are characterized in the context of populations 60 in the identification operation 66.
  • the output of the identification operation 66 is probative features and the feature values for those probative features that are indicative of the clinical situation under test.
  • This output serves as the input to the enrichment module 36, which performs an enrichment operation 68 that enriches the test with enrichment data, e.g. pathway data, clinical implication data, decision recommendation data, or so forth.
  • the feature subset extraction operation 62 is not dependent on the annotated reference population(s) 60. Accordingly, if there are several clinical situations under test for which in silico tests are to be developed, the same reduced features set 64 can be used for each test development, and so the feature subset extraction operation 62 can be run only once.
  • Input to the comparative variation assessment tool is sequencing profiles obtained from one or more tissue samples. These samples can be a group of patients and can originate in normal and/or cancer tissue and could be obtained with various degrees of invasiveness: from a saliva swab, through blood sample, to biopsy and surgery.
  • a group of samples may be obtained from a single patient, e.g. one normal sample and one or more diseased tissue samples, e.g. several biopsied points in suspicious nodules, plus optionally secondary sites such as lymph nodes may be considered.
  • the sequencer apparatus 14 acquires single-base level high coverage read of the DNA or RNA molecules from a specimen.
  • the end result after several standard low- level processing steps is a collection of reads of given lengths which are then aligned to a reference (e.g. human genome for human DNA or RNA sequencing) by the alignment/annotation module 16.
  • the alignment is typically imperfect (by design) and this is captured in the output by the confidence of the match for a single base or a region, coverage of a location on the reference genome, and other quality metric. Given an alignment, it is possible to characterize various types of variations that exist in all individuals. These variations can also be called with some certainty which allows for filtering out noise in the biological signal or in the measurement,
  • FIGURE 4 a simplified diagrammatic view of a processing pipeline is presented.
  • One or more of these pipelines can be used to analyze each sample to provide higher-level information on discovered variations.
  • candidate variations are obtained (e.g. triangle and square locations on the genome which is indicated with the full line in the top diagram of FIGURE 4).
  • Not all variations are of interest to the clinician (for example, they may be known to be irrelevant to the clinical situation under test).
  • the variations are annotated based on some repository of variations and these are indicated in FIGURE 4 (middle diagram) with the symbols A, B, C and i, j,k, and 1.
  • groups of such variations can be grouped based on higher-level grouping (e.g. biological pathways, disease-associated genes, population-specific variants). In FIGURE 4, this higher level annotation is shown in the bottom diagram using symbols ⁇ , ⁇ , and ⁇ .
  • Typical variations comprise single nucleotide variations, copy number variations, or so forth. Furthermore, these can be interpreted with respect to their homo- or hetero -zygosity, equivalence to a reference population, et cetera.
  • the samples can be grouped in various subsets (in case when there are more sequencing outputs/samples to consider) or variations on the annotation of the same measurement relative to the question explored.
  • This is achieved in the approach of FIGURE 3 by the differential features subset operation 62 which produces the reduced-dimensionality feature vectors 64.
  • the annotated reduced profiles (represented by the reduced feature vectors 64) are then analyzed in the context of a reference population or populations 60 with equivalent annotation to identify features that are probative of the clinical situation under test (identification operation 66).
  • each sample can be represented by a feature vector of N values (that is, an N-dimensional feature vector) which can for example correspond to the union of all variations found vi, V2, . . . VN-
  • the feature vector elements have binary values, i.e. each sample is represented by a feature vector of N "0" and "1" values indicating presence or absence of variation v; at the z ' -th position.
  • the reduced-dimensionality feature vectors 64 can then be used to compute pairwise distances between samples and with this perform hierarchical clustering as part of the identification operation 66 to find sample clusters or subsets.
  • N 8 and the samples fall perfectly into two clusters.
  • Hierarchical clustering can, for example identify a larger cluster (e.g. corresponding to breast cancer generally) and smaller clusters contained in the larger cluster that correspond to specific types of breast cancer. More generally, a larger cluster "higher” in the hierarchy corresponds to a more general clinical situation under analysis while the contained smaller clusters "lower” in the hierarchy correspond to more particular clinical situations under analysis (which are subsumed by or encompassed by the more general clinical situation).
  • hierarchical clustering advantageously enables such stratificiation of more general-to-more particular clinical situations, it is alternatively contemplated to employ non-hierarchical (i.e. flat) clustering.
  • FIGURES 5 and 6 output is shown of one analysis that produces data analogous to that discussed with reference to FIGURE 4.
  • the data shown in FIGURES 5 and 6 were extracted from copy number variation (CNV) analysis of DNA sequencing data from eight individuals where seven genomes were analyzed using the eighth genome as a normalization (control) with the CNV-seq tool (Xie et al, "CNV-seq, a new method to detect copy number variation using high-throughput sequencing", BMC Bioinformatics 2009, 10:80).
  • FIGURES 5 and 6 show output of this tool visualized using the UCSC Genome Browser (Kent et al, 'The human genome browser at UCSC", Genome Res. 2002 Jun;12(6):996-1006). In these results, boldface is used to signify amplification and italics to signify deletion.
  • CNV copy number variation
  • chromosome 1 may be discovered in the range 1,000,000-1,000,100 in one sample and 1,000,100-1,000,900 in another. Such discrepancies can be consolidated across all samples to establish a more robust call of the existence and the quantitative characterization of these (shared) variations.
  • FIGURE 9 the copy CNV calls from four genomes are broken into eight merged segments MS m - MS t each of which is characterized with a vector of four values. Additionally, a step is performed that combines neighboring and overlapping segments into consolidated segments from beginning of MSi to MSs. All eight segments are combined into one consolidated segment.
  • aggregate segments are used as units of shared variation that can be used to compare occurrence in individual samples in the subsequent analysis.
  • FIGURES 10-13 an illustrative example is shown for a subset A, a subset B, and sequences for one or more populations (see FIGURE 10).
  • FIGURE 11 highlights a first set of probative annotations (i.e., probative features in the context of the feature vector). These features are present in subset A but are absent in Subset B.
  • FIGURE 12 highlights a second set of (one) probative annotation (i.e., another probative feature in the context of the feature vector). This annotation is absent present in subset A but is present in Subset B.
  • FIGURE 13 shows the aggregate set of probative annotation (i.e., probative features in the context of the feature vector).
  • the output of this reference-population-based annotation is a set of variations that characterize the sample(s) relative to a patient population.
  • subset A and B there are two sample subsets A and B that each differs from some reference population in a different fashion.
  • all samples in subset A have two single nucleotide variations found in a reference population but not in subset B (see also FIGURE 11).
  • subset B contains a single-nucleotide variation found in the reference but not in subset A (see also FIGURE 12).
  • FIGURE 14 a diagrammatic example of a suitable probative feature selection is described.
  • all possible difference and intersection sets are examined for differential presence and/or absence of variations and such sets are returned.
  • this step can be implemented by measuring the distance between profiles where each profile is represented as a vector of values and distances metrics such as correlation and Euclidean distance provide information on which samples are "closest" comprising one or more subsets.
  • the example of FIGURE 14 can be expanded to multiple populations and to more than two subsets.
  • the enrichment operation 68 is suitably performed by the enrichment processor 36.
  • the enrichment may include: based on variant genes, identifying biological pathways implicated with these genes; determining clinical implication based on the enrichment data; and obtaining possible clinical decisions to be presented to the clinician.
  • FIGURE 15 diagrammatically shows this process.
  • features reduction 62 can be grouped under one comparative analysis and each such analysis has one or more clinical
  • a dataset i.e., reduced features set 64, see FIGURE 3
  • a dataset can be analyzed multiple times in the context of different clinical implications or different patient populations.
  • the same clinical implication can be assessed using one or more datasets.
  • the feature extraction 20 and feature reduction 62 can be performed only once, and the identification and enrichment operations 66, 68 repeated for each different clinical situation under test. Every comparative analysis combination of a dataset and a clinical question results in a clinical decision characterized by a fitness score (e.g. on a scale of 0 to 1 , 0 being dataset not informative to characterize clinical question, and 1 being the dataset is directly relevant to characterize the clinical question).
  • the disclosed processing is captured by the following pipeline: Sample; Measurements; Data Analysis; Post-processing; Annotation (respective to a reference sequence with biological and clinical annotations); Interpretation (referenced to the clinical study population(s) 60 and medical literature 34 specifying clinical implications and clinical outcomes).
  • Multiple such pipelines can be implemented and executed depending on the type of measurement, choice of tools to perform analysis and the data repositories used to annotate the data.
  • Measurements produces raw data
  • Data Analysis performs the initial processing like alignment and QA (e.g., performed by sequence
  • Post-processing involves determination of the subsets and their properties, and Annotation determines the relationships between the subsets and the populations (e.g., the identification operation 66). Finally, Interpretation connects the molecular characterization with clinical implications (e.g., the enrichment operation 68).
  • the comparative analysis module 50 provides characterization of all sets - the clinical implication/decision, and also the underlying molecular profile(s) and annotation that contributed towards that conclusion. For example, centroids of the feature vector subsets generated by clustering or other probative features identification processing 66 (see FIGURE 3) can be used to characterize or classify new (individual) patient samples from the patient of interest 44 when they are analyzed in the clinic.
  • the population P may be ovarian cancer patients that responded to therapy.
  • the set S n P will be one set for which the Sample; Measurements; Data Analysis;
  • FIGURE 18 shows one suitable presentation.
  • patient type of measurements, measured tissue
  • a number of comparative analysis instances can be instantiated and presented for example in a matrix format based on Table 1.
  • each column is a different analysis pipeline dependent on the data type and/or the annotation databases (corresponding to a different probative feature subset in the context of feature vectors)
  • each row is an application of a comparative analysis instance applied to a particular analysis pipeline corresponding to a particular clinical implication.
  • Each comparative analysis is scored with respect to fitness to the analyzed sample from a new patient.
  • the fitness is an indication of the computed likelihood.
  • the fitness scores can be accumulated for each row providing a total score for each comparative analysis with respect to a clinical implication.
  • the visualization module 52 highlights each comparative analysis that results in a match for the current patient (e.g., using thick cell borders in the matrix of FIGURE 18), and also provide ranking of the clinical implications where the molecular profiles provide insight into the patient sample thus providing the clinician with tools that enable prioritization and easy overview of the relevant clinical categories and possible clinical actions.
  • FIGURE 18 "Clinical
  • Implication #3 is most likely as all three probative feature subsets #1, #2, #3 provide a match with the patient of interest.
  • the clinical implications are then compared and for example, the cells where the clinical implication is in agreement across the analyses and the clinical actions are ranked (using a star system at the right in FIGURE 18) for example based on the strength of evidence obtained by consensus in the analysis output.
  • the annotation of the variations may be aimed at selecting the best therapy for a breast cancer patient. Copy number, transcription levels, and single nucleotide variations are all measured and the annotation is compared with known variations relevant to breast cancer implicated genes. All comparative analyses then focus on selecting the pathways implicated with each measurement and the targets of therapies are identified in each row of the matrix. Based on the ranking of the therapies, the clinician can decide how to proceed, order another line of analysis, explore the underlying evidence, re-evaluate the data with respect to another reference population, et cetera.
  • the analysis up to this point facilitates focusing on a subset of features based on which implication can be derived at a higher level. For example, given a set of genes which are interesting due to the differential CNV features discussed earlier, a subsequent analysis may be applied to derive pathway regulation profiles to indicate which biological pathways are enriched with gene amplifications or gene deletions in the context of the eight genomes CNV analysis.
  • Table 2 shows two sets of biological pathways selected derived for the genome of a person of a Central European origin listing pathways that indicate relative differences s in CNV profiles which may, based on the clinical question asked, indicate susceptibility to a disease, suitability for a therapy, or a candidate for inclusion in a population for broader analysis of samples.
  • a further example from the literature is considered.
  • molecular profiling data is used to assess carboplatinum-based chemotherapy resistance in ovarian cancer patients.
  • the key genes identified in each patient subgroups are used to further determine which biological pathways are primarily affected in the cancer tissue of two sample subset.
  • DNA methylation information as well as gene expression data in a sample obtained from tumor biopsies (See Banerjee et al, "Pathway and network analysis probing epigenetic influences on chemosensitivity in ovarian cancer", IEEE GENSIPS 2010)
  • the central genes are identified in two subsets with matching expression and methylation profiles in platinum resistant patients.
  • the annotation (gene names, in this case) is obtained to then identify the biological pathways enriched in each subset.
  • the population is determined based on clinical studies that implicate various biological pathways involved in for example therapy resistance and cancer proliferation. Three populations are identified based on the three degrees of resistance to therapy: platinum-sensitive (PFI >6 months) and platinum-resistant (PFI ⁇ 6 months) or platinum-refractory (no PFI) where PFI stands for progression- free interval (PFI), a surrogate marker for intrinsic chemosensitivity.
  • PFI platinum-sensitive
  • PFI ⁇ 6 months platinum-resistant
  • platinum-refractory platinum-refractory
  • PFI stands for progression- free interval (PFI)
  • PFI progression- free interval
  • the subsets A and B are two groups of pathways determined to be with distinct profiles in the given patient cohort.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Selon l'invention, un test in silico est développé pour estimer la probabilité qu'un patient soit dans une situation clinique en cours de test. Le développement comprend la génération de vecteurs caractéristiques représentant des sujets d'un ensemble de sujets, les vecteurs caractéristiques comprenant des valeurs caractéristiques issues de données de marqueur moléculaire, la mise en œuvre d'une réduction de caractéristiques sur les vecteurs caractéristiques pour générer des vecteurs caractéristiques à degré de différenciation, et l'identification d'un ensemble de caractéristiques probatoires et des valeurs caractéristiques pour les caractéristiques probatoires qui sont indicatrices de la situation clinique en cours de test sur la base de la comparaison des vecteurs caractéristiques à degré de différenciation avec un ensemble de données de référence, comprenant des valeurs caractéristiques représentant des sujets identifiés comme étant dans la situation clinique en cours de test. Un vecteur caractéristique en entrée est généré à partir de données comprenant des données de marqueur moléculaire acquises à partir d'une personne à tester. Le test in silico est réalisé par la comparaison du vecteur caractéristique en entrée avec les valeurs caractéristiques pour les caractéristiques probatoires qui sont indicatrices de la situation clinique en cours de test.
PCT/IB2013/059421 2012-10-23 2013-10-17 Analyse comparative et interprétation d'une variation génomique chez un individu ou dans des collections de données de séquence Ceased WO2014064584A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201261717256P 2012-10-23 2012-10-23
US61/717,256 2012-10-23

Publications (1)

Publication Number Publication Date
WO2014064584A1 true WO2014064584A1 (fr) 2014-05-01

Family

ID=49956254

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2013/059421 Ceased WO2014064584A1 (fr) 2012-10-23 2013-10-17 Analyse comparative et interprétation d'une variation génomique chez un individu ou dans des collections de données de séquence

Country Status (1)

Country Link
WO (1) WO2014064584A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115497561A (zh) * 2022-09-01 2022-12-20 北京吉因加医学检验实验室有限公司 一种甲基化标志物分层筛选的方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6647341B1 (en) * 1999-04-09 2003-11-11 Whitehead Institute For Biomedical Research Methods for classifying samples and ascertaining previously unknown classes
US20040053317A1 (en) * 2002-09-10 2004-03-18 Sidney Kimmel Cancer Center Gene segregation and biological sample classification methods
US20060034508A1 (en) * 2004-06-07 2006-02-16 Zhou Xiang S Computer system and method for medical assistance with imaging and genetics information fusion
WO2009047700A2 (fr) * 2007-10-10 2009-04-16 Koninklijke Philips Electronics N.V. Système médical pour aider au diagnostic du cancer

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6647341B1 (en) * 1999-04-09 2003-11-11 Whitehead Institute For Biomedical Research Methods for classifying samples and ascertaining previously unknown classes
US20040053317A1 (en) * 2002-09-10 2004-03-18 Sidney Kimmel Cancer Center Gene segregation and biological sample classification methods
US20060034508A1 (en) * 2004-06-07 2006-02-16 Zhou Xiang S Computer system and method for medical assistance with imaging and genetics information fusion
WO2009047700A2 (fr) * 2007-10-10 2009-04-16 Koninklijke Philips Electronics N.V. Système médical pour aider au diagnostic du cancer

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BANERJEE ET AL.: "Pathway and network analysis probing epigenetic influences on chemosensitivity in ovarian cancer", IEEE GENSIPS, 2010
BOEVA ET AL.: "Control-free calling of copy number alterations in deep- sequencing data using GC-content normalization", BIOINFORMATICS, vol. 27, no. 2, 2011, pages 268 - 269
KENT ET AL.: "The human genome browser at UCSC", GENOME RES., vol. 12, no. 6, June 2002 (2002-06-01), pages 996 - 1006, XP007901725, DOI: doi:10.1101/gr.229102. Article published online before print in May 2002
XIE ET AL.: "CNV-seq, a new method to detect copy number variation using high-throughput sequencing", BMC BIOINFORMATICS, vol. 10, 2009, pages 80, XP021047346, DOI: doi:10.1186/1471-2105-10-80

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115497561A (zh) * 2022-09-01 2022-12-20 北京吉因加医学检验实验室有限公司 一种甲基化标志物分层筛选的方法及装置
CN115497561B (zh) * 2022-09-01 2023-08-29 北京吉因加医学检验实验室有限公司 一种甲基化标志物分层筛选的方法及装置

Similar Documents

Publication Publication Date Title
Yap et al. Verifying explainability of a deep learning tissue classifier trained on RNA-seq data
CN107679052B (zh) 大数据分析方法以及利用了该分析方法的质谱分析系统
US10679726B2 (en) Diagnostic genetic analysis using variant-disease association with patient-specific relevance assessment
US9607375B2 (en) Biological data annotation and visualization
Zheng et al. Application of the time-dependent ROC curves for prognostic accuracy with multiple biomarkers
US10964410B2 (en) System and method for detecting gene fusion
McGurk et al. The use of missing values in proteomic data-independent acquisition mass spectrometry to enable disease activity discrimination
JP2019512795A (ja) 類似するプロファイルを持つ患者を共に分類する分類モデルの性能を改善するための適合性フィードバック
CN116705296B (zh) 一种基于常规mri序列对gbm患者进行风险分层的方法及系统
EP2545481B1 (fr) Procédé, agencement et produit-programme d'ordinateur permettant d'analyser un échantillon biologique ou médical
JP6041331B1 (ja) 情報処理装置と情報処理プログラム並びに情報処理方法
Veronesi et al. The challenge of small lung nodules identified in CT screening: can biomarkers assist diagnosis?
Li et al. Narrative review of the application of artificial intelligence-related technologies in the diagnosis of pulmonary nodules with recommendations for clinical practice and future research
US9953133B2 (en) Biological data annotation and visualization
WO2023154937A1 (fr) Système de traitement d'informations génétiques doté d'un mécanisme d'analyse d'échantillons non liés et procédé de fonctionnement correspondant
CN116403076B (zh) 一种基于dti序列对gbm患者进行风险分层的方法及系统
JP5658671B2 (ja) 臨床データから得られるシグネチャに対する信頼度を決める方法、及びあるシグネチャを他のシグネチャより優遇するための信頼度の使用
WO2014064584A1 (fr) Analyse comparative et interprétation d'une variation génomique chez un individu ou dans des collections de données de séquence
González et al. Analyzing spatial point patterns in digital pathology: immune cells in high-grade serous ovarian carcinomas
Zuckerbrot-Schuldenfrei et al. Breast cancer is detectable from peripheral blood using machine learning over T cell receptor repertoires
Wu et al. Molecule-dynamic-based Aging Clock and Aging Roadmap Forecast with Sundial
Miteva et al. The power of integrating multiple data sources in medical imaging: A study of MGMT methylation status
US20140297194A1 (en) Gene signatures for detection of potential human diseases
WO2011124758A1 (fr) Procédé, dispositif et produit programme d'ordinateur pour l'analyse d'un tissu cancéreux
US20240363198A1 (en) System for identifying genetic variants and method thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13821153

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13821153

Country of ref document: EP

Kind code of ref document: A1