WO2012107786A1 - Système et procédé d'extraction à l'aveugle de caractéristiques à partir de données de mesure - Google Patents

Système et procédé d'extraction à l'aveugle de caractéristiques à partir de données de mesure Download PDF

Info

Publication number
WO2012107786A1
WO2012107786A1 PCT/HR2011/000006 HR2011000006W WO2012107786A1 WO 2012107786 A1 WO2012107786 A1 WO 2012107786A1 HR 2011000006 W HR2011000006 W HR 2011000006W WO 2012107786 A1 WO2012107786 A1 WO 2012107786A1
Authority
WO
WIPO (PCT)
Prior art keywords
pairing
features
sets
reference sample
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/HR2011/000006
Other languages
English (en)
Inventor
Ivica Kopriva
Ivanka Jeric
Mirko Hadzija
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
RUDJER BOSKOVIC INSTITUTE
Boskovic Rudjer Institute
Original Assignee
RUDJER BOSKOVIC INSTITUTE
Boskovic Rudjer Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by RUDJER BOSKOVIC INSTITUTE, Boskovic Rudjer Institute filed Critical RUDJER BOSKOVIC INSTITUTE
Priority to PCT/HR2011/000006 priority Critical patent/WO2012107786A1/fr
Publication of WO2012107786A1 publication Critical patent/WO2012107786A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2115Selection of the most significant subset of features by evaluating different subsets according to an optimisation criterion, e.g. class separability, forward selection or backward elimination
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/20Identification of molecular entities, parts thereof or of chemical compositions
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Definitions

  • the present invention relates to a system and method for blind extraction of features from a test sample of measurement data, in particular for the purpose of detecting substances that may be indicative of a disease, for biomarker detection, gene analysis, and/or compound activity prediction.
  • sensitivity and specificity of these methods depend on a number of factors among them being: the type of biological fluid used the for analysis, the type of the spectroscopic method used for the characterization of the sample, and the type of the data analysis method employed for disease diagnosis or biomarker detection.
  • two typical problems arise: (z) when pattern recognition methods, such as support vector machines or artificial neural networks, are applied for disease diagnoses spectroscopic, or spectrometric data are comprised of large number of features -(even up to 30,000), compared to a much smaller number of available samples, quite often less than 100 (S. Rogers et al., Lecture Notes in Computer Science 2005, 3686: 183-191).
  • variable selection Unless some type of feature selection method (also known as variable selection) is used, this leads to overfitting, i.e. causing a pattern recognition machine (classifier) to generalize (learn) on uninformative features. This decreases sensitivity Computer Science 2005, 3686: 183-191). Unless some type of feature selection method (also known as variable selection) is used, this leads to overfitting, i.e. causing a pattern recognition machine (classifier) to generalize (learn) on uninformative features.
  • biomarker detection from spectra of biological sample is a highly complex problem due to the fact that in some biological fluids biomarkers can be hidden among several hundreds of substances with concentrations that can vary up to few orders of magnitude (H. Mischak et al., Mass Spectrom Rev. 2009, 28: 703-724).
  • the selected features represent peaks that point to interesting molecules, wherein some of them could possibly be biomarkers.
  • the possible drawback of this type of feature elimination is that selected features (peaks in the spectra) are not associated with any particular molecule, i.e., various combinations of selected peaks can appear in different molecules. Hence, additional knowledge is necessary to identify molecules that can possibly be biomarkers.
  • Another disadvantage of this approach to feature selection is that classifier design and feature selection are part of the same process. Thus, this concept cannot be applied to other types of classifiers.
  • US Patent 7,318,051 entitled Methods for feature selection in a learning machine presents a method that also exploits sparseness in a feature selection process. This is done by minimizing the number of non-zero parameters of the system through / 0 -norm minimization. Under certain conditions, minimization of / p -norms, with 0 ⁇ p ⁇ 1, is equivalent to the /o-norm minimization method presented in US Patent 7,318,051, and the method described in the preceding paragraph belongs to the same group of sparseness-based feature selection methods. Thus, selected features are useful for learning classifiers that are robust to overfitting. However, the relation between selected features and molecules that possibly are candidates for biomarkers is not straightforward. It is hence expected that the method described in US 7,318,051 is not useful for biomarker detection.
  • US Patent 7,676,442 B2 entitled Selection of Features Predictive of Biological Conditions Using Protein Mass Spectrographic Data presents a method for kernel selection for SV classifiers invariant with respect to noise present in the data.
  • the kernel selection process is related to the preprocessing of the mass spectral data prior to classification. This method works with SVM classifiers only.
  • US Patent Application US 2010/0205124 Al entitled Support Vector Machine-Based Method for Analysis of Spectral Data The only difference is that in US Patent 7,676,442 B2 the application target are mass spectrographic data, while in US 2010/0205124 Al the application target are infrared spectral data.
  • Patent application WO2008037479 entitled Feature selection on proteomic data for identifying biomarkers presents a subset feature selection method on proteomic data, in particular high-throughput mass spectrometry data for disease diagnosis with high sensitivity and specificity as well as for biomarker discovery.
  • the method is developed for prostate and ovarian cancer. It is a three-step process that at each step reduces the number of features and at least maintains the classification accuracy. In each step it combines existing methods for feature extraction, ranking and evaluation through use of the classifiers.
  • the method proposed in WO2008037479 yields high accuracy (100% on ovarian cancer data set from the NCI and over 97% for prostate cancer) by linear SVM and 10-fold cross-validation.
  • patent application WO 2007015459 entitled Gene set for use in prediction of occurrence of lymph node metastasis of colorectal cancer a method is proposed and specifically developed for detection of presence or absence of lymph node metastasis of colorectal cancer.
  • Patent application WO2008035286 entitled Advanced computer-aided diagnosis of lung nodules presents a feature selection from multi-sliced computed tomography images and for detection and diagnosis of lung cancer.
  • United States Application US2008033899 Al entitled Feature selection using support vector machine classifier presents a method for feature selection that is based on a classifier weight, whereat the feature with the smallest weight is removed from a feature set, and a support vector machine (SVM) classifier is trained again to evaluate the effect of the removed feature on a classification performance.
  • This feature selection scheme is, in principle, equivalent to a recursive feature elimination method. It is intimately related to the SVM classifier that is used as evaluation cost function in a features subset identification and is not applicable with other types of classifiers.
  • Patent application US2008025591 Al entitled Method and system for robust classification strategy for cancer detection from mass spectrometry data presents a feature selection method for mass spectrometry data.
  • the feature selection principle is based on selection of peaks of the mass spectra that are most suited to discriminate between cancer and non-cancer cases in the training set.
  • Patent application WO2005040739 entitled System and method for spectral analysis presents an independent component analysis (ICA) based approach to separation of spectral data into independent components.
  • ICA independent component analysis
  • ICA cannot be applied to a two spectral data model. This is due to the fact that two spectral data model assumes that spectral data are composed from linear combination of three sets of features ("supercomponents"). Thus, two spectral data should be decomposed blindly into three supercomponents, which is a problem that cannot be solved by ICA.
  • ICA requires that the number of spectral data available be equal or greater than number of supercomponents.
  • blind source separation in a form of ICA or non-zero matrix factorization is mentioned formally as a method that could possibly be used to extract features from a multiple electrode signals. This is considered as an alternative to features that are based on an average power contained in five frequency bands. Features are intended to be used for detection of hypoglycemia.
  • the ICA-based blind source separation approach to feature extraction is not based on any specific data model and is not applicable to spectral data of biological samples.
  • the objective of the present invention to provide a method and system for blind extraction of features from a test sample of measurement data, such as spectra or gene expression profiles of biological samples or a set of molecular descriptors of compounds, that allows for a more reliable and more accurate detection of the substances in the sample.
  • a method for blind extraction of features from a test sample of measurement data comprises the steps of pairing said test sample with a first reference sample for said measurement data to obtain a first pairing, said first control sample pertaining to a first group of features, and pairing said test sample with a second reference sample for said measurement data to obtain a second pairing, said second control sample pertaining to a second group of features.
  • the method further comprises the steps of decomposing said first pairing into a plurality of N sets and N corresponding weights, N being an integer no smaller than two, wherein each said set corresponds to a group of features, and wherein at least a first set corresponds to said first group of features and at least a second set corresponds to said second group of features.
  • the method further comprises the step of decomposing said second pairing into a corresponding plurality of N sets and N corresponding weights.
  • each said set may correspond to a group of features, wherein at least a first set may correspond to said first group of features and wherein at least a second set may correspond to said second group of features.
  • the test sample may be a mass spectrum obtained from a body fluid of a patient under investigation.
  • the first reference sample may be a corresponding mass spectrum acquired from a healthy subject, so that the first group of features are features associated with a healthy subject.
  • the second reference sample may be a corresponding mass spectrum obtained from a subject having a certain disease, said second group of features thereby corresponding to features characteristic for said disease.
  • said first reference sample may be a control sample
  • said second reference sample may be a case sample, or vice versa.
  • Decomposing said first pairing combining said test sample data with data from said first reference sample into said plurality of N sets and corresponding weights, and decomposing said second pairing combining said test sample data with data from said second reference sample into a corresponding plurality of N sets and N corresponding weights highly simplifies blind extraction of disease or healthy state expressive substances from a very large number of features. This is because in the decomposition according to the present invention, disease- expressive substances share similar concentrations while healthy-state-expressive substances likewise share similar concentrations.
  • the decomposition according to the present invention allows to group substances with similar concentration profiles, thereby effectively reducing the number of features that need to be extracted for a reliable analysis of the mass spectrum acquired from the patient under investigation. As a result, the method according to the present invention helps to test for a certain disease more reliable, and with greater accuracy.
  • test sample and/or first reference sample and/or plurality of sets and/or weights may each have the form of multi-component (row or column) vectors. In practical applications, they may have a large number of components, with the number of components of the test sample and reference samples corresponding to the respective (discretized) sample sizes.
  • the method and system according to the present invention may be applied to a wide range of measurement data, for instance mass spectra, nuclear magnetic resonance (NMR) spectra, infrared spectra, ultraviolet (UV) spectra, Raman spectra, and electronic paramagnetic resonance spectra.
  • mass spectra nuclear magnetic resonance (NMR) spectra
  • NMR nuclear magnetic resonance
  • UV ultraviolet
  • Raman spectra Raman spectra
  • electronic paramagnetic resonance spectra for instance mass spectra, nuclear magnetic resonance (NMR) spectra, infrared spectra, ultraviolet (UV) spectra, Raman spectra, and electronic paramagnetic resonance spectra.
  • test sample and reference samples may be gene expression profiles acquired from a patient under investigation and corresponding control (healthy) and case (disease) groups, respectively.
  • the data may also be sets of collections of molecular descriptors for compound activity prediction, and again the same advantages of a more reliable extraction of features result.
  • the number N of sets and corresponding weights is larger than 2, and in a particularly preferred embodiment N equals 3.
  • the features extracted from said test sample may be reliably grouped into features pertaining to a healthy subject, features pertaining to a subject diagnosed with a particular disease, and neutral features that may not be easily associated with either a healthy or a disease-diagnosed subject.
  • This particular grouping allows to focus any subsequent analysis on those features that can be clearly associated either with a disease or with a healthy state, and allows to disregard and discard those features that do not provide a clear indication in either direction.
  • the number of features under consideration can be further reduced, thereby simplifying and speeding up the analysis.
  • said step of decomposing comprises a step of factorizing said first pairing and/or factorizing said second pairing into a plurality of N sets and N corresponding weights.
  • Blind extraction is thereby reduced to a pair of matrix factorization problems, the first matrix factorization problem being based on a set of data obtained from said test sample and said first reference sample, and the second matrix factorization problem having the same structure, but being based on a set of data obtained from said test sample and said second reference sample.
  • This allows to split up the task of blind extraction of features into two separate matrix factorization problems, the first problem being based on the "healthy" reference data, while the second factorization problem is based on the "disease-diagnosed" reference data. Since disease-expressive substances and healthy-state-expressive substances share similar relative concentrations, splitting up the task of feature extraction into corresponding partial factorization problems allows for a significant simplification and speedup.
  • the step of decomposing said first pairing preferably comprises the step of selecting said first set corresponding to said first group of features by determining the set of weights that most resembles said first reference sample, and/or selecting said second set corresponding to said second group of features by determining the set of weights that most resembles said test sample.
  • said step of decomposing said second pairing preferably comprises the step of selecting said first set corresponding to said first group of features by determining the set of weights that most resembles said test sample, and/or selecting said second set corresponding to said second group of features by determining the set of weights that most resembles said second reference sample.
  • the preferred embodiment allows to reliably identify the features representative of a healthy state by comparing the corresponding weight vector obtained from the first decomposition with a healthy reference sample, while identifying the features representative of a disease by comparing the corresponding weight vector with the reference sample obtained from a disease-diagnosed subject.
  • the inventors have found that this provides a fast and yet reliable way of extracting those features representative of a disease, and distinguishing them from those features representative of a healthy state.
  • the degree of resemblance or similarity may be evaluated in terms of an angle between a weight vector and a vector representing said first reference sample, second reference sample, or test sample, respectively.
  • a smaller angle may correspond to a greater degree of resemblance or similarity.
  • the method according to the present invention preferably also comprises a step of training at least one classifier on at least four training sets gathered from said first set extracted from said first pairing, said second set extracted from said first pairing, second first set extracted from said second pairing, and said second set extracted from said second pairing.
  • Said classifier may be applied to indicate whether a selected set of features that is present in the spectrum of a test sample relates to a disease or a healthy state.
  • the classifier may include a pattern recognition machine or a Bayes classifier, a support vector machine classifier, a relevance vector machine classifier, a Gaussian process classifier, a classifier based on Fisher's discriminant, a boosted classifier, a naive Bayes classifier, a K-nearest neighbour classifier, or a neural network classifier.
  • the method according to the present invention further comprises the step of pairing said second set with M-1 corresponding sets obtained from M-1 distinct test samples of measurement data to obtain a third pairing, wherein M is a positive integer no smaller than two, and decomposing said third pairing into a plurality of P sets and P corresponding weights, P being an integer no smaller than 2, wherein each said set corresponds to a substance associated with one of said features.
  • said step of decomposing said third pairing comprises a step of factorizing said third pairing into a plurality of P sets and P corresponding weights. This allows to reduce the determination of substances relating to the disease to a matrix factorization problem.
  • said step of decomposing said first, second, and/or third pairing comprises a blind source separation, in particular an under-determined blind source separation.
  • the present invention relates to a method for blind extraction of group of features, henceforth called supercomponents, from two sets of acquired spectra, wherein said blind extraction comprises the following steps:
  • xi represents a reference spectrum acquired from a sample that is obtained from a healthy subject
  • x 2 represents a reference spectrum acquired from a sample that is obtained from a disease-diagnosed subject
  • ⁇ si i, si 2 , sj 3 ⁇ and ⁇ s 2 i, s 22 , s 23 ⁇ are row vectors that represent two sets of three supercomponents, and ⁇ an, &n, ai 3 ⁇ and ⁇ a 2 i, a 22 , a 23 ⁇ are column vectors that represent two sets of concentration profiles that are accompanied to related supercomponents ;
  • ⁇ vi, v 2; ⁇ p represent column vectors of concentration profiles associated with substances ⁇ zj , z 2 , zp ⁇ from which disease expressive supercomponents ⁇ u 1 ? 2,...,UM ⁇ are composed of;
  • the latter method may be applied to the detection of disease-specific chemical compounds, such as biomarkers, which may be present in biological fluids such a urine, blood plasma, cerebrospinal fluid, saliva, amniotic fluid, bile, tears, or tissue extracts.
  • disease-specific chemical compounds such as biomarkers
  • Said method may be used for the detection of diabetes, leukaemia, hepatitis C, Alzheimer's disease, HIV infection, coronary artery disease, depression, renal cell carcinoma, carcinoma of the urinary tract, prostate neoplasia III, ovarian cancer, prostate cancer, colon cancer, kidney cancer, Kaposi's sarcoma, benign prostatic hyperplasia, urinary tract obstruction, vacuities, diabetic nephropathy, IgA nephropathy, membranous glomerulonephritis, kidney stones, focal segmental glomerulonefroze, Fanconi's syndrome, systemic lupus erythematosus, Henoch-Schoenlein purpura, or undetected kidney disease.
  • Said method may also be used for the diagnosis of the state of organs during transplantation of post transplant lyinphoproliferative condition, transplantation of stem cells, transplantation of hematopoietic tissue, kidney transplantation, liver transplantation, or pancreas transplantation.
  • the present invention relates to a method for blind extraction of three groups of features, henceforth supercomponents, from two sets of two collections of gene expression profiles for disease diagnosis and biomarker detection, with the following steps:
  • X represents a reference gene expression profile acquired from a sample that is obtained from a healthy subject
  • x 2 represents a reference gene expression profile acquired from a sample that is obtained from a disease-diagnosed subject
  • ⁇ sn, si 2 , si 3 ⁇ and ⁇ s 2 i, s 22 , s 23 ⁇ are row vectors that represent two sets of three supercomponents, and ⁇ an, ai 2 , ai 3 ⁇ and ⁇ a 2 i, a 22 , a 23 ⁇ are column vectors that represent two sets of concentration profiles that are accompanied to related supercomponents;
  • a disease-expressive supercomponent is extracted from the first set by associating it with the concentration profile vector that makes the smallest angle with the axis defined by a spectrum of said test sample x
  • a supercomponent that is expressive for a healthy state is extracted from the first set by associating it with the concentration profile vector that makes the smallest angle with the axis defined by a spectrum of said reference sample i
  • a disease expressive supercomponent is extracted from the second set by associating it with the concentration profile vector that makes the smallest angle with the axis defined by a spectrum of a reference sample x 2
  • a supercomponent that is expressive for a healthy state is extracted
  • said latter method may be employed in the detection of diabetes, leukaemia, hepatitis C, Alzheimer's disease, HIV infection, coronary artery disease, depression, renal cell carcinoma, carcinoma of the urinary tract, prostate neoplasia III, ovarian cancer, prostate cancer, colon cancer, kidney cancer, Kaposi's sarcoma, benign prostatic hyperplasia, urinary tract obstruction, vacuities, diabetic nephropathy, IgA nephropathy, membranous glomerulonephritis, kidney stones, focal segmental glomerulonefroze, Fanconi's syndrome, systemic lupus erythematosus, Henoch-Schoenlein purpura, or undetected kidney disease.
  • Said method may be employed to assist in the analysis of organs during transplantation of post transplant lymphoproliferative condition, transplantation of stem cells, transplantation of hematopoietic tissue, kidney transplantation, liver transplantation, or pancreas transplantation.
  • the present invention is directed at a method for blind extraction of three groups of features, henceforth supercomponents, from two sets of two collections of molecular descriptors for compound activity prediction, with the following steps:
  • i represents reference molecular descriptors collected from a sample that is obtained from an inactive chemical compound
  • x 2 represents reference molecular descriptors collected from a sample that is obtained from an active compound
  • ⁇ vi, v 2 , ⁇ P represent column vectors of concentration profiles associated with the substances ⁇ zj, z 2 , z/> ⁇ ; and - applying a blind source separation algorithm to ⁇ ui, u 2 , UM ⁇ in Equation (3) to extract substances ⁇ z ⁇ , z 2 , z / > ⁇ from which active state expressive supercomponents ⁇ ui, U 2 ,...,UM ⁇ are composed.
  • said method may be employed in the detection of a pattern of active state specific molecular descriptors present in a collection of molecular descriptors of a chemical compound.
  • an underdetermined blind source separation method may extract three supercomponents ⁇ sn ; si 2 , s 13 ⁇ and ⁇ s 2 i , s 22 , s 23 ⁇ from two pairings ⁇ x u x ⁇ and ⁇ x 2 , x ⁇ by means of sparse component analysis algorithm and single component points as described in: I. Kopriva, I. Jeric, Blind separation of analytes in nuclear magnetic resonance spectroscopy and mass spectrometry: sparseness-based robust multicomponent analysis, Anal.l Chem., vol. 82, pp. 191 1 -1920, 2010, and I. Kopriva, I.
  • Jeric Method of and system for blind extraction of more pure components than mixtures in ID and 2D NMR spectroscopy and mass spectrometry combining sparse component analysis and single component points, PCT/HR2009/000028.
  • other methods developed for solving underdetermined blind source separation problems can be used for the same purpose as well.
  • substances ⁇ zi, z 2 , zp) in Equation (3) may be extracted from the set of disease- expressive supercomponents ⁇ ui, U 2 ,...,UM ⁇ or active state- expressive supercomponents, respectively by means of a blind source separation that combines sparse component analysis and single component points as described in: I. Kopriva, I. Jeric, Blind separation of analytes in nuclear magnetic resonance spectroscopy and mass spectrometry: sparseness-based robust multicomponent analysis, Anal. Chem., vol. 82, pp. 191 1-1920, 2010.
  • said method may be employed to assist in disease diagnoses by means of pattern recognition algorithms to determine whether a suspected patient has the disease.
  • said method may be applied to detect chemical entities directly correlated with a disease that are sufficiently specific to detect said investigated disease confidently - biomarkers.
  • said method may be applied to disease diagnoses and detection of biomarkers present in biological fluids such as urine, blood plasma, cerebrospinal fluid, saliva, amniotic fluid, bile, tears, and others, or tissues or organ extracts.
  • the present invention likewise relates to a system for blind extraction of features from a test sample of measurement data, comprising a data input unit adapted for receiving said test sample of measurement data, a storage unit adapted to store a first reference sample for said measurement data, said first reference sample pertaining to a first group of features, and further adapted to store a second reference sample for said measurement data, said second reference sample pertaining to a second group of features, as well as a data processing unit adapted to pair said test sample with said first reference sample to obtain a first pairing, and to pair said test sample with said second reference sample to obtain a second pairing.
  • Said data processing unit is adapted to decompose said first pairing into a plurality of N sets and N corresponding weights, N being an integer no smaller than 2, wherein each said set corresponds to a group of features, and wherein at least a first set corresponds to said first group of features and at least a second set corresponds to said second group of features.
  • Said data processing unit is further adapted to decompose said second pairing into said plurality of N sets and N corresponding weights.
  • each said set may correspond to a group of features, wherein at least a first set may correspond to said first group of features and wherein at least a second set may correspond to said second group of features.
  • said data processing unit may be adapted to execute a method with some or all of the features as described above.
  • a system for blind extraction of supercomponents from two sets of two spectra may comprise:
  • the present invention likewise relates to a system for blind extraction of three groups of features, henceforth supercomponents, from two sets of two gene expression profiles, said system comprising:
  • processor (7) is adapted to implement code for executing a method according to any one of the previously described embodiments based on the gene expression profile data stored in/on the input storing device or medium (6).
  • the present invention likewise relates to a system for blind extraction of three groups of features, henceforth supercomponents, from two sets of two collections of molecular descriptors for compound activity prediction, said system comprising:
  • processor (11) is adapted to implement code for executing a method according to any of the previously described embodiments based on said collections of molecular descriptors data stored in/on the input storing device or medium (10).
  • the present invention relates to a computer-readable medium having computer- executable instructions stored thereon, which, when executed on a computer, will cause the computer to carry out a method of the present invention according to any of the preceding embodiments.
  • Figure 1 schematically illustrates a block diagram of a system for blind extraction of supercomponents from spectral data and their use for assisting in disease diagnosis and biomarker detection according to a first embodiment of the present invention; illustrate positions of vectors of concentration profiles in a plane spanned by two spectral data sets, where the second spectral data is acquired from a test sample, and the first spectral data is a reference acquired from a sample obtained from a healthy subject ( Figure 2A) or a reference acquired from a sample obtained from a disease-diagnosed subject ( Figure 2B); schematically illustrates blind extraction of three supercomponents (denoted symbolically by squares, rhombuses and circles) from two spectra, wherein each supercomponent is further composed of substances; schematically illustrates blind extraction of disease-expressive substances from a plurality of supercomponents extracted from spectra of samples of disease-diagnosed subjects; show a reference mass spectrum of urine of healthy mice ( Figure 5A) and mass spectrum of a sample of diabetes-
  • Figure 9 schematically illustrates a block diagram of a system for blind extraction of supercomponents from a collection of molecular descriptors data and their use for compound activity prediction according to a third embodiment of the present invention.
  • FIG. 1 A schematic block diagram of a system for blind extraction of groups of features (henceforth called "supercomponents" in accordance with the standard terminology) from two sets of two spectra by employing methods for underdetermined blind source separation according to an embodiment of the present invention is shown in Figure 1.
  • the system consists of: a spectrometer 1 employed to acquire a spectrum from the biological sample (such as a body fluid or tissue); a storing device 2 employed to store gathered spectral data; a CPU 3 or computer where algorithms are implemented for blind extraction of supercomponents, disease diagnoses based on classification of disease-expressive and healthy-state-expressive supercomponents, blind extraction of substances from a set of disease expressive supercomponents and for biomarker detection; and an output device 4 used to store and present disease diagnoses results and biomarker candidates.
  • a spectrometer 1 employed to acquire a spectrum from the biological sample (such as a body fluid or tissue); a storing device 2 employed to store gathered spectral data; a CPU 3 or computer where algorithms are implemented for blind extraction of supercomponents, disease diagnoses based on classification of disease-expressive and healthy-state-expressive supercomponents, blind extraction of substances from a set of disease expressive supercomponents and for biomarker detection; and an output device 4 used to store and present
  • ⁇ sn, s ]2 , s 13 ⁇ and ⁇ s 2 i, S22, s 23 ⁇ are row vectors that represent unknown supercomponents
  • ⁇ an, ai 2 , an ⁇ and ⁇ a 21 , a 22 , a 23 ⁇ are column vectors that represent the unknown concentration profiles of related supercomponents.
  • x represents a spectrum of a test sample. These may be data points indicating a relative abundance for several selected values of mass divided by charge (m/z), as acquired from a mass spectrometer, and may be written in form of a (possibly very large) vector.
  • first set x ⁇ represents a reference spectrum acquired from a sample that is obtained from a healthy subject
  • second set x 2 represents a reference spectrum acquired from a sample that is obtained from a disease-diagnosed subject.
  • Both reference spectra can be in the same vector format as the test sample, wherein the components of a vector represents the relative abundance of a mass spectrum for various values of mass divided by charge.
  • (1) and (2) are further composed of group of components, henceforth called substances, with similar concentrations.
  • disease and healthy-state-expressive supercomponents may be identified by associating them with concentration vectors that make the smallest and the largest angles with respect to axis defined by said reference spectrum xi and x 2 , respectively.
  • Figure 2A corresponds to the data model of Equation (1), wherein second spectral data is acquired from a test sample, and first spectral data is a reference acquired from a sample obtained from a healthy subject.
  • Figure 2B corresponds to the data model according to Equation (2), wherein second spectral data is acquired from a test sample and first spectral data is a reference acquired from a sample obtained from a disease-diagnosed subject.
  • the supercomponent that corresponds to the concentration vector closest to reference spectra is combination of features that are expressive for a healthy state
  • the supercomponent that corresponds to the concentration vector closest to the test spectrum is a combination of disease-expressive features.
  • a supercomponent that corresponds to the concentration vector closest to the reference spectra is composed of disease-expressive features
  • a supercomponent that corresponds to the concentration vector closest to the test spectrum is a combination of features that are expressive for a healthy state.
  • the interpretation of the supercomponents sn, s 12 , and sj 3 in Equation (1) is as follows.
  • the first supercomponent sn collects the disease-expressive features
  • the second supercomponent si 2 collects the healthy-state-expressive features.
  • the third supercomponent collects those features that cannot be reliably classified as either disease-expressive or healthy-state-expressive.
  • Equation (1) is a decomposition of the first set of spectra ⁇ xi, x ⁇ into disease-expressive sn, healthy-state-expressive si 2 , and neutral si 3 supercomponents, with corresponding weights (or concentrations) ⁇ an, an, ai 3 ⁇ .
  • Equation (2) is a decomposition of the second set of spectra ⁇ x 2 , x ⁇ into disease-expressive s 2
  • the features relating to disease-expressive and healthy-state-expressive substances can be reliably identified and extracted. Since disease-expressive substances and healthy-state-expressive substances share similar relative concentrations, substances with similar concentrations can be extracted together as one supercomponent. This highly simplifies blind extraction of disease-expressive and healthy-state-expressive substances from a very large number of features. It also makes feature extraction robust with respect to biological variability of the sample.
  • Blind extraction of three supercomponents from two spectra may be achieved by means of underdetermined blind source separation (uBSS).
  • uBSS underdetermined blind source separation
  • SCA parse component analysis
  • Theoretical foundations of the solution of the uBSS problem employing SCA are laid down in: P. Bofill and M. Zibulevsky, "Underdetermined blind source separation using sparse representation, Signal Processing 81, 2353-2362, 2001; Y. Li, A. Cichocki, S. Amari, "Analysis of Sparse Representation and Blind Source Separation," Neural Computation 16, pp. 1 193-1234, 2004; Y. Li, S. Amari, A.
  • Any standard clustering procedure (k-means clustering, c-means clustering, fuzzy c-means clustering, spectral clustering, hierarchical clustering, etc.) can be employed to cluster the set of features with single component dominance.
  • Concentration vectors (an, ai 2 , a 13 ) and/or (a 2 i , a 22 , a 23 ) are represented by the cluster centers (centroids). Since in Equation (1) and (2) the number of concentration vectors correspond to the number of supercomponents and equals 3, the number of clusters is known in advance and equals 3.
  • supercomponents ⁇ sn, si 2 , si 3 ⁇ in Equation (1) and ⁇ s 2 i, s 2 , s 2 ⁇ in Equation (2) are obtained by solving a linear system of two equations in three unknowns at each feature. This system is solvable if at each feature at least one supercomponent has zero value, i.e. the vector comprised of the entries of three supercomponents at the particular feature should be sparse.
  • 1, 606-617 may be used to solve this i x -regularized least square problem.
  • the four supercomponents relating to disease-expressive and healthy-state-expressive features may be used for assisting in disease diagnosis by applying a previously trained pattern recognition algorithm or classifier to them.
  • the supecomponents relating to the neutral features may be discarded at this stage, but they may be stored for later analysis in variances of the preferred embodiment described herein.
  • pattern recognition algorithms are Bayes classifier, a support vector machine (SVM), a relevance vector machine (RVM), a Gaussian process classifier, a classifier based on Fisher's discriminant, a boosted classifier, a naive Bayes classifier, a K-nearest neighbour classifier, a neural network classifier, etc.
  • a diagnosis method that is robust with respect to biological variability of the sample is obtained by selecting as the output of the four classifiers the one with the highest accuracy achieved in the cross-validation phase.
  • the set of supercomponents composed of the disease-expressive features may be decomposed further into less complex combinations of features that are referred to herein as substances and may be used for identification of disease-specific biomarkers.
  • Figure 3 schematically illustrates blind extraction of three supercomponents (denoted symbolically by squares, rhombuses, and circles, respectively) from two spectra.
  • Each supercomponent is further composed of substances that are expressive for disease (squares), healthy state (circles), or neutral (rhombuses). Different shading and texture within each supercomponent symbolically denote different substances. While the upper mixture corresponds to the first set of spectra ⁇ xi, x ⁇ , the lower mixture corresponds to the second set of spectra ⁇ x 2 , x ⁇ .
  • Blind extraction of supercomponents according to Equations (1) and (2) yields a decomposition into supercomponents of disease-expressive, healthy-state-expressive, and neutral substances.
  • Figure 4 schematically illustrates the blind extraction of disease-expressive substances from a plurality of supercomponents extracted from spectra of samples of disease-diagnosed subjects according to Equation (3). Different shading and texture symbolically denote different substances.
  • blind extraction according to Equation (3) yields a decomposition of the supercomponents associated with disease-diagnosed samples into less complex combinations of individual substances, which can then be used further for identification of disease-specific biomarkers.
  • Figures 5A and 5B respectively show experimental mass spectra of urine samples of healthy mice and diabetes-diagnosed mice.
  • the mass spectrum shown in Figure 5A can be used as a healthy reference, while the mass spectrum shown in Figure 5B represents the spectrum of a test sample.
  • the mass spectrum shown in Figure 5B can be used as a disease- reference while the mass spectrum shown in Figure 5A then represents the spectrum of a test sample.
  • Figures 6A and 6B show two supercomponents extracted from the two mass spectra shown in Figures 5A and 5B in accordance with the data model of Equation (1), where the mass spectrum shown in Figure 5A serves as a healthy reference and the mass spectrum shown in Figure 5B serves as a test.
  • the supercomponent that contains features which are expressive for a healthy state as shown in Figure 6 A corresponds to the concentration vector which is closest to the reference spectrum, as explained above with reference to Figure 2A.
  • the disease-expressive supercomponent is shown in Figure 6B and corresponds to the concentration vector which is closest to the spectrum of a test sample, as likewise explained with reference to Figure 2A.
  • Figures 7A and 7B show substances extracted from nine diabetes-expressive supercomponents.
  • the substances shown in Figure 7 A are extracted by means of a sparse component analysis that combines single component points and linear programming, while the substances shown in Figure 7B are extracted by means of £ 1 -norm based non-negative matrix underapproximation, as explained above.
  • the invention has been described above with reference to the blind extraction of supercomponents from two sets of spectra, such as mass spectra.
  • the invention is by no means limited to this specific example, and may be employed whenever the blind extraction of features from a test sample of measurement data is desired.
  • the method according to the present invention may likewise be employed for the blind extraction of supercomponents from collections of gene expression profiles for disease diagnosis and biomarker detection, or for blind extraction of supercomponents from sets of collections of molecular descriptors for compound activity prediction.
  • Various further applications will become apparent to those skilled in the art.
  • the method according to the present invention equally applies to all such applications. Only the physical interpretation of the reference sample and test samples, the groups of features of supercomponents and the features itself may be different. For instance, a supercomponent relating to healthy features in the analysis of spectra or gene expression profiles may correspond to an inactive state of a chemical compound, whereas a disease-diagnosed supercomponent may correspond to an active state of a chemical compound.
  • FIG. 8 A schematic block diagram of a system for blind extraction of supercomponents from two sets of two gene expression profiles that is defined by Equations (1) and (2) and employing methods for underdetermined blind source separation according to an embodiment of the present invention is shown in Figure 8.
  • the system consists of: a gene chip 5 used to acquire gene expression profiles from a biological sample; a storing device 6 used to store gathered gene expression profiles data; a CPU 7 or computer where algorithms are implemented for blind extraction of supercomponents, disease diagnoses based on classification of disease- expressive and healthy-state-expressive supercomponents, a blind extraction of substances from a set of disease-expressive supercomponents and for biomarker detection; and an output device 8 used to store and present disease diagnoses results and biomarker candidates.
  • FIG. 9 A schematic block diagram of a corresponding system for blind extraction of supercomponents from two sets of two collections of molecular descriptors of chemical compounds that is defined by Equations (1) and (2) and employing methods for underdetermined blind source separation according to an embodiment of the present invention is shown in Figure 9.
  • the system consists of: a collector of molecular descriptors 9 used to acquire molecular descriptors data from the samples of chemical compounds; a storing device 10 used to store gathered molecular descriptors data; a CPU 11 or computer where algorithms are implemented for: a blind extraction of supercomponents and compound activity prediction based on classification of active and inactive state-expressive supercomponents, blind extraction of substances from a set of active state-expressive supercomponents and for detection of pattern of molecular descriptors that is specific for the active state; and an output device 12 used to store and present activity prediction results and candidates for active state specific pattern of molecular descriptors.
  • the feature selection method proposed in the patent application herein has been successfully tested on the ovarian cancer mass spectra of serum samples.
  • the data were downloaded from: http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp, and have been published in: E.F. Petricoin et al., "Use of proteomic patterns in serum to identify ovarian cancer," The Lancet, 359, 572-577.
  • the data set contains 100 control samples and 100 cancer samples. To extract supercomponents one control and one cancer sample were used as reference ones. All the classifiers were applied to standardized data having zero mean and unit variance.
  • a special embodiment of the method for blind extraction of three groups of features from two sets of two spectra has been further tested on colon cancer gene expression profiles data of the tissue samples.
  • the data were downloaded from: http://genomics- pubs.princeton.edu/oncology/affydata/index.html, and have been published in: U. Alon et al., "Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays", Proc. Natl. Acad. Sci. USA, vol. 96, pp. 6745- 6750, 1999.
  • the data set contains 22 control samples and 40 cancer samples, whereas each sample contains 2000 gene expression levels.
  • a special embodiment of the method for blind extraction of three groups of features from two sets of two spectra has been tested furthermore on the diabetes mass spectra data of the urine samples of the NOD mices. The data were prepared in the in house laboratories at the Ruder Boskovic Institute and were comprised of 10 control and 10 diabetic mices.
  • Mass spectra were acquired by a HPLC-MS triple quadruple instrument equipped with an autosampler (Agilent Technologies, USA) operating in a positive ion mode. To extract supercomponents one control and one diabetes sample were used as references ones. Thus, 9 control and 9 diabetes samples remained for cross-validation. Due to the very small sample size leave-one- out cross-validation has been performed in this case.
  • a nonlinear SVM classifier with a RBF kernel achieved 100% sensitivity and 100% specificity.

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

L'invention porte sur un procédé et un système d'extraction à l'aveugle de caractéristiques à partir d'un échantillon de test de données de mesure et de premier et second échantillons de référence, lesquels procédé et système permettent d'extraire des substances ayant des masses ou des concentrations similaires ensemble en tant que super-composants respectifs. Ceci simplifie grandement l'extraction à l'aveugle de substances exprimant une maladie et de substances exprimant un état sain à partir d'un échantillon de données de mesure, et permet ainsi de réduire efficacement le nombre de caractéristiques pour un diagnostic de maladie. L'invention peut également être appliquée à une prédiction d'activité de composé.
PCT/HR2011/000006 2011-02-09 2011-02-09 Système et procédé d'extraction à l'aveugle de caractéristiques à partir de données de mesure Ceased WO2012107786A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/HR2011/000006 WO2012107786A1 (fr) 2011-02-09 2011-02-09 Système et procédé d'extraction à l'aveugle de caractéristiques à partir de données de mesure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/HR2011/000006 WO2012107786A1 (fr) 2011-02-09 2011-02-09 Système et procédé d'extraction à l'aveugle de caractéristiques à partir de données de mesure

Publications (1)

Publication Number Publication Date
WO2012107786A1 true WO2012107786A1 (fr) 2012-08-16

Family

ID=43971451

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/HR2011/000006 Ceased WO2012107786A1 (fr) 2011-02-09 2011-02-09 Système et procédé d'extraction à l'aveugle de caractéristiques à partir de données de mesure

Country Status (1)

Country Link
WO (1) WO2012107786A1 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107249434A (zh) * 2015-02-12 2017-10-13 皇家飞利浦有限公司 鲁棒分类器
WO2018159833A1 (fr) * 2017-03-02 2018-09-07 株式会社ニコン Procédé de distinction de cellules, procédé d'inspection de cancer, dispositif de mesure, dispositif d'inspection de cancer, et programme d'inspection
CN109344851A (zh) * 2018-08-01 2019-02-15 迈克医疗电子有限公司 图像分类显示方法和装置、分析仪器和存储介质
CN112116952A (zh) * 2020-08-06 2020-12-22 温州大学 基于扩散及混沌局部搜索的灰狼优化算法的基因选择方法
US11593680B2 (en) 2020-07-14 2023-02-28 International Business Machines Corporation Predictive models having decomposable hierarchical layers configured to generate interpretable results
CN118730953A (zh) * 2024-08-30 2024-10-01 陕西德丞电子科技有限公司 一种基于光谱分析的食品安全检测系统

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020095260A1 (en) 2000-11-28 2002-07-18 Surromed, Inc. Methods for efficiently mining broad data sets for biological markers
US20030216868A1 (en) * 1999-09-28 2003-11-20 Affymetrix, Inc. Methods and computer software products for multiple probe gene expression analysis
US20030225526A1 (en) * 2001-11-14 2003-12-04 Golub Todd R. Molecular cancer diagnosis using tumor gene expression signature
WO2005040739A2 (fr) 2003-10-22 2005-05-06 Softmax, Inc. Systeme et procede d'analyse spectrale
WO2007015459A1 (fr) 2005-08-01 2007-02-08 Osaka University Ensemble de gènes servant à la prédiction d’apparition de métastase de noeud lymphatique de cancer colorectal
US20070176088A1 (en) 2006-02-02 2007-08-02 Xiangdong Don Li Feature selection in mass spectral data
WO2007145789A2 (fr) 2006-05-18 2007-12-21 John Zhang Procédé et mise en œuvre d'une sélection de fonctionnalités par consensus fiable dans une découverte biomédicale
US7318051B2 (en) 2001-05-18 2008-01-08 Health Discovery Corporation Methods for feature selection in a learning machine
US20080025591A1 (en) 2006-07-27 2008-01-31 International Business Machines Corporation Method and system for robust classification strategy for cancer detection from mass spectrometry data
US20080033899A1 (en) 1998-05-01 2008-02-07 Stephen Barnhill Feature selection method using support vector machine classifier
WO2008035286A2 (fr) 2006-09-22 2008-03-27 Koninklijke Philips Electronics N.V. Diagnostic assisté par ordinateur avancé de nodules du poumon
WO2008037479A1 (fr) 2006-09-28 2008-04-03 Private Universität Für Gesundheitswissenschaften Medizinische Informatik Und Technik - Umit Sélection de caractéristiques sur des données protéomiques pour identifier des biomarqueurs candidats
US7457048B2 (en) 2006-10-24 2008-11-25 Samsung Techwin Co., Ltd. High magnification zoom lens system
WO2009067655A2 (fr) 2007-11-21 2009-05-28 University Of Florida Research Foundation, Inc. Procédés de sélection de particularités par apprentissage local ; marqueurs de pronostic du cancer du sein et de la prostate
US20090287107A1 (en) 2006-06-15 2009-11-19 Henning Beck-Nielsen Analysis of eeg signals to detect hypoglycaemia
US20100002929A1 (en) 2004-05-13 2010-01-07 The Charles Stark Draper Laboratory, Inc. Image-based methods for measuring global nuclear patterns as epigenetic markers of cell differentiation
US7676442B2 (en) 1998-05-01 2010-03-09 Health Discovery Corporation Selection of features predictive of biological conditions using protein mass spectrographic data
US20100205124A1 (en) 2000-08-07 2010-08-12 Health Discovery Corporation Support vector machine-based method for analysis of spectral data

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7676442B2 (en) 1998-05-01 2010-03-09 Health Discovery Corporation Selection of features predictive of biological conditions using protein mass spectrographic data
US20080033899A1 (en) 1998-05-01 2008-02-07 Stephen Barnhill Feature selection method using support vector machine classifier
US20030216868A1 (en) * 1999-09-28 2003-11-20 Affymetrix, Inc. Methods and computer software products for multiple probe gene expression analysis
US20100205124A1 (en) 2000-08-07 2010-08-12 Health Discovery Corporation Support vector machine-based method for analysis of spectral data
US20020095260A1 (en) 2000-11-28 2002-07-18 Surromed, Inc. Methods for efficiently mining broad data sets for biological markers
US7318051B2 (en) 2001-05-18 2008-01-08 Health Discovery Corporation Methods for feature selection in a learning machine
US20030225526A1 (en) * 2001-11-14 2003-12-04 Golub Todd R. Molecular cancer diagnosis using tumor gene expression signature
WO2005040739A2 (fr) 2003-10-22 2005-05-06 Softmax, Inc. Systeme et procede d'analyse spectrale
US20100002929A1 (en) 2004-05-13 2010-01-07 The Charles Stark Draper Laboratory, Inc. Image-based methods for measuring global nuclear patterns as epigenetic markers of cell differentiation
WO2007015459A1 (fr) 2005-08-01 2007-02-08 Osaka University Ensemble de gènes servant à la prédiction d’apparition de métastase de noeud lymphatique de cancer colorectal
US20070176088A1 (en) 2006-02-02 2007-08-02 Xiangdong Don Li Feature selection in mass spectral data
WO2007145789A2 (fr) 2006-05-18 2007-12-21 John Zhang Procédé et mise en œuvre d'une sélection de fonctionnalités par consensus fiable dans une découverte biomédicale
US20090287107A1 (en) 2006-06-15 2009-11-19 Henning Beck-Nielsen Analysis of eeg signals to detect hypoglycaemia
US20080025591A1 (en) 2006-07-27 2008-01-31 International Business Machines Corporation Method and system for robust classification strategy for cancer detection from mass spectrometry data
WO2008035286A2 (fr) 2006-09-22 2008-03-27 Koninklijke Philips Electronics N.V. Diagnostic assisté par ordinateur avancé de nodules du poumon
WO2008037479A1 (fr) 2006-09-28 2008-04-03 Private Universität Für Gesundheitswissenschaften Medizinische Informatik Und Technik - Umit Sélection de caractéristiques sur des données protéomiques pour identifier des biomarqueurs candidats
US7457048B2 (en) 2006-10-24 2008-11-25 Samsung Techwin Co., Ltd. High magnification zoom lens system
WO2009067655A2 (fr) 2007-11-21 2009-05-28 University Of Florida Research Foundation, Inc. Procédés de sélection de particularités par apprentissage local ; marqueurs de pronostic du cancer du sein et de la prostate

Non-Patent Citations (22)

* Cited by examiner, † Cited by third party
Title
"Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing)", 2006, SPRINGER
1. KOPRIVA; I. JERIC: "Blind separation of analytes in nuclear magnetic resonance spectroscopy and mass spectrometry: sparseness-based robust multicomponent analysis", ANAL. CHEM., vol. 82, 2010, pages 1911 - 1920, XP002571604, DOI: doi:10.1021/ac902640y
A. HYVARINEN; J. KARHUNEN; E. OJA: "Independent Component Analysis", 2001, JOHN WILEY
BYUNG-SOO KIM ET AL: "Prostate cancer classification processor using DNA computing technique", IEICE ELECTRONICS EXPRESS IEICE JAPAN, vol. 6, no. 10, 2009, pages 581 - 586, XP002660799, ISSN: 1349-2543 *
COMPUTER SCIENCE, vol. 3686, 2005, pages 183 - 191
E.F. PETRICOIN: "Use ofproteomic patterns in serum to identify ovarian cancer", THE LANCET, vol. 359, pages 572 - 577
H. MISCHAK ET AL., MASS SPECTROM REV., vol. 28, 2009, pages 703 - 724
I. KOPRIVA; 1. JCRIC: "Blind separation of analytes in nuclear magnetic resonance spectroscopy and mass spectrometry: sparseness-based robust multicomponent analysis", ANAL. CHERN., vol. 82, 2010, pages 1911 - 1920, XP002571604, DOI: doi:10.1021/ac902640y
I. KOPRIVA; 1. JCRIC: "Blind separation of analytes in nuclear magnetic resonance spectroscopy and mass spectrometry: sparseness-based robust multicomponent analysis", ANAL.L CHEM., vol. 82, 2010, pages 1911 - 1920, XP002571604, DOI: doi:10.1021/ac902640y
I. KOPRIVA; I. JERID: "Method of and system jor blind extraction of more pure components than mixtures", ID AND 2D NMR
KIM S.J. ET AL.: "An interior-point method for large-scale ?1 -regularized least squares", IEEE J SEL. TOPICS SIGNAL PROC., vol. 1, 2007, pages 606 - 617, XP011199168, DOI: doi:10.1109/JSTSP.2007.910971
P. BOFILL; M. ZIBULEVSKY: "Underdetermined blind source separation using sparse representation", SIGNAL PROCESSING, vol. 81, 2001, pages 2353 - 2362
P. GEORGICV; F. THEIS; A. CICHOCKI: "Sparse Component Analysis and Blind Source Separation of Underdetermined Mixtures", IEEE TRANS. ON NEURAL NETWORKS, vol. 16, no. 4, 2005, pages 992 - 996, XP011135679, DOI: doi:10.1109/TNN.2005.849840
PETRICOIN III ET AL.: "Serum Proteomic Patterns for Detection of Prostate Cancer", JOURNAL OF THE NATIONAL CANCER INSTITUTE, vol. 94, 2002, pages 1576 - 1578, XP002975918
R. MADSEN ET AL., ANAL. CHIM. ACTA, vol. 659, 2010, pages 23 - 33
S. ROGERS ET AL., LECTURE NOTES IN COMPUTER SCIENCE, vol. 3686, 2005, pages 183 - 191
TREVOR HASTIE; ROBERT TIBSHIRANI; JEROME FRIEDMAN: "The Elements of Statistical Learning: Data Mining. Inference, and Prediction", 2009, SPRINGER SERIES IN STATISTICS, article "High -Dimensional problems.- p » N."
TROPP, J.A.; WRIGHT, S.J.: "Computational Methods for Sparse Solution of Linear Inverse Problems", PROC. OF THE IEEE, vol. 98, 2010, pages 948 - 958, XP011308338
U. ALON ET AL.: "Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonuclcotidc arrays", PROC. NATL. ACAD. SCI. USA, vol. 96, 1999, pages 6745 - 6750
Y. LI; A. CICHOCKI; S. AMARI: "Analysis of Sparse Representation and Blind Source Separation", NEURAL COMPUTATION, vol. 16, 2004, pages 1193 - 1234, XP008075412, DOI: doi:10.1162/089976604773717586
Y. LI; S. AMARI; A. CICHOCKI; D.W.C. HO; S. XIE: "Underdetermined Blind Source Separation Based on Sparse Representation", IEEE TRANS. ON SIGNAL PROCESSING, vol. 54, no. 2, 2006, pages 423 - 437
Z. LIU ET AL., IEEEIACM TRANS. COMPUT. BIOLOGY AND BIOINFORMATICS, vol. 7, 2010, pages 100 - 107

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107249434A (zh) * 2015-02-12 2017-10-13 皇家飞利浦有限公司 鲁棒分类器
CN107249434B (zh) * 2015-02-12 2020-12-18 皇家飞利浦有限公司 鲁棒分类器
WO2018159833A1 (fr) * 2017-03-02 2018-09-07 株式会社ニコン Procédé de distinction de cellules, procédé d'inspection de cancer, dispositif de mesure, dispositif d'inspection de cancer, et programme d'inspection
CN109344851A (zh) * 2018-08-01 2019-02-15 迈克医疗电子有限公司 图像分类显示方法和装置、分析仪器和存储介质
CN109344851B (zh) * 2018-08-01 2020-11-10 迈克医疗电子有限公司 图像分类显示方法和装置、分析仪器和存储介质
US11593680B2 (en) 2020-07-14 2023-02-28 International Business Machines Corporation Predictive models having decomposable hierarchical layers configured to generate interpretable results
CN112116952A (zh) * 2020-08-06 2020-12-22 温州大学 基于扩散及混沌局部搜索的灰狼优化算法的基因选择方法
CN112116952B (zh) * 2020-08-06 2024-02-09 温州大学 基于扩散及混沌局部搜索的灰狼优化算法的基因选择方法
CN118730953A (zh) * 2024-08-30 2024-10-01 陕西德丞电子科技有限公司 一种基于光谱分析的食品安全检测系统

Similar Documents

Publication Publication Date Title
Hu et al. Emerging computational methods in mass spectrometry imaging
Xi et al. Statistical analysis and modeling of mass spectrometry-based metabolomics data
Smolinska et al. Current breathomics—a review on data pre-processing techniques and machine learning in metabolomics breath analysis
Listgarten et al. Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry
Labory et al. Benchmarking feature selection and feature extraction methods to improve the performances of machine-learning algorithms for patient classification using metabolomics biomedical data
Hilario et al. Processing and classification of protein mass spectra
Carvalho et al. Identifying differences in protein expression levels by spectral counting and feature selection
Ledesma et al. Advancements within modern machine learning methodology: impacts and prospects in biomarker discovery
Liu et al. Feature selection method based on support vector machine and shape analysis for high-throughput medical data
Seddiki et al. Early diagnosis: end-to-end CNN–LSTM models for mass spectrometry data classification
Kusonmano et al. Informatics for metabolomics
Tian et al. Towards enhanced metabolomic data analysis of mass spectrometry image: Multivariate Curve Resolution and Machine Learning
Huang et al. A new strategy for analyzing time-series data using dynamic networks: identifying prospective biomarkers of hepatocellular carcinoma
WO2012107786A1 (fr) Système et procédé d'extraction à l'aveugle de caractéristiques à partir de données de mesure
Mahmoud et al. An enhanced machine learning approach with stacking ensemble learner for accurate liver cancer diagnosis using feature selection and gene expression data
Gong et al. Evaluating machine learning methods of analyzing multiclass metabolomics
Ye et al. Multi-omics clustering for cancer subtyping based on latent subspace learning
Bowling et al. Analyzing the metabolome
Dutkowski et al. On consensus biomarker selection
CN110890130A (zh) 基于多类型关系的生物网络模块标志物识别方法
Koo et al. Analysis of metabolomic profiling data acquired on GC–MS
Skawinski et al. A comprehensive guide to volatolomics data analysis
Phan et al. Functional genomics and proteomics in the clinical neurosciences: data mining and bioinformatics
Wu et al. Applications of gene pair methods in clinical research: advancing precision medicine
Abdel Samee et al. Detection of biomarkers for hepatocellular carcinoma using a hybrid univariate gene selection methods

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11710853

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11710853

Country of ref document: EP

Kind code of ref document: A1