WO2023150729A2 - Systèmes et méthodes d'identification, de recherche et de comparaison de glycomolécule et d'analyse des résultats de ceux-ci - Google Patents
Systèmes et méthodes d'identification, de recherche et de comparaison de glycomolécule et d'analyse des résultats de ceux-ci Download PDFInfo
- Publication number
- WO2023150729A2 WO2023150729A2 PCT/US2023/062001 US2023062001W WO2023150729A2 WO 2023150729 A2 WO2023150729 A2 WO 2023150729A2 US 2023062001 W US2023062001 W US 2023062001W WO 2023150729 A2 WO2023150729 A2 WO 2023150729A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- glycan
- format
- search
- glycomolecule
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/20—Heterogeneous data integration
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/90—Programming languages; Computing architectures; Database systems; Data warehousing
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
Definitions
- the instant disclosure is directed to a system for managing, searching, curating, comparing and/or analyzing sample data and/or search results associated with glycans, glycopeptides, or other glycomolecules (e.g., glycopeptides, glycoDNA, glycoRNA, glycolipids) that are predicted to be present in a sample.
- glycans e.g., glycopeptides, glycoDNA, glycoRNA, glycolipids
- glycomolecules include glycans and glycoconjugates (molecules that are conjugated to glycans, e.g., glycopeptides, glycoproteins, glycolipids, glycoRNA, glycoDNA).
- search engines There are currently a number of such search engines.
- search engines under the names BYONIC, PGLYCO3, MSFRAGGER-GLYCO, METAMORPHEUS (also known as O-PAIR), STRUCGP, and other similar search engines can be used to match a mass spectrometry signature to known glycomolecule signatures.
- the identification process used by these search engines may involve inputting mass spectrometry data, a glycan database, and/or a conjugate database to perform a search.
- a researcher may input, into a glycopeptide search engine, a mass spectrometry data file (e.g., an mzML file), a glycan database file, and a protein/peptide database file.
- the same search conducted on two different glycopeptide search engines may output search results in two different formats, each specific to the respective platform.
- Many research applications can involve running searches using different glycan databases (the inputs) and/or comparing glycomolecule search results (the outputs) from different search engines.
- this comparison can be exceedingly difficult and time consuming.
- the same set of raw input data cannot be searched across multiple search engines.
- the outputs of different search engines cannot be directly compared.
- each search engine has a different “confidence score” to rank the confidence and credibility of each result that is returned from a glycomolecule search, it is difficult to compare among the search results from these different search engines.
- different search engines rank their identifications/annotations using different strategies, such that even the same set of raw data can have different glycopeptide annotations based on the search engines used, and these annotations can even be mutually exclusive. This leads to challenges when attempting to search multiple glycan databases and compare the results across the different databases.
- a glycan data management system for handling data associated with glycans, the system comprising: a non-transitory memory comprising instructions for converting among a plurality of platform-specific glycan formats and a universal glycan format by applying at least one of a plurality of format conversion rule sets, the plurality of format conversion rule sets comprising: a first rule set configured to convert glycan representations of a first platform-specific glycan format to and from the universal glycan format; and a second rule set configured to convert glycan representations of a second platform-specific glycan format to and from the universal glycan format; and a processor configured to: access an input glycan data comprising one or more glycan representations in the first platform-specific glycan format, the second platform-specific glycan format, or the universal glycan format; and execute the instructions to convert the input glycan data between: the first platform-specific glycan
- the processor is configured to execute the instructions to convert the input glycan data from the first platform-specific glycan format to the second platform-specific glycan format by applying: (1) the first rule set to convert the input glycan data from the first platform-specific glycan format to the universal glycan format; and (2) the second rule set to convert the input glycan data from the universal glycan format to the second platform-specific glycan format.
- the non-transitory memory further comprises instructions for obtaining the input glycan data from a first glycan database in the first platform-specific glycan format, and wherein execution of the instructions by the processor generates a second glycan database comprising glycan data in the second platform-specific glycan format.
- the first platform-specific glycan format is compatible with a first glycomolecule search engine.
- the glycomolecule search engine comprises a glycopeptide search engine.
- the second platform-specific glycan format is compatible with a second glycomolecule search engine.
- the universal glycan format is not compatible with the first or the second glycomolecule search engine.
- the universal glycan format comprises compositional information only.
- the universal glycan format comprises a single-letter representation for different monosaccharides.
- the universal glycan format comprises a zero for a monosaccharide that is not present in the glycan data.
- the universal glycan format comprises: a first number I corresponding to an amount of hexose in the glycan data; a second number m corresponding to an amount of N-acetylhexosamine in the glycan data; a third number n corresponding to an amount of fucose in the glycan data; and a fourth number o corresponding to an amount of N-acetylneuraminic acid in the glycan data, wherein /, m, //, and o are each an integer 0 or greater.
- the universal glycan format further comprises a fifth number p corresponding to an amount of N- glycolylneuraminic acid (Neu5Gc), wherein p is an integer 0 or greater.
- At least one of the first and second platformspecific glycan formats comprises glycan structural information.
- a method of expanding the cross-compatibility of a glycan database comprising: providing to the glycan data management system of any one of claims 1-11 a glycan database comprising input glycan data comprising glycan representations in a first platform-specific glycan format; and adding to the plurality of format conversion rule sets a third rule set configured to convert glycan representations of the third platform-specific glycan format to and from the universal glycan format, wherein none of the other rule sets among the plurality of format conversion rule sets is configured to convert glycan representations of the third platform-specific glycan format to and from the universal glycan format, wherein the third platform-specific glycan format is compatible with a third glycomolecule search engine and wherein the first platformspecific glycan format is not compatible with the third glycomolecule search engine.
- the method further includes causing the processor to access the input glycan data of the glycan database and executing the
- a method of converting a glycan data format comprising: receiving first glycan data comprising one or more glycan representations in a first platform-specific glycan format; and converting the first glycan data from the first platform-specific glycan format to a second platform-specific glycan format via a universal glycan format, wherein the second platform-specific glycan format, but not the first platform-specific glycan format, is compatible with a first glycomolecule search engine.
- the first glycomolecule search engine comprises a first glycopeptide search engine.
- the first platform-specific glycan format is compatible with a second glycomolecule search engine.
- the second glycomolecule search engine comprises a second glycopeptide search engine.
- converting the first glycan data comprises: determining that the first glycan data comprises the one or more glycan representations in the first platform-specific glycan format; and applying to the first glycan data: (1) a first rule set for converting the first platform-specific glycan format to the universal glycan format; and (2) a second rule set for converting the universal glycan data format to the second platform-specific glycan format.
- the universal glycan format comprises compositional information only. In some embodiments, the universal glycan format comprises a single-letter representation for different monosaccharides. In some embodiments, the universal glycan format comprises a zero for a monosaccharide that is not present in the glycan data.
- the universal glycan format comprises: a first number I corresponding to an amount of hexose in the glycan data; a second number m corresponding to an amount of N-acetylhexosamine in the glycan data; a third number n corresponding to an amount of fucose in the glycan data; and a fourth number o corresponding to an amount of N-acetylneuraminic acid in the glycan data, wherein /, m, //, and o are each an integer 0 or greater.
- the universal glycan data format further comprises a fifth number p corresponding to an amount of N- glycolylneuraminic acid, wherein p is an integer 0 or greater.
- At least one of the first and second platformspecific glycan formats comprises structural information.
- a glycan data management system for handling data associated with glycans, the system comprising: a processor; and non-transitory memory comprising instructions, which when executed by the processor causes the processor to perform any of the method of the present disclosure.
- a method of converting a glycan format comprising: receiving glycan data comprising a glycan representation in a first format, the first format comprising compositional information; and converting the glycan representation to a second format, the second format comprising compositional information, wherein either: (1) the first format further comprises glycan structural information, and the second format comprises compositional information only, and wherein converting the glycan representation comprises enumerating an amount of each of the monosaccharides in the glycan; or (2) the first format comprises compositional information only, and the second format further comprises glycan structural information, and wherein converting the representation comprises: receiving information about a source of the glycan data; and generating a list of potential glycan structures based on the compositional information and the source of the glycan data.
- the source comprises human urine, human stool, human blood, or human serum.
- the source comprises a yeast, a plant,
- a glycomolecule data analysis system for comparing glycomolecule search outputs from two or more glycomolecule search engines, the system comprising: a processor; and a non-transitory memory configured to cause the processor to: access a first search output associated with a first glycomolecule search engine and a second search output associated with a second glycomolecule search engine, wherein the first search output comprises first glycomolecular data for each of one or more identified first glycomolecules in a first format, and wherein the second search output comprises second glycomolecular data for each of one or more identified second glycomolecules in a second format; convert one or more of the first and second glycomolecular data to a common format, wherein the common format comprises the first format, the second format, or a universal glycomolecular data format; and perform a comparison between the first and second glycomolecular data in the common format.
- one or more of the first and second glycomolecular data is converted to the common format by: extracting one or more parameters associated with each of the one or more identified first and second glycomolecules in the first and second glycomolecular data, respectively, wherein the one or more parameters are in a search engine-specific format and comprises one or more of: glycan data, an identifier, a peptide sequence, a protein ID, a retention time, an m/z value, a charge number, and a confidence score; and converting the one or more parameters associated with each of the one or more identified first and second glycomolecules to the common format.
- the common format is the universal glycomolecular data format.
- the first and second search outputs correspond to the same glycomolecule mass spectrometry data run on the first and second glycomolecule search engines.
- the non-transitory memory is configured to cause the processor to: access a third search output associated with a third glycomolecule search engine, wherein the third search output comprises third glycomolecular data for each of one or more identified third glycomolecules in a third format; convert one or more of the first, second, and third glycomolecular data to a common format, wherein the common format comprises the first format, the second format, the third format, or the universal glycomolecular data format; and perform a comparison between any two or more of the first, second, and third glycomolecular data in the common format.
- the first, second, and third search outputs correspond to the same glycomolecule mass spectrometry data run on the first, second, and third glycomolecule search engines, respectively.
- the third glycomolecular data is converted to the universal glycomolecular data format by extracting one or more parameters associated with each of the one or more identified third glycomolecules in the third glycomolecular data, wherein the one or more parameters comprises one or more of glycan data, an identifier, a peptide sequence, a protein ID, a retention time, an m/z value, a charge number, and a confidence score; and converting the one or more parameters associated with each of the one or more identified third glycomolecules to the common format.
- a glycomolecule data analysis system for comparing glycomolecule search outputs from two or more glycomolecule search engines, the system comprising: a processor; and a non-transitory memory configured to cause the processor to: access two or more search outputs from two or more glycomolecule search engines, each search output (1) being associated with one of the two or more glycomolecule search engines and (2) comprising glycomolecular data in a search engine-specific format for each of one or more glycomolecules identified by the glycomolecule search engine; extract one or more parameters associated with each of the one or more identified glycomolecules in the glycomolecular data of each of the two or more search outputs, wherein the one or more parameters comprises an identifier; and perform a comparison between the glycomolecular data from the two or more search outputs based on the one or more extracted parameters.
- the comparison is performed by identifying glycomolecules associated with the same identifier in the glycomolecular data from the two or more search outputs.
- the one or more parameters further comprises one or more of: glycan data, a peptide sequence, a protein ID, a retention time, an m/z value, a charge number, and a confidence score.
- the two or more search outputs correspond to the same glycomolecule mass spectrometry data run on the two or more glycomolecule search engines, respectively.
- the non-transitory memory is further configured to cause the processor to convert glycomolecular data of one or more of the two or more search outputs to a common format, wherein the common format comprises one of the search engine-specific formats or a universal glycomolecular data format.
- the common format is the universal glycomolecular data format.
- the universal glycomolecular data format comprises identifying information for the glycomolecule search engine.
- the universal glycomolecular data format comprises, for each identified glycomolecule, one or more of: glycan data, an identifier, a peptide sequence, a protein ID, a retention time, an m/z value, a charge number, and a confidence score.
- the comparison comprises identifying an overlap between any two or more of the glycomolecular data.
- the overlap comprises a set of glycomolecules that are identified in any two or more of the glycomolecular data.
- the overlap comprises one or more of peptide sequences, protein IDs, retention times, m/z values, or charge numbers.
- the comparison comprises identifying a difference between any two or more of the glycomolecular data.
- the difference comprises one or more sets of glycomolecules that are identified in one of the any two or more of the glycomolecular data as being different from the other one of the any two or more of the glycomolecular data.
- the difference comprises one or more of peptide sequences, protein IDs, retention times, m/z values, or charge numbers.
- the glycomolecule is a glycopeptide, glycoprotein, glycolipid, glycoRNA, or glycoDNA.
- the glycomolecule search engine is a glycopeptide search engine, glycoprotein search engine, glycolipid search engine, glycoDNA search engine, glycoRNA search engine, or glycoDNA search engine.
- Also provided herein is a method of comparing glycomolecule search outputs from two or more glycomolecule search engines, comprising: receiving a first search output associated with a first glycomolecule search engine and a second search output associated with a second glycomolecule search engine, wherein the first search output comprises first glycomolecular data for each of one or more identified first glycomolecules in a first format, and wherein the second search output comprises second glycomolecular data for each of one or more identified second glycomolecules in a second format; converting one or more of the first and second glycomolecular data to a common format, wherein the common format comprises the first format, the second format, or a universal glycomolecular data format; and comparing the first and second glycomolecular data in the common format.
- the first and second glycomolecular data comprise one or more parameters associated with each of the one or more identified first and second glycomolecules in the first and second glycomolecular data, respectively, wherein the one or more parameters are in a search engine-specific format and comprises one or more of: glycan data, an identifier, a peptide sequence, a protein ID, a retention time, an m/z value, a charge number, and a confidence score, and wherein converting the one or more of the first and second glycomolecular data to the common format comprises converting the one or more parameters to the common format.
- the first and second search outputs correspond to the same glycomolecule mass spectrometry data run on the first and second glycomolecule search engines.
- the method further comprises providing a data set for running on each of the two or more glycomolecule search engines, wherein the data sets comprise the same glycomolecule mass spectrometry data.
- the method further comprises: generating the glycomolecule mass spectrometry data from a biological sample; and providing the data set comprising the generated glycomolecule mass spectrometry data.
- the biological sample is from human urine, human stool, human blood, or human serum.
- the two or more glycomolecule search engines are two or more glycopeptide search engines, the method further comprising providing a data set for running on each of the two or more glycopeptide search engines, wherein the data sets comprise the same glycopeptide mass spectrometry data.
- each data set comprises a glycan database comprising glycan data in a glycan format that is compatible with a corresponding glycopeptide search engine of the two or more glycopeptide search engines.
- providing the data set further comprises: providing the glycan database in a universal glycan format; and converting the glycan database to a glycan format compatible with one of the two or more glycopeptide search engines.
- providing the data set further comprises: receiving the glycan database in a glycan format compatible with one of the two or more glycopeptide search engines; and converting the glycan database to the universal glycan format.
- the method further comprises: receiving a third search output associated with a third glycomolecule search engine, wherein the third search output comprises third glycomolecular data for each of one or more identified third glycomolecules in a third format; converting one or more of the first, second, and third glycomolecular data to a common format, wherein the common format comprises the first format, the second format, the third format, or the universal glycomolecular data format; and comparing any two or more of the first, second, and third glycomolecular data, each in the common format.
- the first, second, and third search outputs correspond to the same glycomolecule mass spectrometry data run on the first, second, and third glycomolecule search engines.
- the common format is the universal glycomolecular data format.
- the universal glycomolecular data format comprises identifying information for the glycomolecule search engine.
- the universal glycopeptide data format comprises, for each identified glycomolecule, one or more of: glycan data, an identifier, a peptide sequence, a protein ID, a retention time, an m/z value, a charge number, or a confidence score.
- the glycomolecule is a glycopeptide, glycoprotein, glycolipid, glycoRNA, or glycoDNA.
- the glycomolecule search engine is a glycopeptide search engine, glycoprotein search engine, glycolipid search engine, glycoRNA search engine, or glycoDNA search engine.
- a method for determining a glycomolecule profile of a biological sample comprising: providing to each of two or more glycomolecule search engines a data set comprising glycomolecule mass spectrometry data obtained from a biological sample, wherein the same glycomolecule mass spectrometry data are provided to each of the two or more glycomolecule search engines; receiving an output from each of the two or more glycomolecule search engines run on the provided data set, wherein each output from the two or more glycomolecule search engines comprises glycomolecular data for each of one or more identified glycomolecules in a search engine-specific format; converting glycomolecular data of one or more of the outputs from the two or more glycomolecule search engines to a common format, wherein the common format comprises one of the search engine-specific formats or a universal glycomolecular data format; and determining a glycomolecule profile of the sample based on the glycomolecular data of the outputs from the two or more glycomolecule search engines in the common format.
- an electronic system for comparing glycomolecule search outputs from two or more glycomolecule search engines comprising: a processor; and non- transitory memory comprising instructions, which when executed by the processor causes the processor to perform a method of the present disclosure.
- a method of identifying glycomolecules in a biological sample comprising: receiving a first search output for a biological sample, the first search output being associated with a first glycomolecule search engine; determining that the first search output identifies a first set of glycomolecules as being of a first type and a second set of glycomolecules as being of a second type; receiving a second search output for the biological sample, the second search output being associated with a second glycomolecule search engine; determining that the second search output identifies a third set of glycomolecules as being of the first type and a fourth set of glycomolecules as being of the second type; and identifying among the first and second search outputs a conflicting subset of glycomolecules, wherein the conflicting subset of glycomolecules comprises: (a) glycomolecules that are present in both the first set and the fourth set; or (b) glycomolecules that are present in both the second set and the third set.
- the first search output comprises: a first set of identifiers associated with each glycomolecule of the first set of glycomolecules; and a second set of identifiers associated with each glycomolecule of the second set of glycomolecules
- the second search output comprises: a third set of identifiers associated with each glycomolecule of the third set of glycomolecules; a fourth set of identifiers associated with each glycomolecule of the fourth set of glycomolecules
- identifying the conflicting subset of glycomolecules comprises: (a) determining an identifier in the first set of identifiers that is the same as an identifier in the fourth set of identifiers; or (b) determining an identifier in the second set of identifiers that is the same as an identifier in the third set of identifiers.
- Also provided is a method of identifying glycomolecules in a biological sample comprising: receiving a first search output for a biological sample, the first search output being associated with a first glycomolecule search engine; receiving a second search output for the biological sample, the second search output being associated with a second glycomolecule search engine; and determining that the first and second search outputs identify a conflicting subset of glycomolecules, wherein the conflicting subset of glycomolecules comprises one or more glycomolecules determined as being both of a first type in the first search output and of a second type in the second search output.
- the one or more glycomolecules of the conflicting subset of glycomolecules are determined as not being of the second type in the first search output.
- the one or more glycomolecules of the conflicting subset of glycomolecules are determined as not being of the first type in the second search output.
- the first search output comprises a set of identifiers associated with each glycomolecule of a first set of glycomolecules
- the second search output comprises a second set of identifiers associated with each glycomolecule of a second set of glycomolecules
- determining that the first and second search outputs identify the conflicting subset of glycomolecules comprises: identifying a first identifier in the first set of identifiers associated with a glycomolecule of the first type in the first set of glycomolecules; and identifying a second identifier in the second set of identifiers associated with a glycomolecule of the second type in the second set of glycomolecules, wherein the first and second identifiers are the same.
- the method further includes generating a notification comprising a listing of the conflicting subset of glycomolecules. In some embodiments, the method further includes: selecting a subset of data corresponding to the conflicting subset of glycomolecules; and performing an additional analysis on the selected data to confirm an identity of the one or more glycomolecules of the conflicting subset of glycomolecules.
- the first search output comprises a first set of confidence scores associated with the first or the second set of glycomolecules
- the second search output comprises a second set of confidence scores associated with the third or the fourth set of glycomolecules.
- each confidence score of the first set of confidence scores is equal to or greater than a first confidence score threshold
- each confidence scores of the second set of confidence scores is equal to or greater than a second confidence score threshold.
- the method further includes resolving the conflicting subset of glycomolecules based on a comparison of confidence scores of the first and second sets of confidence scores.
- resolving the conflicting subset of glycomolecules comprises, for each glycomolecule identified as being present in both the first set and the fourth set, or in both the second set and the third set, determining that the glycomolecule is of the type identified by the search output comprising the higher confidence score associated with the glycomolecule.
- the first and second sets of confidence scores are normalized.
- glycomolecules of the first type and glycomolecules of the second type have masses within about 0-8 daltons of each other.
- the first type comprises a sialic acid containing glycomolecule and the second type comprises a fucosylated glycomolecule.
- the first search output is in a first format, wherein the second search output is in a second format, and wherein the method further comprises, converting one or both of the first and second search outputs to a common format, wherein the common format comprises the first format, the second format, or a universal glycomolecular data format.
- Also provided is a method of determining a glycomolecule profile of a biological sample comprising: receiving a first search output for a biological sample, the first search output being associated with a first glycomolecule search engine, wherein the first search output comprises a first set of glycomolecules associated with a first set of confidence scores; receiving a second search output for the biological sample, the second search output being associated with a second glycomolecule search engine, wherein the second search output comprises a second set of glycomolecules associated with a second set of confidence scores; and determining a glycomolecule profile of the biological sample based on (1) an overlap between the first and second search outputs and (2) the first and second set of confidence scores.
- the first search output comprises a first set of identifiers
- the second search output comprises a second set of identifiers
- determining the glycomolecule profile further comprises: determining the first set of identifiers associated with the first set of glycomolecules; and determining the second set of identifiers associated with the second set of glycomolecules.
- the overlap comprises a set of glycomolecules in both the first and second search outputs, wherein each glycomolecule in the first set of glycomolecules is associated with a first confidence score of the first set of confidence scores that is equal to or greater than a first confidence score threshold, and each glycomolecule in the second set of glycomolecules is associated with a second confidence score of the second set of confidence scores that is equal to or greater than a second confidence score threshold.
- the method further includes setting the value of the first confidence score threshold and the value of the second confidence score threshold to optimize a percentage overlap between the search outputs among all the first and second glycomolecules having, respectively, first confidence scores equal to or greater than the first confidence score threshold and second confidence scores equal to or greater than the second confidence score threshold.
- determining the glycomolecule profile is based on an overlap between the first set of glycomolecules having first confidence scores within a first highest percentage and the second set of glycomolecules having second confidence scores within a second highest percentage.
- the method further includes generating a consensus list of glycomolecules based on the overlap.
- generating the consensus list comprises applying a weighting factor to either one or both of the first and second sets of confidence scores.
- the first search output is in a first format
- the second search output is in a second format
- the method further comprises, before determining the glycomolecule profile, converting one or more of the first and second search outputs to a common format
- the common format comprises the first format, the second format, or a universal glycomolecular data format.
- the common format is the universal glycomolecular data format.
- the overlap comprises one or more of peptide sequences, protein IDs, retention times, m/z values, or charge states.
- the first search output and second search output are generated by running the respective glycomolecule search engines on the same glycomolecule mass spectrometry data of the biological sample.
- the glycomolecule is a glycopeptide, glycoprotein, glycolipid, glycoRNA, or glycoDNA.
- the glycomolecule search engine is a glycopeptide search engine, glycoprotein search engine, glycolipid search engine, glycoRNA search engine, or glycoDNA search engine.
- the biological sample is from human urine, human stool, human blood, or human serum.
- an electronic system comprising: a processor; and non- transitory memory comprising instructions, which when executed by the processor causes the processor to perform any of the method of the present disclosure.
- a system for curating a consensus list of glycomolecules identified in a biological sample on a user interface comprising: a processor; and a memory comprising instructions that, when executed by the processor, causes the processor to: retrieve a plurality of glycomolecule search results sets for a biological sample from a plurality of glycomolecule search engines, wherein each glycomolecule search results set comprises glycomolecule search results identifying one or more predicted glycomolecules as being present, wherein each predicted glycomolecule is associated with a unique identifier; determine a consensus set comprising, for each unique identifier, a consensus search result identifying one of the predicted glycomolecules associated with the unique identifier in the glycomolecule search results sets; provide a user interface configured to display: one or more search results elements configured to display one or more representations of the one or more predicted glycomolecules from each of the plurality of glycomolecule search results sets; and a consensus list element configured to display the consensus set.
- representations corresponding to the glycomolecule search results for predicted glycomolecules associated with the same unique identifiers are displayed in each of the search result elements.
- the user interface is further configured to display the unique identifier in association with the one or more representations.
- the consensus set is determined at least in part by: identifying conflicting search results and non-conflicting search results among the glycomolecule search results; and populating the consensus list element with the non-conflicting search results.
- the glycomolecule search results are further evaluated by: resolving the conflicting search results by applying a conflict resolution rule set to generate resolved conflicting search results; and populating the consensus list element with the resolved conflicting search results.
- the conflict resolution rule set comprises identifying, for a particular conflicting search result associated with a particular unique identifier, one or more plurality glycomolecules from the predicted glycomolecules of the conflicting search results associated with the particular unique identifier.
- the search results comprise a confidence score for each of the glycomolecules in the conflict search results
- the conflict resolution rule set comprises identifying a consensus search result based on a comparison of the confidence scores of each search result among the conflict search results.
- determining the consensus set comprises determining an initial consensus set, wherein the initial consensus set comprises a single predicted glycomolecule for each unique identifier.
- the instructions when executed by the processor, further causes the processor to provide the plurality of search result elements, each search result element configured to receive recognize a user input to select one predicted glycomolecule among the predicted glycomolecules associated with the same unique identifier as the consensus search result for the unique identifier.
- the user input comprises a touch and a scrolling activity
- the user interface is further configured to scroll through the representations of the predicted glycomolecules displayed in the search results element in response to the scrolling activity.
- the user interface is configured to arrange the representations of the predicted glycomolecules of each search results set with a respective search results element into an ordered list for each of the search results set.
- the user interface is further configured to scroll through the ordered lists in a coordinated manner in response to a scrolling activity on one of the ordered lists by the user.
- each of the ordered lists is identically ordered.
- the unique identifier is a mass spectrometry scan number, and wherein the mass spectrometry scan number identifies a glycomolecule query in each of the glycomolecule search engines.
- the system further comprises a human interface device.
- the system comprises a display.
- the display further comprises a sensor.
- the display further comprises an actuator.
- the user interface is configured to: recognize an input of a user; recognize a selected glycomolecule when the input of the user hovers or interacts with the glycomolecule shown in the glycomolecule search results sets or the consensus set; and present a visualized glycomolecule for the selected glycomolecule.
- the visualized glycomolecule comprises one or more potential structures of the selected glycomolecule.
- the visualized glycomolecule is shown in a modal window in the user interface.
- the visualized glycomolecule is shown in a popup window in the user interface.
- the visualized glycomolecule is shown in a dedicated region of the user interface.
- the user interface is configured to: recognize an input of a user and permit the user to manually edit one or more of the consensus search result; and optionally permit the user to drag and drop one or more of the predicted glycomolecules from one or more of the glycomolecule search results sets into the consensus set.
- the user interface is configured to: recognize an input of a user and permit the user to drag and drop one or more of the predicted glycomolecules from one or more of the glycomolecule search results sets into the consensus set.
- the glycomolecules comprise glycans, glycoproteins, glycolipids, glycoRNAs, glycoDNAs, or any combination thereof. In some embodiments, the glycomolecules comprise glycans.
- the instructions when executed by the processor, further causes the processor to convert one or more of the glycomolecule search results to a common format.
- the common format is a universal glycan format.
- the universal glycan format comprises: a first number I corresponding to an amount of hexose; a second number m corresponding to an amount of N-acetylhexosamine; a third number n corresponding to an amount of fucose; and a fourth number o corresponding to an amount of N-acetylneuraminic acid, and wherein /, m, n, o are each an integer > 0.
- the glycomolecule search results from each of the glycomolecule search engines is in a platform-specific format.
- Also provided herein is a method for curating a consensus list of glycomolecules identified in a biological sample on a user interface, comprising: at a computer system including one or more processors, and memory storing one or more programs configured to be executed by the one or more processors, the one or more programs including instructions for: retrieving a plurality of glycomolecule search results sets for a biological sample from a plurality of glycomolecule search engines wherein each glycomolecule search results set comprises glycomolecule search results identifying one or more predicted glycomolecules as being present, wherein each predicted glycomolecule is associated with a unique identifier; determining a consensus set, wherein the consensus set comprises, for each unique identifier, a consensus search result identifying one of the predicted glycomolecules associated with the unique identifier in the glycomolecule search results sets; providing a user interface configured to display: one or more search results elements configured to display one or more representations of the one or more predicted glycomolecules from each of the plurality of glycomolecule search results sets; and a consensus list element configured to display the consensus set
- the determining the consensus list step comprises: identifying conflicting search results and non-conflicting search results among the glycomolecule search results; and populating the consensus list element with the non-conflicting search results.
- the one or more programs further including instructions for: resolving the conflicting search results to generate resolved conflicting search results; and populating the consensus list element with the resolved conflicting search results.
- the conflict resolution rule set comprises identifying, for a particular conflicting search result associated with a particular unique identifier, one or more plurality glycomolecules from the predicted glycomolecules of the conflicting search results associated with the particular unique identifier.
- the conflict resolution rule set further comprises identifying a consensus search result based on relative confidence scores of each search result among the conflict search results.
- the consensus set comprises an initial consensus set, and wherein the initial consensus set comprises a single search predicted glycomolecule for each unique identifier.
- the one or more programs further include instructions for providing the plurality of search result elements, each of the search result elements configured to recognize a user input to select one predicted glycomolecule among the predicted glycomolecules associated with the glycomolecule search engine that is associated with the same unique identifier.
- the one or more programs further include instructions for arranging, in the user interface, the representations of the predicted glycomolecules of each search results set with a respective search results element into an ordered list for each of the search results set.
- the one or more programs further include instructions for ordering each of the ordered lists identically.
- the input comprises a touch and a scrolling activity
- the user interface is further configured to scroll through the ordered lists at about a same scrolling speed and about a same scrolling direction in response to the scrolling activity on one of the ordered lists.
- the unique identifier is a mass spectrometry scan number, and wherein the mass spectrometry scan number identifies a glycomolecule query in each of the glycomolecule search engines.
- the user interface is configured to: recognize an input of a user and permit the user to manually edit one or more of the consensus search result; and optionally permit the user to drag and drop one or more of the predicted glycomolecules from one or more of the glycomolecule search results sets into the consensus set.
- the user interface is configured to: recognize an input of a user and permit the user to drag and drop one or more of the predicted glycomolecules from one or more of the glycomolecule search results sets into the consensus set.
- the search results further comprise a confidence score for each of the glycomolecules in the conflict search results
- the user interface is configured to: recognize an input of a user and permit the user to apply a custom confidence score to any of the glycomolecules in the conflict search results.
- the one or more programs further include instructions to update a machine learning process with the custom confidence scores, wherein the machine learning process is configured to output recommendations and predictions.
- the user interface is configured to: recognize an input of a user; recognize a selected glycomolecule when the input of the user hovers or interacts with the glycomolecule shown in the glycomolecule search results sets or the consensus set; and present a visualized glycomolecule for the selected glycomolecule.
- the visualized glycomolecule comprises one or more potential structures of the selected glycomolecule.
- the visualized glycomolecule is shown in a modal window in the user interface.
- the visualized glycomolecule is shown in a popup window in the user interface.
- the visualized glycomolecule is shown in a dedicated region of the user interface.
- the glycomolecules comprise glycans, glycolipids, glycoRNAs, glycoDNAs, or any combination thereof.
- the representation of the at least one glycomolecule comprises a glycan representation in a universal glycan format.
- the glycomolecules comprise glycans.
- the representation of the at least one glycomolecule comprises a glycan representation in a universal glycan format.
- the universal glycan format comprises: a first number I corresponding to an amount of hexose; a second number m corresponding to an amount of N-acetylhexosamine; a third number n corresponding to an amount of fucose; and a fourth number o corresponding to an amount of N-acetylneuraminic acid, and wherein /, m. n. o are each an integer > 0.
- Fig. 1 is a block diagram which illustrates a glycan database input converter connected to a series of glycan databases connected to various search engines along with a glycopeptide data converter that converts the output of each search engine to a universal glycan format, according to some non-limiting embodiments of the present disclosure.
- Fig. 2 is a schematic diagram which illustrates the conversion of a particular universal glycan format into the different input formats found within search engine-specific glycan databases, according to some non-limiting embodiments of the present disclosure.
- Fig. 3 is a schematic diagram showing direct conversions of glycan representations between different glycan formats that are each compatible with a glycomolecule search engine.
- Fig. 4 is a schematic diagram showing conversion of glycan representations between different glycan formats via a universal glycan format, according to some non-limiting embodiments of the present disclosure.
- FIG. 5 is a block diagram showing a method of converting a glycan data format, according to some non-limiting embodiments of the present disclosure.
- FIG. 6 is a block diagram showing a method of converting a glycan data format, according to some non-limiting embodiments of the present disclosure.
- FIG. 7 is a block diagram showing a computer system for implementing a glycan data management system and a method of converting a glycan data format, according to some non-limiting embodiments of the present disclosure.
- FIG. 8 is a block diagram showing a network system for implementing a glycan data management system and a method of converting a glycan data format, according to some non-limiting embodiments of the present disclosure.
- Fig. 9 is a schematic diagram showing conversion of glycan representations between different glycan formats via a universal glycan format, according to some non-limiting embodiments of the present disclosure.
- Fig. 10 is a table showing glycan representations in different glycan formats, including a universal glycan format, according to some non-limiting embodiments of the present disclosure.
- Fig. 11 is a schematic diagram showing comparison of search outputs from different glycomolecule search engines using a common format, according to some nonlimiting embodiments of the present disclosure.
- Fig. 12 is a schematic diagram showing a method of comparing glycomolecule search outputs from two or more glycomolecule search engines, according to some non-limiting embodiments of the present disclosure.
- Fig. 13 is a schematic diagram showing a method for analyzing a biological sample to determine a glycomolecule profile, according to some non-limiting embodiments of the present disclosure.
- Fig. 14 is a block diagram showing a computer system for implementing a glycomolecule data analysis system and a method of comparing glycomolecule search outputs, according to some non-limiting embodiments of the present disclosure.
- Fig. 15 is a block diagram showing a network system for implementing a glycomolecule data analysis system and a method of comparing glycomolecule search outputs, according to some non-limiting embodiments of the present disclosure.
- Fig. 16 is a collection of tables showing portions of search outputs from a PGLYCO database search and a BYONIC database search run on the same glycopeptide mass spectrometry data, each output having a search engine-specific format.
- Fig. 17 is a collection of tables showing a search output from a glycopeptide search engine in a search engine-specific format and a universal glycomolecular data format, where the Scan number of the PGLYCO search result is matched to the MS2 protein number of the universal glycopeptide format, according to some non-limiting embodiments of the present disclosure.
- Fig. 18 is a collection of tables showing a search output from a glycopeptide search engine in a search engine-specific format and a universal glycomolecular data format, according to some non-limiting embodiments of the present disclosure.
- Fig. 19 is a collection of tables showing a search output from a glycopeptide search engine in a search engine-specific format and a universal glycomolecular data format, where the Scan number of the METAMORPHEUS search result is matched to the MS2 protein number of the universal glycopeptide format, according to some non-limiting embodiments of the present disclosure.
- Fig. 20 is a schematic diagram showing extraction of glycopeptide data parameters for an identified glycopeptide in a search output of a glycopeptide search engine run on glycopeptide mass spectrometry data, according to some non-limiting embodiments of the present disclosure.
- Fig. 21 is a schematic diagram showing extraction of glycopeptide data parameters for an identified glycopeptide in a search output of a glycopeptide search engine run on glycopeptide mass spectrometry data, according to some non-limiting embodiments of the present disclosure.
- Fig. 22 is a schematic diagram showing extraction of glycopeptide data parameters for an identified glycopeptide in a search output of a glycopeptide search engine run on glycopeptide mass spectrometry data, according to some non-limiting embodiments of the present disclosure.
- FIG. 23 is a schematic diagram showing a method of identifying glycomolecules in a biological sample, according to some non-limiting embodiments of the present disclosure.
- FIG. 24A is a schematic diagram showing a method of identifying glycomolecules in a biological sample, according to some non-limiting embodiments of the present disclosure.
- FIGs. 24B-24D illustrate examples of different pairs of glycans with similar masses.
- Gal D-Galactose
- Man D-Mannose
- Fuc L-Fucose
- GlcNAc N- Acetyl-D-Glucosamine
- Neu5Ac N-Acetylneuraminic Acid
- Neu5Gc N- Glycolylneuraminic acid.
- FIG. 25 is a schematic diagram showing a method of determining a glycomolecule profile of a biological sample, according to some non-limiting embodiments of the present disclosure.
- FIG. 26 is a block diagram showing a computer system for implementing a method of the present disclosure, according to some non-limiting embodiments of the present disclosure.
- FIG. 27 is a block diagram showing a network system for implementing a method of the present disclosure, according to some non-limiting embodiments of the present disclosure.
- FIGs. 28A and 28B are a collection of diagrams showing overlapping and no-overlapping sets of glycomolecules identified as containing sialic acid or as being fucosylated by two different glycopeptide search engines, according to some non-limiting embodiments of the present disclosure.
- FIGs. 29A and 29B are a collection of diagrams showing the change in the overlapping set of glycomolecules identified by two different glycopeptide search engines based on search engine-specific confidence scores, when the confidence score thresholds (“Cut-off score”) are altered in a search engine-specific manner, according to some non-limiting embodiments of the present disclosure.
- FIG. 30 is a schematic diagram showing a non-limiting example, according to some embodiments of the disclosure, of a user interface presenting glycomolecule search results displayed in search result elements and a consensus set displayed in a consensus list.
- FIG. 31A is a schematic diagram showing a non-limiting example, according to some embodiments of the disclosure, of dynamic interaction with a user interface presenting glycomolecule search results displayed in search result elements and a consensus set displayed in a consensus list.
- FIG. 3 IB is a schematic diagram showing a non-limiting example, according to some embodiments of the disclosure, of a user interface that presents a glycomolecule visualization.
- FIG. 31C is a schematic diagram showing a non-limiting example, according to some embodiments of the disclosure, of a user interface presenting glycomolecule search results displayed in search result elements and a consensus set displayed in a consensus list.
- FIG. 32 is a schematic diagram showing a non-limiting example, according to some embodiments of the disclosure, of a user interface presenting glycomolecule search results displayed in search result elements and a consensus set displayed in a consensus list.
- FIG. 33 is a schematic diagram showing a non-limiting example, according to some embodiments of the disclosure, of a user interface presenting glycomolecule search results displayed in search result elements and a consensus set displayed in a consensus list.
- FIG. 34 is a block diagram showing a non-limiting example of a computer system implementing systems and methods of the disclosure.
- FIG. 35 is a block diagram showing a non-limiting example of a method for populating a consensus list element according to some non-limiting embodiments of the disclosure.
- FIG. 36 is a schematic diagram showing conversion of glycan representations between different glycan formats via a universal glycan format, according to some non-limiting embodiments of the disclosure.
- FIG. 37 is a schematic diagram showing conversion of glycan representations between different glycan formats via a universal glycan format, according to some non-limiting embodiments of the disclosure.
- a user may introduce a sample into an analyzer device to determine identities or properties of analytes within the sample.
- the user may introduce a biological sample including glycomolecules into a mass spectrometer.
- the user may then initiate a multiple reaction monitoring mass spectrometry (MRM-MS) process.
- the MRM-MS process may include ionizing glycomolecule analytes (precursor ions) within the sample, fragmenting the ionized analytes (product ions), and measuring charge signals from the precursor and/or product ions. These measured signals may be used to identify the analytes and determine relative abundances of the analytes in the sample.
- This identification and determination may include querying one or more glycan databases (along with other databases, e.g., protein databases in the case of glycopeptides) by inputting the measured signals into an appropriate search engine.
- the system includes a first Glycan Database Converter, which may be a glycan format converter, which is configured to convert naming formats or representations used in a particular glycan database to a different naming format or representation.
- a Glycan Database Converter which may be a glycan format converter, which is configured to convert naming formats or representations used in a particular glycan database to a different naming format or representation.
- the Glycan Database Converter communicates with various search engine databases.
- other public glycan databases may also communicate with the Glycan Database Converter.
- Each of the glycan databases may communicate and provide input to the various search engines as shown in Fig. 1.
- search results from each of the search engines may then be outputted and directed into a Glycomolecule Search Converter (e.g., a Glycomolecule Search Converter), which is configured to obtain the specific output formats from each search engine and convert it into a universal glycomolecule format so that the results from each search can be compared to one another.
- a Glycomolecule Search Converter e.g., a Glycomolecule Search Converter
- Fig. 2 illustrates how a glycan representation can be converted between a universal glycan format (e.g., N(3)H(3)F(l)A(0)) and various platform-specific formats compatible with particular search engines (e.g., PGLYCO3, BYONIC, and METAMORPHEUS search engines).
- a universal glycan format e.g., N(3)H(3)F(l)A(0)
- various platform-specific formats compatible with particular search engines e.g., PGLYCO3, BYONIC, and METAMORPHEUS search engines.
- Fig. 3 illustrates an example method of converting directly between platform-specific formats.
- direct conversion of the first format 310 of the glycan data in a database of interest e.g., that is compatible with a first search engine
- second and third formats 320, 330 compatible with one or more other search engines requires conversion from the first format to each of the other formats 315, 325.
- conversion of glycan data formats from the second and third formats is desired (e.g., to use a glycan database for a second search engine that takes in glycan data in the second format on a third search engine that takes in glycan data in the third format, but not the second format)
- another conversion 345 is required.
- N is an integer greater than 1
- the number of conversion rules e.g., for converting each of the existing glycan formats to the new format (and vice versa) 335, 355, 365 increases by N-l.
- the total number of conversion rule sets for converting glycan data formats among all of the N glycan data formats will be N*(N-l)/2.
- embodiments of the present disclosure can reduce the number of conversion rule sets required to convert glycan data formats across different glycan databases.
- the disclosed glycan format converter may be configured to convert among the different formats so that the user can search each of the different glycan databases using any desired search engine. Additionally or alternatively, the glycan format converter may be configured to convert any of the different formats into a newly developed “universal” format as described herein, so that users can rely on this single universal glycan format rather than having to convert and otherwise work with multiple different formats. Similarly, the glycan format converter may also be configured to convert from this universal glycan format to any other suitable or compatible format (e.g., the format associated with BYONIC, PGLYC03, or any other suitable search engine).
- any other suitable or compatible format e.g., the format associated with BYONIC, PGLYC03, or any other suitable search engine.
- the total number of conversion rules for converting glycan data formats among all of the N glycan data formats will be reduced as compared to a case where a universal glycan format is not used in the conversion process.
- the total number of conversion rules in such cases would be N (rather than N*(N-l)/2), where N is an integer greater than 1.
- the glycan format converter may be capable of converting between structural formats (such as the PGLYCO3 format), which indicate the structure of a glycan, and purely compositional formats (such as the Metamorpheus format), which may indicate only the chemical composition of a glycan.
- structural formats such as the PGLYCO3 format
- Metamorpheus format such as the Metamorpheus format
- the system may include a second converter, which may be a glycomolecule search result format converter (e.g., a glycopeptide search result format converter).
- a glycomolecule search result format converter e.g., a glycopeptide search result format converter
- Different search engines output the results of each search in a different glycopeptide format (see, for example, Fig. 16).
- the BYONIC search engine reports the amino acid sequence both before and after the peptide structure (see, e.g., Fig. 16, bottom panel).
- the PGLYCO3 search engine reports the actual glycosylation site as amino acid “J” instead of the real amino acid “N”, which is asparagine (see, e.g., Fig. 16, top panel).
- a BYONIC search output may represent a glycan of a glycomolecule as “HexNAc(3)Hex(3)Fuc(l),” while a PGLYCO 3 search output may represent the same glycan as “(N(F) (N(H(H) (H(N))))) where N corresponds to HexNAc and H corresponds to Hex.”
- the methods and systems disclosed herein enable the running of searches on sample data (e.g., mass spectrometry data) across multiple glycomolecule search engines and comparison of the search outputs.
- sample data e.g., mass spectrometry data
- Each glycomolecule search engine may have its own unique algorithms for predicting or determining properties or identities of analytes in a sample, such that different search engines may arrive at different conclusions of varying confidence levels about the properties or identities of the same set of analytes.
- identification of glycomolecules by one search engine may be in conflict with identification by another search engine.
- By comparing the search results a user can find potentially mis-identified glycomolecules in the search results. Closer analysis of the underlying data may allow resolution of the conflicting identification.
- search engines may be better for certain types of searches than other search engines. Comparing search outputs for the same sample data across multiple search engines can provide a more complete picture, can help validate or corroborate the identity and properties of analytes, and can generally increase the confidence level a researcher, clinician, or other user has in making conclusions based on the search outputs. Comparing search outputs for the same sample data across multiple search engines may provide guidance as to the pros and cons of different search engines. For example, systematic mis-identification of glycomolecules by one search engine may reveal biases in the search algorithm used by the search engine. Additional routines may then be developed to account for such biases for analyzing future searches.
- Relying on multiple search engines may also provide a more comprehensive annotation of the sample data.
- a user searches the same sample data across multiple search engines and attempts to compare the search outputs, the varying formats in the search outputs of the different search engines make this comparison difficult.
- the methods and systems disclosed herein can, at least in part, enable conversion of search outputs to a common format (e.g., to a format specific to one of the search engines, a universal glycomolecular data format) in order to facilitate the comparison process.
- a common format e.g., to a format specific to one of the search engines, a universal glycomolecular data format
- this disclosure focuses on comparing searches based on mass spectrometry data to predict/determine properties or identities of glycomolecules, the disclosure contemplates comparing searches based on any sample data suitable for predicting/determining properties or identities of glycomolecules.
- glycopeptide search result format converter uses the glycopeptide search result format converter to directly compare their search results despite using multiple search engines because each search result will be converted from the specific format output by each search engine into a common format, e.g., a universal format.
- the system can correlate and manage the different confidence scores that are reported for each search results from the individual search engines, to provide a system for comparing the confidence of each result across the databases. These correlations can be determined by, in part, comparing the search results of multiple engines to determine the overlap, and then deriving equivalent confidence values.
- a common format such as a universal format, to which output from different glycomolecule search engines are converted (e.g., as provided in the present disclosure)
- a common format such as a universal format
- to which output from different glycomolecule search engines are converted allows a user to more easily compare the search results for the same experiment or mass spectrometry run.
- Embodiments provide options to correlate and manage the different confidence scores that are reported for each search result from the individual search engines, to compare the confidence that each respective search engine has in the identity of the search result. These correlations can be determined by, in part, comparing the search results of multiple search engines to determine the overlap, and then deriving equivalent confidence values.
- systems and methods that facilitate curation of a consensus list of glycomolecules identified in a biological sample by providing a user interface for displaying one or more glycomolecule search results from one or more glycomolecule search engines and the consensus list of the predicted glycomolecules.
- the consensus list determined by one or more conflict resolution rules may be automatically populated by the system.
- the systems and methods further facilitate curation of the consensus list by permitting manual curation thereof via user input, thereby allowing further determination and selection of the predicted glycomolecules in the consensus list.
- the presentation of search results and a consensus list on a user interface can allow a user to compare search results across different search engines and have increased confidence in the quality of the identity of glycomolecules present in a sample.
- comparison of search results across different search engines may allow identification of strengths and/or systematic biases of each search engine compared to others.
- manual curation of the consensus list not only allows for independent assessment of the glycomolecules shown therein but also improves accuracy and streamlines the process of identifying glycomolecules in a biological sample, e.g., by using the manual curation to inform conflict resolution rules applied to future analyses.
- aspects of this disclosure relate to curation of a consensus list from glycomolecules identified in biological samples used in medical diagnostics and life sciences research, the disclosure is not limited to mass spectrometry data analysis, as the disclosure contemplates any other suitable systems or methods that utilize a comparative analysis of results from multiple sources as described in more detail below.
- biological sample refers to a sample derived from, obtained by, generated from, provided from, taken from, or removed from an organism, or from fluid or tissue from the organism.
- Biological samples include, but are not limited to synovial fluid, whole blood, blood serum, blood plasma, urine, sputum, tissue, saliva, tears, spinal fluid, tissue section(s) obtained by biopsy; cell(s) that are placed in or adapted to tissue culture; sweat, mucous, fecal material, gastric fluid, abdominal fluid, amniotic fluid, cyst fluid, peritoneal fluid, pancreatic juice, breast milk, lung lavage, marrow, gastric acid, bile, semen, pus, aqueous humor, transudate, and the like including derivatives, portions and combinations of the foregoing.
- biological samples include, but are not limited, to blood and/or plasma. In some examples, biological samples include, but are not limited, to urine or stool. In some examples, biological samples include, but are not limited, to saliva. In some examples, biological samples include, but are not limited to, tissue dissections and tissue biopsies. In some examples, biological samples include, but are not limited to, any derivative or fraction of the aforementioned biological samples.
- glycocan refers to the carbohydrate residue of a glycoconjugate, such as the carbohydrate portion of a glycopeptide, glycoprotein, glycolipid, or proteoglycan.
- Glycans can be monomers or polymers of sugar residues, but typically contain at least three sugars, and can be linear or branched.
- a glycan may include natural sugar residues (e.g., glucose, N-acetylglucosamine, N-Acetylhexosamine (HexNAc), N-acetylneuraminic acid (NeuAc), galactose, mannose, fucose, hexose (Hex) , arabinose, ribose, xylose, etc.) and/or modified sugars (e.g., 2'-fluororibose, 2'- deoxyribose, phosphomannose, 6'-sulfo N-acetylglucosamine, etc).
- the term “glycan” includes homo and heteropolymers of sugar residues.
- the term encompasses free glycans, including glycans that have been cleaved or otherwise released from a glycoconjugate.
- Glycan structures (as compared to glycan data formats or representations) are described by a glycan reference code number, and also illustrated in International PCT Patent Application No. PCT/US2020/016286, filed January 31, 2020, which is herein incorporated by reference in its entirety for all purposes.
- glycomolecule as used herein includes glycans and glycoconjugates (such as, but not limited to, glycopeptides, glycoproteins, glycolipids, glycoRNA, glycoDNA, etc.). Glycomolecule includes fragments of glycoconjugates.
- glycopeptide refers to a peptide having at least one glycan residue bonded thereto.
- a glycopeptide can be an intact protein (e.g., a glycoprotein) or any fragment thereof that has at least one glycan residue covalently bonded thereto.
- glycoform refers to a unique primary, secondary, tertiary, and quaternary structure of a protein with an attached glycan of a specific structure.
- glycosylated peptides refers to a peptide bonded to a glycan.
- Glycosylated peptides include peptides that have been covalently modified by glycosylation to become bonded to a glycan.
- glycopeptide fragment or “glycosylated peptide fragment” or “glycopeptide” refers to a glycosylated peptide (or glycopeptide) having an amino acid sequence that is the same as part (but not all) of the amino acid sequence of the glycosylated protein (or glycoprotein) from which the glycosylated peptide (or glycopeptide) is obtained, e.g., ion fragmentation within a MRM-MS instrument.
- MRM refers to multiple-reaction-monitoring.
- glycopeptide fragments or “fragments of a glycopeptide” refer to the fragments produced directly by using a mass spectrometer optionally after the glycoprotein has been digested enzymatically to produce the glycopeptides.
- glycoprotein refers to the glycosylated protein from which the glycosylated peptide is obtained.
- “Glycoprotein” refers to a protein that contains a peptide backbone covalently linked to one or more sugar moieties (i.e., glycans).
- the peptide backbone typically comprises a linear chain of amino acid residues.
- the sugar moiety(ies) may be in the form of monosaccharides, disaccharides, oligosaccharides, and/or polysaccharides.
- the sugar moiety(ies) may comprise a single unbranched chain of sugar residues or may comprise one or more branched chains.
- sugar moieties may include sulfate and/or phosphate groups. Alternatively or additionally, sugar moieties may include acetyl, glycolyl, propyl or other alkyl modifications.
- glycoproteins contain O-linked sugar moieties; in certain embodiments, glycoproteins contain N-linked sugar moieties.
- peptide is meant to include glycopeptides unless stated otherwise.
- MRM-MS multiple reaction monitoring mass spectrometry
- MRM transition refers to the mass to charge (m/z) peaks or signals observed when a glycomolecule, or a fragment thereof, is detected by MRM-MS.
- the MRM transition is detected as the transition of the precursor and product ions.
- “Representation” or “format” as used herein with reference to a glycan refers to any linear string of characters intended to convey compositional and/or structural features of a glycan.
- a glycan representation can be a string that includes symbols and/or alphanumerical characters.
- a glycan representation or format can be constructed using pre-defined rules for representing the compositional and/or structural features of a glycan. For example, a glycan representation in an example format is N(3)H(3)F(l)A(0)G(0).
- platform-specific glycan format refers to any glycan format that is associated with one or more specific glycomolecule search engines, e.g., one or more specific glycomolecule search engines.
- a platform-specific glycan format can be used by, or be compatible with, the glycomolecule search engine.
- the platform-specific glycan format is a conventional glycan format used by, or compatible with, conventional glycan databases and/or conventional glycan or glycopeptide search engines.
- conventional glycan formats include formats used by PGLYCO3, BYONIC and METAMORPHEUS.
- Conversion rule as used herein with reference to a glycan format refers to any step-wise operation for changing a glycan representation in one format to a different format, e.g., when executed by a processor on an input glycan data.
- a “conversion rule set” as used herein with reference to a glycan format refers to a collection of conversion rules that provide the step-wise operations for changing a glycan representation in one format to and from another format.
- Search engine refers to any computer-implemented program configured to receive a query (e.g., an input string) and implement algorithms to identify entries within one or more databases that provide a match to the query that meets certain predefined criteria and/or user-specified criteria.
- a search engine is typically associated with its own proprietary glycan database and can rely on one or more statistical tests to determine the quality of any given match, and may provide a confidence score that reflects the quality of a match.
- a “glycomolecule search engine” refers to any search engine for identifying one or more of glycans, glycopeptides, glycolipids, glycoRNA, glycoDNA, etc., in sample data (e.g., mass spectrometry sample data).
- Non-limiting examples of glycopeptide search engines include PGLYCO3, BYONIC and METAMORPHEUS.
- a platform-specific glycan format can include any glycan format that can be converted to and/or from a universal glycan format, as disclosed herein.
- the term “about” indicates and encompasses an indicated value and a range above and below that value. In certain embodiments, the term “about” indicates the designated value ⁇ 10%, ⁇ 5%, or ⁇ 1%. In certain embodiments, the term “about” indicates the designated value ⁇ one standard deviation of that value.
- glycomolecule search engines there are a number of glycomolecule search engines available, including, for example, BYONIC, PGLYCO3, and METAMORPHEUS. Comparison of search results from different glycomolecule search engines can inform the user about the reliability of the search results and about characteristics of a glycomolecule search engine.
- the search results from different search engines can be compared between each other to identify conflicting subsets of glycomolecules identified in the search results. For example, one search engine may identify an analyte as being associated with one type of glycan (e.g., fucose), while another search engine may identify the same analyte as being associated with another type of glycan (e.g., sialic acid).
- the search results from different search engines can be compared between each other to create a consensus list of identified glycomolecules.
- a universal glycan format can be used as an input to a Glycan Database Converter.
- the Glycan Database Converter can convert the format of a glycan database between a glycomolecule search engine-specific format and a universal glycan format, and/or between glycomolecule search engine-specific formats via the universal glycan format.
- Fig 2. shows a non-limiting example of taking a universal format for a molecule designated as “N(3)H(3)F(l)A(0)” and converting that universal format into PGLYCO 3, BYONIC and METAMORPHEUS formats.
- N(3)H(3)F(l)A(0) converting that universal format into PGLYCO 3, BYONIC and METAMORPHEUS formats.
- the disclosure contemplates any suitable order.
- the order may be “H(3)N(3) F(l)A(0).”
- the disclosure contemplates any number of sugars in a glycan representation in the universal format.
- the glycan representation may include types of sugars (e.g., H(3)N(3)F(l)A(0)G(l)).
- a glycan data management system can include a non-transitory memory that contains instructions for converting among a plurality of glycan formats (such as a plurality of platform-specific glycan formats, e.g., 410, 420, 430, 440) and a universal glycan format 400 by applying at least one of a plurality of format conversion rule sets (e.g., 415, 425, 435, 445)
- the plurality of format conversion rule sets can include at least a first rule set 415 configured to convert glycan representations of a first platform-specific glycan format 410 to and from the universal glycan format 400; and a second rule set 425 configured to convert glycan representations of a second platform-specific glycan format 420 to and from the universal glycan format 400.
- the glycan data management system can further include a processor, e.g., a central processing unit (CPU), configured to: access an input glycan data comprising one or more glycan representations in the first platform-specific glycan format, the second platform-specific glycan format, or the universal glycan format; and execute the instructions to convert the input glycan data between: the first platform-specific glycan format and the universal glycan format; the second platform-specific glycan format and the universal glycan format; or the first platform-specific glycan format and the second platform-specific glycan format (e.g., via the universal glycan format).
- a processor e.g., a central processing unit (CPU)
- CPU central processing unit
- the processor is configured to execute the instructions to convert the input glycan data from the first platform-specific glycan format 410 to the second platform-specific glycan format 420 by applying: (1) the first rule set 415 to convert the input glycan data from the first platform-specific glycan format to the universal glycan format 400; and (2) the second rule set 425 to convert the input glycan data from the universal glycan format to the second platform-specific glycan format.
- the non-transitory memory further comprises instructions for obtaining the input glycan data from a first glycan database in the first platform-specific glycan format, and wherein execution of the instructions by the processor generates a second glycan database comprising glycan data in the second platform-specific glycan format.
- the first platform-specific glycan format is compatible with a first glycomolecule search engine, e.g., a first glycopeptide search engine, but not compatible with a second glycomolecule search engine, e.g., a second glycopeptide search engine; and the second platform-specific glycan format is compatible with the second glycomolecule search engine, e.g., the second glycopeptide search engine.
- the plurality of format conversion rule sets further includes a third rule set 435 configured to convert glycan representations of a third platform-specific glycan format 430 to and from the universal glycan format 400.
- the glycan data management system includes a processor configured to: access an input glycan data comprising one or more glycan representations in the first platform-specific glycan format 410, the second platform-specific glycan format 420, the third platform-specific glycan format 430, and/or the universal glycan format 400; and execute the instructions to convert the input glycan data between: the first platform-specific glycan format and the universal glycan format; the second platform-specific glycan format and the universal glycan format; the third platform-specific glycan format and the universal glycan format; the first platform-specific glycan format and the second platform-specific glycan format (e.g., via the universal glycan format
- the plurality of format conversion rule sets further includes an N' h rule set 445 configured to convert glycan representations of an N' h platform-specific glycan format 440 to and from the universal glycan format 400, where N is an integer greater than 1.
- the glycan data management system includes a processor configured to: access an input glycan data comprising one or more glycan representations in one or more of the first-to-N' h platform-specific glycan formats; and execute the instructions to convert the input glycan data between: any one of the first- to-N' h platform-specific glycan formats and the universal glycan format; or between any two of the first-to-N' h platform-specific glycan formats (e.g., via the universal glycan format).
- the glycan data management system can have instructions for converting among any suitable number of platform-specific glycan formats.
- the number of platform-specific glycan formats is 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, 40, 50, 60, 70, 80, 90, 100 or more, or a number in a range defined by any two of the preceding values.
- the method involves converting a glycan data format from a first platform-specific glycan format to a second platform-specific glycan format via a universal glycan format.
- the glycan data format can initially be converted from the first platform-specific glycan format 410 to the universal glycan format 400, and then from the universal glycan format to the second platformspecific glycan format 420.
- the method can be expanded to provide conversion among the existing platform-specific glycan formats and a new platformspecific glycan format by adding a conversion rule set for converting the new platformspecific glycan format to the universal glycan format.
- a method 500 of converting a glycan data format can include receiving 510 first glycan data that includes one or more glycan representations in a first platform-specific glycan format.
- the method can further include converting 520 the first glycan data from the first platform-specific glycan format to a second platform-specific glycan format via a universal glycan format, where the second platform-specific glycan format, but not the first platform-specific glycan format, is compatible with a first glycomolecule search engine.
- the first glycomolecule search engine is a first glycopeptide search engine.
- the first platform-specific glycan format is compatible with a second glycomolecule search engine, e.g., a second glycopeptide search engine.
- the second platform-specific glycan format is not compatible with the second glycomolecule search engine, e.g., the second glycopeptide search engine.
- the method includes converting the first glycan data from the first platform-specific glycan format to a third platform-specific glycan format via the universal glycan format, where the third platform-specific glycan format, but not the first platform-specific glycan format, is compatible with a third glycomolecule search engine, e.g., a third glycopeptide search engine.
- the method includes converting the first glycan data from the first platform-specific glycan format to an N' h platform-specific glycan format via the universal glycan format, where N is an integer greater than 1.
- a given glycan data can be converted to one or more of any suitable number of platform-specific glycan formats.
- the number of platform-specific glycan formats is 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, 40, 50, 60, 70, 80, 90, 100 or more, or a number in a range defined by any two of the preceding values.
- converting the first glycan data involves determining that the first glycan data includes the one or more glycan representations in the first platform-specific glycan format; and applying to the first glycan data: (1) a first rule set for converting the first platform-specific glycan format to the universal glycan format; and (2) a second rule set for converting the universal glycan data format to the second platform-specific glycan format.
- the type of platformspecific glycan format of the first glycan data is provided by a user, e.g., as an input made by the user performing the method on a computer system.
- the method includes detecting one or more features of the one or more glycan representations in the first glycan data and identifying the detected feature(s) as being distinguishing feature(s) of the first platform-specific glycan format. In some embodiments, determining that the first glycan data includes the one or more glycan representations in the first platform-specific glycan format is automated, e.g., does not require user input.
- determining that the first glycan data includes the one or more glycan representations in the first platform-specific glycan format is done automatically by comparing the glycan representations with one or more known platform-specific glycan formats (e.g., a known search engine-specific format, such as, but not limited to, PGLYC03, BYONIC, and METAMORPHEUS).
- determining that the first glycan data includes the one or more glycan representations in the first platformspecific glycan format involves the user selecting the first platform-specific glycan format, e.g., from a list of known platform-specific glycan formats.
- a method e.g., computer-implemented method, of expanding the cross-compatibility of a glycan database, e.g., for use in a glycomolecule search engine.
- a glycan database may be compatible with a particular search engine but may not be compatible with a different search engine.
- glycan data management systems and methods of the present disclosure can provide crosscompatibility among different search engines for which the conversion rules for converting the platform-specific glycan format and the universal glycan format are known.
- the cross-compatibility of a glycan database with the new glycomolecule search engine can be implemented by adding a new rule set for converting the new glycan format to and from the universal glycan format.
- the method includes providing to a glycan data management system of the present disclosure a glycan database that includes input glycan data having glycan representations in a first platform-specific glycan format; and adding to the plurality of format conversion rule sets a third rule set configured to convert glycan representations of a third platform-specific glycan format to and from the universal glycan format, wherein none of the other rule sets among the plurality of format conversion rule sets is configured to convert glycan representations of the third platform-specific glycan format to and from the universal glycan format, wherein the third platform-specific glycan format is compatible with a third glycomolecule search engine and wherein the first platform-specific glycan format is not compatible with the third glycomolecule search engine.
- the method includes causing the processor to access the input glycan data of the glycan database and executing the instructions to convert the input glycan data to the third platform-specific glycan format via the universal glycan format.
- the cross-compatibility of glycan formats can be expanded for any number of new glycan formats by adding a single conversion rule set for each new glycan format to the system.
- a non-limiting embodiment of a method 2300 of identifying glycomolecules in a biological sample is provided.
- the method can include receiving 2310 a first search output for a biological sample, the first search output being associated with a first glycomolecule search engine; and determining 2320 that the first search output identifies a first set of glycomolecules as being of a first type and a second set of glycomolecules as being of a second type.
- the first type may be different from the second type.
- the method can further include receiving 2330 a second search output for the biological sample, the second search output being associated with a second glycomolecule search engine; and determining 2340 that the second search output identifies a third set of glycomolecules as being of the first type and a fourth set of glycomolecules as being of the second type.
- the method can also include identifying 2350 among the first and second search outputs a conflicting subset of glycomolecules, where the conflicting subset of glycomolecules includes: (a) glycomolecules that are present in both the first set and the fourth set; or (b) glycomolecules that are present in both the second set and the third set.
- a sample data e.g., a mass spectrum
- the sample data e.g., the same mass spectrum
- the sample data for the same analyte may be associated with the second type of glycomolecule in the second search output.
- a sample data for an analyte may be associated with the second type of glycomolecule in the first search output
- the sample data for the same analyte may be associated with the first type of glycomolecule in the second search output.
- At least one or more of the glycomolecules in the conflicting subset of glycomolecules are present in both the first set and the fourth set, and at least one or more of some other glycomolecules in the conflicting subset of glycomolecules are present in both the second set and the third set.
- Figs. 6A and 6B and the associated description below illustrate an example of identifying a conflicting subset present in two search outputs.
- the conflicting subset of glycomolecules includes only glycomolecules that are present in both the first set and the fourth set. In some embodiments, the conflicting subset of glycomolecules includes only glycomolecules that are present in both the second set and the third set.
- the first set of glycomolecules is mutually exclusive of the third set of glycomolecules. For example, a sample data for an analyte may be associated with the first type of glycomolecule in the first search output, and the sample data for the same analyte is not associated with the first type of glycomolecule in the second search output.
- the second set of glycomolecules is mutually exclusive of the fourth set of glycomolecules. For example, a sample data for an analyte may be associated with the second type of glycomolecule in the first search output, and the sample data for the same analyte is not associated with the second type of glycomolecule in the second search output.
- Identifying the conflicting subsets can be done using any suitable option.
- the correspondence between a glycomolecule identified in one search output with a glycomolecule identified in another search output involves retrieving an identifier (in mass spectrometry, for example, this identifier may sometimes be referred to as a “scan number”) for the glycomolecule identified in one search output, and using the identifier to find the corresponding glycomolecule identified in the other search output.
- an identifier in mass spectrometry, for example, this identifier may sometimes be referred to as a “scan number”
- the first search output includes: a first set of identifiers (e.g., scan numbers) associated with each glycomolecule of the first set of glycomolecules; and a second set of identifiers (e.g., scan numbers) associated with each glycomolecule of the second set of glycomolecules
- the second search output includes: a third set of identifiers (e.g., scan numbers) associated with each glycomolecule of the third set of glycomolecules; a fourth set of identifiers (e.g., scan numbers) associated with each glycomolecule of the fourth set of glycomolecules.
- identifying the conflicting subset of glycomolecules includes: (a) determining an identifier in the first set of identifiers that is the same as an identifier in the fourth set of identifiers; or (b) determining an identifier in the second set of identifiers that is the same as an identifier in the third set of identifiers.
- a sample data for an analyte having a specific identifier may be associated with a specific identifier and with the first type of glycomolecule in the first search output, and the sample data for the same analyte having the same specific identifier may be associated with the specific identifier and with the second type of glycomolecule in the second search output.
- a sample data for an analyte may be associated with a specific identifier and with the second type of glycomolecule in the first search output, and the sample data for the same analyte may be associated with the same specific identifier and with the first type of glycomolecule in the second search output.
- the identifier is a scan number associated with mass spectrometry data for analytes of a sample.
- the method further includes receiving an N' h search output for the biological sample, the N' h search output being associated with an N' h glycomolecule search engine; and determining that the N th search output identifies a (2N-l) th set of glycomolecules as being of a first type and a 2N' h set of glycomolecules as being of a second type, where N is an integer greater than 0 (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25, 30, 35, 40, 50 or more, or a number in a range defined by any two of the preceding values).
- the first type may be different from the second type.
- the method then includes identifying, among any two or more (e.g., 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25, 30, 35, 40, 50 or more, or a number in a range defined by any two of the preceding values) of the N search outputs, a conflicting subset of glycomolecules. For example, in the case where there are three search outputs, a conflicting subset may be identified among sets corresponding to two of the three search outputs, or among sets corresponding to all three search outputs.
- a conflicting subset may be identified among sets corresponding to two of the three search outputs, or among sets corresponding to all three search outputs.
- the conflicting subset includes glycomolecules that are present in one or more (2A;-l) th set of the 2N sets of glycomolecules, and one or more 2m th set of the 2N sets of glycomolecules, where k is an integer greater than 0, (2k- 1 ) is less than or equal to N, m is an integer greater than 0, 2m is less than or equal to N, m.
- At most 95%, e.g., at most 90%, at most 80%, at most 70%, at most 60%, at most 50%, or at most a percentage in a range defined by any two of the preceding values, of the search outputs identify a glycomolecule of the conflicting set as being of the same type (e.g., the first type), and one or more of the other search outputs identify the glycomolecule of the conflicting set as being of a different type (e.g., the second type).
- a non-limiting example of a method 2400 of identifying glycomolecules in a biological sample can include receiving 2410 a first search output for a biological sample, the first search output being associated with a first glycomolecule search engine; and receiving 2420 a second search output for the biological sample, the second search output being associated with a second glycomolecule search engine.
- the method can further include determining 2430 that the first and second search outputs identify a conflicting subset of glycomolecules, where the conflicting subset of glycomolecules includes one or more glycomolecules determined as being both of a first type in the first search output and of a second type in the second search output.
- the first type may be different from the second type.
- a sample data for an analyte may be associated with the first type of glycomolecule in the first search output, and the sample data for the same analyte may be associated with the second type of glycomolecule in the second search output.
- the one or more glycomolecules of the conflicting subset of glycomolecules are determined as not being of the second type in the first search output. In some embodiments, the one or more glycomolecules of the conflicting subset of glycomolecules are determined as not being of the first type in the second search output.
- identifying the conflicting subsets can be done using any suitable option, e.g., by using a suitable identifier (e.g., scan number) associated with an analyte identified in multiple search outputs.
- the first search output includes a set of identifiers (e.g., scan numbers) associated with each glycomolecule of a first set of glycomolecules
- the second search output includes a second set of identifiers (e.g., scan numbers) associated with each glycomolecule of a second set of glycomolecules.
- determining that the first and second search outputs identify the conflicting subset of glycomolecules involves: identifying a first identifier in the first set of identifiers associated with a glycomolecule of the first type in the first set of glycomolecules; and identifying a second identifier in the second set of identifiers associated with a glycomolecule of the second type in the second set of glycomolecules, where the first and second identifiers are the same.
- a sample data for an analyte may be associated with a specific identifier and with the first type of glycomolecule in the first search output, and the sample data for the same analyte may be associated with the specific identifier and with the second type of glycomolecule in the second search output.
- the identifier is a scan number.
- the method further includes receiving an N' h search output for the biological sample, the N' h search output being associated with an N' h glycomolecule search engine; and receiving an (N+l) th search output for the biological sample, the (N+l) th search output being associated with an (N+l) th glycomolecule search engine, where N is an integer greater than 0, e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25, 30, 35, 40, 50 or more, or a number in a range defined by any two of the preceding values.
- the method then includes identifying among any two or more, e.g., 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25, 30, 35, 40, 50 or more, or a number in a range defined by any two of the preceding values of the N search outputs a conflicting subset of glycomolecules.
- the conflicting subset of glycomolecules includes one or more glycomolecules determined in at most 95%, e.g., at most 90%, at most 80%, at most 70%, at most 60%, at most 50%, or at most a percentage in a range defined by any two of the preceding values, of the search outputs as being of a first type and of a second type in one or more of the other search outputs.
- the search output obtained from the search engines can include any suitable information related to or associated with each identified glycomolecule.
- the search output includes information about a glycan identified in the sample, e.g., glycan data, based on the glycomolecule sample data, e.g., mass spectrometry data, of the biological sample.
- the glycan data includes glycan compositional information.
- the glycan data includes glycan structural information.
- the glycan data does not include glycan structural information.
- the glycan data includes glycan compositional information only.
- the glycan data includes the amount of hexose, N-acetylhexosamine, fucose, sialic acid (e.g., N-acetylneuraminic acid, N-glycolylneuraminic acid), and/or any other suitable sugar associated with a glycomolecule.
- sialic acid e.g., N-acetylneuraminic acid, N-glycolylneuraminic acid
- the search output includes, without limitation, one or more of: glycan data, an identifier (in the field of mass spectrometry for example, the identifier is sometimes referred to as a “scan number”), a peptide sequence, a protein ID, a retention time, an m/z value, a charge state (also called “charge number”), and a confidence score.
- glycan data an identifier (in the field of mass spectrometry for example, the identifier is sometimes referred to as a “scan number”), a peptide sequence, a protein ID, a retention time, an m/z value, a charge state (also called “charge number”), and a confidence score.
- the identifier can be used to identify an analyte for which sample data, e.g., mass spectrometry data, was obtained, and can serve as a label to identify the same analyte in search outputs obtained from different glycomolecule search engines, e.g., glycopeptide search engines, such as, without limitation, PGLYCO, BYONIC, METAMORPHEUS, STRUCGP, etc., that are run on the same mass spectrometry data.
- sample data e.g., mass spectrometry data
- the glycan data provides compositional and/or structural information for glycans determined as being likely present in a sample (e.g., glycans of glycopeptides, glycolipids, glycoRNA, glycoDNA) based on the sample data, e.g., mass spectrometry data.
- the search output may include glycan data and the peptide sequence of the polypeptide to which a glycan is conjugated, based on the sample data.
- the retention time, the m/z value, and/or the charge state are provided for the analyte (which may be, for example, a glycopeptide) that is identified based on the sample data.
- the search output can be in any suitable format that provides confidence scores and that can be used to determine the overlap.
- the search output is in a search engine-specific format.
- the search output is converted to a common format, e.g., to facilitate the analysis.
- the common format is a universal glycomolecular data format.
- the first search output is in a first format
- the second search output is in a second format
- the method further includes, before determining the glycomolecule profile, converting one or more of the first and second search outputs to a common format, where the common format comprises the first format, the second format, or a universal glycomolecular data format.
- the confidence score provides an indication of how well the data for a particular analyte matches the glycomolecule identified by the search engine.
- the confidence score may be an indication of a level of confidence the search engine’s algorithm has that the correct glycomolecule was identified.
- the confidence score in the search output is specific to the search engine.
- the confidence score provides an indication of how well the data for a particular analyte matches the entire glycomolecule (e.g., the glycan and the non- glycan component of the glycomolecule) identified by the search engine.
- the confidence score provides an indication of how well the data for a particular analyte matches the non-glycan component of the glycomolecule identified by the search engine.
- the confidence score may be determined by one or more algorithms of a respective search engine.
- the confidence score provides an indication of how well the data for a particular analyte matches the peptide component of the glycopeptide identified by the search engine. Due to differences in search engine algorithms and formats, different search engines may return different confidence scores for the same analyte such that search output confidence scores cannot be easily matched or compared among search outputs from different search engines. This is discussed in further detail below.
- the method includes generating a notification comprising a listing of the conflicting subset of glycomolecules.
- the notification can be generated using any suitable option.
- the notification may be displayed on a screen, a computer monitor, a touch screen, a smartphone, or any other suitable device; is printed on paper; is read out through a speaker, etc.
- the notification can provide any suitable information associated with the listed conflicting subset of glycomolecules.
- the notification includes, without limitation, one or more of: glycan data, an identifier, a peptide sequence, a precursor mass, a peptide sequence with other post-translational modifications (phosphorylation, acetylation, oxidation, carbamidomethylation, and other modifications), a protein ID, a retention time, an m/z value, a charge state, a confidence score (alternatively, a peptide confidence score and/or a glycan confidence score and/or an overall confidence score), and search engine identity associated with each of the identified glycomolecules in the listed conflicting subset of glycomolecules.
- the notification can provide the listing of the conflicting subset of glycomolecules in any suitable format.
- the listing of the conflicting subset of glycomolecules is provided in the notification in a search enginespecific format. In some embodiments, the listing of the conflicting subset of glycomolecules is provided in the notification in a common format, such as a universal glycomolecular data format.
- the method includes providing a link to the subset of the sample data underlying the conflicting subset of glycomolecule that allows a user to pull out the relevant data for further review (e.g., further review of the mass spectrometry data), as disclosed herein. In some embodiments, the method includes providing a selectable option for the use to provide confirmation that the glycomolecule is of the first type, the second type, or neither type (which can be a third option, or may be an undetermined type).
- the method includes options for resolving the conflicting identification of glycomolecules between the different glycomolecule search engines.
- the data underlying the search results e.g., the glycomolecule sample data or mass spectrometry data used as an input to the glycomolecule search engines.
- a subset that corresponds to the conflicting subset of glycomolecules is extracted and one or more additional analyses are performed on the subset of data.
- the one or more additional analyses resolves the conflicting identification of glycomolecules and confirms their identity.
- the method includes selecting a subset of data corresponding to the conflicting subset of glycomolecules; and performing an additional analysis on the selected data to confirm an identity of the one or more glycomolecules of the conflicting subset of glycomolecules.
- the additional analysis confirms that a glycomolecule of the conflicting subset of glycomolecules is of the first type or the second type.
- the additional analysis confirms that a glycomolecule of the conflicting subset of glycomolecules is neither the first type nor the second type.
- the additional analysis confirms that a glycomolecule of the conflicting subset of glycomolecules is of a third type that is different from the first or the second type.
- resolving the conflicting subset of glycomolecules involves an analysis of confidence scores associate with glycomolecules of the conflicting subset.
- the first search output includes a first set of confidence scores associated with the first or the second set of glycomolecules
- the second search output includes a second set of confidence scores associated with the third or the fourth set of glycomolecules.
- the confidence score can be a searchengine specific confidence score.
- the confidence score in the first search output may be a number on a scale of 0 to 100
- the confidence score in the second search output may be a number on a scale of 0 to 1000.
- correspondence between confidence scores from different glycomolecule search engines is determined empirically.
- the confidence scores may first be normalized before further analysis, e.g., before comparison of confidence scores. For example, in the case of the confidence score in the first search output on a scale of 0 to 100 and the confidence score in the second search output on a scale of 0 to 1000, the scores may be normalized so that they are both on a scale of 0 to 100 or on a scale of 0 to 1000.
- a weighting function can be applied to the confidence scores before further analysis.
- the weighting function depends on the glycomolecule search engine from which the search output is obtained.
- the weighting function depends on the type or source of the sample from which the sample data underlying the search output was obtained and the glycomolecule search engine used to generate the search output.
- the weighting function may suppress or reduce the confidence scores in the search output from that particular search engine for sample data obtained from a stool sample compared to a serum sample.
- a weighting function for a glycomolecule search engine is determined empirically.
- a confidence score threshold score can be used to set a minimum confidence score required to be considered a reliable identification of the glycomolecule by the search engine.
- each confidence score of the first set of confidence scores is equal to or greater than a first confidence score threshold
- each confidence score of the second set of confidence scores is equal to or greater than a second confidence score threshold.
- the confidence score threshold is specific to each search engine.
- the confidence score thresholds are scaled to the total range of each search-engine specific confidence scores.
- a confidence score threshold of 20 for a search-engine specific confidence score that has a total range of 0 to 100 may be comparable to a confidence score threshold of 200 for another search-engine specific confidence score that has a total range of O to 1000.
- the confidence scores are compared to resolve the conflicting subset of glycomolecules.
- the method includes resolving the conflicting subset of glycomolecules based on a comparison of confidence scores of the first and second sets of confidence scores.
- the confidence score associated with the glycomolecule in the first search output can be compared with the confidence score associated with the glycomolecule in the second search output.
- the first and second sets of confidence scores are normalized, e.g., to facilitate the comparison.
- resolving the conflicting subset of glycomolecules comprises, for each glycomolecule identified as being present in both the first set and the fourth set, or in both the second set and the third set, determining that the glycomolecule is of the type identified by the search output comprising the higher confidence score associated with the glycomolecule.
- the confidence score is higher (e.g., on a normalized scale) by at least 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%, 150%, 200%, 250%, 300%, 400%, 500%, 750%, 1000% or more, or is higher by a percentage in a range defined by any two of the preceding values.
- the additional analysis to resolve the conflicting subset of glycomolecules involves review of the sample data underlying the search output by a researcher or user. For example, a user may visually inspect the mass spectra associated with the analyte that gave rise to the conflicting identification by different glycomolecule search engines, and confirm that the glycomolecule is of the first type, second type. In some embodiments, the glycomolecule is confirmed to be neither the first nor second type. In some embodiments, the glycomolecule is confirmed to be a third type that is different from the first or second type.
- the additional analysis to resolve the conflicting subset of glycomolecules involves running a search with the same sample data, or the subset thereof associated with the conflicting subset of glycomolecules, on a different glycomolecule search engine (e.g., a glycomolecule search engine different from the first and second glycomolecule search engines).
- a different glycomolecule search engine e.g., a glycomolecule search engine different from the first and second glycomolecule search engines.
- the third glycomolecule search engine’s output identifies the glycomolecule as being the first type, the glycomolecule is confirmed as being the first type.
- the third glycomolecule search engine’s output identifies the glycomolecule as being the second type, the glycomolecule is confirmed as being the second type. Any number of additional glycomolecule search engine can be relied upon to further analyze the sample data.
- 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50 or more, or a number in a range defined by any two of the preceding values, of different glycomolecule search engines can be used to further analyze the sample data and resolve the conflicting subset of glycomolecules.
- the conflicting subset of glycomolecules can include any suitable types of glycomolecules that can be identified as a first type by a glycomolecule search engine and can be identified as a second type by another glycomolecule search engine, based on a suitable sample data (e.g., glycomolecule mass spectrometry data).
- a suitable sample data e.g., glycomolecule mass spectrometry data
- the features of the sample data that a search engine relies on to predict or determine the properties or identities of glycomolecules in a sample can, in some cases, lead to mis-identification of the type of glycomolecule.
- different search engines will typically rely on different algorithms, and may rely on different features of the data and/or weigh features differently to perform the search.
- Conflicting subsets can then be flagged for further validation or action, as described further below.
- the methods and systems disclosed herein may consider particular types of glycomolecule pairs more closely than others to identify conflicting subsets that are more prone to mis-identification. For example, the methods and systems may weight conflicting subsets of particular types more than other in flagging for further validation or action. Different types of glycomolecules that are characterized by features in the sample data that are similar to each other are more likely to be misidentified, and may this be considered more closely. In some embodiments, glycomolecules of the first type and glycomolecules of the second type are similar in one or more features of the sample data, including, without limitation, mass, retention time, m/z value, charge state, or any combination thereof.
- glycomolecules of the first type and glycomolecules of the second type have a difference in mass of 0 to 1 daltons. In some embodiments, glycomolecules of the first type and glycomolecules of the second type have a difference in mass of 0 to 2 daltons. In some embodiments, glycomolecules of the first type and glycomolecules of the second type have a difference in mass of 0 to 3 daltons. In some embodiments, glycomolecules of the first type and glycomolecules of the second type have a difference in mass of 1, 2, 3, 4, 5, 6, 7, or 8 daltons, or a number in a range defined by any two of the preceding values.
- glycomolecules of the first type and glycomolecules of the second type have different types of glycans associated with (e.g., conjugated to) the glycomolecule. In some cases, these glycomolecules may have similar masses such that they may be easily confused. For example, in an MRM-MS experiment, a glycopeptide having a glycan of the first type may have a mass that would be similar to the mass of a glycopeptide having a glycan of the second type.
- Figs. 24B-24D examples of different pairs of glycans with similar masses.
- the first type is a sialic acid-containing glycomolecule (e.g., a sialic acid-containing glycopeptide) and the second type is a fucosylated glycomolecule (e.g., a fucosylated glycopeptide).
- the sialic acid is N- acetylneuraminic acid.
- An example structure of such a pairing is illustrated in Fig. 24B, where the glycans are identical except that the first glycan includes two fucoses and the second glycan includes one N-acetylneuraminic acid.
- glycans Although these glycans are different, they have very similar masses (resulting in m/z of 1955.6972 and 1954.6768) because the mass of two fucoses is approximately the same as the mass of one N-acetylneuraminic acid. Due to the similar masses, one may be confused for the other. This confusion can have significant impact in disease diagnostics for example, where a glycopeptide having sialic acid may be a biomarker for a disease while a glycopeptide having fucose may not be a biomarker for that disease. Other pairs of glycans that have similar masses (and may thus cause confusion) are illustrated in Figs. 24C and 24D.
- 5 hexose has a similar mass as 4 N-acetylhexosamine, such that a glycan such as N6H7 may be confused with N2H12.
- 2 N-acetylhexosamine + 1 fucose + 2 N- acetylneuraminic acid has a similar mass as 7 hexose, such that a glycan such as N2H12F0S0G0 may be confused with N4H5F1S2.
- 2 N- acetylhexosamine + 2 hexose have similar mass as 1 fucose + 2 N-acetylneuraminic acid.
- the disclosure contemplates other suitable glycomolecule pairs that may be misidentified.
- the disclosure focuses on pairs of easily misidentified glycomolecules, the disclosure contemplates any suitable grouping of misidentified glycomolecules (e.g., three types of glycomolecules that are easily mistaken for each other).
- the method can include receiving 2510 a first search output for a biological sample, the first search output being associated with a first glycomolecule search engine.
- the first search output can include a first set of glycomolecules associated with a first set of confidence scores.
- the method can also include receiving 2520 a second search output for the biological sample, the second search output being associated with a second glycomolecule search engine.
- the second search output can include a second set of glycomolecules associated with a second set of confidence scores.
- the method can further include determining 2530 a glycomolecule profile of the biological sample based on (1) an overlap between the first and second search outputs, and (2) the first and second set of confidence scores.
- agreement in the identification of glycomolecules between two or more glycomolecule search engine run on the same sample data can indicate that the identification is reliable. This may increase the confidence level of a user has on the reliability of the search result relative to when search results from only one glycomolecule search engine are used.
- the number of search outputs received is not particularly limited. In some embodiments, the method includes receiving 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, or more, or a number in a range defined by any two of the preceding values, search outputs, each associated with a different glycomolecule search engine.
- the overlap can be determined for any suitable number of search outputs. In some embodiments, the overlap is between any two or more of three search outputs. For example, the overlap can be between the first and second search outputs; between the second and third search outputs; between the first and third search outputs; or between the first, second, and third search outputs. In some embodiments, determining the glycomolecule profile is based on an overlap between any 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, or more, or a number in a range defined by any two of the preceding values, of the search outputs, or all of the search outputs, and the corresponding sets of confidence scores. As noted above, the search output obtained from the search engines can include any suitable information related to or associated with each identified glycomolecule, and can be in any suitable format.
- the confidence score can provide an indication of how well the data for a particular analyte matches the glycomolecule identified by the search engine.
- a weighting function can be applied to the confidence scores.
- the weighting function depends on the glycomolecule search engine from which the search output is obtained.
- the weighting function depends on the type or source of the sample from which the sample data underlying the search output was obtained and the glycomolecule search engine used to generate the search output.
- a weighting function for a glycomolecule search engine is determined empirically.
- the overlap between the search outputs includes glycomolecules that are identified as being the same among the search outputs.
- an overlap is found between different search outputs when there is agreement in identification of one or more parameters of the search outputs for the same analyte, e.g., for identified glycomolecules having the same identifier.
- the same glycomolecules are identified when the identified glycans are the same.
- the same glycomolecules are identified when the identified peptide sequences are the same.
- the same glycomolecules are identified when the identified glycans and peptide sequences are both the same.
- the overlap includes one or more of peptide sequences, protein IDs, retention times, m/z values, or charge states.
- a threshold score for the confidence score for each search output can be used to focus the analysis on those glycomolecules identified by the respective search engines with a desired degree of confidence.
- the overlap includes a set of glycomolecules in both the first and second search outputs.
- each glycomolecule in the first set of glycomolecules is associated with a first confidence score of the first set of confidence scores that is equal to or greater than a first confidence score threshold.
- each glycomolecule in the second set of glycomolecules is associated with a second confidence score of the second set of confidence scores that is equal to or greater than a second confidence score threshold.
- the confidence score thresholds (e.g., the first confidence score threshold, second confidence score threshold) can be any suitable value.
- the confidence score thresholds are specific to each search engine. In some embodiments, the confidence score thresholds are scaled to the total range of each search-engine specific confidence scores. For example, in some embodiments a confidence score threshold of 20 for a searchengine specific confidence score that has a total range of 0 to 100 may be comparable to a confidence score threshold of 200 for another search-engine specific confidence score that has a total range of 0 to 1000.
- the method includes adjusting confidence score thresholds of the first and/or second search outputs to change the proportion or percentage of overlap between the first and second sets of glycomolecules relative to the total number of glycomolecules identified with a confidence score above the threshold.
- confidence score thresholds of the first and/or second search outputs is increased to increase the proportion or percentage of overlap between the first and second sets of glycomolecules relative to the total number of glycomolecules identified with a confidence score above a confidence score threshold.
- confidence score thresholds of the first and/or second search outputs is decreased to increase the proportion or percentage of overlap between the first and second sets of glycomolecules relative to the total number of glycomolecules identified with a confidence score above a confidence score threshold.
- the method may include performing any suitable combination of increasing or decreasing the confidence score thresholds of the first and/or second search outputs to increase the proportion or percentage of overlap.
- the method includes setting the value of the first confidence score threshold and the value of the second confidence score threshold to optimize a percentage overlap between the search outputs among all the first and second glycomolecules having, respectively, first confidence scores equal to or greater than the first confidence score threshold and second confidence scores equal to or greater than the second confidence score threshold.
- the method may include optimizing an overlap between two or more search outputs.
- the optimized overlap is at least 50%, at least 60%, at least 70%, at least 75%, at least 80%, at least 90%, or at least 95%, or a percentage in a range defined by any two of the preceding values.
- the confidence score threshold is set to include glycomolecules identified with confidence score in the top 1%, 2%, 3%, 4%, 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95% or more, or a percentage in a range defined by any two of the preceding values, in the search output.
- Figs. 29A and 29B and the associated description below illustrate an example of adjusting threshold scores to optimize overlaps.
- the search output includes an identifier (e.g., a scan number), that can be used to trace back a search result regarding a particular glycomolecule identified in one search output to the subset of the sample data underlying the search result, and to find the corresponding search result in a different search output (from a different glycomolecule search engine run on the same sample data).
- the first search output includes a first set of identifiers
- the second search output includes a second set of identifiers.
- determining the glycomolecule profile further includes: determining the first set of identifiers associated with the first set of glycomolecules; and determining the second set of identifiers associated with the second set of glycomolecules.
- the method further includes generating a consensus list of glycomolecules based on the overlap.
- the consensus list of glycomolecules can be generated using any suitable option.
- a consensus list of glycomolecules is generated by determining an overlap between search outputs and/or determining differences (e.g., non-overlapping portions) between search outputs.
- the consensus list can include all glycomolecules identified as being in the first set and the second set.
- generating the consensus list includes applying a weighting factor to either one or both of the first and second sets of confidence scores.
- the weight of the contribution of each glycomolecule search engine to the consensus list of glycomolecules differs between different search engines.
- determining the glycomolecule profile includes applying a weighting factor to the search output according to the glycomolecule search engine associated therewith.
- the consensus list of glycomolecules includes a confidence score for each identified glycomolecule.
- the confidence score in the consensus list is a search-engine specific confidence score.
- the confidence score in the consensus list is a consensus confidence score derived from the search engine-specific confidence scores of two or more search engines. For example, in the case of two search outputs from two search engines, for each analyte (corresponding to the same identifier in both search outputs), a consensus confidence score may be determined by applying a function to the two search engine-specific confidence scores of the two search outputs. For example, the consensus confidence score may be obtained by computing an average of the search engine-specific confidence scores for an analyte.
- the consensus confidence score may be obtained by computing a median of search engine-specific confidence scores an analyte.
- the consensus confidence score may be obtained by computing a weighted function (e.g., a weighted average) of search engine-specific confidence scores for an analyte.
- Computing a weighted function may include applying one or more different weighting factors to different search engines.
- the confidence scores may first be normalized before applying a desired function (e.g., an average function, a median function, a weighted average function) to derive the consensus confidence score.
- the scores may be normalized so that they are both on a scale of 0 to 100 or on a scale of 0 to 1000.
- the sample or biological sample in any method or system of the present disclosure can be any suitable biological source of glycomolecules (e.g., glycopeptide, glycoprotein, glycolipid, glycoRNA, glycoDNA).
- the biological sample is a urine, stool, blood, or serum sample.
- the sample is from human urine, human stool, human blood, or human serum.
- the sample is from yeast, a plant, a non-human mammal, or a human.
- a method of the present disclosure includes generating the glycomolecule sample data or mass spectrometry data from the biological sample.
- Any method of the present disclosure can be performed on a suitable system configured to perform the various steps. Any method of the present disclosure can be implemented on a suitable computer system.
- an electronic system that includes: a processor; and non-transitory memory comprising instructions, which when executed by the processor causes the processor to perform any method of the present disclosure.
- Glycan data e.g., input glycan data
- Glycan data for use in the systems and methods of the present disclosure can be any suitable data providing compositional and/or structural information of glycans, e.g., in a glycan database.
- the glycan format e.g., the universal glycan format 400, any one or more of the platform-specific glycan formats 410, 420, 430, 440
- the glycan format (e.g., universal glycan format 400, any one or more of the platform-specific glycan formats 410, 420, 430, 440) can include a representation of any suitable aspect of a glycan in the glycan data.
- the glycan format (e.g., universal glycan format 400, any one or more of the platform-specific glycan formats 410, 420, 430, 440) can include glycan compositional information.
- the universal glycan format includes compositional information only.
- the universal glycan format includes glycan structural information.
- the universal glycan format includes glycan compositional information and glycan structural information.
- At least one of the platform-specific glycan formats includes compositional information only. In some embodiments, at least one of the platform-specific glycan formats includes glycan structural information. In some embodiments, at least one of the platform-specific glycan formats includes glycan compositional information and glycan structural information (e.g., the format compatible with the PGLYCO3 search engine).
- the universal glycan format 400 can include any suitable representation for different monosaccharides that are or may be present in a glycan.
- the universal glycan format includes a single-letter representation for different monosaccharides.
- the universal glycan format uses 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more characters to represent each of the different monosaccharides.
- the universal glycan format represents hexose as “H”.
- the universal glycan format represents N-acetylhexosamine as “N”.
- the universal glycan format represents fucose as “F”.
- the universal glycan format represents N-acetylneuraminic acid as “A”. In some embodiments, the universal glycan format represents N-glycolylneuraminic acid as “G”. In some embodiments, each representation of a monosaccharide in the universal glycan format is associated with the amount of the monosaccharide in a glycan.
- the glycan format (e.g., the universal glycan format 400, or any one or more of the platform-specific glycan formats 410, 420, 430, 440) provides the amount of a monosaccharide that is present in a glycan.
- the glycan format (e.g., the universal glycan format 400, any one or more of the platformspecific glycan formats 410, 420, 430, 440) lists the number of each monosaccharide that is present in the glycan.
- the universal glycan format 400 lists the amount of a monosaccharide that is not present in the glycan, e.g., as “0”. Inclusion of monosaccharides that are not present in the glycan representation is advantageous because it ensures uniformity in the format of the representation. That is, in the universal glycan format, glycan representations for each glycan include information about all the same sugars of interest (whether present or absent). This is in contrast to conventional formats, where glycan representations omit the monosaccharides that are absent. This uniformity of the universal glycan format is readily evident in Figs.
- the universal glycan format includes the amount of fucose and/or sialic acid to represent a glycan, even if there is no fucose and/or sialic acid present in the glycan.
- the universal glycan format includes the amount of N-acetylneuraminic acid and/or N-glycolylneuraminic acid to represent a glycan, even if there is no N- acetylneuraminic acid and/or N-glycolylneuraminic acid present in the glycan.
- the universal glycan format includes a first number I corresponding to an amount of hexose in the glycan data; a second number m corresponding to an amount of N-acetylhexosamine in the glycan data; a third number n corresponding to an amount of fucose in the glycan data; and a fourth number o corresponding to an amount of N- acetylneuraminic acid in the glycan data, where /, m, //, and o are each an integer 0 or greater.
- the universal glycan format further comprises a fifth number p corresponding to an amount of N-glycolylneuraminic acid, wherein p is an integer 0 or greater.
- a glycan format (e.g., the universal glycan format 400, any one or more of the platform-specific glycan formats 410, 420, 430, 440) may provide glycan structural information.
- the PGLYCO (Structure) format uses the sequence of letters to represent the relative position of associated monosaccharides and further uses parentheses to denote the branching structure of the glycan.
- the glycan format includes, for a given glycan, glycan structural information.
- At least one of the platform-specific glycan formats includes glycan structural information.
- 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, 40, 50, 60, 70, 80, 90, 100 or more, or a number in a range defined by any two of the preceding values, or the platform-specific glycan formats includes glycan structural information.
- none of the platform-specific glycan formats includes glycan structural information.
- the universal glycan format 400 includes glycan structural information. In some embodiments, the universal glycan format 400 does not include glycan structural information.
- the glycan structural information is, for a given glycan, a single glycan structural arrangement. In some embodiments, the glycan structural information is, for a given glycan, the most likely glycan structural arrangement. In some embodiments, the glycan structural information is, for a given glycan, a set of all possible glycan structural arrangements. In some embodiments, the glycan structural information is, for a given glycan, a set of all biologically consistent glycan structural arrangements (e.g., consistent with the type of structural arrangements known to be found in glycans from the same biological source).
- the platform-specific glycan format (e.g., 410, 420, 430, 440) can be suitable for representing glycan data as an input to a computer-implemented glycomolecule analysis routine, such as, but not limited to, a glycomolecule search engine.
- the platform-specific glycan format (e.g., 410, 420, 430, 440) is a glycan format used by or compatible with a glycomolecule search engine.
- a glycan database can be an input to a glycomolecule search engine when glycan data of the glycan database are represented in the glycan format used by or compatible with the glycomolecule search engine.
- a glycan database cannot be an input to a glycomolecule search engine when glycan data of the glycan database are represented in a glycan format that is not used by or is incompatible with the glycomolecule search engine.
- the first platform-specific glycan format 410 is compatible with (e.g., a glycan database in the first platform-specific glycan format can be an input 460 to) a glycomolecule search engine 450.
- the second platform-specific glycan format 420 is not compatible with (e.g., a glycan database in the second platformspecific glycan format cannot be an input 470 to) the glycomolecule search engine 450 with which the first platform-specific glycan format is compatible. In some embodiments, only the first platform-specific glycan format 410 is compatible with the glycomolecule search engine 450. In some embodiments, the second platform-specific glycan format 420 is compatible with (e.g., a glycan database in the second platform-specific glycan format can be an input to) a glycomolecule search engine with which the first platform-specific glycan format is not compatible.
- the platform-specific glycan format is a glycan format used by or compatible with a particular glycomolecule search engine or a particular subset of glycomolecule search engines that is available for public and/or commercial use.
- the glycomolecule search engine is a glycopeptide search engine.
- the platform-specific glycan format is a glycan format used by or compatible with a glycopeptide search engine, such as, but not limited to BYONIC, PGLYCO (e.g., PGLYCO3), or METAMORPHEUS.
- the universal glycan format is different from at least one of the platform-specific glycan formats. In some embodiments, the universal glycan format is different from any of the platform-specific glycan formats. In some embodiments, the universal glycan format is the same as one of the platform-specific glycan formats. Format conversion rule sets
- the format conversion rule sets can include any suitable collection of conversion rules for converting representation of glycan data from one format to another.
- the first rule set 415 includes conversion rules for converting the first platform-specific glycan format 410 into the universal glycan format 400.
- the first rule set includes conversion rules for converting the universal glycan format 400 into the first platform-specific glycan format 410.
- the second rule set 425 includes conversion rules for converting the second platform-specific glycan format 420 into the universal glycan format 400.
- the second rule set 425 includes conversion rules for converting the universal glycan format 400 into the second platform-specific glycan format 420.
- the third rule set 435 includes conversion rules for converting the third platform-specific glycan format 430 into the universal glycan format 400.
- the third rule set 425 includes conversion rules for converting the universal glycan format 400 into the third platform-specific glycan format 430.
- the N' h rule set 445 includes conversion rules for converting the N' h platformspecific glycan format 440 into the universal glycan format 400.
- the N' h rule set 445 includes conversion rules for converting the universal glycan format 400 into the N' h platform-specific glycan format 440.
- a glycan data management system of the present disclosure can include any suitable number of format conversion rule sets in the plurality of format conversion rule sets, where each distinct rule set provides for conversion of a distinct platform-specific glycan format to and from the universal glycan format.
- the number of format conversion rule sets in the plurality of format conversion rule sets is 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, 40, 50, 60, 70, 80, 90, 100 or more, or a number in a range defined by any two of the preceding values.
- the number of rule sets R in the plurality of format conversion rule sets is equal to the number of platform-specific glycan formats, e.g., the number of distinct platform-specific glycan formats.
- the instructions do not include a rule set for directly converting the input glycan data between different platform-specific glycan formats. In some embodiments, the instructions do not include a rule set for directly converting the input glycan data between the first platform-specific glycan format and the second platform-specific glycan format. In some embodiments, the instructions do not include a rule set for directly converting the input glycan data between the first platformspecific glycan format and the third platform-specific glycan format. In some embodiments, the instructions do not include a rule set for directly converting the input glycan data between the second platform-specific glycan format and the third platformspecific glycan format.
- the instructions do not include a rule set for directly converting the input glycan data between at least two of the N platform-specific glycan formats. In some embodiments, the instructions do not include a rule set for directly converting the input glycan data between any two of the N platform-specific glycan formats.
- the instructions include a rule set for directly converting the input glycan data between different platform-specific glycan formats. In some embodiments, the instructions include a rule set for directly converting the input glycan data between the first platform-specific glycan format and the second platformspecific glycan format. In some embodiments, the instructions include a rule set for directly converting the input glycan data between the first platform-specific glycan format and the third platform-specific glycan format. In some embodiments, the instructions include a rule set for directly converting the input glycan data between the second platform-specific glycan format and the third platform-specific glycan format.
- the instructions include a rule set for directly converting the input glycan data between at least two of the first to N' h platform-specific glycan formats. In some embodiments, the instructions include a rule set for directly converting the input glycan data between any two of the N platform-specific glycan formats.
- a first glycan format includes glycan structural information
- the rule set is configured to convert the first glycan format to and from a second glycan format that does not include glycan structural information (e.g., a format that includes compositional information only).
- a glycan representation in a first format e.g., a first platform-compatible format or a universal glycan format, having glycan structural information
- a second format e.g., a second platform-compatible format or a universal glycan format, that includes compositional information only (and no structural information) by enumerating the amount of each of the monosaccharides in the glycan.
- enumerating the amount of each of the monosaccharides in the glycan data includes counting the number of each type of monosaccharide in the glycan, and representing the total number for each type of monosaccharide in the glycan.
- converting a glycan representation in a first format, e.g., a first platform-compatible format or a universal glycan format, that does not have glycan structural information, to a second format, e.g., a second platform-compatible format, that includes structural information includes receiving information about a source of the glycan data; and generating a list of potential glycan structures based on the compositional information and the source of the glycan data.
- the possible glycan structures that may be present in a sample can depend on the source of the sample. For example, for a given composition, the list of potential glycan structures in human serum is different from the list of potential glycan structures in human stool.
- generating the list of potential glycan structures includes determining the source of the sample and determining the possible or likely glycan structures for the sample. The potential structures may then be listed as conversions of the first format to the second format in this example.
- a non-limiting method 600 of converting a glycan format e.g., between formats that include and do not include glycan structural information.
- the method can include receiving 610 glycan data comprising a glycan representation in a first format that includes compositional information.
- the method can further include converting 620 the glycan representation to a second format that includes compositional information, where either: (1) the first format further comprises glycan structural information, and the second format comprises compositional information only, and wherein converting the glycan representation comprises enumerating an amount of each of the monosaccharides in the glycan data; or (2) the first format comprises compositional information only, and the second format further comprises glycan structural information, and wherein converting the representation comprises: receiving information about a source of the glycan data; and generating a list of potential glycan structures based on the compositional information and the source of the glycan data.
- the method can be implemented in any suitable system, such as a glycan data management system of the present disclosure.
- the source of the glycan data can by any suitable biological or environmental source.
- the glycan data is from a glycan database that lists the potential glycan structures that are relevant for the type of sample from which glycomolecule mass spectrometry data is obtained and for which a glycan search is being performed.
- the source includes, without limitation, urine, stool, blood, or serum, e.g., human urine, human stool, human blood, or human serum.
- the source includes yeast, a plant, a non-human mammal, or a human.
- the glycan database for use in the systems and methods of the present disclosure can be any suitable database of glycan data.
- a glycan database serves as an input to a glycomolecule search engine, e.g., a glycopeptide search engine.
- the glycan database includes representations of glycan data for one or more glycans.
- the glycan database can include representations of glycan data in any suitable glycan format (e.g., platform-specific glycan format, universal glycan format), as disclosed herein.
- the glycan database includes representations of glycan data for 1, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1,000, 2,000, 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, 10 4 ,10 5 , 10 6 or more, or for a number in a range defined by any two of the preceding values, glycans.
- the glycan database is specific for the type of biological sample or source of the glycans.
- the source or biological sample is blood, serum, urine, or stool.
- the source is a human, a non-human mammal, a plant or yeast, or the biological sample is from a human, a non-human mammal, a plant or yeast.
- the glycan database for use in the systems and methods of the present disclosure can be provided to the system or in the method by any suitable option.
- the glycan database is from a third-party source.
- the glycan database is obtained from a remote server, such as a third-party server, e.g., via a network interface of the system.
- the glycan database is stored locally, e.g., in a data storage device of the system, or a local server.
- the glycan database is provided by the user, e.g., by typing in or copy-pasting glycan representations into an interface.
- the glycan database is compatible with a glycomolecule search engine, e.g., can be an input to a glycomolecule search engine, as disclosed herein.
- the glycan database is compatible with a glycopeptide search engine, such as, BYONIC, PGLYCO (e.g., PGLYCO3), or METAMORPHEUS. Additional embodiments
- the system includes a Glycopeptide Data Converter, as shown in Fig. 1, which can take the results from each database search and convert those results into a universal output format.
- the output from each database may include a confidence score for each result which can also be converted into the universal glycan format so that each output, and its confidence value, can be compared.
- embodiments provide a glycan database converter for conversion of glycan database formats (the inputs), and a novel glycopeptide result converter for conversion of glycopeptide results from different search engines (the outputs).
- a glycomolecule data analysis system can include a processor, e.g., a hardware processor such as a central processing unit (CPU); and a non-transitory memory configured to cause the processor to (e.g., having instructions thereon that, when executed, cause the processor to): access a first search output associated with a first glycomolecule search engine 1111 and a second search output associated with a second glycomolecule search engine 1112, wherein the first search output comprises first glycomolecular data for each of one or more identified first glycomolecules in a first format 1121, and wherein the second search output comprises second glycomolecular data for each of one or more identified second glycomolecules in a second format 1122.
- a processor e.g., a hardware processor such as a central processing unit (CPU); and a non-transitory memory configured to cause the processor to (e.g., having instructions thereon that, when executed, cause the processor to): access a first search output associated with a first glycomolecule search engine 1111 and a second search output
- the non-transitory memory can further be configured to cause the processor to (e.g., have instructions thereon that, when executed, cause the processor to): convert one or more (e.g., both) of the first and second glycomolecular data to a common format 1131, 1132, wherein the common format comprises the first format, the second format, or a universal glycomolecular data format.
- the non-transitory memory can further be configured to cause the processor to (e.g., have instructions thereon that, when executed, cause the processor to): perform a comparison between the first and second glycomolecular data in the common format.
- the non-transitory memory is further configured to cause the processor to: access a third search output associated with a third glycomolecule search engine, wherein the third search output comprises third glycomolecular data for each of one or more identified third glycomolecules in a third format; convert one or more of the first, second, and third glycomolecular data to a common format, wherein the common format comprises the first format, the second format, the third format, or the universal glycomolecular data format; and perform a comparison between any two or more (e.g., between all three) of the first, second, and third glycomolecular data, the comparison performed in the common format.
- the common format comprises the first format, the second format, the third format, or the universal glycomolecular data format
- the non-transitory memory is further configured to cause the processor to: access an N th search output associated with an N th glycomolecule search engine 1113, wherein the N th search output comprises N th glycomolecular data for each of one or more identified N th glycomolecules in an N th format 1123; convert one or more (e.g., two or more, or all N) of the first to N th glycomolecular data to a common format (e.g., 1131, 1132, 1133), wherein the common format comprises any one of the first to N th format, or the universal glycomolecular data format; and perform a comparison between any two or more of the first to N th glycomolecular data, the comparison performed in the common format, where N is an integer that is two or greater.
- N is an integer that is two or greater.
- the glycomolecule search engines can generate the search outputs based on a suitable input data set 1150, which can include glycomolecule sample data, e.g., mass spectrometry data.
- the search outputs accessed or retrieved in the systems and methods of the present disclosure are those obtained from different glycomolecule search engines that are run on the same glycomolecule sample data, e.g., mass spectrometry data in the input data set.
- the input data set provided to each glycomolecule search engine, from which a search output is obtained includes the same glycomolecule sample data, e.g., obtained from the same experiment or mass spectrometry run.
- the glycomolecular data in the search output can provide information about a glycomolecule that the search engine identifies in the input data set.
- the glycomolecular data includes information about a glycan identified in the sample, e.g., glycan data, based on the glycomolecule sample data, e.g., mass spectrometry data, in the input data set.
- the glycan data includes glycan compositional information.
- the glycan data includes glycan structural information.
- the glycan data does not include glycan structural information.
- the glycan data includes glycan compositional information only.
- the glycan data includes the amount of hexose, N-acetylhexosamine, fucose, and/or sialic acid (e.g., N-acetylneuraminic acid, N-glycolylneuraminic acid) associated with a glycomolecule.
- sialic acid e.g., N-acetylneuraminic acid, N-glycolylneuraminic acid
- the glycomolecular data includes one or more parameters associated with each identified glycomolecule in the data.
- the glycomolecular data includes, without limitation, one or more of: glycan data, an identifier (in the field of mass spectrometry for example, the identifier is sometimes referred to as a “scan number”), a peptide sequence, a protein ID, a retention time, an m/z value, a charge number, and a confidence score.
- the identifier can be used to identify an analyte for which sample data, e.g., mass spectrometry data, was obtained, and can serve as a label to identify the same analyte in search outputs obtained from different glycomolecule search engines, e.g., glycopeptide search engines, such as, without limitation, PGLYCO, BYONIC, METAMORPHEUS, STRUCGP, etc., that are run on the same mass spectrometry data.
- sample data e.g., mass spectrometry data
- the glycan data provides compositional and/or structural information for glycans determined as being likely present in a sample (e.g., glycans of glycopeptides, glycolipids, glycoRNA, glycoDNA) based on the sample data, e.g., mass spectrometry data.
- the search output may include glycan data and the peptide sequence of the polypeptide to which a glycan is conjugated, based on the sample data.
- the retention time, the m/z value, and/or the charge number are provided for the analyte (e.g., a glycopeptide) that is identified based on the sample data.
- the confidence score provides an indication of how well the data for a particular analyte matches the glycomolecule identified by the search engine.
- the confidence score in the search output is specific to the search engine.
- the confidence score provides an indication of how well the data for a particular analyte matches the entire glycomolecule (e.g., the glycan and the non-glycan component of the glycomolecule) identified by the search engine.
- the confidence score provides an indication of how well the data for a particular analyte matches the non-glycan component of the glycomolecule identified by the search engine.
- the confidence score may be determined by one or more algorithms of a respective search engine.
- the confidence score provides an indication of how well the data for a particular analyte matches the peptide component of the glycopeptide identified by the search engine. Due to differences in search engine algorithms and formats, different search engines may return different confidence scores for the same analyte such that search output confidence scores cannot be easily matched or compared among search outputs from different search engines. This is discussed in further detail below.
- the glycomolecular data (e.g., the glycan data and/or the parameters associated with each identified glycomolecule, as disclosed herein) in the search output are in a search engine-specific format (e.g., 1121, 1122, 1123) that is associated with the search engine corresponding to the search output.
- the search output may be in a search engine-specific format that is associated with the particular search engine.
- the glycomolecular data (e.g., any one or all the parameters associated with each identified glycomolecule) in the search output from the PGLYCO3 search engine is in a PGLYCO- specific format.
- the glycomolecular data (e.g., any one or all the parameters associated with each identified glycomolecule) in the search output from the BYONIC search engine is in a BYONIC -specific format.
- the glycomolecular data (e.g., any one or all the parameters associated with each identified glycomolecule) in the search output from METAMORPHEUS search engine is in a METAMORPHEUS-specific format.
- the glycomolecular data (e.g., any one or all the parameters associated with each identified glycomolecule) in the search output from the STRUCGP search engine is in a STRUCGP-specific format.
- the identifier in the search output is in a search engine-specific format (e.g., a PGLYCO-specific format, BYONIC-specific format, METAMORPHEUS-specific format, or STRUCGP-specific format).
- the peptide sequence in the search output is in a search engine-specific format (e.g., a PGLYCO-specific format, BYONIC-specific format, METAMORPHEUS-specific format, or STRUCGP-specific format).
- the protein ID in the search output is in a search enginespecific format (e.g., a PGLYCO-specific format, BYONIC-specific format, METAMORPHEUS-specific format, or STRUCGP-specific format).
- the retention time in the search output is in a search engine-specific format (e g., a PGLYCO-specific format, BYONIC-specific format, METAMORPHEUS-specific format, or STRUCGP-specific format).
- the m/z value in the search output is in a search engine-specific format (e.g., a PGLYCO-specific format, BYONIC- specific format, METAMORPHEUS-specific format, or STRUCGP-specific format).
- the charge number in the search output is in a search engine-specific format (e.g., a PGLYCO-specific format, BYONIC-specific format, METAMORPHEUS- specific format, or STRUCGP-specific format).
- the confidence score in the search output is in a search engine-specific format (e.g., a PGLYCO-specific format, BYONIC-specific format, METAMORPHEUS-specific format, or STRUCGP - specific format).
- a search engine-specific format e.g., a PGLYCO-specific format, BYONIC-specific format, METAMORPHEUS-specific format, or STRUCGP - specific format.
- the disclosed system may be configured to automatically detect a format of the search output. For example, the system may compare the text or arrangement of the glycomolecular data to known formats to determine that the search output matches a particular format. In some embodiments, the user may submit a user input selecting a particular format to indicate that an associated search output is in the particular format.
- the glycomolecular data (e.g., glycan compositional and/or structural information) in the search output is in a search engine-specific format, such as when the search output is accessed or retrieved by a system or in a method of the present disclosure.
- the glycan data in the search output from PGLYCO e.g., PGLYCO3
- the glycan data in the search output from BYONIC is in a BYONIC-specific format.
- the glycan data in the search output from METAMORPHEUS is in a METAMORPHEUS-specific format.
- the glycan data in the search output from STRUCGP is in a STRUCGP-specific format.
- a number of searches may be conducted on multiple search engines, resulting in multiple search outputs having glycomolecular data (e.g., the first, second, and/or third glycomolecular data) of different search engine-specific formats.
- the glycomolecular data from the multiple search outputs may be converted to a common format (e.g., 1131, 1132, 1133). That is, the data may all be converted to one of the search engine-specific formats such that they are all in the same format.
- the common format is a search engine-specific format
- the glycomolecular data is converted from a search engine-specific format that is not the common format to the search engine-specific format that is the common format.
- the common format is a search output format employed by one of the glycomolecule search engines, e.g., glycopeptide search engines, such as, without limitation, PGLYCO, BYONIC, METAMORPHEUS, STRUCGP, etc.
- the common format is a PGLYCO-specific format, BYONIC-specific format, METAMORPHEUS-specific format, or a STRUCGP-specific format.
- the common format is not the PGLYCO-specific format.
- the common format is not the BYONIC-specific format.
- the common format is not the METAMORPHEUS-specific format.
- the common format is not the STRUCGP-specific format.
- the glycomolecular data e.g., the first, second, and/or third glycomolecular data
- a common format e.g., 1131, 1132, 1133
- the universal glycomolecular data format is described in greater detail below.
- the common format is not a search engine-specific format (e.g., not the PGLYCO-specific format, BYONIC-specific format, METAMORPHEUS-specific format, or the STRUCGP-specific format).
- the glycomolecular data can be converted to the common format using any suitable option.
- the glycomolecular data e.g., the first, second, and/or third glycomolecular data
- the parameters can include one or more of: glycan data, an identifier, a peptide sequence, a protein ID, a retention time, an m/z value, a charge number, and a confidence score.
- an output that includes the glycomolecular data (e.g., first, second and/or third glycomolecular data) in the common format can be generated as an output (e.g., a new data file, a printed output, displayed on a screen, etc.).
- the common format is a universal glycomolecular data format, as disclosed herein.
- the glycomolecular data e.g., the first, second, and/or third glycomolecular data, is converted to the universal glycomolecular data format, as disclosed herein.
- the glycomolecular data e.g., the first, second, and/or third glycomolecular data
- the glycomolecular data is converted to a common format by removing one or more parameters in a glycomolecule search engine-specific format.
- the glycomolecular data e.g., the first, second, and/or third glycomolecular data
- the first glycomolecular data is converted to a common format by removing one or more parameters that is present in search output from the first glycomolecule search engine but is absent from search output from a second glycomolecule search engine.
- the first glycomolecular data is converted to a common format by removing one or more parameters that is present in search output from the first glycomolecule search engine but cannot be derived from the search output from a second glycomolecule search engine.
- converting the glycomolecular data, e.g., the first, second, and/or third glycomolecular data, to a common format involves deriving a consensus confidence score from one or more search engine-specific confidence scores associated with the identified glycomolecule, as described in further detail below.
- a glycomolecule data analysis system for comparing glycomolecule search outputs from two or more glycomolecule search engines, where the system includes a processor, e.g., a hardware processor such as a central processing unit (CPU); and a non-transitory memory configured to cause the processor to (e.g., having instructions thereon that, when executed, cause the processor to): access two or more search outputs from two or more glycomolecule search engines, each search output (1) being associated with one of the two or more glycomolecule search engines and (2) comprising glycomolecular data in a search engine-specific format for each of one or more glycomolecules identified by the glycomolecule search engine.
- a processor e.g., a hardware processor such as a central processing unit (CPU); and a non-transitory memory configured to cause the processor to (e.g., having instructions thereon that, when executed, cause the processor to): access two or more search outputs from two or more glycomolecule search engines, each search output (1) being associated with one of the two or more glycomolecule search engines and (2) compris
- the non-transitory memory can further be configured to cause the processor to (e.g., have instructions thereon that, when executed, cause the processor to): extract one or more parameters (e.g., as discussed herein) associated with each of the one or more identified glycomolecules in the glycomolecular data of each of the two or more search outputs, where the one or more parameters includes an identifier.
- the non-transitory memory can further be configured to cause the processor to (e.g., have instructions thereon that, when executed, cause the processor to): perform a comparison between the glycomolecular data from the two or more search outputs based on the one or more extracted parameters.
- the identifier provides a label to identify the same analyte in search outputs from different search engines when they are run on the same sample data, e.g., the same mass spectrometry data.
- the comparison is performed by identifying glycomolecules associated with the same identifier in the glycomolecular data from the two or more search outputs.
- the extracted parameter further includes one or more additional features of the glycomolecular data. Suitable extracted parameters include, without limitation, one or more of: glycan data, a peptide sequence, a protein ID, a retention time, an m/z value, a charge number, and a confidence score.
- the search outputs accessed can be any suitable search outputs, as disclosed herein, e.g., from two or more glycomolecule search engines that are run on the same glycomolecule sample data, e.g., mass spectrometry data.
- the non-transitory memory is further configured to cause the processor to convert glycomolecular data of one or more of the two or more search outputs to a common format, such as a particular search engine-specific format or a universal glycomolecular data format, as disclosed herein.
- the common format is the universal glycomolecular data format.
- the converted search outputs are combined to generate a combined database that includes glycomolecular data from two or more of the search outputs in the common format, e.g., in the universal glycomolecular data format.
- the use of a common format to which search outputs from different glycomolecule search engines are converted can facilitate comparison of the search outputs across different search engine platforms (e.g., PGLYCO, BYONIC, METAMORPHEUS, STRUCGP), including, but not limited to, when the search engines are run on the same glycomolecule mass spectrometry data.
- search engine platforms e.g., PGLYCO, BYONIC, METAMORPHEUS, STRUCGP
- different search outputs are compared on an analyte basis.
- glycomolecular data for the same identifier is compared across different search outputs from different search engines. By doing so, glycomolecular data determined by the different search engines for the same analyte may be compared. As explained above, this comparison may allow for validation or corroboration of the identity of analytes in a sample from mass spectrometry data.
- the comparison includes identifying one or more overlaps between any two or more of the glycomolecular data.
- overlap between search results from different glycomolecule or glycopeptide search engines can in some cases indicate which of the identified glycomolecules or glycopeptides are more reliable identifications.
- an overlap is found between different glycomolecular data when there is agreement in identification of one or more parameters of the glycomolecular data for the same analyte, e.g., for identified glycomolecules having the same identifier.
- the overlap includes a set of glycomolecules that are identified in any two or more of the glycomolecular data.
- the overlap includes a set of glycomolecules having the same glycan composition in different search outputs, e.g., different identified glycomolecules associated with the same identifier. In some embodiments, the overlap includes a set of glycomolecules having the same glycan composition and peptide sequence in different search outputs, e.g., different identified glycomolecules associated with the same identifier. In some embodiments, the overlap includes one or more of peptide sequences, protein IDs, retention times, m/z values, or charge numbers.
- the comparison includes identifying a difference between any two or more of the glycomolecular data.
- differences between search results from different glycomolecule or glycopeptide search engines can in some cases indicate which of the identified glycomolecules or glycopeptides are less reliable identifications.
- a difference is identified between different glycomolecular data when there is disagreement as to the identity or a property of an analyte.
- a first search engine may identify (e.g., to a threshold confidence level) a first analyte as a first glycopeptide while a second search engine may identify the same analyte as a second glycopeptide that is different from the first glycopeptide (e.g., to a threshold confidence level).
- the difference includes one or more sets of glycomolecules that are identified in one of the search results but not in another. In some embodiments, the difference includes one or more sets of glycomolecules having different glycan compositions in different search outputs, e.g., different identified glycomolecules associated with the same identifier. In some embodiments, the difference includes one or more sets of glycomolecules having different glycan compositions and peptide sequences in different search outputs, e.g., different identified glycomolecules associated with the same identifier. In some embodiments, the overlap includes one or more of peptide sequences, protein IDs, retention times, m/z values, or charge numbers. In some embodiments, the difference includes one or more of peptide sequences, protein IDs, retention times, m/z values, or charge numbers.
- the method 1200 can include receiving 1210 a first search output associated with a first glycomolecule search engine and a second search output associated with a second glycomolecule search engine, wherein the first search output comprises first glycomolecular data for each of one or more identified first glycomolecules in a first format, and wherein the second search output comprises second glycomolecular data for each of one or more identified second glycomolecules in a second format.
- the method can further include converting 1220 one or more of the first and second glycomolecular data to a common format, wherein the common format comprises the first format, the second format, or a universal glycomolecular data format.
- the method can also include comparing 1230 the first and second glycomolecular data in the common format.
- the method can include receiving search outputs from any suitable number of glycomolecule search engines.
- the method includes receiving a third search output associated with a third glycomolecule search engine, where the third search output comprises third glycomolecular data for each of one or more identified third glycomolecules in a third format; converting one or more of the first, second, and third glycomolecular data to a common format, wherein the common format comprises the first format, the second format, the third format, or the universal glycomolecular data format; and comparing any two or more of the first, second, and third glycomolecular data, each in the common format.
- the method includes receiving a N th search output associated with a N th glycomolecule search engine, where the N th search output comprises N th glycomolecular data for each of one or more identified N th glycomolecules in a N th format; converting one or more of the first to N th glycomolecular data to a common format, wherein the common format comprises any one of the first to N th format, or the universal glycomolecular data format; and comparing any two or more (e.g., two or more, all N) of the first to N th glycomolecular data, each in the common format, where N is an integer 2 or greater.
- the common format is the universal glycomolecular data format as disclosed herein. In some embodiments, the common format is a search engine-specific format.
- the glycomolecular data includes one or more parameters associated with each of the glycomolecules identified in the glycomolecular data, where the parameters are in a search engine-specific format and includes one or more of glycan data, an identifier, a peptide sequence, a protein ID, a retention time, an m/z value, a charge number, and a confidence score. Any one or more of the parameters can be converted to the common format.
- the method includes providing a data set for running on each of the two or more glycomolecule search engines, wherein the data sets include the same glycomolecule sample data, e.g., mass spectrometry data.
- the glycomolecule sample data, e.g., mass spectrometry data, in the data sets provided to the different glycomolecule search engines are obtained from the same experiment or glycomolecule mass spectrometry run.
- the method includes generating the glycomolecule sample data from a biological sample, e.g., by running a glycomolecule mass spectrometry experiment on the sample; and providing the data set containing the generated glycomolecule sample data.
- the data set can include any suitable component of an input to a glycomolecule search engine.
- the data set further includes, without limitation, one or more of a glycan database and a protein/peptide database (or any other suitable database, e.g., a lipid database, a RNA database, a DNA database).
- the two or more glycomolecule search engines are two or more glycopeptide search engines
- the method further includes providing a dataset for running on each of the two or more glycopeptide search engines, where the datasets include the same glycopeptide sample data, e.g., glycopeptide sample data from the same experiment or mass spectrometry run.
- the method can include providing 1310 to each of two or more glycomolecule search engines a data set comprising glycomolecule sample data, e.g., mass spectrometry data, obtained from a biological sample, wherein the same glycomolecule sample data, e.g., mass spectrometry data, are provided to each of the two or more glycomolecule search engines.
- the method can further include receiving 1320 an output from each of the two or more glycomolecule search engines run on the provided data set, wherein each output from the two or more glycomolecule search engines comprises glycomolecular data for each of one or more identified glycomolecules in a search engine-specific format.
- the method can include converting 1330 glycomolecular data of one or more of the outputs from the two or more glycomolecule search engines to a common format, wherein the common format comprises one of the search engine-specific formats or a universal glycomolecular data format.
- the method can also include determining 1340 a glycomolecule profile of the sample based on the glycomolecular data of the outputs from the two or more glycomolecule search engines in the common format.
- the common format is the universal glycomolecular data format.
- the common format is a search enginespecific format, as provided herein.
- the glycomolecule profile can be determined based on the glycomolecular data of the glycomolecule search engines in the common format using any suitable option. In some embodiments, determining the glycomolecule profile comprises generating a consensus list of glycomolecules identified in the glycomolecular data associated with at least two of the glycomolecule search engines. [0085] In some embodiments, each data set includes a glycan database containing glycan data in a glycan format that is compatible with a corresponding glycomolecule search engine of the two or more glycomolecule search engines, e.g., a corresponding glycopeptide search engine of the two or more glycopeptide search engines.
- the glycan data can be in any suitable glycan format that is compatible with, or is used by, a glycomolecule search engine.
- the glycan format is compatible with a glycopeptide search engine such as, but not limited to, BYONIC, PGLYCO (e.g., PGLYCO3), or METAMORPHEUS search engines.
- Converting glycomolecular data in one format, e.g., a search engine-specific format, to a common format can be done using any suitable option.
- converting glycomolecular data to a common format includes extracting the one or more parameters, as disclosed herein, associated with each of the one or more identified glycomolecules in the glycomolecular data; and converting the one or more parameters to the common format.
- the common format is a universal glycomolecular data format, as disclosed herein.
- the converted glycomolecular data of the search outputs are combined to generate a combined database that includes glycomolecular data from two or more of the search outputs in the common format, e.g., in the universal glycomolecular data format.
- each of the glycomolecular data can include any suitable number of identified glycomolecules.
- the number of identified glycomolecules is 100 or more, 200 or more, 500 or more, 1,000 or more, 2,000 or more, 5,000 or more, 10,000 or more, 20,000 or more, 50,000 or more, 100,000 or more, 500,000 or more, 10 6 or more, or a number in a range defined by any two of the preceding values.
- the method can involve search outputs from any suitable number of glycomolecule search engines.
- the number of glycomolecule search engines is 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, 40, 50, 60, 70, 80, 90, 100 or more, or a number in a range defined by any two of the preceding values.
- the glycomolecule is a glycopeptide, glycoprotein, glycolipid, glycoRNA, or glycoDNA. In some embodiments, the glycomolecule is a glycopeptide. In any system or method of the present disclosure, in some embodiments, the glycomolecule search engine is a glycopeptide search engine, glycoprotein search engine, glycolipid search engine, glycoRNA search engine, or glycoDNA search engine. In some embodiments, the glycomolecule search engine is a glycopeptide search engine.
- the glycomolecule search engine can be any suitable software platform configured to receive input data, e.g., glycomolecule sample data or mass spectrometry data, perform a search of the input data against a suitable database, and generate an output of the search results, e.g., with glycomolecular data.
- the glycomolecule search engine is a publicly available search engine.
- the glycomolecule search engine is a commercially available search engine.
- the glycomolecule search engine is a third-party search engine.
- the sample or biological sample in any system or method of the present disclosure can be any suitable biological source of glycomolecules (e.g., glycopeptide, glycoprotein, glycolipid, glycoRNA, glycoDNA).
- the biological sample is a urine, stool, blood, or serum sample.
- the sample is from human urine, human stool, human blood, or human serum.
- the sample is from yeast, a plant, a non-human mammal, or a human.
- a method of the present disclosure includes generating the glycomolecule sample data or mass spectrometry data from the biological sample.
- the results of the comparison can be further processed in any suitable manner for the intended use.
- the results of the comparison are provided as an output, e.g., in the common format, for the user.
- the results of the comparison are provided on a screen, as an electronic file, as a printout, an audio output, etc.
- a report is generated with the results of the comparison.
- the results of the comparison are stored in a database.
- the results of the comparison from different experiments e.g., having different glycomolecule sample data or mass spectrometry data
- the results of the comparison from different experiments are stored in the same database.
- the universal glycomolecular data format can include any suitable representation of glycomolecular data that conveys information desired and/or parameters extracted from search outputs from two or more glycomolecule search engines. As disclosed herein, the universal glycomolecular data format can facilitate comparison of the search results from two or more glycomolecule search engines. In some embodiments, the universal glycomolecular data format includes, for each identified glycomolecule in the glycomolecular data, without limitation, one or more of: glycan data, an identifier, a peptide sequence, a protein ID, a retention time, an m/z value, a charge number, and a confidence score.
- the universal glycomolecular data format includes glycan data that includes representations of glycans that include compositional and/or structural information, of the glycans identified as being associated with a particular glycomolecule.
- the glycan data includes glycan compositional information.
- the glycan data includes the amount of hexose, N- acetylhexosamine, fucose, and/or sialic acid (e.g., N-acetylneuraminic acid, N- glycolylneuraminic acid) associated with a glycomolecule.
- the glycan data includes glycan structural information.
- the glycan data does not include glycan structural information. In some embodiments, the glycan data includes glycan compositional information only. In some embodiments, the universal glycomolecular data format includes glycan data in more than one representation. In some embodiments, the universal glycomolecular data format includes glycan data in one or more representations with compositional information only, and in one or more representations with glycan structural information. The glycan data can be in any suitable format.
- the glycan data in the universal glycomolecular data format is in a search engine-specific format, e.g., BYONIC, PGLYCO (e.g., PGLYCO3), STRUCGP, and/or METAMORPHEUS format.
- the glycan data is in a universal glycan format.
- the universal glycomolecular data format includes glycan data in more than one formats.
- the universal glycomolecular data format includes glycan data in the universal glycan format and any one or more of the search engine-specific formats.
- the universal glycomolecular data format includes identifiers associated with each analyte detected and measured in an experiment.
- an analyte e.g., a glycopeptide
- a mass spectrometer may be assigned a unique identifier, and the detected mass spectrum that results from this analyte passing through the mass spectrometer is associated with the assigned identifier.
- the identifier can be used to identify an analyte for which sample data (e.g., mass spectrometry data) was obtained, and can serve as a label to identify the same analyte in search outputs obtained from different glycomolecule search engines, e.g., glycopeptide search engines, such as, without limitation, PGLYCO, BYONIC, METAMORPHEUS, STRUCGP, etc., that are run on the same sample data, e.g., mass spectrometry data.
- the identifier is generally in a common format, or may otherwise be converted to a common format.
- the universal glycomolecular data format includes information about the molecules to which the glycans of each analyte are conjugated.
- the universal glycomolecular data format includes amino acid sequences of the peptides associated with the analyte (i.e., glycopeptides to which glycans are conjugated).
- the peptide sequence is in a common format.
- the peptide sequence in the universal glycomolecular data format represents the linear sequence of amino acids in the identified glycopeptide.
- the peptide sequence in the universal glycomolecular data format includes one or more one or more amino acids flanking the identified glycopeptide in a glycoprotein from which the glycopeptide was taken. In other embodiments, the peptide sequence in the universal glycomolecular data format does not include any such amino acids flanking the identified glycopeptide in the glycoprotein. In some embodiments, the peptide sequence in the universal glycomolecular data format does not include a numerical character. In some embodiments, the peptide sequence in the universal glycomolecular data format includes any suitable abbreviation of amino acids. In some embodiments, the peptide sequence in the universal glycomolecular data format includes any suitable single-letter amino acid abbreviations.
- the peptide sequence in the universal glycomolecular data format includes only the standard single-letter amino acid abbreviations. In some embodiments, the peptide sequence in the universal glycomolecular data format does not include non-standard single-letter amino acid abbreviations. In some embodiments, the peptide sequence in the universal glycomolecular data format indicates the site of glycosylation using a marker.
- the marker includes an ASCII character, an image (e.g., an emoji), or some other suitable symbol positioned within the peptide sequence at the site of glycosylation (e.g., positioned before and/or after the amino acid to which a glycan is conjugated).
- the amino acid to which a glycan is conjugated may be formatted differently (e.g., it may be bolded, underlined, highlighted, colored differently).
- the peptide sequence in the universal glycomolecular data format specifies the site of glycosylation (e.g., by providing a numerical location within the peptide sequence or the protein sequence).
- the universal glycomolecular data format includes a protein ID.
- the protein ID can provide the identity of the protein which the glycopeptide identified in the glycopeptide sample data or mass spectrometry data is identified as being part of.
- the protein ID can be any suitable identifier used in a database of proteins, including, without limitation, UNIPROT, GENBANK, and ENSEMBL.
- the universal glycomolecular data format includes the retention time.
- the retention time can be in any suitable format. In some embodiments, the retention time is in minutes. In some embodiments, the retention time is in seconds.
- the universal glycomolecular data format includes the mass-to-charge (m/z) ratio for each analyte (e.g., corresponding to the ratio between the mass and charge of the precursor ion in a multiple reaction monitoring (MRM) experiment), which may be determined, for example, based on a mass spectrum associated with the analyte.
- the m/z value can be in any suitable format. For example, referencing Figs. 8 and 9, the numbers in the “PrecursorMZ” or “Observed m/z” column indicate the (m/z) ratio of each respective analyte.
- the m/z value can be rounded to any suitable number of decimal points. In some embodiments, the m/z value is rounded to 0, 1, 2, 3, 4, 5, 6, 7, 8, or more decimal points.
- the universal glycomolecular data format includes charge numbers for each analyte, indicating a charge associated with the respective analyte.
- the charge number can be in any suitable format. For example, referencing Figs. 8 and 9, the numbers in the “Charge” or “Z” column indicate the charge numbers of each respective analyte.
- the universal glycomolecular data format includes confidence scores for each analyte.
- the confidence score can be any suitable value that indicates how well the data matches the glycomolecule identified by a search engine.
- the confidence score may be an indication of a level of confidence the search engine’s algorithm has that the correct glycomolecule was identified.
- the confidence score of the universal glycomolecular data format is a confidence score selected from one of search outputs of a particular search engine (e.g., from one of the search engines that identified the glycomolecule based on the underlying data).
- the format of the confidence score of the universal glycomolecular format may similarly be conformed to a particular one of the search engines.
- a first search engine-specific confidence score may be a number on a scale of 0 to 100, while a second search engine-specific confidence score may be a number on a scale of 0 to 1000.
- the confidence score of the universal glycomolecular format may be made to conform to the format of the first search engine (e.g., 0 to 100).
- the format of the confidence score of the universal glycomolecular format may be a fixed format.
- the confidence score of the universal glycomolecular format may always be on a scale from 0 to 100.
- the confidence score for the universal glycomolecular data format is a consensus confidence score derived from the search engine-specific confidence scores of two or more search engines.
- a consensus confidence score may be determined by applying a function to the two search engine-specific confidence scores of the two search outputs.
- the consensus confidence score may be obtained by computing an average of the search engine-specific confidence scores for an analyte.
- the consensus confidence score may be obtained by computing a median of search engine-specific confidence scores an analyte.
- the consensus confidence score may be obtained by computing a weighted function (e.g., a weighted average) of search engine-specific confidence scores for an analyte.
- Computing a weighted function may include applying one or more different weighting factors to different search engines.
- the confidence scores may first be normalized before applying a desired function (e.g., an average function, a median function, a weighted average function) to derive the consensus confidence score.
- a desired function e.g., an average function, a median function, a weighted average function
- the scores may be normalized so that they are both on a scale of 0 to 100 or on a scale of 0 to 1000.
- the universal glycomolecular data format includes identifying information for the glycomolecule search engine.
- the identifying information for the glycomolecule search engine can specify the search engine from which the search output was obtained.
- the identifying information for the glycomolecule search engine can be in any suitable format.
- the identifying information for the glycomolecule search engine specifies whether the search engine is PGLYCO, BYONIC, METAMORPHEUS, or STRUCGP, or any other search engine from which the search output was obtained.
- the universal glycomolecular data format can include one or more parameters for each of the identified glycomolecules in any suitable order. In some embodiments, the identifier is listed first.
- the universal glycomolecular data format lists parameters in the following order: identifier, protein ID, peptide sequence, glycan data, m/z value, charge number, retention time, confidence score, search engine. In some embodiments, the universal glycomolecular data format lists parameters in the following order: identifier, protein ID, peptide sequence, glycan data, charge number, retention time, search engine.
- Non-limiting examples of a universal glycomolecular data format are shown in Figs. 17-22.
- FIG. 30 illustrates a schematic diagram showing a non-limiting example of a user interface presenting glycomolecule search results displayed in search result elements and a consensus set displayed in a consensus list.
- a system for curating a consensus list of glycomolecules identified in a biological sample on a user interface including a processor and a memory (e.g., non-transitory memory) having instructions that, when executed by the processor, causes the processor to retrieve a plurality of glycomolecule search results sets for a biological sample from a plurality of glycomolecule search engines 3011a, b, ... i (wherein z > 3, e.g., 3, 4, 5, 6, 7, 8, 9, 10, or more, hereafter).
- retrieval of these search results may include accessing search results that were uploaded, imported, or otherwise inputted by the user into the system.
- the system may be at least partially cloudbased, in which case the user may upload search results from multiple search engines (e.g., exported files from such search engines) via an online portal.
- the exported files may be in plain-text or formatted as a csv (comma-separated values), xml (extensible markup language), or any other suitable machine-readable format.
- the system may be a local client device, in which case, the user may simply submit the search results as an input into the system.
- each of the glycomolecule search results sets 3012a, b, . . . i includes glycomolecule search results 3014a, b, ... i (wherein z > 3, e.g., 3, 4, 5, 6, 7, 8, 9, 10, or more, hereafter) that identify one or more predicted glycomolecules as being present.
- the predictions are made based on associated mass spectrometry data.
- each predicted glycomolecule may be predicted based on an associated mass spectrum, which is itself associated with a unique identifier, typically called a mass spectrometry scan number.
- the mass spectrum is a spectrum created by measuring signals from fragmented ions of a single type of glycomolecule that are sent through a mass spectrometer.
- a multiple reaction monitoring setup may be used to ionize and filter for glycomolecules of a single type (e.g., filtered by retention times in a liquid chromatography column and the mass-to-charge ratio from a first mass spectrometry measurement), apply an energy to the ionized glycomolecules to fragment them into fragmented ions, and then record the spectrum of the fragmented ions using a second mass spectrometry measurement.
- a unique identifier typically called an MS2 scan number is assigned to the spectrum of the fragmented ions, and each such MS2 scan number and corresponding spectrum ideally corresponds to a single type of glycomolecule. Referencing FIG. 30, the identifier 3013 may reference the MS2 scan number associated with a particular spectrum.
- the instructions when executed by the processor, further causes the processor to determine a consensus set which includes, for each unique identifier 3013, a consensus search result identifying one of the predicted glycomolecules associated with the unique identifier 3013 in the glycomolecule search results sets.
- the consensus list element 3020 provides a glycan portion 3024a of the predicted glycomolecule for Scan #1 (a glycomolecule, for example, may include a glycan and a peptide).
- the instructions when executed by the processor, further causes the processor to provide a user interface 3002, as shown in FIG. 30, configured to display one or more search results elements 3010a, b, ... i (wherein z > 3, e.g., 3, 4, 5, 6, 7, 8, 9, 10, or more, hereafter) configured to display one or more representations of the one or more predicted glycomolecule search results 3014a, b, ... i from each of the plurality of glycomolecule search results sets 3012a, b, ... i.
- the glycomolecule search results set 3012a from Search Engine 1 3011a is summarized and at least partially shown in the first search results element 3010a as search results 3014a-l ...
- the representations of the one or more predicted glycomolecule search results 3014a, b, ... i is textual, graphical, or is toggled between textual and graphical representations.
- the one or more search results elements do not display the one or more representations of the one or more predicted glycomolecule search results 3014a, b, ... i from each of the plurality of glycomolecule search results sets 3012a, b, ... i that are summarized in the search results elements 3010a, b, ... i respectively.
- i may be displayed in a plurality of respective search results elements 3010a, b, ... i.
- information from all the glycomolecule search results or at least multiple glycomolecule search results is displayed in a single search result element (e.g., a single table element that displays all search results).
- the respective locations of the plurality of search results elements 3010a, b, ... i may be determined automatically on the user interface 3002 based on a default setting, or a user modification of the default setting that is read and executed by the processor.
- the default setting may be based on a rank for the associated search engine, such rank being determined by the system disclosed herein.
- the rank may be based on usage data (e.g., usage by the user or by a population of users) associated with each of the search engines.
- the rank may be based on a type of sample that is associated with the data that is being searched. That is, the system may determine that a first search engine is more optimal for a particular type of sample (e.g., as may be determined based on how commonly used it is for the particular type of sample) than a second search engine, and may thus rank the search engines accordingly.
- the search results elements may be arranged in an order based on the rank associated with their respective search engines.
- the user modification of the default setting takes precedence.
- settings modifications such as this or any other suitable modification may be made via a user input element such as the settings element 3050.
- the user may be able reorder the arrangement of the search results elements on the user interface 3002 by submitting a suitable user input (e.g., by dragging and dropping or some other suitable means).
- the user interface 3002 is further configured to recognize a user input and allow scrolling through the one or more of the search results elements 3010a, b, ... i.
- the user interface 3002 is configured such that scrolling activity in one of the search result elements 3010a automatically causes the scrolling of the other search results elements 3010b ... i such that sets of the glycomolecules associated with the same unique identifiers 3013 are displayed across all of the search results elements 3010a, b, ... i as shown in FIG. 31 A.
- the user interface 3002 may enable the user to make direct comparisons among the different search results sets displayed in the search results elements 3010a, b, ... i.
- the same set includes the same representations.
- the plurality of search results elements 3010a, b, ... i may be arranged horizontally on the user interface as shown in FIG. 30, 31A-31B, and 32. In some embodiments, these plurality of search results elements 3010a, b, ... i may be vertically arranged on the user interface as shown in FIG. 33. Any suitable arrangement is contemplated by this disclosure.
- the user interface 3002 is further configured to display a scroll bar 3042 that is activated and controlled by a user input, the scroll bar 3042 further permitting scrolling activity through the plurality of the search results elements 3010a, b, ... /, and thus, allow potentially large number of search results 3012 to be available for display on the user interface 3002.
- the representations corresponding to the predicted glycomolecule search results 3014a, b, ... i are displayed in each of the search result elements.
- the unique identifiers 3013 associated with each search result is displayed to allow users to easily reference and compare search results across the different search results elements.
- FIG. 30 and the following figures only depict search results elements 3010a, b, ... i that present scan numbers, compositional information of glycans, and confidence scores from search results, the disclosure contemplates that the search results elements 3010a, b, ... i may also display other information from the search results, for example in additional columns.
- the glycomolecule is a glycoconjugate that includes a glycan and another molecule bonded together (e.g., a glycopeptide that includes a glycan and a peptide, a glycoRNA that includes a glycan and an RNA, glycoDNA that includes a glycan and a DNA, a glycolipid that includes a glycan and a lipid), information such as sequence information of the other molecule may also be presented. As another example, information such as mass, mass-to-charge ration (m/z), and/or any other suitable information from the search results may also be presented in the search results elements 3010a, b, ... i. A consensus list element 3020 configured to display the consensus set 3022 may also be displayed. Consensus sets are described in further detail below.
- the search results 3012 include a confidence score 3015 for each of the glycomolecules in the conflict search results 3520, and wherein the conflict resolution rule set 3540 includes identifying a consensus search result 3024 based on a comparison of the confidence scores 3015 of each search result among the conflicting search results 3520.
- the confidence score can be any suitable value that indicates the degree of confidence a search engine may have that a predicted glycomolecule search result corresponds to a particular mass spectrum. For example, referencing FIG. 30, Search Engine 1 reports a confidence score of 85 for its prediction that the mass spectrum associated with Scan #1 corresponds to its predicted glycomolecule search result (which includes the glycan N(l)H(l)(F)(l)A(0)).
- the confidence score 3015 is a value that is reported in the search results of each and each value may thus be on its own respective search engine-specific scale.
- the confidence score 3015 is a normalized and/or scaled confidence score that facilitates comparison of confidence scores across different search engines. For example, referencing FIG. 30, the confidence scores of two or more of the Search Engines 1 to i may have been scaled differently in the original search results that may have been exported from their respective search engines (e.g., Search Engine 1 may have been on a scale of 1 to 100, Search Engine 2 may have been on a scale of 1 to 500). But as shown in FIG. 30, the system may have scaled all the confidence scores to a common scale of 1 to 100 to allow for a more straightforward comparison.
- the confidence scores of two or more of the search engines may have different relative values. For example, a score of 61 reported by Search Engine 1 may have a relative value that is higher than a score of 65 reported by Search Engine 2, because the algorithm for Search Engine 1 may apply a different standard in scoring.
- the scores may be normalized by the disclosed system to account for this difference in relative values and adjusting the confidence scores. Normalization of the scores may be based on a predetermined weighting system that weights scores of different search engines differently, by manual input of a user, or may be dynamically determined based on for example historical performance of the different search engines.
- FIG. 35 is a block diagram showing a non-limiting example of a method for populating a consensus list element 3020 according to some non-limiting embodiments of the disclosure.
- the consensus set is determined, at least in part, by identifying 3510 conflicting search results 3520 and nonconflicting search results 3530 among the predicted glycomolecule search results from different search engines, and populating the consensus list element 3020 with the nonconflicting search results 3530.
- the predicted glycomolecule search results are further evaluated by resolving the conflicting search results 3520 by the one or more programs that further include instructions, when executed by the processor, that applies a conflict resolution rule set 3540 to generate resolved conflicting search results 3550, and populate the consensus list element 3020 with the resolved conflicting search results 3550.
- the conflict resolution rule set 3540 can include any suitable conflict resolution rules for resolving conflicting search results.
- the conflict resolution rule set 3540 takes into account one or more factors associated with the search results, such as, without limitation, the number of search engines that are in agreement among a conflicting search result, the relative confidence scores (after normalization/scaling of the confidence scores) given by each search engine for the conflicting search result, etc.
- the different factors are weighted differently to resolve a conflict.
- the system is configured to apply a conflict resolution rule set 3540 that is a default or predetermined rule set (e.g., taking into account a default factor or set of factors, and/or to using a default weighting of the factors).
- the user interface is configured to allow a user to determine which factor to use, and/or to select how different factors are to be weighted, for the conflict resolution rule set 3540.
- the conflict resolution rule set 3540 comprises identifying, for a particular conflicting search result 3520 associated with a particular unique identifier 3013, one or more plurality glycomolecules (e.g., a majority glycomolecule) from the predicted glycomolecule search results of the conflicting search results 3520 associated with the particular unique identifier 3013.
- a predicted glycomolecule is a plurality glycomolecule when the same glycomolecule is predicted by more search engines than the number of search engines identifying any one other glycomolecule.
- a predicted glycomolecule is a majority glycomolecule when the same glycomolecule is predicted by more than half the search engines (e.g., by 2 out of 3 search engines, by 3 out of 4 search engines, by 3 or more out of five search engines, etc.).
- resolving the conflict for a search result involves selecting one of the conflicting search results. For example, the system may select the plurality glycomolecule.
- the search result among the conflicting search results 3520 having the highest confidence score is chosen as the resolved conflicting search results 3550 and therefore is populated 3560 in the consensus list element 3020 as shown in FIG. 35.
- the conflicting search results 3520 are resolved based on relative confidence scores 3015 (e.g., calculating a weighted score for each potential glycomolecule). In a non-limiting example, as shown in FIG.
- the predicted glycomolecule from the third search engine 3011/ may be selected and displayed on the consensus list element 3020 despite a majority of the search engines (2 out of 3 in this example) indicating a different glycomolecule.
- confidence scores in this example may have been normalized and/or scaled by the system to a common scale of 1 to 100 as explained earlier.
- confidence scores 3015 may be weighted based on known characteristics of the search algorithms employed by the search engines 3011 (e.g., some search engine algorithms are tuned to be better at classifying certain glycomolecules, better at analyzing certain sample types, etc.). In some embodiments, such characteristics may be based on historical performance metrics of each search engine 3011. In some embodiments, users may be presented with characteristics/features that contributed to the confidence score 3015 for a particular scanned glycomolecule, which in some embodiments can advantageously assist the user in independent assessment of the confidence level of the associated prediction.
- determining the consensus set includes determining an initial consensus set, wherein the initial consensus set includes a single predicted glycomolecule search result for each unique identifier 3013.
- the consensus set 3022 includes an initial consensus set having subsets of the full set of glycomolecules identified in all the search results sets that are input into the disclosed system (e.g., the search results identified by all the search engines 3011).
- search results e.g., displayed in search results elements 3011a, 3011b, and 3011/
- search engines 3011 e.g., Search Engine 1, Search Engine 2, . . .
- search Engine / to determine what glycomolecules may be present in a biological sample.
- mass spectrometry data sent to the search engines 3011 may have included four MS2 scan numbers (numbered 1 to 4 for simplicity) corresponding to four MS2 mass spectra that may each be associated with a particular type of glycomolecule.
- each of the search engines 3011 may have returned its own search results that predict the glycomolecules in an associated sample.
- all search engines (Search Engines 1 to /) may have predicted the same type of glycomolecule (which includes the glycan N(2)H(2)F(2)A(0)). In this example, the system may ensure that the consensus set includes this predicted glycomolecule search result for Scan #2.
- Search Engine 1 and Search Engine 3 may have predicted the same type of glycomolecule (which includes the glycan N(l)H(l)F(l)A(0)), while Search Engine i predicted a different type of glycomolecule (which includes the glycan N(l)H(l)F(l)A(0)).
- the disclosed system may attempt to resolve such conflict by applying a conflict resolution rule set to determine what glycomolecule should correspond to the associated scan number in the consensus set, as provided herein.
- the initial consensus set may include all search results on which all the search engines agree, as well as the search results that the system determines following application of the system’s conflict resolution rule set (in the cases where there are conflicts among the search engines).
- a user may additionally or alternatively manually select among the different conflicting glycomolecules and may thus modify the initial consensus set. When the user is finished, this modified consensus set can be exported or otherwise used as a final consensus set.
- the disclosed system may include an adaptive learning process (e.g., a suitable neural network process or other machine learning process) which may adapt the conflict resolution rule set in real-time based on the user’s modifications to the consensus set. For example, if a user modifies the initial consensus set by substituting a particular search result for a different search result, the adaptive learning process may use that information to over time adapt the conflict resolution rule set based on detected patterns.
- the modified consensus set may be exported or transferred into a separate adaptive machine learning process for a similar purpose.
- the user interface 3002 is further configured to display the unique identifier 3013 in association with the one or more representations.
- the unique identifier 3013 is a mass spectrometry scan number (e.g., an MS2 scan number).
- the instructions when executed by the processor, further causes the processor to provide the plurality of search result elements 3010a, b, ... i, each search result element 3010 configured to recognize a user input to select one predicted glycomolecule among the identified search results 3014a associated with the same unique identifier 3013 as the consensus search result 3024 for the unique identifier 3013.
- the user input comprises a touch and a scrolling activity, and wherein the user interface is further configured to scroll through the representations of the predicted glycomolecule search results 3014a displayed in the search results element 3010 in response to the scrolling activity.
- the user interface 3002 is configured to arrange the representations of the predicted glycomolecule search result 3014a of each search results set 3012 with a respective search results element 3010 into an ordered list for each of the search results set 3012.
- the ordered lists are ordered by the unique identifier 3013.
- the unique identifier 3013 is the mass spectrometry scan number as shown in the non-limiting examples of FIGS. 30, 31 A- 3 IB, 32, and 33.
- each of the ordered lists is identically ordered.
- the user interface 3002 is further configured to scroll through the ordered lists in a coordinated manner in response to a scrolling activity on one of the ordered lists by the user as shown in FIG. 31 A.
- FIG. 31A is a schematic diagram showing a non-limiting example, according to some embodiments of the disclosure, of dynamic interaction with a user interface presenting glycomolecule search results displayed in search result elements and a consensus set displayed in a consensus list element.
- the system further includes a display 3001.
- the display includes a touchscreen.
- the display 3001 further includes a sensor.
- the sensor includes capacitive sensing to detect and measure proximity, pressure, position and displacement, force, humidity, fluid level, acceleration, or any combination thereof.
- the display 3001 further includes an actuator.
- the actuator provides haptic feedback.
- the actuator provides haptic feedback in the form of a vibration.
- the system further includes a human interface device.
- human interface devices may include, but are not limited to, a keyboard, a mouse, a touchpad, the touchscreen, a digital pen, a finger of the user interfacing directly with a touchscreen, and/or the like.
- the system or method recognizes a human interface device and the user interface 3002 may render a user interface element 3040, which may be a cursor or similar element of any shape, that enables the user to interact with the user interface.
- the user interface element 3040 may change its appearance on the user interface 3002 dynamically depending upon the user input that is registered by the human interface device.
- FIG. 3 IB is a schematic diagram showing a non-limiting example, according to some embodiments of the disclosure, of a user interface 3002 that presents a glycomolecule visualization.
- the user interface is configured to recognize an input of a user, recognize a selected glycomolecule search results 3014a when the input of the user hovers, clicks, taps, or otherwise interacts with a glycomolecule search results 3014a shown in the search results elements 3010a, b, . . . i or in the consensus list element 3020, and present a glycomolecule visualization 3016 for the corresponding glycomolecule search results 3014a as shown in FIG. 3 IB.
- FIG. 3 IB is a schematic diagram showing a non-limiting example, according to some embodiments of the disclosure, of a user interface 3002 that presents a glycomolecule visualization.
- the user interface is configured to recognize an input of a user, recognize a selected glycomolecule search results 3014a when the input of the user hovers, clicks, taps, or otherwise interacts with a glycomolecule search results
- the glycomolecule visualization 3016 is configured to depict only the glycan portion of a selected glycomolecule. In some cases, this may be because it is generally more difficult to predict the glycan portions with a high degree of confidence (e.g., using mass spectrometry data), partly because glycans have a great degree of variability in structure (e.g., due to a large number of possible structure configurations).
- the identities of conjugates (e.g., peptide, RNA, DNA, lipid) of glycans in a glycoconjugate may be relatively easy to predict. As such, users may find it of more interest to inspect the glycan portions of glycomolecules when trying to understand or validate a search result.
- the glycomolecule visualization 3016 can be configured to depict an entire glycomolecule (in the case where the glycomolecule is a glyconjugate) or a portion thereof that includes more than just the glycan portion.
- the glycomolecule visualization 3016 may depict the entire glycopeptide.
- the input of the user includes, but is not limited to, a touch, a click, a press, a voice, or any combination thereof.
- the glycomolecule visualization 3016 includes one or more potential structures of the selected glycomolecule search results 3014a.
- a glycomolecule visualization 3016 includes viewing the intact glycopeptide (i.e., the peptide backbone and all of its attached gylcans), glycolipids, glycoRNA, glycoDNA, and the like.
- the glycomolecule visualization 3016 is shown in a modal window 3003 in the user interface 3002 as shown in FIG. 3 IB.
- the glycomolecule visualization 3016 is shown in a popup window 3003 in the user interface 3002, as in FIG. 3 IB.
- the glycomolecule visualization 3016 is shown in a dedicated region 3003 of the user interface 3002 as shown in FIG. 31C.
- Presenting the glycomolecule visualizations in the user interface 3002 may be advantageous in that such visualizations may provide additional context to the user.
- these visualizations may be informative for a user engaged in manually curating the consensus list within the consensus element 3020.
- a user may gain additional insight from the glycomolecule visualization 3016 and may determine that a first glycomolecule suggested by a first search engine 3011a is unlikely to be present in the particular sample associated with the respective search result, which results in the user to review the other glycomolecules identified from other search engines 3011b ... i including the glycomolecule visualizations provided from these other search engines 3011b ... i.
- different glycomolecule search engines use their respective algorithms to make predictions about what glycomolecules are in a sample based on data (e.g., mass spectrometry data) associated with the sample. These algorithms are not perfect and sometimes make erroneous predictions. In some cases, users may be able to identify these erroneous predictions when the glycomolecules (or at least the glycan portion of the glycomolecules) are visualized. In some cases, a user may be able to make a subjective judgment or conclusion about a prediction of a particular glycomolecule after inspecting a potential structure of the glycomolecule.
- the user may inspect a potential structure of a glycan portion of the glycomolecule as depicted in a visualization and determine that such a structure is unlikely to occur (or at least unlikely to be present in the sample).
- a potential structure of a glycan portion of the glycomolecule as depicted in a visualization and determine that such a structure is unlikely to occur (or at least unlikely to be present in the sample).
- researchers using glycomolecule search platforms typically work with search results within spreadsheets or tables of text, making it difficult to make such judgments.
- researchers may attempt to manually draw a structure or else input compositional information of individual glycans in separate software to visualize particular glycans of interest.
- the user interface 3002 disclosed here presents such visualizations within the user interface as the user is making such judgments when validating a search results set, or when curating a consensus list from multiple search results as further described herein.
- the process of search results validation and the curation of a consensus list may be greatly enhanced by the disclosed user interface 3002.
- the user interface 3002 in these embodiments also facilitates understanding of a search results set (or the associated sample) by providing immediate access to the depictions of the glycomolecules for side-by-side viewing as the user inspects the search results.
- the glycomolecule may be visualized using any suitable format.
- glycans are visualized using Symbol Nomenclature for Glycans (SNFG), as illustrated in FIG. 3 IB.
- SNFG Symbol Nomenclature for Glycans
- the user interface 3002 further includes notifications element 3030 to the user.
- the notifications element 3030 may be set up to inform the user of a glycomolecule that is identified in the consensus list element 3020 as having a confidence score 3015 above a default threshold or a user-set threshold.
- the notifications element 3030 may additionally or alternatively be configured to display a subset of the search results from the consensus list that correspond to glycomolecules of particular characteristics, which may be set by the user. For example, the user may submit a user input specifying a criterion requesting a list of the glycomolecules that were determined by the search engines to be present at a high relative abundance (or within a specified range of abundances). In this example, search results (e.g., from the consensus list) that satisfy this criterion may be displayed within the notifications element 3030.
- the notifications element 3030 further displays the criteria used by the conflict resolution rule set to populate the consensus list element 3020.
- the notifications element 3030 may indicate to the user that the glycomolecule corresponding to Scan #1 on the consensus list element 3020 was determined at least in part by the majority (or largest plurality) of Search Engines 3011.
- the majority or largest plurality
- the notifications element 3030 may indicate to the user that the glycomolecule corresponding to Scan #4 on the consensus list element 3020 was determined at least in part by associated confidence scores (e.g., the associated confidence score of 99 of search result 3014/). Any number of different such factors may be displayed in the notifications element 3030.
- This display by presenting the rationale behind why the system recommended a particular search result in the initial consensus set, may inform the user in deciding whether or not to agree with the search result recommendation in the initial consensus set. For example, if the user determines that a particular glycomolecule corresponding to a search result in the initial consensus set is unlikely to be present in the associated sample, and finds that the rationale presented in the notifications element 3030 is unpersuasive or unjustified, the user may modify the consensus set.
- the notifications element 3030 may include a link, a file link, a local system hyperlink, a local network hyperlink, a cloud hyperlink, or the like to various log data generated from the decision-making process by the system and the user during automated and manual curation activity respectively, which may include various metadata generated from the searching and curation activity.
- notifications element 3030 may display error messages.
- the display 3001 includes an actuator. In these embodiments, when a notification element 3030 is caused to appear on the user interface 3002, the actuator is triggered to vibrate the display 3001 or any other suitable device so as to alert the user.
- the user interface 3002 is configured to recognize an input of a user and permit the user to manually edit one or more of the consensus search results 3024 displayed in the consensus list element 3020. In some embodiments, the user interface 3002 is further configured to permit the user to drag and drop one or more of the predicted glycomolecule search results 3014a from one or more of the glycomolecule search results sets 3012 into the consensus list element 3020, and thus incorporate those predicted glycomolecule search results 3014a into the consensus set. In other words, in some embodiments the user interface 3002 is constructed and configured for user interaction.
- the user interface 3002 is configured to recognize an input of a user and permit the user to drag and drop one or more of the predicted glycomolecule search results 3014 from one or more of the glycomolecule search results sets 3012 into the consensus set.
- the search results further comprise a confidence score 3015 for each of the glycomolecules in the conflict search results 3220, and wherein the user interface 3002 is configured to recognize an input of a user and permit the user to apply a custom confidence score 3015 to any of the glycomolecules in the conflict search results 3220.
- the one or more programs further include instructions, when executed by the processor, that update a machine learning process with the custom confidence scores. The machine learning process is configured to output recommendations and predictions.
- the user interface 3002 permits the user to rate their own confidence score 3015 for each glycomolecule presented during the manual curation steps, which can inform the machine learning process and thereby further update the machine learning process to increase its ability for future recommendations/predictions.
- the machine learning process may include adaptive machine learning for real-time collection of user-provided information (which includes, but is not limited to, custom confidence scores) to provide real-time recommendations and predictions to the user.
- the computer system can include a memory 3420, e.g., non-transitory memory, and a processor 3410 (e.g., a hardware processor, a CPU), as disclosed herein.
- a memory 3420 e.g., non-transitory memory
- a processor 3410 e.g., a hardware processor, a CPU
- the system can include a human interface device 3430 (e.g., a display device 3435 to provide a user interface configured to display one or more search results elements and a consensus list element 3020, as disclosed herein) and/or one or more other devices (e.g., a keyboard, mouse, touchpad, touchscreen, microphone, etc.) that allow a user to interact with the user interface, and/or to input data and/or instructions, as disclosed herein.
- a human interface device 3430 e.g., a display device 3435 to provide a user interface configured to display one or more search results elements and a consensus list element 3020, as disclosed herein
- other devices e.g., a keyboard, mouse, touchpad, touchscreen, microphone, etc.
- the system includes a network interface 3440.
- the network interface can be configured to allow the processor to access the internet or a remote server, such as a cloud-based server 3470.
- the processor is configured to retrieve search results via the network interface 3440.
- the system includes a database 3450, e.g., a local database to store and access data (such as usage data or historical data, as disclosed herein).
- the system includes a communication bus for the processor to communicate with components of the system.
- the glycomolecules include glycans, glycoproteins, glycolipids, glycoRNAs, glycoDNAs, or any combination thereof.
- the glycomolecules comprise glycans.
- the instructions, when executed by the processor, further causes the processor to convert one or more of the glycomolecule search results to a common format as shown in FIG. 36.
- the common format is a universal glycan format 3700.
- the universal glycan format 3700 can include any suitable representation for different monosaccharides that are or may be present in a glycan.
- the universal glycan format includes a single-letter representation for different monosaccharides.
- the universal glycan format uses 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more characters to represent each of the different monosaccharides.
- the glycan format e.g., the universal glycan format, or any one or more of the platformspecific glycan formats
- the glycan format e.g., the universal glycan format, any one or more of the platform-specific glycan formats
- the universal glycan format lists the amount of a monosaccharide that is not present in the glycan, e.g., as “0”. Inclusion of monosaccharides that are not present in the glycan representation is advantageous because it ensures uniformity in the format of the representation. That is, in the universal glycan format 3700, glycan representations for each glycan include information about all the same sugars of interest (whether present or absent). This is in contrast to conventional formats, where glycan representations omit the monosaccharides that are absent. This uniformity of the universal glycan format is readily evident in FIG. 36, which show a number of glycan representations in conventional platform-specific formats and their counterparts in the universal glycan format. The uniformity allows for better readability, and also simplifies comparison and analysis of multiple glycomolecules.
- the universal glycan format 3700 includes the amount of fucose and/or sialic acid to represent a glycan, even if there is no fucose and/or sialic acid present in the glycan. In some embodiments, the universal glycan format includes the amount of N-acetylneuraminic acid and/or N-glycolylneuraminic acid to represent a glycan, even if there is no N-acetylneuraminic acid and/or N- glycolylneuraminic acid present in the glycan.
- the universal glycan format includes a first number I corresponding to an amount of hexose; a second number m corresponding to an amount of N-acetylhexosamine; a third number n corresponding to an amount of fucose; and a fourth number o corresponding to an amount of N- acetylneuraminic acid, and wherein /, m, //, o are each an integer > 0.
- each representation of a monosaccharide in the universal glycan format is associated with the amount of the monosaccharide in a glycan.
- a glycan format (e.g., the universal glycan format 3700, any one or more of the platform-specific glycan formats 3710, 3720, 3730, 3740) may provide glycan structural information.
- the PGLYCO3 (Structure) format uses the sequence of letters to represent the relative position of associated monosaccharides and further uses parentheses to denote the branching structure of the glycan.
- the glycan format includes, for a given glycan, glycan structural information.
- At least one of the platform-specific glycan formats includes glycan structural information.
- 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, 40, 50, 60, 70, 80, 90, 100 or more, or a number in a range defined by any two of the preceding values, or the platform-specific glycan formats includes glycan structural information.
- none of the platform-specific glycan formats includes glycan structural information.
- the universal glycan format 3700 includes glycan structural information. In some embodiments, the universal glycan format 3700 does not include glycan structural information.
- the glycan structural information is, for a given glycan, a single glycan structural arrangement. In some embodiments, the glycan structural information is, for a given glycan, the most likely glycan structural arrangement. In some embodiments, the glycan structural information is, for a given glycan, a set of all possible glycan structural arrangements. In some embodiments, the glycan structural information is, for a given glycan, a set of all biologically consistent glycan structural arrangements (e.g., consistent with the type of structural arrangements known to be found in glycans from the same biological source).
- the glycomolecule search results from each of the glycomolecule search engines is in a platform-specific format.
- the platform-specific format is a platform-specific glycan format.
- the platform-specific glycan format (e.g., 3710, 3720, 3730, 3740) can be suitable for representing glycan data as an input to a computer-implemented glycomolecule analysis routine, such as, but not limited to, a glycomolecule search engine.
- the platform-specific glycan format (e.g., 3710, 3720, 3730, 3740) is a glycan format used by or compatible with a glycomolecule search engine 3010.
- a glycan database can be an input to a glycomolecule search engine 3010 when glycan data of the glycan database are represented in the glycan format used by or compatible with the glycomolecule search engine.
- a glycan database cannot be an input to a glycomolecule search engine when glycan data of the glycan database are represented in a glycan format that is not used by or is incompatible with the glycomolecule search engine.
- the first platform-specific glycan format 3710 is compatible with (e.g., a glycan database in the first platform-specific glycan format can be an input 3760 to) a glycomolecule search engine 3750.
- the second platform-specific glycan format 3720 is not compatible with (e.g., a glycan database in the second platformspecific glycan format cannot be an input 3770 to) the glycomolecule search engine 3750 with which the first platform-specific glycan format is compatible. In some embodiments, only the first platform-specific glycan format 3710 is compatible with the glycomolecule search engine 3750. In some embodiments, the second platform-specific glycan format 3720 is compatible with (e.g., a glycan database in the second platform-specific glycan format can be an input to) a glycomolecule search engine with which the first platformspecific glycan format is not compatible.
- the platform-specific glycan format is a glycan format used by or compatible with a particular glycomolecule search engine or a particular subset of glycomolecule search engines that is available for public and/or commercial use.
- the glycomolecule search engine is a glycopeptide search engine.
- the platform-specific glycan format is a glycan format used by or compatible with a glycopeptide search engine, such as, but not limited to, BYONIC, PGLYCO3, MSFRAGGER-GLYCO, METAMORPHEUS (also known as O-PAIR), and STRUCGP.
- the universal glycan format 3700 is different from at least one of the platform-specific glycan formats. In some embodiments, the universal glycan format 3700 is different from any of the platform-specific glycan formats. In some embodiments, the universal glycan format 3700 is the same as one of the platformspecific glycan formats.
- a method for curating a consensus list of glycomolecules identified in a biological sample on a user interface includes, at a computer system shown in FIG. 34 including one or more processors 3410, and memory 3420 storing one or more programs configured to be executed by the one or more processors 3410, the one or more programs including instructions for retrieving a plurality of glycomolecule search results sets 3012 (with reference to Fig. 30) for a biological sample from a plurality of glycomolecule search engines 3011a, b, ... i wherein each glycomolecule search results set 3012 comprises glycomolecule search results 3014 identifying one or more predicted glycomolecule search results 3014a as being present, wherein each predicted glycomolecule search results 3014a is associated with a unique identifier 3013.
- the one or more programs further include instructions for determining a consensus set 3022, wherein the consensus set 3022 includes, for each unique identifier 3013, a consensus search result 3024 identifying one of the predicted glycomolecules 3024a associated with the unique identifier 3013 in the glycomolecule search results sets 3012.
- the one or more programs further include instructions for providing a user interface 3002 configured to display one or more search results elements 3010a, b, ... i configured to display one or more representations of the one or more predicted glycomolecule search results 3014a from each of the plurality of glycomolecule search results sets 3012.
- the user interface 3002 is further configured to display a consensus list element 3020 configured to display the consensus set 3022.
- the user interface is further configured to display the unique identifier 3013 associated with the one or more representations.
- the one or more programs further include instructions for any of the embodiments described above.
- a computer-readable medium comprising instructions thereon, which when executed by a processor causes the processor to perform the method of any of the embodiments described above.
- the computer-readable medium can be any suitable medium that can store instructions for performing a method of the present disclosure. Suitable media include, without limitation, a hard drive, a SSD hard drive, a CD-ROM drive, a DVD-ROM drive, a floppy disk, a tape, a flash memory stick or card.
- a glycan data management or glycomolecule analysis system of the present disclosure can include any suitable non-transitory memory for holding the instructions, which when executed, perform any of the methods disclosed herein.
- the non-transitory memory is structured to include one or more of a readonly memory (ROM), flash memory, dynamic random access memory (DRAM) (such as synchronous DRAM (SDRAM), double data rate (DDR SDRAM), or DRAM (RDRAM), and so forth), static memory (e.g., flash memory, static random access memory (SRAM), and so forth), and a data storage device (e.g., hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read).
- ROM readonly memory
- DRAM dynamic random access memory
- SDRAM synchronous DRAM
- DDR SDRAM double data rate
- RDRAM DRAM
- static memory e.g.
- the glycan data management or glycomolecule analysis system can include any suitable processor configured to access the input glycan data and execute the instructions.
- the processor is structured to include one or more general-purpose processing devices such as a microprocessor, central processing unit (CPU), and the like.
- the processor includes a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets.
- CISC complex instruction set computing
- RISC reduced instruction set computing
- VLIW very long instruction word
- the glycan data management system can include a non-transitory memory 720 and a processor 710 (e.g., a hardware processor, a CPU), as disclosed herein.
- the glycan data management system includes a user interface 730, e.g., to allow a user to input data and/or instructions (e.g., a keyboard, mouse, touchpad, touchscreen, microphone, etc.), and/or to allow a user to receive output (e.g., visual output, such as a screen or monitor; audio output, such as a speaker; etc.) from the system.
- a user interface 730 e.g., to allow a user to input data and/or instructions (e.g., a keyboard, mouse, touchpad, touchscreen, microphone, etc.), and/or to allow a user to receive output (e.g., visual output, such as a screen or monitor; audio output, such as a speaker; etc.) from the system.
- the non-transitory memory 720 can include instructions for implementing the methods described herein with the processor 710.
- the glycan data management system includes a network interface 740.
- the network interface can be configured to allow the processor to access the internet or a remote server, such as a cloudbased server 770.
- the processor is configured to access the input glycan data via the network interface 740.
- the glycan data management system includes a database 750, e.g., a local database to store and access data.
- the glycan data management system includes a communication bus 715 for the processor to communicate with components of the system.
- the input glycan data is from a glycan database stored in a data storage device (e.g., hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read).
- the input glycan data can be stored at any suitable location.
- the input glycan data and/or the glycan database is stored on a server.
- the server is a local server 820, e.g., on the same local area network (LAN) as the glycan data management system 810.
- one or more servers are located remotely over the internet 830, across one or more server locations, e.g., 840, 850, 860, each having one or more remote servers.
- the input glycan data and/or the glycan database is stored at a first remote server location 840.
- the first remote server location also hosts a first glycomolecule search engine, e.g., a first glycoprotein search engine, where the glycan format of the input glycan data and/or the glycan database is compatible with the first glycomolecule search engine.
- a processor of the glycan data management system 810 is configured to access the input data from the first remote server location 840, and to execute instructions to convert the input glycan data from the first platform-specific glycan format, compatible with the first glycomolecule search engine at the first remote server location, via the universal glycan format, to a second platform-specific glycan format.
- the second platform-specific glycan format is compatible with a second glycomolecule search engine at the second remote server location 850.
- a processor of the glycan data management system 810 is configured to access the input data from the first remote server location 840, and to execute instructions to convert the input glycan data from the first platform-specific glycan format, compatible with the first glycomolecule search engine at the first remote server location, via the universal glycan format, to a third platform-specific glycan format.
- the second platform-specific glycan format is compatible with a third glycomolecule search engine at the third remote server location 860
- any method of the present disclosure can be performed on any suitable system as disclosed herein.
- a glycan data management system for handling data associated with glycans, where the system includes a processor; and non- transitory memory comprising instructions, which when executed by the processor causes the processor to perform any one of the methods, e.g., computer-implemented methods, disclosed herein.
- an electronic system for comparing glycomolecule search outputs from two or more glycomolecule search engines where the system includes a processor; and non-transitory memory comprising instructions, which when executed by the processor causes the processor to perform any one of the methods, e.g., computer-implemented methods, disclosed herein.
- the glycomolecule data analysis system can include a non-transitory memory 1420 and a processor 1410 (e.g., a hardware processor, a CPU), as disclosed herein.
- the glycomolecule data analysis system includes a user interface 1430, e.g., to allow a user to input data and/or instructions (e.g., a keyboard, mouse, touchpad, touchscreen, microphone, etc.), and/or to allow a user to receive output (e.g., visual output, such as a screen or monitor; audio output, such as a speaker; etc.) from the system.
- a user interface 1430 e.g., to allow a user to input data and/or instructions (e.g., a keyboard, mouse, touchpad, touchscreen, microphone, etc.), and/or to allow a user to receive output (e.g., visual output, such as a screen or monitor; audio output, such as a speaker; etc.) from the system.
- the glycomolecule data analysis system includes a network interface 1440.
- the network interface can be configured to allow the processor to access the internet or a remote server, such as a cloud-based server 1470.
- the processor is configured to access the search output via the network interface 1440.
- the glycomolecule data analysis system includes a database 1450, e.g., a local database to store and access data.
- the glycomolecule data analysis system includes a communication bus 1415 for the processor to communicate with components of the system.
- the search output is stored in a data storage device (e.g., hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read).
- the search output can be stored at any suitable location.
- the search output is stored on a server.
- the server is a local server 1520, e.g., on the same local area network (LAN) as the glycomolecule data analysis system 1510.
- one or more servers are located remotely over the internet 1530, across one or more server locations, e.g., 1540, 1550, 1560, each having one or more remote servers.
- a first search output is stored at a first remote server location 1540.
- the first remote server location also hosts a first search output, e.g., a search output from a first glycomolecule search engine, where the search output is in a format specific to the first glycomolecule search engine.
- a processor of the glycomolecule data analysis system 1510 is configured to access the first search output from the first remote server location 1540, and to execute instructions to convert the first search output from the first search engine-specific format to a common format.
- a second search output is stored at the second remote server location 1550.
- a processor of the glycomolecule data analysis system 1510 is configured to access the second search output from the second remote server location 1550, and to execute instructions to convert the second search output from the second search engine-specific format to the common format.
- a third search output is stored at the third remote server location 1560.
- a processor of the glycomolecule data analysis system 1510 is configured to access the third search output from the third remote server location 1560, and to execute instructions to convert the third search output from the third search enginespecific format to the common format.
- An electronic system of the present disclosure can include any suitable non-transitory memory, e.g., configured as disclosed herein, or for holding instructions, which when executed, perform any of the methods disclosed herein.
- the non-transitory memory is structured to include one or more of a read-only memory (ROM), flash memory, dynamic random access memory (DRAM) (such as synchronous DRAM (SDRAM), double data rate (DDR SDRAM), or DRAM (RDRAM), and so forth), static memory (e.g., flash memory, static random access memory (SRAM), and so forth), and a data storage device (e.g., hard disk, magnetic tape, any other magnetic medium, CD- ROM, any other optical medium, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read).
- ROM read-only memory
- DRAM dynamic random access memory
- SDRAM synchronous DRAM
- DDR SDRAM double data rate
- RDRAM DRAM
- the electronic system can include any suitable processor configured to receive the search output, and analyze the search output, as provided herein.
- the processor is structured to include one or more general-purpose processing devices such as a microprocessor, central processing unit (CPU), and the like.
- the processor includes a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets.
- CISC complex instruction set computing
- RISC reduced instruction set computing
- VLIW very long instruction word
- the electronic system can include a non-transitory memory 2620 and a processor 2610 (e.g., a hardware processor, a CPU), as disclosed herein.
- the electronic system includes a user interface 2630, e.g., to allow a user to input data and/or instructions (e.g., a keyboard, mouse, touchpad, touchscreen, microphone, etc.), and/or to allow a user to receive output (e.g., visual output, such as a screen or monitor; audio output, such as a speaker; etc.) from the system.
- the electronic system includes a network interface 2640.
- the network interface can be configured to allow the processor to access the internet or a remote server, such as a cloud-based server 2670.
- the processor is configured to access the search output via the network interface 2640.
- the electronic system includes a database 2650, e.g., a local database to store and access data.
- the electronic system includes a communication bus 2615 for the processor to communicate with components of the system.
- the search output is stored in a data storage device (e.g., hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read).
- the search output can be stored at any suitable location.
- the search output is stored on a server.
- the server is a local server 2720, e.g., on the same local area network (LAN) as the electronic system 2710.
- one or more servers are located remotely over the internet 2730, across one or more server locations, e.g., 2740, 2750, 2760, each having one or more remote servers.
- a first search output is stored at a first remote server location 2740.
- the first remote server location also hosts a first search output, e.g., a search output from a first glycomolecule search engine.
- a processor of the electronic system 2710 is configured to access the first search output from the first remote server location 2740, and to execute instructions to analyze the search output as provided herein (e.g., determine that the first search output identifies a first set of glycomolecules as being of a first type and a second set of glycomolecules as being of a second type).
- a second search output is stored at the second remote server location 2750.
- a processor of the electronic system 2710 is configured to access the second search output from the second remote server location 2750, and to analyze the search output as provided herein (e.g., determine that the second search output identifies a third set of glycomolecules as being of the first type and a fourth set of glycomolecules as being of the second type).
- a third search output is stored at the third remote server location 2760.
- a processor of the electronic system 2710 is configured to access the third search output from the third remote server location 2760, and to execute instructions to analyze the search output as provided herein.
- any method of the present disclosure can be performed on any suitable system as disclosed herein.
- an electronic system that includes: a processor; and non-transitory memory comprising instructions, which when executed by the processor causes the processor to perform any method of the present disclosure (e.g., a method of identifying glycomolecules in a biological sample, as disclosed herein; a method of determining a glycomolecule profile of a biological sample, as disclosed herein).
- any method of the present disclosure can be performed on any suitable system as disclosed herein.
- the system includes a processor; and non-transitory memory comprising instructions, which when executed by the processor causes the processor to perform any one of the methods, e.g., computer-implemented methods, disclosed herein.
- a system is a medical diagnostic or life sciences research method to identify the glycomolecule in a sample.
- the method includes 5, 10, 15, 20, 25, 30, 40, 50 or more steps, or a number of steps in a range defined by any two of the preceding numbers.
- at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50 or more steps, or a number of steps in a range defined by any two of the preceding numbers, of the method are designed to be performed in sequence.
- at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50 or more steps, or a number of steps in a range defined by any two of the preceding numbers, of the method are designed to be performed in parallel.
- a system of the present disclosure is configured to provide notifications to the user of particular tasks that are time consuming or tasks that may create bottlenecks. In some embodiments, a system of the present disclosure is configured to suggest alternative tasks, or suggest abandoning a particular step. In some embodiments, a method of the present disclosure includes providing notifications to the user of particular tasks that are time consuming or tasks that may create bottlenecks. In some embodiments, a method of the present disclosure includes suggesting alternative tasks, or abandoning a particular step.
- machine learning may be used for the predictions/notifications.
- any of the systems of the present disclosure may be implemented in a cloud computing environment or cloud-based network.
- User interaction with the databases may be mediated via a central hub that stores and controls access to various interactions with the data.
- the cloud computing environment may also provide sharing of protocols, analysis methods, libraries, as well as distributed processing for performing the analysis, and generating output reports.
- the system may be implemented in a computer browser, on-demand or online.
- instructions or software written to perform the systems or methods as described herein is stored in some form of computer readable medium, such as memory (e.g., non-transitory memory), CD-ROM, DVD-ROM, memory stick, flash drive, hard drive, SSD hard drive, server, mainframe storage system and the like.
- systems and methods may be written in any of various suitable programming languages, for example compiled languages such as C, C#, C++, Fortran, and Java.
- Other programming languages could be script languages, such as Perl, MatLab, SAS, SPSS, Python, Ruby, Pascal, Delphi, R and PHP.
- the system or method is written in C, C#, C++, Fortran, Java, Perl, R, or Python.
- the system may include an independent application with data input and data display modules.
- the system may include a computer software product and may include classes wherein distributed objects comprise applications including computational methods as described herein.
- An assay instrument, desktop computer, laptop computer, or server which may contain a processor in operational communication with accessible memory may comprise the instructions for implementation of the systems and/or methods of the present disclosure.
- a desktop computer or a laptop computer is in operational communication with one or more computer readable storage media or devices and/or outputting devices.
- An assay instrument, desktop computer and a laptop computer may operate under a number of different computer-based operational languages, such as those utilized by Apple® based computer systems or PC-based computer systems.
- An assay instrument, desktop and/or laptop computers and/or server system may further provide a computer interface for creating or modifying experimental definitions and/or conditions, viewing data results and monitoring experimental progress.
- an outputting device may be a graphic user interface such as a computer monitor or a computer screen, a printer, a hand-held device such as a personal digital assistant (i.e., PDA, smartphone), a tablet computer, a hard drive, a server, a memory stick, a flash drive and the like.
- a graphic user interface such as a computer monitor or a computer screen
- a printer such as a hand-held device
- PDA personal digital assistant
- a computer readable storage device or medium may be any device such as a server, a mainframe, a supercomputer, a magnetic tape system and the like.
- a storage device may be located onsite in a location proximate to the assay instrument, for example adjacent to or in close proximity to, an assay instrument.
- a storage device may be located in the same room, in the same building, in an adjacent building, on the same floor in a building, on different floors in a building, etc., in relation to the assay instrument.
- a storage device may be located off-site, or distal, to the assay instrument.
- a storage device may be located in a different part of a city, in a different city, in a different state, in a different country, etc., relative to the assay instrument.
- communication between the assay instrument and one or more of a desktop, laptop, or server is typically via Internet connection, either wireless or by a network cable through an access point.
- a storage device may be maintained and managed by the individual or entity directly associated with an assay instrument, whereas in other embodiments a storage device may be maintained and managed by a third party, typically at a distal location to the individual or entity associated with an assay instrument.
- an outputting device may be any device for visualizing data.
- An assay instrument, desktop, laptop and/or server system may be used itself to store and/or retrieve computer implemented software programs incorporating computer code for performing and implementing computational methods as described herein, data for use in the implementation of the computational methods, and the like.
- One or more of an assay instrument, desktop, laptop and/or server may comprise one or more computer readable storage media for storing and/or retrieving software programs incorporating computer code for performing and implementing computational methods as described herein, data for use in the implementation of the computational methods, and the like.
- Computer readable storage media may include, but is not limited to, one or more of a hard drive, a SSD hard drive, a CD-ROM drive, a DVD-ROM drive, a floppy disk, a tape, a flash memory stick or card, and the like.
- a network including the Internet may be the computer readable storage media.
- computer readable storage media refers to computational resource storage accessible by a computer network via the Internet or a company network offered by a service provider rather than, for example, from a local desktop or laptop computer at a distal location to the assay instrument.
- computer readable storage media for storing and/or retrieving computer implemented software programs incorporating computer code for performing and implementing methods, e.g., computational methods, as described herein, data for use in the implementation of the methods, e.g., computational methods, and the like is operated and maintained by a service provider in operational communication with an assay instrument, desktop, laptop and/or server system via an Internet connection or network connection.
- a hardware platform for providing a computational environment comprises a processor (i.e., CPU) wherein processor time and memory layout such as random access memory (i.e., RAM) are systems considerations.
- processor time and memory layout such as random access memory (i.e., RAM) are systems considerations.
- RAM random access memory
- smaller computer systems offer inexpensive, fast processors and large memory and storage capabilities.
- graphics processing units GPUs
- hardware platforms for performing computational methods as described herein comprise one or more computer systems with one or more processors.
- smaller computers are clustered together to yield a supercomputer network.
- methods e.g., computational methods, as described herein are carried out on a collection of inter- or intra-connected computer systems (i.e., grid technology) which may run a variety of operating systems in a coordinated manner.
- inter- or intra-connected computer systems i.e., grid technology
- CONDOR framework Universal of Wisconsin- Madison
- systems available through United Devices are exemplary of the coordination of multiple stand-alone computer systems for the purpose of dealing with large amounts of data.
- These systems may offer Perl interfaces to submit, monitor and manage large sequence analysis jobs on a cluster in serial or parallel configurations.
- Glycan Database Converter to obtain PGLYCO3 glycan database input files for the BYONIC search engine’s default N- glycan human plasma database.
- One human serum sample was enriched for glycopeptides using the AssayMAP Bravo Platform, analyzed on an OrbitrapTM Exploris 480 mass spectrometer coupled with Ultimate 3000 nanoLC system.
- Glycopeptide identification was achieved using searches on PGLYCO3 and BYONIC search engines with the same searching parameters.
- Our Glycomolecule Search Converter was later applied after search engine identifications to extract specific parameters including: MS2 Scan number, protein ID, peptide sequence, glycoform annotations, retention time, and confidence score output for each search engine.
- PGLYCO3 identified 4498 glycopeptide-containing spectra from 120 glycoproteins, while BYONIC identified 8615 glycopeptide-containing spectra from 125 glycoproteins.
- the confidence scores reported by PGLYCO3 ranged from 5.03 to 164.36, while the confidence scores reported by BYONIC ranged from 0.04 to 1610.4.
- the magnitude of confidence score ranges were drastically different from PGLYCO3 compared to BYONIC.
- the Glycomolecule Search Converter can effortlessly retrieve MS2 scan numbers and confidence scores, conversions between confidence scores for different search engines can be analyzed based on the same MS2 scan.
- glycopeptide converter enabled a direct comparison of the outputted results from PGLYCO3 and BYONIC while accounting for differences in the respective confidence scores between PGLYCO3 and BYONIC.
- This non-limiting example shows conversion of glycan representations between platform-specific glycan formats via a universal glycomolecular data format.
- glycan data with glycan representations in a universal glycan format (“Universal”) is provided, where the sequential order of glycans was listed as the single letter representation of N, H, F, A, and G (from left to right) where each single letter representation had an associated integer value of 0 or greater corresponding to the amount of that glycan.
- the single letter representations N, H, F, A, and G correspond to HexNAc, Hex, fucose, NeuAc, and Neu5Gc, respectively.
- a BYONIC /Universal conversion rule set configured to convert each of the glycan representations in the Universal glycan format to and from a BYONIC glycan format (“BYONIC”) is provided.
- the BYONIC glycan format is compatible with the BYONIC glycopeptide search engine.
- the glycans HexNAc, Hex, NeuAc in the BYONIC format were converted to the single letter representation of N, H, and A, respectively, for the universal glycan format.
- a METAMORPHEUS/Universal conversion rule set configured to convert each of the glycan representations in the Universal glycan format to and from a METAMORPHEUS glycan format (“METAMORPHEUS”) is provided.
- the METAMORPHEUS glycan format is compatible with the METAMORPHEUS glycopeptide search engine.
- a PGLYCO (Composition)/Universal conversion rule set configured to convert each of the glycan representations in the Universal glycan format to and from a PGLYCO (Composition) glycan format that does not include glycan structural information (“PGLYCO (Composition)”) is provided.
- the PGLYCO (Composition) glycan format is compatible with the PGLYCO glycopeptide search engine.
- a PGLYCO (Structure)/Universal conversion rule set configured to convert each of the glycan representations in the Universal glycan format to and from a PGLYCO (Structure) glycan format that includes glycan structural information (“PGLYCO (Structure)”) is provided.
- the PGLYCO (Structure) glycan format is compatible with the PGLYCO glycopeptide search engine.
- a database listing each glycan representation of the database in the corresponding PGLYCO (Structure), PGLYCO (Composition), Universal, BYONIC, and METAMORPHEUS glycan formats can be constructed.
- the BYONIC glycan format can be converted to the METAMORPHEUS glycan format by first converting the BYONIC glycan format to the Universal glycan format using the BYONIC /Universal conversion rule set, and then to the METAMORPHEUS glycan format using the METAMORPHEUS/Universal conversion rule set.
- a glycan database in the BYONIC glycan format can be converted to the METAMORPHEUS glycan format for use as an input to run the METAMORPHEUS glycopeptide search engine.
- a glycan database in any one of the platform-specific glycan formats can be converted to any one of the other platform-specific glycan formats via the Universal glycan format for use as an input to run the respective glycopeptide search engine compatible with the platform-specific glycan format.
- This non-limiting example shows conversion of a glycan representation from a format that does not include structural information, to a format that includes structural information.
- a glycan database for searching a sample from human serum is obtained in a BYONIC-specific format.
- One of the glycans in the database is represented by its composition in the universal glycan format as: N(3)H(3)F(l)A(0).
- the universal glycan format does not include any structural information.
- a potential glycan structure for a glycan having the composition N(3)H(3)F(l)A(0) in a human serum sample is generated: (N(F)(N(H(H)(H(N))))).
- glycan database was for a human serum sample, glycan structures that would only be found in other sample types, e.g., human stool sample, were not included in the list of potential glycan structures.
- Glycan Database Converter on PGLYCO3 glycan database input files for 154 N-Linked glycans from the BYONIC search engine’ s N-glycan human plasma database.
- Glycopeptide identification was achieved using searches on PGLYCO3, BYONIC, and METAMORPHEUS search engines with the same search parameters.
- Our Glycopeptide Data Converter was applied after search engine identifications to extract specific parameters including: MS2 Scan number, peptide sequence, glycoform annotations, retention time, and confidence score output for each search engine.
- PGLYCO3 identified 3,038-6,801 glycopeptide-containing spectra (GCS)
- BYONIC identified 4,660-9,737 GCS
- METAMORPHEUS identified 3,140-6,818 GCS.
- GCS glycopeptide-containing spectra
- BYONIC identified 4,660-9,737 GCS
- METAMORPHEUS identified 3,140-6,818 GCS.
- the Glycopeptide Data Converter can retrieve MS2 scan numbers and confidence scores
- conversions between confidence scores for the different search engines can be analyzed based on the same MS2 scan.
- 30.1% (3,266 of 10,868 PSMs) of glycopeptide containing spectra reported a shared space at FAIMS CV 45V by the PGLYCO3, BYONIC, and METAMORPHEUS search engines.
- This non-limiting example shows conversion of search outputs from three glycopeptide search engines to a universal glycomolecular data format.
- search outputs from PGLYCO3, BYONIC, and METAMORPHEUS for the same experiment were obtained.
- the search results from PGLYCO3, BYONIC, and METAMORPHEUS are in different formats, and cannot be easily compared.
- each search output was converted to a universal glycomolecular data format.
- the PGLYCO3 search results were converted to the universal glycomolecular data format (Fig. 17).
- the universal glycomolecular data format included the identifier (“MS2#”), protein ID, peptide sequence, glycan data (“Glycan”), m/z value (“Precursor MZ”), charge number (“Z”), retention time (“RT”), confidence score (“Total score”), and the search engine. Parameters in the PGLYCO3 format that are not in the universal glycomolecular data format were removed.
- the peptide sequence from the PGLYCO3 format was converted to a common format by substituting “J” with “N”.
- the glycan structural information in the PGLYCO format was not included in the universal glycomolecular data format.
- the BYONIC search results were converted to the universal glycomolecular data format (Fig. 18). Parameters in the BYONIC format that are not in the universal glycomolecular data format were removed. The modification brackets and numbers in the peptide sequence of the BYONIC format were removed to convert to the universal glycomolecular data format. The amino acids flanking the peptide sequences provided in the BYONIC format were removed in the universal glycomolecular data format.
- the METAMORPHEUS search results were converted to the universal glycomolecular data format (Fig. 19). Parameters in the METAMORPHEUS format that are not in the universal glycomolecular data format were removed. The format of the protein ID was the same between the METAMORPHEUS format and the universal glycomolecular data format.
- This non-limiting example shows conversion of specific parameters from a search engine-specific format to a universal glycomolecular data format.
- Conversion of search results from PGLYCO3 to a universal glycopeptide data format involved conversion of protein ID, peptide sequence, retention time, and glycan data from the PGLYCO3 format to the universal glycopeptide data format (Fig. 20).
- the protein identifier was converted to include only the portion required to uniquely identify the protein.
- the peptide sequence was converted to replace “J”, indicating the asparagine residue to which the glycan is found to be attached, with “N”.
- the retention time was converted from seconds to minutes.
- the glycan data was converted to a universal glycan format.
- Conversion of search results from BYONIC to the universal glycopeptide data format involved conversion of the identifier, protein ID, peptide sequence, and glycan data from the BYONIC format to the universal glycopeptide data format (Fig. 21).
- the identifier was converted by removing extraneous information and leaving the identifier that is sufficient to identify the same analyte in the search results from other search engines.
- the protein identifier was converted to include only the portion required to uniquely identify the protein.
- the peptide sequence was converted to remove modification brackets and numbers, and to remove amino acids flanking the peptide sequence of the identified glycopeptide.
- the glycan data was converted to the universal glycan format.
- Conversion of search results from METAMORPHEUS to the universal glycopeptide data format involved conversion of the m/z number (or precursor MZ), protein ID, and glycan data from the METAMORPHEUS format to the universal glycopeptide data format (Fig. 22).
- the m/z number (or precursor MZ) was converted by rounding the value to four decimal points.
- the protein identifier was converted typographically to the universal glycopeptide data format.
- the glycan data was converted to the universal glycan format.
- This non-limiting example shows identification of a conflicting subset of glycomolecules (e.g., glycopeptides) in search results from different glycomolecule search engines.
- glycomolecules e.g., glycopeptides
- PGLYCO3 identifies a first set of glycopeptides as containing sialic acid (Fig. 28A, circle on the right), and BYONIC identifies a second set of glycopeptides as containing sialic acid (Fig. 28A, circle on the left).
- PGLYCO3 identifies a third set of glycopeptides as being fucosylated (Fig. 28B, circle on the right), and BYONIC identifies a fourth set of glycopeptides as being fucosylated (Fig. 28B, circle on the left).
- Figs. 28A and 28B there is a conflicting subset of glycopeptides 610.
- a number of analytes are identified in the BYONIC search output as having sialic acid as illustrated in Fig. 28 A, and the same analytes (e.g., as may be determined by their shared identifiers or “scan numbers”) are identified in the PGLYCO3 search output as having fucose, thus creating a conflict.
- Analyte “X” may be one of the glycopeptides that is within this conflicting subset.
- Analyte X is identified as being a fucosylated glycopeptide in the PGLYCO3 search output (and not in the BYONIC search output) (Fig. 28B).
- the scan number for analyte “X” is used to determine that analyte “X” is identified as a glycopeptide containing sialic acid in the BYONIC search output (and not in the PGLYCO3 search output) (Fig. 28A).
- peptides with sialic acid and peptides with fucose are commonly confused for each other. This is due to the similarity in mass of some commonly occurring glycans, such as fucose and sialic acid, as illustrated in Fig.
- the disclosed method may determine that there is a conflict.
- the raw mass spectrometry data corresponding to analyte “X” with conflicting identification between the two search engines is pulled out of the sample data used to run the searches, and further analysis of the raw data is carried out to resolve the conflict. Based on further analysis, the analyte “X” is determined to be a fucosylated glycopeptide.
- This non-limiting example shows generating a glycopeptide profile based on a consensus list of overlapping sets of glycomolecules in search results from different glycomolecule search engines.
- Raising the threshold score for PGLYCO3 from 50 to 80 results in a third set of glycomolecules identified by PGLYCO3 (Fig. 29B, circle on the right) that is a subset of the first set of glycomolecules identified using the lower threshold score.
- the overlap between the third set of glycomolecules identified by PGLYCO3 using the higher threshold score and the second set of glycomolecules identified by BYONIC identifies a second consensus list of glycomolecules in the sample from which the sample data used to run the searches were obtained.
- the proportion of glycomolecules in the third set identified by PGLYCO3 that are in the second consensus list is larger than the proportion of glycomolecules in the first set identified by PGLYCO3 that are in the first consensus list.
- the proportion of glycomolecules in the second set identified by BYONIC that are in the second consensus list is larger than the proportion of glycomolecules in the second set identified by BYONIC that are in the first consensus list.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biotechnology (AREA)
- Bioethics (AREA)
- Software Systems (AREA)
- Chemical & Material Sciences (AREA)
- Crystallography & Structural Chemistry (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
La présente invention concerne un système et une méthode de conversion de représentations de glycane entre différents formats de glycane spécifiques à une plateforme par l'intermédiaire d'un format de glycane universel. L'utilisation d'un format de glycane universel pour une conversion entre différents formats de glycane spécifiques à une plateforme, tels que différents formats spécifiques de moteur de recherche de glycomolécule, peut réduire le nombre d'ensembles de règles de conversion requis pour convertir des formats de données de glycane parmi les différents formats de glycane spécifiques à une plateforme. Le format de glycane universel peut également améliorer la lisibilité et simplifier l'analyse des glycomolécules. L'invention concerne également un système et une méthode pour comparer ou analyser des résultats de recherche de glycomolécule provenant de différents moteurs de recherche de glycomolécule et une interface utilisateur, un système et une méthode pour traiter une liste de consensus de glycomolécules (par exemple, des glycopeptides, un glycoADN, un glycoARN, des glycolipides) identifiés dans un échantillon biologique sur une interface utilisateur à partir d'ensembles de résultats de recherche de glycomolécule à partir de multiples moteurs de recherche de glycomolécule, les ensembles de résultats de recherche de glycomolécule pouvant comprendre des identifications conflictuelles entre les moteurs de recherche de glycomolécule.
Applications Claiming Priority (12)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202263306426P | 2022-02-03 | 2022-02-03 | |
| US63/306,426 | 2022-02-03 | ||
| US202263269804P | 2022-03-23 | 2022-03-23 | |
| US63/269,804 | 2022-03-23 | ||
| US202263269884P | 2022-03-24 | 2022-03-24 | |
| US63/269,884 | 2022-03-24 | ||
| US202263362303P | 2022-03-31 | 2022-03-31 | |
| US63/362,303 | 2022-03-31 | ||
| US202263365850P | 2022-06-03 | 2022-06-03 | |
| US63/365,850 | 2022-06-03 | ||
| US202363479909P | 2023-01-13 | 2023-01-13 | |
| US63/479,909 | 2023-01-13 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2023150729A2 true WO2023150729A2 (fr) | 2023-08-10 |
| WO2023150729A3 WO2023150729A3 (fr) | 2023-09-28 |
Family
ID=87553046
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2023/062001 Ceased WO2023150729A2 (fr) | 2022-02-03 | 2023-02-03 | Systèmes et méthodes d'identification, de recherche et de comparaison de glycomolécule et d'analyse des résultats de ceux-ci |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2023150729A2 (fr) |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2002074233A2 (fr) * | 2001-03-16 | 2002-09-26 | Glycominds Ltd. | Systeme et procede de creation d'une serie de bases de donnees de structures de glycanes tridimensionnelles et leurs applications |
| EP3519832B1 (fr) * | 2016-10-03 | 2024-04-03 | Waters Technologies Corporation | Complexes de glycane et d'acides aminés marqués utiles dans l'analyse en lc-sm et procédés pour les préparer |
-
2023
- 2023-02-03 WO PCT/US2023/062001 patent/WO2023150729A2/fr not_active Ceased
Also Published As
| Publication number | Publication date |
|---|---|
| WO2023150729A3 (fr) | 2023-09-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Zeng et al. | Precise, fast and comprehensive analysis of intact glycopeptides and modified glycans with pGlyco3 | |
| Long et al. | Toward a standardized strategy of clinical metabolomics for the advancement of precision medicine | |
| Ebbels et al. | Recent advances in mass spectrometry-based computational metabolomics | |
| Subramaniam et al. | Bioinformatics and systems biology of the lipidome | |
| Polasky et al. | Multiattribute glycan identification and FDR control for glycoproteomics | |
| Woodin et al. | GlycoPep grader: a web-based utility for assigning the composition of N-linked glycopeptides | |
| Akune et al. | The RINGS resource for glycome informatics analysis and data mining on the Web | |
| Li et al. | Databases and bioinformatic tools for glycobiology and glycoproteomics | |
| Trifonova et al. | Mass spectrometry-based metabolomics diagnostics–myth or reality? | |
| Forsythe et al. | Exploring human metabolites using the human metabolome database | |
| Campbell et al. | GlycoBase and autoGU: resources for interpreting HPLC-glycan data | |
| Zong et al. | Deep learning prediction of glycopeptide tandem mass spectra powers glycoproteomics | |
| Aoki-Kinoshita et al. | Glycoinformatics | |
| Lundstrøm et al. | Decoding glycomics with a suite of methods for differential expression analysis | |
| Yang et al. | HMMER-Extractor: An auxiliary toolkit for identifying genomic macromolecular metabolites based on Hidden Markov Models | |
| CN114999564A (zh) | 蛋白质数据处理方法、装置、电子设备以及存储介质 | |
| Ferreira et al. | Empowering peptidomics: Utilizing computational tools and approaches | |
| Yu et al. | A versatile toolkit for drug metabolism studies with GNPS2: from drug development to clinical monitoring | |
| Campbell et al. | Glycoinformatics | |
| Kozlova et al. | An open-source pipeline for processing direct infusion mass spectrometry data of the human plasma metabolome | |
| WO2023150729A2 (fr) | Systèmes et méthodes d'identification, de recherche et de comparaison de glycomolécule et d'analyse des résultats de ceux-ci | |
| Abtheen et al. | Transformer-based Deep Learning for Glycan Structure Inference from Tandem Mass Spectrometry | |
| Kim et al. | gFinder: A web-based bioinformatics tool for the analysis of N-glycopeptides | |
| Agravat et al. | Computational approaches to define a human milk metaglycome | |
| Wishart | Statistical evaluation and integration of multi-omics data with MetaboAnalyst |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23750475 Country of ref document: EP Kind code of ref document: A2 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 23750475 Country of ref document: EP Kind code of ref document: A2 |