WO2012101151A1 - Nouveau procédé permettant de fournir une bibliothèque de n-mères ou de biopolymères - Google Patents

Nouveau procédé permettant de fournir une bibliothèque de n-mères ou de biopolymères Download PDF

Info

Publication number
WO2012101151A1
WO2012101151A1 PCT/EP2012/051099 EP2012051099W WO2012101151A1 WO 2012101151 A1 WO2012101151 A1 WO 2012101151A1 EP 2012051099 W EP2012051099 W EP 2012051099W WO 2012101151 A1 WO2012101151 A1 WO 2012101151A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
mers
biopolymers
sequences
mer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/EP2012/051099
Other languages
English (en)
Inventor
Peter Kamp Busk
Lene M. LANGE
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aalborg Universitet AAU
Original Assignee
Aalborg Universitet AAU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aalborg Universitet AAU filed Critical Aalborg Universitet AAU
Publication of WO2012101151A1 publication Critical patent/WO2012101151A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides

Definitions

  • TITLE A novel method of providing a library of n-mers or biopolymers
  • the present invention relates to a method of providing a library of n-mer sequences, in particular primers and/or probes, a method of providing a library of biopolymer sequences involving the use of the n-mers, a method for providing an extended biopolymer fragment or full length sequence involving use of the primers, and use of the selected primers and/or probes for i.a. amplification of nucleic acids.
  • the complexity of the biomass available put high demands to the microbial products: Most agricultural products will have to be reserved for feeding the 9 billion people as well as for feeding the animals for the food chain.
  • the biomass available for industrial purposes will by and large in future be crop residue/biowaste materials. Such materials are primarily composed of plant lignocelluloses, a highly recalcitrant structure which needs a host of enzymes for full decomposition. This put even higher demands on the discovery of new and improved enzymes of microbial origin.
  • Protein and enzyme discovery can be based on genome sequencing (confined to one organism at a time and depending on time consuming annotations), activity screening (requiring cloning and available high through put assays), and searching for novelty through sequence similarity (e.g. a Polymerase Chain Reaction (PCR) based approach).
  • genome sequencing confined to one organism at a time and depending on time consuming annotations
  • activity screening requiring cloning and available high through put assays
  • searching for novelty through sequence similarity e.g. a Polymerase Chain Reaction (PCR) based approach.
  • PCR primers for discovering novel xylanases (e.g. GH10 and GH1 1 ) and discovering novel endoglucanases (e.g. GH45).
  • novel xylanases e.g. GH10 and GH1 1
  • novel endoglucanases e.g. GH45
  • the 3D protein structure has through evolution maintained longer stretches of rather highly conserved regions, suitable for primer construction.
  • other of the needed types of enzymes for cellulose decomposition as the cellobiohydrolases or the auxiliary protein belonging to GH61 have either very high sequence variation within each protein family and/or limited areas of sufficient conservation of sequence similarity.
  • the resulting invention has for biopolymers such as proteins, RNA and DNA, been developed to a spectrum of methods, allowing for improved discovery of novel proteins/peptides (from biological materials as well as from data bases / in silico ), for discovering subgrouping of protein families, for identifying micro RNA target sites, for pinpointing important sequence stretches in known and unknown biopolymers.
  • Methods for creation of degenerated primers are typically based on sequence alignment (reviewed by (Chakravorty & Vigoreaux 2010)).
  • the design of the primers is critically dependent on finding the relevant sequences for alignment. These sequences are selected according to the problem on hand. For example, when looking for new members of a fungal gene family in an Aspergillus species it makes sense to limit the alignment to known genes from related Aspergillus. However, often the number and divergence of sequences that can be aligned is limited by the ability to perform a correct alignment and by the ability to identify the most conserved sequence motifs in the aligned sequences.
  • the MEME and related algorithms (Bailey & Elkan 1995; Price et al. 2003) is a bioinformatic tool that can be used for discovery of conserved motifs in protein sequences.
  • the motif length is not fixed and the motifs identified do not have to be 100 % identical in the sequences (http://meme.nbcr.net/meme4_5_0/cgi-bin/meme.cgi).
  • the method is great for finding sequence motifs where the requirement for the exact sequence of the motif is not absolute. E.g.; native
  • transcription factor binding sites will often be variations of a sequence motif (Busk & Pages 1998).
  • the present invention is more suitable for finding sequences that are 100 % identical to the motif as is necessary for degenerated primers.
  • the requirement for a non-redundant sequence reduces the degree of freedom of the search but allows for larger freedom of input sequences.
  • Glycosyl hydrolases have been classified into families based on sequence alignment and alignment of hydrophophic stretches (Henrissat 1991 ). However, further
  • the present invention can be used to generate efficient PCR primers for the gh61 protein family and to classify the gh61 s into 13 subfamilies.
  • chaos game representation creates a picture based on the biological sequences and pictures representing different sequences can be compared (Jeffrey 1990).
  • An important limitation of chaos game representation is that the method is only able to accommodate four different words. This makes the method suitable for nucleotide sequence comparison but difficult to adapt to protein sequences made up of 20 different words/amino acids (Davies et al. 2008; Deschavanne & Tuffery 2008).
  • Another alignment-independent approach for sequence comparison is to count the frequency of all words of a certain length (for example trimers) in each sequence and classify the sequences according to word frequencies (Blaisdell 1986; D'Auria et al.
  • Variations of this method include dividing the sequences in subsequences with different chemical properties (for example hydrophilic and hydrophobic) (Strope & Moriyama 2007). These alignment- independent methods require less computation than alignment and can be used for comparison of distantly-related sequences (Vinga & J. Almeida 2003) but do not produce the precise and easily comprehensive overview of sequence similarity and differences that are the hallmark of successful alignment (Arakawa et al. 2009; J. S. Almeida & Vinga 2009; Deschavanne & Tuffery 2008).
  • Word frequency methods for alignment-independent sequence comparison are inspired by text analysis methods (reviewed by (Vinga & J. Almeida 2003)). These methods look for short sequences (words) within protein or DNA sequences and count the number of times each word is repeated within the sequences. The similarity between two or more sequences is calculated by comparing the frequency of each word within the sequences (Tomovic et al. 2006; D'Auria et al. 2006; Cheng et al. 2005). A sophistication of this approach is to calculate the statistical probability that a word will occur in a training set of sequences to find the words that have the highest probability of being found in the selected sequences but not in randomly chosen sequences (Vries et al. 2004).
  • the best words are not the sequences that have the highest frequency in the training set, but the words that have the best discriminating power. These words can be used to calculate the likelihood that sequences not included in the training set have the same properties as the training set.
  • the method use word lengths of up to 4 amino acids. E.g.; a set of 4-mer words can be derived from a training set of G-protein Coupled Receptor sequences (GPCR) and used to predict whether other protein sequences are GPCRs (Vries et al. 2004).
  • GPCR G-protein Coupled Receptor sequences
  • a problem with this method is that it is dependent on the quality of the underlying models in the training sets.
  • the training sets are defined by application of Hidden Markov Models to curated seed alignments and several parameters such as reliability of alignment and family size affect the selection of training set and hence the outcome of the method.
  • Families or related proteins often contain short amino acid motifs that are conserved between the family members (Marchler-Bauer et al. 201 1 ). It is assumed that these motifs are conserved because they are functionally or structurally important for the family. Although the motifs are important they will normally only occur once within each protein e.g. a conserved motif forming an active site. Word frequencies methods that calculate n-mer frequencies within sequences are not designed to find such unique amino acid motifs that are conserved between sequences. References.
  • the present invention distinguishes itself from all other prior art approaches for grouping and discovery of biopolymers.
  • Prior art can in short be described as follows. Alignment is sequence based one to one comparison, fixed in both direction and spacing. Domain finding is based on sequence recognition of known domain structures. Blast searches are based on sequence homology and a one to one comparison. The inherent potentials of blast search for finding similarities have not been transformed into a method (e.g. through primer construction) for discovery of novel biopolymers.
  • PPR Peptide Pattern Recognition
  • the core method of PPR consists of two steps: 1 . Find a limited number of n-mer short sequences that are highly conserved in a group of longer biological sequences such as proteins or nucleic acid sequences. 2. Select the longer biological sequences that contain more than a threshold number of the n-mer short sequences. Any input sequence that is unrelated to the other input sequences will be discarded.
  • the output of PPR is a group of related sequences selected from the input and a list of the n-mer short sequences that are most conserved in this group.
  • PPR measures word frequencies as the number of sequences that contain a given n-mer but ignores the number of times that the n-mer occurs within each sequence. This makes PPR fundamentally different from traditional word frequencies methods that measure the number of times a given n-mer occurs within each sequence.
  • PPR is free from bias introduced by training set selection and can be used with any word length (n-mer) and does not depend on removing statistically frequent words from the data set. Another important difference is in the practical outcome of the methods: Vries and coworkers show that their method is able to correctly classify 70 % of unknown proteins but claim that the method can be improved to reach about 85 % (Vries et al. 2004).
  • PPR is able to classify proteins with 90 % - 95 % accuracy (according to enzymatic function). This considerably higher accuracy in functional prediction is highly valuable for elucidating the function of unknown proteins, e.g. enzymes of potential industrial use.
  • the current invention builds on a subdivision of the biopolymer (e.g. DNA, RNA or amino acid sequence) not in individual amino acids or nucleotides and base pairs but in blocks of n-mers. Such subdivision is freely placed in all permutation of each of the biopolymer sequences. Further all subdivisions of all sequences included are compared to each other. Not one to one; and not in any specific sequence. Such multiple n-mer subdivision and multiple comparisons, identity finding and ranking is made possible through proposed algorithm directed program based computerizations.
  • the invention can be viewed as a method of generating all possible primers for a group of input sequences and test the primers to find the optimal primers that will identify and provide as many as possible of the input sequences. However, in a more general form the invention generates a number of n-mers that characterize a group of biopolymers.
  • identifying level of identity for each of the n-mers allows for ranking in level of identity and for inserting treshholds. Identifying what is unifying, usable for discovery of novelty among natures own variants, novelty in sequence or novelty in subgrouping; and what is different and excluding.
  • n-mer or "n-mer sequence” as used herein is intended to refer to 2 or more consecutive monomeric units of a biopolymer, which is identified in the biopolymer.
  • the n-mers may be identified from the biopolymer starting from one or the other end of the biopolymer.
  • a biopolymer having N building blocks, wherein N is an integer, will naturally consist of N-1 2-mers, N-2 3-mers, N-3 4-mers and so forth. Thus, for instance, a protein having 100 amino acids will have 99 2-mers, 98 3-mers 97 4-mers and so forth.
  • Some of the n-mers in the biopolymer may be identical.
  • the term "having one or more n-mers in common" as used herein such as in connection with a library of biopolymer sequences means that the provided biopolymer sequences of the library are related in such a way that the group of n-mers identified in the library of biopolymers define and characterize the library, thus all the identified n- mers will be found in the generated library of biopolymers. After selecting a first library of biopolymers the method may be repeated with a biopolymer from the remaining mixture of biopolymers.
  • library as used herein is intended to refer to a well defined group of n-mers (e.g. hexapeptides or nucleic acids of 18 base pairs) or biopolymers, which have been identified and selected from a larger group of n-mers or biopolymers.
  • n-mers e.g. hexapeptides or nucleic acids of 18 base pairs
  • biopolymers which have been identified and selected from a larger group of n-mers or biopolymers.
  • environment means from nature or a predefined source, e.g. a gene bank, a known micro organism, a mammal (e.g. human), or from an unknown source such as a new micro organism, a pool of unidentified biopolymers, or a mixture of known and unknown sources, an environmental sample e.g. from a mammal or a microorganism.
  • a gene bank e.g. a gene bank
  • a known micro organism e.g. human
  • mammal e.g. human
  • an unknown source such as a new micro organism, a pool of unidentified biopolymers, or a mixture of known and unknown sources
  • an environmental sample e.g. from a mammal or a microorganism.
  • biopolymer or “biopolymer sequence” as used herein means a biological molecule, including macromolecules, and molecules produced by a living organism, composed of two or more monomeric subunits, or derivatives thereof, which are linked by a bond or a macromolecule.
  • a biopolymer can be, for example, a polynucleotide or a polypeptide, or derivatives or combinations thereof, for example, a nucleic acid molecule containing a peptide nucleic acid portion or a glycoprotein, respectively.
  • Biopolymers include, but are not limited to, nucleic acids or proteins. Nucleic acids include DNA, RNA, and fragments thereof. Nucleic acids can be derived from genomic DNA, RNA, mitochondrial nucleic acid, chloroplast nucleic acid and other organelles with separate genetic material.
  • a specified group of biopolymers having a mixture of different biopolymers is intended to refer to a group of biopolymers of known or unknown origin or mixtures thereof, which comprises a mixture of different biopolymers, which group is analyzed for the purpose of generating a library.
  • a specified frequency is intended to mean that each individual n-mer as selected is accorded a specific number based on how many biopolymers the n-mer can be identified in. Thus, for instance, a hexapeptide (6-mer) identified in 7 different biopolymers will be given the frequency 7 and a dipeptide (2- mer) identified in 40 different biopolymers will be given the frequency 40.
  • polypeptide refers to a biopolymer that comprises more than about 20 consecutive amino acids.
  • polypeptide encompasses proteins, fragments of proteins, cleaved forms of proteins, partially digested proteins, and the like, which are greater than about 20 consecutive amino acids.
  • peptide refers to a biopolymer comprising fewer than about
  • polynucleotide refers to a biopolymer that comprises more than about 100 consecutive nucleotides or modified nucleotides.
  • Polynucleotides include DNA, RNA, m-RNA, r-RNA, t-RNA, cDNA, DNA-RNA duplexes, non-coding RNA etc.
  • primer as used herein is intended to refer to a strand of nucleic acid that serves as a starting point for DNA or RNA synthesis.
  • probe as used herein is intended to refer to a fragment of nucleic acids or amino acids residues of variable length.
  • the probe is typically a single stranded nucleic acids that may recognize a sequence complementary to the sequence in the probe.
  • signal sequence is intended to refer to a short sequence of amino acid residues, usually at the amino terminus of the nascent polypeptide chain that marks the protein for translocation across a membrane. Such sequences may consist of about 3-60 amino acids residues that direct the transport of a protein. Signal sequences may also be called signal peptides, targeting signals, transit peptides, or localization signals. Signal sequences may consist of one or more subunits. When referring to a polypeptide or protein herein such polypeptide or protein may also include a signal sequence.
  • genetic code as used herein is intended to refer to set of rules by which information encoded in genetic material (DNA or RNA sequences) is translated into proteins (amino acid sequences) by living cells.
  • the code defines a mapping between tri-nucleotide sequences, called codons, and amino acids.
  • the genetic code consists of 64 triplets of nucleotides. These triplets are called codons. With three exceptions, each codon encodes for one of the 20 amino acids used in the synthesis of proteins. That produces some redundancy in the code: most of the amino acids being encoded by more than one codon.
  • the genetic code is almost universal.
  • overlapping is intended to refer to that the two biopolymer fragments possess sequences in common such that the relative order of linked biopolymers can be assembled.
  • the two biopolymer fragments have a 100 % identitical sequence and that this sequence starts from one end of one biopolymer and from the other end of the other biopolymer in a way that the two biopolymers can be assembled to a longer sequence.
  • nucleic acid consensus sequence as used herein is intended to refer to that the sequence is a mixture of sequence that at each position may have several possible nucleic acids since the genetic code is degenerate i.e. more than one codon can specify the same amino acid. Thus, for example each of these 4 nucleic acid codons - CCC, CCG, CCT and CCA - specifies the same amino acid, proline.
  • degenerate primer as used herein is used to described mixtures of similar, but not identical nucleotide sequences that if translated will encode the same amino acid sequence in at least one of the six reading frames.
  • the nucleotide sequence corresponding to the amino acid isoleucine might be "ATH", where A stands for adenine, T for thymine, and H for adenine, thymine, or cytosine, according to the genetic code for each codon, using the lUPAC symbols for degenerate bases.
  • ATA, ATT and ATC comprises a host of degenerate nucleotide sequences, where the third position of the sequence is degenerated.
  • One or several of the degenerated positions may be substituted with a modified base that can base pair with one or more natural nucleotides.
  • inosine which can base pair with A, C, G and T, may be used instead of A, C, G and T in a degenerated nucleotide sequence used as degenerated primer.
  • in silico is an expression used to mean performed on computer or via computer simulation.
  • the phrase was coined in 1989 as an analogy to the latin phrases in vivo and in vitro which are commonly used in biology and refer to experiments done in living organisms and outside of living organisms, respectively.
  • suitable distance is intended to refer to the distance between the two primers which is necessary to perform a meaningful PCR.
  • extended biopolymer as used herein is intended to refer to biopolymer fragments which are extended by alignment of the fragments.
  • unordered biopolymer fragments as used herein is intended to mean populations of fragments of biopolymers, such as polynucleotides or polypeptides, that may form part of a larger biopolymer, such as a polynucleotide or a polypeptide fragment, which has not been assembled into such larger biopolymer fragments.
  • degeneracy as used herein is intended to mean that the level of degeneracy of a degenerate primer having a specific degree of degeneracy may be altered either by incorporating for example inosine into the primers at positions of three- and four-base degeneracy or to introduce preferential biases in codon usages depending on the tRNA pool of the organism of interest.
  • adenosine (A) bases complement thymine (T) bases and vice versa;
  • RNA bases complement cytosine (C) bases and vice versa.
  • adenine (A) bases complement uracil (U) bases instead of
  • the complementary strand of the DNA sequence is complementary to the complementary strand of the DNA sequence
  • the latter sequence is called the reverse complementary strand to the DNA sequence 5' A G T C A T G 3' when it is written with the 5' end on the left and the 3' end on the right
  • the present invention relates in a broad aspect to a method of providing a library of n-mer sequences, the method comprising the steps of:
  • (v) optionally use the generated n-mers or a selection of the generated n-mers according to a specified frequency to provide a second group of one or more sequences from the specified group of biopolymers having one or more of the n-mer(s) in the sequences,
  • step (vii) group all provided n-mers from the biopolymers into the library of n-mer sequences. In an embodiment hereof further including the step (viii) define specific use of the n-mers of the library.
  • the library of n-mer sequences may contain one or more n-mers, such as one single n-mer from a 3' untranslated region of an mRNA when looking for a miRNA binding site, or two or more n-mers when one or more biopolymer sequences are analyzed and all possible n-mers are identified.
  • the library generated will contain a huge number of n-mers which have the common feature that they all originates from the biopolymers used to generate such n-mers, and which biopolymers have been grouped by identification of one or more conserved n-mers being found in the biopolymers.
  • the specified group of biopolymers having a mixture of different biopolymers are from a predefined source, such as a gene bank, a known organism, sample, protein, gene family, chromosomes from one organism or selection of chromosome sequences or parts hereof from several organisms or from an unknown source such as a new microorganism, a pool of unidentified biopolymers, or a mixture of known and unknown sources, an environmental sample e.g. from a mammal, microorganism, plant sample, mixture of organisms, sample of unordered sequence reads e.g., from one or several organisms, from a database.
  • a predefined source such as a gene bank, a known organism, sample, protein, gene family, chromosomes from one organism or selection of chromosome sequences or parts hereof from several organisms or from an unknown source such as a new microorganism, a pool of unidentified biopolymers, or a mixture of known and unknown sources
  • an environmental sample
  • biopolymers When selecting a biopolymer sequence from a specified group of biopolymers one or more biopolymers may be selected, such as one biopolymer which will then be used to generate all possible n-mers from the biopolymer sequence, and such group of n-mers may then be designated to a library. If two or more biopolymer sequences are selected from a specified group of biopolymers then such biopolymers may be known and selected because they belong to the same family of biopolymers, or such biopolymers may not be known.
  • biopolymers are not known, they may be identified and selected because they have at least one n-mer in common or the method may be continued with two or more biopolymers in parallel and result in one library, if such biopolymers belong to the same family, or several libraries each of which are identified and classified in individual libraries according to the present invention.
  • a library may comprise one or more libraries according to the n-mers used, and a selected biopolymer used for generation of n-mers may belong to more than one library depending on the n-mers used, for instance, a protein having two different domains may generate one library covering the first domain, based on n-mers generated from the first domain, and a different library covering the second domain, based on n-mers generated from the second domain.
  • the next step is to generate all possible n-mers from the biopolymer sequence(s). All the n-mers may then be used in the next step or optionally for each n-mer sequence calculate frequency as the number of biopolymers that contain the n-mer sequence at least once and assign the frequency to the n-mer. By calculating frequency each n-mer will be assigned an integer based on how many biopolymers that contain the specific n-mer. This then will result in one or more n-mers having a frequency number that indicates whether the n-mer is in the lower or higher end and it can then be ranked accordingly if desired.
  • the n-mers are ranked according to frequency and only the 90%, such as the 80%, 70%, 60%, 50%, 40%, 30%, 20%, 10%, most frequent n-mers are used in step (iii) or (v) or both.
  • the n-mers are ranked according to frequency and only the 2, such as the 5, 10, 20, 50, 100, 200 most frequent n-mers are used in step (iii) or (v) or both. In an additional embodiment these two selection criteria may be combined.
  • the n-mers are ranked according to frequency and only the 90%, such as the 80%, 70%, 60%, 50%, 40%, 30%, 20%, 10%, less frequent n-mers are used in step (iii) or (v) or both.
  • n-mers are ranked according to frequency and only the 2, such as the 5, 10, 20, 50, 100, 200 less frequent n-mers are used in step (iii) or (v) or both. In an additional embodiment these two selection criteria may be combined.
  • the 3 to 150 n-mers such as 10 to 120, 20 to 100, 50 to 100, typically 100 n-mers, having the highest frequency are used in step (iii) or (v) or both.
  • step (iii) the generated n-mers or a selection of the generated n-mers according to a specified frequency are used to provide a first group, from the specified group of biopolymers, of 2 or more biopolymers having one or more of the n-mer(s) in the biopolymer sequence(s), if only one biopolymer is found then it is the same biopolymer as selected from the start, and then this may be used to define a library of n-mers.
  • the next step (iv) is to generate all possible n-mers from the first group, or alternatively as explained above for each n-mer sequence calculate frequency as the number of biopolymers that contain the n-mer sequence at least once and assign the frequency to the n-mer.
  • the identified n-mers may then be grouped into the library of n-mer sequences, which may then be used for a suitable purpose, such as primers or probes.
  • the generated n-mers or a selection of the generated n-mers according to a specified frequency from step (iv) may be used in step (v) to provide a second group of one or more sequences from the specified group of biopolymers having one or more of the n-mer(s) in the sequences. If no new biopolymers are identified or if desired to stop, then the identified n-mers may then be grouped into the library of n-mer sequences, which may then be used for a suitable purpose, such as primers or probes.
  • n-mer library may be desired to stop and select the identified n-mer library before reaching the situation where no further biopolymers and thus no further new n- mers are identified.
  • it will always be an option to stop looking for further or new n-mers, and then group all provided n-mers from the biopolymers into the library of n-mer sequences, and optionally define specific use of the n-mers of the library, such as primers or probes.
  • steps (iv) and (v) may be repeated until no further biopolymers of the specified group of biopolymers are retrieved, and a definite number of biopolymers are identified.
  • all provided n-mers from the biopolymers are grouped into the library of n-mer sequences, and optionally define specific use of the n-mers of the library, such as primers or probes.
  • the method comprises the steps of:
  • step (iv) generate all possible n-mers from the first group and for each n-mer sequence calculate the frequency as the number of biopolymers that contain the n-mer sequence at least once and assign the frequency to the n-mer,
  • step (viii) decide whether the n-mers are peptides or nucleic acids, if the n-mers are peptides reverse translate the n-mers into a corresponding nucleic acid consensus sequence, and provide at least one nucleic acid consensus sequence from one of the most frequent occurring n-mers, and define whether it is a probe or a primer.
  • the n-mer is a primer.
  • the n-mer is a probe.
  • the nucleic acid consensus sequence of step (iv) is translated according to the genetic code.
  • two primers are provided separated by a suitable distance and the reverse primer is complementary to the nucleic acid consensus sequence.
  • the forward and reverse primers can be used for PCR that will generate a sequence comprising the primers and any sequence located between the primers Sambrook, J. & Russell, D.W. , 2001 . Molecular Cloning: A Laboratory Manual, Third Edition 3rd ed. , Cold Spring Harbor Laboratory Press.
  • the set of primers are selected to be degenerated primers consisting of as few as possible similar, but not identical nucleotide sequences.
  • each degenerated primers may consists of 2, such as 4, 8, 16, 32, 64, 128, 256, 512 1024 similar, but not identical nucleotide sequences.
  • n-mers provided according to the above described method of the present invention have many uses such as for amplification of nucleic acids, as hybridization probes for screening a library of nucleotide sequences or an expression library of peptides or polypeptides, as as antigens for generating an antibody for screening of an expression library of peptides or polypeptides, as an expression library of peptides or polypeptides, both in wet lab and in silico screening.
  • the present invention relates to use of primers obtainable from the method of the present invention for amplification of nucleic acids.
  • the present invention relates to use of primers obtainable from the method of the present invention as hybridization probes for screening a library of nucleotide sequences
  • the present invention relates to use of primers obtainable from the method of the present invention as hybridization probes for screening an expression library of peptides or polypeptides In a still further aspect the present invention relates to use of primers obtainable from the method of the present invention as antigens for generating an antibody for screening of an expression library of peptides or polypeptides
  • the present invention relates to use of primers obtainable from the method of the present invention as an expression library of peptides or polypeptides, both in wet lab and in silico screening.
  • the probes used are obtained from the method of the present invention.
  • the present invention relates to use of probes obtainable from the method of the present invention for amplification of nucleic acids.
  • the present invention relates to use of probes obtainable from the method of the present invention as hybridization probes for screening a library of nucleotide sequences
  • the present invention relates to use of probes obtainable from the method of the present invention as an expression library of peptides or polypeptides, both in wet lab and in silico screening.
  • the probes used are obtained from the method of the present invention.
  • the present invention relates to a method of providing a library of biopolymer sequences having one or more n-mers in common, wherein the library is composed of at least 2 biopolymer sequences, the method comprising the steps of:
  • the method can be repeated on the remaining mixture of different biopolymers, and so forth until all biopolymers from the mixture of different biopolymers have been assigned to a library or have been identified as not belonging to a library. Whether or not all libraries of biopolymers in the mixture of different biopolymers is identified and selected is a matter of choice.
  • biopolymer sequences of the library have sequence similarity of at least 1 % identity, such as at least 5%, from 1 % to 25%, or from 5% to
  • Preferred methods to determine identity are designed to give the largest match between the sequences tested. Methods to determine identity are described in publicly available computer programs. Preferred computer program methods to determine identity between two sequences include the GCG program package, including GAP (Devereux et al., Nucl. Acid. Res., 12, 387, (1984)); Genetics Computer Group,
  • BLASTX Altschul et al., J. Mol. Biol., 215, 403-410, (1990)
  • the BLASTX program is publicly available from the National Center for Biotechnology Information (NCBI) and other sources (BLAST Manual, Altschul et al. NCB/NLM/NIH Bethesda, Md. 20894; Altschul et al., supra).
  • NCBI National Center for Biotechnology Information
  • the well known Smith Waterman algorithm may also be used to determine identity.
  • GAP Genetics Computer Group, University of Wisconsin, Madison, Wis.
  • two proteins for which the percent sequence identity is to be determined are aligned for optimal matching of their respective amino acids (the "matched span", as determined by the algorithm).
  • a gap opening penalty (which is calculated as 3. times, the average diagonal; the "average diagonal” is the average of the diagonal of the comparison matrix being used; the “diagonal” is the score or number assigned to each perfect amino acid match by the particular comparison matrix) and a gap extension penalty (which is usually ⁇ fraction (1 /10) ⁇ times the gap opening penalty), as well as a comparison matrix such as PAM 250 or BLOSUM 62 are used in conjunction with the algorithm.
  • a standard comparison matrix see Dayhoff et al.
  • biopolymer sequences of the library have the same bioactivity.
  • biopolymer sequences of the library are from the same gene family.
  • biopolymer sequences of the library have the same type or phylogenetic class of organisms.
  • biopolymer sequences of the library have the same organism.
  • biopolymer sequences of the library are from the same sample containing biopolymers from one or more organisms.
  • a library of biopolymers have been selected, for instance, because they have the same bioactivity, or maybe are from the same gene family or other desired property, or for further testing to identify the property or properties, such library maybe subjected to the same method of the present invention again, for instance, by subjecting the provided library to steps (i) to (vii) and the n-mer is at least one number higher than the previous defined n-mer.
  • n is an integer above 1 which may be as high as the number of building blocks in the biopolymers although this may in many cases only provide one or a few biopolymers from the mixture of different biopolymers, typically n is an integer from 2 to 100, such as 2-75, 3-60, 3-18, 4-50, 5-20, 5-8, 2-10, or e.g. 2-6.
  • the length of the n-mer may be a fraction of the length of the biopolymer such as 1/10, 1 /100, 1 /1000, 1 /10000 or even smaller fraction when the biopolymer consist of very long biopolymers such as chromosomes.
  • the n-mer is composed of n amino acids.
  • the biopolymer sequences are selected from nucleic acids (e.g. DNA or RNA) then the n-mer is composed of n nucleotides.
  • biopolymer sequence is selected from polypeptides, such as proteins, or fragments thereof.
  • the biopolymer sequence is selected from nucleic acids such as polydeoxyribonucleic acids (DNA) and polyribonucleic acids (RNA), or fragments thereof.
  • nucleic acids such as polydeoxyribonucleic acids (DNA) and polyribonucleic acids (RNA), or fragments thereof.
  • DNA polydeoxyribonucleic acids
  • RNA polyribonucleic acids
  • the nucleic acid is an RNA it is selected from mRNA and non-coding RNA, e.g. microRNA.
  • the present method for providing n-mers or biopolymers may be performed in a lab, on paper or by using a computer, that is, by in siiico. Typically, in order to generate and handle large data sets the method is performed in siiico.
  • the present invention relates to use of an n-mer from a biopolymer sequence having an identified sequence for searching and collecting fragments of biopolymers from an environment comprising the biopolymer fragments.
  • a further use of n-mers generated from a known biopolymer or an identified sequence of a biopolymer is the possibility of looking for biopolymer fragments in an environment where the sequence of such fragments have not yet been identified as belonging to a certain host.
  • n-mers generated from a biopolymer sequence having an identified sequence it may be possible to search and find biopolymer fragments which have the n-mer in the sequence.
  • Such identified and collected biopolymer fragments may then be assembled to a larger fragment or full length biopolymer, which can then be tested for a specific activity.
  • n-mers are generated and then such n-mers are used to provide a group of one or more biopolymer fragments from the specified environment, wherein such fragments have one or more of the n-mers in the sequence. If the n-mers are amino acid sequences and the biopolymer fragment from the environment consists of nucleotide sequences then the biopolymer fragment is considered to have the n-mer in the sequence if the biopolymer fragment sequence translated in any of the three reading frames encodes the n-mer.
  • reverse complementary sequence to the biopolymer fragment is considered to have the n-mer in the sequence if the biopolymer fragment reverse complementary sequence translated in any of the three reading frames encodes the n-mer.
  • the collected biopolymer fragments are aligned if possible to generate a longer sequence consisting of overlapping, collected biopolymer fragments. Then repeat, if this is considered suitable, the search for biopolymer fragments by using n-mers generated from the identified biopolymer fragments until the extended biopolymer fragments reach the expected length of the known
  • the present invention concerns use of a library of n-mer
  • the present invention relates to use of a library of n-mer sequences generated from a library of one or more known biopolymer sequences having a known function in common to identify a different biopolymer sequence comprising the n-mer sequences and having the same function as the known biopolymer sequence(s), by comparing the library of n-mer sequences of the known biopolymer sequence(s) with the n-mer sequences from the different biopolymer sequence(s), and selecting the different biopolymer sequence(s) having at least 1 of the n-mers in common.
  • the different biopolymer sequence(s) having at least 2 of the n- mers in common, such as at least 3 at least 4, at least 5, at least 6, such as 10, such as 20, such as 30 in common.
  • the sum of the frequencies of the common n-mers (frequency is number of biopolymers, e.g.
  • proteins, in the library that contain the n-mer divided by total number of proteins in the library will typically be used to increase the likelihood that the function of the different biopolymer is the same as the known biopolymer sequence(s), such that the sum of the frequencies of the common n-mers should be at least 0.5, such as at least 0.8, at least 1 .0, at least 1 .5, at least 2.0, at least 3.0.
  • other functions of the n-mers such as the product of their frequencies may also be used to predict the functions of the different biopolymer sequence.
  • the function of the different biopolymer sequence may be predicted from a relative comparison of the number of n-mers and/or the sum of their frequencies or other property of the n-mers from one library of n-mers relative to the same function for the n-mers from another library of n-mers.
  • the different biopolymers to be compared with the library of n-mers from the known biopolymers may be preselected based on sequence identity (e.g. by alignment to the known biopolymers used to generate the library of n-mers), species of origin, expression pattern or other function of the biopolymers.
  • the known function as used herein is to be understood in its broadest interpretation to mean any function such as and without limitation: sequence identity, species of origin, expression pattern, enzymatic activity, structural role, helper function, epitope, recognition site for proteins, secondary or tertiary structure or any other function that a peptide-, protein-, or nucleic acid sequence may have and which is determined by the sequence of the biopolymer.
  • the use further comprises the step of screening the selected different biopolymer sequence(s) in a relevant assay to confirm that the function is the same as the known biopolymer sequence(s).
  • the library of n-mer sequences of the known biopolymer sequence(s) may be compared with the n-mer sequences from the different biopolymer sequence(s) in different ways which are all intended to be encompassed by the present invention without limiting the scope thereof, such as by defining a score of the selected biopolymer for each n-mer library calculated as the number of the n-mers that are included in the selected sequence of the biopolymer or the sum of the frequency of the n-mers that are included in the selected sequence of the biopolymer or as another value associated to the n-mers that are included in the selected sequence of the biopolymer (e.g. multiplication of the frequencies of the n-mers).
  • the score of each selected biopolymer is used to associate the biopolymer to the library of n-mers and to infer the probability that the selected biopolymer has similar properties as the known biopolymers used to generate the library of n-mers.
  • the score is an absolute number, however the score of a biopolymer for one library of n-mers may be compared to the score for another library of n-mers to decide which library (and thus known biopolymers) are most related to the different biopolymer.
  • a method of providing a library of n-mer sequences, wherein the library is composed of an n-mer sequence comprising the steps of:
  • (v) optionally use the generated n-mers or a selection of the generated n-mers according to a specified frequency to provide a second group of one or more sequences from the specified group of biopolymers having one or more of the n-mer(s) in the sequences,
  • step (iv) generate all possible n-mers from the first group and for each n-mer sequence calculate the frequency as the number of biopolymers that contain the n-mer sequence at least once and assign the frequency to the n-mer,
  • step (viii) decide whether the n-mers are peptides or nucleic acids, if the n-mers are peptides reverse translate the n-mers into a corresponding nucleic acid consensus sequence, and provide at least one nucleic acid consensus sequence from one of the most frequent occurring n-mers, and define whether it is a probe or a primer.
  • step (iv) is translated according to the genetic code.
  • primer set further is selected to have a desired redundancy.
  • unknown source such as a new micro organism, a pool of unidentified biopolymers, or a mixture of known and unknown sources
  • an environmental sample e.g. from a mammal, microorganism, plant sample, mixture of organisms, sample of unordered sequence reads e.g., from one or several organisms, or from a database.
  • n is an integer from 2 to 20 75.
  • biopolymer sequence is selected from polypeptides, proteins, nucleic acids, or fragments thereof.
  • the program is written in the Ruby programming language version 1 .8.6 and normally executed on a machine with the Microsoft Windows XP version 2002 operative system but can also be executed under other operative systems and would easily be adapted to other versions of Ruby.
  • seq seq.slice(seq. index(x)+1 , seq. length)
  • score + 1 if @seq. include?(pep) ⁇ end
  • score + 1 if @seq.include?(pep.seq) ⁇ end
  • array @family_array. sort, reverse
  • Attr_accessor seq, :name, :score, :degeneracy, :degeneracy_w_inosine, :average_position, :frequency, :prot_score
  • score + 1 if p.seq. include?(@seq) ⁇
  • prot_array.fetch_proteins_fromCpeptidcycler_excluded_proteins2.txt", pepjength) array_output [] prot_array.each do
  • master_cut_off master_prot. peptides. length * (1 +cut_off)/lim it
  • prot_array each do
  • #Make group of proteins similar to the master p. score 0 #zero the hexa_score
  • p. score + 1 if p.seq.include?(peptide) ⁇ if p. score > master_cut_off
  • prot_score 0
  • prot_score + p. score
  • sort_peptide_array (each ⁇
  • peptide_array « a[2] ⁇ #Best peptides in array peptideprofile :evolving
  • prot_array each do
  • sort_peptide_array [] array.each do
  • prot_score 0
  • prot_score + p. score
  • sort_peptide_array each ⁇
  • peptide_array [peptide_array2]. flatten
  • prot_array.fetch_proteins_fromCpeptidcycler_excluded_proteins2.txt", “nothing") excluded_file File.newCpeptidcycler_excluded_proteins2.txt”, “w") #non-family members in back to pool
  • prot_array each do
  • p.score + 1 if p.seq. include?(peptide) ⁇
  • pep.score pep.calc_score(family_array)
  • pep.average_position position_array[position_array. length/2]
  • pep. degeneracy pep.calc_degeneracy
  • pep.degeneracy_w_inosine pep.calc_degeneracy_w_inosine
  • n_input_prots 0
  • classifier_peptides File.newCnew_family_classifying_peptides.txt", "w"
  • info best_family[1 ].
  • guf Peptidegenerator.calc(family_array, peptide_array, rounds)
  • the input consists of a text file (".txt” in windows format) containing biopolymer sequences in FASTA format.
  • the input file is called "six_families.txt” and contains 105 different protein sequences.
  • classify_family3.rb can be opened in a text editor such as notepad, SciTE, wordpad, MSword or other to define a number of parameters. The most frequently used parameters are:
  • cut_off The number of selected n-mers that are present in a biopolymer should be larger than this value to include the biopolymer in the group that is defined by the n- mers.
  • pepjength Length of the n-mers.
  • the parameters are:
  • cut_off 9 (A protein should contain at least 10 of the selected peptides to be included in the group).
  • pepjength 6 (The peptides are hexamers (six amino acids long)).
  • amino acid amino acid
  • peptide amino acid
  • protein protein or similar referring to peptide and amino acid nomenclature but the program works just as well for biopolymers and n-mers consisting of nucleotide sequences.
  • Position The median position in the selected biopolymers that contain the n- mer.
  • Degeneracy_w_l Same as degeneracy but with nucleotide positions that can include all four bases (A,C,G,T) substitude with an inosine that is not degenerated.
  • Degeneracy and degeneracy_w_l are only relevant when the n-mers are peptides.
  • group_n_peps.txt is a text file that can be opened as such or opened or imported into MS excel Open Office calc or another spread sheet.
  • the information written to the logfile is the date of the run, number of input sequences, values for limit, cut_off and pepjength, number of groups generated, and for each group: Group number, activity of the protein used to generate the first set of hexapeptides for generation of the group and number of proteins included in the group.
  • Each protein was assigned a number between ">" and "gi" in the name line of the FASTA formatted sequence. The number can be used for manual tracking of the origin of the protein:
  • EDF-1 Endothelial Differentiation Factor 1
  • SP1 Sp1 transcription factor
  • PLC Protein Kinase C
  • the input file contained animal protein sequences of six different types. Between 1 1 and 23 proteins sequence of each type were included.
  • the number of most frequently occurring peptides necessary to define a group could be as low as 3 and still lead to successful classification.
  • 10-mer peptides only a few peptides with a frequency higher than 1 (occurring in more than 1 protein) were generated and the limit lost its relevancy. (Testing with a limit of 1 million 10 peptides worked successfully).
  • Table 4 Cross comparison of the hexapeptide signatures for each group (group) of gh61 proteins.
  • the two cysteines that form a cysteine bridge in the crystal structure were found in this region but were only conserved in groups 1 , 8 and 1 1 .
  • group 1 1 the second of the two cysteines was found in the protein sequences but was located outside the conserved hexapeptides.
  • Region 2 was conserved in all 13 groups and contains a conserved histidine that does not participate in coordination of the nickel atom (Karkehabadi, S. et al. , 2008. The first structure of a glycoside hydrolase family 61 member, Cel61 B from Hypocrea jecorina, at 1 .6 A resolution. Journal of Molecular Biology, 383(1 ), 144-154) but nevertheless is located on the nickel-binding surface of gh61 together with two other conserved residues (Q/E 49 and Y 5 i) in region 2. Region 3 is outside the reported crystal structure and contains a conserved proline-glycine dipeptide.
  • a library of 92 amino acid sequences that were characterized as ABRE binding factors (ABF) or had high sequence similarity to ABFs (accession numbers
  • the primers should have the smallest possible redundancy and redundant bases at the 3' end are not allowed.
  • CGGAC CGGAC
  • Reverse primers were designed to be reverse complementary to the DNA sequence encoding the hexapeptide and according to the same rules.
  • a protein sequence is defined as found if it contains a sequence corresponding to at least one of the forward primers and a sequence corresponding to one of the reverse primers.
  • Table 5 Four hexamers (table 5) were chosen for generation of primers based on the chosen criteria such as high frequency, low degeneracy (with inosine incorporation) and generation of an amplicon of an informative length.
  • Hexamer peptides chosen for generation of degenerated primers Position The median position of the hexapeptide in the proteins that contain the hexapeptide.
  • Frequency Number of proteins that contain the hexapeptide.
  • Degeneracy_w_l Number of nucleotide sequences that will encode the n-mer, last nucleotide not included if this position is degenerate but with nucleotide positions that can include all four bases (A,C,G,T) substitude with an inosine that is not degenerated.
  • DNA sequence Seqeunce of the degenerated primer design according to the design criteria.
  • Primer type Designates whether the primer will be used as forward or reverse primer in PCR.
  • This example demonstrates that the algorithm can be used to find peptide sequences 15 suitable for degenerate primers of a family of plant transcription factors without aligning the input sequences.
  • 25 of the input sequences from Streptococcus pneumonia were classified into one group with a common set of hexapeptides that could be used for generation of primers. Neither the protein from Salmonella enterica not the protein from Yersinia pestis were included in the group and the remaining six sequences from Streptococcus pneumonia no significant sequence similarity to the the 25 sequences in group 1.
  • Two hexamers (table 6) were chosen for generation of primers based on the chosen criteria such as high frequency, low degeneracy (with inosine incorporation) and generation of an amplicon of an informative length.
  • Position The median position of the hexapeptide in the proteins that contain the hexapeptide.
  • Frequency Number of proteins that contain the hexapeptide.
  • Degeneracy_w_l Number of nucleotide sequences that will encode the n-mer, last nucleotide not included if this position is degenerate but with nucleotide positions that can include all four bases (A,C,G,T) substitude with an inosine that is not degenerated.
  • DNA sequence Sequence of the degenerated primer design according to the design criteria.
  • Primer type Designates whether the primer will be used as forward or reverse primer in PCR.
  • nucleotide sequences Performed by searching in nucleotide sequences for shorter nucleotide sequences defined as forward and reverse primers.
  • a nucleotide sequence is defined as found if it contains a sequence corresponding to at least one of the forward primers and a sequence corresponding to one of the reverse primers.
  • Reovirus contains 10 - 12 DNA segments that can be considered as independent viral chromosomes. Thus any structurally meaningful conserved sequence is expected to map to a specific segment whereas conserved sequences
  • sequences from reovirus were classified into the largest group. These sequences (accession numbers: 294854040, 499863, 32479528, 258549709, 18031637, 1803161 1 , 18031609, 18031607, 18031605, 18031597, 18031589, 18031587, 18031581 , 18031579, 18031639, 18031563, 18031615, 18031617,
  • Table 7 20-mer nucleotides chosen as primers Position: The median position of the 20-mer nucleotide in the DNA sequence that contains the 20-mer.
  • Frequency sequence that contains the 20-mer.
  • Primer type Designates whether the primer will be used as forward or reverse primer in PCR.
  • the primers were synthesized and HPLC-purified by Sigma-Aldrich (UK/Europe). Peptide Primer name Final primer sequence
  • Table 8 List of degenerated primers.
  • Fungal mycelium was first scraped of the top of a wheat bran agar plate, frozen in N 2 (l) and grinded with a morter and pestle. DNA was extracted from the homogenized mycelium with the Fungal DNA Mini Kit (Omega Bio-Tek, USA) according to the manufacturer's instructions.
  • PCR was performed using standard conditions and the PCR products were cycle sequenced by Eurofins-MWG (Germany) or StarSEQ (Germany) with one of the degenerated primers used for PCR.
  • the resulting sequences were translated to amino acid sequence and used for BLAST search against the non-redundant protein sequence database at NCBI and inspected for conserved domains (Marchler-Bauer et al., 2009) in the CDD database at NCBI to identify sequences encoding glycohydrolase family 61 -like proteins.
  • the most frequently occurring hexapeptides defining group 1 of gh61 s were used for design of degenerated primers (Table 8). As the two most conserved hexapeptides (occurring in 80 and 78 % of the proteins) could be used for design of reverse primers we did not find it necessary to design a third reverse primer.
  • One of the three hexapeptides used for forward primer design contains one serine residue that is coded by 6 different codons at the N-terminal. A degenerate primer to serine does not contribute significantly to specificity and therefore, the primer was made by reverse translation of the peptide HHGPV. In in silico PCR the three forward and two reverse primers were able to amplify 66 of the 85 proteins in group 1 and no proteins from other groups.
  • the primers were used for all six possible combination for PCR of DNA from the 14 thermophilic fungi. For all the fungi at least one of the primer sets gave an amplification product with the expected size and for some fungi all the primer sets gave a positive product. For each fungus, the longest ampliqon that had the expected size was sequenced and analyzed for open reading frames. All the ampliqons yielded a sequence that encodes a novel, putative GH61 gene fragment. Although the isolated sequences are only partial, it was possible to classify all except one as belonging to group 1 . The unassigned sequence from C.
  • senegalense was the shortest of the sequences and is only 37 amino acids long but had up to 73 % identical to known gh61 sequences and 78 % to the new sequence from R. thermophila.
  • the PCR result showed that degenerated primers based on the hexapeptide finder algorithm could be used to find new gh61 proteins.
  • a library of 51 amino acid sequences that were annotated as glycosyl-hydrolase family 45 proteins in CAZY (accession numbers 189577959, 62770092, 62770095, 62770085, 520823, 62770091 , 6179891 , 151303713, 151303715, 38492164, 26516781 , 40739414, 222103626, 222103630, 224434578, 224434580, 37732125, 126697302, 4249556, 4249558, 4756863, 39951371 , 32526553, 62821724, 62821722, 15384734, 28881412, 194143489, 4210808, 195547039, 56410394, 238033880, 8052314, 170943791 , 170936906, 8926975, 27530542, 27530617, 27530615, 1 16001534, 197267671 , 158138919,
  • a library of 34-mers starting at random positions in the input sequences was generated.
  • the coverage was ten times defined as that the total number of nucleotides in the library of random 34-mers was ten times the number of nucleotides in the library of input sequences.
  • All selected 34-mers were assembled into longer sequences in all possible combinations if they overlapped by at least 17 nucleotides. These longer sequences were further extended if they overlapped by at least 17 nucleotides with each other or with a 34-mer.
  • E.g. if the letters A, B, C and D represent sequences that are 17 nucleotides long, then a first 34-mer with the sequence AB overlaps with a 34-mer with the sequence BC but not with a 34-mer with the sequence CB.
  • AB and BC will form the contig ABC, which overlaps with CD but not with AC.
  • ABC and CD can be assembled to the contig ABCD.
  • the contigs ABC and BCD can be overlapped to form ABCD. This process was continued until no new nucleotide sequences were generated.
  • the assembled 34-mers were used as template and extended by 34-mers from the pool of random 34-mers in all possible combinations if they overlapped by at least 17 nucleotides. These longer sequences were further extended if they overlapped by at least 17 nucleotides with each other or with a 34-mer from the pool of random 34- mers E.g. ; if the letters A, B, C, D and E represent sequences that are 17 nucleotides long, then the contig ABCD can be overlapped with a 34-mer with the sequence DE from the pool of random 34-mers to form the contig ABCDE. This process was continued until no new nucleotide sequences were generated.
  • the library with the two nucleotide sequences from Melanocarpus albomyces was used to generate a library of random 34-mer sequences.
  • This library can be viewed as a file containing 34 nucleotides sequencing reads with a ten times coverage of the two input sequences and where the gh7 and the gh45 sequences are mixed at random.
  • the second sequence was 263 nucleotides and identical to nucleotides 194 - 462 of the GH45 from M. albomyces (accession number 27125829) except for a gap at nucleotides 419- 423.
  • the last sequence was 126 nucleotides and identical to nucleotides 1 - 126 of the GH45 from M. albomyces (accession number 27125829).
  • sequences can be used for further extension in silico by other methods 15 or as probes, primers or for other ways of further studies of the sequences and the putative proteins they encode.
  • Each protein subfamilies was assigned a function corresponding to the function of the most abundant enzyme type in the subfamily.
  • the proteins were assigned to the subfamily with the highest subfamily-specific frequency score and the function assigned to subfamily was taken as the function predicted for the protein.
  • the 1 18 eukaryotic GH5 proteins have very divergent sequences with an average pair wise identity of 9 % and only 23 % of the pair wise sequence comparisons producing any significant alignment.
  • these proteins were analyzed by PPR with different parameters (peptide length, number of peptides and cut off) each analysis resulted in a number of protein subfamilies that were assigned a function
  • the parameters tested were from peptides of length 3 -10, peptide lists with 30 -200 conserved peptides and cut off from 5 - 40 peptides.
  • the cut off is the number of peptides from a list of conserved peptides that a protein should contain to be part of the subfamily.
  • the other half of the GH5 proteins were assigned a function by:
  • a library of proteins with the same functions can be used to make a library of n-mer peptides and calculate their frequency.
  • This library of n-mers can predict the function of complicated protein families such as GH5 with high accuracy (more than 90 %).
  • Example 1 1 Peptide Pattern Analysis of many proteins from the GH13 CAZy family and comparison with other methods.
  • This step removes all peptides that occur in only one protein from the subsequent
  • All the seed proteins were ranked according to this number with the seed protein with the highest number first.
  • the 100 highest ranked proteins on this list were used as seed proteins for PPR analysis and the largest group of proteins that came out of the analysis was selected as a subfamily. These proteins were removed from the list of proteins and from the list of seed proteins before repeating the PPR analysis. This step significantly reduced the number of calculations when many proteins were used as input as only 100 seed proteins were used in each round of PPR instead of using all the proteins.
  • GH 13 glycoside hydrolase family 13 proteins
  • Cross comparison of 5195 proteins that were assigned to both a CAZy subfamily ((Stam et al. 2006), www.cazy.org) and a PPR subfamily showed that on average 89 % of the proteins in each CAZy subfamily belonged to one PPR subfamily and vice versa.
  • PPR analysis of 8138 GH13 took 7 hours with a script written in Ruby, which is a relatively slow programming language, and runned on a powerful desktop computer (InteIR CoreTm i7-2600 CPU @ 3.40 GHz; 16 GB RAM). It took less than 25 minutes to perform a PPR analysis of 1691 sequences chosen at random from the 8138 GH13.
  • the Schizophyllum commune whole genome nucleotide sequence was divided into 2000 bases long fragments. This was done once by starting at base number 1 and once starting at base number 1000 thus generating two sets of fragments with 1000 bases overlap. Each fragment was reverse translated in all three reading frames on one strand to generate one set of possible open reading frames (forward reading frames), and in all three reading frames on the other strand to generate the other set of possible open reading frames (reverse reading frames).
  • GH5 proteins By scoring the eight annotated GH5 proteins against the subfamily -specific peptide lists (see example 9) they were assigned as 2 cellulases (EC 3.2.1 .4), 5 glucan 1 ,3- ⁇ - glucosidases (EC 3.2.1 .58) and 1 mannan endo- ⁇ -1 ,4-mannosidase (EC 3.2.1 .78).
  • Peptide lists generated with Peptide Pattern Recognition can be used for searching genomes and fragments of genomes or other sequences for open reading frames encoding proteins resembling the peptide lists. This is a fast method to find new family members and designate subfamilies and predict the function of the proteins encoded by the nucleotide sequences.

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Biochemistry (AREA)
  • Library & Information Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La présente invention concerne un procédé permettant de fournir une bibliothèque de séquences de n-mères, la bibliothèque étant composée d'une séquence de n-mère. De plus, l'invention concerne un procédé consistant à fournir une bibliothèque de séquences de biopolymères ayant un ou plusieurs n-mères en commun. L'invention concerne enfin des amorces et/ou des sondes spécifiques pour, par exemple, l'amplification d'acides nucléiques.
PCT/EP2012/051099 2011-01-26 2012-01-25 Nouveau procédé permettant de fournir une bibliothèque de n-mères ou de biopolymères Ceased WO2012101151A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP11152232.2 2011-01-26
EP11152232 2011-01-26

Publications (1)

Publication Number Publication Date
WO2012101151A1 true WO2012101151A1 (fr) 2012-08-02

Family

ID=44148513

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2012/051099 Ceased WO2012101151A1 (fr) 2011-01-26 2012-01-25 Nouveau procédé permettant de fournir une bibliothèque de n-mères ou de biopolymères

Country Status (1)

Country Link
WO (1) WO2012101151A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019140353A1 (fr) * 2018-01-12 2019-07-18 Camena Bioscience Limited Compositions et procédés de synthèse d'acides nucléiques enzymatique géométrique sans matrice
US12590324B2 (en) 2019-01-14 2026-03-31 Camena Bioscience Limited Compositions and methods for template-free geometric enzymatic nucleic acid synthesis

Non-Patent Citations (37)

* Cited by examiner, † Cited by third party
Title
ALMEIDA, J.S.; VINGA, S.: "Biological sequences as pictures: a generic two dimensional solution for iterated maps", BMC BIOINFORMATICS, vol. 10, 2009, pages 100, XP021047241, DOI: doi:10.1186/1471-2105-10-100
ALTSCHUL ET AL., J. MOL. BIOL., vol. 215, 1990, pages 403 - 410
ARAKAWA, K.; OSHITA, K.; TOMITA, M.: "A web server for interactive and zoomable Chaos Game Representation images", SOURCE CODE FOR BIOLOGY AND MEDICINE, vol. 4, 2009, pages 6, XP021060333, DOI: doi:10.1186/1751-0473-4-6
BAILEY, T.L.; ELKAN, C.: "The value of prior knowledge in discovering motifs with M E M E.", ISMB. INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS FOR MOLECULAR BIOLOGY, vol. 3, 1995, pages 21 - 29, XP009115995
BETTY YEE MAN CHENG ET AL: "Protein Classification based on Text Document Classification techniques", PROTEINS: STRUCTURE, FUNCTION, AND BIOINFORMATICS, vol. 58, 11 January 2005 (2005-01-11), pages 955 - 970, XP055001791, DOI: 10.1002/prot.20373 *
BLAISDELL, B.E.: "A measure of the similarity of sets of sequences not requiring sequence alignment", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, vol. 83, no. 14, 1986, pages 5155 - 5159
BUSK, P.K.; PAGES, M.: "Regulation of abscisic acid-induced transcription", PLANT MOLECULAR BIOLOGY, vol. 37, no. 3, 1998, pages 425 - 435, XP002185939, DOI: doi:10.1023/A:1006058700720
CHAKRAVORTY, S.; VIGOREAUX, J.O.: "Amplification of orthologous genes using degenerate primers", METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J., vol. 634, 2010, pages 175 - 185
CHENG, B.Y.M.; CARBONELL, J.G.; KLEIN-SEETHARAMAN, J.: "Protein classification based on text document classification techniques", PROTEINS, vol. 58, no. 4, 2005, pages 955 - 970, XP055001791, DOI: doi:10.1002/prot.20373
CLARKE, A.J.; DRUMMELSMITH, J.; YAGUCHI, M.: "Identification of the catalytic nucleophile in the cellulase from Schizophyllum commune and assignment of the enzyme to Family 5, subtype 5 of the glycosidases", FEBS LETTERS, vol. 414, no. 2, 1997, pages 359 - 361, XP004366347, DOI: doi:10.1016/S0014-5793(97)01049-1
D'AURIA, G.; PUSHKER, R.; RODRIGUEZ-VALERA, F.: "IWoCS: analyzing ribosomal intergenic transcribed spacers configuration and taxonomic relationships", BIOINFORMATICS (OXFORD, ENGLAND, vol. 22, no. 5, 2006, pages 527 - 531, XP055001789, DOI: doi:10.1093/bioinformatics/btk033
DAVIES, M.N. ET AL.: "Alignment-Independent Techniques for Protein Classification", CURRENT PROTEOMICS, vol. 5, 2008, pages 217 - 223
DAYHOFF ET AL., ATLAS OF PROTEIN SEQUENCE AND STRUCTURE, vol. 5, 1978
DESCHAVANNE, P.; TUFF6RY, P.: "Exploring an alignment free approach for protein classification and structural class prediction", BIOCHIMIE, vol. 90, no. 4, 2008, pages 615 - 625, XP022576203
DEVEREUX ET AL., NUCL. ACID. RES., vol. 12, 1984, pages 387
DO, C.B.; KATOH, K.: "Protein multiple sequence alignment", METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J., vol. 484, 2008, pages 379 - 413
GIUSEPPE D'AURIA ET AL: "IWoCS: analyzing ribosomal intergenic transcribed spacers configuration and taxonomic relationships", BIOINFORMATICS, vol. 22, no. 5, 10 January 2006 (2006-01-10), pages 527 - 531, XP055001789, DOI: 10.1093/bioinformatics/btk033 *
HARRIS, P.V. ET AL.: "Stimulation of lignocellulosic biomass hydrolysis by proteins of glycoside hydrolase family 61: structure and function of a large, enigmatic family", BIOCHEMISTRY, vol. 49, no. 15, 2010, pages 3305 - 3316
HENIKOFF ET AL., PROC. NATL. ACAD. SCI USA, vol. 89, 1992, pages 10915 - 10919
HENRISSAT, B.: "A classification of glycosyl hydrolases based on amino acid sequence similarities", THE BIOCHEMICAL JOURNAL, vol. 280, 1991, pages 309 - 316
INNIS, M.A. ET AL.: "PCR Protocols: A Guide to Methods and Applications", 1990, ACADEMIC PRESS
JABADO, O.J. ET AL.: "Greene SCPrimer: a rapid comprehensive tool for designing degenerate primers from multiple sequence alignments", NUCLEIC ACIDS RESEARCH, vol. 34, no. 22, 2006, pages 6605 - 6611
JEFFREY, H.J.: "Chaos game representation of gene structure", NUCLEIC ACIDS RESEARCH, vol. 18, no. 8, 1990, pages 2163 - 2170
JOHN K VRIES ET AL: "A Sequence Alignment-Independent Method for Protein Classification", APPLIED BIOINFORMATICS, vol. 3, no. 2-3, 1 January 2004 (2004-01-01), pages 137 - 148, XP055001794 *
KARKEHABADI, S. ET AL.: "The first structure of a glycoside hydrolase family 61 member, Cel61 B from Hypocrea jecorina, at 1.6 A resolution", JOURNAL OF MOLECULAR BIOLOGY, vol. 383, no. 1, 2008, pages 144 - 154, XP025433363, DOI: doi:10.1016/j.jmb.2008.08.016
KARKEHABADI, S. ET AL.: "The first structure of a glycoside hydrolase family 61 member, Cel6l B from Hypocrea jecorina, at 1.6 A resolution", JOURNAL OF MOLECULAR BIOLOGY, vol. 383, no. 1, 2008, pages 144 - 154, XP025433363, DOI: doi:10.1016/j.jmb.2008.08.016
LO LEGGIO, L.; LARSEN, S.: "The 1.62 A structure of Thermoascus aurantiacus endoglucanase: completing the structural picture of subfamilies in glycoside hydrolase family 5", FEBS LETTERS, vol. 523, no. 1-3, 2002, pages 103 - 108, XP004371155, DOI: doi:10.1016/S0014-5793(02)02954-X
MARCHLER-BAUER, A. ET AL.: "CDD: a Conserved Domain Database for the functional annotation of proteins", NUCLEIC ACIDS RESEARCH, vol. 39, 2011, pages D225 - 229
PRICE, A.; RAMABHADRAN, S.; PEVZNER, P.A.: "Finding subtle motifs by branching from sample strings", BIOINFORMATICS (OXFORD, ENGLAND, vol. 19, no. 2, 2003, pages II149 - 155
SAMBROOK, J.; RUSSELL, D.W.: "Molecular Cloning: A Laboratory Manual", 2001, COLD SPRING HARBOR LABORATORY PRESS
STAM, M.R. ET AL.: "Dividing the large glycoside hydrolase family 13 into subfamilies: towards improved functional annotations of alpha-amylase-related proteins", PROTEIN ENGINEERING, DESIGN & SELECTION: PEDS, vol. 19, no. 12, 2006, pages 555 - 562, XP002444482, DOI: doi:10.1093/protein/gzl044
STROPE, P.K.; MORIYAMA, E.N.: "Simple alignment-free methods for protein classification: a case study from G-protein-coupled receptors", GENOMICS, vol. 89, no. 5, 2007, pages 602 - 612, XP022030055, DOI: doi:10.1016/j.ygeno.2007.01.008
TOMOVI6, A.; JANICI6, P.; KESELJ, V.: "n-gram-based classification and unsupervised hierarchical clustering of genome sequences", COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, vol. 81, no. 2, 2006, pages 137 - 153, XP024895751, DOI: doi:10.1016/j.cmpb.2005.11.007
TOMOVIC A ET AL: "n-Gram-based classification and unsupervised hierarchical clustering of genome sequences", COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, ELSEVIER, AMSTERDAM, NL, vol. 81, no. 2, 1 February 2006 (2006-02-01), pages 137 - 153, XP024895751, ISSN: 0169-2607, [retrieved on 20060201], DOI: DOI:10.1016/J.CMPB.2005.11.007 *
VAN PETEGEM, F. ET AL.: "Atomic resolution structure of the major endoglucanase from Thermoascus aurantiacus", BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS, vol. 296, no. 1, 2002, pages 161 - 166, XP005152866, DOI: doi:10.1016/S0006-291X(02)00775-1
VINGA, S.; ALMEIDA, J.: "Alignment-free sequence comparison-a review", BIOINFORMATICS, vol. 19, no. 4, 2003, pages 513 - 523, XP003009542, DOI: doi:10.1093/bioinformatics/btg005
VRIES, J.K. ET AL.: "A sequence alignment-independent method for protein classification", APPLIED BIOINFORMATICS, vol. 3, no. 2-3, 2004, pages 137 - 148

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019140353A1 (fr) * 2018-01-12 2019-07-18 Camena Bioscience Limited Compositions et procédés de synthèse d'acides nucléiques enzymatique géométrique sans matrice
US11667941B2 (en) 2018-01-12 2023-06-06 Camena Bioscience Limited Compositions and methods for template-free geometric enzymatic nucleic acid synthesis
US12590324B2 (en) 2019-01-14 2026-03-31 Camena Bioscience Limited Compositions and methods for template-free geometric enzymatic nucleic acid synthesis

Similar Documents

Publication Publication Date Title
Liggenstoffer et al. Phylogenetic diversity and community structure of anaerobic gut fungi (phylum Neocallimastigomycota) in ruminant and non-ruminant herbivores
Sen et al. Diversity, abundance, and ecological roles of planktonic fungi in marine environments
Sivakala et al. Desert environments facilitate unique evolution of biosynthetic potential in Streptomyces
Luo et al. Highly resolved phylogenetic relationships within order Acipenseriformes according to novel nuclear markers
Jiang et al. The first whole genome sequencing of Sanghuangporus sanghuang provides insights into its medicinal application and evolution
Nabi et al. High-throughput RNA sequencing of mosaic infected and non-infected apple (Malus× domestica Borkh.) cultivars: from detection to the reconstruction of whole genome of viruses and viroid
Modha et al. Metaviromics reveals unknown viral diversity in the biting midge Culicoides impunctatus
Tomé et al. Hybrid assembly improves genome quality and completeness of Trametes villosa CCMB561 and reveals a huge potential for lignocellulose breakdown
Galvez et al. Sequencing and de novo assembly of abaca (Musa textilis Née) var. Abuab genome
Vu et al. Complete chloroplast genome of Paphiopedilum delenatii and phylogenetic relationships among Orchidaceae
Chen et al. Biodiversity of duckweed (Lemnaceae) in water reservoirs of Ukraine and China assessed by chloroplast DNA barcoding
Zhan et al. Detection and characterization of cucumis melo cryptic virus, cucumis melo amalgavirus 1, and melon necrotic spot virus in Cucumis melo
Li et al. Genome-wide characterization of dirigent proteins in populus: gene expression variation and expression pattern in response to marssonina brunnea and phytohormones
Sa et al. DNA barcoding and species classification of Morchella
Pédron et al. Early emergence of Dickeya solani revealed by analysis of Dickeya diversity of potato blackleg and soft rot causing pathogens in Switzerland
Lin et al. Comparative genomic analysis uncovers the chloroplast genome variation and phylogenetic relationships of Camellia species
Romeiro-Brito et al. A target capture probe set useful for deep-and shallow-level phylogenetic studies in Cactaceae
Zhang et al. Insight into the phylogeny and metabolic divergence of Monascus species (M. pilosus, M. ruber, and M. purpureus) at the genome level
Sun et al. Genomic characteristics and comparative genomics analysis of the endophytic fungus Paraphoma chrysanthemicola DS-84 isolated from Codonopsis pilosula root
Çağlar et al. Detection and Multigene Typing of ‘Candidatus Phytoplasma Solani’-Related Strains Infecting Tomato and Potato Plants in Different Regions of Turkey
Le et al. De novo metagenomic analysis of microbial community contributing in lignocellulose degradation in humus samples harvested from Cuc Phuong tropical forest in Vietnam
Park et al. Lineage-specific variation in IR boundary shift events, inversions, and substitution rates among Caprifoliaceae sl (Dipsacales) Plastomes
Cui et al. Taxonomic delimitation of the monostromatic green algal genera Monostroma Thuret 1854 and Gayralia Vinogradova 1969 (Ulotrichales, Chlorophyta)
Kadoguchi et al. Optimization of Cellulase Production from Agri-Industrial Residues by Aspergillus terreus NIH2624
Guo et al. Comparative and phylogenetic analysis of complete plastomes among aristidoideae species (Poaceae)

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12701117

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12701117

Country of ref document: EP

Kind code of ref document: A1