WO2025212384A2 - Procédés et compositions d'analyse de l'acide nucléique - Google Patents

Procédés et compositions d'analyse de l'acide nucléique

Info

Publication number
WO2025212384A2
WO2025212384A2 PCT/US2025/021867 US2025021867W WO2025212384A2 WO 2025212384 A2 WO2025212384 A2 WO 2025212384A2 US 2025021867 W US2025021867 W US 2025021867W WO 2025212384 A2 WO2025212384 A2 WO 2025212384A2
Authority
WO
WIPO (PCT)
Prior art keywords
nucleic acid
scaffold
ssna
species
adapters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2025/021867
Other languages
English (en)
Inventor
Camille Rebecca SCHWARTZ
Tobin Escher GROTH
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Claret Bioscience LLC
Original Assignee
Claret Bioscience LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Claret Bioscience LLC filed Critical Claret Bioscience LLC
Publication of WO2025212384A2 publication Critical patent/WO2025212384A2/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/10Design of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing

Definitions

  • the technology relates in part to methods and compositions for analyzing nucleic acid.
  • the technology relates to methods and compositions for preparing a nucleic acid library from single-stranded nucleic acid fragments and analyzing fragment end sequences.
  • the technology relates to identifying a disease according to a fragment end sequence analysis.
  • the technology relates to identifying a disease according to a k-mer analysis.
  • Genetic information of living organisms e.g., animals, plants and microorganisms
  • other forms of replicating genetic information e.g., viruses
  • nucleic acid i.e., deoxyribonucleic acid (DNA) or ribonucleic acid (RNA)
  • Genetic information is a succession of nucleotides or modified nucleotides representing the primary structure of chemical or hypothetical nucleic acids.
  • a variety of high-throughput sequencing platforms are used for analyzing nucleic acid.
  • the ILLUMINA platform involves clonal amplification of adapter-ligated DNA fragments.
  • Another platform is nanopore-based sequencing, which relies on the transition of nucleic acid molecules or individual nucleotides through a small channel.
  • Library preparation for certain sequencing platforms often includes fragmentation of DNA, modification of fragment ends, and ligation of adapters, and may include amplification of nucleic acid fragments (e.g., PGR amplification).
  • nucleic acid ends may contain useful information. Accordingly, methods that modify nucleic acid ends (e.g., for library preparation) while preserving the information contained in the nucleic acid ends are useful for processing and analyzing nucleic acid.
  • Another aspect of library preparation includes capturing single stranded nucleic acid fragments.
  • single-stranded library preparation methods can generate better and more complex libraries compared to traditional double-stranded DNA (dsDNA) preparation methods.
  • single-stranded library preparation methods can be useful for capturing single-stranded nucleic acid fragments in a mixture of single-stranded DNA fragments and single-stranded RNA fragments.
  • Drawbacks to producing single-stranded DNA (ssDNA) libraries include labor intensive, expensive, and time-consuming protocols, and exotic or custom reagent requirements.
  • nucleic acid e.g., single-stranded nucleic acid, denatured double-stranded nucleic acid, or mixtures containing single-stranded nucleic acid.
  • Described herein are methods for detecting cancer from cell-free nucleic acid fragment end analysis. In particular, described herein are methods for detecting cancer from cell- free nucleic acid according to a fragment end k-mer analysis.
  • methods comprising a) obtaining nucleic acid sequence reads mapped to a reference genome, where the sequence reads are reads of single-stranded cell-free nucleic acid from a test sample from a subject; b) generating a k-mer profile for the subject, where the profile comprises a plurality of k-mer species at, adjacent to, and/or near a plurality of sequence read-genome junctions; and c) detecting the presence or absence of cancer in the subject according to the k-mer profile generated in (b).
  • systems comprising one or more microprocessors and memory, which memory comprises instructions executable by the one or more microprocessors and which memory comprises nucleic acid sequence reads mapped to a reference genome, where the sequence reads are reads of single-stranded cell-free nucleic acid from a test sample from a subject, and where the instructions executable by the one or more microprocessors are configured to a) generate a k-mer profile for the subject, where the profile comprises a plurality of k-mer species at, adjacent to, and/or near a plurality of sequence read-genome junctions; and b) detect the presence or absence of cancer in the subject according to the k-mer profile generated in (a).
  • machines comprising one or more microprocessors and memory, which memory comprises instructions executable by the one or more microprocessors and which memory comprises nucleic acid sequence reads mapped to a reference genome, where the sequence reads are reads of single-stranded cell-free nucleic acid from a test sample from a subject, and where the instructions executable by the one or more microprocessors are configured to a) generate a k-mer profile for the subject, where the profile comprises a plurality of k-mer species at, adjacent to, and/or near a plurality of sequence read-genome junctions; and b) detect the presence or absence of cancer in the subject according to the k-mer profile generated in (a).
  • Fig. 2 shows an example k-mer analysis overview.
  • Fig. 3 shows an example CpG island.
  • Fig. 5A shows 2-mer features that separate prostate cancer plasma samples (university samples) from healthy plasma samples (in-house samples).
  • Fig. 5B shows the samples do not cluster solely by collection center (i.e., in-house samples vs. university samples).
  • Fig. 6A shows 2-mer features that separate prostate cancer plasma samples (university samples and publicly available single-stranded prostate cancer sample data) from healthy plasma samples (in-house samples).
  • Fig. 6B shows publicly available single-stranded prostate cancer sample data (generated from non-SRSLY libraries) cluster together.
  • Fig. 10A and 10B show random forest classifier results for prostate cancer.
  • Figs. 16A and 16B show random forest classifier results (70/30 and LOO) for whole genome sequencing reads vs. CpG filtered reads.
  • Fig. 18 shows median number of reads going into k-mer script for the various genomic subsets.
  • Fig. 19 shows sample cohorts for limit of detection (LOD) analysis.
  • Fig. 21 shows limit of detection (LOD) results.
  • Fig. 22 shows an example workflow for a 3-group classification.
  • Fig. 23 shows 3 group classification results.
  • Fig. 24A shows classification results with different values of K (2-mers, 3-mers, 4-mers) for the full feature set.
  • Fig. 24B shows classification results with different values of K (2-mers, 3-mers, 4-mers) for the top 20 feature set.
  • Fig. 25A shows top 20 2-mer values for healthy and cancer samples.
  • Fig. 25B shows top 20 3-mer values for healthy and cancer samples.
  • Fig. 25C shows top 20 4-mer values for healthy and cancer samples.
  • Figs. 26A and 26B show construction of 3-mers and 4-mers from top 20 2-mers (based on read2 only).
  • Fig. 27 shows constructed k-mers compared to top 20 k-mers.
  • Fig. 28A shows a description of a sample cohort.
  • Fig. 28B shows molecular and analytical workflow for a study.
  • Fig. 30A shows classification performance across multiple values of K.
  • Fig. 30B shows a comparison between the median precision-recall curve for prostate cancer versus myelodysplastic syndrome (MDS).
  • Fig. 31 A shows prostate cancer classification performance based on read depth.
  • Fig. 31 B shows prostate cancer classification performance is uniform across genomic location with similar depth of coverage.
  • Fig. 32 shows that the cfDNA fragmentomic signal is dispersed across the genome.
  • Fig. 33 shows that the 3' end of cfDNA contains a robust signal for prostate cancer classification.
  • k-mer sequences are analyzed at nucleic acid sequence read ends and/or at nucleic acid sequence read mapping junctions.
  • Certain methods and compositions useful for analyzing nucleic acid fragment ends are described in International PCT Publication No. WO2019/140201 , International PCT Publication No. W02020/206143, and International PCT Publication No. WO2021/262805, each of which in incorporated by reference in its entirety.
  • K-mers generally refer to substrings of length k contained within a biological sequence.
  • k-mers may be composed of nucleotides (e.g., A, T, G, and C), and an example sequence “AGAT” has four monomers (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT).
  • a method herein comprises identifying one or more k-mer species.
  • a k-mer species generally refers to a k-mer having a particular sequence and length. The number of possible sequences increases with length of a k-mer.
  • 2-mers may include 2 4 (i.e., 16) possible sequences such as AA, AT, AG, AC, TA, TT, TG, TC, GA, GT, GG, GC, CA, CT, CG, and CC.
  • 3-mers may include 3 4 (i.e., 81 ) possible sequences
  • 4-mers may include 4 4 (i.e., 256) possible sequences, and so on.
  • a method herein comprises identifying a plurality of k-mer species.
  • a method herein may comprise identifying about 2 k-mer species, about 3 k-mer species, about 4 k-mer species, about 5 k-mer species, about 6 k-mer species, about 7 k-mer species, about 8 k-mer species, about 9 k-mer species, about 10 k-mer species, about 11 k-mer species, about 12 k-mer species, about 13 k-mer species, about 14 k-mer species, about 15 k-mer species, about 16 k-mer species, about 17 k-mer species, about 18 k-mer species, about 19 k-mer species, about 20 k-mer species, about 21 k-mer species, about 22 k-mer species, about 23 k-mer species, about 24 k-mer species, about 25 k-mer species, about 30 k-mer species, about 35 k-mer species, about 40 k-mer species, about 45 k-mer species,
  • a plurality of k-mer species comprises 2-mers. In some embodiments, a plurality of k-mer species comprises 3-mers. In some embodiments, a plurality of k-mer species comprises 4-mers. In some embodiments, a plurality of k-mer species comprises a combination of 2-mers, 3-mers, and 4-mers. In some embodiments, a plurality of k-mer species comprises a combination of one or more 2-mers and one or more 3-mers. In some embodiments, a plurality of k-mer species comprises a combination of one or more 2-mers and one or more 4-mers. In some embodiments, a plurality of k-mer species comprises a combination of one or more 3-mers and one or more 4-mers.
  • a plurality of k-mer species comprises a combination of one or more 2-mers, one or more 3-mers, and one or more 4-mers. In some embodiments, a plurality of k-mer species comprises a combination of two or more 2-mers and two or more 3-mers. In some embodiments, a plurality of k-mer species comprises a combination of two or more 2-mers and two or more 4-mers. In some embodiments, a plurality of k-mer species comprises a combination of two or more 3-mers and two or more 4-mers. In some embodiments, a plurality of k- mer species comprises a combination of two or more 2-mers, two or more 3-mers, and two or more 4-mers.
  • a method herein comprises identifying one or more k-mer species at one or more sequence read-genome junctions. In some embodiments, a method herein comprises identifying a plurality of k-mer species at, adjacent to, and/or near a plurality of sequence readgenome junctions.
  • a sequence read-genome junction refers to the location of an end of a sequence read mapped to a reference genome. On one side of a junction is a mapped sequence read and on the other side of a junction is genome sequence beyond the mapped sequence read.
  • CA sequence read-genome junction
  • AT sequence readgenome junction
  • k-mers at the example sequence read-genome junctions above may be read “CA” or “AC” and “AT” or “TA.”
  • a method herein comprises identifying one or more k-mer species adjacent to one or more sequence read-genome junctions. In some embodiments, a method herein comprises identifying a plurality of k-mer species adjacent to a plurality of sequence read-genome junctions. A k-mer species adjacent to a sequence read-genome junction abuts but does not span a sequence read-genome junction. A k-mer species adjacent to a sequence read-genome junction may be on the sequence read side or the genome side of a junction.
  • k-mer species adjacent to a sequence read-genome junction on the genome side are underlined: CCCCAAAAATTTT, and k-mer species adjacent to a sequence read-genome junction on the sequence read side are underlined: CCCCAAAAATTTT.
  • a method herein comprises identifying one or more k-mer species near one or more sequence read-genome junctions. In some embodiments, a method herein comprises identifying a plurality of k-mer species near a plurality of sequence read-genome junctions. A k-mer species near a sequence read-genome junction may be within a certain distance of nucleotides upstream or downstream of a sequence read-genome junction. In some embodiments, a k-mer species near a sequence read-genome junction may be within three nucleotides upstream or downstream of a sequence read-genome junction.
  • a k-mer species near a sequence read-genome junction may be within four nucleotides upstream or downstream of a sequence read-genome junction. In some embodiments, a k-mer species near a sequence readgenome junction may be within five nucleotides upstream or downstream of a sequence readgenome junction.
  • a k-mer species near a sequence read-genome junction may be on the sequence read side or the genome side of a junction. In the example sequence above, k-mer species near a sequence read-genome junction on the genome side are underlined: CCCCAAAAATTTT, and an example k-mer species near a sequence read-genome junction on the sequence read side is underlined: CCCCAAAAATTTT.
  • a method herein comprises generating a k-mer profile for a subject.
  • a k-mer profile may contain a plurality of k-mer species (e.g., one or more 2-mer species, one or more 3- mer species, one or more 4-mer species).
  • a k-mer profile contains a map category for each k-mer (e.g., read, genome, junction).
  • a k-mer profile contains a location for each k-mer (e.g., position relative to a sequence read-genome junction).
  • a k-mer profile contains a map category for each k-mer and a location for each k-mer.
  • a method herein comprises sequencing single-stranded nucleic acid (ssNA) from a test sample from a subject. Certain methods herein comprise combining single stranded nucleic acid (ssNA) with scaffold adapters, or components thereof.
  • Scaffold adapters generally include a scaffold polynucleotide and an oligonucleotide. Accordingly, a “component” of a scaffold adapter may refer to a scaffold polynucleotide and/or an oligonucleotide, or a subcomponent or region thereof.
  • the oligonucleotide and/or the scaffold polynucleotide can be composed of pyrimidine (C, T, U) and/or purine (A, G) nucleotides. Additional components or subcomponents may include one or more of an index polynucleotide, a unique molecular identifier (UMI), one or more regions that flank a unique molecular identifier (UMI), primer binding site (e.g., sequencing primer binding site, P5 primer binding site, P7 primer binding site), flow cell binding region, and the like, and complements thereto. Scaffold adapters comprising a P5 primer binding site may be referred to as P5 adapters or P5 scaffold adapters.
  • a scaffold polynucleotide is a single-stranded component of a scaffold adapter.
  • a polynucleotide herein generally refers to a single-stranded multimer of nucleotide from 5 to 500 nucleotides, e.g., 5 to 100 nucleotides.
  • Polynucleotides may be synthetic or may be made enzymatically, and, in some embodiments, are about 5 to 50 nucleotides in length.
  • a scaffold polynucleotide may include an ssNA hybridization region (also referred to as a scaffold, scaffold region, single-stranded scaffold, single-stranded scaffold region, or splint) and an oligonucleotide hybridization region.
  • An ssNA hybridization region and an oligonucleotide hybridization region may be referred to as subcomponents of a scaffold polynucleotide.
  • An ssNA hybridization region typically comprises a polynucleotide that hybridizes, or is capable of hybridizing, to an ssNA terminal region.
  • An oligonucleotide hybridization region typically comprises a polynucleotide that hybridizes, or is capable of hybridizing, to all or a portion of the oligonucleotide component of the scaffold adapter.
  • an ssNA hybridization region comprises a random sequence. In some embodiments, an ssNA hybridization region comprises a sequence complementary to an ssNA terminal region sequence of interest (e.g., targeted sequence). In certain embodiments, an ssNA hybridization region comprises one or more nucleotides that are all capable of non-specific base pairing to bases in the ssNA. Nucleotides capable of non-specific base pairing may be referred to as universal bases. A universal base is a base capable of indiscriminately base pairing with each of the four standard nucleotide bases: A, C, G and T.
  • the ssNA hybridization region is from 4 to 20 nucleotides in length, e.g., from 5 to 15, 5 to 10, 5 to 9, 5 to 8, or 5 to 7 (e.g., 6 or 7) nucleotides in length. In some embodiments, the ssNA hybridization region is 7 nucleotides in length.
  • the ssNA hybridization region comprises or consists of a random nucleotide sequence, such that when a plurality of heterogeneous scaffold polynucleotides having various random ssNA hybridization regions are employed, the collection is capable of acting as scaffold polynucleotides for a heterogeneous population of ssNAs irrespective of the sequences of the terminal regions of the ssNAs.
  • Each scaffold polynucleotide having a unique ssNA hybridization region sequence may be referred to as a scaffold polynucleotide species and a collection of multiple scaffold polynucleotide species may be referred to as a plurality of scaffold polynucleotide species (e.g., for a scaffold polynucleotide designed to have 7 random bases in the ssNA hybridization region, a plurality of scaffold polynucleotide species would include 4 7 unique ssNA hybridization region sequences).
  • each scaffold adapter having a unique scaffold polynucleotide may be referred to as a scaffold adapter species and a collection of multiple scaffold adapter species may be referred to as a plurality of scaffold adapter species.
  • a species of scaffold polynucleotide generally contains a feature that is unique with respect to other scaffold polynucleotide species.
  • a scaffold polynucleotide species may contain a unique sequence feature.
  • a unique sequence feature may include a unique sequence length, a unique nucleotide sequence (e.g., a unique random sequence, a unique targeted sequence), or a combination of a unique sequence length and nucleotide sequence.
  • Sequencing adapter generally refers to one or more nucleic acid domains that include at least a portion of a nucleotide sequence (or complement thereof) utilized by a sequencing platform of interest, such as a sequencing platform provided by Illumina® (e.g., the HiSeqTM, MiSeqTM and/or Genome AnalyzerTM sequencing systems); Oxford NanoporeTM Technologies (e.g., the MinlONTM sequencing system), Ion TorrentTM (e.g., the Ion PGMTM and/or Ion ProtonTM sequencing systems); Pacific Biosciences (e.g., a Sequel or PACBIO RS II sequencing system); Life TechnologiesTM (e.g., a SOLiDTM sequencing system); Roche (e.g., the 454 GS FLX+ and/or GS Junior sequencing systems); Genapsys; BGI; or any sequencing platform of interest.
  • Illumina® e.g., the HiSeqTM, MiSeqTM and/or Genome AnalyzerTM sequencing systems
  • Nucleic acid fragments may be combined with scaffold adapters, or components thereof, thereby generating combined products.
  • Combining ssNA fragments with scaffold adapters, or components thereof, may comprise hybridization and/or ligation (e.g., ligation of hybridization products).
  • a combined product may include an ssNA fragment connected to (e.g., hybridized to and/or ligated to) a scaffold adapter, or component thereof, at one or both ends of the ssNA fragment.
  • a combined product may include an ssNA fragment hybridized to a scaffold adapter, or component thereof, at one or both ends of the ssNA fragment, which may be referred to as a hybridization product.
  • a combined product may include an ssNA fragment ligated to a scaffold adapter, or component thereof, at one or both ends of the ssNA fragment, which may be referred to as a ligation product.
  • products from a cleavage step i.e., cleaved products
  • scaffold adapters, or components thereof may be combined with scaffold adapters, or components thereof, thereby generating combined products.
  • Certain methods herein comprise generating sets of combined products (e.g., a first set of combined products and a second set of combined products).
  • a first set of combined products includes ssNAs connected to (e.g., hybridized to and/or ligated to) scaffold adapters, or components thereof, from a first set of scaffold adapters, or components thereof.
  • a second set of combined products includes the first set of combined products connected to (e.g., hybridized to and/or ligated to) scaffold adapters, or components thereof, from a second set of scaffold adapters, or components thereof.
  • a set of combined products includes ssNAs connected to (e.g., hybridized to) scaffold adapters, or components thereof, from a first set of scaffold adapters, or components thereof.
  • the set of combined products further includes ssNAs connected to (e.g., hybridized to) scaffold adapters, or components thereof, from a second set of scaffold adapters, or components thereof.
  • ssNAs may be combined with scaffold adapters, or components thereof, under hybridization conditions, thereby generating hybridization products.
  • the scaffold adapters are provided as pre-hybridized products and the hybridization step includes hybridizing the scaffold adapters to the ssNA.
  • the scaffold adapter components are provided as individual components and the hybridization step includes hybridizing the scaffold adapter components 1) to each other and 2) to the ssNA.
  • the scaffold adapter components i.e., oligonucleotides and scaffold polynucleotides
  • the hybridization steps includes 1 ) hybridizing the scaffold polynucleotides to the ssNA, and then 2) hybridizing the oligonucleotides to the oligonucleotide hybridization region of the scaffold polynucleotides.
  • the conditions during the combining step are those conditions in which scaffold adapters, or components thereof (e.g., single-stranded scaffold regions), specifically hybridize to ssNAs having a terminal region or terminal regions that are complementary in sequence with respect to the single-stranded scaffold regions.
  • the conditions during the combining step also may include those conditions in which components of the scaffold adapters (e.g., oligonucleotides and oligonucleotide hybridization regions within the scaffold polynucleotides), specifically hybridize, or remain hybridized, to each other.
  • Specific hybridization may be affected or influenced by factors such as the degree of complementarity between the single-stranded scaffold regions and the ssNA terminal region(s), or between the oligonucleotides and oligonucleotide hybridization regions, the length thereof, and the temperature at which the hybridization occurs, which may be informed by melting temperatures (Tm) of the single-stranded scaffold regions.
  • Melting temperature generally refers to the temperature at which half of the single-stranded scaffold regions /ssNA terminal regions remain hybridized and half of the single-stranded scaffold regions /ssNA terminal regions dissociate into single strands.
  • a method herein comprises exposing hybridization products to conditions under which an end of an ssNA is joined to an end of a scaffold adapter to which it is hybridized.
  • a method herein may comprise exposing hybridization products to conditions under which an end of an ssNA is joined to an end of an oligonucleotide component of a scaffold adapter to which it is hybridized. Joining may be achieved by any suitable approach that permits covalent attachment of ssNA to the scaffold adapter and/or oligonucleotide component of a scaffold adapter to which it is hybridized.
  • an ssNA When one end of an ssNA is joined to an end of a scaffold adapter and/or oligonucleotide component of a scaffold adapter to which it is hybridized, typically one of two attachment events is conducted: 1 ) the 3’ end of the ssNA to the 5’ end of the oligonucleotide component of the scaffold adapter, or 2) the 5’ end of the ssNA to the 3’ end of the oligonucleotide component of the scaffold adapter.
  • both ends of an ssNA are each joined to an end of a scaffold adapter and/or oligonucleotide component of a scaffold adapter to which it is hybridized, typically two attachment events are conducted: 1 ) the 3’ end of the ssNA to the 5’ end of the oligonucleotide component of a first scaffold adapter, and 2) the 5’ end of the ssNA to the 3’ end of the oligonucleotide component of a second scaffold adapter.
  • a method herein comprises contacting hybridization products with an agent comprising a ligase activity under conditions in which an end of an ssNA is covalently linked to an end of a scaffold adapter and/or oligonucleotide component of a scaffold adapter to which the target nucleic acid (ssNA) is hybridized.
  • Ligase activity may include, for example, blunt-end ligase activity, nick-sealing ligase activity, sticky end ligase activity, circularization ligase activity, cohesive end ligase activity, DNA ligase activity, RNA ligase activity, single-stranded ligase activity, and double-stranded ligase activity.
  • Ligase activity may include ligating a 5’ phosphorylated end of one polynucleotide to a 3’ OH end of another polynucleotide (5’P to 3’OH).
  • Ligase activity may include ligating a 3’ phosphorylated end of one polynucleotide to a 5’ OH end of another polynucleotide (3’P to 5’OH).
  • Ligase activity may include ligating a 5’ end of an ssNA to a 3’ end of a scaffold adapter and/or oligonucleotide component of a scaffold adapter hybridized thereto in a ligation reaction.
  • Ligase activity may include ligating a 3’ end of an ssNA to a 5’ end of a scaffold adapter and/or oligonucleotide component of a scaffold adapter hybridized thereto in a ligation reaction.
  • Suitable reagents e.g., ligases
  • kits for performing ligation reactions are known and available.
  • Instant Sticky-end Ligase Master Mix available from New England Biolabs (Ipswich, MA) may be used.
  • Ligases that may be used include but are not limited to, for example, T3 ligase, T4 DNA ligase (e.g., at low or high concentration), T7 DNA Ligase, E.
  • RNA ligases T4 RNA ligase 1 , T4 RNA ligase 2, truncated T4 RNA ligase 2, thermostable 5' App DNA/RNA ligase, SplintR® Ligase, RtcB ligase, Taq ligase, and the like and combinations thereof.
  • a phosphate group may be added at the 5’ end of the oligonucleotide component or ssNA fragment using a suitable kinase, for example, such as T4 polynucleotide kinase (PNK).
  • PNK polynucleotide kinase
  • Such kinases and guidance for using such kinases to phosphorylate 5’ ends are available, for example, from New England BioLabs, Inc. (Ipswich, MA).
  • a method comprises covalently linking the adjacent ends of an oligonucleotide component and an ssNA terminal region, thereby generating covalently linked hybridization products.
  • the covalently linking comprises contacting the hybridization products (e.g., ssNA fragments hybridized to at least one scaffold adapter herein) with an agent comprising a ligase activity under conditions in which the end of an ssNA terminal region is covalently linked to an end of the oligonucleotide component.
  • a method comprises covalently linking the adjacent ends of a first oligonucleotide component and a first ssNA terminal region, and covalently linking the adjacent ends of a second oligonucleotide component and a second ssNA terminal region, thereby generating covalently linked hybridization products.
  • the covalently linking comprises contacting hybridization products (e.g., ssNA fragments each hybridized two scaffold adapters herein) with an agent comprising a ligase activity under conditions in which an end of a first ssNA terminal region is covalently linked to an end of a first oligonucleotide component and an end of a second ssNA terminal region is covalently linked to an end of a second oligonucleotide component.
  • the agent comprising a ligase activity is a T4 DNA ligase.
  • the T4 DNA ligase is used at an amount between about 1 unit/pil to about 50 units/pil.
  • the T4 DNA ligase is used at an amount between about 5 unit/pil to about 30 units/pil. In some embodiments, the T4 DNA ligase is used at an amount between about 5 unit/pil to about 15 units/pl. In some embodiments, the T4 DNA ligase is used at about 10 units/pil. In some embodiments, the T4 DNA ligase is used at an amount less than 25 units/pil. In some embodiments, the T4 DNA ligase is used at an amount less than 20 units/pil. In some embodiments, the T4 DNA ligase is used at an amount less than 15 units/pil. In some embodiments, the T4 DNA ligase is used at an amount less than 10 units/ il.
  • a method comprises sequentially covalently linking the adjacent ends of a first oligonucleotide component to a first ssNA terminal region, and the adjacent ends of a second oligonucleotide component to a second ssNA terminal region.
  • the covalently linking comprises contacting a set of hybridization products with one or more agents comprising a ligase activity.
  • the second oligonucleotide is 3’ phosphorylated, which blocks the second oligonucleotide from ligating to the second ssNA terminal region during the initial ligation reaction.
  • the ligation blocking mechanism is removed, allowing the second oligonucleotide to ligate to the second ssNA terminal region.
  • a second set of ligation products is formed where the adjacent ends of the second oligonucleotide and the second ssNA terminal region are ligated.
  • adding 9 g of PEG 8000 in a 50 ml SPRI bead solution may be referred to as “18% SPRI.”
  • adding 19 g of PEG 8000 in a 50 ml SPRI solution may be referred to as “38% SPRI.”
  • the higher proportion of PEG the lower the size of DNA fragments retained.
  • a purifying or washing step may enrich for nucleic acid fragments, or amplification products thereof, having a particular length or range of lengths.
  • an SPRI purification may enrich for nucleic acid fragments, or amplification products thereof, having a particular length or range of lengths.
  • the amount of PEG 8000 in an SPRI bead solution used in an SPRI purification may affect the length or range of lengths of fragments that are enriched. For example, an SPRI purification at 1 .5x v/v ratio may recover more fragments in the ⁇ 100 base range than an SPRI purification at 1 .2x because the final concentration of PEG 8000 is higher in 1 ,5x than in 1 .2x.
  • a method herein may be performed in a suitable reaction volume and/or with a suitable amount of ssNA and/or suitable ratio of ssNA to scaffold adapters (or components thereof).
  • a suitable reaction volume and/or a suitable amount of ssNA and/or a suitable ratio of ssNA to scaffold adapters (or components thereof) may include reaction volumes, amounts of ssNA, and/or ratios of ssNA and scaffold adapters that reduce or prevent adapter dimer formation.
  • a suitable amount of ssNA may range from about 250 pg to about 5 ng of ssNA.
  • 1 ng ssNA may be combined with between about 1 .0 to 2.0 picomoles of each scaffold adapter (i.e., about 1 .0 to 2.0 picomoles of scaffold adapters (pool of scaffold adapters that contains a plurality of scaffold adapter species) that hybridize to the 5’ end of ssNA terminal regions, and about 1 .0 to 2.0 picomoles of scaffold adapters (pool of scaffold adapters that contains a plurality of scaffold adapter species) that hybridize to the 3’ end of ssNA terminal regions).
  • an SPRI bead solution may be added to a sample solution, often with instructions for a v/v ratio.
  • 1 .2x 18% SPRI means that, if given a 50 pl sample, add 60 pl (50 x 1 .2) of 18% SPRI beads.
  • This v/v ratio leads to a final concentration of PEG at 9.8%, assuming there is in no PEG in the sample solution.
  • ligation products there is an existing amount of PEG present in the sample solution (i.e., ligation products). Accordingly, a user may adjust the volume of added SPRI beads to reach the desired final concentration of PEG.
  • a desired final concentration of PEG may range from about 5% final PEG to about 15% final PEG.
  • Y-scaffold adapters may comprise a plurality of nucleic acid components and subcomponents.
  • Y-scaffold adapters comprise a first nucleic acid strand and a second nucleic acid strand.
  • a first nucleic acid strand is complementary to a second nucleic acid strand.
  • a portion of a first nucleic acid strand is complementary to a portion of a second nucleic acid strand.
  • a scaffold adapter comprises one strand capable of forming a hairpin structure having a single-stranded loop. In some embodiments, a scaffold adapter consists of one strand capable of forming a hairpin structure having a single-stranded loop.
  • a scaffold adapter having a hairpin structure generally comprises a double-stranded “stem” region and a single stranded “loop” region.
  • a scaffold adapter comprises one strand (i.e., one continuous strand) capable of adopting a hairpin structure. In some embodiments, a scaffold adapter consists essentially of one strand (i.e., one continuous strand) capable of adopting a hairpin structure.
  • Hairpin scaffold adapters may comprise a plurality of nucleic acid components and subcomponents within the one strand.
  • a hairpin scaffold adapter comprises an oligonucleotide and a scaffold polynucleotide.
  • the oligonucleotide is complementary to an oligonucleotide hybridization region in the scaffold polynucleotide.
  • a portion of the oligonucleotide is complementary to a portion of the oligonucleotide hybridization region in the scaffold polynucleotide.
  • a hairpin scaffold adapter comprises complementary region and a non-complementary region.
  • the complementary region often forms the stem of the hairpin adapter and the non-complementary region often forms the loop, or part thereof, of the hairpin scaffold adapter.
  • the oligonucleotide and the scaffold polynucleotide may comprise subcomponents (e.g., subcomponents of scaffold polynucleotides, subcomponents of oligonucleotides, and subcomponents of sequencing adapters described herein, such as, for example, UMIs, UMI flanking regions, amplification priming sites and/or specific sequencing adapters (e.g., P5, P7 adapters)).
  • the oligonucleotide and the scaffold polynucleotide do not comprise certain subcomponents of sequencing adapters described herein, such as, for example, amplification priming sites and specific sequencing adapters (e.g., P5, P7 adapters).
  • Hairpin scaffold adapters may comprise one or more cleavage sites capable of being cleaved under cleavage conditions.
  • a cleavage site is located between an oligonucleotide and a scaffold polynucleotide. Cleavage at a cleavage site often generates two separate strands from the hairpin scaffold adapter.
  • cleavage at a cleavage site generates a partially double stranded scaffold adapter with two unpaired strands forming a “Y” structure.
  • Cleavage sites may include any suitable cleavage site, such as cleavage sites described herein, for example.
  • a hairpin scaffold adapter comprises a single-stranded scaffold region (ssNA hybridization region).
  • the single-stranded scaffold region of a hairpin scaffold adapter typically is located adjacent to the double-stranded stem portion and at the opposite end of the loop portion.
  • the single-stranded scaffold region of a hairpin scaffold adapter typically is complementary to a terminal region of a target nucleic acid (e.g., a terminal region of a single-stranded nucleic acid).
  • a hairpin scaffold adapter comprises in a 5' to 3' orientation: an oligonucleotide, one or more cleavage sites, and a scaffold polynucleotide comprising an oligonucleotide hybridization region and a scaffold region (ssNA hybridization region).
  • a hairpin oligonucleotide comprises in a 5' to 3' orientation: a scaffold polynucleotide comprising a scaffold region (ssNA hybridization region) and an oligonucleotide hybridization region, one or more cleavage sites, and an oligonucleotide.
  • a plurality or pool of hairpin scaffold adapter species comprises a mixture of: 1 ) hairpin scaffold adapters comprising in a 5' to 3' orientation: an oligonucleotide, one or more cleavage sites, and a scaffold polynucleotide comprising an oligonucleotide hybridization region and a scaffold region (ssNA hybridization region); and 2) hairpin scaffold adapters comprising in a 5' to 3' orientation: a scaffold polynucleotide comprising a scaffold region (ssNA hybridization region) and an oligonucleotide hybridization region, one or more cleavage sites, and an oligonucleotide.
  • a nucleic acid comprises one or more modified nucleotides.
  • a DNA molecule comprises one or more modified nucleotides.
  • an unligated DNA molecule (e.g., not ligated to one or more scaffold adapters described herein) comprises one or more modified nucleotides.
  • a scaffold adapter, or component thereof comprises one or more modified nucleotides.
  • an unligated scaffold adapter, or component thereof, (e.g., not ligated to a DNA molecule) comprises one or more modified nucleotides.
  • Modified nucleotides may be referred to as modified bases or non-canonical bases and may include, for example, nucleotides conjugated to a member of a binding pair, blocked nucleotides, non-natural nucleotides, nucleotide analogues, peptide nucleic acid (PNA) nucleotides, Morpholino nucleotides, locked nucleic acid (LNA) nucleotides, bridged nucleic acid (BNA) nucleotides, glycol nucleic acid (GNA) nucleotides, threose nucleic acid (TNA) nucleotides, and the like and combinations thereof.
  • PNA peptide nucleic acid
  • LNA locked nucleic acid
  • BNA bridged nucleic acid
  • GAA glycol nucleic acid
  • TAA threose nucleic acid
  • a scaffold adapter, or component thereof comprises one or more nucleotides with modifications chosen from one or more of amino modifier, biotinylation, thiol, alkynes, 2’-0-methoxy-ethyl Bases (2’-MOE), RNA, fluoro bases, iso (iso-dG, iso-DC), inverted, methyl, nitro, phos, and the like.
  • a scaffold adapter, or component thereof comprises one or more modified nucleotides within a duplex region, within a scaffold region, at one end, or at both ends of the scaffold adapter, or component thereof.
  • a scaffold adapter, or component thereof comprises one or more unpaired modified nucleotides.
  • a scaffold adapter, or component thereof comprises one or more unpaired modified nucleotides at one end of the adapter. In some embodiments, a scaffold adapter, or component thereof, comprises one or more unpaired modified nucleotides at the end of the adapter opposite to the end that hybridizes to a target nucleic acid (e.g., an end comprising a single-stranded scaffold region).
  • a modified nucleotide may be present at the end of the strand having a 3’ terminus or at the end of the strand having a 5’ terminus.
  • a nucleic acid molecule comprises one or more modified nucleotides.
  • the one or more modified nucleotides are capable of blocking covalent linkage of the nucleic acid molecule to another nucleic acid molecule, oligonucleotide, polynucleotide, or scaffold adapter.
  • a nucleic acid molecule comprises one or more modified nucleotides at one or more unligated ends.
  • an oligonucleotide component comprises one or more modified nucleotides.
  • the one or more modified nucleotides are capable of blocking covalent linkage of the oligonucleotide component to another oligonucleotide, polynucleotide, or nucleic acid molecule.
  • an oligonucleotide component comprises one or more modified nucleotides at an end not adjacent to the ssNA.
  • a scaffold polynucleotide comprises one or more modified nucleotides.
  • the one or more modified nucleotides are capable of blocking covalent linkage of the scaffold polynucleotide to another oligonucleotide, polynucleotide, or nucleic acid molecule.
  • a scaffold polynucleotide may comprise the one or more modified nucleotides at one or both ends of the polynucleotide.
  • the one or more modified nucleotides comprise a ligation-blocking modification.
  • a nucleic acid molecule comprises one or more blocked nucleotides.
  • a DNA molecule may comprise one or more blocked nucleotides at one or more unligated ends.
  • a scaffold adapter, or component thereof comprises one or more blocked nucleotides.
  • a scaffold adapter, or component thereof may comprise one or more modified nucleotides that are capable of blocking hybridization to a nucleotide in another scaffold adapter, or component thereof. In some instances, the one or more modified nucleotides are capable of blocking ligation to a nucleotide in another scaffold adapter, or component thereof.
  • a scaffold adapter may comprise one or more modified nucleotides that are capable of blocking hybridization to a nucleotide in a target nucleic acid (e.g., ssNA).
  • the one or more modified nucleotides are capable of blocking ligation to a nucleotide in a target nucleic acid.
  • one or both ends of a scaffold polynucleotide include a blocking modification and/or the end of an oligonucleotide component not adjacent to an ssNA fragment may include a blocking modification.
  • a blocking modification refers to a modified end that cannot be linked to the end of another nucleic acid component using an approach employed to covalently link the adjacent ends of an oligonucleotide component and an ssNA fragment.
  • the blocking modification is a ligation-blocking modification. Examples of blocking modifications which may be included at one or both ends of a nucleic acid molecule, a scaffold polynucleotide and/or the end of an oligonucleotide component not adjacent to the ssNA, include the absence of a 3’ OH, and an inaccessible 3’ OH.
  • a sample can be a liquid sample.
  • a liquid sample can comprise extracellular nucleic acid (e.g., circulating cell-free DNA).
  • liquid samples include, but are not limited to, blood or a blood product (e.g., serum, plasma, or the like), urine, cerebral spinal fluid, saliva, sputum, biopsy sample (e.g., liquid biopsy for the detection of cancer), a liquid sample described above, the like or combinations thereof.
  • a sample is a liquid biopsy, which generally refers to an assessment of a liquid sample from a subject for the presence, absence, progression or remission of a disease (e.g., cancer).
  • a liquid biopsy can be used in conjunction with, or as an alternative to, a sold biopsy (e.g., tumor biopsy).
  • extracellular nucleic acid is analyzed in a liquid biopsy.
  • a biological sample may be blood, plasma or serum.
  • blood encompasses whole blood, blood product or any fraction of blood, such as serum, plasma, buffy coat, or the like as conventionally defined. Blood or fractions thereof often comprise nucleosomes. Nucleosomes comprise nucleic acids and are sometimes cell-free or intracellular. Blood also comprises buffy coats. Buffy coats are sometimes isolated by utilizing a ficoll gradient. Buffy coats can comprise white blood cells (e.g., leukocytes, T-cells, B-cells, platelets, and the like). Blood plasma refers to the fraction of whole blood resulting from centrifugation of blood treated with anticoagulants.
  • Blood serum refers to the watery portion of fluid remaining after a blood sample has coagulated. Fluid or tissue samples often are collected in accordance with standard protocols hospitals or clinics generally follow. For blood, an appropriate amount of peripheral blood (e.g., between 3 to 40 milliliters, between 5 to 50 milliliters) often is collected and can be stored according to standard procedures prior to or after preparation.
  • peripheral blood e.g., between 3 to 40 milliliters, between 5 to 50 milliliters
  • An analysis of nucleic acid found in a subject’s blood may be performed using, e.g., whole blood, serum, or plasma.
  • An analysis of tumor or cancer DNA found in a patient’s blood may be performed using, e.g., whole blood, serum, or plasma.
  • An analysis of pathogen DNA found in a patient’s blood may be performed using, e.g., whole blood, serum, or plasma.
  • An analysis of transplant DNA found in a transplant recipient’s blood for example, may be performed using, e.g., whole blood, serum, or plasma.
  • a subject’s blood e.g., patient’s blood; cancer patient's blood; a pregnant woman's blood
  • a tube containing EDTA or a specialized commercial product such as Cell-Free DNA BCT (Streck, Omaha, NE) or Vacutainer SST (Becton Dickinson, Franklin Lakes, N.J.) to prevent blood clotting, and plasma can then be obtained from whole blood through centrifugation. Serum may be obtained with or without centrifugation-following blood clotting. If centrifugation is used then it is typically, though not exclusively, conducted at an appropriate speed, e.g., 1 ,500-3,000 times g.
  • Plasma or serum may be subjected to additional centrifugation steps before being transferred to a fresh tube for nucleic acid extraction.
  • nucleic acid may also be recovered from the cellular fraction, enriched in the buffy coat portion, which can be obtained following centrifugation of a whole blood sample from the subject and removal of the plasma.
  • a sample may be a tumor nucleic acid sample (i.e. , a nucleic acid sample isolated from a tumor).
  • tumor generally refers to neoplastic cell growth and proliferation, whether malignant or benign, and may include pre-cancerous and cancerous cells and tissues.
  • cancer and “cancerous” generally refer to the physiological condition in mammals that is typically characterized by unregulated cell growth/proliferation.
  • cancer examples include, but are not limited to, carcinoma, lymphoma, blastoma, sarcoma, leukemia, squamous cell cancer, small-cell lung cancer, non-small cell lung cancer, adenocarcinoma of the lung, squamous carcinoma of the lung, cancer of the peritoneum, hepatocellular cancer, gastrointestinal cancer, pancreatic cancer, glioblastoma, cervical cancer, ovarian cancer, liver cancer, bladder cancer, hepatoma, breast cancer, colon cancer, colorectal cancer, endometrial or uterine carcinoma, salivary gland carcinoma, kidney cancer, liver cancer, prostate cancer, vulval cancer, thyroid cancer, hepatic carcinoma, various types of head and neck cancer, and the like.
  • a sample may be heterogeneous.
  • a sample may include more than one cell type and/or one or more nucleic acid species.
  • a sample may include (i) fetal cells and maternal cells, (ii) cancer cells and non-cancer cells, and/or (iii) pathogenic cells and host cells.
  • a sample may include (i) cancer and non-cancer nucleic acid, (ii) pathogen and host nucleic acid, (iii) fetal derived and maternal derived nucleic acid, and/or more generally, (iv) mutated and wild-type nucleic acid.
  • a sample may include a minority nucleic acid species and a majority nucleic acid species, as described in further detail below.
  • a sample may include cells and/or nucleic acid from a single subject or may include cells and/or nucleic acid from multiple subjects.
  • nucleic acid(s), nucleic acid molecule(s), nucleic acid fragment(s), target nucleic acid(s), nucleic acid template(s), template nucleic acid(s), nucleic acid target(s), target nucleic acid(s), polynucleotide(s) , polynucleotide fragment(s), target polynucleotide(s), polynucleotide target(s), and the like may be used interchangeably throughout the disclosure.
  • RNA e.g., message RNA (mRNA), short inhibitory RNA (siRNA), ribosomal RNA (rRNA), transfer RNA (tRNA), microRNA, transacting small interfering RNA (ta-siRNA), natural small interfering RNA (nat-siRNA), small nucleolar RNA (snoRNA), small nuclear RNA (snRNA), long non-coding RNA (IncRNA), non-coding RNA (ncRNA), transfer-messenger RNA (tmRNA), precursor messenger RNA (pre-mRNA), small Cajal body-specific RNA (scaRNA), piwi-interacting RNA (piRNA), endoribonucleas
  • a particular nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions), alleles, orthologs, single nucleotide polymorphisms (SNPs), and complementary sequences as well as the sequence explicitly indicated.
  • degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues.
  • nucleic acid is used interchangeably with locus, gene, cDNA, and mRNA encoded by a gene.
  • the term also may include, as equivalents, derivatives, variants and analogs of RNA or DNA synthesized from nucleotide analogs, single-stranded ("sense” or “antisense,” “plus” strand or “minus” strand, “forward” reading frame or “reverse” reading frame) and double-stranded polynucleotides.
  • a nucleotide or base generally refers to the purine and pyrimidine molecular units of nucleic acid (e.g., adenine (A), thymine (T), guanine (G), and cytosine (C)).
  • a nucleotide or base generally refers to the purine and pyrimidine molecular units of nucleic acid (e.g., adenine (A), thymine (T), guanine (G), and cytosine (C)).
  • a nucleic acid e.g., adenine (A), thymine (T), guanine (G), and cytosine (C)
  • a nucleic acid e.g., adenine (A), thymine (T), guanine (G), and cytosine (C)
  • Target nucleic acids may be any nucleic acids of interest.
  • Nucleic acids may be polymers of any length composed of deoxyribonucleotides (i.e., DNA bases), ribonucleotides (i.e., RNA bases), or combinations thereof, e.g., 10 bases or longer, 20 bases or longer, 50 bases or longer, 100 bases or longer, 200 bases or longer, 300 bases or longer, 400 bases or longer, 500 bases or longer, 1000 bases or longer, 2000 bases or longer, 3000 bases or longer, 4000 bases or longer, 5000 bases or longer.
  • deoxyribonucleotides i.e., DNA bases
  • ribonucleotides i.e., RNA bases
  • 10 bases or longer 20 bases or longer, 50 bases or longer, 100 bases or longer, 200 bases or longer, 300 bases or longer, 400 bases or longer, 500 bases or longer, 1000 bases or longer, 2000 bases or longer, 3000 bases or longer, 4000 bases or longer, 5000 bases or longer.
  • nucleic acids are polymers composed of deoxyribonucleotides (i.e., DNA bases), ribonucleotides (i.e., RNA bases), or combinations thereof, e.g., 10 bases or less, 20 bases or less, 50 bases or less, 100 bases or less, 200 bases or less, 300 bases or less, 400 bases or less, 500 bases or less, 1000 bases or less, 2000 bases or less, 3000 bases or less, 4000 bases or less, or 5000 bases or less.
  • deoxyribonucleotides i.e., DNA bases
  • ribonucleotides i.e., RNA bases
  • combinations thereof e.g., 10 bases or less, 20 bases or less, 50 bases or less, 100 bases or less, 200 bases or less, 300 bases or less, 400 bases or less, 500 bases or less, 1000 bases or less, 2000 bases or less, 3000 bases or less, 4000 bases or less, or 5000 bases or less.
  • Nucleic acid may be single or double stranded.
  • Single stranded DNA for example, can be generated by denaturing double stranded DNA by heating or by treatment with alkali, for example.
  • ssDNA is derived from double-stranded DNA (dsDNA).
  • a method herein comprises prior to combining a nucleic acid composition comprising dsDNA with the scaffold adapters herein, or components thereof, denaturing the dsDNA, thereby generating ssDNA.
  • nucleic acid is in a D-loop structure, formed by strand invasion of a duplex DNA molecule by an oligonucleotide or a DNA-like molecule such as peptide nucleic acid (PNA).
  • D loop formation can be facilitated by addition of E. Goli RecA protein and/or by alteration of salt concentration, for example, using methods known in the art.
  • Nucleic acid e.g., nucleic acid targets, single-stranded nucleic acid (ssNA), oligonucleotides, overhangs, scaffold polynucleotides and hybridization regions thereof (e.g., ssNA hybridization region, oligonucleotide hybridization region)) may be described herein as being complementary to another nucleic acid, having a complementarity region, being capable of hybridizing to another nucleic acid, or having a hybridization region.
  • ssNA hybridization region e.g., single-stranded nucleic acid (ssNA), oligonucleotides, overhangs, scaffold polynucleotides and hybridization regions thereof (e.g., ssNA hybridization region, oligonucleotide hybridization region)
  • ssNA hybridization region e.g., ssNA hybridization region, oligonucleotide hybridization region
  • complementary or complementarity or “hybridization” generally refer to a nucleotide sequence that base-pairs by non-covalent bonds to a region of a nucleic acid (e.g., the nucleotide sequence of an ssNA hybridization region that hybridizes to the terminal region of an ssNA fragment, and the nucleotide sequence of an oligonucleotide hybridization region that hybridizes to an oligonucleotide component of a scaffold adapter).
  • adenine (A) forms a base pair with thymine (T)
  • G guanine pairs with cytosine (C) in DNA.
  • a mixture of nucleic acids comprises single-stranded nucleic acid and double-stranded nucleic acid. In some embodiment, a mixture of nucleic acids comprises DNA and RNA. In some embodiment, a mixture of nucleic acids comprises ribosomal RNA (rRNA) and messenger RNA (mRNA).
  • rRNA ribosomal RNA
  • mRNA messenger RNA
  • Nucleic acid provided for processes described herein may contain nucleic acid from one sample or from two or more samples (e.g., from 1 or more, 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more, 11 or more, 12 or more, 13 or more, 14 or more, 15 or more, 16 or more, 17 or more, 18 or more, 19 or more, or 20 or more samples).
  • target nucleic acids comprise degraded DNA.
  • Degraded DNA may be referred to as low-quality DNA or highly degraded DNA.
  • Degraded DNA may be highly fragmented, and may include damage such as base analogs and abasic sites subject to miscoding lesions and/or intermolecular crosslinking. For example, sequencing errors resulting from deamination of cytosine residues may be present in certain sequences obtained from degraded DNA (e.g., miscoding of C to T and G to A).
  • target nucleic acids e.g., ssNAs
  • target nucleic acids are derived from nicked double-stranded nucleic acid fragments.
  • Nicked double-stranded nucleic acid fragments may be denatured (e.g., heat denatured) to generate ssNA fragments.
  • nucleic acid is provided for conducting methods described herein without prior processing of the sample(s) containing the nucleic acid.
  • nucleic acid may be analyzed directly from a sample without prior extraction, purification, partial purification, and/or amplification.
  • target nucleic acids are not contacting with an exonuclease (e.g., DNAse) prior to combining with the scaffold adapters herein, or components thereof.
  • target nucleic acids are not amplified prior to combining with the scaffold adapters herein, or components thereof.
  • target nucleic acids are not attached to a solid support prior to combining with the scaffold adapters herein, or components thereof.
  • target nucleic acids are not conjugated to another molecule prior to combining with the scaffold adapters herein, or components thereof.
  • target nucleic acids are not cloned into a vector prior to combining with the scaffold adapters herein, or components thereof.
  • target nucleic acids may be subjected to dephosphorylation prior to combining with the scaffold adapters herein, or components thereof. In some embodiments, target nucleic acids may be subjected to phosphorylation prior to combining with the scaffold adapters herein, or components thereof.
  • combining target nucleic acids e.g., ssNAs
  • combining target nucleic acids with the scaffold adapters herein, or components thereof comprises isolating the target nucleic acids, phosphorylating the isolated target nucleic acids, and combining the phosphorylated target nucleic acids with the scaffold adapters herein, or components thereof.
  • combining target nucleic acids with the scaffold adapters herein, or components thereof comprises isolating the target nucleic acids, dephosphorylating the scaffold adapters herein, or components thereof, and combining the isolated target nucleic acids with the dephosphorylated scaffold adapters herein, or dephosphorylated components thereof.
  • combining target nucleic acids with the scaffold adapters herein, or components thereof comprises isolating the target nucleic acids, dephosphorylating the isolated target nucleic acids, phosphorylating the dephosphorylated target nucleic acids, and combining the phosphorylated target nucleic acids with the scaffold adapters herein, or components thereof.
  • combining target nucleic acids with the scaffold adapters herein, or components thereof comprises isolating the target nucleic acids, dephosphorylating the isolated target nucleic acids, phosphorylating the dephosphorylated target nucleic acids, dephosphorylating the scaffold adapters, or components thereof, and combining the phosphorylated target nucleic acids with the dephosphorylated scaffold adapters herein, or dephosphorylated components thereof.
  • Single-stranded nucleic acid or ssNA generally refers to a collection of polynucleotides which are single-stranded (i.e., not hybridized intermolecu larly or intramolecularly) over 70% or more of their length.
  • ssNA is single-stranded over 75% or more, 80% or more, 85% or more, 90% or more, 95% or more, or 99% or more, of the length of the polynucleotides.
  • the ssNA is single-stranded over the entire length of the polynucleotides.
  • Single-stranded nucleic acid may be referred to herein as target nucleic acid.
  • ssNA may include single-stranded deoxyribonucleic acid (ssDNA).
  • ssDNA includes, but is not limited to, ssDNA derived from double-stranded DNA (dsDNA).
  • dsDNA double-stranded DNA
  • ssDNA may be derived from double-stranded DNA which is denatured (e.g., heat denatured and/or chemically denatured) to produce ssDNA.
  • a method herein comprises, prior to combining ssDNA with scaffold adapters described herein, or components thereof, generating the ssDNA by denaturing dsDNA.
  • ssNA includes single-stranded ribonucleic acid (ssRNA).
  • RNA may include, for example, messenger RNA (mRNA), microRNA (miRNA), small interfering RNA (siRNA), transacting small interfering RNA (ta-siRNA), natural small interfering RNA (nat-siRNA), ribosomal RNA (rRNA), transfer RNA (tRNA), small nucleolar RNA (snoRNA), small nuclear RNA (snRNA), long non-coding RNA (IncRNA), non-coding RNA (ncRNA), transfer-messenger RNA (tmRNA), precursor messenger RNA (pre-mRNA), small Cajal body-specific RNA (scaRNA), piwi-interacting RNA (piRNA), endoribonucleaseprepared siRNA (esiRNA), small temporal RNA (stRNA), signal recognition RNA, telomere RNA, ribozyme, or a combination thereof.
  • mRNA messenger
  • a method herein comprises contacting ssNA with a single-stranded nucleic acid binding agent. In some embodiments, a method herein comprises contacting ssNA with singlestranded nucleic acid binding protein (SSB) to produce SSB-bound ssNA. In some embodiments, a method herein comprises contacting sscDNA with single-stranded nucleic acid binding protein (SSB) to produce SSB-bound sscDNA. In some embodiments, a method herein comprises contacting ssDNA with single-stranded nucleic acid binding protein (SSB) to produce SSB-bound ssDNA.
  • SSB single-stranded nucleic acid binding protein
  • ET SSB Extreme Thermostable Single-Stranded DNA Binding Protein
  • Tth Thermus thermophilus
  • RPA - replication protein A
  • ET SSB, Tth RecA, E. coli RecA, T4 Gene 32 Protein, as well buffers and detailed protocols for preparing SSB-bound ssNA using such SSBs are commercially available (e.g., New England Biolabs, Inc. (Ipswich, MA)).
  • a method herein does not comprise contacting ssNA with single-stranded nucleic acid binding protein (SSB) to produce SSB-bound ssNA. Accordingly, a method herein may omit the step of producing SSB-bound ssNA.
  • a method herein may comprise combining ssNA with scaffold adapters described herein, or components thereof, without contacting the ssNA with SSB.
  • a method herein may be referred to an “SSB-free” method for producing a nucleic acid library.
  • Certain SSB-free methods described herein may produce libraries having parameters similar to parameters for libraries prepared using SSB, as shown in the Drawings and discussed in the Examples.
  • a method herein comprises contacting ssNA with a single-stranded nucleic acid binding agent other than SSB.
  • Such singlestranded nucleic acid binding agents can stably bind single stranded nucleic acids, can prevent or reduce formation of nucleic acid duplexes, can still allow the bound nucleic acids to be ligated or otherwise terminally modified, and can be thermostable.
  • Example single-stranded nucleic acid binding agents include but are not limited to topoisomerases, helicases, domains thereof, and fusion proteins comprising domains thereof.
  • a nucleic acid composition “consisting essentially of” single-stranded nucleic acid (ssNA) generally includes ssNA and no additional protein or nucleic acid components.
  • a nucleic acid composition “consisting essentially of” single-stranded nucleic acid (ssNA) may exclude double-stranded nucleic acid (dsNA) or may include a low percentage of dsNA (e.g., less than 10% dsNA, less than 5% dsNA, less than 1% dsNA).
  • dsNA double-stranded nucleic acid
  • a nucleic acid composition “consisting essentially of” single-stranded nucleic acid (ssNA) may exclude proteins.
  • a nucleic acid composition “consisting essentially of” singlestranded nucleic acid may exclude single-stranded binding proteins (SSBs) or other proteins useful for stabilizing ssNA.
  • a nucleic acid composition “consisting essentially of” singlestranded nucleic acid (ssNA) may include chemical components typically present in nucleic acid compositions such as buffers, salts, alcohols, crowding agents (e.g., PEG), and the like; and may include residual components (e.g., nucleic acids, proteins, cell membrane components) from the nucleic acid source (e.g., sample) or nucleic acid extraction.
  • a nucleic acid composition “consisting essentially of” single-stranded nucleic acid (ssNA) may include ssNA fragments having one or more phosphates (e.g., a terminal phosphate, a 5’ terminal phosphate).
  • a nucleic acid composition “consisting essentially of” single-stranded nucleic acid (ssNA) may include ssNA fragments comprising one or more modified nucleotides. Enriching nucleic acids
  • nucleic acid (e.g., extracellular nucleic acid) is enriched or relatively enriched for a subpopulation or species of nucleic acid.
  • Nucleic acid subpopulations can include, for example, cancer nucleic acid, tumor nucleic acid, fetal nucleic acid, maternal nucleic acid, patient nucleic acid, host nucleic acid, pathogen nucleic acid, transplant nucleic acid, microbiome nucleic acid, nucleic acid comprising fragments of a particular length or range of lengths, or nucleic acid from a particular genome region (e.g., single chromosome, set of chromosomes, and/or certain chromosome regions).
  • a particular genome region e.g., single chromosome, set of chromosomes, and/or certain chromosome regions.
  • methods of the technology comprise an additional step of enriching for a subpopulation of nucleic acid in a sample.
  • nucleic acid from normal tissue e.g., non-cancer cells, host cells
  • maternal nucleic acid is selectively removed (partially, substantially, almost completely or completely) from the sample.
  • enriching for a particular low copy number species nucleic acid may improve quantitative sensitivity.
  • Non-limiting examples of methods for enriching for a nucleic acid subpopulation in a sample include methods that exploit epigenetic differences between nucleic acid species (e.g., methylation-based fetal nucleic acid enrichment methods described in U.S. Patent Application Publication No. 2010/0105049, which is incorporated by reference herein); restriction endonuclease enhanced polymorphic sequence approaches (e.g., such as a method described in U.S. Patent Application Publication No.
  • Nucleic acids comprising one or more modifications can be enriched for by a variety of methods, including but not limited to antibody-based pulldown. Modified nucleic acid enrichment can be conducted before or after denaturation of dsDNA. Enrichment prior to denaturation can result in also enriching for the complementary strand which may lack the modification, while enrichment after denaturation does not enrich for complementary strands lacking modification.
  • nucleic acid is enriched for fragments from a select genomic region (e.g., chromosome) using one or more sequence-based separation methods described herein.
  • Sequence-based separation generally is based on nucleotide sequences present in the fragments of interest (e.g., target and/or reference fragments) and substantially not present in other fragments of the sample or present in an insubstantial amount of the other fragments (e.g., 5% or less).
  • sequence-based separation can generate separated target fragments and/or separated reference fragments. Separated target fragments and/or separated reference fragments often are isolated away from the remaining fragments in the nucleic acid sample.
  • scaffold adapters are used to enrich for target nucleic acids.
  • scaffold adapters can be designed such that some or all of the bases in the ssNA hybridization region are defined or known bases. These scaffold adapters can hybridize preferentially to target nucleic acids with sequences complementary to the defined or known bases of the scaffold adapter ssNA hybridization region, thereby enriching for the target nucleic acids in the resulting library.
  • including a GC dinucleotide in the ssNA hybridization region can be used to enrich for target nucleic acids that have terminal CG (also called CpG) dinucleotides.
  • a selective nucleic acid capture process is used to separate target and/or reference fragments away from a nucleic acid sample.
  • nucleic acid capture systems include, for example, Nimblegen sequence capture system (Roche NimbleGen, Madison, Wl) ; ILLUMINA BEADARRAY platform (Illumina, San Diego, CA); Affymetrix GENECHIP platform (Affymetrix, Santa Clara, CA); Agilent SureSelect Target Enrichment System (Agilent Technologies, Santa Clara, CA); and related platforms.
  • Such methods typically involve hybridization of a capture oligonucleotide to a part or all of the nucleotide sequence of a target or reference fragment and can include use of a solid phase (e.g., solid phase array) and/or a solution-based platform.
  • Capture oligonucleotides (sometimes referred to as “bait”) can be selected or designed such that they preferentially hybridize to nucleic acid fragments from selected genomic regions or loci, or a particular sequence in a nucleic acid target.
  • a hybridization-based method e.g., using oligonucleotide arrays
  • nucleic acid is enriched for a particular nucleic acid fragment length, range of lengths, or lengths under or over a particular threshold or cutoff using one or more length-based separation methods.
  • Nucleic acid fragment length typically refers to the number of nucleotides in the fragment.
  • Nucleic acid fragment length also is sometimes referred to as nucleic acid fragment size.
  • a length-based separation method is performed without measuring lengths of individual fragments.
  • a length-based separation method is performed in conjunction with a method for determining length of individual fragments.
  • length-based separation refers to a size fractionation procedure where all or part of the fractionated pool can be isolated (e.g., retained) and/or analyzed.
  • a method herein includes enriching an RNA species in a mixture of RNA species. In some embodiments, a method herein includes enriching an sscDNA species in a mixture of sscDNA species. For example, a method herein may comprise enriching messenger RNA (mRNA) present in a mixture of mRNA and ribosomal RNA (rRNA), or enriching sscDNA corresponding to mRNA present in a mixture of sscDNA corresponding to mRNA and rRNA.
  • mRNA messenger RNA
  • rRNA ribosomal RNA
  • Enrichment strategies can increase the relative abundance (e.g., as assessed by percent of sequencing reads) of the targeted nucleic acids by at least 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%, 200%, 300%, 400%, 500%, 600%, 700%, 800%, 900%, 1000%, 1100%, 1200%, 1300%, 1400%, 1500%, 1600%, 1700%, 1800%, 1900%, 2000%, 3000%, 4000%, 5000%, 6000%, 7000%, 8000%, 9000%, 10000%, or more.
  • a method herein comprises contacting under hybridization conditions a set of ligation products (e.g., a set of ssNAs ligated to one or more scaffold adapters herein) with one or more probe species.
  • each probe species comprises a sequence complementary to a ribosomal RNA (rRNA) sequence.
  • rRNA ribosomal RNA
  • a probe sequence may be complementary to an rRNA sequence, or complement, or reverse complement thereof, and a probe sequence may be complementary to an sscDNA sequence generated from an rRNA fragment, or complement, or reverse complement thereof.
  • a probe sequence may be complementary to an entire rRNA sequence or an entire sscDNA sequence, or may be complementary to a portion of an rRNA sequence or a portion of an sscDNA sequence.
  • a set of probes comprises one or more probes targeting adapter dimers, such as one or more probes targeting -P5 and -P7 regions.
  • a set of probes comprises one or more probes targeting transcripts from genes that are commonly highly expressed across many or all samples of a given type, such as ribonucleoprotein complex genes like RN7SL1 and RN7SL2.
  • a set of probes comprises one or more probes comprising nucleic acid sequences set forth in SEQ ID NOs: 1 -174.
  • a probe species can specifically hybridize to a ssNA, or portion thereof, comprising a ribosomal RNA (rRNA) sequence, or a corresponding sscDNA sequence, complement thereof, or reverse complement thereof.
  • rRNA ribosomal RNA
  • Specific hybridization may be affected or influenced by factors such as the degree of complementarity between the ssNA and the probe, the length thereof, and the temperature at which the hybridization occurs, which may be informed by melting temperatures (Tm) of the probe and/or ssNA.
  • Melting temperature generally refers to the temperature at which half of the probes /ssNAs remain hybridized and half of the probes /ssNAs dissociate into single strands.
  • the binding pair is biotin and streptavidin.
  • a probe species comprises a first member of a binding pair (e.g., biotin); and a second member of a binding pair (e.g., streptavidin) is conjugated to a solid support or substrate.
  • a solid support or substrate can be any physically separable solid to which a member of a binding pair can be directly or indirectly attached including, but not limited to, surfaces provided by microarrays and wells, and particles such as beads (e.g., paramagnetic beads, magnetic beads, microbeads, nanobeads), microparticles, and nanoparticles.
  • Solid supports also can include, for example, chips, columns, optical fibers, wipes, filters (e.g., flat surface filters), one or more capillaries, glass and modified or functionalized glass (e.g., controlled-pore glass (CPG)), quartz, mica, diazotized membranes (paper or nylon), polyformaldehyde, cellulose, cellulose acetate, paper, ceramics, metals, metalloids, semiconductive materials, quantum dots, coated beads or particles, other chromatographic materials, magnetic particles; plastics (including acrylics, polystyrene, copolymers of styrene or other materials, polybutylene, polyurethanes, TEFLONTM, polyethylene, polypropylene, polyamide, polyester, polyvinylidenedifluoride (PVDF), and the like), polysaccharides, nylon or nitrocellulose, resins, silica or silica-based materials including silicon, silica gel, and modified silicon, Sephadex®, Sepharose®, carbon, metals (
  • a solid support or substrate may be coated using passive or chemically-derivatized coatings with any number of materials, including polymers, such as dextrans, acrylamides, gelatins or agarose. Beads and/or particles may be free or in connection with one another (e.g., sintered).
  • a solid support can be a collection of particles.
  • the particles can comprise silica, and the silica may comprise silica dioxide.
  • the silica can be porous, and in certain embodiments the silica can be non-porous.
  • the particles further comprise an agent that confers a paramagnetic property to the particles.
  • the agent comprises a metal
  • the agent is a metal oxide, (e.g., iron or iron oxides, where the iron oxide contains a mixture of Fe2+ and Fe3+).
  • a member of a binding pair may be linked to a solid support by covalent bonds or by non-covalent interactions and may be linked to a solid support directly or indirectly (e.g., via an intermediary agent such as a spacer molecule or biotin).
  • Hybridized ligation products and unhybridized ligation products may be separated using any suitable separation method.
  • hybridized ligation products and unhybridized ligation products may be separated using a suitable a probe pull-down method.
  • hybridized ligation products comprising a first member of a binding pair may be separated using a solid support conjugated to a second member of a binding pair.
  • hybridized ligation products comprising a biotinylated probe may be separated using a solid support (e.g., magnetic bead) conjugated to a streptavidin.
  • unhybridized ligation products are retained for further analysis (e.g., analysis of mRNA sequences).
  • hybridized ligation products are retained for further analysis (e.g., analysis of rRNA sequences or a specific subset thereof).
  • length-based separation refers to a size fractionation procedure where all or part of the fractionated pool can be isolated (e.g., retained) and/or analyzed. Size fractionation procedures are known in the art (e.g., separation on an array, separation by a molecular sieve, separation by gel electrophoresis, separation by column chromatography (e.g., size-exclusion columns), and microfluidics-based approaches).
  • length-based separation approaches can include fragment circularization, chemical treatment (e.g., formaldehyde, polyethylene glycol (PEG)), mass spectrometry and/or size-specific nucleic acid amplification, for example.
  • length-based separation is performed using Solid Phase Reversible Immobilization (SPRI) beads.
  • SPRI Solid Phase Reversible Immobilization
  • nucleic acid fragments of a certain length, range of lengths, or lengths under or over a particular threshold or cutoff are separated from the sample.
  • fragments having a length under a particular threshold or cutoff e.g., 500 bp, 400 bp, 300 bp, 200 bp, 150 bp, 100 bp
  • short fragments and fragments having a length over a particular threshold or cutoff are referred to as “long” fragments, large fragments, and/or high molecular weight (HMW) fragments.
  • fragments of a certain length, range of lengths, or lengths under or over a particular threshold or cutoff are retained for analysis while fragments of a different length or range of lengths, or lengths over or under the threshold or cutoff are not retained for analysis.
  • fragments that are less than about 500 bp are retained.
  • fragments that are less than about 400 bp are retained.
  • fragments that are less than about 300 bp are retained.
  • fragments that are less than about 200 bp are retained.
  • fragments that are less than about 150 bp are retained.
  • fragments that are in the range of about 110 bp to about 190 bp, 130 bp to about 180 bp, 140 bp to about 170 bp, 140 bp to about 150 bp, 150 bp to about 160 bp, or 145 bp to about 155 bp are retained.
  • target nucleic acids e.g., ssNAs
  • target nucleic acids having fragment lengths of less than about 1000 bp are combined with a plurality or pool of scaffold adapter species, or components of scaffold adapter species, described herein.
  • target nucleic acids e.g., ssNAs
  • target nucleic acids having fragment lengths of less than about 500 bp are combined with a plurality or pool of scaffold adapter species, or components of scaffold adapter species, described herein.
  • target nucleic acids (e.g., ssNAs) having fragment lengths of less than about 400 bp are combined with a plurality or pool of scaffold adapter species, or components of scaffold adapter species, described herein.
  • target nucleic acids e.g., ssNAs
  • target nucleic acids having fragment lengths of about 100 bp or more are combined with a plurality or pool of scaffold adapter species, or components of scaffold adapter species, described herein.
  • target nucleic acids e.g., ssNAs
  • target nucleic acids having fragment lengths of about 200 bp or more are combined with a plurality or pool of scaffold adapter species, or components of scaffold adapter species, described herein.
  • target nucleic acids e.g., ssNAs having fragment lengths of about 300 bp or more are combined with a plurality or pool of scaffold adapter species, or components of scaffold adapter species, described herein.
  • Certain length-based separation methods that can be used with methods described herein employ a selective sequence tagging approach, for example.
  • a fragment size species e.g., short fragments
  • Such methods typically involve performing a nucleic acid amplification reaction using a set of nested primers which include inner primers and outer primers.
  • one or both of the inner can be tagged to thereby introduce a tag onto the target amplification product.
  • the outer primers generally do not anneal to the short fragments that carry the (inner) target sequence.
  • the inner primers can anneal to the short fragments and generate an amplification product that carries a tag and the target sequence.
  • tagging of the long fragments is inhibited through a combination of mechanisms which include, for example, blocked extension of the inner primers by the prior annealing and extension of the outer primers.
  • Enrichment for tagged fragments can be accomplished by any of a variety of methods, including for example, exonuclease digestion of single stranded nucleic acid and amplification of the tagged fragments using amplification primers specific for at least one tag.
  • Another length-based separation method that can be used with methods described herein involves subjecting a nucleic acid sample to polyethylene glycol (PEG) precipitation.
  • PEG polyethylene glycol
  • Examples of methods include those described in International Patent Application Publication Nos. W02007/140417 and WO2010/115016.
  • This method in general entails contacting a nucleic acid sample with PEG in the presence of one or more monovalent salts under conditions sufficient to substantially precipitate large nucleic acids without substantially precipitating small (e.g., less than 300 nucleotides) nucleic acids.
  • Another length-based enrichment method that can be used with methods described herein involves circularization by ligation, for example, using circligase. Short nucleic acid fragments typically can be circularized with higher efficiency than long fragments. Non-circularized sequences can be separated from circularized sequences, and the enriched short fragments can be used for further analysis.
  • Methods herein may include preparing a nucleic acid library and/or modifying nucleic acids for a nucleic acid library.
  • ends of nucleic acid fragments are modified such that the fragments, or amplified products thereof, may be incorporated into a nucleic acid library.
  • a nucleic acid library refers to a plurality of polynucleotide molecules (e.g., a sample of nucleic acids) that are prepared, assembled and/or modified for a specific process, non-limiting examples of which include immobilization on a solid phase (e.g., a solid support, a flow cell, a bead), enrichment, amplification, cloning, detection and/or for nucleic acid sequencing.
  • a solid phase e.g., a solid support, a flow cell, a bead
  • a nucleic acid library is prepared prior to or during a sequencing process.
  • a nucleic acid library (e.g., sequencing library) can be prepared by a suitable method as known in the art.
  • a nucleic acid library can be prepared by a targeted or a non-targeted preparation process.
  • a library of nucleic acids is modified to comprise one or more polynucleotides of known composition, non-limiting examples of which include an identifier (e.g., a tag, an indexing tag), a capture sequence, a label, an adapter, a restriction enzyme site, a promoter, an enhancer, an origin of replication, a stem loop, a complimentary sequence (e.g., a primer binding site, an annealing site), a suitable integration site (e.g., a transposon, a viral integration site), a modified nucleotide, a unique molecular identifier (UMI) described herein, a palindromic sequence described herein, the like or combinations thereof.
  • an identifier e.g., a tag, an indexing tag
  • a capture sequence e.g., a label, an adapter, a restriction enzyme site, a promoter, an enhancer, an origin of replication, a stem loop, a complimentary sequence (e.g.,
  • the resulting blunt end repaired nucleic acid can then be extended by a single nucleotide, which is complementary to a single nucleotide overhang on the 3’ end of an adapter/primer. Any nucleotide can be used for the extension/overhang nucleotides.
  • end repair is omitted and scaffold adapters (e.g., scaffold adapters described herein) are ligated directly to the native ends of nucleic acids (e.g., single-stranded nucleic acids, fragmented nucleic acids, and/or cell-free DNA).
  • nucleic acid library preparation comprises ligating a scaffold adapter, or component thereof, (e.g., to a sample nucleic acid, to a sample nucleic acid fragment, to a template nucleic acid, to a target nucleic acid, to an ssNA), such as a scaffold adapter described herein.
  • Scaffold adapters, or components thereof may comprise sequences complementary to flow-cell anchors, and sometimes are utilized to immobilize a nucleic acid library to a solid support, such as the inside surface of a flow cell, for example.
  • scaffold adapters, or components thereof, when used in combination with amplification primers are designed generate library constructs comprising one or more of: universal sequences, molecular barcodes (UMIs), UMI flanking sequence, sample ID sequences, spacer sequences, and a sample nucleic acid sequence (e.g., ssNA sequence).
  • amplification primers e.g., universal amplification primers
  • UMIs molecular barcodes
  • scaffold adapters, or components thereof, when used in combination with universal amplification primers are designed to generate library constructs comprising an ordered combination of one or more of: universal sequences, molecular barcodes (UMIs), sample ID sequences, spacer sequences, and a sample nucleic acid sequence (e.g., ssNA sequence).
  • a library construct may comprise a first universal sequence, followed by a second universal sequence, followed by first molecular barcode (UMI), followed by a spacer sequence, followed by a template sequence (e.g., sample nucleic acid sequence; ssNA sequence), followed by a spacer sequence, followed by a second molecular barcode (UMI), followed by a third universal sequence, followed by a sample ID, followed by a fourth universal sequence.
  • scaffold adapters, or components thereof, when used in combination with amplification primers are designed generate library constructs for each strand of a template molecule (e.g., sample nucleic acid molecule; ssNA molecule).
  • scaffold adapters are duplex adapters.
  • An identifier can be a suitable detectable label incorporated into or attached to a nucleic acid (e.g., a polynucleotide) that allows detection and/or identification of nucleic acids that comprise the identifier.
  • a nucleic acid e.g., a polynucleotide
  • an identifier is incorporated into or attached to a nucleic acid during a sequencing method (e.g., by a polymerase).
  • an identifier is incorporated into or attached to a nucleic acid prior to a sequencing method (e.g., by an extension reaction, by an amplification reaction, by a ligation reaction).
  • identifiers are six or more contiguous nucleotides.
  • a multitude of fluorophores are available with a variety of different excitation and emission spectra. Any suitable type and/or number of fluorophores can be used as an identifier.
  • 1 or more, 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more, 20 or more, 30 or more or 50 or more different identifiers are utilized in a method described herein (e.g., a nucleic acid detection and/or sequencing method).
  • one or two types of identifiers are linked to each nucleic acid in a library.
  • Detection and/or quantification of an identifier can be performed by a suitable method, apparatus or machine, nonlimiting examples of which include flow cytometry, quantitative polymerase chain reaction (qPCR), gel electrophoresis, a luminometer, a fluorometer, a spectrophotometer, a suitable gene-chip or microarray analysis, Western blot, mass spectrometry, chromatography, cytofluorimetric analysis, fluorescence microscopy, a suitable fluorescence or digital imaging method, confocal laser scanning microscopy, laser scanning cytometry, affinity chromatography, manual batch mode separation, electric field suspension, a suitable nucleic acid sequencing method and/or nucleic acid sequencing apparatus, the like and combinations thereof.
  • qPCR quantitative polymerase chain reaction
  • a nucleic acid library or parts thereof are amplified (e.g., amplified by a PCR-based method) under amplification conditions.
  • a sequencing method comprises amplification of a nucleic acid library.
  • a nucleic acid library can be amplified prior to or after immobilization on a solid support (e.g., a solid support in a flow cell).
  • Nucleic acid amplification includes the process of amplifying or increasing the numbers of a nucleic acid template and/or of a complement thereof that are present (e.g., in a nucleic acid library), by producing one or more copies of the template and/or its complement. Amplification can be carried out by a suitable method.
  • a nucleic acid library can be amplified by a thermocycling method or by an isothermal amplification method. In some embodiments, a rolling circle amplification method is used. In some embodiments, amplification takes place on a solid support (e.g., within a flow cell) where a nucleic acid library or portion thereof is immobilized. In certain sequencing methods, a nucleic acid library is added to a flow cell and immobilized by hybridization to anchors under suitable conditions. This type of nucleic acid amplification is often referred to as solid phase amplification. In some embodiments of solid phase amplification, all or a portion of the amplified products are synthesized by an extension initiating from an immobilized primer.
  • Solid phase amplification reactions are analogous to standard solution phase amplifications except that at least one of the amplification oligonucleotides (e.g., primers) is immobilized on a solid support.
  • modified nucleic acid e.g., nucleic acid modified by addition of adapters
  • Non-limiting examples of solid phase nucleic acid amplification reactions include interfacial amplification, bridge amplification, emulsion PCR, WildFire amplification (e.g., U.S. Patent Application Publication No. 2013/0012399), the like or combinations thereof.
  • a nucleic acid library comprises nucleic acid originating from an RNA source and nucleic acid originating from a DNA source, where both types of nucleic acid molecules comprise a common priming site at one end and a different priming site at the other end.
  • both types of nucleic acid molecules may have priming site A at one end
  • nucleic acid originating from the RNA source may have priming site B at the opposite end
  • nucleic acid originating from the DNA source may have priming site C at the opposite end.
  • An amplification reaction that includes primers binding to A and B, and excludes a primer binding to C, will result in exponential amplification of nucleic acid originating from the RNA source and linear amplification of nucleic acid originating from the DNA source.
  • nucleic acid e.g., nucleic acid fragments, sample nucleic acid, cell-free nucleic acid, single-stranded nucleic acid, single-stranded DNA, single-stranded RNA
  • hybridization products are sequenced by a sequencing process.
  • ssNA ligated to oligonucleotide components provided herein single-stranded ligation products
  • single-stranded ligation products are sequenced by a sequencing process.
  • hybridization products and/or singlestranded ligation products are amplified by an amplification process, and the amplification products are sequenced by a sequencing process.
  • hybridization products and/or single-stranded ligation products are not amplified by an amplification process, and the hybridization products and/or single-stranded ligation products are sequenced without prior amplification by a sequencing process.
  • the sequencing process generates sequence reads (or sequencing reads).
  • a method herein comprises determining the sequence of a single-stranded nucleic acid molecule based on the sequence reads.
  • generating sequence reads may include generating forward sequence reads (also referred to herein as readl ) and generating reverse sequence reads (also referred to herein as read2).
  • readl forward sequence reads
  • read2 reverse sequence reads
  • sequencing using certain paired-end sequencing platforms sequence each nucleic acid fragment from both directions, generally resulting in two reads per nucleic acid fragment, with the first read in a forward orientation (forward read) and the second read in reverse-complement orientation (reverse read).
  • a forward read is generated off a particular primer within a sequencing adapter (e.g., ILLUMINA adapter, P5 primer), and a reverse read is generated off a different primer within a sequencing adapter (e.g., ILLUMINA adapter, P7 primer).
  • a sequencing adapter e.g., ILLUMINA adapter, P5 primer
  • a reverse read is generated off a different primer within a sequencing adapter (e.g., ILLUMINA adapter, P7 primer).
  • Nucleic acid may be sequenced using any suitable sequencing platform including a Sanger sequencing platform, a high throughput or massively parallel sequencing (next generation sequencing (NGS)) platform, or the like, such as, for example, a sequencing platform provided by Illumina® (e.g., HiSeqTM, MiSeqTM and/or Genome AnalyzerTM sequencing systems); Oxford NanoporeTM Technologies (e.g., MinlON sequencing system), Ion TorrentTM (e.g., Ion PGMTM and/or Ion ProtonTM sequencing systems); Pacific Biosciences (e.g., PACBIO RS II sequencing system); Life TechnologiesTM (e.g., SOLiD sequencing system); Roche (e.g., 454 GS FLX+ and/or GS Junior sequencing systems); or any other suitable sequencing platform.
  • Illumina® e.g., HiSeqTM, MiSeqTM and/or Genome AnalyzerTM sequencing systems
  • Oxford NanoporeTM Technologies e.g., MinlON sequencing system
  • the nominal, average, mean or absolute length of single-end reads sometimes is about 10 continuous nucleotides to about 250 or more contiguous nucleotides, about 15 contiguous nucleotides to about 200 or more contiguous nucleotides, about 15 contiguous nucleotides to about 150 or more contiguous nucleotides, about 15 contiguous nucleotides to about 125 or more contiguous nucleotides, about 15 contiguous nucleotides to about 100 or more contiguous nucleotides, about 15 contiguous nucleotides to about 75 or more contiguous nucleotides, about 15 contiguous nucleotides to about 60 or more contiguous nucleotides, 15 contiguous nucleotides to about 50 or more contiguous nucleotides, about 15 contiguous nucleotides to about 40 or more contiguous nucleotides, and sometimes about 15 contiguous nucleotides or about 36 or more contiguous nucleo
  • the nominal, average, mean or absolute length of single-end reads is about 20 to about 30 bases, or about 24 to about 28 bases in length. In certain embodiments the nominal, average, mean or absolute length of single-end reads is about 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, 14, 15, 16, 17, 18, 19, 21 , 22, 23, 24, 25, 26, 27, 28 or about 29 bases or more in length. In certain embodiments the nominal, average, mean or absolute length of single-end reads is about 20 to about 200 bases, about 100 to about 200 bases, or about 140 to about 160 bases in length.
  • the nominal, average, mean or absolute length of single-end reads is about 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or about 200 bases or more in length.
  • the nominal, average, mean or absolute length of paired-end reads sometimes is about 10 contiguous nucleotides to about 25 contiguous nucleotides or more (e.g., about 10, 1 1 , 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 , 22, 23, 24 or 25 nucleotides in length or more), about 15 contiguous nucleotides to about 20 contiguous nucleotides or more, and sometimes is about 17 contiguous nucleotides or about 18 contiguous nucleotides.
  • Reads generally are representations of nucleotide sequences in a physical nucleic acid. For example, in a read containing an ATGC depiction of a sequence, "A” represents an adenine nucleotide, “T” represents a thymine nucleotide, “G” represents a guanine nucleotide and “C” represents a cytosine nucleotide, in a physical nucleic acid.
  • Sequence reads obtained from a sample from a subject can be reads from a mixture of a minority nucleic acid and a majority nucleic acid. For example, sequence reads obtained from the blood of a cancer patient can be reads from a mixture of cancer nucleic acid and non-cancer nucleic acid.
  • sequence reads obtained from the blood of a pregnant female can be reads from a mixture of fetal nucleic acid and maternal nucleic acid.
  • sequence reads obtained from the blood of a patient having an infection or infectious disease can be reads from a mixture of host nucleic acid and pathogen nucleic acid.
  • sequence reads obtained from the blood of a transplant recipient can be reads from a mixture of host nucleic acid and transplant nucleic acid.
  • sequence reads obtained from a sample can be reads from a mixture of nucleic acid from microorganisms collectively comprising a microbiome (e.g., microbiome of gut, microbiome of blood, microbiome of mouth, microbiome of spinal fluid, microbiome of feces) in a subject.
  • sequence reads obtained from a sample can be reads from a mixture of nucleic acid from microorganisms collectively comprising a microbiome (e.g., microbiome of gut, microbiome of blood, microbiome of mouth, microbiome of spinal fluid, microbiome of feces), and nucleic acid from the host subject.
  • a mixture of relatively short reads can be transformed by processes described herein into a representation of genomic nucleic acid present in the subject, and/or a representation of genomic nucleic acid present in a tumor, a fetus, a pathogen, a transplant, or a microbiome.
  • “obtaining” nucleic acid sequence reads of a sample from a subject and/or “obtaining” nucleic acid sequence reads of a biological specimen from one or more reference persons can involve directly sequencing nucleic acid to obtain the sequence information. In some embodiments, “obtaining” can involve receiving sequence information obtained directly from a nucleic acid by another.
  • nucleic acids in a sample are enriched and/or amplified (e.g., non-specifically, e.g., by a PCR based method) prior to or during sequencing.
  • specific nucleic acid species or subsets in a sample are enriched and/or amplified prior to or during sequencing.
  • a species or subset of a pre-selected pool of nucleic acids is sequenced randomly.
  • nucleic acids in a sample are not enriched and/or amplified prior to or during sequencing.
  • a representative fraction of a genome is sequenced and is sometimes referred to as “coverage” or “fold coverage.”
  • cover or “fold coverage.”
  • a 1 -fold coverage indicates that roughly 100% of the nucleotide sequences of the genome are represented by reads.
  • fold coverage is referred to as (and is directly proportional to) “sequencing depth.”
  • “fold coverage” is a relative term referring to a prior sequencing run as a reference. For example, a second sequencing run may have 2-fold less coverage than a first sequencing run.
  • a genome is sequenced with redundancy, where a given region of the genome can be covered by two or more reads or overlapping reads (e.g., a “fold coverage” greater than 1 , e.g., a 2-fold coverage).
  • a genome (e.g., a whole genome) is sequenced with about 0.01 -fold to about 100-fold coverage, about 0.1 -fold to 20-fold coverage, or about 0.1 -fold to about 1 -fold coverage (e.g., about 0.015-, 0.02-, 0.03-, 0.04-, 0.05-, 0.06-, 0.07-, 0.08-, 0.09-, 0.1 -, 0.2-, 0.3-, 0.4-, 0.5-, 0.6-, 0.7-, 0.8-, 0.9-, 1 -, 2-, 3-, 4-, 5-, 6-, 7-, 8-, 9-, 10-, 15-, 20-, 30-, 40-, 50-, 60-, 70-, 80-, 90-fold or greater coverage).
  • a minimum fold coverage is determined according to a limit of detection (LOD) analysis (e.g., a LOD analysis as described in Example 1 and shown in Figs. 19-21).
  • LOD limit of detection
  • One or more of accuracy, sensitivity, and specificity may be measured for various total read counts and fold coverages.
  • a LOD is between about 10,000 reads to about 100,000 reads.
  • a LOD may be about 10,000 reads, 20,000 reads, 30,000 reads, 40,000 reads, 50,000 reads, 60,000 reads, 70,000 reads, 80,000 reads, 90,000 reads, or 100,000 reads.
  • a LOD is between about 5,000 reads to about 50,000 reads.
  • a LOD may be about 5,000 reads, 10,000 reads, 15,000 reads, 20,000 reads, 25,000 reads, 30,000 reads, 35,000 reads, 40,000 reads, 45,000 reads, or 50,000 reads.
  • a minimum fold coverage is about 0.0001 to about 0.001 .
  • a minimum fold coverage may be about 0.0001 , 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008, 0.0009, or 0.001.
  • a minimum fold coverage is about 0.0005.
  • a fold coverage is as low as 0.0005. In some embodiments, a minimum fold coverage is about 0.001 to about 0.01 . For example, a minimum fold coverage may be about 0.001 , 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, or 0.01 . In some embodiments, a minimum fold coverage is about 0.0025. In some embodiments, a fold coverage is as low as 0.0025.
  • a sequencing method utilizes identifiers that allow multiplexing of sequence reactions in a sequencing process.
  • a sequencing process can be performed using any suitable number of unique identifiers (e.g., 4, 8, 12, 24, 48, 96, or more).
  • a sequencing process sometimes makes use of a solid phase, and sometimes the solid phase comprises a flow cell on which nucleic acid from a library can be attached and reagents can be flowed and contacted with the attached nucleic acid.
  • a flow cell sometimes includes flow cell lanes, and use of identifiers can facilitate analyzing a number of samples in each lane.
  • a flow cell often is a solid support that can be configured to retain and/or allow the orderly passage of reagent solutions over bound analytes.
  • Flow cells frequently are planar in shape, optically transparent, generally in the millimeter or sub-millimeter scale, and often have channels or lanes in which the analyte/reagent interaction occurs.
  • Non-limiting examples of commercially available multiplex sequencing kits include Illumina’s multiplexing sample preparation oligonucleotide kit and multiplexing sequencing primers and PhiX control kit (e.g., Illumina’s catalog numbers PE-400-1001 and PE-400-1002, respectively).
  • any suitable method of sequencing nucleic acids can be used, non-limiting examples of which include Maxim & Gilbert, chain-termination methods, sequencing by synthesis, sequencing by ligation, sequencing by mass spectrometry, microscopy-based techniques, the like or combinations thereof.
  • a first-generation technology such as, for example, Sanger sequencing methods including automated Sanger sequencing methods, including microfluidic Sanger sequencing, can be used in a method provided herein.
  • sequencing technologies that include the use of nucleic acid imaging technologies (e.g., transmission electron microscopy (TEM) and atomic force microscopy (AFM)), can be used.
  • TEM transmission electron microscopy
  • AFM atomic force microscopy
  • a high-throughput sequencing method is used.
  • a targeted approach often isolates, selects and/or enriches a subset of nucleic acids in a sample for further processing by use of sequence-specific oligonucleotides.
  • a library of sequence-specific oligonucleotides is utilized to target (e.g., hybridize to) one or more sets of nucleic acids in a sample.
  • Sequence-specific oligonucleotides and/or primers are often selective for particular sequences (e.g., unique nucleic acid sequences) present in one or more chromosomes, genes, exons, introns, and/or regulatory regions of interest.
  • targeted sequences are isolated and/or enriched by capture to a solid phase (e.g., a flow cell, a bead) using one or more sequence-specific anchors.
  • targeted sequences are enriched and/or amplified by a polymerase-based method (e.g., a PCR-based method, by any suitable polymerase-based extension) using sequence-specific primers and/or primer sets. Sequence specific anchors often can be used as sequence-specific primers.
  • MPS sequencing sometimes makes use of sequencing by synthesis and certain imaging processes.
  • a nucleic acid sequencing technology that may be used in a method described herein is sequencing-by-synthesis and reversible terminator-based sequencing (e.g., Illumina’s Genome Analyzer; Genome Analyzer II; HISEQ 2000; HISEQ 2500 (Illumina, San Diego GA)). With this technology, millions of nucleic acid (e.g., DNA) fragments can be sequenced in parallel.
  • a flow cell is used which contains an optically transparent slide with 8 individual lanes on the surfaces of which are bound oligonucleotide anchors (e.g., adapter primers).
  • Sequencing by synthesis generally is performed by iteratively adding (e.g., by covalent addition) a nucleotide to a primer or preexisting nucleic acid strand in a template directed manner. Each iterative addition of a nucleotide is detected and the process is repeated multiple times until a sequence of a nucleic acid strand is obtained. The length of a sequence obtained depends, in part, on the number of addition and detection steps that are performed. In some embodiments of sequencing by synthesis, one, two, three or more nucleotides of the same type (e.g., A, G, C or T) are added and detected in a round of nucleotide addition.
  • A, G, C or T nucleotide of the same type
  • Nucleotides can be added by any suitable method (e.g., enzymatically or chemically). For example, in some embodiments a polymerase or a ligase adds a nucleotide to a primer or to a preexisting nucleic acid strand in a template directed manner. In some embodiments of sequencing by synthesis, different types of nucleotides, nucleotide analogues and/or identifiers are used. In some embodiments, reversible terminators and/or removable (e.g., cleavable) identifiers are used. In some embodiments, fluorescent labeled nucleotides and/or nucleotide analogues are used.
  • sequencing by synthesis comprises a cleavage (e.g., cleavage and removal of an identifier) and/or a washing step.
  • a suitable method described herein or known in the art non-limiting examples of which include any suitable imaging apparatus, a suitable camera, a digital camera, a CCD (Charge Couple Device) based imaging apparatus (e.g., a CCD camera), a CMOS (Complementary Metal Oxide Silicon) based imaging apparatus (e.g., a CMOS camera), a photo diode (e.g., a photomultiplier tube), electron microscopy, a field-effect transistor (e.g., a DNA field-effect transistor), an ISFET ion sensor (e.g., a CHEMFET sensor), the like or combinations thereof.
  • MPS platforms include ILLUMINA/SOLEX/HISEQ (e.g., Illumina’s Genome Analyzer; Genome Analyzer II; HISEQ 2000; HISEQ), SOLID, Roche/454, PACBIO and/or SMRT, Helicos True Single Molecule Sequencing, Ion Torrent and Ion semiconductor-based sequencing (e.g., as developed by Life Technologies), WildFire, 5500, 5500x1 W and/or 5500x1 W Genetic Analyzer based technologies (e.g., as developed and sold by Life Technologies, U.S. Patent Application Publication No.
  • nucleic acid is sequenced and the sequencing product (e.g., a collection of sequence reads) is processed prior to, or in conjunction with, an analysis of the sequenced nucleic acid.
  • sequence reads may be processed according to one or more of the following: aligning, mapping, filtering, counting, normalizing, weighting, generating a profile, and the like, and combinations thereof. Certain processing steps may be performed in any order and certain processing steps may be repeated.
  • Sequence reads can be mapped and the number of reads mapping to a specified nucleic acid region (e.g., a chromosome or portion thereof) are referred to as counts.
  • Any suitable mapping method e.g., process, algorithm, program, software, module, the like or combination thereof
  • mapping processes are described hereafter.
  • Mapping nucleotide sequence reads can be performed in a number of ways, and often comprises alignment of the obtained sequence reads with a matching sequence in a reference genome.
  • sequence reads generally are aligned to a reference sequence and those that align are designated as being "mapped,” as "a mapped sequence read” or as “a mapped read.”
  • a mapped sequence read is referred to as a “hit” or “count.”
  • mapped sequence reads are grouped together according to various parameters and assigned to particular genomic portions, which are discussed in further detail below.
  • aligning generally refer to two or more nucleic acid sequences that can be identified as a match (e.g., 100% identity) or partial match. Alignments can be done manually or by a computer (e.g., a software, program, module, or algorithm), non-limiting examples of which include the Efficient Local Alignment of Nucleotide Data (ELAND) computer program distributed as part of the ILLUMINA Genomics Analysis pipeline. Alignment of a sequence read can be a 100% sequence match. In some instances, an alignment is less than a 100% sequence match (i.e., non-perfect match, partial match, partial alignment).
  • an alignment is about a 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 89%, 88%, 87%, 86%, 85%, 84%, 83%, 82%, 81%, 80%, 79%, 78%, 77%, 76% or 75% match.
  • an alignment comprises a mismatch.
  • an alignment comprises 1 , 2, 3, 4 or 5 mismatches. Two or more sequences can be aligned using either strand (e.g., sense or antisense strand).
  • a nucleic acid sequence is aligned with the reverse complement of another nucleic acid sequence.
  • sequence reads can be aligned with sequences in a reference genome.
  • sequence reads can be found and/or aligned with sequences in nucleic acid databases known in the art including, for example, GenBank, dbEST, dbSTS, EMBL (European Molecular Biology Laboratory) and DDBJ (DNA Databank of Japan).
  • BLAST or similar tools can be used to search identified sequences against a sequence database. Search hits can then be used to sort the identified sequences into appropriate portions (described hereafter), for example.
  • reference genome can refer to any particular known, sequenced or characterized genome, whether partial or complete, of any organism or virus which may be used to reference identified sequences from a subject.
  • a reference genome used for human subjects as well as many other organisms can be found at the National Center for Biotechnology Information at World Wide Web URL ncbi.nlm.nih.gov.
  • a “genome” refers to the complete genetic information of an organism or virus, expressed in nucleic acid sequences.
  • a reference sequence or reference genome often is an assembled or partially assembled genomic sequence from an individual or multiple individuals.
  • a reference genome is an assembled or partially assembled genomic sequence from one or more human individuals.
  • a reference genome comprises sequences assigned to chromosomes.
  • reads may be mapped to a reference genome by use of a suitable mapping and/or alignment program or algorithm, non-limiting examples of which include BWA (Li H. and Durbin R. (2009) Bioinformatics 25, 1754-60), Novoalign [Novocraft (2010)], Bowtie (Langmead B, et al., (2009) Genome Biol. 10:R25), SOAP2 (Li R, et al., (2009) Bioinformatics 25, 1966-67), BFAST (Homer N, et al., (2009) PLoS ONE 4, e7767), GASSST (Rizk, G. and Lavenier, D.
  • BWA Li H. and Durbin R. (2009) Bioinformatics 25, 1754-60
  • Novoalign Novoalign [Novocraft (2010)]
  • Bowtie Longmead B, et al., (2009) Genome Biol. 10:R25
  • SOAP2 Li R, et al., (2009) Bio
  • Reads that do not overlap or that do not overlap sufficiently can remain unmerged and be mapped as paired reads. Paired-end reads may be mapped and/or aligned using a suitable short read alignment program or algorithm.
  • short read alignment programs include BarraCUDA, BFAST, BLASTN, BLAT, Bowtie, BWA, CASHX, CUDA-EC, CUSHAW, CUSHAW2, drFAST, ELAND, ERNE, GNUMAP, GEM, GensearchNGS, GMAP, Geneious Assembler, iSAAC, LAST, MAQ, mrFAST, mrsFAST, MOSAIK, MPscan, Novoalign, NovoalignCS, Novocraft, NextGENe, Omixon, PALMapper, Partek , PASS, PerM, QPalma, RazerS, REAL, cREAL, RMAP, rNA, RTG, Segemehl, SeqMap, Shrec,
  • Paired-end reads are often mapped to opposing ends of the same polynucleotide fragment, according to a reference genome.
  • read mates are mapped independently.
  • information from both sequence reads i.e., from each end
  • a reference genome is often used to determine and/or infer the sequence of nucleic acids located between paired-end read mates.
  • the term “discordant read pairs” as used herein refers to a paired-end read comprising a pair of read mates, where one or both read mates fail to unambiguously map to the same region of a reference genome defined, in part, by a segment of contiguous nucleotides.
  • discordant read pairs are paired-end read mates that map to unexpected locations of a reference genome.
  • unexpected locations of a reference genome include (i) two different chromosomes, (ii) locations separated by more than a predetermined fragment size (e.g., more than 300 bp, more than 500 bp, more than 1000 bp, more than 5000 bp, or more than 10,000 bp), (iii) an orientation inconsistent with a reference sequence (e.g., opposite orientations), the like or a combination thereof.
  • discordant read mates are identified according to a length (e.g., an average length, a predetermined fragment size) or expected length of template polynucleotide fragments in a sample. For example, read mates that map to a location that is separated by more than the average length or expected length of polynucleotide fragments in a sample are sometimes identified as discordant read pairs. Read pairs that map in opposite orientation are sometimes determined by taking the reverse complement of one of the reads and comparing the alignment of both reads using the same strand of a reference sequence. Discordant read pairs can be identified by any suitable method and/or algorithm known in the art or described herein (e.g., SVDetect, Lumpy, BreakDancer, BreakDancerMax, CREST, DELLY, the like or combinations thereof).
  • noisy data refers to (a) data that has a significant variance between data points when analyzed or plotted, (b) data that has a significant standard deviation (e.g., greater than 3 standard deviations), (c) data that has a significant standard error of the mean, the like, and combinations of the foregoing.
  • noisy data sometimes occurs due to the quantity and/or quality of starting material (e.g., nucleic acid sample), and sometimes occurs as part of processes for preparing or replicating DNA used to generate sequence reads.
  • noise results from certain sequences being over represented when prepared using PCR-based methods. Methods described herein can reduce or eliminate the contribution of noisy data, and therefore reduce the effect of noisy data on the provided outcome.
  • one or more processing steps can comprise one or more normalization steps. Normalization can be performed by a suitable method described herein or known in the art. In certain embodiments, normalization comprises adjusting values measured on different scales to a notionally common scale. In certain embodiments, normalization comprises a sophisticated mathematical adjustment to bring probability distributions of adjusted values into alignment. In some embodiments, normalization comprises aligning distributions to a normal distribution. In certain embodiments, normalization comprises mathematical adjustments that allow comparison of corresponding normalized values for different datasets in a way that eliminates the effects of certain gross influences (e.g., error and anomalies). In certain embodiments, normalization comprises scaling. Normalization sometimes comprises division of one or more data sets by a predetermined variable or formula. Normalization sometimes comprises subtraction of one or more data sets by a predetermined variable or formula.
  • references are disease subjects (e.g., subjects with cancer). In some embodiments, references are healthy subjects (e.g., non-cancer subjects).
  • Non-limiting examples of statistical methods suitable for comparing data sets, relationships and/or profiles include Behrens- Fisher approach, bootstrapping, Fisher's method for combining independent tests of significance, Neyman-Pearson testing, confirmatory data analysis, exploratory data analysis, exact test, F-test, Z-test, T-test, calculating and/or comparing a measure of uncertainty, a null hypothesis, counternulls and the like, a chi-square test, omnibus test, calculating and/or comparing level of significance (e.g., statistical significance), a meta-analysis, a multivariate analysis, a regression, simple linear regression, robust linear regression, the like or combinations of the foregoing.
  • significance e.g., statistical significance
  • comparing two or more data sets, relationships and/or profiles comprises determining and/or comparing a measure of uncertainty.
  • a “measure of uncertainty” as used herein refers to a measure of significance (e.g., statistical significance), a measure of error, a measure of variance, a measure of confidence, the like or a combination thereof.
  • a measure of uncertainty can be a value (e.g., a threshold) or a range of values (e.g., an interval, a confidence interval, a Bayesian confidence interval, a threshold range).
  • two or more data sets, relationships and/or profiles can be analyzed and/or compared by utilizing multiple (e.g., 2 or more) statistical methods (e.g., least squares regression, principle component analysis, linear discriminant analysis, quadratic discriminant analysis, bagging, neural networks, support vector machine models, random forests, classification tree models, K-nearest neighbors, logistic regression and/or loss smoothing) and/or any suitable mathematical and/or statistical manipulations (e.g., referred to herein as manipulations).
  • multiple e.g., 2 or more
  • statistical methods e.g., least squares regression, principle component analysis, linear discriminant analysis, quadratic discriminant analysis, bagging, neural networks, support vector machine models, random forests, classification tree models, K-nearest neighbors, logistic regression and/or loss smoothing
  • any suitable mathematical and/or statistical manipulations e.g., referred to herein as manipulations.
  • Methods described herein can provide an outcome indicative of one or more characteristics of a sample or source described above. Methods described herein sometimes provide an outcome indicative of a phenotype and/or presence or absence of a medical condition for a test sample (e.g., providing an outcome determinative of the presence or absence of a medical condition and/or phenotype; providing an outcome determinative of the presence or absence of cancer). An outcome often is part of a classification process, and a classification (e.g., classification of one or more characteristics of a sample or source; and/or presence or absence of a genotype, phenotype, genetic variation and/or medical condition for a test sample) sometimes is based on and/or includes an outcome.
  • a classification e.g., classification of one or more characteristics of a sample or source; and/or presence or absence of a genotype, phenotype, genetic variation and/or medical condition for a test sample
  • An outcome and/or classification sometimes is based on and/or includes a result of data processing for a test sample that facilitates determining one or more characteristics of a sample or source and/or presence or absence of a genotype, phenotype, genetic variation, genetic alteration, and/or medical condition in a classification process (e.g., a statistic value).
  • An outcome and/or classification sometimes includes or is based on a score determinative of, or a call of, one or more characteristics of a sample or source and/or presence or absence of a genotype, phenotype, genetic variation, genetic alteration, and/or medical condition.
  • an outcome and/or classification includes a conclusion that predicts and/or determines one or more characteristics of a sample or source and/or presence or absence of a genotype, phenotype, genetic variation, genetic alteration, and/or medical condition in a classification process.
  • an outcome and/or classification comprises a plot (e.g., a profile plot).
  • multiple values are analyzed together, sometimes in a profile for such values (e.g., z-score profile, p-value profile, chi value profile, phi value profile, result of a t-test, value profile, the like, or combination thereof).
  • a consideration of probability can facilitate determining one or more characteristics of a sample or source and/or whether a subject is at risk of having, or has, a genotype, phenotype, genetic variation and/or medical condition, and an outcome and/or classification determinative of the foregoing sometimes includes such a consideration.
  • a statistics value indicative of probability, certainty and/or uncertainty e.g., standard deviation, median absolute deviation (MAD), measure of certainty, measure of confidence, measure of certainty or confidence that a value obtained for a test sample is inside or outside a particular range of values, measure of
  • An outcome and/or classification sometimes is expressed in a laboratory test report for particular test sample as a probability (e.g., odds ratio, p-value), likelihood, or risk factor, associated with the presence or absence of a genotype, phenotype, genetic variation and/or medical condition.
  • An outcome and/or classification for a test sample sometimes is provided as "positive” or “negative” with respect a particular genotype, phenotype, genetic variation and/or medical condition (e.g., positive or negative for cancer).
  • an outcome and/or classification sometimes is designated as “positive” in a laboratory test report for a particular test sample where presence of a genotype, phenotype, genetic variation and/or medical condition is determined, and sometimes an outcome and/or classification is designated as “negative” in a laboratory test report for a particular test sample where absence of a genotype, phenotype, genetic variation and/or medical condition is determined.
  • An outcome and/or classification sometimes is determined and sometimes includes an assumption used in data processing.
  • Two measures of performance for a classification process can be calculated based on the ratios of these occurrences: (i) a sensitivity value, which generally is the fraction of predicted positives that are correctly identified as being positives; and (ii) a specificity value, which generally is the fraction of predicted negatives correctly identified as being negative.
  • a laboratory test report generated for a classification process includes a measure of test performance (e.g., sensitivity and/or specificity) and/or a measure of confidence (e.g., a confidence level, confidence interval).
  • a measure of test performance and/or confidence sometimes is obtained from a clinical validation study performed prior to performing a laboratory test for a test sample.
  • one or more of sensitivity, specificity and/or confidence are expressed as a percentage.
  • Coefficient of variation in some embodiments is expressed as a percentage, and sometimes the percentage is about 10% or less (e.g., about 10, 9, 8, 7, 6, 5, 4, 3, 2 or 1%, or less than 1% (e.g., about 0.5% or less, about 0.1% or less, about 0.05% or less, about 0.01% or less)).
  • a probability e.g., that a particular outcome and/or classification is not due to chance
  • a measured variance, confidence level, confidence interval, sensitivity, specificity and the like e.g., referred to collectively as confidence parameters
  • confidence parameters for an outcome and/or classification can be generated using one or more data processing manipulations described herein.
  • An outcome and/or classification for a test sample often is ordered by, and often is provided to, a health care professional or other qualified individual (e.g., physician or assistant) who transmits an outcome and/or classification to a subject from whom the test sample is obtained.
  • a health care professional or other qualified individual e.g., physician or assistant
  • an outcome and/or classification is provided using a suitable visual medium (e.g., a peripheral or component of a machine, e.g., a printer or display).
  • a classification and/or outcome often is provided to a healthcare professional or qualified individual in the form of a report.
  • a report typically comprises a display of an outcome and/or classification (e.g., a value, one or more characteristics of a sample or source, or an assessment or probability of presence or absence of a genotype, phenotype, genetic variation and/or medical condition), sometimes includes an associated confidence parameter, and sometimes includes a measure of performance for a test used to generate the outcome and/or classification.
  • a report sometimes includes a recommendation for a follow-up procedure (e.g., a procedure that confirms the outcome or classification).
  • a report can be displayed in a suitable format that facilitates determination of presence or absence of a genotype, phenotype, genetic variation and/or medical condition by a health professional or other qualified individual.
  • Non-limiting examples of formats suitable for use for generating a report include digital data, a graph, a 2D graph, a 3D graph, and 4D graph, a picture (e.g., a jpg, bitmap (e.g., bmp), pdf, tiff, gif, raw, png, the like or suitable format), a pictograph, a chart, a table, a bar graph, a pie graph, a diagram, a flow chart, a scatter plot, a map, a histogram, a density chart, a function graph, a circuit diagram, a block diagram, a bubble map, a constellation diagram, a contour diagram, a cartogram, spider chart, Venn diagram, nomogram, and the like, or combination of the foregoing.
  • a picture e.g., a jpg, bitmap (e.g., bmp), pdf, tiff, gif, raw, png, the like or suitable format
  • a pictograph
  • a report may be generated by a computer and/or by human data entry, and can be transmitted and communicated using a suitable electronic medium (e.g., via the internet, via computer, via facsimile, from one network location to another location at the same or different physical sites), or by another method of sending or receiving data (e.g., mail service, courier service and the like).
  • a suitable electronic medium e.g., via the internet, via computer, via facsimile, from one network location to another location at the same or different physical sites
  • Non-limiting examples of communication media for transmitting a report include auditory file, computer readable file (e.g., pdf file), paper file, laboratory file, medical record file, or any other medium described in the previous paragraph.
  • a laboratory file or medical record file may be in tangible form or electronic form (e.g., computer readable form), in certain embodiments.
  • a report can be received by obtaining, via a suitable communication medium, a written and/or graphical representation comprising an outcome and/or classification, which upon review allows a healthcare professional or other qualified individual to make a determination as to one or more characteristics of a sample or source, or presence or absence of a genotype, phenotype, genetic variation and/or or medical condition for a test sample.
  • An outcome and/or classification may be provided by and obtained from a laboratory (e.g., obtained from a laboratory file).
  • a laboratory file can be generated by a laboratory that carries out one or more tests for determining one or more characteristics of a sample or source and/or presence or absence of a genotype, phenotype, genetic variation and/or medical condition for a test sample.
  • Laboratory personnel e.g., a laboratory manager
  • can analyze information associated with test samples e.g., test profiles, reference profiles, test values, reference values, level of deviation, patient information underlying an outcome and/or classification.
  • laboratory personnel can re-run the same procedure using the same (e.g., aliquot of the same sample) or different test sample from a test subject.
  • a laboratory may be in the same location or different location (e.g., in another country) as personnel assessing the presence or absence of a genotype, phenotype, genetic variation and/or a medical condition from the laboratory file.
  • a laboratory file can be generated in one location and transmitted to another location in which the information for a test sample therein is assessed by a healthcare professional or other qualified individual, and optionally, transmitted to the subject from which the test sample was obtained.
  • provided herein are methods for diagnosing presence or absence of a genotype, phenotype, a genetic variation and/or a medical condition for a test sample according to an outcome or classification generated by methods described herein, and optionally according to generating and transmitting a laboratory report that includes a classification for presence or absence of the genotype, phenotype, a genetic variation and/or a medical condition for the test sample.
  • Certain processes and methods described herein e.g., mapping sequence reads, processing sequence read data, determining one or more characteristics of a sample based on sequence read data, generating k-mer data, processing k-mer data, determining one or more characteristics of a sample based on k-mer data
  • Methods described herein may be computer-implemented methods, and one or more portions of a method sometimes are performed by one or more processors (e.g., microprocessors), computers, systems, apparatuses, or machines (e.g., microprocessor-controlled machine).
  • Computers, systems, apparatuses, machines and computer program products suitable for use often include, or are utilized in conjunction with, computer readable storage media.
  • Non-limiting examples of computer readable storage media include memory, hard disk, CD-ROM, flash memory device and the like.
  • Computer readable storage media generally are computer hardware, and often are non-transitory computer-readable storage media.
  • Computer readable storage media are not computer readable transmission media, the latter of which are transmission signals per se.
  • systems, machines and apparatuses that include computer readable storage media with an executable program module stored thereon, where the program module instructs a microprocessor to perform part of a method described herein.
  • a computer program product often includes a computer usable medium that includes a computer readable program code embodied therein, the computer readable program code adapted for being executed to implement a method or part of a method described herein.
  • Computer usable media and readable program code are not transmission media (i.e., transmission signals per se).
  • Computer readable program code often is adapted for being executed by a processor, computer, system, apparatus, or machine.
  • methods described herein are performed by automated methods.
  • one or more steps of a method described herein are carried out by a microprocessor and/or computer, and/or carried out in conjunction with memory.
  • an automated method is embodied in software, modules, microprocessors, peripherals and/or a machine comprising the like, that perform methods described herein.
  • software refers to computer readable program instructions that, when executed by a microprocessor, perform computer operations, as described herein.
  • Machines, software and interfaces may be used to conduct methods described herein. Using machines, software and interfaces, a user may enter, request, query or determine options for using particular information, programs or processes (e.g., processing sequence read data, processing k- mer, and/or providing an outcome), which can involve implementing statistical analysis algorithms, statistical significance algorithms, statistical algorithms, iterative steps, validation algorithms, and graphical representations, for example.
  • programs or processes e.g., processing sequence read data, processing k- mer, and/or providing an outcome
  • statistical analysis algorithms e.g., statistical significance algorithms, statistical algorithms, iterative steps, validation algorithms, and graphical representations, for example.
  • a data set may be entered by a user as input information, a user may download one or more data sets by suitable hardware media (e.g., flash drive), and/or a user may send a data set from one system to another for subsequent processing and/or providing an outcome (e.g., send sequence read data from a sequencer to a computer system for sequence read processing; send processed sequence read data to a computer system for further processing and/or yielding an outcome and/or report; send k-mer data to a computer system for k-mer processing; send processed k-mer data to a computer system for further processing and/or yielding an outcome and/or report).
  • suitable hardware media e.g., flash drive
  • a system includes one or more machines, which may be local or remote with respect to a user. More than one machine in one location or multiple locations may be accessed by a user, and data may be mapped and/or processed in series and/or in parallel. Thus, a suitable configuration and control may be utilized for mapping and/or processing data using multiple machines, such as in local network, remote network and/or "cloud" computing platforms.
  • Data may be input by a suitable device and/or method, including, but not limited to, manual input devices or direct data entry devices (DDEs).
  • manual devices include keyboards, concept keyboards, touch sensitive screens, light pens, mouse, tracker balls, joysticks, graphic tablets, scanners, digital cameras, video digitizers and voice recognition devices.
  • DDEs include bar code readers, magnetic strip codes, smart cards, magnetic ink character recognition, optical character recognition, optical mark recognition, and turnaround documents.
  • output from a sequencing apparatus or machine may serve as data that can be input via an input device.
  • sequence read information may serve as data that can be input via an input device.
  • mapped sequence reads may serve as data that can be input via an input device.
  • k-mer sequences may serve as data that can be input via an input device.
  • simulated data is generated by an in-silico process and the simulated data serves as data that can be input via an input device.
  • in silico refers to research and experiments performed using a computer. In silico processes include, but are not limited to, mapping sequence reads and processing mapped sequence reads according to processes described herein.
  • a system may include software useful for performing a process or part of a process described herein, and software can include one or more modules for performing such processes (e.g., sequencing module, logic processing module, data display organization module).
  • software refers to computer readable program instructions that, when executed by a computer, perform computer operations. Instructions executable by the one or more microprocessors sometimes are provided as executable code, that when executed, can cause one or more microprocessors to implement a method described herein.
  • a module described herein can exist as software, and instructions (e.g., processes, routines, subroutines) embodied in the software can be implemented or performed by a microprocessor.
  • a module e.g., a software module
  • a module can be a part of a program that performs a particular process or task.
  • the term “module” refers to a self-contained functional unit that can be used in a larger machine or software system.
  • a module can comprise a set of instructions for carrying out a function of the module.
  • a module can transform data and/or information.
  • Data and/or information can be in a suitable form.
  • data and/or information can be digital or analogue.
  • data and/or information sometimes can be packets, bytes, characters, or bits.
  • data and/or information can be any gathered, assembled or usable data or information.
  • a computer program product sometimes is embodied on a tangible computer-readable medium, and sometimes is tangibly embodied on a non-transitory computer-readable medium.
  • a module sometimes is stored on a computer readable medium (e.g., disk, drive) or in memory (e.g., random access memory).
  • a module and microprocessor capable of implementing instructions from a module can be located in a machine or in a different machine.
  • a module and/or microprocessor capable of implementing an instruction for a module can be located in the same location as a user (e.g., local network) or in a different location from a user (e.g., remote network, cloud system).
  • the modules can be located in the same machine, one or more modules can be located in different machine in the same physical location, and one or more modules may be located in different machines in different physical locations.
  • a machine comprises at least one microprocessor for carrying out the instructions in a module.
  • Sequence read data and/or k-mer data sometimes are accessed by a microprocessor that executes instructions configured to carry out a method described herein.
  • Sequence read data and/or k-mer data that are accessed by a microprocessor can be within memory of a system, and the sequence read data and/or k-mer data can be accessed and placed into the memory of the system after they are obtained.
  • a machine includes a microprocessor (e.g., one or more microprocessors) which microprocessor can perform and/or implement one or more instructions (e.g., processes, routines and/or subroutines) from a module.
  • a machine includes multiple microprocessors, such as microprocessors coordinated and working in parallel.
  • a machine operates with one or more external microprocessors (e.g., an internal or external network, server, storage device and/or storage network (e.g., a cloud)).
  • a machine comprises a module (e.g., one or more modules).
  • a machine comprising a module often is capable of receiving and transferring one or more of data and/or information to and from other modules.
  • a machine comprises peripherals and/or components.
  • a machine can comprise one or more peripherals or components that can transfer data and/or information to and from other modules, peripherals and/or components.
  • a machine interacts with a peripheral and/or component that provides data and/or information.
  • peripherals and components assist a machine in carrying out a function or interact directly with a module.
  • Non-limiting examples of peripherals and/or components include a suitable computer peripheral, I/O or storage method or device including but not limited to scanners, printers, displays (e.g., monitors, LED, LOT or CRTs), cameras, microphones, pads (e.g., ipads, tablets), touch screens, smart phones, mobile phones, USB I/O devices, USB mass storage devices, keyboards, a computer mouse, digital pens, modems, hard drives, jump drives, flash drives, a microprocessor, a server, CDs, DVDs, graphic cards, specialized I/O devices (e.g., sequencers, photo cells, photo multiplier tubes, optical readers, sensors, etc.), one or more flow cells, fluid handling components, network interface controllers, ROM, RAM, wireless transfer methods and devices (Bluetooth, WiFi, and the like,), the world wide web (www), the internet, a computer and/or another module.
  • a suitable computer peripheral, I/O or storage method or device including but not limited to scanners, printers,
  • Software often is provided on a program product containing program instructions recorded on a computer readable medium, including, but not limited to, magnetic media including floppy disks, hard disks, and magnetic tape; and optical media including CD-ROM discs, DVD discs, magnetooptical discs, flash memory devices (e.g., flash drives), RAM, floppy discs, the like, and other such media on which the program instructions can be recorded.
  • a server and web site maintained by an organization can be configured to provide software downloads to remote users, or remote users may access a remote system maintained by an organization to remotely access software. Software may obtain or receive input information.
  • Software may include a module that specifically obtains or receives data (e.g., a data receiving module that receives sequence read data, mapped read data, and/or k-mer data) and may include a module that specifically processes the data (e.g., a processing module that processes received data (e.g., filters, normalizes, provides an outcome and/or report).
  • obtaining” and “receiving” input information refers to receiving data (e.g., sequence reads, mapped reads, k-mers) by computer communication means from a local, or remote site, human data entry, or any other method of receiving data.
  • the input information may be generated in the same location at which it is received, or it may be generated in a different location and transmitted to the receiving location.
  • input information is modified before it is processed (e.g., placed into a format amenable to processing (e.g., tabulated)).
  • Software can include one or more algorithms in certain embodiments.
  • An algorithm may be used for processing data and/or providing an outcome or report according to a finite sequence of instructions.
  • An algorithm often is a list of defined instructions for completing a task. Starting from an initial state, the instructions may describe a computation that proceeds through a defined series of successive states, eventually terminating in a final ending state. The transition from one state to the next is not necessarily deterministic (e.g., some algorithms incorporate randomness).
  • an algorithm can be a search algorithm, sorting algorithm, merge algorithm, numerical algorithm, graph algorithm, string algorithm, modeling algorithm, computational genometric algorithm, combinatorial algorithm, machine learning algorithm, cryptography algorithm, data compression algorithm, parsing algorithm and the like.
  • An algorithm can include one algorithm or two or more algorithms working in combination.
  • An algorithm can be of any suitable complexity class and/or parameterized complexity.
  • An algorithm can be used for calculation and/or data processing, and in some embodiments, can be used in a deterministic or probabilistic/predictive approach.
  • An algorithm can be implemented in a computing environment by use of a suitable programming language, non-limiting examples of which are C, C++, Java, Perl, Python, Fortran, and the like.
  • a suitable programming language non-limiting examples of which are C, C++, Java, Perl, Python, Fortran, and the like.
  • an algorithm can be configured or modified to include margin of errors, statistical analysis, statistical significance, and/or comparison to other information or data sets (e.g., applicable when using a neural net or clustering algorithm).
  • several algorithms may be implemented for use in software. These algorithms can be trained with raw data in some embodiments. For each new raw data sample, the trained algorithms may produce a representative processed data set or outcome. A processed data set sometimes is of reduced complexity compared to the parent data set that was processed. Based on a processed set, the performance of a trained algorithm may be assessed based on sensitivity and specificity, in some embodiments. An algorithm with the highest sensitivity and/or specificity may be identified and utilized, in certain embodiments.
  • simulated (or simulation) data can aid data processing, for example, by training an algorithm or testing an algorithm.
  • simulated data includes hypothetical various samplings of different groupings of sequence reads and/or k-mers. Simulated data may be based on what might be expected from a real population or may be skewed to test an algorithm and/or to assign a correct classification. Simulated data also is referred to herein as “virtual” data. Simulations can be performed by a computer program in certain embodiments. One possible step in using a simulated data set is to evaluate the confidence of identified results, e.g., how well a random sampling matches or best represents the original data.
  • p-value a probability value
  • an empirical model may be assessed, in which it is assumed that at least one sample matches a reference sample (with or without resolved variations).
  • another distribution such as a Poisson distribution for example, can be used to define the probability distribution.
  • a system may include one or more microprocessors in certain embodiments.
  • a microprocessor can be connected to a communication bus.
  • a computer system may include a main memory, often random-access memory (RAM), and can also include a secondary memory.
  • Memory in some embodiments comprises a non-transitory computer-readable storage medium.
  • Secondary memory can include, for example, a hard disk drive and/or a removable storage drive, representing a floppy disk drive, a magnetic tape drive, an optical disk drive, memory card and the like.
  • a removable storage drive often reads from and/or writes to a removable storage unit.
  • Non-limiting examples of removable storage units include a floppy disk, magnetic tape, optical disk, and the like, which can be read by and written to by, for example, a removable storage drive.
  • a removable storage unit can include a computer-usable storage medium having stored therein computer software and/or data.
  • a microprocessor may implement software in a system.
  • a microprocessor may be programmed to automatically perform a task described herein that a user could perform. Accordingly, a microprocessor, or algorithm conducted by such a microprocessor, can require little to no supervision or input from a user (e.g., software may be programmed to implement a function automatically).
  • the complexity of a process is so large that a single person or group of persons could not perform the process in a timeframe short enough for determining one or more characteristics of a sample.
  • secondary memory may include other similar means for allowing computer programs or other instructions to be loaded into a computer system.
  • a system can include a removable storage unit and an interface device.
  • Non-limiting examples of such systems include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM, or PROM) and associated socket, and other removable storage units and interfaces that allow software and data to be transferred from the removable storage unit to a computer system.
  • one entity maps sequence reads to a reference genome, and utilizes the mapped reads in a method, system, machine, apparatus or computer program product described herein.
  • Sequence reads mapped to a reference genome sometimes are transferred by one entity to a second entity for use by the second entity in a method, system, machine, apparatus or computer program product described herein, in certain embodiments.
  • one entity generates sequence reads and a second entity maps those sequence reads to a reference genome.
  • the second entity sometimes utilizes the mapped reads in a method, system, machine or computer program product described herein.
  • the second entity transfers the mapped reads to a third entity
  • the third entity utilizes the mapped reads in a method, system, machine or computer program product described herein.
  • the third entity sometimes is the same as the first entity. That is, the first entity sometimes transfers sequence reads to a second entity, which second entity can map sequence reads to a reference genome, and the second entity can transfer the mapped reads to a third entity.
  • a third entity sometimes can utilize the mapped reads in a method, system, machine or computer program product described herein, where the third entity sometimes is the same as the first entity, and sometimes the third entity is different from the first or second entity.
  • one entity obtains a sample from a subject, optionally isolates nucleic acid from the sample (e.g., from plasma or urine), and transfers the sample or nucleic acid to a second entity that generates sequence reads from the nucleic acid.
  • Systems, methods, and data structures described herein are operational with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of known computing systems, environments, and/or configurations that may be suitable include, but are not limited to, personal computers, server computers, thin clients, thick clients, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
  • Any type of computer-readable media that can store data that is accessible by a computer such as magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, random access memories (RAMs), read only memories (ROMs), and the like, may be used in the operating environment.
  • a system comprising one or more microprocessors and memory, which memory comprises instructions executable by the one or more microprocessors and which memory comprises nucleic acid sequence reads mapped to a reference genome, where the sequence reads are reads of single-stranded cell-free nucleic acid from a test sample from a subject, and which instructions executable by the one or more microprocessors are configured to a) generate a k-mer profile for the subject, where the profile comprises a plurality of k-mer species at, adjacent to, and/or near a plurality of sequence read-genome junctions; and b) detect the presence or absence of cancer in the subject according to the k-mer profile generated in (a).
  • a non-transitory computer-readable storage medium with an executable program stored thereon, where the program instructs a microprocessor to perform the following: a) access nucleic acid sequence reads mapped to a reference genome, where the sequence reads are reads of single-stranded cell-free nucleic acid from a test sample from a subject; b) generate a k-mer profile for the subject, where the profile comprises a plurality of k-mer species at, adjacent to, and/or near a plurality of sequence read-genome junctions; and c) detect the presence or absence of cancer in the subject according to the k-mer profile generated in (b).
  • a method comprising: a) obtaining nucleic acid sequence reads mapped to a reference genome, wherein the sequence reads are reads of single-stranded cell-free nucleic acid from a test sample from a subject; b) generating a k-mer profile for the subject, wherein the profile comprises a plurality of k- mer species at, adjacent to, and/or near a plurality of sequence read-genome junctions; and c) detecting the presence or absence of cancer in the subject according to the k-mer profile generated in (b).
  • A1 .2 The method of embodiment A1 or A1 .1 , further comprising after (b) quantifying the plurality of k-mer species, thereby generating k-mer species quantifications.
  • A1 .5 The method of embodiment A1 .4, wherein the random forest classifier is trained on PWMs from samples in a training set.
  • A1 .6 The method of embodiment A1 .5, wherein the training set comprises known cancer samples and non-cancer samples.
  • A5. The method of any one of embodiments A1 to A1 .7, wherein the plurality of k-mer species comprises two or more of 2-mers, 3-mers, and 4-mers.
  • A7 The method of any one of embodiments A1 -A6, wherein the plurality of k-mer species comprise k-mers from the read sequence, from the genome sequence, and/or from the read-genome junction sequence.
  • A8 The method of any one of embodiments A1 -A7, wherein the plurality of k-mer sequences comprise a plurality of k-mer species that are informative for cancer.
  • sequence reads comprise reads from genome-wide sequencing.
  • A11 The method of any one of embodiments A1 -A10, wherein the sequence reads comprise reads that map to CpG islands. A12. The method of any one of embodiments A1 -A11 , further comprising prior to (a), producing a library from the single-stranded cell-free nucleic acid.
  • the method of embodiment A12, wherein producing the library comprises combining the single-stranded cell-free nucleic acid with a first set of sequencing adapters, or components thereof, and a second set of sequencing adapters, or components thereof.
  • a molecule of the first scaffold polynucleotide species is hybridized to a first singlestranded nucleic acid terminal region and a molecule of the first oligonucleotide, and
  • a molecule of the second scaffold polynucleotide species is hybridized to a second first single-stranded nucleic acid terminal region and a molecule of the second oligonucleotide.
  • A17 The method of embodiment A16, wherein the sequencing is non-targeted sequencing.
  • A21 The method of any one of embodiments A16-A20, wherein the sequencing is performed at about 0.0025-fold coverage or more.
  • A21 .1 The method of any one of embodiments A16-A20, wherein the sequencing is performed at about 0.0005-fold coverage or more.
  • A25 The method of any one of embodiments A16-A24, further comprising mapping the sequence reads to the reference genome.
  • test sample comprises plasma.
  • A28 The method of any one of embodiments A1 -A27, further comprising after (c) providing a report of the presence or absence of cancer in the subject.
  • A29 The method of any one of embodiments A1 -A28, wherein the cancer is prostate cancer.
  • Example 1 K-mer analysis of prostate cancer samples
  • Fig. 1 shows an example workflow for identifying cancer based on a fragmentomics and k-mer data analysis. For a given test sample of unknown cancer status, whole genome sequencing is performed on a SRSLY library prepared from plasma cfDNA in the test sample. The sample undergoes bioinformatic processing and then is put through a k-mer script described herein. The resulting position weight matrix (PWM) from the test sample is input to a random forest classifier that has been trained on the PWM outputs from known prostate cancer and healthy samples. The classifier outputs the predicted health status of the test sample.
  • PWM position weight matrix
  • FIG. 5B shows the samples do not cluster solely by collection center (i.e., in-house samples vs. university samples).
  • Fig. 6A shows 2-mer features that separate prostate cancer plasma samples (university samples and publicly available singlestranded prostate cancer sample data) from healthy plasma samples (in-house samples).
  • Fig. 6B shows publicly available single-stranded prostate cancer sample data (generated from non-SRSLY libraries) cluster together.
  • the heatmap in Fig. 7A shows how the k-mers correlate with the principal components. The greatest separation of cancer and healthy samples is observed when plotting PC2 vs PC3, highlighted by the dashed boxes.
  • Fig. 7B shows principal component analysis (PCA) showing clear separation of prostate cancer samples (triangle) from healthy samples (circle) when looking at PC2 vs PC3.
  • PCA principal component analysis
  • FIGS. 8A and 8B show the same analysis as Figs. 7A and 7B except not restricted to CpG islands.
  • the separation in the principal component analysis (PCA) is not as clear when looking at the whole genome rather than restricting the analysis to the k-mers within CpG islands.
  • the sequence is iterated over in segments of length K to extract the K+1 k-mers.
  • a count table is then updated that keeps track of the different k-mer counts at each possible position for readl and read2.
  • the count table is converted to a position weight matrix (PWM) by dividing the total reads by the background probability of each k-mer (1/4 A k) in each table cell.
  • the Iog2 of each entry is then calculated.
  • the resulting PWM is used for the random forest analysis and PCA.
  • a method was developed to classify between healthy and prostate cancer individuals using a random forest classifier based on short K length sequences (k-mers) obtained from the fragment ends of the sequenced cell-free DNA from plasma samples.
  • Plasma cfDNA samples were analyzed from 43 prostate cancer patients and 83 healthy individuals that were prepared using SRSLY, a single-stranded library preparation method that retains the native sequence on the 5' and 3' ends unlike traditional library preparation methods that lose the 3' end sequence during the prerequisite end-repair step.
  • the libraries were sequenced to a depth of 5-300 million reads per sample. Then, k-mers were retrieved - specifically all K length sequences between K bases into the read, and K bases into the genome.
  • Fig. 29A shows a comparison of log odds ratio for nucleotides at the 5' and 3' ends of read fragments.
  • the top panels show results for the 5' and 3' ends of cfDNA captured using SRSLY.
  • the bottom panels show results from a typical double stranded prep and the effects of end repair which generates blunt-ended molecules making the 5' and 3' end appear complementary to each other, thereby losing information that is distinct between the two ends.
  • Fig. 29B shows k-mer retrieval from fragment ends of cfDNA. The equation shows the overall number of k-mers retrieved for a give size of K. As K increases, the number of read and genomic k-mers stays constant but the number of junction spanning k-mers increases.
  • MDS myelodysplastic syndrome
  • Fig. 33 shows that the 3' end of cfDNA contains a robust signal for prostate cancer classification. 4- fold classification was performed 20 times and the area under the precision-recall curve was captured at each iteration. A random forest model was built using 1 -4mers from the 5' or 3' ends. We observed that the 3' cfDNA fragment ends consistently demonstrated better classification performance (AUPRC >0.92) than the 5' end of the fragment, a signal that is retained in SRSLY.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Library & Information Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Biochemistry (AREA)
  • Molecular Biology (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La présente invention concerne en partie des procédés et des compositions permettant d'analyser un acide nucléique. Selon certains aspects, l'invention concerne des procédés et des compositions permettant de préparer une bibliothèque d'acides nucléiques à partir de fragments d'acide nucléique simple brin et d'analyser des séquences d'extrémité de fragment. Selon certains aspects, l'invention concerne également l'identification d'une maladie selon une analyse de séquence d'extrémité de fragment. Selon certains aspects, l'invention concerne en outre l'identification d'une maladie selon une analyse k-mère.
PCT/US2025/021867 2024-04-03 2025-03-27 Procédés et compositions d'analyse de l'acide nucléique Pending WO2025212384A2 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202463574121P 2024-04-03 2024-04-03
US63/574,121 2024-04-03
US202463715333P 2024-11-01 2024-11-01
US63/715,333 2024-11-01

Publications (1)

Publication Number Publication Date
WO2025212384A2 true WO2025212384A2 (fr) 2025-10-09

Family

ID=97267871

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2025/021867 Pending WO2025212384A2 (fr) 2024-04-03 2025-03-27 Procédés et compositions d'analyse de l'acide nucléique

Country Status (1)

Country Link
WO (1) WO2025212384A2 (fr)

Similar Documents

Publication Publication Date Title
JP7542672B2 (ja) 核酸を解析するための方法および組成物
EP3947723B1 (fr) Methodes et compositions pour l'analyse d'acides nucleiques
EP4428244B1 (fr) Procédés et compositions pour analyser un acide nucléique
CA3049455C (fr) Fabrication et utilisation d'adaptateur de sequencage
CA3049682A1 (fr) Procedes d'evaluation non invasive d'alterations genetique
US20230014607A1 (en) Methods and compositions for analyzing nucleic acid
JPWO2021262805A5 (fr)
WO2025212384A2 (fr) Procédés et compositions d'analyse de l'acide nucléique
US20250019693A1 (en) Methods and compositions for analyzing nucleic acid
EP4185715A1 (fr) Détection, surveillance et déclaration de cancer à partir de séquençage d'adn acellulaire
US20240150825A1 (en) Methods and compositions for analyzing nucleic acid
WO2024054517A1 (fr) Procédés et compositions pour l'analyse d'acide nucléique
EP4724637A1 (fr) Procédés et compositions pour l'analyse d'acide nucléique

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25782630

Country of ref document: EP

Kind code of ref document: A2