WO2021202910A1 - Methods and systems for determining pigmentation phenotypes - Google Patents
Methods and systems for determining pigmentation phenotypes Download PDFInfo
- Publication number
- WO2021202910A1 WO2021202910A1 PCT/US2021/025433 US2021025433W WO2021202910A1 WO 2021202910 A1 WO2021202910 A1 WO 2021202910A1 US 2021025433 W US2021025433 W US 2021025433W WO 2021202910 A1 WO2021202910 A1 WO 2021202910A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- dogs
- phenotype
- pigmentation
- canine subject
- dog
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/124—Animal traits, i.e. production traits, including athletic performance or the like
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/156—Polymorphic or mutational markers
Definitions
- Consumer genomics may enable genetic discovery on an unprecedented scale by linking very large databases of personal genomic data with phenotype information voluntarily submitted via web-based surveys. These databases may have a transformative effect on human genomics research, yielding insights on increasingly complex traits, behaviors, and disease by including many thousands of individuals in genome-wide association studies (GWAS).
- GWAS genome-wide association studies
- the promise of consumer genomic data may not be limited to human research, however. Genomic tools for canine subjects (e.g., dogs) may be readily available, with hundreds of causal Mendelian variants already characterized, because selection and breeding may lead to dramatic phenotypic diversity underlain by a simple genetic structure.
- the present disclosure provides methods, systems, and media for determining a pigmentation phenotype of a canine subject.
- the present disclosure provides a computer-implemented method for determining a pigmentation phenotype of a canine subject, comprising (a) receiving genotype data for the canine subject, wherein the genotype data comprises quantitative values of each of a plurality of genetic markers, wherein the plurality of genetic markers comprises genetic variants; (b) applying a trained machine learning classifier to the genotype data to determine a predicted pigmentation phenotype based at least in part on the quantitative values of the plurality of genetic variants; and (c) identifying the canine subject as having the predicted pigmentation phenotype with an accuracy of at least about 70%.
- the canine subject is a dog.
- the dog is a purebred dog or a mixed breed dog.
- the dog is a purebred dog.
- the purebred dog is selected from Labrador retriever and golden retriever.
- the dog is a mixed breed dog.
- the dog has a breed selected from Labrador retriever and golden retriever.
- the genotype data is obtained by assaying a biological sample obtained from the canine subject.
- the biological sample comprises a blood sample, a saliva sample, a swab sample, a cell sample, or a tissue sample.
- the assaying comprises sequencing the biological sample or derivatives thereof.
- the plurality of genetic markers comprises at least 5 distinct genetic markers.
- the plurality of genetic markers comprises at least 10 distinct genetic markers.
- the quantitative values are indicative of a presence or absence in the genotype data of each of the plurality of genetic variants.
- the plurality of genetic variants is selected from the group consisting of single nucleotide polymorphisms (SNPs), insertions or deletions (indels), microsatellites, or structural variants.
- the pigmentation phenotype comprises a coat color intensity phenotype, a ticking phenotype, a roaning phenotype, or a tongue pigmentation phenotype.
- the pigmentation phenotype comprises a coat color intensity phenotype.
- the plurality of genetic markers comprises one or more markers selected from the group listed in Table 8.
- the plurality of genetic markers comprises one or more SNPs of a genetic locus selected from canFam3.1 chr2: 74.7Mb, chr20: 55.8Mb, and chr21: 10.9Mb. In some embodiments, the plurality of genetic markers comprises canFam3.1 chr2: 74.7Mb or chr21: 10.9Mb. In some embodiments, the plurality of genetic markers comprises canFam3.1 chr2: 74.7Mb and chr21: 10.9Mb. In some embodiments, the pigmentation phenotype comprises a ticking phenotype. In some embodiments, the pigmentation phenotype comprises a roaning phenotype.
- the plurality of genetic markers comprises one or more markers selected from the group listed in Table 11.
- the pigmentation phenotype comprises a tongue pigmentation phenotype.
- the plurality of genetic markers comprises one or more markers selected from the group listed in Table 13.
- applying the trained machine learning classifier comprises determining a weighted sum of the quantitative values of the plurality of genetic markers.
- the weighted sum is determined using a plurality of pre-determined weights associated with the plurality of genetic markers.
- the plurality of pre determined weights associated with the plurality of genetic markers is determined by performing a genome-wide association study (GWAS) comprising a multiple linear regression.
- GWAS genome-wide association study
- applying the trained machine learning classifier comprises applying a multiple logistic regression to the quantitative values of the plurality of genetic markers.
- the method further comprises determining a second pigmentation phenotype of a second canine subject, and determining an expected range of pigmentation phenotypes of a potential offspring of the canine subject and the second canine subject. In some embodiments, the method further comprises determining a recommendation indicative of whether or not to breed the first canine subject and the second canine subject together, based on the expected range of pigmentation phenotypes of the potential offspring of the canine subject and the second canine subject.
- the method further comprises determining a recommendation indicative of breeding the first canine subject and the second canine subject together, when the expected range of pigmentation phenotypes of the potential offspring of the canine subject and the second canine subject includes a pre-determined pigmentation phenotype. In some embodiments, the method further comprises determining a recommendation against breeding the first canine subject and the second canine subject together, when the expected range of pigmentation phenotypes of the potential offspring of the canine subject and the second canine subject does not include a pre-determined pigmentation phenotype.
- the method further comprises generating a social connection between a first person associated with the first canine subject and a second person associated with the second canine subject, based at least in part on the expected range of pigmentation phenotypes of the potential offspring of the canine subject and the second canine subject.
- the social connection is generated when the expected range of pigmentation phenotypes of the potential offspring of the canine subject and the second canine subject includes a pre-determined pigmentation phenotype.
- the social connection is generated through a social media network.
- the first person is a pet owner of the first canine subject, and wherein the second person is a pet owner of the second canine subject.
- the method further comprises identifying the canine subject as having the predicted pigmentation phenotype among at least 3 different categorical or quantitative values of pigmentation phenotypes. In some embodiments, the method further comprises identifying the canine subject as having the predicted pigmentation phenotype among at least 6 different categorical or quantitative values of pigmentation phenotypes. In some embodiments, the method further comprises identifying the canine subject as having the predicted pigmentation phenotype with an accuracy of at least about 75%. In some embodiments, the method further comprises identifying the canine subject as having the predicted pigmentation phenotype with an accuracy of at least about 80%.
- the method further comprises identifying the canine subject as having the predicted pigmentation phenotype among 2 different categorical or quantitative values of pigmentation phenotypes.
- the 2 different categorical or quantitative values of pigmentation phenotypes comprise a darker coat color and a lighter coat color.
- the method further comprises identifying the canine subject as having the predicted pigmentation phenotype with an accuracy of at least about 85%.
- the method further comprises identifying the canine subject as having the predicted pigmentation phenotype with an accuracy of at least about 90%.
- the method further comprises identifying the canine subject as having the predicted pigmentation phenotype with an accuracy of at least about 95%.
- the trained machine learning classifier comprises a linear regression or a logistic regression. In some embodiments, the trained machine learning classifier comprises the linear regression. In some embodiments, the trained machine learning classifier comprises the logistic regression.
- the present disclosure provides a computer system for determining a pigmentation phenotype of a canine subject, comprising: a database that is configured to store genotype data for the canine subject, wherein the genotype data comprises quantitative values of each of a plurality of genetic markers, wherein the plurality of genetic markers comprises genetic variants; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to:
- the present disclosure provides a non-transitory computer-readable medium comprising machine-executable code that, upon execution by one or more computer processors, implements a method for determining a pigmentation phenotype of a canine subject, the method comprising (a) receiving genotype data for the canine subject, wherein the genotype data comprises quantitative values of each of a plurality of genetic markers, wherein the plurality of genetic markers comprises genetic variants; (b) applying a trained machine learning classifier to the genotype data to determine a predicted pigmentation phenotype based at least in part on the quantitative values of the plurality of genetic variants; and (c) identifying the canine subject as having the predicted pigmentation phenotype with an accuracy of at least about 70%.
- Another aspect of the present disclosure provides a non-transitory computer readable medium comprising machine executable code that, upon execution by one or more computer processors, implements any of the methods above or elsewhere herein.
- Another aspect of the present disclosure provides a system comprising one or more computer processors and computer memory coupled thereto.
- the computer memory comprises machine executable code that, upon execution by the one or more computer processors, implements any of the methods above or elsewhere herein.
- FIG. 1 illustrates an example of a method of determining a pigmentation phenotype of a canine subject.
- FIG. 2 illustrates a computer system that is programmed or otherwise configured to implement methods provided herein.
- FIGs. 3A-3B show Manhattan plots of association with roaning (FIG. 3A) and ticking (FIG. 3B). Red and blue horizontal lines are significant (P ⁇ 5 x 10 8 ) and suggestive (P ⁇ 1 x 10 5 ) associations, respectively.
- FIGs. 4A-4B show a Q-Q plot of the association with roaning (FIG. 4A) and ticking (FIG. 4B).
- the GWAS of 320 roaned dogs (cases) and 357 non-ticked, non-roaned dogs (controls) identified two highly significant and two suggestive markers (FIG. 3A and FIGs. 4A- 4B).
- FIGs. 5A-5B show Manhattan plots of association with roaning, including roaning for herding breeds (FIG. 5 A) and roaning for non-herding breeds (FIG. 5B). Red and blue horizontal lines are significant (P ⁇ 5 x 10 8 ) and suggestive (P ⁇ 1 x 10 5 ) associations, respectively.
- FIGs. 6A-6B show Manhattan plots of association with ticking, including ticking for herding breeds (FIG. 6A) and ticking for non-herding breeds (FIG. 6B). Red and blue horizontal lines are significant (P ⁇ 5 x 10 8 ) and suggestive (P ⁇ 1 x 10 5 ) associations, respectively.
- FIG. 7 shows normalized read depth in 5-kb sliding windows across the significant GWAS locus on CFA38 for Australian Cattle Dogs (red), German Wirehaired Pointer (pink), and Border Collies (grey). Filled circles shows the corresponding region of the Manhattan plot shown in FIGs. 3A-3B.
- FIG. 8 shows haplotype structure near the tandem duplication on CFA38 (position 11,031,835-11,243,237).
- Border Collies grey
- breeds with high frequency of ticking Brittany, Clumber spaniel, and English setter; purple
- breeds with high frequency of roaning Australian Cattle Dog, German Wirehaired Pointer, Wirehaired Pointer, and Wirehaired Pointing Griffon; brown
- Dalmatians red
- Rows correspond to haplotypes (two rows/individual), and columns correspond to markers.
- +/- presence and absence of the 11-kb duplication based on Manta.
- Red box 11-kb duplication (CFA38:11,131,835-11,143,234).
- Orange box a core haplotype (CF A38:11,122,646- 11,167,876).
- FIG. 9 shows discordant read pairs at the duplication breakpoint on CFA38 identified in Miinsterlander (top panel), Australian Cattle Dog (middle: SRR7107580), and Border Collie (bottom: SRR7107950).
- Outward-facing read pairs green indicate that this is a tandem duplication found in ticked and roaned dogs but not in Border Collie.
- FIGs. 10A-10B show PCR genotyping of the tandem duplication on CFA38 associated with roaning.
- FIG. 10A shows a schematic view of the design of the PCR genotyping assay. Single headed arrows indicate three pairs of primers to amplify three regions. The first (black) and the third (yellow) primer pairs should produce amplicons in all dogs regardless of the presence or absence of the duplication, while the second pair in the middle should produce an amplicon only in dogs carrying the duplication.
- FIG. 10B shows PCR genotyping of a roaned and control dogs. Each gel lane corresponds to PCR primer pairs depicted in FIG. 10A.
- FIG. 11 shows a density distribution of ALRR for the discovery panel dogs with zero, one, or two copies of the duplication-associated haplotypes (no haplotype, heterozygote, and homozygote, respectively).
- Vertical ticks indicate individual ALRR of dogs with roaning (orange) and without roaning (grey). Density plots with the number of individuals less than 10 are not shown, but individual ALRR is indicated with longer vertical ticks.
- FIG. 12 shows genotype frequency of the marker near MITF (CFA20:21836232) in roaned and non-roaned dogs.
- CFA20:21836232 a marker near MITF
- GWAS GWAS
- a marker near MITF CFA20:21,836,232
- Roaned dogs were mostly “GG” homozygous (89%) or “AG” heterozygous (10%) at this marker, while “AA” homozygotes were most common in non-roaned dogs (66%), affirming the requirement of a capability of having white areas for roaning to be visible.
- FIGs. 13A-13B show a density distribution of ALRR for the validation panel dogs with zero, one, or two copies of the duplication-associated haplotypes (no haplotype, heterozygote, and homozygote, respectively), including target breeds (FIG. 13A) and mixed breeds (FIG. 13B).
- Vertical ticks indicate individual ALRR of dogs with roaning (orange) and without roaning (grey). Density plots with the number of individuals less than 10 are not shown, but individual ALRR is indicated with longer vertical ticks.
- FIGs. 14A-14D show a signature of selection in the region on CFA37 associated with roaning.
- FIG. 14A shows nucleotide diversity (p) for Wirehaired Pointing Griffon (orange), Border Collies (grey squares), and Labrador Retriever (black triangles) in 500-kb sliding windows.
- FIG. 14B shows pairwise genetic differentiation (FST) for Wirehaired Pointing Griffon (red) and Labrador Retriever (black). Border Collies were used as a reference.
- FIG. 14C shows ROH in Australian Cattle Dog (orange), Dalmatians (red), and Border Collies (grey).
- FIG. 14D shows XP-EHH in Australian Cattle Dog (orange), Dalmatians (red), and Labrador Retrievers (black). Border Collies were used as a reference. Wirehaired Pointing Griffons and Australian Cattle Dogs are commonly associated with roaning. Blue rectangle: position of the 11-kb duplication p and FS T are estimated by using whole-genome resequencing data, while ROH and XP-EHH were estimated by using Illumina genotyping data.
- FIG. 15 shows human orthologous region (hg38) of the CFA38 associated with roaning (UCSC genome browser). The highlighted area in blue is the orthologous region to the tandem duplication identified in dogs with roaning, which is located within the intron 61 of USH2A.
- GeneHancer Regulatory Elements are located at chrl:215, 715, 579-215, 717, 032 (green line), which corresponds to CFA38:11, 146, 170-11, 147, 605 in the dog genome.
- DNAse I hypersensitive sites grey and black boxes.
- Open Regulatory Annotation (ORegAnno) orange and blue boxes.
- FIGs. 16A-16H show representative coat phenotypes, including German Wirehaired Pointer (roaned) (FIG. 16A); Australian Cattle Dog (roaned) (FIG. 16B); a mixed breed of Treeing Walker Coonhound and Bluetick Coonhound (ticked) (FIG. 16C); a Border Collie (ticked) (FIG. 16D); an English Setter (both roaned and ticked) (FIG. 16E); an Australian Cattle Dog (both roaned and ticked) (FIG. 16F); a Pointer (without roaning and ticking) (FIG. 16G); and an Australian Cattle Dog (without roaning and ticking) (FIG. 16A).
- FIGs. 16A, 16C, 16E, and 16G are non-herding breeds, while FIGs. 16B, 16D, 16F, and 16H are herding breeds.
- FIGs. 17A-17B show Manhattan plots of association with roaning and ticking, including for Roaning (FIG. 17A) and Ticking (FIG. 17B).
- FIG. 18 shows normalized read depth in 5-kb sliding windows across the significant GWAS locus on CFA38 for Australian Cattle Dogs, German Wirehaired Pointer, and Border Collies.
- FIG. 19 shows haplotypes near the marker on CFA38 significantly associated with roaning. Border Collies, breeds with high frequency of ticking, breeds with high frequency of roaning, and Dalmatians.
- FIGs. 20A-20B show PCR genotyping of the tandem duplication on CFA38 associated with roaning.
- FIG. 21 shows density distribution of the array signal intensity (ALRR) for the discovery panel dogs with zero, one, or two copies of the duplication-associated haplotypes (no haplotype, heterozygote, and homozygote, respectively). Vertical ticks indicate individual ALRR of dogs with roaning (heterozygote and homozygote) and without roaning (no haplotype).
- FIGs. 22A-22D show a signature of selection in the region on CFA38 associated with roaning.
- FIGs. 23A-23C show the six point coat pheomelanin intensity scale.
- FIGs. 24A-24B show quantitative coat pheomelanin intensity GWAS results.
- FIGs. 25A-25B show species and breed allele frequencies at top GWAS markers.
- FIGs. 26A-26B show dominance and epistatic interactions.
- FIGs. 27A-27B show performance of the best fit multivariate linear regression classifier model for pheomelanin intensity phenotypes in validation cohort.
- FIG. 28 shows phenotyping validation on 350 randomly selected dogs.
- FIGs. 29A-29C show Manhattan plots for additional GWAS, including 6-point phenotype, no covariates (FIG. 29A); binary phenotype, with covariates (FIG. 29B); and binary phenotype, no covariates (FIG. 29C).
- FIGs. 30A-30E show detailed views of regions surrounding top GWAS SNPs (e.g., on CFA2, CFA15, CFA18, CFA20, and CFA21), including CFA2 Association Region (74,465,672-75,100,435) (FIG. 30A); CFA15 Association Region (29,575,066-29,973,539) (FIG. 30B); CFA18 Association Region (12,410,382-13,410,382) (FIG. 30C); CFA20 Association Region (55,783,410-55,960,115) (FIG. 30D); and CFA21 Association Region (10,698,290-11,165,504) (FIG. 30E).
- CFA2 Association Region 74,465,672-75,100,435)
- CFA15 Association Region 29,575,066-29,973,539)
- CFA18 Association Region (12,410,382-13,410,382) FIG. 30C
- CFA20 Association Region 55,783,410-55,960,115
- FIG. 31 A shows that CFA15 top marker genotype correlates with sequencing coverage in known CNV.
- FIG. 31B shows SRA ran ID and sample name, breed, BICF2G630433130 genotype (coded as number of red-associated alleles), and CFA15 CNV mean normalized depth of coverage for all dogs shown in FIG. 31 A.
- a sample includes a plurality of samples, including mixtures thereof.
- the term “subject,” generally refers to an entity or a medium that has testable or detectable genetic information.
- a subject can be a person, individual, or patient.
- a subject can be a vertebrate, such as, for example, a mammal.
- Non-limiting examples of mammals include humans, simians, farm animals, sport animals, rodents, and pets (e.g., canines such as dogs, or felines such as cats).
- the subject may have a normal or abnormal health or physiological state or condition or be suspected of having a normal or abnormal health or physiological state or condition.
- the subject may be displaying a symptom(s) indicative of a health or physiological state or condition.
- the subject can be asymptomatic with respect to such health or physiological state or condition.
- nucleic acid generally refers to a molecule comprising one or more nucleic acid subunits, or nucleotides.
- a nucleic acid may include one or more nucleotides selected from adenosine (A), cytosine (C), guanine (G), thymine (T) and uracil (U), or variants thereof.
- a nucleotide generally includes a nucleoside and at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more phosphate (P03) groups.
- a nucleotide can include a nucleobase, a five-carbon sugar (either ribose or deoxyribose), and one or more phosphate groups, individually or in combination.
- Ribonucleotides are nucleotides in which the sugar is ribose.
- Deoxyribonucleotides are nucleotides in which the sugar is deoxyribose.
- a nucleotide can be a nucleoside monophosphate or a nucleoside polyphosphate.
- a nucleotide can be a deoxyribonucleoside polyphosphate, such as, e.g., a deoxyribonucleoside triphosphate (dNTP), which can be selected from deoxyadenosine triphosphate (dATP), deoxycytidine triphosphate (dCTP), deoxyguanosine triphosphate (dGTP), uridine triphosphate (dUTP) and deoxythymidine triphosphate (dTTP) dNTPs, that include detectable tags, such as luminescent tags or markers (e.g., fluorophores).
- dNTP deoxyribonucleoside polyphosphate
- dNTP deoxyribonucleoside triphosphate
- dNTP deoxyribonucleoside triphosphate
- dNTP deoxyribonucleoside triphosphate
- dNTP deoxyribonucleoside triphosphate
- dNTP deoxyribonucleoside triphosphat
- Such subunit can be an A, C, G, T, or U, or any other subunit that is specific to one or more complementary A, C, G, T or U, or complementary to a purine (i.e., A or G, or variant thereof) or a pyrimidine (i.e., C, T or U, or variant thereof).
- a nucleic acid is deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or derivatives or variants thereof.
- a nucleic acid may be single-stranded or double stranded.
- a nucleic acid molecule may be linear, curved, or circular or any combination thereof.
- nucleic acid molecule generally refer to a polynucleotide that may have various lengths, such as either deoxyribonucleotides or ribonucleotides (RNA), or analogs thereof.
- a nucleic acid molecule can have a length of at least about 5 bases, 10 bases, 20 bases, 30 bases, 40 bases, 50 bases, 60 bases, 70 bases, 80 bases, 90, 100 bases, 110 bases, 120 bases, 130 bases, 140 bases, 150 bases, 160 bases, 170 bases, 180 bases, 190 bases, 200 bases, 300 bases, 400 bases, 500 bases, 1 kilobase (kb), 2 kb, 3, kb, 4 kb, 5 kb, 10 kb, or 50 kb or it may have any number of bases between any two of the aforementioned values.
- oligonucleotide is typically composed of a specific sequence of nucleotide bases: adenine (A); cytosine (C); guanine (G); and thymine (T) (uracil (U) for thymine (T) when the polynucleotide is RNA).
- A adenine
- C cytosine
- G guanine
- T thymine
- U uracil
- T thymine
- the terms “nucleic acid molecule,” “nucleic acid sequence,” “nucleic acid fragment,” “oligonucleotide” and “polynucleotide” are at least in part intended to be the alphabetical representation of a polynucleotide molecule. Alternatively, the terms may be applied to the polynucleotide molecule itself.
- Oligonucleotides may include one or more nonstandard nucleotide(s), nucleotide analog(s) and/or modified nucleotides.
- sample generally refers to a biological sample.
- biological samples include nucleic acid molecules, amino acids, polypeptides, proteins, carbohydrates, fats, or viruses.
- a biological sample is a nucleic acid sample including one or more nucleic acid molecules.
- the biological sample may comprise or be derived from blood samples, saliva samples, swab samples, cell samples, or tissue samples.
- the nucleic acid molecules may be cell-free nucleic acid molecules, such as cell-free DNA (cfDNA) or cell-free RNA (cfRNA).
- the nucleic acid molecules may be derived from a variety of sources including human, mammal (e.g., dog), non-human mammal, ape, monkey, chimpanzee, reptilian, amphibian, or avian, sources. Further, samples may be extracted from a variety of animal fluids, including but not limited to bodily fluid samples such as blood, serum, plasma, vitreous, sputum, urine, tears, perspiration, saliva, semen, mucosal excretions, mucus, spinal fluid, cerebrospinal fluid (CSF), pleural fluid, peritoneal fluid, amniotic fluid, lymph fluid, and the like.
- bodily fluid samples such as blood, serum, plasma, vitreous, sputum, urine, tears, perspiration, saliva, semen, mucosal excretions, mucus, spinal fluid, cerebrospinal fluid (CSF), pleural fluid, peritoneal fluid, amniotic fluid, lymph fluid, and the like.
- Biological samples may be obtained or derived from subjects using an ethylenediaminetetraacetic acid (EDTA) collection tube, a cell-free RNA collection tube (e.g., Streck), or a cell-free DNA collection tube (e.g., Streck).
- Biological samples may be derived from whole blood samples by fractionation.
- Biological samples or derivatives thereof may contain cells.
- a biological sample may be a blood sample or a derivative thereof (e.g., blood collected by a collection tube or blood drops) or a cell or tissue sample (e.g., a swab).
- the term “whole blood,” as used herein, generally refers to a blood sample that has not been separated into sub-components (e.g., by centrifugation).
- the whole blood of a blood sample may contain cfDNA and/or germline DNA.
- Whole blood DNA (which may contain cfDNA and/or germline DNA) may be extracted from a blood sample.
- Whole blood DNA sequencing reads (which may contain cfDNA sequencing reads and/or germline DNA sequencing reads) may be extracted from whole blood DNA.
- the present disclosure provides a computer-implemented method for determining a pigmentation phenotype of a canine subject, comprising (a) receiving genotype data for the canine subject, wherein the genotype data comprises quantitative values of each of a plurality of genetic markers, wherein the plurality of genetic markers comprises genetic variants; (b) applying a trained machine learning classifier to the genotype data to determine a predicted pigmentation phenotype based at least in part on the quantitative values of the plurality of genetic variants; and (c) identifying the canine subject as having the predicted pigmentation phenotype with an accuracy of at least about 70%.
- FIG. 1 illustrates an example of a method 100 for determining a pigmentation phenotype of a canine subject, in accordance with some embodiments.
- the method 100 may comprise receiving genotype data for the canine subject.
- the genotype data may comprise quantitative values of each of a plurality of genetic markers.
- the plurality of genetic markers comprises genetic variants.
- the method 100 may comprise applying a trained machine learning classifier to the genotype data to determine a predicted pigmentation phenotype based at least in part on the quantitative values of the plurality of genetic markers (e.g., genetic variants).
- the method 100 may comprise identifying the canine subject as having the predicted pigmentation phenotype with an accuracy of at least about 70%.
- the canine subject is a dog.
- the dog is a purebred dog or a mixed breed dog.
- the dog is a purebred dog.
- the purebred dog is selected from Labrador retriever and golden retriever.
- the dog is a mixed breed dog.
- the dog has a breed selected from Labrador retriever and golden retriever.
- the genotype data is obtained by assaying a biological sample obtained from the canine subject.
- the biological sample comprises a blood sample, a saliva sample, a swab sample, a cell sample, or a tissue sample.
- the assaying comprises sequencing the biological sample or derivatives thereof.
- the plurality of genetic markers comprises at least 5 distinct genetic markers.
- the plurality of genetic markers comprises at least 10 distinct genetic markers.
- the quantitative values are indicative of a presence or absence in the genotype data of each of the plurality of genetic variants.
- the plurality of genetic variants is selected from the group consisting of single nucleotide polymorphisms (SNPs), insertions or deletions (indels), microsatellites, or structural variants.
- the pigmentation phenotype comprises a coat color intensity phenotype, a ticking phenotype, a roaning phenotype, or a tongue pigmentation phenotype.
- the pigmentation phenotype comprises a coat color intensity phenotype.
- the plurality of genetic markers comprises one or more markers selected from the group listed in Table 8.
- the plurality of genetic markers comprises one or more SNPs of a genetic locus selected from canFam3.1 chr2: 74.7Mb, chr20: 55.8Mb, and chr21: 10.9Mb. In some embodiments, the plurality of genetic markers comprises canFam3.1 chr2: 74.7Mb or chr21: 10.9Mb. In some embodiments, the plurality of genetic markers comprises canFam3.1 chr2: 74.7Mb and chr21: 10.9Mb. In some embodiments, the pigmentation phenotype comprises a ticking phenotype. In some embodiments, the pigmentation phenotype comprises a roaning phenotype.
- the plurality of genetic markers comprises one or more markers selected from the group listed in Table 11.
- the pigmentation phenotype comprises a tongue pigmentation phenotype.
- the plurality of genetic markers comprises one or more markers selected from the group listed in Table 13.
- applying the trained machine learning classifier comprises determining a weighted sum of the quantitative values of the plurality of genetic markers.
- the weighted sum is determined using a plurality of pre-determined weights associated with the plurality of genetic markers.
- the plurality of pre determined weights associated with the plurality of genetic markers is determined by performing a genome-wide association study (GWAS) comprising a multiple linear regression.
- applying the trained machine learning classifier comprises applying a multiple logistic regression to the quantitative values of the plurality of genetic markers.
- GWAS genome-wide association study
- the method further comprises determining a second pigmentation phenotype of a second canine subject, and determining an expected range of pigmentation phenotypes of a potential offspring of the canine subject and the second canine subject. In some embodiments, the method further comprises determining a recommendation indicative of whether or not to breed the first canine subject and the second canine subject together, based on the expected range of pigmentation phenotypes of the potential offspring of the canine subject and the second canine subject.
- the method further comprises determining a recommendation indicative of breeding the first canine subject and the second canine subject together, when the expected range of pigmentation phenotypes of the potential offspring of the canine subject and the second canine subject includes a pre-determined pigmentation phenotype. In some embodiments, the method further comprises determining a recommendation against breeding the first canine subject and the second canine subject together, when the expected range of pigmentation phenotypes of the potential offspring of the canine subject and the second canine subject does not include a pre-determined pigmentation phenotype.
- the method further comprises generating a social connection between a first person associated with the first canine subject and a second person associated with the second canine subject, based at least in part on the expected range of pigmentation phenotypes of the potential offspring of the canine subject and the second canine subject.
- the social connection is generated when the expected range of pigmentation phenotypes of the potential offspring of the canine subject and the second canine subject includes a pre-determined pigmentation phenotype.
- the social connection is generated through a social media network.
- the first person is a pet owner of the first canine subject, and wherein the second person is a pet owner of the second canine subject.
- the method further comprises identifying the canine subject as having the predicted pigmentation phenotype among at least 3 different categorical or quantitative values of pigmentation phenotypes. In some embodiments, the method further comprises identifying the canine subject as having the predicted pigmentation phenotype among at least 6 different categorical or quantitative values of pigmentation phenotypes. In some embodiments, the method further comprises identifying the canine subject as having the predicted pigmentation phenotype with an accuracy of at least about 75%. In some embodiments, the method further comprises identifying the canine subject as having the predicted pigmentation phenotype with an accuracy of at least about 80%.
- the method further comprises identifying the canine subject as having the predicted pigmentation phenotype among 2 different categorical or quantitative values of pigmentation phenotypes.
- the 2 different categorical or quantitative values of pigmentation phenotypes comprise a darker coat color and a lighter coat color.
- the method further comprises identifying the canine subject as having the predicted pigmentation phenotype with an accuracy of at least about 85%.
- the method further comprises identifying the canine subject as having the predicted pigmentation phenotype with an accuracy of at least about 90%.
- the method further comprises identifying the canine subject as having the predicted pigmentation phenotype with an accuracy of at least about 95%.
- methods and systems of the present disclosure may be used to add a valuable social component to the genetic assay results of dogs.
- dogs owners By allowing dog owners to directly connect with each other based on a similarity of pigmentation of their pets, owners can gain more information from other dogs’ owners about the suitability of a potential mating pairing between two dogs (e.g., having desired pigmentation traits).
- Methods and systems of the present disclosure may use one or more algorithms to determine a pigmentation phenotype of a canine subject.
- the canine subject is a dog.
- the dog comprises one or more dog breeds selected from the group consisting of: Affenpinscher, Anderson Hound, Africanis, Aidi, Airedale Terrier,
- Bichon Frise Billy, Bisben, Black and Tan Coonhound, Black and Tan Virginia Foxhound, Bullenbeisser, Black Norwegian Elkhound, Black Russian Terrier, Blackmouth Cur, Grand Bleu de Gascogne, Petit Bleu de Gascogne, Bloodhound, Blue Lacy, Blue Paul Terrier, Bluetick Coonhound, Boerboel, Bohemian Shepherd, B perfumese, Border Collie, Border Terrier, Borzoi, Laun Coarse-haired Hound, Boston Terrier, Bouvier des Ardennes, Bouvier des Flandres, Boxer, Boykin Dogl, Bracco Italiano, Braque d'Auvergne, Braque du Bourbonnais, Braque du Puy, Braque Francais, Braque Saint-Germain, Brazilian Terrier, Briard, Briquet Griffon Vendeen, Brittany, Broholmer, Bruno Jura Hound, Bucovina Shepherd Dog, Bull and Terrier, Bull Terrier, Bull Terrier (Miniature), Bullmastiff, Bully Kut
- the subject is a purebred dogs (e.g., having a single breed type) or a mixed-breed dog (e.g., having a plurality of breed types).
- the subject is a mixed-breed dog having DNA from any number (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more than 10) or combination of purebred dogs.
- the method may comprise receiving genotype data as inputs.
- the genotype data may be obtained by assaying biological samples obtained from the population of test individuals.
- the biological samples comprise blood samples, saliva samples, swab samples, cell samples (e.g., mouth or cheek swab), or tissue samples.
- the assaying comprises sequencing the biological samples or derivatives thereof to generate the genotype data.
- sequencing reads may be generated from the biological samples using any suitable sequencing method.
- the sequencing method can be a first- generation sequencing method, such as Maxam-Gilbert or Sanger sequencing, or a high- throughput sequencing (e.g., next-generation sequencing or NGS) method.
- a high-throughput sequencing method may sequence simultaneously (or substantially simultaneously) at least about 10,000, 100,000, 1 million, 10 million, 100 million, 1 billion, or more polynucleotide molecules.
- Sequencing methods may include, but are not limited to: pyrosequencing, sequencing-by-synthesis, single-molecule sequencing, nanopore sequencing, semiconductor sequencing, sequencing-by-ligation, sequencing-by-hybridization, Digital Gene Expression (Helicos), massively parallel sequencing, e.g., Helicos, Clonal Single Molecule Array (Solexa/Illumina), sequencing using PacBio, SOLiD, Ion Torrent, or Nanopore platforms.
- the sequencing comprises whole genome sequencing (WGS).
- the sequencing may be performed at a depth sufficient to generate the desired genotype data with a desired performance (e.g., accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), or the area under curve (AUC) of a receiver operator characteristic (ROC)).
- a desired performance e.g., accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), or the area under curve (AUC) of a receiver operator characteristic (ROC)
- the sequencing is performed at a depth of about 20X, about 30X, about 40X, about 50X, about 60X, about 70X, about 80X, about 90X, about 100X, about 150X, about 200X, about 250X, about 300X, about 350X, about 400X, about 450X, about 500X, or more than about 500X.
- the sequencing is performed in a “low-pass” manner, for example, at a depth of no more than about 12X, no more than about 1 IX, no more than about 10X, no more than about 9X, no more than about 8X, no more than about 7X, no more than about 6X, no more than about 5X, no more than about 4X, no more than about 3.5X, no more than about 3X, no more than about 2.5X, no more than about 2X, no more than about 1 5X, or no more than about IX.
- the sequencing reads may be aligned to a reference genome.
- the reference genome may comprise at least a portion of a genome (e.g., a dog genome or a human genome).
- the reference genome may comprise an entire genome (e.g., an entire dog genome or an entire human genome).
- the reference genome may comprise a database comprising a plurality of genomic regions that correspond to coding and/or non-coding genomic regions of a genome.
- the database may comprise a plurality of genomic regions that correspond to coding and/or non-coding genomic regions of a genome, such as single nucleotide variants (SNVs), single nucleotide polymorphisms (SNPs), copy number variants (CNVs), insertions or deletions (indels), and fusion genes.
- SNVs single nucleotide variants
- SNPs single nucleotide polymorphisms
- CNVs copy number variants
- indels insertions or deletions
- fusion genes fusion genes.
- the alignment may be performed using a Burrows-Wheeler algorithm or another alignment algorithm.
- quantitative measures of the sequencing reads may be generated for each of a plurality of genomic regions. Quantitative measures of the sequencing reads may be generated, such as counts of DNA sequencing reads that are aligned with a given genomic region. Sequencing reads having a portion or all of the sequencing read aligning with a given genomic region may be counted toward the quantitative measure for that genomic region.
- genomic regions may comprise genetic markers such as genetic variants (e.g., single nucleotide polymorphisms (SNPs), insertions or deletions (indels), microsatellites, or structural variants).
- Patterns of specific and non-specific genomic regions may be indicative of pigmentation phenotypes (e.g., color coat intensity, roaning, ticking, or tongue pigmentation).
- measuring the plurality of counts of DNA sequencing reads comprises performing binding measurements of the plurality of DNA molecules at each of the plurality of genomic regions.
- performing the binding measurements comprises assaying the plurality of DNA molecules using probes that are selective for at least a portion of the plurality of genomic regions in the plurality of DNA molecules.
- the probes are nucleic acid molecules having sequence complementarity with nucleic acid sequences of the plurality of genomic regions.
- the nucleic acid molecules are primers or enrichment sequences.
- the assaying comprises use of array hybridization or polymerase chain reaction (PCR), or nucleic acid sequencing.
- the method further comprises enriching the plurality of DNA molecules for at least a portion of the plurality of genomic regions.
- the enrichment comprises amplifying the plurality of DNA molecules.
- the plurality of DNA molecules may be amplified by selective amplification (e.g., by using a set of primers or probes comprising nucleic acid molecules having sequence complementarity with nucleic acid sequences of the plurality of genomic regions).
- the plurality of DNA molecules may be amplified by universal amplification (e.g., by using universal primers).
- the enrichment comprises selectively isolating at least a portion (e.g., mononucleotides and/or dinucleotides) of the plurality of DNA molecules.
- the counts of DNA sequencing reads may be normalized or corrected.
- the counts of DNA sequencing reads may be normalized and/or corrected to account for known biases in sequencing and library preparation and/or known biases in sequencing and library preparation.
- a subset of the quantitative measures or counts may be filtered out, e.g., based on a quality score of the sequencing reads.
- a trained algorithm e.g., a machine learning classifier
- the genotype data comprises quantitative values of each of a plurality of genetic markers (e.g., genetic variants).
- the trained algorithm may be used to determine quantitative or categorical measures of a predicted pigmentation phenotype of the canine subject.
- the trained algorithm may be configured to determine the predicted pigmentation phenotype with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than 99%.
- the trained algorithm may comprise a supervised machine learning algorithm.
- the trained algorithm may comprise a classification and regression tree (CART) algorithm.
- the supervised machine learning algorithm may comprise, for example, a linear regression, a logistic regression, a Random Forest, a support vector machine (SVM), a neural network, or a deep learning algorithm.
- the trained algorithm may comprise an unsupervised machine learning algorithm.
- the trained algorithm may be configured to accept a plurality of input variables and to produce one or more output values based on the plurality of input variables.
- the plurality of input variables may be generated based on processing genotype data of nucleic acids.
- an input variable may comprise a number of sequences corresponding to or aligning to a reference genome or genomic loci of a reference genome.
- an input variable may comprise analog or digital values of genotype data produced by a sequencer or array.
- the trained algorithm may comprise a classifier, such that each of the one or more output values comprises one of a fixed number of possible values (e.g., a linear classifier, a logistic regression classifier, etc.) indicating a classification of the genotype data by the classifier.
- the trained algorithm may comprise a binary classifier, such that each of the one or more output values comprises one of two values (e.g., (0, 1 ⁇ , (positive, negative ⁇ , (present, absent ⁇ , or (light, dark ⁇ ) indicating a classification of the canine subject based on genotype data by the classifier.
- the trained algorithm may be another type of classifier, such that each of the one or more output values comprises one of more than two values (e.g., (0, 1, 2 ⁇ , (positive, negative, or indeterminate ⁇ , (present, absent, or indeterminate ⁇ , or (light, medium, or dark ⁇ ) indicating a classification of the canine subject based on genotype data by the classifier.
- the output values may comprise descriptive labels, numerical values, or a combination thereof.
- Some of the output values may comprise descriptive labels. Such descriptive labels may provide an identification of predicted pigmentation phenotypes, and may comprise, for example, (light, medium, or dark ⁇ . As another example, such descriptive labels may provide a relative assessment of the likelihood of different pigmentation phenotypes being present in the canine subject based on the genotype data. Some descriptive labels may be mapped to numerical values, for example, by mapping “positive” or “present” to 1, and “negative” or “absent” to 0. [0079] Some of the output values may comprise numerical values, such as binary, integer, or continuous values. Such binary output values may comprise, for example, (0, 1 ⁇ , (positive, negative ⁇ , or (present, absent ⁇ . Such integer output values may comprise, for example, (0, 1,
- Such continuous output values may comprise, for example, a probability value of at least 0 and no more than 1 (e.g., indicative of the likelihood of different pigmentation phenotypes being present in the canine subject).
- Such continuous output values may comprise, for example, an un- normalized probability value of at least 0.
- Some numerical values may be mapped to descriptive labels, for example, by mapping 1 to “positive” or “present”, and 0 to “negative” or “absent”.
- Some of the output values may be assigned based on one or more cutoff values. For example, a binary classification of the canine subject based on genotype data may assign an output value of “positive” or 1 if the canine subject has at least a 50% probability of having a given pigmentation phenotype.
- a binary classification of the canine subject based on genotype data may assign an output value of “negative” or 0 if the canine subject has less than a 50% probability of having a given pigmentation phenotype.
- a single cutoff value of 50% is used to classify the canine subject into one of the two possible binary output values based on genotype data.
- Examples of single cutoff values may include about 1%, about 2%, about 5%, about 10%, about 15%, about 20%, about 25%, about 30%, about 35%, about 40%, about 45%, about 50%, about 55%, about 60%, about 65%, about 70%, about 75%, about 80%, about 85%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, and about 99%.
- a classification of the canine subject based on genotype data may assign an output value of “positive” or 1 if the canine subject has a probability of having a given pigmentation phenotype of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more.
- the classification of the canine subject based on genotype data may assign an output value of “positive” or 1 if the canine subject has a probability of having a given pigmentation phenotype of more than about 50%, more than about 55%, more than about 60%, more than about 65%, more than about 70%, more than about 75%, more than about 80%, more than about 85%, more than about 90%, more than about 91%, more than about 92%, more than about 93%, more than about 94%, more than about 95%, more than about 96%, more than about 97%, more than about 98%, or more than about 99%.
- the classification of genotype data may assign an output value of “negative” or 0 if the canine subject has a probability of having a given pigmentation phenotype of less than about 50%, less than about 45%, less than about 40%, less than about 35%, less than about 30%, less than about 25%, less than about 20%, less than about 15%, less than about 10%, less than about 9%, less than about 8%, less than about 7%, less than about 6%, less than about 5%, less than about 4%, less than about 3%, less than about 2%, or less than about 1%.
- the classification of genotype data may assign an output value of “negative” or 0 if the canine subject has a probability of having a given pigmentation phenotype of no more than about 50%, no more than about 45%, no more than about 40%, no more than about 35%, no more than about 30%, no more than about 25%, no more than about 20%, no more than about 15%, no more than about 10%, no more than about 9%, no more than about 8%, no more than about 7%, no more than about 6%, no more than about 5%, no more than about 4%, no more than about 3%, no more than about 2%, or no more than about 1%.
- the classification of the canine subject based on genotype data may assign an output value of “indeterminate” or 2 if the canine subject is not classified as “positive”, “negative”, 1, or 0.
- a set of two cutoff values is used to classify the canine subject based on genotype data into one of the three possible output values.
- sets of cutoff values may include (1%, 99% ⁇ , (2%, 98% ⁇ , (5%, 95% ⁇ , (10%, 90% ⁇ , (15%, 85% ⁇ , (20%, 80% ⁇ , (25%, 75% ⁇ , (30%, 70% ⁇ , (35%, 65% ⁇ , (40%, 60% ⁇ , and (45%, 55% ⁇ .
- sets of n cutoff values may be used to classify the canine subject based on genotype data into one of n+ 1 possible output values, where n is any positive integer.
- the trained algorithm may be trained with a plurality of independent training samples.
- Each of the independent training samples may comprise sets of genotype data generated from nucleic acids (e.g., from a biological sample of a canine subject) and one or more known output values corresponding to the genotype data (e.g., a set of known pigmentation phenotypes corresponding to the genotype data, such as that generated from photographs of the canine subjects).
- Independent training samples may be obtained or derived from a plurality of different subjects.
- Independent training samples may comprise sets of genotype data generated from nucleic acids (e.g., from a biological sample of a canine subject) and one or more known output values corresponding to the genotype data (e.g., a set of known pigmentation phenotypes corresponding to the genotype data, such as that generated from photographs of the canine subjects) obtained at a plurality of different time points from the same subject.
- nucleic acids e.g., from a biological sample of a canine subject
- known output values corresponding to the genotype data e.g., a set of known pigmentation phenotypes corresponding to the genotype data, such as that generated from photographs of the canine subjects
- the trained algorithm may be trained with at least about 5, at least about 10, at least about 15, at least about 20, at least about 25, at least about 30, at least about 35, at least about 40, at least about 45, at least about 50, at least about 100, at least about 150, at least about 200, at least about 250, at least about 300, at least about 350, at least about 400, at least about 450, or at least about 500 independent training samples.
- the trained algorithm may be trained with no more than about 500, no more than about 450, no more than about 400, no more than about 350, no more than about 300, no more than about 250, no more than about 200, no more than about 150, no more than about 100, or no more than about 50 independent training samples.
- the trained algorithm may be configured to determine a predicted pigmentation phenotype based on genotype data at an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the accuracy of identifying a predicted pigmentation phenotype by the trained algorithm may be calculated as the percentage of canine subjects that are correctly identified or classified (e.g., presence or absence of a particular pigmentation phenotype).
- the trained algorithm may be configured to identify predicted pigmentation phenotypes with a positive predictive value (PPV) of at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more.
- the PPV of identifying the predicted pigmentation phenotypes using the trained algorithm may be calculated as the percentage of
- the trained algorithm may be configured to identify predicted pigmentation phenotypes with a negative predictive value (NPV) of at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more.
- the NPV of identifying the predicted pigmentation phenotypes using the trained algorithm may be calculated as the percentage of
- the trained algorithm may be adjusted or tuned to improve one or more of the performance, accuracy, PPV, or NPV of identifying the predicted pigmentation phenotypes.
- the trained algorithm may be adjusted or tuned by adjusting parameters of the trained algorithm (e.g., a set of cutoff values used to predict pigmentation phenotypes, as described elsewhere herein, or weights of a neural network).
- the trained algorithm may be adjusted or tuned continuously during the training process or after the training process has completed.
- a subset of the inputs may be identified as most influential or most important to be included for making high-quality classifications.
- the plurality of input variables or a subset thereof may be ranked based on classification metrics indicative of each input variable’s importance toward making high-quality classifications or identifications of pigmentation phenotypes.
- classification metrics indicative of each input variable’s importance toward making high-quality classifications or identifications of pigmentation phenotypes.
- Such metrics may be used to reduce, in some cases significantly, the number of input variables (e.g., predictor variables) that may be used to train the trained algorithm to a desired performance level (e.g., based on a desired minimum accuracy, PPV, or NPV, or a combination thereof).
- training the trained algorithm with a plurality comprising several dozen or hundreds of input variables in the trained algorithm results in an accuracy of classification of more than 99%
- training the trained algorithm instead with only a selected subset of no more than about 5, no more than about 10, no more than about 15, no more than about 20, no more than about 25, no more than about 30, no more than about 35, no more than about 40, no more than about 45, no more than about 50, or no more than about 100
- such most influential or most important input variables among the plurality can yield decreased but still acceptable accuracy of classification (e.g., at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%
- the subset may be selected by rank-ordering the entire plurality of input variables and selecting a predetermined number (e.g., no more than about 5, no more than about 10, no more than about 15, no more than about 20, no more than about 25, no more than about 30, no more than about 35, no more than about 40, no more than about 45, no more than about 50, or no more than about 100) of input variables with the best classification metrics.
- a predetermined number e.g., no more than about 5, no more than about 10, no more than about 15, no more than about 20, no more than about 25, no more than about 30, no more than about 35, no more than about 40, no more than about 45, no more than about 50, or no more than about 100
- FIG. 2 shows a computer system 201 that is programmed or otherwise configured to, for example, receive genotype data for a canine subject, apply a trained machine learning classifier to genotype data to determine a predicted pigmentation phenotype, and identify canine subjects as having the predicted pigmentation phenotype.
- the computer system 201 can regulate various aspects of analysis, calculation, and generation of the present disclosure, such as, for example, receiving genotype data for a canine subject, applying a trained machine learning classifier to genotype data to determine a predicted pigmentation phenotype, and identifying canine subjects as having the predicted pigmentation phenotype.
- the computer system 201 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device.
- the electronic device can be a mobile electronic device.
- the computer system 201 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 205, which can be a single core or multi core processor, or a plurality of processors for parallel processing.
- the computer system 201 also includes memory or memory location 210 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 215 (e.g., hard disk), communication interface 220 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 225, such as cache, other memory, data storage and/or electronic display adapters.
- the memory 210, storage unit 215, interface 220 and peripheral devices 225 are in communication with the CPU 205 through a communication bus (solid lines), such as a motherboard.
- the storage unit 215 can be a data storage unit (or data repository) for storing data.
- the computer system 201 can be operatively coupled to a computer network (“network”) 230 with the aid of the communication interface 220.
- the network 230 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet.
- the network 230 in some cases is a telecommunication and/or data network.
- the network 230 can include one or more computer servers, which can enable distributed computing, such as cloud computing.
- one or more computer servers may enable cloud computing over the network 230 (“the cloud”) to perform various aspects of analysis, calculation, and generation of the present disclosure, such as, for example, receiving genotype data for a canine subject, applying a trained machine learning classifier to genotype data to determine a predicted pigmentation phenotype, and identifying canine subjects as having the predicted pigmentation phenotype.
- cloud computing may be provided by cloud computing platforms such as, for example, Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform, and IBM cloud.
- the network 230 in some cases with the aid of the computer system 201, can implement a peer-to-peer network, which may enable devices coupled to the computer system 201 to behave as a client or a server.
- the CPU 205 can execute a sequence of machine-readable instructions, which can be embodied in a program or software.
- the instructions may be stored in a memory location, such as the memory 210.
- the instructions can be directed to the CPU 205, which can subsequently program or otherwise configure the CPU 205 to implement methods of the present disclosure. Examples of operations performed by the CPU 205 can include fetch, decode, execute, and writeback.
- the CPU 205 can be part of a circuit, such as an integrated circuit. One or more other components of the system 201 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC).
- ASIC application specific integrated circuit
- the storage unit 215 can store files, such as drivers, libraries and saved programs.
- the storage unit 215 can store user data, e.g., user preferences and user programs.
- the computer system 201 in some cases can include one or more additional data storage units that are external to the computer system 201, such as located on a remote server that is in communication with the computer system 201 through an intranet or the Internet.
- the computer system 201 can communicate with one or more remote computer systems through the network 230.
- the computer system 201 can communicate with a remote computer system of a user (e.g., a pet owner, a kennel owner, a veterinarian, a breeder, an animal shelter employee, a physician, a nurse, a caretaker, a patient, or a subject).
- remote computer systems include personal computers (e.g., portable PC), slate or tablet PC’s (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants.
- the user can access the computer system 201 via the network 230.
- Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 201, such as, for example, on the memory 210 or electronic storage unit 215.
- the machine executable or machine readable code can be provided in the form of software.
- the code can be executed by the processor 205.
- the code can be retrieved from the storage unit 215 and stored on the memory 210 for ready access by the processor 205.
- the electronic storage unit 215 can be precluded, and machine-executable instructions are stored on memory 210.
- the code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code, or can be compiled during runtime.
- the code can be supplied in a programming language that can be selected to enable the code to execute in a pre compiled or as-compiled fashion.
- aspects of the systems and methods provided herein can be embodied in programming.
- Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium.
- Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk.
- “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server.
- another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
- a machine readable medium such as computer-executable code
- a tangible storage medium such as computer-executable code
- Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings.
- Volatile storage media include dynamic memory, such as main memory of such a computer platform.
- Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system.
- Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications.
- RF radio frequency
- IR infrared
- Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data.
- Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
- the computer system 201 can include or be in communication with an electronic display 235 that comprises a user interface (UI) 240 for providing, for example, genotype data, genetic markers, quantitative values of genetic variants, and predicted pigmentation phenotypes.
- UI user interface
- ETs include, without limitation, a graphical user interface (GET) and web-based user interface.
- Methods and systems of the present disclosure can be implemented by way of one or more algorithms.
- An algorithm can be implemented by way of software upon execution by the central processing unit 205.
- the algorithm can, for example, receive genotype data for a canine subject, apply a trained machine learning classifier to genotype data to determine a predicted pigmentation phenotype, and identify canine subjects as having the predicted pigmentation phenotype.
- Example 1 Statistical models for prediction of roaning phenotypes in the domestic dog from genetic markers
- Consumer genomics may enable genetic discovery on an unprecedented scale by linking very large databases of personal genomic data with phenotype information voluntarily submitted via web-based surveys. These databases may have a transformative effect on human genomics research, yielding insights on increasingly complex traits, behaviors, and disease by including many thousands of individuals in genome-wide association studies (GWAS).
- GWAS genome-wide association studies
- the promise of consumer genomic data may not be limited to human research, however. Genomic tools for canine subjects (e.g., dogs) may be readily available, with hundreds of causal Mendelian variants already characterized, because selection and breeding may lead to dramatic phenotypic diversity underlain by a simple genetic structure.
- Results are reported of a consumer genomics study conducted in a non-human model: a GWAS of blue eyes based on more than 3,000 customer dogs with validation panels including nearly 3,000 more, a large canine GWAS study.
- melanocortin-1 receptor MC1R
- MC1R melanocortin-1 receptor
- similar coloration has independently evolved in multiple lineages via mutations in different genes (e.g., LYST and AIM1 in polar bears and KIT in horses with white coats). Understanding the genetic mechanisms of color variation and phenotypic convergence has shed light on how phenotypes evolve under similar selective forces (either natural or artificial).
- high conservation of melanogenesis pathways across vertebrates warrants transformative research in human genomics research, such as the case oiMClR that is strongly associated with risk of melanomas.
- Ticking and roaning are two common coat patterns observed in dogs and other domestic animals. Ticking may be characterized as small pigmented spots of varying numbers and sizes appearing on otherwise unpigmented (white) areas. Roaning is similar to, and sometimes co-occurs with, ticking, but may include pigmented and unpigmented hairs interspersed more evenly without the formation of distinct spots. Typically, individuals are not immediately born with ticking and roaning patterns, but instead these pigmented areas may develop as the individual ages, indicating time-dependent action of underlying pigmentation mechanisms.
- KIT ligand gene K1TLG
- K1TLG KIT ligand gene
- Gene interaction or epistasis is one of the key mechanisms in the formation of phenotypic diversity in both wild and domesticated species.
- An example is three color types of Labrador Retrievers, where tyrosinase-related protein 1 (TYRP1 ) and MC1R determine their coat colors as black, chocolate, or yellow.
- Modifier genes may constitute a type of epistasis; for example, several variants of microphthalmia-associated transcription factor (MITF) modify the coat color of dogs by preventing the melanocyte development and migration in certain areas of the body and, in some cases, across nearly the entire body.
- MITF microphthalmia-associated transcription factor
- Genomic regions associated with ticking and roaning coat patterns in dogs were investigated by using a total of 1,009 dogs that were genotyped at 228,830 markers covering all 38 autosomes and chromosome X. Dog owners contributed to this study by providing photographs of their dogs, from which their phenotypes were classified as ticked, roaned, or lacking these patterns, to identify genomic regions associated with these phenotypes by genome wide association study (GWAS).
- GWAS genome wide association study
- Results were obtained as follows. A novel association was observed on chromosome 38 with roaning (but not with ticking). Further, a 11-kilobase tandem duplication was identified. Further, phenotype and genotype association was performed. Further, prediction of roaning coat pattem in an 888-dog validation panel was performed. Further, selection on the CFA38 was performed. Further, functional annotation was performed. Further, genotyping and genome-wide association was performed. Further, identification of tandem duplication was performed.
- Table 1 shows the number of dogs and breeds used for genome-wide association study.
- a total of 1,099 dogs with profile pictures from a database were used by targeting 27 breeds and their mixes (1,000 and 99 purebred and mixed dogs, respectively) (Table 1).
- FIGs. 3A-3B show Manhattan plots of association with roaning (FIG. 3A) and ticking (FIG. 3B). Red and blue horizontal lines are significant (P ⁇ 5 x 10 8 ) and suggestive (P ⁇ 1 x 10 5 ) associations, respectively.
- FIGs. 4A-4B show a Q-Q plot of the association with roaning (FIG. 4A) and ticking (FIG. 4B).
- the GWAS of 320 roaned dogs (cases) and 357 non-ticked, non-roaned dogs (controls) identified two highly significant and two suggestive markers (FIG. 3A and FIGs. 4A- 4B).
- the second most significant marker overlapped with R-spondin 2 gene ( RSP02 ) on CFA13 at the position 8,625,896 (P 1.4 x 10 18 ).
- the associations with RSP02 likely resulted from the breeds with contrasting coat texture, such as Border Collies and German Wirehaired Pointers, due to the association between this gene and wiry texture of the fur.
- FGF5 fibroblast growth factor 5
- Pigmented fur may be visible in a white background (e.g., coat patterns known as Irish spotting, piebald, or extreme white), which was likely formed by MITF variants.
- GWAS was re-run by subdividing the dataset by herding breeds (Australian Cattle Dogs, Australian Shepherds, and Border Collies) and the rest of breeds (hereafter referred to as non herding breeds).
- FIGs. 5A-5B show Manhattan plots of association with roaning, including roaning for herding breeds (FIG. 5 A) and roaning for non-herding breeds (FIG. 5B). Red and blue horizontal lines are significant (P ⁇ 5 x 10 8 ) and suggestive (P ⁇ 1 x 10 5 ) associations, respectively.
- the non-roaned control group was devoid of the roaning-allele (A) at the CFA38 marker, while the frequency of this allele was 72% and 65% in roaned herding and working dogs, respectively.
- FIG. 7 shows normalized read depth in 5-kb sliding windows across the significant GWAS locus on CFA38 for Australian Cattle Dogs (red), German Wirehaired Pointer (pink), and Border Collies (grey). Filled circles shows the corresponding region of the Manhattan plot shown in FIGs. 3A-3B. Note that two Border Collies were heterozygous in the duplication showing the elevated depth.
- FIG. 8 shows haplotype structure near the tandem duplication on CFA38 (position 11,031,835-11,243,237).
- Border Collies grey
- breeds with high frequency of ticking Brittany, Clumber spaniel, and English setter; purple
- breeds with high frequency of roaning Australian Cattle Dog, German Wirehaired Pointer, Wirehaired Pointer, and Wirehaired Pointing Griffon; brown
- Dalmatians red
- Rows correspond to haplotypes (two rows/individual), and columns correspond to markers.
- +/- presence and absence of the 11-kb duplication based on Manta.
- Red box 11-kb duplication (CFA38:11,131,835-11,143,234).
- Orange box a core haplotype (CF A38:11,122,646- 11,167,876).
- Table 2 shows whole genome re-sequencing data used for characterizing the 11-kb tandem duplication on CFA38.
- FIG. 9 shows discordant read pairs at the duplication breakpoint on CFA38 identified in Miinsterlander (top panel), Australian Cattle Dog (middle: SRR7107580), and Border Collie (bottom: SRR7107950). Outward-facing read pairs (green) indicate that this is a tandem duplication found in ticked and roaned dogs but not in Border Collie.
- structural variations SVs were searched by using publicly available whole genome re-sequencing data (Table 2).
- FIGs. 10A-10B show PCR genotyping of the tandem duplication on CFA38 associated with roaning.
- FIG. 10A shows a schematic view of the design of the PCR genotyping assay. Single headed arrows indicate three pairs of primers to amplify three regions. The first (black) and the third (yellow) primer pairs should produce amplicons in all dogs regardless of the presence or absence of the duplication, while the second pair in the middle should produce an amplicon only in dogs carrying the duplication.
- FIG. 10B shows PCR genotyping of a roaned and control dogs. Each gel lane corresponds to PCR primer pairs depicted in FIG. 10A.
- Table 3 shows (a) primer sequences used for PCR assays described in FIG. 8. (b) Midpoint span product sequence. Nucleotides in bold and italic are likely the end of the first copy and the beginning of the second copy, respectively.
- a breakpoint PCR assay was designed by targeting the region spanning the two copies (forward and reverse primers mapping to CanFam3.1 CF A38 : 11 , 143 , 136- 11 , 143 , 155 and CF A38 : 11 , 131 , 969- 11 , 131 , 988, respectively) (FIGs.
- haplotypes 10A-10B, Table 3). A total of 99 dogs (73 with roaning and 26 dogs without roaning) were assayed. This primer pair produced a single 400-bp amplicon in 64 dogs (FIGs. 10A-10B). To define haplotypes, six markers were used in the flanking region of the duplication, including the most significant GWAS marker (FIG. 8). About 83% of the PCR-positive dogs (54 dogs) carried at least one copy of the duplication-associated haplotype (hap-1: “AGAGGA”). Six dogs (9%) had at least one copy of the recombinant haplotype identified in the whole genome re sequencing data (hap-2: “GGAGGA”). A third haplotype associated with the duplication was identified in two dogs (3%), one with one copy and the other with two copies of this haplotype (hap-3 : “GAAAAA”).
- the whole-genome variant analysis showed that the haplotypes of the two Dalmatians were identical to the duplication-associated haplotype identified in dogs that are typically associated with roaning in the region 11,031,835- 11,243,237 on CFA38 (FIG. 8, Table 2).
- Manta analysis was used to detect the tandem duplication in the corresponding region in both of these two dogs.
- phenotype and genotype association was performed as follows. Table 4 shows genotype frequencies of the markers associated with roaning in the discovery panel, including A) CFA38 Duplication and B) The top associated GWAS SNP. The presence or absence of the tandem duplication on CFA38 was predicted for the discovery panel dogs based on the three haplotypes associated with the duplication (hap-1, hap-2, and hap-3). A total of 404 dogs had at least one copy of the duplication-associated haplotypes.
- FIG. 11 shows a density distribution of ALRR for the discovery panel dogs with zero, one, or two copies of the duplication-associated haplotypes (no haplotype, heterozygote, and homozygote, respectively).
- Vertical ticks indicate individual ALRR of dogs with roaning (orange) and without roaning (grey). Density plots with the number of individuals less than 10 are not shown, but individual ALRR is indicated with longer vertical ticks.
- Table 5 shows genotype frequencies of four pigmentation genes in roaned and non- roaned dogs, including A) A-locus: ASIP; B) E-locus: MC1R ; C) B-locus: TYRPl ; and D) K- locus CBD103.
- FIG. 12 shows Ggenotype frequency of the marker near MITF (CFA20:21836232) in roaned and non-roaned dogs.
- FIGs. 13A-13B show a density distribution of ALRR for the validation panel dogs with zero, one, or two copies of the duplication-associated haplotypes (no haplotype, heterozygote, and homozygote, respectively), including target breeds (FIG. 13A) and mixed breeds (FIG. 13B).
- Vertical ticks indicate individual ALRR of dogs with roaning (orange) and without roaning (grey). Density plots with the number of individuals less than 10 are not shown, but individual ALRR is indicated with longer vertical ticks.
- FIGs. 14A-14D show a signature of selection in the region on CFA37 associated with roaning.
- FIG. 14A shows nucleotide diversity (p) for Wirehaired Pointing Griffon (orange), Border Collies (grey squares), and Labrador Retriever (black triangles) in 500-kb sliding windows.
- FIG. 14B shows pairwise genetic differentiation (FST) for Wirehaired Pointing Griffon (red) and Labrador Retriever (black). Border Collies were used as a reference.
- FIG. 14C shows ROH in Australian Cattle Dog (orange), Dalmatians (red), and Border Collies (grey).
- FIG. 14D shows XP-EHH in Australian Cattle Dog (orange), Dalmatians (red), and Labrador Retrievers (black). Border Collies were used as a reference. Wirehaired Pointing Griffons and Australian Cattle Dogs are commonly associated with roaning. Blue rectangle: position of the 11-kb duplication p and FST are estimated by using whole-genome resequencing data, while ROH and XP-EHH were estimated by using Illumina genotyping data.
- Results showed that a loss-of-stop-codon mutation was detected at CFA38: 11, 111,286 (T - > C), but all of the putatively roan-associated dogs were homozygous for the wild-type allele at this marker.
- FIG. 15 shows human orthologous region (hg38) of the CFA38 associated with roaning (UCSC genome browser). The highlighted area in blue is the orthologous region to the tandem duplication identified in dogs with roaning, which is located within the intron 61 of USH2A.
- GeneHancer Regulatory Elements are located at chrl:215, 715, 579-215, 717, 032 (green line), which corresponds to CFA38:11, 146, 170-11, 147, 605 in the dog genome.
- DNAse I hypersensitive sites grey and black boxes.
- Open Regulatory Annotation (ORegAnno) orange and blue boxes.
- the CFA38 duplication was detected in an intronic region of USH2A , and the orthologous region in the human reference genome (hg38) was detected at chrl:215,694,945- 215,712,452 based on Liftover. At least three clusters of highly conserved sequences were identified in this region (maximum PhyloP scores of 5.5, 4.3, and 4.1), which overlapped with a DNAse I hypersensitive sites and transcription factor binding sites annotated by Open Regulatory Annotation (ORegAnno) (FIG. 15). In addition, there were two additional regions of high conservation outside the duplication (maximum PhyloP scores of 9.6 and 9.1), which were annotated as transcription factor binding sites by ORegAnno and interaction regions by GeneHancer based on Hi-C mapping.
- R-locus e.g., the CFA38 duplication
- S-locus e.g., MITF on CFA20
- certain S-locus variants may override the effect of the CFA38 duplication.
- a functional assay performed by using USH2A knockout mice may show that this gene is involved in the maintenance of retinal photoreceptors and the development of cochlear (inner ear) hair cells. Further, a mutation in USH2A may show abnormal pigment deposition and reduced expression of MITF in retinal cells derived from induced pluripotent stem cells. Further, the distribution of usherin in healthy individuals is highly conserved between mice and humans, in which skin was completely devoid of this protein. The duplication of the putative regulatory regions may result in ectopic expression of USH2A in skin melanocytes. Alternatively, the duplication may facilitate alternative splicing and create a novel protein isoform since this complex gene with 73 exons may form several isoforms.
- German Shorthaired Pointers with roaned coat may have been favored by hunters because they blend with forest better than white dogs.
- the duplication-associated haplotypes were sporadically identified in various breeds, including Akitas, Siberian Huskies, and village dogs (e.g., indigenous dogs that accompany humans but are not selectively bred), indicating that selection acted on a variation that existed in the ancestral canine population (e.g., “soft sweep”).
- S-locus may be molecularly characterised, and the .v" variant at MITF may be required to have white fur as a base color.
- T-locus may be a responsible locus for creating “ticks” or pigmented spots to the white coat but, with a modifier locus on CFA3, causing fewer and larger spots.
- the Australian Cattle Dog may have been established in Australia in the 19th century by crossing Collie-type dogs with Dingos (a wild dog in Australia), Bull Terriers, Kelpies and Dalmatians. Therefore, the duplication-associated haplotypes on CFA38 may have been introgressed from Dalmatians to the ancestral population of Australian Cattle Dog during its breed formation, followed by selection that increased the frequency of the duplication. This is in line with the above hypothesis because decoupling the allelic combinations at the modifier locus on CFA3 and the roaning locus on CFA38 revealed the putatively ancestral roaning coat pattern.
- a tentative causal mutation lies in a non-coding region which may modify expression patterns of USH2A.
- Darwin may not have been convinced that all of the domestic dog breeds have descended from any one wild species, but novel epistatic interactions and rewiring regulatory networks can result in a burst of phenotypic divergence.
- FIGs. 16A-16H show representative coat phenotypes, including German Wirehaired Pointer (roaned) (FIG. 16A); Australian Cattle Dog (roaned) (FIG. 16B); a mixed breed of Treeing Walker Coonhound and Bluetick Coonhound (ticked) (FIG. 16C); a Border Collie (ticked) (FIG. 16D); an English Setter (both roaned and ticked) (FIG. 16E); an Australian Cattle Dog (both roaned and ticked) (FIG. 16F); a Pointer (without roaning and ticking) (FIG. 16G); and an Australian Cattle Dog (without roaning and ticking) (FIG. 16A).
- FIGs. 16A, 16C, 16E, and 16G are non-herding breeds, while FIGs. 16B, 16D, 16F, and 16H are herding breeds. Dogs were scored for roaning based on their photographs as follows. If roaning was observed on any part of the body, the dog was scored as roaned. Similarly, dogs were classified as ticked if they had any spots on their body, and the extent of ticking was scored from the scale one (lightly ticked) to five (heavily ticked).
- ticking and roaning may result from a similar genetic mechanism, roaned dogs were never considered as ‘not ticked’ controls nor were ticked dogs considered ‘not roaned’ controls, however, dogs may be considered both ticked and roaned if both patterns were clearly visible in the coat.
- a set of 1,099 adolescent and adult dogs was identified whose coat pattern could be assumed to be developmentally complete (approximately 6 months or older) (FIGs. 16A-16H).
- a total of 27 breeds were included in the discovery panel (Table 1). First-generation crosses of these breeds or more advanced generation crosses with the proportion of the primary breeds higher than 80% were also included in the discovery panel.
- coat phenotype data was collected of 529 herding group dogs, 90 working group dogs, and 302 mixed breed dogs to validate the prediction of coat phenotype based on genetic markers (“validation panel”).
- Mixed breed dogs were selected if the proportion of their primary breeds was less than 50%, and if they were a mix of three or more breeds based on an ancestry prediction algorithm.
- the MITF marker was included to ensure that the dataset had approximately equal number of dogs with and without white areas in their body. A set of 33 dogs was removed because of the presence of both ticked and roaned areas. To reduce observer bias, all dogs’ phenotypes in both discovery and validation panels were scored by the same person, who was blinded to the dog genotypes and their genetic ancestry at the time of phenotyping.
- Genotypes of the dogs were collected by using high-density SNP arrays (232,972 markers, of which 228,830 markers were on autosomes and chromosome X). A mean genotyping rate of 97.4% was obtained across all dogs. After removing markers with minor allele frequency less than 1%, a set of 187,496 markers was obtained, for which the genotyping rate was 99.8%. Genotyping rate calculation and marker filtering were performed by PLINK vl.9.
- LRR probe intensity
- Manta is used, which uses paired and split-read evidence for SV detection in mapped sequencing reads.
- whole genome sequence data was obtained for 38 dogs of the eight breeds from the NCBI Sequence Read Archive (Table 2). They were selected because of the high prevalence of ticking and roaning patterns in these breeds.
- Sequence reads of these samples were mapped to CanFam3.1 reference genome by using the BWA-MEM algorithm in BWA. Read depths for all sites were calculated by using the GATK DepthOfCoverage tool. To visualize the CFA38 duplication breakpoints, mean per site read depths were calculated for non-overlapping 5-kb windows along CFA38 and then were divided by the autosome average read depth for normalization. Discordant read pairs were visualized by Integrative Genomics Viewer (IGV). To identify haplotypes associated with the CFA38 duplication, single nucleotide variants of 722 dogs and other canid species were phased by Beagle v4.1 with default parameter settings. Genetic map positions were derived from a LD-based canine recombination map. Haplotypes of dogs genotyped by a custom microarray were reconstructed by using a reference panel, with missing data imputed using Eagle2.
- Haplotypes associated with the CFA38 duplication were validated by a breakpoint PCR assay. Three pairs of primers were designed to amplify three regions: 1) the midpoint spanning the duplication (midpoint primer pair), 2) 5’ flanking region of the duplication start region (5’ control primer pair), and 3) 3’ flanking region of the duplication end region (3’ control primer pair) (Table 2). One microliter of total DNA was used for PCR reactions using the following primer combinations: Tick38-F2-2 and Tick38_Rl (midpoint primer pair), Tick38_Fl and Tick38_Rl (5’ control primer pair), and Tick38-F2-2 and Tick38-R2-2 (3’ control primer pair).
- PCR reactions were performed using Go Taq G2 Hot Start Green Master Mix (Promega M7422) in a total volume of 25 microliters (uL) following the manufacturer’s protocol. The following cycling parameters were used: 95 °C for 3 minutes, 35X (95 °C for 30 seconds, 58 °C for 30 seconds, and 72 °C for 30 seconds), 72 °C for 5 minutes, and a 12 °C hold.
- PCR product was visualized on a 1.5% agarose gel with IX GelRed (Biotium Cat No 41003); product from one dog was submitted for purification and Sanger sequencing at Genewiz (Genewiz.com).
- signatures of selection were detected as follows. Pairwise nucleotide diversity (p) was calculated using VCFTools vO.1.16 for Wirehaired Pointing Griffons, Border Collies, and Labrador Retrievers, separately in 500-kb sliding windows with 10-kb steps along CFA38. Genetic differentiation was measured as FS T between breeds (Wirehaired Pointing Griffon vs. Border Collies and Labrador Retrievers vs. Border Collies) in the same window sizes. Whole-genome variant data were used. Sites with missing genotype rate larger than 50% were excluded.
- ROH homozygosity
- XP-EHH cross-population extended haplotype homozygosity
- the frequency of ROH at each marker position was calculated by dividing the sum of ROH state (absence or presence as 0 or 1, respectively) by the total number of individuals. This indicated the proportion of autozygous individuals at a given marker position along a chromosome.
- XP-EHH was calculated for Australian Cattle Dogs, Dalmatians, and Labrador Retrievers (with Border Collies as a reference breed) by using rehh R package.
- Example 2 Statistical models for prediction of pigmentation phenotypes in the domestic dog from genetic markers
- the intensity of red/fawn color in the hair coat varies from cream (low intensity) to dark red (high intensity) based on the amount of pheomelanin in the hair cells.
- Three genetic loci may be shown to be significantly associated with this variation in certain breeds, but none of them may be highly predictive of coat color in two of the most popular breeds in the United States, Labrador and Golden Retrievers, or in mixed breed dogs, which comprise the majority of the global canine population.
- GWAS genome-wide association study
- 601 purebred Yellow Labrador Retrievers and Golden Retrievers with known coat colors.
- Three genomic loci were identified that showed significant associations with coat color intensity: canFam3.1 chromosome (chr) 2: 74.7Mb, chr20: 55.8Mb, and chr21: 10.9Mb.
- the chr2 and chr21 associations may not have been previously reported.
- the chr2 and chr20 associations were also significant in an independent sample of 630 mixed breed dogs.
- GWAS results showed that in most dog breeds, coat color intensity is a polygenic trait, meaning that it is controlled by multiple loci which may interact with each other.
- a common approach for accurately predicting polygenic trait phenotypes is to fit a statistical model with phenotype as a function of genotypes at many markers.
- a model fit on a sufficiently large and representative “training” sample may be used to accurately predict phenotypes for new individuals given their genotypes, even without knowing the underlying genetic architecture of the trait.
- a multiple linear regression model was developed that uses genotype data at 10 genetic markers (SNPs) that were significantly associated with coat color intensity phenotype in the GWAS as the independent variables, and coat color intensity phenotype as the dependent variable.
- SNPs genetic markers
- coat color intensity phenotype was able to be predicted on a scale of 1 (cream) to 6 (dark red) with at least 70% to 80% accuracy, and whether it has a cream coat versus a darker coat with at least 85% to 95% accuracy (depending on the breed).
- Ticking may refer to another type of canine coat color variation, where small pigmented spots of varying numbers and sizes appear on otherwise unpigmented (white) areas of the hair coat. Roaning is similar to and sometimes co-occurs with ticking, but comprises more evenly intermingled pigmented and unpigmented hairs rather than distinct spots. “T locus” and “R locus” may be responsible loci for ticking and roaning, respectively, although they may not have been precisely mapped or characterized at a molecular level.
- Tongue color variation is similar to coat color variation in the sense that both show phenotypes of unpigmented, partially pigmented (e.g., spotted or blotched), and completely pigmented patterns.
- Completely pigmented “black” tongue is a part of the breed standard for some breeds, such as Chow Chows and Shar-Peis. It is possible that dogs with spotted tongues have some proportion of Chow Chow or Shar-Pei ancestry, but the presence of spotted tongues may also occur in purebred dogs of other breeds.
- the SNP -based test may include numerous advantageous features, including 1) the use of a polygenic prediction models for pigmentation phenotypes, 2) a training panel of a large number of purebred and mixed breed dogs, 3) the use of a set of novel, high-effect genomic loci to predict pigmentation phenotypes, 4) the known accuracy of pigmentation phenotype prediction in a number of breeds as well as mixed breed dogs, and 5) prediction of the expected range of pigmentation phenotypes in litters produced by pairs of tested dogs.
- polygenic tests may be developed to predict coat color intensity phenotype in Labrador and Golden retrievers or mixed breed dogs, and/or to use information from more than one genetic locus, and/or to cover the additional pigmentation phenotypes of ticking and/or roaning and tongue pigmentation.
- a dog’s predicted pigmentation phenotypes may be used in conjunction with a matchmaker tool to plan matings between pairs of dogs that are more likely to produce the desired phenotype while minimizing genome-wide inbreeding levels and risk for over 180 genetic health conditions.
- Table 8 Markers used in the model for predicting color coat intensity
- This set of markers may be used to accurately predict coat color intensity phenotype (e.g., by applying the following equation).
- ⁇ is the predicted numeric phenotype value and X x through X 10 are the number of alleles associated with darker coat color that the dog of interest has at BICF2P 1302896, BICF2P828524, BICF2G630655699, BICF2G630433130, TIGRP2P30892_rs8643466, TIGRP2P31085_rs8981024, BICF2S245539, BICF2P 1392970, BICF2P202986, and BICF2S23541470 (respectively).
- the dog is classified as likely to have a cream coat; and if ⁇ is greater than 1.5, the dog is classified as likely to have a yellow or red coat.
- all possible genotype combinations can be determined in pups produced by mating those dogs.
- the predicted coat color intensity phenotype may be obtained for each genotype combination.
- the predicted range of coat color intensity phenotypes, as well as the expected frequency of each phenotype may be reported in litters produced by that pair of parents.
- Table 10 Red-associated allele frequencies by breed in GWAS samples for coat color intensity predictive model SNPs
- roaned coat pattern was predicted as follows.
- Table 11 Markers used in the model for predicting roaned coat pattern
- the marker on chr20 is in the putative regulatory region of MITF (S locus). For roaned (pigmented) hairs to be visible, two copies of the “G” variant are required to make the base coat color white, since this variant is recessive.
- the duplication on chr38 is located between BICF2S23536290 and BICF2P1396284. The following allelic combinations of the six markers on chr38 were strongly associated with the duplication: AGAGAA, GGAGAA, GAAAAA, and GGAAAA.
- the roaning coat pattern is predicted if a dog has: 1) at least one copy of the duplication-associated allelic combinations (AGAGAA, GGAGAA, GAAAAA, and GGAAAA) at R locus on chr 38; and 2) two copies of the G variant at S locus on chr 20.
- Table 12 shows a model prediction accuracy for roaned coat pattern.
- tongue pigmentation phenotype was predicted as follows.
- a tongue pigmentation prediction model was constructed that uses direct genotyping of the 149-kb tandem duplication based on 21 markers (canFam3.1 position chr37: 28,543,289-28,692,507) or genotyping of linked markers used to predict duplication genotype (e.g. the [G/A] SNP at position chr37:28,616,075), as well as direct genotyping of TMEM40 or linked markers (e.g., the [A/G] SNP at position chr20:5,843,762), MITF or linked markers (e.g.
- the copy number (CN) of the duplicated region on chr 37 was determined based on the signal intensity of the probes on a custom Illumina CanineHD Beadchip as well as the estimated number of risk alleles.
- Table 13 shows the markers used in the model for predicting tongue pigmentation phenotype.
- Table 13 Markers used in the model for predicting tongue pigmentation
- Table 14 Effect sizes (b) of four loci associated with tongue pigmentation (partial or complete) by multinomial logistic regression analysis.
- Example 3 - R-locus for roaned coat is associated with a tandem duplication in an intronic region of USH2A in dogs and also contributes to Dalmatian spotting
- Structural variations (SVs) may represent a large fraction of all genetic diversity, but how this genetic diversity is translated into phenotypic and organismal diversity may be unclear. Explosive diversification of dog coat color and patterns after domestication can provide a unique opportunity to explore this question; however, a significant obstacle is to efficiently collect a sufficient number of individuals with known phenotypes and genotypes of hundreds of thousands of markers.
- a genomic region on chromosome 38 was identified that is strongly associated with a mottled coat pattern (roaning) by genome-wide association study.
- a putative causal variant was identified in this region, an 11-kb tandem duplication (11,131,835-11,143,237) characterized by sequence read coverage and discordant reads of whole-genome sequence data, microarray probe intensity data, and a duplication- specific PCR assay.
- the tandem duplication is in an intronic region of usherin gene ( USH2A ), which was perfectly associated with roaning but absent in non-roaned dogs.
- MC1R melanocortin-1 receptor
- MC1R melanocortin-1 receptor
- Similar coloration has independently evolved in multiple lineages via mutations in different genes (e.g., LYST and AIM1 in polar bears and KIT and MA TP in horses with white coats) [refs. 3-5] Understanding the genetic mechanisms of color variation and phenotypic convergence has shed light on how novel phenotypes evolve under similar selective forces (either natural or artificial).
- Ticking and roaning are two common coat patterns observed in dogs and other domestic animals. Ticking may be characterized as small pigmented spots of varying numbers and sizes appearing on otherwise unpigmented (white) areas. Roaning may be similar to and sometimes co-occurs with ticking but may be characterized with pigmented and unpigmented hairs interspersed more evenly without the formation of distinct spots.
- the distinctive spots of the Dalmatian breed may be believed to be a modified form of ticking where a size of each tick or spot is enlarged and distinctive by a modifier locus (flecking locus, F-locus) mapped on canine chromosome (CFA) 3 [ref. 8]
- a modifier locus flecking locus, F-locus
- CFA canine chromosome
- KIT ligand gene K1TLG
- pigs pigs
- goats roaning in dogs
- Gene interaction or epistasis is a key mechanism in the formation of phenotypic diversity in both wild and domesticated species.
- An example is three color types of Labrador Retrievers, where tyrosinase-related protein 1 (TYKP1 ) and MC1R determine their coat colors as black, chocolate, or yellow [ref. 15]
- Modifier genes constitute a type of epistasis; for example, several variants of microphthalmia-associated transcription factor (MITF) modify the coat color of dogs by preventing the melanocyte development and migration in certain areas of the body and, in some cases, across nearly the entire body. This results in a loss of pigmentation leading to white markings in otherwise uniformly colored areas [refs.
- MITF microphthalmia-associated transcription factor
- S-locus is a major locus controlling this white spotting pattern, and several variants within and close to MITF have been identified, including a SINE insertion at 3 kilobase (kb) upstream of the MITF transcription start site (TSS) and a variable length polymorphism (Lp) at 100 bp upstream of MITF TSS [refs. 17- 18] Both T-locus and R-locus are considered as modifier loci by locally changing coat color from white to pigmented through the interaction with S-locus [ref.
- Genomic regions associated with ticking and roaning coat patterns in dogs were investigated by using a total of 1,281 purebred dogs for marker discovery (“discovery panel”) and 274 mixed breed dogs for marker validation (“validation panel”) that were genotyped using an Embark SNP array with 220,484 markers covering all 38 autosomes and chromosome X. Dog owners contributed to this study by providing photographs of their dogs, from which their phenotypes were classified as ticked, roaned, or lacking these patterns to identify genomic regions associated with these phenotypes by genome-wide association study (GWAS).
- GWAS genome-wide association study
- Phenotype data collection was performed as follows. Owner-submitted photographs were used to evaluate coat patterns of dogs in a veterinary database where the owner agreed to participate in scientific research. To ensure a high level of confidence in correctly assessing the coat patterns, the following selection criteria were applied based on the photograph and on the dog itself to determine if each individual was a good candidate for the study. Photographs had to be of high quality, in focus, well-lit, and not show evidence of filter use or image-editing. In addition, photographs that included multiple dogs or that depicted a dog very far from the camera were excluded. A reasonable amount or the entirety of the dog's body had to be shown in the photograph, especially areas where white patterns likely governed by S-locus [refs.
- dogs were classified as ticked if they had any spots on their body, and the extent of ticking was scored either the scale one (lightly ticked) or two (heavily ticked). Because ticking and roaning may result from a similar genetic mechanism, roaned dogs were never considered as ‘not ticked’ controls, nor were ticked dogs considered ‘not roaned’ controls. However, dogs could be considered both ticked and roaned if both patterns were clearly visible in the coat. A set of 1,281 adolescent and adult dogs were identified whose coat pattern may be assumed to be developmentally complete (approximately 6 months or older). A total of 27 breeds were included in the discovery panel.
- Genotyping and genome-wide association were performed as follows. DNA was extracted from buccal swab samples collected by dog owners and extracted by Illumina, Inc. Genotypes of the dogs were collected by using custom Illumina Canine high-density SNP arrays (a total of 220,484 markers). Mean genotyping rate was 97.4 % across all dogs. After removing markers with minor allele frequency less than 1%, a set of 176,910 markers was used, for which the genotyping rate was 99.8%. Genotyping rate calculation and marker filtering were performed by PLINK vl.9 [ref. 23]
- haplotypes of roaned and non- roaned dogs were reconstructed from the array genotypes by using Beagle v4.1 with default parameter settings [ref. 25] Genetic map positions were derived from a LD-based canine recombination map [ref. 26]
- Haplotypes associated with the CFA38 duplication were validated by a breakpoint PCR assay. Three pairs of primers were designed to amplify three regions in separate PCR reactions: 1) the midpoint spanning the duplication (midpoint primer pair), 2) 5’ flanking region of the duplication start region (5’ control primer pair), and 3) 3’ flanking region of the duplication end region (3’ control primer pair).
- One microliter of total DNA was used for PCR reactions using the following primer combinations: Tick38-F2-2 and Tick38_Rl (midpoint primer pair), Tick38_Fl and Tick38_Rl (5’ control primer pair), and Tick38-F2-2 and Tick38- R2-2 (3’ control primer pair).
- PCR reactions were performed using Go Taq G2 Hot Start Green Master Mix (Promega M7422) in a total volume of 25 uL following the manufacturer’s protocol. The following cycling parameters were used: 95°C for 3 minutes, 35X (95°C for 30 seconds, 58°C for 30 seconds, 72°C for 30 seconds), 72°C for 5 minutes, 12°C hold.
- PCR product was visualized on a 1.5% agarose gel with IX GelRed (Biotium Cat No 41003); the products from three dogs were submitted for purification and Sanger sequencing at Biotechnology Resource Center at Cornell University.
- Detecting signatures of selection was performed as follows. Pairwise nucleotide diversity (p) was calculated using VCFTools vO.1.16 [ref. 32] for Wirehaired Pointing Griffons, Border Collies, and Labrador Retrievers, separately in 500-kb sliding windows with 10-kb steps along CFA38. Genetic differentiation was measured as /’si between breeds (Wirehaired Pointing Griffon vs. Border Collies and Labrador Retrievers vs. Border Collies) in the same window sizes. Whole-genome variant data reported in [ref. 31] were used. Sites with missing genotype rates larger than 50% were excluded.
- ROH homozygosity
- XP-EHH cross-population extended haplotype homozygosity
- the frequency of ROH at each marker position was calculated by dividing the sum of ROH state (absence or presence as 0 or 1, respectively) by the total number of individuals. This indicated the proportion of autozygous individuals at a given marker position along a chromosome.
- XP-EHH was calculated for Australian Cattle Dogs, Dalmatians, and Labrador Retrievers (with Border Collies as a reference breed) by using rehh R package [ref. 34]
- Participating dogs were part of a veterinary customer base. Owners provided informed consent to use their dogs’ data in scientific research by agreeing the following statement: “I want this dog’s data to contribute to medical and scientific research”. Ethical approval was not required as non-invasive methods for genotype or phenotype collection were used (buccal swab and photographing, respectively). Dogs were never handled directly by researchers. Owners were given the opportunity to opt-out of the study at any time during data collection.
- a novel association on chromosome 38 was observed with roaning, but not with ticking.
- a total of 1,281 purebred dogs was selected with profile pictures where dogs showed white spotting patterns in their bodies. Inspection of customer-provided photographs identified 344 dogs with varying degrees of ticking, 358 dogs with a roaning pattern on some part of the body, and 579 dogs without any noticeable ticking or roaning in any part of their bodies (e.g., “control” dogs). Dogs that exhibited both phenotypes (ticking and roaning) were excluded from the study.
- FIGs. 17A-17B show Manhattan plots of association with roaning and ticking, including for Roaning (FIG. 17A) and Ticking (FIG. 17B). Upper and lower horizontal lines are significant (P ⁇ 5 x 10 8 ) and suggestive (P ⁇ l x 10 5 ) associations, respectively.
- the non-roaned control group was completely devoid of the roan- associated “A” allele at the most significant marker at the position 11,085,443 on CFA38, while 57 % and 38 % of roaned dogs were AA homozygous and AG heterozygous, respectively, indicating a dominant action of this locus.
- a total of 321 haplotypes were identified in this region based on 52 markers, among which 21 haplotypes had the roan-associated “A” allele at the position 11,085,443.
- VEP Variant Effect Predictor
- the remaining five dogs with the duplication had either one copy of the duplication-associated haplotype or a potential recombinant haplotype of the duplication-associated haplotype by sharing a core haplotype from the positions 11,122,646-11,167,876.
- the dogs without the duplication did not have the duplication- associated haplotype or similar ones.
- two Dalmatians in the WGS data were both homozygous for the duplication-associated haplotype (FIG. 19).
- FIG. 18 shows normalized read depth in 5-kb sliding windows across the significant GWAS locus on CFA38 for Australian Cattle Dogs, German Wirehaired Pointer, and Border Collies. Filled circles show the corresponding markers of the Manhattan plot shown in FIG.
- FIG. 19 shows haplotypes near the marker on CFA38 significantly associated with roaning. Border Collies, breeds with high frequency of ticking, breeds with high frequency of roaning, and Dalmatians. Rows correspond to haplotypes (two rows/individual), and columns correspond to markers. The positions of the first and last markers are 11,031,835 and 11,243,237, respectively. +/-: presence and absence of the 11-kb duplication based on Manta. Red box: 11 -kb duplication (CFA38:11,131,835-11,143,234). Yellow box: a core haplotype (CFA38:11, 122, 646-11, 167, 876).
- Red triangle the most significant marker associated with roaning.
- Green triangle markers used for defining the duplication-associated haplotypes. Photos of representative breeds are shown (from top to bottom: Border Collie, Miinsterlander, German Wirehaired Pointer, and Dalmatian).
- five samples were available for the breakpoint PCR assay. All of these five samples produced the 400-bp amplicon. There was one homozygous dog and one heterozygous dog for hap GGOl, indicating a potential recombination event between the markers at 11,120,096 and 11,140,091.
- FIGs. 20A-20B show PCR genotyping of the tandem duplication on CFA38 associated with roaning.
- FIG. 20A Schematic view of the design of the PCR genotyping assay. Yellow boxes indicate the duplicated region. Single-headed arrows indicate pairs of primers to amplify three regions. The first (black) and the third (yellow) primer pairs should produce amplicons in all dogs regardless of the presence or absence of the duplication, while the second pair in the middle should produce an amplicon only in dogs carrying the duplication. Representative coat patterns of non-roaned (top) and roaned dogs (bottom) are shown (left: non herding group, right: herding group).
- FIG. 20B PCR genotyping of a roaned and control dogs. Each gel lane corresponds to PCR primer pairs depicted in panel A.
- duplication-associated haplotypes A total of 357 dogs had at least one copy of the duplication-associated haplotypes.
- the presence of the duplication-associated haplotypes explained all roaned cases (246 homozygous and 112 heterozygous dogs out of 358 roaned dogs), whereas these haplotypes were absent in non-roaned dogs.
- FIG. 21 shows density distribution of the array signal intensity (ALRR) for the discovery panel dogs with zero, one, or two copies of the duplication-associated haplotypes (no haplotype, heterozygote, and homozygote, respectively).
- Vertical ticks indicate individual ALRR of dogs with roaning (heterozygote and homozygote) and without roaning (no haplotype).
- the imputed genotypes of four dogs were confirmed by Sanger sequencing the region CFA38:11,143,161-11,143,326). They had either small spots (or ticking), faint roaning pattern in muzzle areas, a limited amount of white marking (e.g., a possible “residual white”), wolf-like sable pattern without large patches of roaning, or long fur that resulted in inaccurate phenotyping.
- white marking e.g., a possible “residual white”
- the genotype data of the SNP array revealed that about 50 % of Australian Cattle Dogs with roaned coat were autozygous between 10 and 11 Mb on CFA38 (FIG. 22C). Similarly, frequent autozygosity was found in Dalmatians but not in Border Collies in this region, indicating that the duplication-associated haplotype was likely favored by selection in Australian Cattle Dogs and Dalmatians. Moreover, cross-population extended haplotype homozygosity (XP-EHH) [refs.
- FIGs. 22A-22D show a signature of selection in the region on CFA38 associated with roaning.
- FIG. 22A Nucleotide diversity (p) for Wirehaired Pointing Griffon (orange), Border Collies (grey), and Labrador Retriever (black) in 500-kb sliding windows.
- FIG. 22B Pairwise genetic difference (/’sx) for Wirehaired Pointing Griffon (orange) and Labrador Retriever (black). Border Collies were used as a reference.
- FIG. 22C ROH in Australian Cattle Dog (orange), Dalmatians (red), and Border Collies (grey).
- 22D XP-EHH in Australian Cattle Dog (orange), Dalmatians (red), and Labrador Retrievers (black). Border Collies were used as a reference. Wirehaired Pointing Griffons and Australian Cattle Dogs are breeds where roaning is common. Blue rectangle: position of the 11-kb duplication p and F ' sx are estimated by using whole-genome re-sequencing data, while ROH and XP-EHH were estimated by using Embark genotyping data.
- the duplication-associated haplotypes were searched for, found in the discovery dataset (FIG. 19), in the WGS dataset with 722 dogs and other canid species [ref. 31] In addition to the breeds that were used for the discovery of the duplication (FIG. 19), 16 breeds had at least one copy of the duplication-associated haplotypes.
- haplotypes were fairly common in some breeds, such as German Shepherds Dogs (5 out of 15 dogs) and Belgian Tervurens (4 out of 11 dogs); however, roaning, if any, should not be visible in these breeds because of the lack of white areas (e.g., S/S genotype at S-locus).
- the duplication-associated haplotypes were also found in breeds where roaning was occasionally observed: Portuguese Water Dogs (3 out of 11 dogs), Lagotto Romagnolos (2 out of 5 dogs), and Dachshunds (2 out of 5 dogs). Finally, village dogs in China, Papua New Guinea, and Vietnam also had the duplication-associated haplotypes (6 out of 45 dogs), indicating a potentially ancient origin of the duplication.
- Mapped sequence read coverage within the duplication was about 1.5 times and 2 times higher than the surrounding 100-kb flanking region in dogs with one or two copies of the duplication- associated haplotypes, respectively, confirming the association between the haplotypes and the duplication in these breeds.
- the presence of the duplication in these haplotypes was confirmed by the breakpoint PCR assay, Sanger sequencing of the PCR amplicon spanning the duplication midpoint, and whole-genome re-sequencing data for the identification of discordant read pairs and abrupt read depth increase.
- the distribution of the array signal intensity in dogs with 0, 1, or 2 copies of the duplication- associated haplotypes was in agreement with the expected distribution. This mutation is nearly completely penetrant by explaining more than 99% of roaning cases in both purebred and mixed breed dogs.
- the haplotype-based linkage test can accurately detect the presence of the CFA38 duplication, which has high predictability for the roaning coat pattern.
- dog_10056, dog_10079, dog_10087, and dog_10166 had a long coat, which makes it difficult to accurately distinguish between ticked and roaned patterning, while the remaining dog had limited white spotting patterns (dog_10028).
- the small white spotting pattern is likely a residual white, which were excluded from the study. Assuming that the phenotypes of these dogs were correctly assigned, there might be additional modifier loci interacting with R-locus and/or S-locus.
- duplication-associated haplotypes were found in other distantly-related breeds (e.g., German Shepherd Dogs and Portuguese Water Dogs) and village dogs (e.g., indigenous dogs that accompany humans but are not selectively bred), indicating that selection acted on a variation that existed in the ancestral canine population (e.g., “soft sweep”).
- distantly-related breeds e.g., German Shepherd Dogs and Portuguese Water Dogs
- village dogs e.g., indigenous dogs that accompany humans but are not selectively bred
- Example 4 Five genetic variants explain over 70% of hair coat pheomelanin intensity variation in purebred and mixed breed domestic dogs
- the pigment molecule pheomelanin may confer red and yellow color to hair, and the intensity of this coloration may be caused by variation in the amount of pheomelanin.
- domestic dogs may exhibit a wide range of pheomelanin intensity, ranging from the white coat of the Samoyed to the deep red coat of the Irish Setter. While several genetic variants may be associated with specific coat intensity phenotypes in certain dog breeds, they may not explain the majority of phenotypic variation across breeds. In order to gain further insight into the extent of multigenicity and epistatic interactions underlying coat pheomelanin intensity in dogs, a large dataset obtained via a direct-to-consumer canine genetic testing service was leveraged.
- the database comprised genome-wide single nucleotide polymorphism (SNP) genotype data and owner-provided photos for 3,057 pheomelanic mixed breed and purebred dogs from 62 breeds and varieties spanning the full range of canine coat pheomelanin intensity.
- SNP single nucleotide polymorphism
- GWAS genome-wide association study
- Canine coat colors and patterns may result from varied expression of two pigment molecules: eumelanin, which is black or brown, and pheomelanin which is reddish-yellow. Most canids have coats containing a mixture of hairs expressing eumelanin, pheomelanin, or both, but many domestic dogs have coats in which only pheomelanin is expressed. These “pheomelanic” coats result from mutations in and around one of two genes that regulate switching between eumelanin and pheomelanin synthesis in hair follicle melanocytes: melanocortin 1 receptor (.
- MC1R known as the ⁇ locus
- ASIP agouti signaling protein
- a locus At least four different recessive mutations in and around the MCIR gene inhibit the synthesis of eumelanin in hair follicle melanocytes, resulting in a solid “recessive red” coat containing only pheomelanin [refs. 5-7 and 17]
- a completely or mostly red coat can also result from carrying a dominant ASIP variant (A y ), which produces “sable” coats with varying amounts of black/brown hairs concentrated around the dorsal midline, and pheomelanic hairs across the rest of the body [refs. 8 and 15]
- the intensity of pheomelanic coloration may vary widely across and within breeds that are fixed for recessive red or sable coats. For example, Irish Setters have consistently deep red coats, while Soft-coated Wheaten Terriers have coats that vary from cream to tan. Additionally, many breeds with solid white or cream coats have been shown to be recessive red, including Bichon Frise, Samoyed, West Highland White terrier, and White German Shepherd [refs. 5 and 18] Uncovering the genetic basis of pheomelanin intensity variation in dogs may be unexpectedly challenging.
- Participating dogs were part of a veterinary customer base. Owners provided informed consent to use their dogs’ data in scientific research by agreeing the following statement: “I want this dog’s data to contribute to medical and scientific research”. Ethical approval was not required as non-invasive methods for genotype or phenotype collection were used (buccal swabbing and photographing, respectively). Dogs were never handled directly by researchers. Owners were given the opportunity to opt-out of the study at any time during data collection. The discovery and validation cohorts were selected from data available collected between October 2018 and June 2020. All data were de-identified.
- Genotype and phenotype data were collected as follows. Cheek cell samples were collected by dog owners with buccal swabs, and DNA was extracted (using methods by Illumina) and genotyped at 214,634 biallelic autosomal and X chromosome markers on an Embark Veterinary custom Illumina CanineHD SNP array. Dogs that had been genotyped between October 2018 and June 2020 were filtered to those that 1) had owner consent to use of their genetic data and owner-reported data for research, 2) had at least one owner-provided photo, 3) had owner reported breed assignments, and 4) were genetically “recessive red” (e/e at the E locus [ref.
- Phenotyping was performed as follows. To develop a color scale for visual phenotyping, three shades (cream, tan, and red) were selected that encompass the range of coat pheomelanin intensity phenotypes in domestic dogs, their hexadecimal values (#FFFEF9, #D3A467, and #93471A) were obtained. Then, the Matplotlib [33] LinearSegmentedColormap and Normalize functions were used to obtain six equally spaced hexadecimal values spanning the range of values defined by these three colors. The six point coat color scale (FIGs. 23A) includes the colors encoded by these hexadecimal values: #FFFEF9 (1), #EDDABF (2), #DCB684 (3), #C69158 (4), #AD6C39 (5), and #93471A (6).
- FIGs. 23A-23C show the six point coat pheomelanin intensity scale.
- FIG. 23A Photos of six purebred dogs that exhibit the full range of coat pheomelanin intensity in canids are shown above a continuous color scale and numbered swatches showing the color of each of the six phenotype values used in this study. From left to right, the breeds of the dogs in these photos are: West Highland White Terrier, Yellow Labrador Retriever, Soft-coated Wheaten Terrier, Golden Retriever, Nova Scotia Duck-Tolling Retriever, Irish Setter. All six dogs pictured were part of the study sample.
- FIG. 23B An example of a dog that displays “countershading”.
- FIG. 23C Histograms showing the number of dogs with each phenotype value in the discovery and validation samples.
- the pheomelanin intensity phenotype could not be confidently typed based on available photos for 215 dogs (due to poor photo quality, positioning of the dog in the photo, multiple dogs shown in the same photo, or lack of red hair on the head or shoulders due to coat patterning) and these were excluded from further analyses. [0303] In order to achieve a more balanced distribution of phenotypes across the GWAS sample, concordant owner-reported and genetically-determined breed assignments were used to identify an additional 192 genetically pheomelanic, purebred dogs with no owner-provided photo that belonged to breeds that are fixed for red coats (5 or 6 on the phenotype scale).
- Genome-wide association was performed as follows. To identify genomic regions associated with pheomelanin intensity variation, coat color was encoded as both a case-control (cream versus red) and quantitative trait (six point scale), and a multivariate linear mixed model was constructed using GEMMA v.0.98 [ref. 36] to the discovery dataset. To further account for confounding effects of shared ancestry among dogs of the same or closely related breeds, kinship matrices were constructed from array genotypes using the GEMMA -gk command and used as a random effect in the model for each GWAS run.
- the mean depth of sequencing coverage across all autosomes was calculated using the Genome Analysis Toolkit 3 [ref. 41] DepthOfCoverage tool, and depth of coverage values in regions of interest were divided by the mean autosomal depth of coverage to obtain normalized depth of coverage values. [0311] To determine which allele at each top GWAS marker was most likely the ancestral allele, genotypes were obtained at these markers across 54 publicly available wild canid whole genome sequencing datasets downloaded from the Sequence Read Archive [ref. 38] (48 Gray Wolves, 3 Coyotes, 1 Dhole, and 1 Golden Jackal). The accession information for these 54 datasets and their genotypes at the top GWAS markers are available in (). The allele frequencies at the top GWAS markers in these populations are shown in FIG. 25A.
- Predictive models for coat pheomelanin intensity were constructed as follows. Using the linear model module in the Python scikit-leam package version 0.21.3 [ref. 43], a multivariate linear regression classifier model was trained on the training set of discovery cohort dogs with coat color phenotypes as the dependent variable. In these models, the independent variables were genotype dosage values (coded additively, or with one allele completely dominant to the other) at the five top GWAS markers, as well terms representing their pairwise interactions (e.g., the product of the dosage values at the two individual loci). The coefficients, standard error, t-test values for each independent variable, as well as the y-intercept, adjusted R- squared, and log likelihood values for the best fit model are provided in Table 17.
- Results from the GWAS identified five loci associated with coat pheomelanin intensity variation.
- GWAS treating coat pheomelanin intensity phenotypes as a quantitative trait in the discovery dataset identified five significantly associated genomic regions on CFA2, 15, 18, 20, and 21.
- a total of 88 SNPs passed the Bonferroni correction threshold of 2.73 x 10 7 (6.56 on the -logio scale) (supp data).
- CFA2 74,746,906 base pairs (bp) (BICF2P 1302896), CFA15: 29,840,789 bp (BICF2G630433130 ), CFA18: 12,910,382 bp (chrl8_12910382), CFA20: 55,850,145 bp (BICF2P828524), and CFA21: 10,864,834 bp (BICF2G630655755) (FIGs. 24A-24B, Table 15).
- FIGs. 24A-24B show quantitative coat pheomelanin intensity GWAS results.
- FIG. 24A GWAS p-values are shown in a Manhattan plot for the autosomes (chromosome 1-38) and the X chromosome (chromosome 39). For each chromosome with one or more genome-wide significant markers, the top marker on the chromosome is highlighted in gold and labeled with its marker ID. The blue dashed line shows the minimum unadjusted -log io (p-value) for genome-wide significance using the Bonferroni correction: 6.56 .
- FIG. 24B Bar plots show the number of dogs with each phenotype value (1-6) for each genotype class at each of the top five GWAS markers. The genotype classes are coded according to the dosage of the red-associated alleles at each marker, which are listed in Table 15 as “Allele 1”.
- Table 15 Top GWAS markers at five associated loci
- Table 15 Marker IDs, physical position in the canFam3.1 reference genome, gene symbol (if applicable), the red-associated allele and its frequency (Red Allele, Freq.), effect size (Beta) and standard error (se) of the effect size, uncorrected -log 10 (Wald’s p-value), and proportion of variance explained (PVE) for the most significant marker at each of the five associated loci.
- FIGs. 25A-25B show species and breed allele frequencies at top GWAS markers.
- FIG. 25A shows the frequencies of the red-associated allele at the top five GWAS markers in 53 public wild canid genomes [ref. 34]
- FIG. 25B shows the same information across 31 breeds with at least 8 individuals in the GWAS sample.
- Each row shows the breed/species phenotype value range and (for phenotyped dogs, e.g., the dogs in the GWAS sample) the mean phenotype value for each breed, with the mean phenotype value colored by the corresponding coat color.
- Mean phenotype and allele frequency values are colored white or black to improve readability.
- the red-associated allele was present in most of the domestic dog breeds examined, but it was only fixed in breeds with consistently high coat pheomelanin intensity such as Brittany, Redbone Coonhound, and Irish Setter (FIG. 25B).
- the cream-associated allele was fixed in several breeds that are fixed for completely cream coats, including American Eskimo Dog, Samoyed, West Highland White Terrier, and White Shepherd (FIG. 25B).
- the top CFA18 variant, chrl8_12910382 is a missense mutation p.I487M in a conserved residue of the twelfth exon of the solute carrier family 26 member 4 gene ( SLC26A4 ).
- SLC26A4 solute carrier family 26 member 4 gene
- the top CFA15 variant, BICF2G630433130 is located approximately 8 kilobases (kb) downstream of a 6 kb copy number variant (CNV) near the KIT ligand gene ( K1TLG ) that was previously associated with variation in coat pheomelanin intensity in Nova Scotia Duck Tolling Retrievers and Poodles [ref. 31], as well as squamous cell carcinoma of the digit in eumelanistic, but not recessive red, Standard Poodles [ref. 44]
- the red-associated allele at this marker was present at an intermediate frequency (23%) across 48 Gray Wolves, but not in Coyote, Dhole, or Golden Jackal (FIG.
- the top CFA20 variant is the same variant reported in another coat pheomelanin intensity GWAS using over 90 different breeds, which was used to fine map the peak to a nearby missense mutation in the major facilitator superfamily domain containing 12 gene ( MFSD12 ) at CFA20: 55,856,000 bp [ref. 18] It was observed that the red-associated allele at BICF2P828524 was segregating at an intermediate frequency in Gray Wolves and carried by the single Dhole for which that data was available, but absent in 3 Coyotes genomes, making it difficult to infer which allele is ancestral. Consistent with Hedan et al. [ref.
- FIGs. 26A-26B show dominance and epistatic interactions.
- FIG. 26A For each of the top five GWAS markers, violin plots show the distribution of observed normalized six point phenotype values for each genotype class. The black lines connect the observed means of the three genotype classes, and the blue lines connect the expected means under a perfectly additive model. The estimated dominance coefficient for each marker, d , is shown in the upper left hand comer of each plot. An asterisk indicates that the predicted heterozygote class mean phenotype fell outside the 95% confidence interval of the observed heterozygote mean phenotype, which indicates that d is statistically significant.
- FIG. 26A For each of the top five GWAS markers, violin plots show the distribution of observed normalized six point phenotype values for each genotype class. The black lines connect the observed means of the three genotype classes, and the blue lines connect the expected means under a perfectly additive model. The estimated dominance coefficient for each marker, d , is shown in the upper left hand comer of each plot.
- 26B Scatter plots showing genotype-phenotype interactions at the seven locus pairs that showed statistically significant interaction effects per the epistasis test.
- the “dosage”, e.g., the diploid genotype coded as the number of red-associated alleles, is displayed on the X axis, and the dosage at the other marker is represented by the three lines connecting the points.
- the Y axis shows the mean 6 point coat pheomelanin intensity phenotype across dogs with each genotype combination.
- Table 16 Pairwise tests for epistatic interaction among top GWAS markers
- Table 16 Interaction term coefficients (b3), test statistic, and p-value for each pair of the top five GWAS variants. Interactions with a p-value ⁇ 5 x 10 2 (marked with an asterisk) were considered statistically significant.
- FIG. 26B Two locus genotype and phenotype combinations for these variant pairs are shown in FIG. 26B.
- the top CFA2 variant exhibits weak negative epistasis with the red-associated alleles at CFA15, 18, and 21 (shown in (i))
- Two copies of the cream associated allele at the top CFA20 variant almost entirely masks the effect of the red-associated allele at the top CFA15 variant, and the top CFA15 variant exhibits negative epistasis with the top CFA21 variant (shown in (ii)).
- the top CFA18 variant exhibits positive epistasis with the top CFA20 variant and negative epistasis with the top CFA21 variant (shown in (iii))
- a multi-locus linear classifier model was constructed and trained to determine coat pheomelanin intensity with high accuracy.
- a common approach for accurately predicting multigenic trait phenotypes such as body weight is to fit a statistical model with phenotype as a function of genotypes at multiple genetic markers.
- a model fit on a sufficiently large and representative training sample can be used to accurately predict phenotypes for new individuals given their genotypes without knowing the true underlying genetic architecture of the trait.
- the phenotypic predictions produced by these models can then be used to learn more about the genetic architecture of the trait.
- a series of multiple linear regression classifier models were trained using genotype values at the top CFA2, 15, 18, 20, and 21 GWAS markers as independent variables.
- a machine learning classifier model was trained on normalized six point phenotype values that split the genotypes at all five loci into two variables each indicating whether or not they were heterozygous (“1”), and whether or not they were homozygous for the red-associated allele (“2”).
- the ratios of the model coefficients (f3) for the 1 and 2 variables at each locus provided an additional evaluation of the dominance relationship between the two alleles: loci for which the 1 £ was approximately half of the 2 £ fit the assumption of additivity, whereas loci for which the 1 £ was approximately zero were more consistent with the red-associated allele being recessive to the other allele, and loci for which the 1 and 2 £s were similar were more consistent with the red-associated allele being dominant to the other allele.
- Table 17 Evaluating additivity at top GWAS markers using linear model coefficients for heterozygotes versus red-associated allele homozygotes
- Table 17 Coefficients, coefficient standard error, t score values, and t test p-values for the y-intercept and each of the independent variables in a predictive model that encodes each dog’s genotype at each of the five top GWAS markers according to whether or not it was heterozygous (“1”), and whether or not it was homozygous for the red-associated allele (“2”).
- PREs represent the fraction of the total sum of squares error that is accounted for by each independent variable.
- Table 18 Best fit linear regression model equations, adjusted R-squared, and log likelihood scores are shown for each of the individual top GW AS SNPs using the dominance encoding most supported by the data in Table 17.
- the “CFA15 2” term encodes CFA15 genotype assuming that the red-associated allele is completely recessive, e.g., 1 if homozygous for the red-associated allele, and 0 if either of the other two genotype classes.
- CFA18_red_dom” and CFA21_red_dom terms encode CFA18 and CFA21 genotypes assuming that the “CFA21_red_dom” terms encode CFA18 and CFA21 genotypes assuming that the red-associated allele is completely dominant, e.g., 1 if heterozygous or homozygous for the red-associated allele, and 0 if homozygous for the other allele.
- Table 19 Coefficients, coefficient standard error, t score values, t test p-values, and
- Section A shows the base model that assumes perfect additivity at each locus and no interactions between loci.
- Section B. shows the best fit model incorporating dominance at all five loci.
- Section C. shows a model consisting of only the two previously reported loci (CFA15 and CFA20) using their best dominance encoding, and their pairwise interaction (CFA15 2 x CFA20).
- Section D. shows the best fit model incorporating both the dominance terms in model B. and two pairwise epistasis terms: CFA15 2 x CFA20 and CFA18_red_dom x CFA20.
- Section E. shows a reduced version of model D. that only includes terms that explained > 0.1% of variance (PRE > 1 x 10 3 ) in model D. and shows similar performance.
- FIGs. 27A-27B show performance of the best fit multivariate linear regression classifier model for pheomelanin intensity phenotypes in validation cohort.
- FIG. 27A Strip plot of observed versus predicted phenotypes for all dogs in the validation dataset using the predictive model shown in Table 17. The adjusted R-squared value is shown in the top right hand corner. Each point represents a single dog, colored according to its observed six point phenotype.
- FIG. 27B Performance of the multivariate linear regression model within and across breeds. For each row, observed and predicted phenotype averages are shown ⁇ their standard deviation.
- each row shows the fraction of dogs with a predicted phenotype value within one point of their observed phenotype (on the six point phenotype scale).
- the model’s performance was generally high in breeds that are fixed for a narrow range of coat pheomelanin intensity (e.g., Samoyeds and Irish Setters) and lower in breeds with a wide range of coat colors (e.g., Chihuahuas and Poodles).
- Some notable exceptions to this pattern were Bichon Frise, which are fixed for cream or white coats but poorly predicted by this model, and Golden Retrievers and Yellow Labrador retrievers, which display nearly the full range of coat pheomelanin intensity variation and for which the model is highly predictive.
- the top CFA2 variant falls within a long intergenic non-coding RNA (lincRNA) with unknown functional significance in domestic dog.
- lincRNA long intergenic non-coding RNA
- Many mammalian (including dog) lincRNAs are known to modulate the expression of nearby protein-coding genes via cis- regulatory mechanisms [refs. 49-52]
- the closest annotated canine protein-coding gene is RUNX family transcription factor 3 ( RUNX3 ), located approximately 82 kb downstream of ENSCAFG00000042716 at CFA2: 74,829,960-74,856,947.
- RUNX3 encodes a transcription factor that shows reduced expression in hair follicles in human premature hair greying, and appears to regulate expression of several other genes that also show reduced expression in premature greying samples [ref. 53]
- RUNX3 is also known to be a regulator of hair shape determination during murine embryonic development [ref. 54] Therefore, the CFA2 locus identified in the GWAS may be tagging a c/.s-regulatory module comprising ENSCAFG00000042716, RUNX3 , and possibly other unknown genic variants or functional genomic elements. Identifying the causal mutations underlying this association may be performed by fine mapping of the locus, as well as molecular experiments to directly assess the functional impacts of any candidate mutations.
- the top CFA21 variant is an intronic substitution in the TYR gene.
- This gene encodes the enzyme tyrosinase, which catalyzes the oxidation of 1-dihydroxy-phenylalanine (DOPA) to DOPA quinone, a precursor of both eumelanin and pheomelanin.
- DOPA 1-dihydroxy-phenylalanine
- DOPA quinone DOPA quinone
- the MFSD12 cream-associated variant masks the effect of the KITLG red-associated variant by causing abnormal degradation of melanosomes downstream of pro-melanogenic signalling by KITLG.
- a multigenic predictive model using genotypes at the most strongly-associated single-nucleotide genetic markers on CFA2, 15, 18, 20, and 21, plus two interaction terms was able to explain over 70% of the phenotypic variation across both the GWAS cohort and an independent validation cohort containing individuals from over 60 breeds as well as mixed breed dogs. This represents a gain of approximately 20% variance explained compared to a model using only the two previously discovered loci (Table 19, Section C). Because coat pheomelanin intensity appears to be a truly continuous phenotype across dogs, it is likely that the remaining variation is controlled by multiple additional loci.
- FIG. 28 shows phenotyping validation on 350 randomly selected dogs.
- a strip plot shows original versus re-scored 6 point phenotypes for a random sample of 350 dogs from the discovery sample.
- the correlation coefficient (Pearson’s Rho) between the original and new phenotype scores is shown in the upper left hand comer of the plot.
- FIGs. 29A-29C show Manhattan plots for additional GWAS, including 6-point phenotype, no covariates (FIG. 29A); binary phenotype, with covariates (FIG. 29B); and binary phenotype, no covariates (FIG. 29C).
- FIGs. 30A-30E show detailed views of regions surrounding top GWAS SNPs (e.g., on CFA2, CFA15, CFA18, CFA20, and CFA21), including CFA2 Association Region (74,465,672-75,100,435) (FIG. 30A); CFA15 Association Region (29,575,066-29,973,539) (FIG. 30B); CFA18 Association Region (12,410,382-13,410,382) (FIG. 30C); CFA20 Association Region (55,783,410-55,960,115) (FIG. 30D); and CFA21 Association Region (10,698,290-11,165,504) (FIG. 30E).
- CFA2 Association Region 74,465,672-75,100,435)
- CFA15 Association Region 29,575,066-29,973,539)
- CFA18 Association Region (12,410,382-13,410,382) FIG. 30C
- CFA20 Association Region 55,783,410-55,960,115
- Each panel shows the genomic region defined by the positions of the first upstream marker and last downstream marker with r2 > 0.2 with the most significant GWAS marker on the chromosome (indicated by a red “x”).
- the top panel of each figure shows the GWAS -loglO(p-value) and physical position of all GWAS markers in the region, colored by their r 2 with the top GWAS marker.
- FIG. 31 A shows that CFA15 top marker genotype correlates with sequencing coverage in known CNV. Boxplots overlaid with strip plots show the distribution of mean normalized depth of coverage across the CFA15 CNV characterized in [ref. 31] (CFA15: 29,821,450-29,832,950 bp) for dogs with each possible BICF2G630433130 genotype. Each point represents a single dog. Kruskal Wallis test p-values are shown for each pair of genotypes. [0360] FIG. 31B shows SRA run ID and sample name, breed, BICF2G630433130 genotype (coded as number of red-associated alleles), and CFA15 CNV mean normalized depth of coverage for all dogs shown in FIG. 31 A.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Biotechnology (AREA)
- Analytical Chemistry (AREA)
- Evolutionary Biology (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Computational Biology (AREA)
- Organic Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- Bioethics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Immunology (AREA)
- Microbiology (AREA)
- Biochemistry (AREA)
- General Engineering & Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
Claims
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CA3178467A CA3178467A1 (en) | 2020-04-02 | 2021-04-01 | Methods and systems for determining pigmentation phenotypes |
| EP21781361.7A EP4127224A4 (en) | 2020-04-02 | 2021-04-01 | METHODS AND SYSTEMS FOR DETERMINING PIGMENTATION PHENOTYPES |
| GB2215887.7A GB2612196A (en) | 2020-04-02 | 2021-04-01 | Methods and systems for determining pigmentation phenotypes |
| US17/956,446 US20230106107A1 (en) | 2020-04-02 | 2022-09-29 | Methods and systems for determining pigmentation phenotypes |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202063004204P | 2020-04-02 | 2020-04-02 | |
| US63/004,204 | 2020-04-02 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/956,446 Continuation US20230106107A1 (en) | 2020-04-02 | 2022-09-29 | Methods and systems for determining pigmentation phenotypes |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021202910A1 true WO2021202910A1 (en) | 2021-10-07 |
Family
ID=77929982
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2021/025433 Ceased WO2021202910A1 (en) | 2020-04-02 | 2021-04-01 | Methods and systems for determining pigmentation phenotypes |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US20230106107A1 (en) |
| EP (1) | EP4127224A4 (en) |
| CA (1) | CA3178467A1 (en) |
| GB (1) | GB2612196A (en) |
| WO (1) | WO2021202910A1 (en) |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113862380A (en) * | 2021-11-22 | 2021-12-31 | 广东海洋大学 | pH-related molecular markers of yak Wnt3a gene after slaughter and its application |
| CN114743601A (en) * | 2022-04-18 | 2022-07-12 | 中国农业科学院农业基因组研究所 | Breeding method, device and equipment based on multigroup data and deep learning |
| CN116246701A (en) * | 2023-02-13 | 2023-06-09 | 广州金域医学检验中心有限公司 | Data analysis device, medium and equipment based on phenotype term and variant gene |
| CN116863998A (en) * | 2023-06-21 | 2023-10-10 | 扬州大学 | Genetic algorithm-based whole genome prediction method and application thereof |
| US12322515B2 (en) | 2023-07-14 | 2025-06-03 | Onikoroshi, LLC | Personalized wellness systems and methods of use |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20060147962A1 (en) * | 2003-06-16 | 2006-07-06 | Mars, Inc. | Genotype test |
| US20070020651A1 (en) * | 2001-05-25 | 2007-01-25 | Dnaprint Genomics, Inc. | Compositions and methods for the inference of pigmentation traits |
| US20130246033A1 (en) * | 2012-03-14 | 2013-09-19 | Microsoft Corporation | Predicting phenotypes of a living being in real-time |
| US20150356243A1 (en) * | 2013-01-11 | 2015-12-10 | Oslo Universitetssykehus Hf | Systems and methods for identifying polymorphisms |
| US20160342693A1 (en) * | 2015-05-21 | 2016-11-24 | BarkHappy Inc. | Automated compatibility matching system for dogs and dog owners |
| US20170037482A1 (en) * | 2014-08-04 | 2017-02-09 | Lafayette Christa | Method for Evaluating Health and Genetic Predisposition of Animals |
| US20200175611A1 (en) * | 2018-11-30 | 2020-06-04 | TailTrax LLC | Multi-channel data aggregation system and method for communicating animal breed, medical and profile information among remote user networks |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CA2543785A1 (en) * | 2003-10-24 | 2005-05-06 | Mmi Genomics, Inc. | Compositions, methods, and systems for inferring canine breeds for genetic traits and verifying parentage of canine animals |
| WO2009134226A1 (en) * | 2008-05-01 | 2009-11-05 | The Board Of Trustees Of The Leland Stanford Junior University | Canine coat color prediction |
-
2021
- 2021-04-01 EP EP21781361.7A patent/EP4127224A4/en not_active Withdrawn
- 2021-04-01 GB GB2215887.7A patent/GB2612196A/en not_active Withdrawn
- 2021-04-01 CA CA3178467A patent/CA3178467A1/en active Pending
- 2021-04-01 WO PCT/US2021/025433 patent/WO2021202910A1/en not_active Ceased
-
2022
- 2022-09-29 US US17/956,446 patent/US20230106107A1/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20070020651A1 (en) * | 2001-05-25 | 2007-01-25 | Dnaprint Genomics, Inc. | Compositions and methods for the inference of pigmentation traits |
| US20060147962A1 (en) * | 2003-06-16 | 2006-07-06 | Mars, Inc. | Genotype test |
| US20130246033A1 (en) * | 2012-03-14 | 2013-09-19 | Microsoft Corporation | Predicting phenotypes of a living being in real-time |
| US20150356243A1 (en) * | 2013-01-11 | 2015-12-10 | Oslo Universitetssykehus Hf | Systems and methods for identifying polymorphisms |
| US20170037482A1 (en) * | 2014-08-04 | 2017-02-09 | Lafayette Christa | Method for Evaluating Health and Genetic Predisposition of Animals |
| US20160342693A1 (en) * | 2015-05-21 | 2016-11-24 | BarkHappy Inc. | Automated compatibility matching system for dogs and dog owners |
| US20200175611A1 (en) * | 2018-11-30 | 2020-06-04 | TailTrax LLC | Multi-channel data aggregation system and method for communicating animal breed, medical and profile information among remote user networks |
Non-Patent Citations (11)
| Title |
|---|
| BANERJEE ET AL.: "Banerjeeid Saikat, Zeng Lingyao, Schunkertid Heribert, Sö Dingid Johannes", PLOS GENETICS, vol. 14, no. 12, 31 December 2018 (2018-12-31), pages 1 - 27, XP055925164 * |
| BANNASCH DANIKA, SAFRA NOA, YOUNG AMY, KARMI NILI, SCHAIBLE R. S., LING G. V.: "Mutations in the SLC2A9 Gene Cause Hyperuricosuria and Hyperuricemia in the Dog", PLOS GENETICS, vol. 4, no. 11, 7 November 2008 (2008-11-07), pages 1 - 8, XP055925155 * |
| GANBOLD ONOLRAGCHAA, MANJULA PRABUDDHA, LEE SEUNG-HWAN, PAEK WOON KEE, SEO DONGWON, MUNKHBAYAR MUNKHBAATAR, LEE JUN HEON: "Sequence characterization and polymorphism of melanocortin 1 receptor gene in some goat breeds with different coat color of Mongolia", ASIAN- AUSTRALIAN JOURNAL OF ANIMAL SCIENCES, vol. 32, no. 7, 7 February 2019 (2019-02-07), pages 939 - 948, XP055925165 * |
| HEDAN ET AL.: "Identification of a Missense Variant in MFSD12 Involved in Dilution of Phaeomelanin Leading to White or Cream Coat Color in Dogs", GENES, vol. 10, no. 5, 21 May 2019 (2019-05-21), pages 1 - 9, XP055925139 * |
| KAWAKAMI TAKESHI, JENSEN MEGHAN K., SLAVNEY ANDREA, DEANE PETRA E., MILANO AUSRA, RAGHAVAN VANDANA, FORD BRETT, CHU ERIN T., SAMS : "R-locus for roaned coat is associated with a tandem duplication in an intronic region of USH2A in dogs and also contributes to Dalmatian spotting", PLOS ONE, vol. 16, no. 3, 23 March 2021 (2021-03-23), pages 1 - 24, XP055925168 * |
| See also references of EP4127224A4 * |
| SLAVNEY ANDREA J., KAWAKAMI TAKESHI, JENSEN MEGHAN K., NELSON THOMAS C., SAMS AARON J., BOYKO ADAM R.: "Five genetic variants explain over 70% of hair coat pheomelanin intensity variation in purebred and mixed breed domestic dogs", PLOS ONE, vol. 16, no. 5, 27 May 2021 (2021-05-27), pages 1 - 23, XP055925186 * |
| SOMMERLAD SUSAN F, MORTON JOHN M, HAILE-MARIAM MEKONNEN, JOHNSTONE ISOBEL, SEDDON JENNIFER M, O'LEARY CAROLINE A: "Prevalence of congenital hereditary sensorineural deafness in Australian Cattle Dogs and associations with coat characteristics and sex", BMC VETERINARY RESEARCH, vol. 8, 29 October 2012 (2012-10-29), pages 1 - 16, XP021122392 * |
| TRACY CHEW , CALI E. WILLET , BIANCA HAASE AND CLAIRE M. WADE: "Genomic Characterization of External Morphology Traits in Kelpies Does Not Support Common Ancestry with the Australian Dingo", GENES, vol. 10, no. 337, 3 May 2019 (2019-05-03), pages 1 - 12, XP055925162 * |
| WEICH KALIE, AFFOLTER VERENA, YORK DANIEL, REBHUN ROBERT, GRAHN ROBERT, KALLENBERG ANGELICA, BANNASCH DANIKA: "Pigment Intensity in Dogs is Associated with a Copy Number Variant Upstream of KITLG", GENES, vol. 11, no. 1, 9 January 2020 (2020-01-09), pages 1 - 13, XP055925184 * |
| YANG ET AL.: "The origin of chow chows in the light of the East Asian breeds", BMC GENOMICS, vol. 18, 16 February 2017 (2017-02-16), pages 1 - 13, XP021240794, DOI: 10.1186/s12864-017-3525-9 * |
Cited By (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113862380A (en) * | 2021-11-22 | 2021-12-31 | 广东海洋大学 | pH-related molecular markers of yak Wnt3a gene after slaughter and its application |
| CN114743601A (en) * | 2022-04-18 | 2022-07-12 | 中国农业科学院农业基因组研究所 | Breeding method, device and equipment based on multigroup data and deep learning |
| CN114743601B (en) * | 2022-04-18 | 2023-02-03 | 中国农业科学院农业基因组研究所 | Breeding method, device and equipment based on multigroup data and deep learning |
| CN116246701A (en) * | 2023-02-13 | 2023-06-09 | 广州金域医学检验中心有限公司 | Data analysis device, medium and equipment based on phenotype term and variant gene |
| CN116246701B (en) * | 2023-02-13 | 2024-03-22 | 广州金域医学检验中心有限公司 | Data analysis devices, media and equipment based on phenotypic terms and variant genes |
| CN116863998A (en) * | 2023-06-21 | 2023-10-10 | 扬州大学 | Genetic algorithm-based whole genome prediction method and application thereof |
| CN116863998B (en) * | 2023-06-21 | 2024-04-05 | 扬州大学 | Genetic algorithm-based whole genome prediction method and application thereof |
| US12322515B2 (en) | 2023-07-14 | 2025-06-03 | Onikoroshi, LLC | Personalized wellness systems and methods of use |
| US12488901B2 (en) | 2023-07-14 | 2025-12-02 | Onikoroshi, LLC | Personalized wellness systems and methods of use |
Also Published As
| Publication number | Publication date |
|---|---|
| GB2612196A (en) | 2023-04-26 |
| CA3178467A1 (en) | 2021-10-07 |
| GB202215887D0 (en) | 2022-12-14 |
| EP4127224A4 (en) | 2024-07-24 |
| US20230106107A1 (en) | 2023-04-06 |
| EP4127224A1 (en) | 2023-02-08 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20230106107A1 (en) | Methods and systems for determining pigmentation phenotypes | |
| Plassais et al. | Whole genome sequencing of canids reveals genomic regions under selection and variants influencing morphology | |
| US11788142B2 (en) | Compositions and methods for discovery of causative mutations in genetic disorders | |
| Pausch et al. | Homozygous haplotype deficiency reveals deleterious mutations compromising reproductive and rearing success in cattle | |
| Brooks et al. | Whole-genome SNP association in the horse: identification of a deletion in myosin Va responsible for Lavender Foal Syndrome | |
| CN105603062B (en) | Methods of Assessing Inherited Conditions | |
| US10522240B2 (en) | Evaluating genetic disorders | |
| Lee et al. | Deciphering the genetic blueprint behind Holstein milk proteins and production | |
| Johnson et al. | Genotyping-by-sequencing (GBS) detects genetic structure and confirms behavioral QTL in tame and aggressive foxes (Vulpes vulpes) | |
| Wang et al. | Multiple ancestral haplotypes harboring regulatory mutations cumulatively contribute to a QTL affecting chicken growth traits | |
| Holl et al. | Variant in the RFWD 3 gene associated with PATN 1, a modifier of leopard complex spotting | |
| JP2023501006A5 (en) | ||
| Slavney et al. | Five genetic variants explain over 70% of hair coat pheomelanin intensity variation in purebred and mixed breed domestic dogs | |
| Cai et al. | SNP markers associated with body size and pelt length in American mink (Neovison vison) | |
| Kaelin et al. | Ancestry dynamics and trait selection in a designer cat breed | |
| Wolfsberger et al. | Genetic diversity and selection in Puerto Rican horses | |
| Xu et al. | Genome-wide association studies and haplotype-sharing analysis targeting the egg production traits in Shaoxing duck | |
| Bubac et al. | Genetic association with boldness and maternal performance in a free-ranging population of grey seals (Halichoerus grypus) | |
| Falchi et al. | Effect of genotyping density on the detection of runs of homozygosity and heterozygosity in cattle | |
| Kawakami et al. | R-locus for roaned coat is associated with a tandem duplication in an intronic region of USH2A in dogs and also contributes to Dalmatian spotting | |
| Palinkas-Bodzsar et al. | Gene conservation of six Hungarian local chicken breeds maintained in small populations over time | |
| Lien et al. | Identification of QTL and loci for egg production traits to tropical climate conditions in chickens | |
| Freitas et al. | Identification of eQTLs using different sets of single nucleotide polymorphisms associated with carcass and body composition traits in pigs | |
| Nxumalo et al. | A review on omics approaches, towards understanding environmental resilience of indigenous Nguni sheep: Implications for their conservation and breeding programs in South Africa | |
| Ukawa et al. | Widespread genetic testing control inherited polycystic kidney disease in cats |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21781361 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 3178467 Country of ref document: CA |
|
| ENP | Entry into the national phase |
Ref document number: 202215887 Country of ref document: GB Kind code of ref document: A Free format text: PCT FILING DATE = 20210401 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2215887.7 Country of ref document: GB |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2021781361 Country of ref document: EP Effective date: 20221102 |
|
| WWP | Wipo information: published in national office |
Ref document number: 2215887.7 Country of ref document: GB |
|
| WWW | Wipo information: withdrawn in national office |
Ref document number: 2021781361 Country of ref document: EP |















