WO2025085574A1 - Procédés pour des prédictions améliorées de phénotypes polygéniques à travers des populations diverses - Google Patents

Procédés pour des prédictions améliorées de phénotypes polygéniques à travers des populations diverses Download PDF

Info

Publication number
WO2025085574A1
WO2025085574A1 PCT/US2024/051666 US2024051666W WO2025085574A1 WO 2025085574 A1 WO2025085574 A1 WO 2025085574A1 US 2024051666 W US2024051666 W US 2024051666W WO 2025085574 A1 WO2025085574 A1 WO 2025085574A1
Authority
WO
WIPO (PCT)
Prior art keywords
individual
genetic
trait
individuals
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2024/051666
Other languages
English (en)
Inventor
Manolis KELLIS
Yosuke Tanigawa
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Massachusetts Institute of Technology
Original Assignee
Massachusetts Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Massachusetts Institute of Technology filed Critical Massachusetts Institute of Technology
Publication of WO2025085574A1 publication Critical patent/WO2025085574A1/fr
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/20ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Definitions

  • a polygenic score also known as a polygenic index, polygenic risk score, genetic risk score, or genome-wide score—is a numeric value used to quantify the estimated effect of many genetic variants on an individual’s phenotype.
  • the individual’s phenotype may include a particular disease and/or other medical traits.
  • a PGS can be used to quantify an individual’s medical trait(s) based on their genetics.
  • a PGS is typically calculated using an additive sum of the effects of genetic variants identified from genome-wide association studies (GWASes).
  • GWASes genome-wide association studies
  • the method includes: obtaining genomic data including information indicative of genetic variants present in a genome of the individual; calculating, using an at least one processor and a trained computational model, at least one trait score based on the information Attorney Docket No.: M0437.70168WO00 indicative of the genetic variants, wherein the trained computational model was obtained by training a first computational model using data describing phenotypes, genotypes, and/or phenotype-genotype relationships for a population including admixed individuals of multiple ancestries; and generating, using the at least one processor, a graphical user interface including a visualization of the at least one trait score, the at least one trait score being indicative of a presence of the medical trait in the individual; and displaying, using a display device, the generated graphical user interface.
  • the techniques described herein relate to a system, including: at least one computer hardware processor; and at least one non-transitory computer readable medium storing processor-executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform the method of predicting the medical trait in the individual.
  • the techniques described herein relate to at least one non-transitory computer readable medium storing processor-executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform the method of predicting the medical trait in the individual.
  • the method further includes altering a course of medical treatment related to the medical trait based on the at least one trait score.
  • altering the course of medical treatment includes reducing a number of medical interventions, reducing a frequency of medical interventions, and/or selecting a different type of medical intervention based on a value of the at least one trait score.
  • altering the course of medical treatment is based on the value of the at least one trait score relative to a threshold value.
  • altering the course of medical treatment includes increasing a number of medical interventions, increasing a frequency of medical interventions, and/or selecting a different type of medical intervention based on the at least one trait score.
  • altering the course of medical treatment is based on a value of the at least one trait score relative to a threshold value.
  • the method further includes identifying at least one therapeutic agent based on a value of the at least one trait score.
  • Attorney Docket No.: M0437.70168WO00 [0013]
  • the method further includes administering the at least one therapeutic agent to the individual based on the value of the at least one trait score.
  • the method further includes altering a recruitment strategy for a clinical trial related to the medical trait based on the at least one trait score.
  • altering the recruitment strategy for the clinical trial includes removing the individual from enrollment in the clinical trial based on a value of the at least one trait score.
  • altering the recruitment strategy for the clinical trial is based on the value of the at least one trait score relative to a threshold value.
  • altering the recruitment strategy for the clinical trial includes including the individual in enrollment in the clinical trial based on a value of the at least one trait score.
  • altering the recruitment strategy for the clinical trial is based on the value of the at least one trait score relative to a threshold value.
  • training the first computational model includes using batch screening iterative lasso (BASIL) regression.
  • training the first computational model includes performing a regression including a loss, the loss including a squared loss and/or a binomial loss.
  • obtaining the trained computational model further includes using the trained first computational model and performing a regression including a penalty applied to genetic variants with heterogeneous associations between single-ancestry populations.
  • performing the regression further includes including effects of a global ancestry of the individual on genetic variants with heterogeneous associations.
  • including the effects of the global ancestry of the individual includes including, for each instance of genomic data in the data describing phenotypes, genotypes, and/or phenotype-genotype relationships, an interaction described by a number of copies of each genetic variant with heterogeneous associations in each instance of genomic data and principal components of a genomic matrix generated using each instance of genomic data.
  • performing the regression includes performing Elastic Net regression.
  • training the first computational model includes performing a regression including at least one penalty factor determined based on one or more biological priors.
  • the one or more biological priors include one or more of the following: variant pathogenicity, predicted variant consequences, known causal variants, tissue-specific regulatory genomic annotations, cell-type-specific regulatory genomic annotations, and/or aggregation of one or more of the preceding effects.
  • the method further includes determining the genetic variants with heterogeneous associations between single-ancestry populations by: determining genetic variants associated with the medical trait for two or more single-ancestry populations using inverse- variance weighted meta-analysis of genome-wide association studies (GWAS) for each of the two or more single-ancestry populations.
  • determining the genetic variants with heterogeneous associations further includes using Cochran's Q test to determine genetic variants associated with -values smaller than a threshold value.
  • calculating the at least one trait score further includes using information indicative of one or more conventional risk factors and/or genetic variants associated with the individual.
  • obtaining the trained computational model further includes performing a regression including effects on phenotypes in the data describing phenotypes, genotypes, and/or phenotype-genotype relationships of the one or more conventional risk factors and/or the genetic variants associated with the individual.
  • obtaining the genomic data includes obtaining genotyping data previously obtained by genotyping a biological sample obtained from the individual.
  • obtaining the genomic data includes obtaining genotyping data by sequencing a biological sample obtained from the individual.
  • obtaining the genotyping data includes obtaining microarray data, whole-genome sequencing data, whole-exome sequencing data, and/or genotype imputation from partially observed data.
  • training the first computational model further includes using one or more of a linear genetic effect, a genetic dominance effect, and/or a sex-based genetic effect associated with the medical trait.
  • obtaining the trained computational model further includes: determining an optimal regularization parameter value by: training a first plurality of Attorney Docket No.: M0437.70168WO00 computational models using a plurality of different regularization parameter values and a subset of the data describing phenotypes, genotypes, and/or phenotype-genotype relationships; evaluating a predictive performance of each of the first plurality of trained computational models; and selecting the optimal regularization parameter value based on the evaluated predictive performance of each of the first plurality of trained computational models.
  • obtaining the trained computational model further includes training the first computational model using the optimal regularization parameter value.
  • the techniques described herein relate to a method of performing a clinical trial. The method includes: obtaining, for a first individual, a first trait score associated with a medical trait by: calculating the first trait score using an at least one processor, a trained computational model, and first genomic data including information indicative of genetic variants present in a genome of the first individual, wherein the trained computational model was obtained by training a first computational model using data describing phenotypes, genotypes, and/or phenotype-genotype relationships for a population including admixed individuals of multiple ancestries; enrolling the first individual in the clinical trial based on a value of the first trait score; and altering a course of medical treatment for the first individual in accordance with the clinical trial.
  • the techniques described herein relate to a system, including: at least one computer hardware processor; and at least one non-transitory computer readable medium storing processor-executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform the method of performing the clinical trial.
  • the techniques described herein relate to at least one non-transitory computer readable medium storing processor-executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform the method of performing the clinical trial.
  • enrolling the first individual in the clinical trial based on the value of the first trait score includes enrolling the first individual in the clinical trial based on the value of the first trait score relative to a threshold value.
  • the method further includes: obtaining, for a second individual, a second trait score associated with the medical trait by calculating the second trait score using an Attorney Docket No.: M0437.70168WO00 at least one processor, the trained computational model, and second genomic data including information indicative of genetic variants present in a genome of the second individual; and declining to enroll the second individual in the clinical trial based on a value of the second trait score.
  • declining to enroll the second individual in the clinical trial based on the value of the second trait score includes declining to enroll the second individual in the clinical trial based on the value of the second trait score relative to a threshold value.
  • training the first computational model includes using batch screening iterative lasso (BASIL) regression.
  • BASIL batch screening iterative lasso
  • training the first computational model includes performing a regression including a loss, the loss including a squared loss and/or a binomial loss.
  • obtaining the trained computational model further includes using the trained first computational model and performing a regression including a penalty applied to genetic variants with heterogeneous associations between single-ancestry populations.
  • performing the regression further includes including effects of a global ancestry of the first individual on genetic variants with heterogeneous associations.
  • including the effects of the global ancestry of the first individual includes including, for each instance of genomic data in the data describing phenotypes, genotypes, and/or phenotype-genotype relationships, an interaction described by a number of copies of each genetic variant with heterogeneous associations in each instance of genomic data and principal components of a genomic matrix generated using each instance of genomic data.
  • performing the regression includes performing Elastic Net regression.
  • training the first computational model includes performing a regression including at least one penalty factor determined based on one or more biological priors.
  • the one or more biological priors include one or more of the following: variant pathogenicity, predicted variant consequences, known causal variants, tissue-specific regulatory genomic annotations, cell-type-specific regulatory genomic annotations, and/or aggregation of one or more of the preceding effects.
  • the method further includes determining the genetic variants with heterogeneous associations between single-ancestry populations by: determining genetic variants associated with the medical trait for two or more single-ancestry populations using inverse- Attorney Docket No.: M0437.70168WO00 variance weighted meta-analysis of genome-wide association studies (GWAS) for each of the two or more single-ancestry populations.
  • GWAS genome-wide association studies
  • determining the genetic variants with heterogeneous associations further includes using Cochran's Q test to determine genetic variants associated with -values smaller than a threshold value.
  • calculating the at least one trait score further includes using information indicative of one or more conventional risk factors and/or genetic variants associated with the first individual.
  • obtaining the trained computational model further includes performing a regression including effects on phenotypes in the data describing phenotypes, genotypes, and/or phenotype-genotype relationships of the one or more conventional risk factors and/or the genetic variants associated with the first individual.
  • obtaining the genomic data includes obtaining genotyping data previously obtained by genotyping a biological sample obtained from the first individual.
  • obtaining the first genomic data includes obtaining genotyping data by sequencing a biological sample obtained from the first individual.
  • obtaining the genotyping data includes obtaining microarray data, whole-genome sequencing data, whole-exome sequencing data, and/or genotype imputation from partially observed data.
  • training the first computational model further includes using one or more of a linear genetic effect, a genetic dominance effect, and/or a sex-based genetic effect associated with the medical trait.
  • obtaining the trained computational model further includes: determining an optimal regularization parameter value by: training a first plurality of computational models using a plurality of different regularization parameter values and a subset of the data describing phenotypes, genotypes, and/or phenotype-genotype relationships; evaluating a predictive performance of each of the first plurality of trained computational models; and selecting the optimal regularization parameter value based on the evaluated predictive performance of each of the first plurality of trained computational models. [0060] In some aspects, obtaining the trained computational model further includes training the first computational model using the optimal regularization parameter value.
  • FIG.1-1 is a schematic block diagram of a trait score facility, in accordance with some embodiments of the technology described herein.
  • FIG. 1-2 is a flowchart of a process 200 for determining an individual’s trait score, including an inclusive polygenic score (iPGS), in accordance with some embodiments of the technology described herein.
  • iPGS inclusive polygenic score
  • FIG. 1-3 is a flowchart of a process 300 for generating a computational model for determining an individual’s trait score, including an iPGS, in accordance with some embodiments of the technology described herein.
  • FIG. 1-4 is a flowchart of another process 400 for generating a computational model for determining an individual’s trait score, including an iPGS, in accordance with some embodiments of the technology described herein.
  • FIG.1-5 is a flowchart of a process 500 for designing a clinical trial using trait scores, in accordance with some embodiments of the technology described herein.
  • FIG.1-6 is a diagram of an illustrative computer system 600, in accordance with some embodiments. [0068] FIG.
  • FIGs. 2-2A to 2-2B show inclusive PGS (iPGS) training with diverse ancestry enhances the transferability of polygenic scores in the UK Biobank.
  • FIG. 2-2A shows principal- component projections of the unrelated individuals in the UK Biobank and population-label assignment.
  • FIG. 2-2B shows relative average improvements of PGS model performance against the baseline model trained only with White British individuals (material and methods). Error bars represent 95% confidence intervals of average improvements.
  • FIGs. 2-3A to 2-3G show a systematic predictive performance evaluation of inclusive PGS (iPGS) models and PRS-CSx across 60 anthropometric and hematological traits in the UK Biobank.
  • FIG. 2-3A shows the predictive performance (R 2 ) in White British (WB), South Asian (SA), and African (Afr) groups in the UK Biobank for four select models: (i) WB-only, (ii) inclusive, (iii) inclusive-FixN, and (vii) PRS-CSx.
  • FIG.2-3B shows the number of approximately LD-independent (R 2 ⁇ 0.2 in the African population in the UK Biobank) variants with heterogeneous genome-wide association studies (GWAS) associations (material and methods).
  • FIGs. 2-3C to 2-3G show the predictive performance of up to eight PGS models in White British (WB) and African (Afr) populations in the UK Biobank for five select traits. The refit models are trained only for the neutrophil and leukocyte counts, where genetic variants with heterogeneous GWAS effects were observed. The predictive performance for other models and ancestry groups is shown in FIGs. 2-8A to 2-8B and FIG. 2-9.
  • BMI body mass index.
  • Vol. volume.
  • Dist. distribution.
  • Impd. impedance.
  • FIGs. 2-4A to 2-4E show enhanced predictive performance with iPGS+refit that additionally accounts for ancestry-dependent genetic effects.
  • FIG. 2-4A shows GWAS meta- analysis heterogeneity test in the UK Biobank for the neutrophil count. Genetic variants with heterogeneity ⁇ value ⁇ 5 ⁇ 10 ⁇ are highlighted.
  • FIG. 2-4B shows GWAS effect size comparison between White British (x axis) and African (y axis) populations in the UK Biobank. The color indicates whether the variants show heterogeneous GWAS associations.
  • FIG.2-4C shows allele frequency comparison for 5,890 genetic variants associated with neutrophil count (material and methods). Heterogeneous GWAS associations are shown. The shape and size represent the direction and the magnitude of GWAS associations in the African population in the UK Biobank.
  • FIGs.2-4D to 2- 4E show phenotype mean values of neutrophil count (FIG. 2-4D) and leukocyte count (FIG. 2- 4E) stratified by decile of PGS in the held-out test set of individuals of the African population in the UK Biobank are shown. Error bars represent the standard error-of-mean estimates.
  • PTVs protein-truncating variants.
  • PAVs protein-altering variants.
  • FIG. 2-5 shows the size of the inclusive PGS model. The number of genetic variants with non-zero coefficients in the inclusive PGS model trained on up to 284,661 individuals is shown as a histogram across 60 traits.
  • FIGs. 2-6A to 2-6D show a comparison of effect size between GWAS and inclusive polygenic scores (iPGS) for standing height. The effect size estimates from GWAS (x-axis) and iPGS (y-axis) are compared for (FIG.
  • iPGS inclusive polygenic scores
  • FIG. 2-6A white British, (FIG. 2-6B) non-British white, (FIG. 2-6C) South Asian, and (FIG.2-6D) African. Error bars represent the standard error of the GWAS effect size estimates.
  • FIG. 2-7 shows a comparison of predicted PGS in the held-out test set individuals between the WB-only and the Inclusive PGS models. Across 60 traits, the Pearson’s correlation (R 2 ) was evaluated between the two predicted PGS values across all unrelated held-out test individuals (“All”) as well as a subset of held-out test set individuals (WB: white British, NBW: non-British white, SA: South Asian, Afr: African, and Others).
  • FIG. 2-8A shows a systematic predictive performance evaluation of inclusive PGS (iPGS) and PRS-CSx models across 31 anthropometric traits in UK Biobank.
  • FIG. 2-8B shows a systematic predictive performance evaluation of inclusive PGS (iPGS) and PRS-CSx models across 29 hematological traits in UK Biobank.
  • FIG. 2-9 shows the predictive performance of up to eight PGS models for five select traits.
  • FIG. 2-10 shows average improvements of an inclusive PGS (iPGS) model against a WB-only model.
  • iPGS inclusive PGS
  • R 2 the predictive performance
  • FIG. 2-11 shows comparisons of heritability and predictive performance of the inclusive PGS model.
  • the estimated heritability (h2) (x-axis) and predictive performance of the inclusive PGS model (y-axis) in the white British population in UK Biobank are shown. Error bars represent standard error.
  • FIG. 2-12 shows cumulative frequency distribution of minor allele frequency of the genetic variants in UK Biobank. The allele frequency was computed using different subsets of unrelated individuals in UK Biobank and different subsets of genetic variants.
  • the “UKB variants” represents all of the 1,316,181 genetic variants considered in the study, consisting of directly genotyped and imputed genetic variants and imputed HLA allotypes.
  • FIG. 2-13 shows a comparison of allele frequency of genetic variants in white British and African in UK Biobank.
  • FIGs. 2-14A to 2-14D show ancestry-biased genetic associations for the neutrophil count in UK Biobank.
  • FIG. 2-14A shows a Manhattan plot for the meta-analyzed p values from UK Biobank GWAS association analysis across white British, non-British white, South Asian, and African ancestry groups are shown.
  • FIG.2-14B to 2-14C show a Manhattan plot for population- specific GWAS analysis for white British (FIG.2-14B) and African (FIG.2-14C) ancestry groups in UK Biobank.
  • FIG.2-14D shows a Manhattan plot for GWAS heterogeneity p value across white British, non-British white, South Asian, and African ancestry groups in UK Biobank.
  • the shape of the points represents the predicted consequence of genetic variants from Ensembl’s Variant Effect Predictor (VEP).
  • FIG. 2-15 shows allele frequency and minor allele frequency of genetic associations for the neutrophil count in UK Biobank. Cascade plots for population-specific GWAS analysis for white British (top) and African (bottom) ancestry groups in UK Biobank. The shape of the points represents the predicted consequence of genetic variants from Ensembl’s Variant Effect Predictor (VEP).
  • VEP Variant Effect Predictor
  • FIGs. 2-16A to 2-16C show a comparison of PGS coefficients in two PGS models for neutrophil counts.
  • FIGs. 2-16A to 2-16B the coefficients (BETA) for the inclusive PGS model (FIG. 2-16A) and WB-only model (FIG. 2-16B).
  • the shape of the points represents the predicted consequence of genetic variants from Ensembl’s Variant Effect Predictor (VEP).
  • VEP Variant Effect Predictor
  • FIGs. 3-1A to 3-1E show GenESIS integrates validated GxS effects on top of linear effects in a unified framework.
  • FIG. 3-1A shows the number of predictor variables considered in iPGS (x-axis) and GenESIS model (y-axis).
  • FIGs. 3-1B to 3-1C show a comparison of GWAS effect size estimates between Attorney Docket No.: M0437.70168WO00 females (x-axis) and males (y-axis) for hip circumference in white British individuals.
  • the genetic variants selected in GenESIS only for linear effects are shown in (FIG. 3-1B), and the selected variants with GxS interaction effects are shown in (FIG.3-1C).
  • FIG.3-1D shows the distributions of GxS interaction effect test statistics, T Diff , for hip circumference in white British individuals.
  • FIG. 3-1E shows the number of traits where GenESIS is selected for trait prediction on the basis of the validation set metric. For each population group (y-axis), the number of traits per trait category is shown.
  • FIGs. 3-2A to 3-2D show enhanced PGS transferability in non-European populations with GenESIS.
  • FIG. 3-2A shows the magnitude and statistical significance of the gain in predictive performance of GenESIS over linear-only iPGS across three population groups. 32 traits are shown where GenESIS was selected for at least one of the three populations (white British, South African, and African). The full results are shown in FIG. 3-6.
  • FIG. 3-2B shows a fraction of GxS interaction terms in the selected predictor variables in the GenESIS model.
  • FIG. 3-2C shows the predictive performance (R 2 ) of covariate-only model, linear-only iPGS model, and GenESIS model for hip circumferences and body mass index in individuals of White British, South Asian, and African ancestry in the held-out test set.
  • FIG.3-2D shows for hip circumference, the predictive performance of the GenESIS model (partial R 2 for covariate-adjusted phenotype, x- axis) was compared for individuals of African ancestry against all publicly available PGS model evaluations in the PGS catalog (y-axis). Error bars represent 95% confidence intervals. [0087] FIGs.
  • FIGs. 3-3A to 3-3D show sparse GenESIS models offer interpretation.
  • FIGs. 3-3A to 3-3B show the linear (FIG. 3-3A) and GxS interaction (FIG. 3-3B) effects in the GenESIS model for hip circumference. Genetic variants with large effects were annotated. The GxS effect size directions are represented for male individuals.
  • FIG. 3-3C shows pleiotropic association of rs1260326, a protein-altering variant in GCKR gene for select traits. The GWAS-based effect size estimates are shown on the left and the statistical significance of the association on the right. Phenome-wide associations with nominal ⁇ ⁇ 1 ⁇ 10 ⁇ are shown in Table 18.
  • FIG.3-3D shows enriched ontology terms for genetic variants with GxS effects in GenESIS, nominated by the GREAT enrichment analysis.
  • FIGs. 3-4A to 3-4B show the number and magnitude of linear and gene-by-sex interaction effects in GenESIS.
  • FIG. 3-4A shows a comparison of the number of predictor Attorney Docket No.: M0437.70168WO00 variables with linear (x-axis) and GxS interaction effects (y-axis) across 99 traits.
  • FIG.3-4B shows a comparison of the linear and GxS effect sizes across 99 traits. The median of the absolute value of the effect size per standard deviation of trait values is shown along with the interquartile range (IQR).
  • IQR interquartile range
  • FIGs. 3-5A to 3-5B show validating GxS effects in GenESIS by sex-stratified GWAS for bilirubin traits.
  • FIGs. 3-5A to 3-5B show the distributions of gene-by-sex (GxS) interaction test statistics (Example 2 Methods) for the genetic variants included in the GenESIS models stratified by whether the variant has linear or GxS interaction effects for total bilirubin (FIG. 3- 5A) and direct bilirubin (FIG. 3-5B).
  • GxS gene-by-sex
  • FIG.3-6A to 3-6B show a systematic evaluation of the predictive performance of GenESIS and linear-only iPGS.
  • FIG.3-6A shows the magnitude and statistical significance of the gain in predictive performance of GenESIS over linear-only iPGS across five population groups across 99 quantitative traits in UK Biobank.
  • FIG. 3-6B shows a fraction of GxS interaction terms in the selected predictor variables in the GenESIS model.
  • FIG. 3-7 shows enhanced PGS transferability in non-European populations with GenESIS.
  • Predictive performance (R 2 ) of the covariate-only model, linear-only iPGS model, and GenESIS model is shown for hip circumferences and body mass index for the following five population groups in the held-out test set: White British, non-British White, South Asian, African, and other individuals.
  • FIG. 3-8 shows modeling male- and female-specific genetic effects in GenESIS.
  • the predictive performance (R 2 ) of GenESIS PGS models is shown for seven traits (rows) across five population groups (cols) in the UK Biobank resource.
  • the “GxS_m'' model represents the primary analysis, focusing on sex-shared and male-specific effects.
  • FIG. 4-1A shows improvements in predictive performance (R 2 ) for white British individuals in the held-out test set in UK Biobank.
  • the predictive performance of the baseline model, trained only on white British individuals, is shown on the x-axis.
  • the difference between the predictive performance of the iPGS model vs. the baseline model is shown on the y-axis.
  • FIG.4-1B shows improvements in predictive performance (AUROC) for white British individuals in the held-out test set in UK Biobank.
  • the predictive performance of the baseline model, trained only on white British individuals, is shown on the x-axis.
  • the difference between the predictive performance of the iPGS model vs. the baseline model is shown on the y-axis.
  • the dotted horizontal line represents when the two models (iPGS and the baseline model) show the same predictive performance.
  • the dashed line represents the average improvement of 3.94% across 49 binary traits.
  • FIG. 4-2A shows improvements in predictive performance (R 2 ) for non-British white individuals in the held-out test set in UK Biobank.
  • the predictive performance of the baseline model, trained only on white British individuals, is shown on the x-axis.
  • the difference between the predictive performance of the iPGS model vs. the baseline model is shown on the y-axis.
  • the dotted horizontal line represents when the two models (iPGS and the baseline model) show the same predictive performance.
  • the dashed line represents the average improvement of 7.41% across 177 quantitative traits.
  • FIG. 4-2B shows improvements in predictive performance (AUROC) for non-British white individuals in the held-out test set in UK Biobank.
  • FIG. 4-3A shows improvements in predictive performance (R 2 ) for South Asian individuals in the held-out test set in UK Biobank.
  • the predictive performance of the baseline model, trained only on white British individuals, is shown on the x-axis.
  • the difference between the predictive performance of the iPGS model vs. the baseline model is shown on the y-axis.
  • FIG.4-3B shows improvements in predictive performance (AUROC) for South Asian individuals in the held-out test set in UK Biobank.
  • the predictive performance of the baseline model, trained only on white British individuals, is shown on the x-axis.
  • the difference between the predictive performance of the iPGS model vs. the baseline model is shown on the y-axis.
  • the dotted horizontal line represents when the two models (iPGS and the baseline model) show the same predictive performance.
  • the dashed line represents the average improvement of 1.03% across 10 binary traits.
  • FIG.4-4A shows improvements in predictive performance (R 2 ) for African individuals in the held-out test set in UK Biobank.
  • the predictive performance of the baseline model, trained only on white British individuals, is shown on the x-axis.
  • the difference between the predictive performance of the iPGS model vs. the baseline model is shown on the y-axis.
  • the dotted horizontal line represents when the two models (iPGS and the baseline model) show the same predictive performance.
  • the dashed line represents the average improvement of 26.74% across 177 quantitative traits.
  • FIG. 4-4B shows improvements in predictive performance (AUROC) for African individuals in the held-out test set in UK Biobank.
  • the predictive performance of the baseline model, trained only on white British individuals, is shown on the x-axis.
  • FIG. 4-5A shows improvements in predictive performance (R 2 ) for other individuals in the held-out test set in UK Biobank.
  • the predictive performance of the baseline model, trained only on white British individuals, is shown on the x-axis.
  • the difference between the predictive performance of the iPGS model vs. the baseline model is shown on the y-axis.
  • the dotted horizontal line represents when the two models (iPGS and the baseline model) show the same predictive performance.
  • FIG. 4-5B shows improvements in predictive performance (AUROC) for other individuals in the held-out test set in UK Biobank.
  • the predictive performance of the baseline model, trained only on white British individuals, is shown on the x-axis.
  • the difference between Attorney Docket No.: M0437.70168WO00 the predictive performance of the iPGS model vs. the baseline model is shown on the y-axis.
  • the dotted horizontal line represents when the two models (iPGS and the baseline model) show the same predictive performance.
  • the dashed line represents the average improvement of 5.58% across 38 binary traits.
  • FIG.4-6A shows improvements in predictive performance (R 2 ) for white British individuals in the held-out test set in UK Biobank.
  • the predictive performance of the original iPGS model is shown on the x-axis.
  • the difference in the predictive performance between the two iPGS models trained with and without the “training & validation refit” is shown on the y-axis.
  • the dotted horizontal line represents when the two models show the same predictive performance.
  • the dashed line represents the average improvement of 0.45% across 364 quantitative traits.
  • FIG.4-6B shows improvements in predictive performance (AUROC) for white British individuals in the held-out test set in UK Biobank.
  • the predictive performance of the original iPGS model is shown on the x-axis.
  • FIG. 4-7A shows improvements in predictive performance (R 2 ) for non-British white individuals in the held-out test set in UK Biobank.
  • the predictive performance of the original iPGS model is shown on the x-axis.
  • the difference in the predictive performance between the two iPGS models trained with and without the “training & validation refit” is shown on the y-axis.
  • the dotted horizontal line represents when the two models show the same predictive performance.
  • FIG. 4-7B shows improvements in predictive performance (AUROC) for non-British white individuals in the held-out test set in UK Biobank.
  • the predictive performance of the original iPGS model is shown on the x-axis.
  • the difference in the predictive performance between the two iPGS models trained with and without the “training & validation refit” is shown on the y- axis.
  • the dotted horizontal line represents when the two models show the same predictive performance.
  • the dashed line represents the average improvement of 2.31% across 21 binary traits.
  • FIG.4-8A shows improvements in predictive performance (R 2 ) for South Asian individuals in the held-out test set in UK Biobank.
  • the predictive performance of the original iPGS model is shown on the x-axis.
  • the difference in the predictive performance between the two iPGS models trained with and without the “training & validation refit” is shown on the y-axis.
  • the dotted horizontal line represents when the two models show the same predictive performance.
  • the dashed line represents the average improvement of 0.47% across 364 quantitative traits.
  • FIG.4-8B shows improvements in predictive performance (AUROC) for South Asian individuals in the held-out test set in UK Biobank.
  • the predictive performance of the original iPGS model is shown on the x-axis.
  • FIG.4-9A shows improvements in predictive performance (R 2 ) for African individuals in the held-out test set in UK Biobank.
  • the predictive performance of the original iPGS model is shown on the x-axis.
  • the difference in the predictive performance between the two iPGS models trained with and without the “training & validation refit” is shown on the y-axis.
  • the dotted horizontal line represents when the two models show the same predictive performance.
  • FIG. 4-9B shows improvements in predictive performance (AUROC) for African individuals in the held-out test set in UK Biobank.
  • the predictive performance of the original iPGS model is shown on the x-axis.
  • the difference in the predictive performance between the two iPGS models trained with and without the “training & validation refit” is shown on the y-axis.
  • the dotted horizontal line represents when the two models show the same predictive performance.
  • the dashed line represents the average improvement of 25.12% across 5 binary traits.
  • FIG. 4-10A shows improvements in predictive performance (R 2 ) for other individuals in the held-out test set in UK Biobank.
  • FIG. 4-10B shows improvements in predictive performance (AUROC) for other individuals in the held-out test set in UK Biobank.
  • the predictive performance of the original iPGS model is shown on the x-axis.
  • a polygenic score is a quantitative method used to predict an individual’s predisposition to a particular phenotype trait based only on their genome.
  • the traits that may be described using a PGS are often of clinical relevance, and as a result, PGSes have a large number of potential applications in medical practice.
  • PGSes can be used to predict an individual's risk of developing certain diseases (e.g., coronary artery disease, type 2 diabetes, and/or breast cancer, as non-limiting examples). Additionally, PGSes may be used to improve preventative health and to provide personalized medical care. For example, individuals with a high PGS associated with a medical condition may benefit from more aggressive preventative measures or monitoring. In contrast, it may be beneficial to reduce unnecessary tests for “low-risk” individuals (e.g., with a low PGS) as reducing invasive testing and associated costs. PGSes may also inform personalized prevention and treatment strategies (e.g., for cancer, cardiovascular conditions, or other diseases, as non-limiting examples). [0114] As another example, PGSes may inform pharmaceutical research and clinical trials.
  • diseases e.g., coronary artery disease, type 2 diabetes, and/or breast cancer, as non-limiting examples.
  • PGSes may be used to improve preventative health and to provide personalized medical care. For example, individuals with a high PGS associated with
  • PGSes can also be used to identify individuals with higher genetic risk for a certain condition or medical trait, making them good candidates for clinical trials. Additionally, PGSes can be used to identify individuals with certain genetic predispositions ot disease conditions to assist in the development of cellular models (e.g., iPS cells) for pharmaceutical research. Furthermore, PGSes may be used in prenatal and infant health screening to identify health risks early in life. Such genetic screening using PGSes may provide valuable information for genetic counselors, who can then provide more precise advice to parents.
  • GWAS genome-wide association study
  • GWAS summary statistics which provide information about an association between each genetic variant and a phenotype trait of interest, may be provided as an input. These statistics include the effect size (often a ⁇ coefficient) and ⁇ - value for each genetic variant.
  • genetic variants of interest for the PGS model may be selected from the GWAS summary statistics.
  • C+T linkage disequilibrium
  • P+T pruning and thresholding
  • C+T linkage disequilibrium
  • C+T linkage disequilibrium clumping and p-value thresholding
  • C+T linkage disequilibrium clumping and p-value thresholding
  • correlated genetic variants are removed using LD clumping or LD pruning to ensure independence among the selected genetic variants.
  • additional genetic variants are removed that do not meet a selected statistical significance threshold (e.g., genetic variants with ⁇ > 0.05).
  • each selected genetic variant is assigned a weight based on its effect size from the GWAS. This weight represents the contribution of the genetic variant to the phenotype trait.
  • the effect size estimates from GWAS are used as the weights.
  • a PGS is determined for each individual in the sample data set. For each individual, the number of risk alleles that are present in their genome for each genetic variant is multiplied by the weight of that genetic variant. The PGS for that individual is then determined by summing across all genetic variants. In some approaches, the first and second steps are performed simultaneously. This procedure results in the selection of relevant genetic variants as well as an estimation of their PGS effect size. [0116] An increased availability of genetic data in recent years has allowed researchers to analyze larger numbers of individual genomes and, correspondingly, to provide more reliable PGS models derived from increased statistical power. In parallel, there have been methodological improvements in training PGS models with higher accuracy. However, most PGS models suffer from limited transferability across populations.
  • PGS models are trained from GWAS data collected from specific populations (e.g., Europeans, Africans, etc.), resulting in final PGS models that are population specific. It has been recognized that PGS models trained from a single population show limited predictive performance in individuals from different populations than the original training population. Because non-European individuals are typically underrepresented in genetic studies, they are later underserved by PGS models trained on European populations. Thus, this underrepresentation results in inequitable healthcare improvements derived from advancements in genetic research.
  • populations e.g., Europeans, Africans, etc.
  • the use of genetic data from admixed individuals offers improved PGS modelling for non-European populations and, as a result, improved equity in healthcare.
  • population specific PGS models typically provide reduced PGS model accuracy for populations that are currently underrepresented in genetic studies (e.g., non-European populations).
  • iPGS inclusive PGS
  • the inventors have developed an inclusive PGS (“iPGS”) model, described herein, in which genetic variants that exhibit genetic effects that are shared across different ancestries are characterized by a score determined by a single PGS model without relying on GWAS analysis of single-ancestry populations.
  • the techniques described herein include systems and methods for predicting a medical trait of an individual using a trait score and/or iPGS.
  • the medical trait may be one or more of (i) continuous variable traits (e.g., blood pressure, weight, etc.), (ii) binary traits (e.g., whether the individual has a disease or a condition and/or a likelihood of the individual developing a disease or a condition), and/or (iii) a time-to-event (e.g., a prediction of how long the individual has before the onset of a disease, condition, or other trait).
  • the medical trait may further include behavioral traits, physiological traits, and/or cognitive traits.
  • the techniques include obtaining genomic data including information indicative of genetic variants present in the individual’s genome.
  • This determination may include training the computational model using supervised learning on data describing phenotypes, genotypes, and/or phenotype-genotype relationships for a population including admixed individuals of multiple ancestries.
  • admixed refers to individuals with recent ( ⁇ 100 generations) admixture events between isolated populations.
  • the supervised learning may be performed using, for example, penalized regression techniques including, but not limited to ⁇ ⁇ -penalized regression (LASSO), ⁇ ⁇ -penalized regression (Ridge), and/or ⁇ ⁇ - and ⁇ ⁇ -penalized regression (Elastic Net).
  • the supervised learning may be performed using, for example, statistical boosting techniques.
  • batch-based or iterative techniques can be used to perform supervised learning using large-scale individual-level data.
  • the techniques further include determining whether the medical trait is or is not described by an ancestry-dependent genetic effect. That is, the techniques include determining whether one or more genetic variants that are relevant to the medical trait have heterogeneous associations between single-ancestry populations (e.g., whether the genetic variants are more likely to be correlated with the medical trait in one population than another). If one or more genetic variants that are relevant to the medical trait have heterogeneous associations between single-ancestry populations, then a regression may be performed to obtain the computational model.
  • BASIL batch screening iterative lasso
  • the regression may include an Elastic Net regression including a penalty applied to the determined genetic variants with heterogeneous associations.
  • the techniques next include generating a graphical user interface including a visualization of the calculated trait score.
  • the graphical user interface may include one or more charts and/or tables describing the calculated trait score and its impact on the medical trait of interest.
  • the generated graphical user interface is Attorney Docket No.: M0437.70168WO00 displayed using a display device (e.g., a computer monitor, television screen, smartphone display, mobile device display, and/or a tablet display).
  • a display device e.g., a computer monitor, television screen, smartphone display, mobile device display, and/or a tablet display.
  • FIG. 1-1 is a schematic block diagram of a trait score facility 100, in accordance with some embodiments of the technology described herein.
  • trait score facility 100 includes a genotyping system 110, a trait score system 120, and a remote system 130. It should be appreciated that Trait score facility 100 is illustrative and that a trait score facility 100 may have one or more other components of any suitable type in addition to or instead of the components illustrated in FIG.
  • one or more of the genotyping system 110, the trait score system 120, and the remote system 130 may be communicatively connected by a network 140.
  • the network 140 may be or include one or more local- and/or wide-area, wired and/or wireless networks, including a local-area or wide-area enterprise network and/or the Internet.
  • the network 140 may be, for example, a hard-wired network (e.g., a local area network within a facility), a wireless network (e.g., connected over Wi-Fi and/or cellular networks), a cloud-based computing network, or any combination thereof.
  • the genotyping system 110 and the trait score system 120 may be located within a same facility and connected directly to each other or connected to each other via the network 140, while the remote system 130 may be located in a remote facility and connected to the genotyping system 110 and/or the trait score system 120 through the network 140.
  • the genotyping system 110 may be configured to perform genotyping of a biological sample obtained from an individual 112.
  • the biological sample may be any suitable biological material obtained from the individual 112.
  • the biological sample may include one or more of saliva, blood, biopsied tissue, formalin-fixed paraffin- Attorney Docket No.: M0437.70168WO00 embedded (FFPE) tissue, fine-needle aspirate (FNA), core needle biopsies (CNBs), liquid biopsies, urine, feces, hair, epithelial cells, bone marrow, or other biological matter obtained from the individual 112.
  • the genotyping system 110 may be configured to perform one or more types of genotyping.
  • the genotyping system 110 may be configured to perform whole genome sequencing, whole exome sequencing, and/or targeted sequencing (e.g., targeting a particular chromosome or genomic region).
  • the genotyping system 110 may be configured to perform genotyping using a DNA microarray (e.g., a “gene chip”).
  • trait score facility 100 includes trait score system 120 communicatively coupled to the genotyping system 110.
  • the trait score system 120 may not be communicatively coupled directly to the genotyping system 110 but may obtain the genotyping results by retrieving the results from a separate computer-readable memory, including but not limited to a separate database (e.g., as associated with an electronic medical record (EMR)).
  • EMR electronic medical record
  • trait score system 120 may be any suitable electronic device configured to receive information from genotyping system 110 and/or to process obtained genomic data.
  • trait score system 120 may be a fixed electronic device such as a desktop computer, a rack-mounted computer, or any other suitable fixed electronic device.
  • trait score system 120 may be a portable device such as a laptop computer, a smart phone, a tablet computer, or any other portable device that may be configured to receive information from genotyping system 110 and/or to process obtained genomic data.
  • trait score system 120 includes a computational facility 122.
  • the computational facility 122 may be configured to analyze genomic data obtained by genotyping system 110.
  • the computational facility 122 may be configured to, for example, use a computational model to analyze the obtained genomic data to determine an iPGS and/or a trait score (e.g., using the iPGS) associated with a medical trait for the individual 112, as described herein.
  • the computational facility 122 may be implemented as hardware, software, or any suitable combination of hardware and software, as aspects of the disclosure provided herein are not limited in this respect.
  • the computational facility 122 may be implemented in the trait score system 120, such as being implemented in software (e.g., executable instructions) executed by one or more process of the trait score system 120.
  • the computational facility 122 may be additionally or alternatively implemented at one or more other elements of the Trait score facility 100 of FIG. 1-1.
  • the computational facility 122 may be implemented at the genotyping system 110 and/or the remote system 130 discussed herein.
  • the computational facility 122 may be implemented at or with another device, such as a device located remote from the trait score facility 100 and receiving data via the network 140.
  • trait score system 120 also interacts with remote system 130 through network 140, in some embodiments.
  • Remote system 130 may be any suitable electronic device configured to receive information (e.g., from genotyping system 110 and/or trait score system 120) and to display generated graphical user interfaces for viewing.
  • the remote system 130 may be remote from the genotyping system 110 and trait score system 120, such as by being located in a different room, wing, or building of a facility (e.g., a healthcare facility) than the genotyping system 110, or being geographically remote from the genotyping system 110 and trait score system 120, such as being located in another part of a city, another city, another state or country, etc.
  • remote system 130 may be a fixed electronic device such as a desktop computer, a rack-mounted computer, or any other suitable fixed electronic device.
  • remote system 130 may be a portable device such as a laptop computer, a smart phone, a tablet computer, or any other portable device that may be configured to receive and view generated graphical user interfaces and/or to send instructions and/or information to trait score system 120.
  • remote system 130 may receive information (e.g., information indicative of one or more iPGSes and/or of trait scores, generated graphics describing one or more iPGSes and/or trait scores) from trait score system 120 and/or genotyping system 110 over the network 140.
  • a remote user 132 e.g., a medical clinician may use remote system 130 to view the received information on remote system 130.
  • FIG. 1-2 is a flowchart of a process 200 for predicting a medical trait of an individual by determining one or more trait scores and/or iPGSes, in accordance with some embodiments of the technology described herein.
  • Process 200 may be implemented by a computational facility, such as the computational facility 122 of FIG. 1-1.
  • the process 200 may be performed by a computing device configured to receive and/or obtain information from a genotyping system (e.g., genotyping system 110 described in connection with FIG. 1-1).
  • the process 200 may be performed by one or more processors located remotely (e.g., as part of a cloud computing environment, as connected through a network) from the genotyping system that obtained the input genomic data.
  • Process 200 may begin at act 202, where the computational facility obtains genomic data.
  • the genomic data includes information indicative of genetic variants present in the individual’s genome.
  • the genomic data may be obtained by genotyping the individual (e.g., by performing one or more of whole genome sequencing, whole exome sequencing, and/or targeted sequencing).
  • genotyping the individual may be performed by performing genotyping of a biological sample obtained from an individual 112.
  • the biological sample may be any suitable biological material obtained from the individual 112.
  • the biological sample may include one or more of saliva, blood, biopsied tissue, formalin-fixed paraffin- embedded (FFPE) tissue, fine-needle aspirate (FNA), core needle biopsies (CNBs), liquid biopsies, urine, feces, hair, epithelial cells, bone marrow, or other biological matter obtained from the individual 112.
  • genotyping the individual may be performed prior to and separately from determining the trait score and/or iPGS (e.g., by genotyping a biological sample obtained from the individual to obtain the genomic data prior to determining the trait score and/or iPGS). Alternatively, in some embodiments, genotyping the individual may be performed as a part of determining the trait score and/or iPGS (e.g., by genotyping a biological sample obtained from the individual). In some embodiments, genotyping may be performed using a DNA microarray. [0137] After act 202, process 200 may proceed to act 204, where the computational facility may determine at least one trait score for the individual based on the information indicative of the genetic variants.
  • the at least one trait score may be calculated using at least one processor (e.g., of a computing device implementing the computational facility) and a computational model trained to determine trait scores and/or iPGSes.
  • the computational model may be obtained by calculating effect sizes for one or more genetic variants associated with the medical trait.
  • the computational model may be obtained according to the embodiment of FIG. 1-3.
  • FIG. 1-3 is a flowchart of a process 300 for generating a computational model for determining a trait score and/or an iPGS.
  • Process 300 may be implemented by a computational facility, such as the computational facility 122 of FIG. 1-1.
  • the process 300 may be performed by a computing device configured to receive and/or obtain information from a genotyping system (e.g., genotyping system 110 described in connection with FIG. 1-1).
  • the process 300 may be performed by one or more processors located remotely (e.g., as part of a cloud computing environment, as connected through a network) from the genotyping system that obtained the input genomic data.
  • process 300 may begin at act 302, in which the computational facility may train an initial computational model using supervised learning and data describing phenotypes, genotypes, and/or phenotype-genotype relationships for a population including admixed individuals of multiple ancestries.
  • the term “admixed” refers to individuals with recent ( ⁇ 100 generations) admixture events between isolated populations.
  • the supervised learning may be performed using, for example, penalized regression techniques including, but not limited to ⁇ ⁇ -penalized regression (LASSO), ⁇ ⁇ -penalized regression (Ridge), and/or ⁇ ⁇ - and ⁇ ⁇ - penalized regression (Elastic Net).
  • the supervised learning may be performed using, for example, statistical boosting techniques.
  • batch-based or iterative techniques including but not limited to batch screening iterative lasso (BASIL) regression, can be used to perform supervised learning using large-scale individual-level data.
  • BASIL batch screening iterative lasso
  • the computational facility may implement BASIL using the R snpnet package (version 2) on individual-level data.
  • implementing the supervised learning includes fitting an iPGS model by finding the exact solutions for ⁇ ⁇ - and ⁇ ⁇ -penalized multivariate regression (e.g., Elastic Net regression) using BASIL.
  • BASIL directly operates on the individual-level genotype and phenotype data and performs variable selection and effect size estimation simultaneously using penalized generalized linear models (GLMs).
  • GLMs generalized linear models
  • the weights, > ! , and penalty factors, B C allow different levels of shrinkage.
  • Penalty factors less than 1 indicate prioritization of the predictor variables, whereas penalty factors greater than 1 indicate that the effect of the predictor variables is minimized or not preferred in the penalized regression model.
  • one may assign different degrees of the penalty factor values to prioritize a subset of predictor variables according to biological knowledge (“biological priors”).
  • the source(s) of biological knowledge may include variant pathogenicity (e.g., whether a gene variant is known to be pathogenic, likely pathogenic, likely benign, and/or benign), the predicted consequence of variants (e.g., whether the gene variants are putative loss-of-function variants, non-synonymous variants, etc.), known causal variants, tissue- or cell-type-sample-specific regulatory genomic annotations (e.g., gene variants with specific effects in tissue or cell types), and aggregations of the preceding effects, as non-limiting examples.
  • variant pathogenicity e.g., whether a gene variant is known to be pathogenic, likely pathogenic, likely benign, and/or benign
  • the predicted consequence of variants e.g., whether the gene variants are putative loss-of-function variants, non-synonymous variants, etc.
  • known causal variants e.g., tissue- or cell-type-sample-specific regulatory genomic annotations (e.g., gene variants
  • the Elastic Net loss is where L is an Elastic Net parameter that controls the balance between the ⁇ ⁇ - (Lasso) and ⁇ ⁇ - (Ridge) penalization.
  • the tuning parameter, 1, may be optimized based on the predictive performance on the validation data set.
  • the tuning parameter, ⁇ may be optimized using, for example, validation set metrics or 10-fold cross-validation, as implemented using the cv.glmnet function in the glmnet package in R. Additional aspects of optimizing the tuning parameter, ⁇ , are described in in the Examples.
  • L may be set to a value of 0.99
  • penalty factors Attorney Docket No.: M0437.70168WO00 B C ⁇ 1 may be set for non-synonymous coding variants or previously characterized pathogenic or likely-pathogenic variants.
  • the function for the loss contribution may be selected in the generalized linear model (GLM) framework.
  • the loss contribution may be a squared loss for a continuous phenotype (e.g., the Gaussian family in GLM).
  • the loss may be binomial for binary traits (e.g., the binomial family in GLM).
  • the techniques described herein may be applicable for multiple phenotypes (e.g., the multivariate Gaussian family), time-to-event phenotypes (Cox regression), and/or multiple time-to-event phenotypes. Additional aspects of the loss contribution are described in the Examples. [0145]
  • the covariate terms may be unpenalized in the regression. When fitting a PGS model from large-scale cohorts, it is not uncommon to have a large number of individuals (P ⁇ 300,000) and genetic variants ( ⁇ ⁇ 1,000,000).
  • BASIL efficiently solves the exact solution of the penalized regression in an iterative procedure by taking advantage of strong rules that guide the variable selection in each iteration step. Additional aspects of BASIL regression are described in “A fast and scalable framework for large-scale and ultrahigh dimensional sparse regression with application to the UK Biobank,” J. Qian, et al. (2020) PLOS Genetics 16(10): e1009141, which is incorporated herein by reference in its entirety. [0146] In some embodiments, the computational model may be expanded to introduce genetic variants with linear effects, genetic dominance effects, and/or sex-based effects.
  • the following penalized generalized linear regression may be considered to fit the intercept term and regression coefficient vector as a function of the tuning parameter ⁇ that controls the sparsity of the solution: where Elastic Net penalization term for the coefficient .
  • the Elastic Net penalization term is written as follows: Attorney Docket No.: M0437.70168WO00 [0147]
  • sample weights > !
  • penalty factor values B C , that allow different levels of shrinkage to variables.
  • One may use > ! 1 and B C ⁇ 1 as the default values.
  • Penalty factors less than 1 indicate prioritization of the predictor variables, whereas penalty factors greater than 1 indicate that the effect of the predictor variables is minimized or not preferred in the penalized regression model.
  • one may assign different degrees of the penalty factor values to prioritize a subset of predictor variables according to biological knowledge (“biological priors”).
  • the source(s) of biological knowledge may include variant pathogenicity (e.g., whether a gene variant is known to be pathogenic, likely pathogenic, likely benign, and/or benign), the predicted consequence of variants (e.g., whether the gene variants are putative loss-of-function variants, non-synonymous variants, etc.), known causal variants, tissue- or cell-type-sample-specific regulatory genomic annotations (e.g., gene variants with specific effects in tissue or cell types), and aggregations of the preceding effects, as non-limiting examples.
  • the loss contribution for the i-th observation depends on the types of the exponential family considered in the regression analysis of generalized linear models. For example, it is the squared loss, i.e.
  • the concatenated vectors may be used as predictors: [0152]
  • the coefficient vector is also a concatenation of two components, corresponding to covariate-effects and linear effects of genetic variants, respectively: Attorney Docket No.: M0437.70168WO00 [0153]
  • the covariate terms are set to be unpenalized (i.e. the objective function of the penalized generalized linear regression becomes the following: [0154]
  • to incorporate nonlinear genetic dominance effects into the computational model consider a matrix, is an indicator variable representing whether the i-th individual is homozygous for effect allele for the j-th genetic variant.
  • the concatenated vectors of three components may be used as predictors:
  • the coefficient vector then has three components:
  • GxS Gene-by-Sex
  • To incorporate sex-based genetic effects consider two matrices, represent Gene-by-Sex (GxS) interaction terms for male and female individuals, respectively, as follows: and , where is an indicator function. The has the original genotype dosage for male individuals and is always set to be zero for female individuals and vice versa for .
  • the concatenated vectors of four components may be used as predictors:
  • the coefficient vector then has four components: Attorney Docket No.: M0437.70168WO00
  • the coefficient vector then has seven components: One may focus on smaller subsets of variants for genetic dominance effects and GxS interaction effects of the linear and genetic dominance effects. Additional aspects related to GenESIS techniques are described in Example 2.
  • process 300 may proceed to act 304, in which the computational facility may determine genetic variants with heterogeneous associations between single-ancestry populations (e.g., genetic variants that more strongly indicate an association between the genetic variants and the medical trait in one population versus another population). Genetic variants with heterogeneous associations between single-ancestry populations may be determined using inverse-variance weighted meta-analysis with a heterogeneity test may be applied to GWASes of single-ancestry populations.
  • genetic variants with heterogeneous associations between single-ancestry populations may be determined using inverse-variance weighted meta-analysis with a heterogeneity test may be applied to GWASes of single-ancestry populations.
  • Cochran’s Q test may be used to determine genetic variants with heterogeneous associations by identifying genetic variants having heterogeneity ⁇ -values smaller than a threshold value (e.g., ⁇ -values ⁇ 5 ⁇ 10 ⁇ , as one non- limiting example).
  • a threshold value e.g., ⁇ -values ⁇ 5 ⁇ 10 ⁇ , as one non- limiting example.
  • process 300 may proceed to act 306, in which the computational facility may further train the initial computational model by performing a regression.
  • the regression of act 306 may include, for example, Elastic Net regression or another suitable form of penalized regression.
  • the regression may include a penalty applied to the genetic variants with heterogeneous associations.
  • the regression may Attorney Docket No.: M0437.70168WO00 include covariate effects, representing effects of conventional risk factors that are not genotypic in origin. [0159] In some embodiments, the regression may proceed in two steps. In a first portion, the regression may be used to fit covariate effects in the training data set. The covariate effects may take into consideration conventional risk factors (e.g., age, sex, weight, or other non-genotypic factors).
  • conventional risk factors e.g., age, sex, weight, or other non-genotypic factors.
  • the covariate-only model may be determined by fitting an unpenalized regression using individual-level data from the single-ancestry population of interest to characterize the covariate effects of: age, sex, age 2 , age*sex, the Townsend deprivation index, and/or the genotype principal components, as non-limiting examples.
  • the covariate effects may be fit using an unpenalized regression of the form: where ⁇ ⁇ ⁇ ] ⁇ , ⁇ ! represents the number of individuals in the training data.
  • ⁇ ⁇ ⁇ ] ⁇ , ⁇ ! may be the same as ⁇ ⁇ ! .
  • ⁇ ⁇ ⁇ ] ⁇ , ⁇ ! may be different than ⁇ ⁇ ! .
  • the act 306 may proceed to further train the computational model, using an estimate obtained from the above regression performed using the covariate effects.
  • the regression may take the form of an Elastic Net penalized regression: where L( ⁇ ! , ⁇ ⁇ ) is a loss for the F th individual and D E ( ⁇ C ) is the Elastic Net regularization term for the coefficient ⁇ , as described above, the subscript m indicates covariate effects, and the subscript n indicates genetic effects, and the summation for genetic variants W ⁇ o is performed over the genetic variants with heterogeneous effects.
  • the term S / ⁇ + ⁇ _ T U ⁇ $ ! term represents the covariate effects and the term ⁇ ⁇ h b ⁇ ) ! represents the predicted phenotype value from the non-covariate Attorney Docket No.: M0437.70168WO00 terms in the iPGS model, described in relation to act 302 herein.
  • the final term represents scores determined for genetic variants with heterogeneous associations between single-ancestry populations.
  • suitable penalty factors may be assigned for each term in the above expression. For example, a penalty factor of 1.1 may be assigned to the effects due to the genetic variants and a penalty factor of 1.0 may be assigned to the covariate-only and iPGS scores. Additional aspects are described in Example 1 herein.
  • the regression of act 306 may optionally include taking into consideration effects of the individual’s global ancestry on the genetic variants with heterogeneous associations.
  • including the effects of the global ancestry of the individual comprises including, for each instance of genomic data in the data describing phenotypes, genotypes, and/or phenotype-genotype relationships, an interaction described by a number of copies of each genetic variant with heterogeneous associations in each instance of genomic data and principal components of a genomic matrix generated using each instance of genomic data.
  • These effects may be considered as a measure of how “close” an individual is to a reference single-ancestry population and may be characterized using principal components obtained from genotype matrices generated based on the genomic data associated with each individual included in the training data set and/or in a subset of the training data set.
  • the regression performed after fitting the covariate effects may be performed using: where Dm (w) ! is the
  • FIG. 1-4 is a flowchart of a process 400 for generating a computational model for determining a trait score and/or an iPGS.
  • Process 400 may be implemented by a computational facility, such as the computational facility 122 of FIG. 1-1.
  • the process 400 may be performed by a computing device configured to receive and/or obtain information from a genotyping system (e.g., genotyping system 110 described in connection with FIG. 1-1).
  • the process 400 may be performed by one or more processors located remotely (e.g., as part of a cloud computing environment, as connected through a network) from the genotyping system that obtained the input genomic data.
  • Process 400 is similar to process 300 as described in FIG.1-3 but includes an additional act 403.
  • the computational facility may determine if the medical trait is or is not described by an ancestry-dependent genetic effect. The outcome of this determination may guide the form of the computational model that is used in act 204 to calculate the at least one polygenic score.
  • process 400 may begin at act 402, in which the computational facility may perform supervised learning to train the computational model using data describing phenotypes, genotypes, and/or phenotype-genotype relationships for a population including admixed individuals of multiple ancestries. Thereafter, process 400 may proceed to act 403, in which it is determined whether the medical trait is or is not described by an ancestry-dependent genetic effect. This determination may be made using, for example, a heterogeneity test (e.g., Cochran’s Q test) applied to meta-analysis of GWASes of single-ancestry populations.
  • a heterogeneity test e.g., Cochran’s Q test
  • process 400 may terminate after act 402. In this manner, computational models may be built efficiently for medical traits that are linked to less heterogeneous genetic effects.
  • Attorney Docket No.: M0437.70168WO00 [0166]
  • process 400 may proceed to act 404.
  • Act 404 may include two sub-acts, 404a and 404b. In sub-act 404a, the computational facility may determine genetic variants with heterogeneous associations between single-ancestry populations.
  • genetic variants with heterogeneous associations between single-ancestry populations may be determined as a result of the determination performed in act 403. [0167] After sub-act 404a, process 400 may proceed to sub-act 404b, in which the computational facility may perform additional regression, as described in connection with act 306 of FIG. 1-3. The additional regression may include a penalty applied to the determined genetic variants with heterogeneous associations between single-ancestry populations. In some embodiments, the second regression may optionally include taking into consideration effects of the individual’s global ancestry on the genetic variants with heterogeneous associations, also as described in connection with act 306 of FIG. 1-3. After sub-act 404b, process 400 may terminate, and process 200 may proceed to act 206.
  • process 200 may proceed to act 206, in which the computational facility may generate, using the at least one processor, a graphical user interface in some embodiments.
  • the graphical interface may include a visualization of the calculated at least one trait score and/or iPGS, the calculated at least one trait score being indicative of a prediction of the medical trait in the individual.
  • the graphical interface may include one or more graphs, diagrams, charts, or other graphical illustration of the determined trait scores and/or one or more tables or textual representations of the determined trait scores.
  • the graphical user interface may be integrated into and/or be an electronic medical record (EMR).
  • EMR electronic medical record
  • process 200 may proceed to act 208, in which a display device may display the generated graphical user interface.
  • the graphical user interface may be displayed on one or more of a computer monitor, television screen, smartphone display, mobile device display, and/or a tablet display. In this manner, a clinician and/or the individual may review the determined trait scores.
  • process 200 may optionally proceed to act 210, in which a course of medical treatment related to the medical trait may be altered based on the calculated at least one trait score.
  • the number and/or frequency of medical interventions may be changed based on the at least one trait score.
  • the number and/or frequency of medical interventions in the course of medical treatment may be reduced based on the value of the at least one trait score. If the trait score value is low (e.g., below a threshold value), the individual may be considered “low-risk” such that additional medical interventions may be unnecessary as being invasive and/or cost inefficient.
  • the number and/or frequency of medical interventions in the course of medical treatment may be increased based on the value of the at least one trait score.
  • process 200 may optionally proceed to act 212, in which at least one therapeutic agent may be identified and/or administered to the individual based on the calculated at least one trait score.
  • the at least one trait score may indicate a likelihood of a health risk associated with a particular cancer carrying a specific genetic variant. A treatment designed to combat the specific genetic variant of the cancer may then be identified and/or administered to the individual by a medical provider.
  • the at least one trait score may be used to alter a recruitment strategy for a clinical trial related to the medical trait.
  • the individual may be added to or removed from the clinical trial based on the value of the trait score.
  • the trait score may be compared to a threshold value in order to determine whether the individual should be added to or removed from the clinical trial.
  • the trait score may be greater than a threshold value, indicating a likelihood that the individual exhibits or may develop the medical trait such that the individual will be enrolled in the clinical trial.
  • the trait score may be less than a threshold value, indicating a likelihood that the individual exhibits or may develop the medical trait such that the individual will be enrolled in the clinical trial.
  • FIG. 1-5 is a flowchart of a process 500 for performing a clinical trial, in accordance with some embodiments of the technology described herein.
  • Process 500 may begin at act 502, in which a first trait score associated with a medical trait of a first individual is obtained.
  • the first trait score may be obtained by calculating the first trait score using a trained computational model Attorney Docket No.: M0437.70168WO00 and first genomic data including information indicative of genetic variants present in a genome of the first individual.
  • the first trait score may be calculated using a computational facility, such as the computational facility 122 of FIG. 1-1.
  • the process 200 may be performed by a computing device configured to receive and/or obtain information from a genotyping system (e.g., genotyping system 110 described in connection with FIG. 1-1).
  • the process 200 may be performed by one or more processors located remotely (e.g., as part of a cloud computing environment, as connected through a network) from the genotyping system that obtained the input genomic data.
  • act 502 may begin at sub-act 502a, in which the trained computational model may be obtained by training a first computational model using data describing phenotypes, genotypes, and/or phenotype-genotype relationships for a population including admixed individuals of multiple ancestries.
  • the trained computational model may be obtained, for example, using any techniques described in connection with FIGs. 1-2 through 1-4 herein (e.g., in connection with act 302 of FIG. 1-3, as one non-limiting example).
  • process 500 may optionally proceed to sub-act 502b, in which the computational facility may perform additional regression, as described in connection with act 306 of FIG. 1-3.
  • the additional regression may include a penalty applied to genetic variants with heterogeneous associations between single-ancestry populations.
  • the second regression may optionally include taking into consideration effects of the individual’s global ancestry on the genetic variants with heterogeneous associations, also as described in connection with act 306 of FIG. 1-3.
  • act 502 may terminate, and process 500 may proceed to act 504.
  • process 500 may proceed to act 504, in which the first individual is enrolled in the clinical trial based on a value of the first trait score.
  • the trait score may be compared to a threshold value in order to determine whether the individual should be added to the clinical trial. For example, the trait score may be greater than a threshold value, indicating a likelihood that the individual exhibits or may develop the medical trait such that the individual will be enrolled in the clinical trial. Alternatively, depending on the medical trait being evaluated, the trait score may be less than a threshold value, indicating a likelihood that the individual exhibits or may develop the medical trait such that the individual will be enrolled in the clinical trial.
  • process 500 may proceed to act 506, in which a course of medical treatment for the first individual is altered in accordance with the clinical trial.
  • a course of medical treatment for the first individual is altered in accordance with the clinical trial.
  • the number and/or frequency of medical interventions e.g., medical tests, medical imaging, at-home monitoring, surgeries, vaccinations, and/or other medical treatments
  • the number and/or frequency of medical interventions in the course of medical treatment may be reduced based on the value of the at least one trait score.
  • a therapeutic agent associated with the clinical trial may be administered to the individual based on the calculated at least one trait score.
  • the at least one trait score may indicate a likelihood of a health risk associated with a particular cancer carrying a specific genetic variant.
  • a treatment designed to combat the specific genetic variant of the cancer may then be identified and/or administered to the individual by a medical provider associated with the clinical trial.
  • the process may further include assessing a second trait score determined for a second individual.
  • the second trait score may be calculated using the trained computational model and second genomic data including information indicative of genetic variants present in a genome of the second individual.
  • the process may include declining to enroll the second individual in the clinical trial based on a value of the second trait score.
  • the second trait score may be compared to a threshold value in order to determine whether the individual should be added to the clinical trial.
  • the trait score may be greater than a threshold value, indicating that the second individual is not likely to exhibit or to develop the medical trait such that the individual will not be enrolled in the clinical trial.
  • the trait score may be less than a threshold value, indicating that the second individual is not likely to exhibit or to develop the medical trait such that the individual will not be enrolled in the clinical trial.
  • one or more computer programs that when executed perform methods of the present disclosure need not reside on a single computer or processor but may be distributed in a modular fashion among a number of different computers or processors to implement various aspects of the present disclosure.
  • Techniques operating according to the principles described herein may be implemented in any suitable manner. Included in the discussion above are a series of flow charts showing the steps and acts of various processes to determine inclusive polygenic risk scores and to use the determined iPGS values to alter an individual’s course of medical treatment. The processing and decision blocks of the flow charts above represent steps and acts that may be included in algorithms that carry out these various processes.
  • Algorithms derived from these processes may be implemented as software integrated with and directing the operation of one or more single- or multi-purpose processors, may be implemented as functionally equivalent circuits such as a Digital Signal Processing (DSP) circuit or an Application-Specific Integrated Circuit (ASIC), or may be implemented in any other suitable manner.
  • DSP Digital Signal Processing
  • ASIC Application-Specific Integrated Circuit
  • the flow charts included herein do not depict the syntax or operation of any particular circuit or of any particular programming language or type of programming language. Rather, the flow charts illustrate the functional information one skilled in the art may use to fabricate circuits or to Attorney Docket No.: M0437.70168WO00 implement computer software algorithms to perform the processing of a particular apparatus carrying out the types of techniques described herein.
  • the techniques described herein may be embodied in computer-executable instructions implemented as software, including as application software, system software, firmware, middleware, embedded code, or any other suitable type of computer code.
  • Such computer-executable instructions may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine.
  • a “functional facility,” however instantiated, is a structural component of a computer system that, when integrated with and executed by one or more computers, causes the one or more computers to perform a specific operational role.
  • a functional facility may be a portion of or an entire software element.
  • a functional facility may be implemented as a function of a process, or as a discrete process, or as any other suitable unit of processing. If techniques described herein are implemented as multiple functional facilities, each functional facility may be implemented in its own way; all need not be implemented the same way.
  • these functional facilities may be executed in parallel and/or serially, as appropriate, and may pass information between one another using a shared memory on the computer(s) on which they are executing, using a message passing protocol, or in any other suitable way.
  • functional facilities include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • the functionality of the functional facilities may be combined or distributed as desired in the systems in which they operate.
  • one or more functional facilities carrying out techniques herein may together form a complete software package.
  • These functional facilities Attorney Docket No.: M0437.70168WO00 may, in alternative embodiments, be adapted to interact with other, unrelated functional facilities and/or processes, to implement a software program application.
  • inventive concepts may be embodied as a computer readable storage medium (or multiple computer readable storage media) (e.g., a computer memory, one or more floppy discs, compact discs, optical discs, magnetic tapes, flash memories, circuit configurations in Field Programmable Gate Arrays or other semiconductor devices, or other tangible computer storage medium) encoded with one or more programs that, when executed on one or more computers or other processors, perform methods that implement one or more of the various embodiments described above.
  • the computer readable medium or media can be transportable, such that the program or programs stored thereon can be loaded onto one or more different computers or other processors to implement various ones of the aspects described above.
  • computer readable media may be tangible (e.g., non-transitory) computer readable media.
  • the computer readable media may comprise a persistent memory.
  • Computer-executable instructions implementing the techniques described herein may, in some embodiments, be encoded on one or more computer-readable media to provide functionality to the media.
  • Such a computer-readable medium may be implemented in any suitable manner, including as computer-readable storage media 606 of FIG. 1-6 described below (i.e., as a portion of a computing device 800) or as a stand-alone, separate storage medium.
  • these instructions may be executed on one or more suitable computing device(s) operating in any suitable computer system, including the exemplary computer system of FIG. 1-6, or one or more computing devices (or one or more processors of one or more computing devices) may be programmed to execute the computer-executable instructions.
  • a computing device or processor may be programmed to execute instructions when the instructions are stored in a manner accessible to the computing device or processor, such as in a data store (e.g., an on-chip cache or instruction register, a computer-readable storage medium Attorney Docket No.: M0437.70168WO00 accessible via a bus, a computer-readable storage medium accessible via one or more networks and accessible by the device/processor, etc.).
  • a data store e.g., an on-chip cache or instruction register, a computer-readable storage medium Attorney Docket No.: M0437.70168WO00 accessible via a bus, a computer-readable storage medium accessible via one or more networks and accessible by the device/processor, etc.
  • FIG. 1-6 illustrates one exemplary implementation of a computing device in the form of a computing device 600 that may be used in a system implementing techniques described herein, although others are possible. It should be appreciated that FIG.
  • Computing device 600 may comprise at least one processor 602, a network adapter 604, and computer-readable storage media 606.
  • Computing device 600 may be, for example, a desktop or laptop personal computer, a personal digital assistant (PDA), a smart mobile phone, a server, a wireless access point or other networking element, or any other suitable computing device.
  • Network adapter 604 may be any suitable hardware and/or software to enable the computing device 600 to communicate wired and/or wirelessly with any other suitable computing device over any suitable computing network.
  • the computing network may include wireless access points, switches, routers, gateways, and/or other networking equipment as well as any suitable wired and/or wireless communication medium or media for exchanging data between two or more computers, including the Internet.
  • Computer-readable media 606 may be adapted to store data to be processed and/or instructions to be executed by processor 602. Processor 602 enables processing of data and execution of instructions. The data and instructions may be stored on the computer-readable storage media 606. [0192]
  • the data and instructions stored on computer-readable storage media 606 may comprise computer-executable instructions implementing techniques which operate according to the principles described herein.
  • computer-readable storage media 606 Attorney Docket No.: M0437.70168WO00 stores computer-executable instructions implementing various facilities and storing various information as described above.
  • Computer-readable storage media 606 may store a functional [0193] While not illustrated in FIG. 1-6, a computing device may additionally have one or more components and peripherals, including input and output devices. These devices can be used, among other things, to present a user interface. Examples of output devices that can be used to provide a user interface include printers or display screens for visual presentation of output and speakers or other sound generating devices for audible presentation of output. Examples of input devices that can be used for a user interface include keyboards, and pointing devices, such as mice, touch pads, and digitizing tablets.
  • a computing device may receive input information through speech recognition or in other audible format.
  • Embodiments have been described where the techniques are implemented in circuitry and/or computer-executable instructions. It should be appreciated that some embodiments may be in the form of a method, of which at least one example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
  • Computer-executable instructions may be in many forms, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc.
  • data structures may be stored in computer-readable media in any suitable form.
  • data structures may be shown to have fields that are related through location in the data structure. Such relationships may likewise be achieved by assigning storage for the fields with locations in a computer-readable medium that convey relationship between the fields.
  • any suitable mechanism may be used to establish a relationship between information in fields of a data structure, including through the use of pointers, tags or other mechanisms that establish relationship between data elements.
  • a computer When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers.
  • a computer may be embodied in any of a number of forms, such as a rack-mounted computer, a desktop computer, a laptop computer, or a tablet computer, as non-limiting examples.
  • a computer may be embedded in a device not generally regarded as a computer but with suitable processing capabilities, including a Personal Digital Assistant (PDA), a smartphone, or any other suitable portable or fixed electronic device.
  • PDA Personal Digital Assistant
  • Such computers may be interconnected by one or more networks in any suitable form, including a local area network or a wide area network, such as an enterprise network, and intelligent network (IN) or the Internet.
  • networks may be based on any suitable technology and may operate according to any suitable protocol and may include wireless networks, wired networks or fiber optic networks.
  • PGSs Polygenic scores
  • iPGS a PGS training strategy that considers individuals across the continuum of genetic ancestry.
  • iPGS captures genetic effects shared across population groups while avoiding the need for LD reference panels and is applicable to admixed individuals.
  • the improved performance of iPGS is indicated in 33 simulation configurations and systematic application across 60 anthropometric and hematological traits in the UK Biobank.
  • An iPGS+refit strategy is also developed to jointly model the ancestry-shared and ancestry-dependent effects and indicate its utility in improving prediction in a few hematological traits in the African population in the UK Biobank.
  • Synthetic genotype and phenotype data Attorney Docket No.: M0437.70168WO00 [0203] Synthetic genotype and phenotype data was prepared to investigate the behavior of iPGS. The recently released simulated genotypes from the INTERVENE consortium were used and their HAPNEST pipeline was used to generate synthetic quantitative phenotypes. The HAPNEST pipeline is capable of generating synthetic genotypes and phenotypes for hundreds of thousands of individuals across multiple continental ancestry groups, thanks to its computational efficiency. The synthetic genotype data preserves the key statistics, such as minor-allele frequency and LD in ancestry-matched reference panels and has lower relatedness with the reference panels.
  • the pipeline allowed us to simulate phenotypes under the specified heritability and polygenicity.
  • the HAPNEST synthetic dataset (BioStudies: S-BSST936; EMBL-EBI's BioStudies repository) was downloaded and synthetic genotype data on chromosome 22 for 168,000 individuals each in African and European ancestry groups was focused on.
  • the training set was used to fit models, the validation set was used to determine the sparsity of the models, and the held-out test sets were used to evaluate the predictive performance of the models.
  • the same training, validation, and test sets were used for all tested synthetic traits.
  • the default value of the trans-ancestry genetic correlation of 1.0 was used, the synthetic phenotypes were assumed to have no covariate effects, and the model was restricted to sample causal variants from chromosome 22 alone.
  • the study population in the UK Biobank [0205] The UK Biobank is a population-based cohort study with genomic and phenotypic datasets across about 500,000 volunteers collected across multiple sites in the United Kingdom. Sample-level quality control (QC) was performed.
  • the training, validation, and test sets were used for model fitting, determination of the sparsity hyperparameter, and predictive performance evaluation, respectively.
  • the same training, validation, and test sets were used for all tested traits in the UK Biobank.
  • iPGS training four different subsets of the training set were considered (Tables 1 and 2).
  • For MultiPop, NoAdmixed, and InclusiveFixN PGS models White British individuals in the training set were randomly sub-sampled so that the total number of individuals used in the iPGS training would match that of the WB-only model.
  • a similar procedure was applied to define four subsets of the validation-set individuals for PGS training.
  • VEP-predicted consequence of the variants were grouped into six groups: protein-truncating variants (PTVs), proteinaltering variants (PAVs), Attorney Docket No.: M0437.70168WO00 proximal coding variants (PCVs), intronic variants (intronic), genetic variants on untranslated regions (UTRs), and other non-coding variants (others).
  • PTVs protein-truncating variants
  • PAVs proteinaltering variants
  • PCVs proteinaltering variants
  • PCVs proteinaltering variants
  • PCVs proteinaltering variants
  • intronic variants intronic
  • UTRs untranslated regions
  • other non-coding variants other non-coding variants
  • variants passing the following criteria were focused on: (1) the missingness of the variant is less than 1%, considering that the two genotyping arrays (the UK BiLEVE Axiom array and the UK Biobank Axiom array) cover a slightly different set of variants and (2) Hardy-Weinberg disequilibrium test p value greater than 1.0 ⁇ 10 ⁇ .
  • Phenotype definition in the UK Biobank [0209] 60 anthropometric and hematological traits in the UK Biobank were studied (Table 3). Some of those phenotypes are collected at up to four instances, each of which corresponds to (1) the initial assessment visit (2006-2010), (2) the first repeat assessment visit (2012-2013), (3) the imaging visit (2014-present), and (4) first repeat imaging visit (2019- present). Phenotype data was defined by using the median of non-missing values for each individual across the 60 quantitative traits as described elsewhere.
  • Sparse-polygenic-score training from individual-level data Sparse-polygenic-score models were fit by using batch screening iterative lasso (BASIL) implemented in the R snpnet package (version 2) on the individual-level data. The additive effects of genetic variants on the phenotypes and fit of a polygenic score model were considered by finding the exact solution for ⁇ ⁇ -and ⁇ ⁇ -penalized multivariate regression (Elastic Net). Specifically, BASIL directly operates on the individual-level data and performs variable selection and effect-size estimation simultaneously.
  • BASIL batch screening iterative lasso
  • BASIL efficiently solves the exact solution of the penalized regression in an iterative procedure by taking advantage of strong rules that guide the variable selection in each iteration step.
  • a similar model can be used for binary phenotypes (logistic regression), time-to-event phenotypes (Cox proportional hazards regression), or joint modeling of multiple phenotypes, as shown previously.
  • Inclusive polygenic scores with synthetic data [0211] In the application of iPGS to the synthetic genotype and phenotype data from the HAPNEST pipeline, the impact of the composition of the training-set individuals was assessed on the predictive performance by using the held-out synthetic individuals of African and European ancestry groups.
  • n train 110,000 individuals for each of the three synthetic phenotypes.
  • the training set included individuals with synthetic African and synthetic European ancestry, each with different ratios. The ratios tested were 100%, 95%, 90%, 75%, 60%, 50%, 40%, 25%, 10%, 5%, and 0%.
  • the validation set metric was used to select the sparsity of the model.
  • the protein-truncating and protein-altering variants were prioritized as previously described.
  • a penalty factor was assigned of 0.5 to putative protein-truncating variants and pathogenic variants; 0.75 to putative protein-altering variants, likely pathogenic variants, and HLA allelotypes; 1.2 for genetic variants that are not present in the HapMap phase 3 dataset; and 1.0 for the other remaining variants.
  • the specific values of penalty factors are based on heuristics.
  • Inclusive polygenic score with population-specific refit [0214] To model the ancestry-dependent genetic effects on top of the ancestry-shared effects captured in iPGS, the iPGS+refit procedure was developed. The individuals in the training and validation sets and also of African ancestry were focused on and unpenalized regression was fit by using the individual-level data to characterize the covariate effects: phenotype ⁇ age + sex + age ⁇ + age*sex + Townsend deprivation index + genotype PCs, where genotype PCs represent the first 18 genotype PCs as in the iPGS training. The covariate-only score term was obtained by predicting the phenotype values using the covariate terms alone.
  • the missing values in the genetic variants were imputed with heterogeneous associations in the individual-level data by using the allele frequency computed in the African population in the UK Biobank.
  • a penalty factor of 1.1 was assigned for the genetic variants and 1.0 for the covariate-only score and iPGS.
  • variant * PC1 and variant * PC2 terms were also considered for genetic variants with heterogeneous associations in the penalized regression model: phenotype ⁇ covariate-only score + iPGS (Equation 1-3)
  • a penalty factor of 1.2 was assigned for the interaction terms, 1.1 for the genetic variants, and 1.0 for both covariate-only and iPGS scores.
  • the elastic-net penalized regression was fit by setting elastic-net parameter L to be 0.99 and optimized the tuning parameter by using 10 -fold cross-validation with the cv.glmnet function implemented in the glmnet package in R.
  • Genome-wide association analysis was applied with PLINK (v. 2.00 alpha). Population-specific genotype PCs were first computed for White British, non-British White, South Asian, and African individuals in the UK Biobank by using the randomized algorithm ("approx" modifier) implemented as the "--pca allele-wts 20 approx vzs" command in PLINK2.
  • the GWAS analysis was subsequently applied by using age, sex, Townsend deprivation index, array, and the top ten population-specific genotype PC loadings as covariates and using the approximation Attorney Docket No.: M0437.70168WO00 algorithm ("cc-residualize” modifier) implemented as the "--glm zs omit-ref no-x-sex log 10 hide- covar skip-invalid-pheno cc-residualize firth-fallback" command in PLINK2.
  • GWAS analysis was applied by using all the individuals in the White British group in the UK Biobank without applying the quantile normalization.
  • GWAS meta-analysis and heterogeneity test [0221] Using the GWAS summary statistics for four analyzed populations (White British, non-British White, South Asian, and African), inverse-variance weighted (IVW) meta-analysis was performed by using METAL (version 2020-05-05) and included a heterogeneity-of-effects analysis.
  • Heritability analysis [0222] Linkage disequilibrium (LD) score regression (LDSC) was applied, and the SNP-based heritability was estimated.
  • LD Linkage disequilibrium
  • Allele frequency and LD pruning Attorney Docket No.: M0437.70168WO00 [0223]
  • the non-reference allele frequency was computed with PLINK2 by using the individuals in the training set and in the following population groups in the UK Biobank: White British, non-British White, South Asian, and African.
  • the cumulative frequency of the minor- allele frequency distribution was computed for all the 1,316,181 genetic variants considered.
  • the analysis was also repeated by focusing on the subset of variants selected in at least one of the PGS models across 60 anthropometric and hematological traits.
  • LD pruning was applied with widow size 200 kb and pairwise threshold ' ⁇ of 0.5 by using the "--indep-pairwise 200 kb 0.5 " command implemented in PLINK2.
  • Protein-truncation, protein-altering, or proximal-coding variants were prioritized by using the "--indep-preferred" command. The procedure was repeated for White British, non-British White, South Asian, and African individuals in UK Biobank. The selected variants were used as approximately LD- independent variants.
  • PGS training with PRS-CSx [0225] PGS models were fit by using PRS-CSx, a cross-population polygenic prediction method based on Bayesian multivariate regression using continuous shrinkage priors.
  • the precomputed LD reference panels constructed from the UK Biobank data were downloaded from GitHub (github.com/getian107/PRScsx) and were used for the analysis. Specifically, the European (EUR) reference was used for White British and non-British White populations, the South Asian (SAS) reference was used for the South Asian population, and the African (AFR) reference was used for the African population in the UK Biobank.
  • the Bayesian regression model implemented in PRScsx.py was fit for each chromosome independently.
  • a small-scale grid search was applied for the global shrinkage parameter, phi, by fitting four models corresponding to the following phi values: 1 ⁇ 10 ⁇ , 1 ⁇ 10 ⁇ , 1 ⁇ 10 ⁇ , and 1.
  • the default values were used for the other parameters and posterior SNP effect-size estimates were obtained for each discovery super-population (i.e., EUR, SAS, and AFR).
  • the optimal linear combination of the three scores was subsequently learned.
  • the individuals in the validation set were used, the super-population-specific scores for each individual were computed by using the "--score" command implemented in PLINK2, scaling was applied so that the super-population-specific scores have zero mean and unit variance, and the coefficients of linear combinations of the population-specific scores were learned according to the recommendations provided in the GitHub repository.
  • M0437.70168WO00 i.e., the White British, non-British White, South Asian, African, and Others groups in the UK Biobank
  • the global shrinkage parameter was selected on the basis of the predictive performance evaluated in the individuals in the validation set.
  • PGS performance evaluation [0227] The held-out test set was used to evaluate the predictive performance (& ⁇ ) of (1) PGS (genotype-only) models, (2) covariate-only models, and (3) full models that considered both covariates and genotypes (Tables 4 and 5). The 95% confidence interval of predictive performance was evaluated by using the approximate standard error of & ⁇ .
  • the predictive performance computed for the WB- only model was used as the baseline to evaluate the significance of improvements in predictive performance ( & ⁇ ) in the held-out-test-set individuals.
  • the significance of the difference in & ⁇ between the iPGS model and the WB-only model was assessed by using the delta method implemented in the r2redux package in R.
  • iPGS inclusive polygenic score
  • Ancestry-shared genetic effects were characterized from large-scale individual-level data of more than one million genetic variants across hundreds of thousands of ancestry-diverse individuals by taking advantage of efficient variable screening rules in BASIL.
  • the individuals were randomly split into training, validation, and held-out test sets and a PGS model was fit on the training-set individuals.
  • the validation set was used to select the sparsity of the penalized regression model and the held-out test set for performance evaluation.
  • Application to synthetic data [0230] The approach was first tested with a synthetic individual level dataset generated by HAPNEST.
  • Simulated genotypes were used in chromosome 22 and three synthetic phenotypes were created with different polygenicity and heritability for 168,000 individuals each in African and European ancestry groups (material and methods).
  • n train 110,000
  • Each of the ancestry groups were randomly split into a training set (70%), a validation set (10%), and a held-out test set for evaluation (20%) (Table 1).
  • the difference in Attorney Docket No.: M0437.70168WO00 & ⁇ values was statistically significant, with a nominal ⁇ value of 2.9 ⁇ 10 ⁇ (material and methods).
  • Heritability and allele-frequency analysis [0237] Next, the relationship between heritability and predictive performance of the iPGS model was investigated by focusing on White British individuals because they had the largest sample size in the UK Biobank. Because the additive genetic effects are modeled in the iPGS model, the narrow-sense SNP heritability provides the theoretical upper bound of predictive performance. The heritability was estimated by using LD score regression and it was compared with the predictive performance of the iPGS model. It was found that the predictive performance of the iPGS models for hematological traits was closer to the heritability estimates than it was for the anthropometric traits (FIG. 2-11; Table 8).
  • the neutrophil-count-lowering alleles with heterogeneous associations were of higher allele frequency in individuals of African ancestry in UK Biobank (FIG. 2-4C).
  • the lead GWAS-associated variant for neutrophil counts in African ancestry groups is a well-characterized upstream untranslated region (UTR) variant rs2814778 in ACKR1 (atypical chemokine receptor 1, also known as Duffy blood group gene [DARC] [MIM: 613665]) (FIG. 2- Attorney Docket No.: M0437.70168WO00 10), which encodes the subunit of the Duffy receptor and serves as the basis of the Duffy blood group system.
  • the UTR variant rs2814778 disrupts binding sites of the GATA1 transcription factor and shuts down expression of the receptor in erythrocytes; thus, it is considered the null allele.
  • the null allele is under positive selection in the African population (allele frequency of 83% in the African population and 0.3% in the non-Finish European population), given that the Duffy receptor works as the canonical entry point for the malaria parasite, Plasmodium vivax, and the null allele is protective against malaria infection. Beyond its roles in erythrocytes, the null allele is also known as the causal variant for neutrophil-count-lowering associations from admixture mapping studies.
  • the predictive performance of these models was evaluated by using the individuals in the heldout test set.
  • the iPGS+refit model improved predictive performance when the genetic variants were observed with heterogeneous GWAS associations across ancestry groups. For neutrophil counts, further improvements were seen with population-specific iPGS+refit, even beyond the improvements in the inclusive PGS model without the population-specific refit in the African population (FIG. 2-3F and FIG. 2-4D).
  • iPGS model ii
  • the few exceptions were all in the African population and Attorney Docket No.: M0437.70168WO00 when there were genetic variants with ancestry-dependent effects (FIGs.2-3A to 2-3B).
  • Ancestry- dependent genetic effects violate the modeling assumption in iPGS; inclusive PGS training works best to capture ancestry-shared genetic effects.
  • iPGS+refit showed the best predictive performance for African individuals (FIG. 2-3F, FIG. 2- 3G, FIG. 2-4D, and FIG. 2-4E).
  • Those results highlight the advantage of iPGS and the flexibility of iPGS+refit in jointly modeling ancestry-dependent and ancestry-shared genetic effects. Discussion [0244] Presented herein is inclusive PGS (iPGS), a PGS training strategy that includes ancestry-diverse individuals.
  • iPGS does not require LD reference panels in PGS fitting and naturally provides a way to include admixed individuals in PGS training.
  • iPGS+refit a method to model ancestry-dependent effects on top of the shared effects captured in iPGS, was also developed.
  • the inclusive PGS model captures the causal UTR variant, but that is not the case for the WB-only model.
  • An iPGS+refit strategy was developed to jointly model ancestry-shared and ancestry- dependent effects in a specific population when the modeling assumption in the vanilla iPGS did not hold.
  • the candidate genetic variants with ancestry-dependent effects were selected by using the heterogeneity test implemented in a GWAS meta-analysis. It is empirically reported herein that the interaction effects between the genetic variants and genotype PCs help improve the predictive performance in iPGS+refit.
  • GenESIS enhancing transferability of polygenic scores with gene-by-sex interactions.
  • PGS Polygenic score
  • GenESIS substantially extends the recently developed inclusive PGS (iPGS).
  • GenESIS and iPGS fit predictive models directly on the individual-level data, thus naturally applicable to individuals across the continuum of genetic ancestry.
  • GenESIS results GENe, Environment, and Sex Interaction Score (GenESIS) methodology
  • GenESIS linear and interaction effects of genetic variants, demography, environmental factors, and biological sex are considered by applying supervised learning. The following steps were used as a proof of principle, though a broader set of approaches is possible, as described in the discussion.
  • genetic data was first augmented by constructing variables representing nonlinear genetic effects (Methods).
  • supervised statistical learning was applied directly on the individual-level data while introducing more regularization for nonlinear and context-dependent genetic effects.
  • GenESIS Third, the incremental utility of nonlinear and context- dependent genetic effects was evaluated in polygenic prediction by benchmarking GenESIS Attorney Docket No.: M0437.70168WO00 against linear-only inclusive PGS (iPGS) models.
  • iPGS linear-only inclusive PGS
  • L 1 - and L 2 - penalized Elastic Net regression was applied directly on the individual- level data considering linear, nonlinear, and genome-wide gene-by-sex (GxS) interaction effects, represented in 2,630,335 predictor variables across 1,316,147 genetic variants (Methods, Table 13).
  • GxS genome-wide gene-by-sex
  • HLA human leukocyte antigens
  • LD complex linkage disequilibrium
  • Linear-only iPGS models were applied as a baseline using the same set of 1.3 million genetic variants.
  • GenESIS GxS effects are highly consistent with sex- stratified genome-wide associations, validating the approach.
  • the predictive performance of GenESIS, linear-only iPGS, and a model that considers covariate terms alone was subsequently evaluated in each of the following population groups: white British (WB), non-British white (NBW), South Asian (SA), African (Afr), and other unrelated individuals (Others).
  • the number of predictor variables increased from 197 to 14,869 for total bilirubin and 278 to 9979 for direct bilirubin in GenESIS.
  • a median of 8.0% of predictors (ranging from 0.2% to 27.7%) captures GxS interaction effects.
  • the variants with GxS interaction effects are distributed genome-wide, not necessarily localized in sex chromosomes. On median across 99 UK Biobank traits, the GxS interaction effect size is 24.8% smaller than that of linear effects (FIGs. 3-1A to 3-1B).
  • Genetic dominance effects at HLA allelotypes are captured in two traits (gamma-glutamyl transferase and C-reactive protein), both on the same allele, HLA-DPA1*0103.
  • the high allele frequency of the allelotype (96.0% in White British) works advantageously in capturing genetic dominance effects, possibly reflecting the limited statistical power to detect and capture genetic dominance genetic effects for other imputed allelotypes.
  • Predictive performance is reported in each population group to minimize the risk of confounding due to population structure. The statistical difference between the two models was subsequently tested in the validation and held-out test set. In the comparison between GenESIS and iPGS, the full models that consider covariate, genetics, and their interactions were studied and their predictive performance was evaluated. [0263] Across 99 traits, substantial heterogeneity was found across traits in the improvements in predictive performance with GenESIS, as expected, given that the magnitude of GxS effects depends on the genetic architecture of traits (FIGs. 3-6A to 3-6B). Model selection was applied between GenESIS vs. linear-only iPGS for each (population, trait) pair based on the validation set metrics (Methods).
  • GenESIS model The biological basis of sex-dependent genetic effects was also inferred in the GenESIS model by investigating pleiotropic associations and genome-wide ontology enrichment of genetic variants with GxS effects.
  • GenESIS model for hip circumference was focused on.
  • protein-altering variants in MC4R (rs2229616) and GIPR (rs1800437) were found and selected for their trait-lowering effects (FIG. 3-3A).
  • Melanocortin 4 receptor encoded in MC4R gene
  • gastric inhibitory polypeptide receptor encoded in GIPR
  • glucose metabolism both known for their relevance in anthropometric traits.
  • GxS interaction effects on the other hand, a non-synonymous variant was found in GCKR (rs1260326), where glucokinase regulator, the encoded protein, plays a regulatory role in glucose metabolism and is associated with central fat accumulation (FIG. 3-3B).
  • the variant shows pleiotropic associations across blood biochemistry (for example, triglycerides, C-reactive protein, and sex hormone-binding globulin [SHBG] levels), anthropometric (whole-body water mass and trunk fat-free mass), blood pressure (pulse rate and position on pulse wave notch), and sex-specific traits (FIG. 3-3C, Table 18)
  • the associations on the sex-specific traits include genetic effects on age at menopause and “had menopause,” a binary questionnaire-based phenotype.
  • the pleiotropic association of the variant across SHBG and sex-specific traits offers insights into the biology behind the sex-biased effects represented in the GenESIS model.
  • GenESIS a unified polygenic score (PGS) modeling framework capable of incorporating linear, nonlinear, and context-dependent effects.
  • GenESIS gene-by-sex
  • iPGS linear-only inclusive polygenic scores
  • GenESIS model for hip circumference an illustrative example, outperforms all publicly available model evaluations for Africans in the PGS catalog.
  • biologically plausible hypotheses for context-dependent effects captured in GenESIS are nominated.
  • phenome-wide associations across menopausal age and sex hormone-binding globulin levels are reported on genetic variants with GxS interaction effects in GenESIS.
  • genome-wide enrichment of GxS effects were found in GenESIS to biological processes and pathways, nominating attractive targets for context-dependent interventions.
  • the existing approaches in the field include prioritizing variants, which are present in diverse populations or overlapping with bio- sample-specific regulatory elements, integrating the results of statistical fine mapping, and incorporating genetic data from ancestry-diverse individuals.
  • the GenESIS model does not assume the genetic effects to be linear or ubiquitously shared across everyone and allows genetic effects to be modulated in the presence of demography and environmental variables.
  • L 1 - and L 2 - penalized Elastic Net regression was applied directly on the individual-level data using the batch screening iterative lasso (BASIL) algorithm implemented in the R snpnet package (version 2), although other statistical and machine-learning approaches, such as statistical boosting, would be applicable as described in the discussion.
  • BASIL batch screening iterative lasso
  • the loss contribution for the i-th observation depends on the types of the exponential family considered in the regression analysis of generalized linear models. For example, it is the squared loss, i.e., , for quantitative phenotypes (Gaussian family), and it is logistic loss, i.e., , for binary phenotypes (Binomial family).
  • a similar model can be used for time-to-event phenotypes (Cox Proportional Hazards regression), or joint modeling of multiple phenotypes as shown previously.
  • GenESIS with linear effects alone For modeling linear effects of genetic variants, a covariate matrix of n individuals and d cov covariates and a genotype matrix of n individuals and d g variants representing the allelic count of effect allele was considered.
  • the concatenated vectors were used as predictors: Attorney Docket No.: M0437.70168WO00 (Equation 2-3) [0280]
  • the coefficient vector is also a concatenation of two components, corresponding to covariate effects and linear effects of genetic variants, respectively: (Equation 2-4) [0281]
  • Covariate terms were set to be unpenalized (i.e., for the objective function of the penalized generalized linear regression (Equation 2-1) becomes the following, which is equivalent to the polygenic prediction models considered previously: (Equation 2-5) GenESIS with genetic dominance effects [0282]
  • a matrix is considered, where is an indicator variable representing whether the i-th individual is homozygous for effect allele for the j-th genetic variant.
  • Equation 2-6 The concatenated vectors of three components were used as predictors: (Equation 2-6) [0283]
  • the coefficient vector now has three components: (Equation 2-7) Attorney Docket No.: M0437.70168WO00 [0284]
  • GxS G-by-Sex
  • predictor variables were augmented. For example, two matrices may be considered, and and represent GxS interaction terms for male and female individuals, respectively, as follows: , and (Equation 2-8) (Equation 2-9) where is an indicator function.
  • a combination of self-reported ethnic background (Data Field 21000) and genetic principal components (Data Field 22009) was subsequently used to define four population groups (white British, non-British white, African, and South Asian).
  • the remaining unrelated individuals were kept as Others. Individuals were focused on whose inferred biological sex is either female or male. In each population, 53.9% of unrelated individuals are female and the remaining 46.1% are male (Table 13).
  • the training set was Attorney Docket No.: M0437.70168WO00 used for model fitting, the validation set for determining hyperparameters, and the held-out test sets for predictive performance evaluation. The same training, validation, and test sets were used for all tested traits.
  • variants passing the following criteria were focused on: (1) the missingness of the variant is less than 1%, considering that the two genotyping arrays (the UK BiLEVE Axiom array and UK Biobank Axiom array) cover a slightly different set of variants and (2) Hardy-Weinberg disequilibrium test p-value greater than 1.0x10 -7 .
  • the following criteria were used: (1) the missingness of the variant is less than 1%; (2) minor allele frequency (MAF) greater than 0.01%; (3) imputation quality score (INFO score) greater than 0.3; (4) does not present in the directly genotyped dataset; and (5) present in the HapMap Phase 3 dataset.
  • the imputed allelotype dosage was kept within [0, 0.1), (0.9, 1.1), or (1.9, 2.0] and converted it to hard call.
  • the HLA allelotype was focused on with (1) missingness no more than 1% and (2) Hardy-Weinberg disequilibrium test p-value greater than 1.0x10 -4 . All variants and allelotypes were concatenated into one dataset using PLINK 2.0 (v2.00a3.3LM 3 Jun 2022). The quality control procedure resulted in 1,316,181 unique genetic variants and allelotypes considered in the analysis. [0294] The following variables were defined to consider nonlinear genetic effects.
  • the imputed HLA allelotypes account for complex LD structure in the major histocompatibility complex (MHC) region.
  • the imputed allelotype dosage was kept within (1.9, 2.0] for genetic dominance effects of imputed HLA allelotypes.
  • GxS interaction effects sex-specific effects in males were modeled by keeping the original genotype for male individuals and setting zero for female individuals, as in Equation 2-8.
  • variables were prepared for GxS interaction effects of genetic dominance terms for the HLA allelotypes. Variables were dropped when none of the Attorney Docket No.: M0437.70168WO00 unrelated individuals considered in the analysis had non-zero values.
  • Phenotype definition in UK Biobank [0295] In the UK Biobank resource, 99 anthropometric, blood biochemistry, blood count (hematological), and blood pressure traits were studied (Table 12). Some of those phenotypes are collected at up to four instances, each of which corresponds to (1) the initial assessment visit (2006-2010), (2) the first repeat assessment visit (2012-2013), (3) the imaging visit (2014-), and (4) first repeat imaging visit (2019-). Phenotype data was defined using the median of non-missing values for each individual across the 60 quantitative traits as described elsewhere. Genome-wide association analysis [0296] Sex-stratified genome-wide association analysis was applied with PLINK (v2.00 alpha).
  • Population-specific genotype PCs were computed for white British individuals in the UK Biobank cohort using the randomized algorithm implemented in PLINK2.
  • the GWAS analysis was subsequently applied using age, Townsend deprivation index, array, and the top ten population-specific genotype PC loadings as covariates, using approximation algorithm implemented as “--glm zs omit-ref no-x-sex log10 hide-covar skip-invalid-pheno cc-residualize firth-fallback” command in PLINK2.
  • the participants of the UK Biobank cohort were genotyped on two different arrays: about 10% of participants were genotyped on the UK BiLEVE Axiom array, whereas the rest were genotyped on the UK Biobank Axiom array.
  • an indicator variable “array” was included in the covariates, denoting whether the UK Biobank Axiom array or UK BiLEVE Axiom array was used in the genotyping.
  • Quantile normalization was applied using the “--pheno-quantile-normalize” option in PLINK2.
  • GWAS analysis was conducted using the male and female individuals separately in the white British population group.
  • genotype PCs represent genetic ancestry and account for trait mean differences associated with the genome-wide genetic ancestry.
  • Analysis of both male- and female-specific GxS effect vectors presents technical challenges in numerical stability, partly due to the collinearity of sex-specific GxS effect vectors.
  • male-specific GxS interaction effects were focused on, given that more female individuals are in the cohort.
  • both male- and female-specific GxS effect vectors were incorporated and the difference in predictive performance was investigated.
  • Protein-truncating and protein- altering variants were prioritized using penalty factors shown in Table 13 to improve the interpretation of selected variants. More penalization was imposed for nonlinear genetic effects to reduce the risk of overfitting. The specific values of penalty factors are based on heuristics, and finding the optimal values of penalty factors would be an important direction of follow-up studies as described in the Discussion. Comparison of linear and GxS effect size [0299] The standard deviation of the phenotype values was computed in the training set for each trait and was used to normalize the GenESIS effect size. The first, median, and third quartile of the absolute value of normalized effect size was computed for linear and GxS effects. They were compared across 99 traits.
  • Phenome-wide association analysis [0305] The phenome-wide association of genetic variants was investigated with GxS interaction effects in the GenESIS models using the Global Biobank Engine. The association summary statistics for the select genetic variants were obtained based on their coordinates on the GRCh37 reference genome. For example, the association profile for rs1260326 (a missense Attorney Docket No.: M0437.70168WO00 variant in GCKR) is from the following webpage: biobankengine.stanford.edu/RIVAS_HG19/variant/2-27730940-T-C.
  • sample weights, , and penalty factor values that allow different magnitudes of shrinkage to variables.
  • a covariate matrix of n individuals and d cov covariates and a genotype matrix of n individuals and d g variants representing the allelic count of effect allele were considered.
  • Equation 3-3 The concatenated vectors were used as predictors: Attorney Docket No.: M0437.70168WO00 (Equation 3-3) [0314]
  • the coefficient vector is also a concatenation of two components, corresponding to covariate effects and linear effects of genetic variants, respectively: (Equation 3-4) [0315]
  • the objective function of the penalized generalized linear regression (Equation 3-1) becomes the following, which is equivalent to the polygenic prediction models considered previously herein: (Equation 3-5) [0316]
  • the loss contribution for the i-th observation depends on the types of the exponential family considered in the regression analysis of generalized linear models.
  • a penalty factor of 0.5 was assigned to putative protein-truncating variants and pathogenic variants, 0.75 to putative protein-altering variants, likely-pathogenic variants, and HLA allelotypes, 1.2 for variants that are not present in the HapMap Phase 3 dataset, and 1.0 for other variants on remaining variants.
  • the specific values of penalty factors are based on heuristics, Attorney Docket No.: M0437.70168WO00 and finding the optimal values of penalty factors would be an important direction of follow-up studies.
  • Results List of 226 Analyzed Traits [0318] The analysis was expanded from 60 quantitative traits to 226 traits consisting of 177 quantitative and 49 binary traits, grouped in the following 12 groups as listed in Table 20.
  • Fit another iPGS model using a larger number of individuals in the training and the validation set and determine the optimal value of the tuning parameter ⁇ using the regularization criterion in the previous step.
  • obtaining the trained computational model includes techniques to determine an optimal regularization parameter value.
  • the regularization parameter can be the optimal ⁇ value itself, the number of genetic variants and/or predictor variables included in the iPGS model, or information-theoretic metrics, such as Bayesian Information Criterion (BIC).
  • BIC Bayesian Information Criterion
  • a plurality of computational models may be trained, each having a different value of the regularization parameter, using a subset (e.g., only the training set of data) of the data describing phenotypes, genotypes, and/or phenotype-genotype relationships.
  • a predictive performance of each of the first plurality of trained computational models may be evaluated, and an optimal regularization parameter value may be selected based on the predictive performance metrics.
  • the trained computational model may be obtained by training a computational model using the optimal regularization parameter and a larger data set (e.g., the training set of data and the validation set of data).
  • Tables 1 and 2. The number of unrelated individuals in UK Biobank analysis. Table 1. Population assignment. The number of training, validation, and test-set individuals across population groups is shown. Table 2. The number of unrelated individuals used in polygenic score training. The number of individuals used to train PGS models is shown. In the iPGS+refit in Afr models (models v and vi), ⁇ train ⁇ 284,661 individuals were used to train the iPGS model, whereas a subset of ⁇ ⁇ 4,853 individuals were used in the population-specific refit model (material and methods). Abbreviations are as follows: WB, White British; NBW: non-British White; SA, South Asian; and Afr, African.
  • the number of selected genetic variants The number of genetic variants with non- zero coefficients is shown across four PGS models and the 60 traits analyzed in the study. Table 5. Comparison of predicted PGS in the held-out test set individuals. The Pearson’s correlation (R2) between the two predicted PGS values was evaluated across all unrelated held-out test individuals (“All”) as well as a subset of held-out test set individuals (WB: white British, NBW: non-British white, SA: South Asian, Afr: African, and Others). Reported is the median value across 60 traits. Table 6. Average improvements of PGS models over the WB-only models.
  • the four columns under the PGS models correspond to the genetic variants selected for the specified PGS model in at least one of the 60 anthropometric and hematological traits.
  • Table 9A The minor allele frequency threshold of 5%
  • Table 9B The minor allele frequency threshold of 1% Table 10.
  • the number of heterogeneous associations from GWAS meta-analysis The number of genetic variants with heterogeneous GWAS associations is shown across the 60 traits analyzed in the study. The number of genetic variants reaching the nominal ⁇ value of 5 ⁇ 10 ⁇ in the GWAS heterogeneity tests is shown in the "Without LD pruning, Heterogeneity p ⁇ 5e-8" column. The number of approximately LD-independent variants ( ⁇ ⁇ ⁇ 0.5 ) in each population is also shown.
  • Table 11 The number of UK Biobank individuals analyzed in the study. The number of individuals in training, validation, and held-out test sets stratified by biological sex are shown.
  • Phenome-wide association across UK Biobank traits for rs1260326 Phenome-wide association across UK Biobank traits for rs1260326.
  • phenome-wide association summary statistics across UK Biobank traits were obtained from the Global Biobank Engine (Methods). For each phenotype (Code, Group, Name), effects sizes (BETA or log odds ratio), the corresponding 95% confidence intervals (L95OR and U95OR), number of individuals (number of individuals with non-missing values for continuous traits and number of case individuals for binary traits, shown in the n_case column) and statistical significance of the association shown in -log10(p-value) is show. Traits are sorted by the statistical significance of phenome-wide associations.
  • a method of predicting a medical trait in an individual comprising: obtaining genomic data including information indicative of genetic variants present in a genome of the individual; calculating, using an at least one processor and a trained computational model, at least one trait score based on the information indicative of the genetic variants, wherein the trained computational model was obtained by training a first computational model using data describing phenotypes, genotypes, and/or phenotype-genotype relationships for a population including admixed individuals of multiple ancestries; and generating, using the at least one processor, a graphical user interface including a visualization of the at least one trait score, the at least one trait score being indicative of a presence of the medical trait in the individual; and displaying, using a display device, the generated graphical user interface.
  • Example 2 The method of example 1, further comprising altering a course of medical treatment related to the medical trait based on the at least one trait score.
  • Example 3 The method of example 1 or 2, wherein altering the course of medical treatment comprises reducing a number of medical interventions, reducing a frequency of medical interventions, and/or selecting a different type of medical intervention based on a value of the at least one trait score.
  • Example 4. The method of any one of examples 1-3, wherein altering the course of medical treatment is based on the value of the at least one trait score relative to a threshold value.
  • Example 5 Example 5.
  • altering the course of medical treatment comprises increasing a number of medical interventions, increasing a frequency of medical interventions, and/or selecting a different type of medical intervention based on the at least one trait score.
  • Example 6 The method of any one of examples 1-5, wherein altering the course of medical treatment is based on a value of the at least one trait score relative to a threshold value.
  • Example 7. The method of any one of examples 1-6, further comprising identifying at least one therapeutic agent based on a value of the at least one trait score.
  • Example 8. The method of any one of examples 1-7, further comprising administering the at least one therapeutic agent to the individual based on the value of the at least one trait score.
  • Example 9 The method of any one of examples 1-8, further comprising altering a recruitment strategy for a clinical trial related to the medical trait based on the at least one trait score.
  • Example 10 The method of any one of examples 1-9, wherein altering the recruitment strategy for the clinical trial comprises removing the individual from enrollment in the clinical trial based on a value of the at least one trait score.
  • Example 11 The method of any one of examples 1-10, wherein altering the recruitment strategy for the clinical trial is based on the value of the at least one trait score relative to a threshold value.
  • Example 12 Example 12.
  • Example 13 The method of any one of examples 1-11, wherein altering the recruitment strategy for the clinical trial comprises including the individual in enrollment in the clinical trial based on a value of the at least one trait score.
  • Example 13 The method of any one of examples 1-12, wherein altering the recruitment strategy for the clinical trial is based on the value of the at least one trait score relative to a threshold value.
  • Example 14 The method of any one of examples 1-13, wherein training the first computational model comprises using batch screening iterative lasso (BASIL) regression.
  • BASIL batch screening iterative lasso
  • Example 15 The method of any one of examples 1-14, wherein training the first computational model comprises performing a regression including a loss, the loss comprising a squared loss and/or a binomial loss.
  • Example 17 The method of any one of examples 1-15, wherein obtaining the trained computational model further comprises using the trained first computational model and performing a regression including a penalty applied to genetic variants with heterogeneous associations between single-ancestry populations.
  • Example 17 The method of any one of examples 1-16, wherein performing the regression further comprises including effects of a global ancestry of the individual on genetic variants with heterogeneous associations.
  • Example 18 Example 18.
  • Example 19 The method of any one of examples 1-18, wherein performing the regression comprises performing Elastic Net regression.
  • Example 20 The method of any one of examples 1-19, wherein training the first computational model comprises performing a regression including at least one penalty factor determined based on one or more biological priors.
  • Example 21 The method of any one of examples 1-20, wherein the one or more biological priors include one or more of the following: variant pathogenicity, predicted variant consequences, known causal variants, tissue-specific regulatory genomic annotations, cell-type- specific regulatory genomic annotations, and/or aggregation of one or more of the preceding effects.
  • Example 22 The method of any one of examples 1-21, further comprising determining the genetic variants with heterogeneous associations between single-ancestry populations by determining genetic variants associated with the medical trait for two or more single-ancestry populations using inverse-variance weighted meta-analysis of genome-wide association studies (GWAS) for each of the two or more single-ancestry populations.
  • GWAS genome-wide association studies
  • Example 24 The method of any one of examples 1-23, wherein calculating the at least one trait score further comprises using information indicative of one or more conventional risk factors and/or genetic variants associated with the individual.
  • Example 25 The method of any one of examples 1-24, wherein obtaining the trained computational model further comprises performing a regression including effects on phenotypes in the data describing phenotypes, genotypes, and/or phenotype-genotype relationships of the one or more conventional risk factors and/or the genetic variants associated with the individual.
  • Example 26 The method of any one of examples 1-25, wherein obtaining the genomic data comprises obtaining genotyping data previously obtained by genotyping a biological sample obtained from the individual. Attorney Docket No.: M0437.70168WO00 [0357]
  • Example 27 The method of any one of examples 1-26, wherein obtaining the genomic data comprises obtaining genotyping data by sequencing a biological sample obtained from the individual.
  • Example 28 The method of any one of examples 1-27, wherein obtaining the genotyping data comprises obtaining microarray data, whole-genome sequencing data, whole- exome sequencing data, and/or genotype imputation from partially observed data. [0359]
  • Example 29 The method of any one of examples 1-25, wherein obtaining the genomic data comprises obtaining genotyping data previously obtained by genotyping a biological sample obtained from the individual. Attorney Docket No.: M0437.70168WO00 [0357]
  • Example 27 The method of any one of examples 1-26, wherein obtaining the genomic data comprises obtaining genotyping data by sequencing a biological sample obtained from the individual
  • Example 30 The method of any one of examples 1-28, wherein training the first computational model further comprises using one or more of a linear genetic effect, a genetic dominance effect, and/or a sex-based genetic effect associated with the medical trait.
  • Example 30 The method of any one of examples 1-29, wherein obtaining the trained computational model further comprises: determining an optimal regularization parameter value by: training a first plurality of computational models using a plurality of different regularization parameter values and a subset of the data describing phenotypes, genotypes, and/or phenotype- genotype relationships; evaluating a predictive performance of each of the first plurality of trained computational models; and selecting the optimal regularization parameter value based on the evaluated predictive performance of each of the first plurality of trained computational models.
  • Example 31 The method of any one of examples 1-30, wherein obtaining the trained computational model further comprises training the first computational model using the optimal regularization parameter value.
  • Example 32 A system, comprising: at least one computer hardware processor; and at least one non-transitory computer readable medium storing processor-executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform the method of any one of examples 1 or 14 to 31.
  • Example 33 At least one non-transitory computer readable medium storing processor- executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform the method of any one of examples 1 or 14 to 31.
  • Example 34 Example 34.
  • a method of performing a clinical trial comprising: obtaining, for a first individual, a first trait score associated with a medical trait by: calculating the first trait score using an at least one processor, a trained computational model, and first genomic data including information indicative of genetic variants present in a genome of the first individual, wherein the Attorney Docket No.: M0437.70168WO00 trained computational model was obtained by training a first computational model using data describing phenotypes, genotypes, and/or phenotype-genotype relationships for a population including admixed individuals of multiple ancestries; enrolling the first individual in the clinical trial based on a value of the first trait score; and altering a course of medical treatment for the first individual in accordance with the clinical trial.
  • Example 35 The method of example 34, wherein enrolling the first individual in the clinical trial based on the value of the first trait score comprises enrolling the first individual in the clinical trial based on the value of the first trait score relative to a threshold value.
  • Example 36 The method of example 34 or 35, further comprising: obtaining, for a second individual, a second trait score associated with the medical trait by calculating the second trait score using an at least one processor, the trained computational model, and second genomic data including information indicative of genetic variants present in a genome of the second individual; and declining to enroll the second individual in the clinical trial based on a value of the second trait score.
  • Example 37 Example 37.
  • Example 38 The method of any one of examples 34-37, wherein training the first computational model comprises using batch screening iterative lasso (BASIL) regression.
  • BASIL batch screening iterative lasso
  • Example 39 The method of any one of examples 34-38, wherein training the first computational model comprises performing a regression including a loss, the loss comprising a squared loss and/or a binomial loss.
  • Example 40 The method of any one of examples 34-36, wherein declining to enroll the second individual in the clinical trial based on the value of the second trait score comprises declining to enroll the second individual in the clinical trial based on the value of the second trait score relative to a threshold value.
  • Example 41 The method of any one of examples 34-40, wherein performing the regression further comprises including effects of a global ancestry of the first individual on genetic variants with heterogeneous associations.
  • Example 42 The method of any one of examples 34-40, wherein performing the regression further comprises including effects of a global ancestry of the first individual on genetic variants with heterogeneous associations.
  • Example 43 The method of any one of examples 34-42, wherein performing the regression comprises performing Elastic Net regression.
  • Example 44 The method of any one of examples 34-43, wherein training the first computational model comprises performing a regression including at least one penalty factor determined based on one or more biological priors. [0375] Example 45.
  • Example 46 The method of any one of examples 34-45, further comprising determining the genetic variants with heterogeneous associations between single-ancestry populations by: determining genetic variants associated with the medical trait for two or more single-ancestry populations using inverse-variance weighted meta-analysis of genome-wide association studies (GWAS) for each of the two or more single-ancestry populations.
  • GWAS genome-wide association studies
  • Example 48 The method of any one of examples 34-47, wherein calculating the at least one trait score further comprises using information indicative of one or more conventional risk factors and/or genetic variants associated with the first individual.
  • Example 49 Example 49.
  • Example 50 The method of any one of examples 34-49, wherein obtaining the genomic data comprises obtaining genotyping data previously obtained by genotyping a biological sample obtained from the first individual.
  • Example 51 The method of any one of examples 34-50, wherein obtaining the first genomic data comprises obtaining genotyping data by sequencing a biological sample obtained from the first individual.
  • Example 52 The method of any one of examples 34-50, wherein obtaining the first genomic data comprises obtaining genotyping data by sequencing a biological sample obtained from the first individual.
  • obtaining the trained computational model further comprises: determining an optimal regularization parameter value by: training a first plurality of computational models using a plurality of different regularization parameter values and a subset of the data describing phenotypes, genotypes, and/or phenotype- genotype relationships; evaluating a predictive performance of each of the first plurality of trained computational models; and selecting the optimal regularization parameter value based on the evaluated predictive performance of each of the first plurality of trained computational models.
  • Example 55 The method of any one of examples 34-54, wherein obtaining the trained computational model further comprises training the first computational model using the optimal regularization parameter value.
  • Example 56 Example 56.
  • a system comprising: at least one computer hardware processor; and at least one non-transitory computer readable medium storing processor-executable instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform the method of any one of examples 34 to 55.
  • Example 57 At least one non-transitory computer readable medium storing processor- executable instructions that, when executed by at least one computer hardware processor, cause the at least one computer hardware processor to perform the method of any one of examples 34 to 55.
  • a reference to “A and/or B,” when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
  • the phrase “at least one,” in reference to a list of one or more elements should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements.
  • At least one of A and B can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Pathology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne des systèmes et des procédés de prédiction d'un trait médical chez un individu. Les techniques comprennent l'obtention de données génomiques comprenant des informations indiquant des variants génétiques présents dans un génome de l'individu et le calcul, à l'aide d'au moins un processeur et d'un modèle de calcul entraîné, d'un score de trait sur la base des informations indicatives des variants génétiques, le modèle de calcul entraîné ayant été obtenu par entraînement d'un premier modèle de calcul à l'aide de données décrivant des phénotypes, des génotypes et/ou des relations entre génotypes et phénotypes pour une population comprenant des individus mélangés de multiples ascendances. Le score de trait peut être utilisé pour modifier une évolution de traitement médical de l'individu et/ou pour mettre en œuvre un essai clinique.
PCT/US2024/051666 2023-10-16 2024-10-16 Procédés pour des prédictions améliorées de phénotypes polygéniques à travers des populations diverses Pending WO2025085574A1 (fr)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US202363590758P 2023-10-16 2023-10-16
US63/590,758 2023-10-16
US202463639569P 2024-04-26 2024-04-26
US63/639,569 2024-04-26
US202463663654P 2024-06-24 2024-06-24
US63/663,654 2024-06-24

Publications (1)

Publication Number Publication Date
WO2025085574A1 true WO2025085574A1 (fr) 2025-04-24

Family

ID=95449334

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/051666 Pending WO2025085574A1 (fr) 2023-10-16 2024-10-16 Procédés pour des prédictions améliorées de phénotypes polygéniques à travers des populations diverses

Country Status (1)

Country Link
WO (1) WO2025085574A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120199324A (zh) * 2025-05-23 2025-06-24 中国人民解放军海军军医大学第二附属医院 一种基于迁移学习的多族群prs动态校准方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200167914A1 (en) * 2017-07-19 2020-05-28 Altius Institute For Biomedical Sciences Methods of analyzing microscopy images using machine learning
WO2022197968A1 (fr) * 2021-03-19 2022-09-22 Scipher Medicine Corporation Méthodes de classification et de traitement de patients
WO2022271724A1 (fr) * 2021-06-22 2022-12-29 Scipher Medicine Corporation Procédés et systèmes pour le suivi thérapeutique et la conception d'essais cliniques
US20230260658A1 (en) * 2020-04-20 2023-08-17 Myriad Genetics, Inc. Polygenic trait prediction using local ancestry

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200167914A1 (en) * 2017-07-19 2020-05-28 Altius Institute For Biomedical Sciences Methods of analyzing microscopy images using machine learning
US20230260658A1 (en) * 2020-04-20 2023-08-17 Myriad Genetics, Inc. Polygenic trait prediction using local ancestry
WO2022197968A1 (fr) * 2021-03-19 2022-09-22 Scipher Medicine Corporation Méthodes de classification et de traitement de patients
WO2022271724A1 (fr) * 2021-06-22 2022-12-29 Scipher Medicine Corporation Procédés et systèmes pour le suivi thérapeutique et la conception d'essais cliniques

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MÄGI REEDIK, HORIKOSHI MOMOKO, SOFER TAMAR, MAHAJAN ANUBHA, KITAJIMA HIDETOSHI, FRANCESCHINI NORA, MCCARTHY MARK I., MORRIS ANDREW: "Trans-ethnic meta-regression of genome-wide association studies accounting for ancestry increases power for discovery and improves fine-mapping resolution", HUMAN MOLECULAR GENETICS, vol. 26, no. 18, 15 September 2017 (2017-09-15), pages 3639 - 3650, XP093308144, ISSN: 0964-6906, DOI: 10.1093/hmg/ddx280 *
SINNOTT- ARMSTRONG ET AL.: "Genetics of 35 blood and urine biomarkers in the UK Biobank.", NAT GENET, vol. 53, no. 2, 18 January 2021 (2021-01-18), pages 185 - 194, XP037581977, Retrieved from the Internet <URL:https://pmc.ncbi.nlm.nih.gov/articles/PMC7867639> [retrieved on 20241204], DOI: 10.1038/s41588-020-00757-z *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120199324A (zh) * 2025-05-23 2025-06-24 中国人民解放军海军军医大学第二附属医院 一种基于迁移学习的多族群prs动态校准方法及系统
CN120199324B (zh) * 2025-05-23 2025-08-01 中国人民解放军海军军医大学第二附属医院 一种基于迁移学习的多族群prs动态校准方法及系统

Similar Documents

Publication Publication Date Title
Kachuri et al. Principles and methods for transferring polygenic risk scores across global populations
Uffelmann et al. Genome-wide association studies
Enoma et al. Machine learning approaches to genome-wide association studies
US20250266129A1 (en) Machine Learning Platform for Polygenic Models
Wang et al. Polygenic prediction across populations is influenced by ancestry, genetic architecture, and methodology
Hamid et al. Data integration in genetics and genomics: methods and challenges
Shen et al. SHEsisPlus, a toolset for genetic studies on polyploid species
US20220044761A1 (en) Machine learning platform for generating risk models
Vinkhuyzen et al. Estimation and partition of heritability in human populations using whole-genome analysis methods
US7035739B2 (en) Computer systems and methods for identifying genes and determining pathways associated with traits
WO2022087478A1 (fr) Plate-forme d&#39;apprentissage automatique pour génération de modèles de risque
Morris Fine mapping of type 2 diabetes susceptibility loci
Schwarzerova et al. A perspective on genetic and polygenic risk scores—advances and limitations and overview of associated tools
Juang et al. Rare variants discovery by extensive whole-genome sequencing of the Han Chinese population in Taiwan: Applications to cardiovascular medicine
Alireza et al. Enhancing prediction accuracy of coronary artery disease through machine learning-driven genomic variant selection
Tanigawa et al. Power of inclusion: Enhancing polygenic prediction with admixed individuals
Liu et al. TreeMap: a structured approach to fine mapping of eQTL variants
Zaidi et al. The genetic and phenotypic correlates of mtDNA copy number in a multi-ancestry cohort
WO2025085574A1 (fr) Procédés pour des prédictions améliorées de phénotypes polygéniques à travers des populations diverses
Chen et al. Genomics of drug target prioritization for complex diseases
Wang et al. A comprehensive investigation of statistical and machine learning approaches for predicting complex human diseases on genomic variants
Tremmel et al. Machine learning models for pharmacogenomic variant effect predictions–recent developments and future frontiers
Zhao et al. Adjusting for genetic confounders in transcriptome-wide association studies leads to reliable detection of causal genes
Fu et al. Defining the distance between diseases using SNOMED CT embeddings
Ahmed Multi-omics/genomics in predictive and personalized medicine

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24880537

Country of ref document: EP

Kind code of ref document: A1