WO2019023644A1 - Designed proteins for ligand binding - Google Patents
Designed proteins for ligand binding Download PDFInfo
- Publication number
- WO2019023644A1 WO2019023644A1 PCT/US2018/044195 US2018044195W WO2019023644A1 WO 2019023644 A1 WO2019023644 A1 WO 2019023644A1 US 2018044195 W US2018044195 W US 2018044195W WO 2019023644 A1 WO2019023644 A1 WO 2019023644A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- amino acid
- protein
- ligand
- acid residue
- ligand binding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional [2D] or three-dimensional [3D] molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/30—Drug targeting using structural data; Docking or binding prediction
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/50—Molecular design, e.g. of drugs
Definitions
- a computer-implemented method including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize the protein.
- a system including: at least one data processor; and at least one memory storing instructions which, when executed by the at least one data processor, result in operations including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and where
- a non-transitory computer-readable storage medium including program code, which when executed by at least one data processor, causes operations including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetic
- FIGS. 1A-1C The design strategy.
- FIG. 1A Structures of natural cofactor-binding proteins show a folded core supporting a cofactor-binding region.
- FIG. IB Examples of previously designed tetra-helical porphyrin-binding proteins; all but PS1 (which is described herein) lack a folded core.
- the a2 protein is from ref 40; the remainder are described in the text.
- FIG. 1C The design process starts with a parameterized backbone, which undergoes
- FIG. 2 The computational design workflow for optimized core packing.
- the abiological porphyrin cofactor, (CF 3 ) 4 PZn, is shown in the upper left.
- the constrained, parameterized backbone of SCRPZ-2 feeds into a flexible backbone design protocol that allows the interior side chains and backbone to simultaneously conform to the porphyrin (CF 3 ) 4 PZn.
- CF 3 porphyrin
- On the right are depicted the ab initio folding predictions of PS1 sequence.
- the Rosetta folding algorithm predicts a shallow folding funnel for the binding region (light gray) and a deep folding funnel shifted toward lower RMSD for the folded core (dark gray) of apo-PSl .
- the RMSD (root mean squared deviation) in A is against the helical residues within these regions in the designed model. Energy is in Rosetta energy units (r.e.u.).
- FIGS. 3A-3D Biophysical characterization of apo- and holo-PSl .
- FIG. 3 A Electronic absorption and emission
- FIG. 3B Determination of ⁇ by apo-PSl titration into a buffer solution (100 mM NaCl, 50 mM NaPi, pH 7.5) of (CF 3 ) 4 PZn with 1% w/v octyl-b-D-glucopyranoside. Inset shows spectral shifts upon porphyrin binding to PS1.
- FIG. 3C Circular dichroism (CD) spectra of apo- and holo-PSl in 50 mM NaPi, 100 mM NaCl, pH 7.5 as a function of temperature. The transitions appear reversible based on the fact that the spectra are identical after cooling to room
- FIG. 3D Pump-probe transient absorption spectra of (CF 3 ) 4 PZn bound in the interior of holo-PSl at 21 °C and 100 °C.
- the black spectrum shows characteristic S I ⁇ SN absorptions of (CF 3 ) 4 PZn, which smoothly transitions into the gray spectrum showing characteristic TI ⁇ TN absorptions of (CF 3 ) 4 PZn.
- FIG. 4 The structure of holo-PSl agrees closely with the design.
- the holo-PSl model shown is the centroid of the NMR structural ensemble. 26 porphyrin-protein nuclear Overhauser effects (NOEs), drawn as sticks, experimentally determine the orientation of the porphyrin within the binding site of PS1. Middle panel compares observed vs.
- NOEs porphyrin-protein nuclear Overhauser effects
- Panel shows -10 A slices of the holo-PSl NMR centroid and design in the binding region and folded core, respectively.
- FIGS 5A-5F Apo- and holo-PSl share similar folded cores and differ in the binding region.
- FIG. 5 A 2D X H- 15 N HSQC spectra acquired for apo- and holo-PSl .
- Experimental conditions 0.78 mM at 298K, 50 mM NaPi, 100 mM NaCl, pH 7.5, in 5% D 2 0.
- Resonance assignments are indicated using the one-letter amino acid code. Signals arising from side chains (Asn HD2/ND2, Gin HE2/NE2, Arg HE/NE and Trp HEl/NEl) are also labeled.
- the residues belonging to the binding region and folded core are color-coded as in (FIG. 5B).
- Non-helical residues are labeled in cyan font face.
- the inset in the HSQC spectrum of apo-PSl shows the chemical shift of the indole proton of Trp68 near 10.2 ppm.
- a dashed box surrounds 90% of the backbone resonances of apo-PSl and is also placed at the same position in the holo-PSl spectrum. Arrows point to resonances of residues within the binding region that change dramatically upon binding of the cofactor.
- FIG. 5B Solution NMR structures of apo-PSl and holo-PSl . The structures were aligned to the backbone of the helical folded core of the lowest energy holo-PSl model. Terminal residues 1, 108, and 109 are not shown for clarity.
- FIG. 5B Solution NMR structures of apo-PSl and holo-PSl . The structures were aligned to the backbone of the helical folded core of the lowest energy holo-PSl model. Terminal residues 1, 108, and 109 are
- FIGS. 5D-5F Backbone alignment of the holo- and apo-centroids at the folded core shows, FIG. 5F, agreement of side chain rotamer states far from the binding site and, e, differences in first-shell rotamers (e.g., Trp68, Leu98) accompanied by changes in backbone of the binding region.
- Centroids are from NMR structural ensembles clustered via RMSD of core side chain heavy atoms.
- FIG. 6 PS1 design metrics. PS1 design ensemble resulting from flexible backbone sequence design.
- FIG. 6B Residues (Ca atoms shown as spheres) within the PS 1 design that were allowed to vary from the SCRPZ-2 sequence. 40 of the 108 residues were allowed to vary, and, of the 40 residues, 28 were mutated and 12 residues were retained from the original SCRPZ-2 sequence as a result of the computational design process.
- FIG. 7A-7B Analytical ultracentrifugation and gel filtration analysis show that apo- and holo-PSl are monomeric in solution.
- FIG. 7A Analytical ultracentrifugation. Solutions of apo- and holo-PSl were centrifuged at speeds ranging from 25,000 r.p.m. to 45,000 r.p.m. and monitored by absorbance at 280 nm. Parameters were globally fit to the data. Single-species fitting agrees well with the data over the entire range and yields the molecular weight of apo-PSl 15.81 ⁇ 0.09 kD and holo-PSl 12.24 ⁇ 0.91 kD, which agrees well with the 12.86 kD weight of PS1.
- FIG. 7B Analytical gel filtration analysis of apo- and holo-PSl . Detection wavelengths are labeled as the same color as their respective curves. Apo shows a small degree ( ⁇ 5%) of dimerization (1.35 ml elution volume) relative to the monomer peak (1.62 ml elution volume). The small peak near 1.05 ml elution volume in holo-PSl is unbound (excess), aggregated porphyrin eluting in the void volume of the superdex 75 5/150 column. Samples were run at concentrations of 100 ⁇ and 37 uM for apo and holo, respectively, in 50 mM NaPi, 150 mM NaCl, pH 7.0 buffer.
- FIG. 8 Temperature and GnHCl induced unfolding of apo-PSl .
- FIG. 10 Absorption spectra of (CF 3 ) 4 PZn/PSl and (CF 3 ) 4 PZn/PS2 complexes. Each protein shows 100% porphyrin loading, based on absorbance at 280 nm and 423 nm.
- buffer 100 mM NaCl, 50 mM NaPi, pH 7.5.
- FIG. 11 The NMR structural ensemble of apo-PSl contains two clusters of conformations, closed and open. Above, color mapping of the pairwise backbone RMSD matrix of each NMR ensemble member of apo-PSl . Apo models with high structural similarity in the region of residues 61-67 and 99-105 (labeled in the open structure shown below) are blue in the plot. Models that are structurally dissimilar (large RMSD) are red in the plot. Below, the model centroids representing the closed and open structures (models 1 and 18, respectively, in the deposited NMR structure). The porphyrin (CF 3 ) 4 PZn is shown in green, and the holo centroid (orange) is also drawn for comparison.
- CF 3 porphyrin 4 PZn
- FIG. 12 HDX protection factors for apo- and holo-PSl, as described in Table S5. Note that “68 indole” denotes the indole N of Trp68 side chain.
- FIG. 13 Molecular dynamics simulations show the binding region of apo-PSl is more accessible to solvent. Histogram of number of waters within 3.5 A of any heavy atom of each buried amino acid side chain (an A or D position of the heptad repeat), from 1000 snapshots of a 1 trajectory of apo-PSl . All histograms are drawn to the same scale and show number of solvating waters normalized by side chain surface area. Binding region shown in light gray, and folded core in dark gray. [0023] FIG. 14 depicts a flowchart illustrating a process for designing proteins, in accordance with some example embodiments. [0024] FIG. 15 depicts a block diagram illustrating a computing system, in accordance with some example embodiments.
- FIG. 16 Solution NMR structure of PS1 and computational models of PS1 variants.
- FIG. 17 The PS1 deletion variant binds endogenous heme when expressed in E. coli. Characteristic Soret and Q bands of heme can be seen at 410 and 550 nm in the displayed absorption spectra.
- FIGS. 18A-18B enFold proteins are capable of noncovalently binding endogenous ligands in the cell.
- FIG. 18A Expression m E. coli of the deletion variant of PS1 (PS1 D103- 109, SEQ ID NO: 8) shows a high loading of endogenous heme in the porphyrin binding site.
- Inset E. coli cultures of after induction).
- FIG. 18B Denovo proteins from binary sequence patterning. Previous studies have only been able to incorporate heme exogenously, i.e. after expression and purification, heme is added to the purified apo-protein. See Patel et al, Protein Science, 18: 1388-1400.
- Protein catalysis requires atomic-level orchestration of side chains, substrates, and cofactors, yet the ability to design a small-molecule-binding protein entirely from first principles with a precisely predetermined structure has not been demonstrated.
- PS1 novel protein
- holo-PSl The high-resolution structure of holo-PSl is in sub-A agreement with the design.
- the structure of apo-PSl retains the remote core packing of the holo, predisposing a flexible binding region for the desired ligand-binding geometry.
- Our results illustrate the unification of core packing and binding site definition as a central principle of ligand-binding protein design.
- an analog is used in accordance with its plain ordinary meaning within Chemistry and Biology and refers to a chemical compound that is structurally similar to another compound (i.e., a so-called “reference” compound) but differs in composition, e.g., in the replacement of one atom by an atom of a different element, or in the presence of a particular functional group, or the replacement of one functional group by another functional group, or the absolute stereochemistry of one or more chiral centers of the reference compound. Accordingly, an analog is a compound that is similar or comparable in function and appearance but not in structure or origin to a reference compound. [0030]
- the terms "a” or "an,” as used in herein means one or more.
- substituted with a[n] means the specified group may be substituted with one or more of any or all of the named substituents.
- a group such as an alkyl or heteroaryl group
- the group may contain one or more unsubstituted C1-C20 alkyls, and/or one or more unsubstituted 2 to 20 membered heteroalkyls.
- a “detectable agent” or “detectable moiety” is a composition detectable by appropriate means such as spectroscopic, photochemical, biochemical, immunochemical, chemical, magnetic resonance imaging, or other physical means.
- useful detectable agents include 18 F, 32 P, 33 P, 45 Ti, 47 Sc, 52 Fe, 59 Fe, 62 Cu, 64 Cu, 67 Cu, 67 Ga, 68 Ga, 77 As, 86 Y, 90 Y.
- fluorescent dyes or chromophores fluorescent dyes or chromophores
- phosphor e.g., phosphorescent dyes or chromophores
- lumophore luminescent dyes or chromophores
- electron-dense reagents enzymes (e.g., as commonly used in an ELISA), biotin, digoxigenin, paramagnetic molecules, paramagnetic nanoparticles, ultrasmall superparamagnetic iron oxide (“USPIO”) nanoparticles, USPIO nanoparticle aggregates, superparamagnetic iron oxide (“SPIO”) nanoparticles, SPIO nanoparticle aggregates, monochrystalline iron oxide
- USPIO ultrasmall superparamagnetic iron oxide
- SPIO superparamagnetic iron oxide
- Gadolinium chelate Gadolinium chelate
- radioisotopes e.g. carbon-11, nitrogen-13, oxygen-15, fluorine-18, rubidium-82
- fluorodeoxyglucose e.g. fluorine-18 labeled
- any gamma ray emitting radionuclides positron- emitting radionuclide
- radiolabeled glucose e.g. glucose, radiolabeled water, radiolabeled ammonia, biocolloids, microbubbles
- biocolloids e.g.
- microbubble shells including albumin, galactose, lipid, and/or polymers
- microbubble gas core including air, heavy gas(es), perfluorcarbon, nitrogen, octafluoropropane, perflexane lipid microsphere, perflutren, etc.
- iodinated contrast agents e.g.
- a detectable moiety is a monovalent detectable agent or a detectable agent capable of forming a bond with another composition.
- Radioactive substances e.g., radioisotopes
- Radioactive substances include, but are not limited to, 18 F, 32 P, 33 P, 45 Ti, 47 Sc, 52 Fe, 59 Fe, 62 Cu, 64 Cu, 67 Cu, 67 Ga, 68 Ga, 77 As, 86 Y, 90 Y, 89 Sr, 89 Zr, 94 Tc, 94 Tc, 99m Tc, "Mo, 105 Pd, 105 Rh, lu Ag, m In, 123 I, 124 I, 125 I, 131 I, 142 Pr, 143 Pr, 149 Pm, 153 Sm, 154" 1581 Gd, 161 Tb, 166 Dy, 166 Ho, 169 Er, 175 Lu, 177 Lu, 186 Re, 188 Re, 189 Re, 194 Ir, 198 Au, 199 Au, 211 At, 211 Pb, 212 Bi,
- Paramagnetic ions that may be used as additional imaging agents in accordance with the embodiments of the disclosure include, but are not limited to, ions of transition and lanthanide metals (e.g. metals having atomic numbers of 21-29, 42, 43, 44, or 57-71). These metals include ions of Cr, V, Mn, Fe, Co, Ni, Cu, La, Ce, Pr, Nd, Pm, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb and Lu.
- transition and lanthanide metals e.g. metals having atomic numbers of 21-29, 42, 43, 44, or 57-71.
- These metals include ions of Cr, V, Mn, Fe, Co, Ni, Cu, La, Ce, Pr, Nd, Pm, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb and Lu.
- amino acid residue in a protein "corresponds" to a given residue when it occupies the same essential structural position within the protein as the given residue.
- nucleic acid or protein when applied to a nucleic acid or protein denotes that the nucleic acid or protein is essentially free of other cellular components with which it is associated in the natural state. It can be, for example, in a homogeneous state and may be in either a dry or aqueous solution. Purity and homogeneity are typically determined using analytical chemistry techniques such as polyacrylamide gel electrophoresis or high performance liquid
- amino acid refers to naturally occurring and synthetic amino acids, as well as amino acid analogs and amino acid mimetics that function in a manner similar to the naturally occurring amino acids.
- Naturally occurring amino acids are those encoded by the genetic code, as well as those amino acids that are later modified, e.g., hydroxyproline, ⁇ -carboxyglutamate, and O-phosphoserine.
- Amino acid analogs refers to compounds that have the same basic chemical structure as a naturally occurring amino acid, i.e., an a carbon that is bound to a hydrogen, a carboxyl group, an amino group, and an R group, e.g., homoserine, norleucine, methionine sulfoxide, methionine methyl sulfonium. Such analogs have modified R groups (e.g., norleucine) or modified peptide backbones, but retain the same basic chemical structure as a naturally occurring amino acid.
- Amino acid mimetics refers to chemical compounds that have a structure that is different from the general chemical structure of an amino acid, but that function in a manner similar to a naturally occurring amino acid.
- non-naturally occurring amino acid and “unnatural amino acid” refer to amino acid analogs, synthetic amino acids, and amino acid mimetics, which are not found in nature.
- Amino acids may be referred to herein by either their commonly known three letter symbols or by the one-letter symbols recommended by the IUPAC-IUB Biochemical
- polypeptide refers to a polymer of amino acid residues, wherein the polymer may in embodiments be conjugated to a moiety that does not consist of amino acids.
- the terms apply to amino acid polymers in which one or more amino acid residue is an artificial chemical mimetic of a corresponding naturally occurring amino acid, as well as to naturally occurring amino acid polymers and non-naturally occurring amino acid polymers.
- a “fusion protein” refers to a chimeric protein encoding two or more separate protein sequences that are recombinantly expressed as a single moiety. In embodiments, the protein includes at least 30 amino acid residues.
- a protein may be characterized as having a protein backbone.
- a "protein backbone” is used herein in accordance with its ordinary meaning and refers to the polymer of amino acid residues that create a continuous chain. For example, a rotein backbone may refer to the series
- each R independently represents optionally different amino acid side chains.
- the protein backbone includes core amino acid residues and ligand binding amino acid residues. In embodiments, the protein backbone includes core amino acid residues. In embodiments, the protein backbone includes ligand binding amino acid residues.
- nucleic acid As may be used herein, the terms “nucleic acid,” “nucleic acid molecule,” “nucleic acid oligomer,” “oligonucleotide,” “nucleic acid sequence,” “nucleic acid fragment” and
- polynucleotide are used interchangeably and are intended to include, but are not limited to, a polymeric form of nucleotides covalently linked together that may have various lengths, either deoxyribonucleotides or ribonucleotides, or analogs, derivatives or modifications thereof.
- polynucleotides may have different three-dimensional structures, and may perform various functions, known or unknown.
- Non-limiting examples of polynucleotides include a gene, a gene fragment, an exon, an intron, intergenic DNA (including, without limitation, heterochromatic DNA), messenger RNA (mRNA), transfer RNA, ribosomal RNA, a ribozyme, cDNA, a recombinant polynucleotide, a branched polynucleotide, a plasmid, a vector, isolated DNA of a sequence, isolated RNA of a sequence, a nucleic acid probe, and a primer.
- Polynucleotides useful in the methods of the disclosure may comprise natural nucleic acid sequences and variants thereof, artificial nucleic acid sequences, or a combination of such sequences.
- a polynucleotide is typically composed of a specific sequence of four nucleotide bases: adenine (A); cytosine (C); guanine (G); and thymine (T) (uracil (U) for thymine (T) when the polynucleotide is RNA).
- A adenine
- C cytosine
- G guanine
- T thymine
- U uracil
- T thymine
- polynucleotide sequence is the alphabetical representation of a polynucleotide molecule; alternatively, the term may be applied to the polynucleotide molecule itself. This alphabetical representation can be input into databases in a computer having a central processing unit and used for bioinformatics applications such as functional genomics and homology searching.
- Polynucleotides may optionally include one or more non-standard nucleotide(s), nucleotide analog(s) and/or modified nucleo
- Constantly modified variants applies to both amino acid and nucleic acid sequences.
- “conservatively modified variants” refers to those nucleic acids that encode identical or essentially identical amino acid sequences. Because of the degeneracy of the genetic code, a number of nucleic acid sequences will encode any given protein. For instance, the codons GCA, GCC, GCG and GCU all encode the amino acid alanine. Thus, at every position where an alanine is specified by a codon, the codon can be altered to any of the corresponding codons described without altering the encoded polypeptide. Such nucleic acid variations are "silent variations,” which are one species of conservatively modified variations.
- Every nucleic acid sequence herein which encodes a polypeptide also describes every possible silent variation of the nucleic acid.
- each codon in a nucleic acid except AUG, which is ordinarily the only codon for methionine, and TGG, which is ordinarily the only codon for tryptophan
- TGG which is ordinarily the only codon for tryptophan
- amino acid sequences one of skill will recognize that individual substitutions, deletions or additions to a nucleic acid, peptide, polypeptide, or protein sequence which alters, adds or deletes a single amino acid or a small percentage of amino acids in the encoded sequence is a "conservatively modified variant" where the alteration results in the substitution of an amino acid with a chemically similar amino acid. Conservative substitution tables providing functionally similar amino acids are well known in the art. Such conservatively modified variants are in addition to and do not exclude polymorphic variants, interspecies homologs, and alleles of the disclosure.
- the following eight groups each contain amino acids that are conservative substitutions for one another: 1) Alanine (A), Glycine (G); 2) Aspartic acid (D), Glutamic acid (E); 3) Asparagine (N), Glutamine (Q); 4) Arginine (R), Lysine (K); 5) Isoleucine (I), Leucine (L), Methionine (M), Valine (V); 6) Phenylalanine (F), Tyrosine (Y), Tryptophan (W); 7) Serine (S), Threonine (T); and 8) Cysteine (C), Methionine (M) ⁇ see, e.g., Creighton, Proteins (1984)).
- Percentage of sequence identity is determined by comparing two optimally aligned sequences over a comparison window, wherein the portion of the polynucleotide or polypeptide sequence in the comparison window may comprise additions or deletions ⁇ i.e., gaps) as compared to the reference sequence (which does not comprise additions or deletions) for optimal alignment of the two sequences. The percentage is calculated by determining the number of positions at which the identical nucleic acid base or amino acid residue occurs in both sequences to yield the number of matched positions, dividing the number of matched positions by the total number of positions in the window of comparison and multiplying the result by 100 to yield the percentage of sequence identity.
- nucleic acids or polypeptide sequences refer to two or more sequences or subsequences that are the same or have a specified percentage of amino acid residues or nucleotides that are the same (i.e., about 60% identity, preferably 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or higher identity over a specified region, when compared and aligned for maximum correspondence over a comparison window or designated region) as measured using a BLAST or BLAST 2.0 sequence comparison algorithms with default parameters described below, or by manual alignment and visual inspection ⁇ see, e.g., NCBI web site http://www.ncbi.nlm.nih.gov/BLAST/ or the like).
- sequences are then said to be "substantially identical".
- This definition also refers to, or may be applied to, the compliment of a test sequence.
- the definition also includes sequences that have deletions and/or additions, as well as those that have substitutions.
- the preferred algorithms can account for gaps and the like.
- identity exists over a region that is at least about 25 amino acids or nucleotides in length, or more preferably over a region that is 50-100 amino acids or nucleotides in length.
- amino acid or nucleotide base "position" is denoted by a number that sequentially identifies each amino acid (or nucleotide base) in the reference sequence based on its position relative to the N-terminus (or 5'-end). Due to deletions, insertions, truncations, fusions, and the like that must be taken into account when determining an optimal alignment, in general the amino acid residue number in a test sequence determined by simply counting from the N- terminus will not necessarily be the same as the number of its corresponding position in the reference sequence. For example, in a case where a variant has a deletion relative to an aligned reference sequence, there will be no amino acid in the variant that corresponds to a position in the reference sequence at the site of deletion.
- amino acid side chain refers to the functional substituent contained on amino acids.
- an amino acid side chain may be the side chain of a naturally occurring amino acid.
- Naturally occurring amino acids are those encoded by the genetic code (e.g., alanine, arginine, asparagine, aspartic acid, cysteine, glutamine, glutamic acid, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, or valine), as well as those amino acids that are later modified, e.g., hydroxyproline, ⁇ -carboxyglutamate, and O-phosphoserine.
- the amino acid side chain may be a non-natural amino acid side chain.
- the amino acid side chain may be a non-natural amino acid side chain.
- the amino acid side chain may be a non-natural amino acid side chain.
- non-natural amino acid side chain refers to the functional substituent of compounds that have the same basic chemical structure as a naturally occurring amino acid, i.e., an a carbon that is bound to a hydrogen, a carboxyl group, an amino group, and an R group, e.g., homoserine, norleucine, methionine sulfoxide, methionine methyl sulfonium, allylalanine, 2- aminoisobutryric acid.
- Non-natural amino acids are non-proteinogenic amino acids that either occur naturally or are chemically synthesized.
- Such analogs have modified R groups (e.g., norleucine) or modified peptide backbones, but retain the same basic chemical structure as a naturally occurring amino acid.
- Non-limiting examples include exo-cis-3- aminobicyclo[2.2.1]hept-5-ene-2-carboxylic acid hydrochloride, cis-2- aminocycloheptanecarboxylic acid hydrochloride, cis-6-amino-3-cyclohexene-l-carboxylic acid hydrochloride, cis-2-amino-2-methylcyclohexanecarboxylic acid hydrochloride, cis-2-amino-2- methylcyclopentanecarboxylic acid hydrochloride ,2-(Boc-aminomethyl)benzoic acid, 2-(Boc- amino)octanedioic acid, Boc-4,5-dehydro-Leu-OH (dicyclohexylammonium), Boc-4-(Fmo
- nucleic acids or polypeptide sequences refer to two or more sequences or subsequences that are the same or have a specified percentage of amino acid residues or nucleotides that are the same (i.e., about 60% identity, preferably 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%), 98%), 99%), or higher identity over a specified region, when compared and aligned for maximum correspondence over a comparison window or designated region) as measured using a BLAST or BLAST 2.0 sequence comparison algorithms with default parameters described below, or by manual alignment and visual inspection (see, e.g., NCBI web site
- substantially identical This definition also refers to, or may be applied to, the compliment of a test sequence.
- the definition also includes sequences that have deletions and/or additions, as well as those that have substitutions.
- the preferred algorithms can account for gaps and the like.
- identity exists over a region that is at least about 25 amino acids or nucleotides in length, or more preferably over a region that is 50-100 amino acids or nucleotides in length.
- expression includes any step involved in the production of the polypeptide including, but not limited to, transcription, post-transcriptional modification, translation, post- translational modification, and secretion. Expression can be detected using conventional techniques for detecting protein (e.g., ELISA, Western blotting, flow cytometry,
- Control or "control experiment” is used in accordance with its plain ordinary meaning and refers to an experiment in which the subjects or reagents of the experiment are treated as in a parallel experiment except for omission of a procedure, reagent, or variable of the experiment.
- the control is used as a standard of comparison in evaluating experimental effects.
- a control is the measurement of the activity of a protein in the absence of a compound as described herein (including embodiments and examples).
- the term "about” means a range of values including the specified value, which a person of ordinary skill in the art would consider reasonably similar to the specified value. In embodiments, about means within a standard deviation using measurements generally acceptable in the art. In embodiments, about means a range extending to +/- 10%> of the specified value. In embodiments, about means the specified value. [0052]
- the terms "bind” and “bound” as used herein is used in accordance with its plain and ordinary meaning and refers to the association between atoms or molecules. The association can be direct or indirect. For example, bound atoms or molecules may be direct, e.g., by covalent bond or linker (e.g.
- first linker or second linker e.g., a first linker or second linker
- indirect e.g., by non-covalent bond (e.g. electrostatic interactions (e.g. ionic bond, hydrogen bond, halogen bond), van der Waals interactions (e.g. dipole-dipole, dipole-induced dipole, London dispersion), ring stacking (pi or hyrdophobic effects), hydrophobic interactions and the like).
- non-covalent bond e.g. electrostatic interactions (e.g. ionic bond, hydrogen bond, halogen bond), van der Waals interactions (e.g. dipole-dipole, dipole-induced dipole, London dispersion), ring stacking (pi or hyrdophobic effects), hydrophobic interactions and the like).
- set of ligand binding amino acid residues refers to at least two ligand binding amino acid residues.
- Ligand binding amino acid residues refer to amino acid residues which are capable of binding (e.g., has a measurable dissociation constant of binding, has a dissociation constant of binding less than 1 ⁇ ) to a ligand.
- the ligand binding amino acid residues refer to amino acid residues which bind to a ligand.
- Each ligand binding amino acid residue is associated with a set of ligand binding amino acid residue atomic coordinates (e.g., Cartesian coordinates, internal coordinates, polar coordinates, or spherical coordinates) which defines the ligand binding amino acid residue in space (e.g., Euclidean space).
- ligand binding amino acid residues refer to amino acid residues within about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, or 30 A from the ligand.
- ligand binding amino acid residues refer to amino acid residues within about 5 A from the ligand. In determining the set of ligand binding amino acid residues, such factors such as the proximity of the amino acid to the ligand or the interactions between the amino acid and the ligand may influence the designation to be a "ligand binding amino acid residue.”
- dissociation constant is used in accordance with its plain ordinary meaning and refers to the ligand concentration at which half of the proteins are occupied (i.e. bound to a ligand) at equilibrium.
- the dissociation constant has molar units (M).
- M molar units
- nM nanomolar
- ⁇ micromolar
- ligand and "cofactor” are synonymous, and used in accordance with their plain ordinary meaning in chemistry and biochemistry and refer to an agent (e.g., compound, metal, ion, biomolecule, agonist, antagonist) which is capable of binding to a protein (e.g., a protein described herein).
- a ligand refers to an agent (e.g., compound, metal, ion, biomolecule) which is binds (e.g., covalently or non-covalently) to a protein.
- the ligand upon binding the ligand has an effect on the protein (e.g., structural change of the protein, modulation of signaling pathways).
- a ligand is associated with a set of ligand atomic coordinates (e.g., Cartesian coordinates, internal coordinates, polar coordinates, spherical coordinates) which define the ligand in space (e.g., Euclidean space).
- the ligand may be endogenous or exogenous.
- Non-limiting examples of ligands include a catalyst, detectable agent, therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theranostic (e.g., a combined therapeutic and diagnostic agent), photodynamic therapy (PDT) agent, porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component that is capable of binding a metal ion.
- MRI magnetic resonance imaging
- PET positron emission tomography
- radiological imaging agent diagnostic agent
- diagnostic agent theranostic agent
- theranostic e.g., a combined therapeutic and diagnostic agent
- PDT photodynamic
- the ligand is a peptide (e.g., 2 to 30 amino acid residues), a protein (e.g., greater than 30 amino acid residues), a small molecule (e.g., a compound with a molecular weight of less than 2000 Daltons), or a small molecule-metal-ion complex (e.g., a metalloporphyrin).
- the ligand is endogenous.
- the ligand is exogenous.
- the ligand is flavin.
- the ligand is heme.
- set of core amino acid residues refers to at least two core amino acid residues.
- Core amino acid residues refer to amino acid residues, which are incapable of binding to a ligand (e.g., does not have a measurable dissociation constant of binding, does not have a dissociation constant of binding less than 1 ⁇ ).
- core amino acids are amino acids which do not bind a ligand.
- Each core amino acid residue is associated with a set of core amino acid residue atomic coordinates (e.g., Cartesian coordinates, internal coordinates, polar coordinates, spherical coordinates) which defines the core binding amino acid residue in space (e.g., Euclidean space).
- Core amino acids are at least 75% inaccessible to a 1.8 A spherical probe.
- a typical set of core amino acid residues contains at least 6 amino acid residues.
- the set of core amino acid residues includes amino acid residues which are solvent inaccessible as measured by the accessible surface area. Additional information regarding the accessible surface area assessment may be found in Lins et al. (Lins, L., Thomas, A., & Brasseur, R. (2003) Protein Science: A Publication of the Protein Society, 12(7), 1406-141), which is incorporated herein in its entirety for all purposes.
- the core amino acids atomic coordinates are greater than 5 A from any ligand atomic acid coordinate.
- the set of core amino acid residues is hydrophobic.
- the core amino acids includes the sequence:
- LGLVAFLIFGLVLILIHLFAAGWVFFAILLLLALILA (SEQ ID NO: 5).
- Optimizing may employ iterative or heuristic algorithms, such as simplex algorithm, memetic algorithm, differential evolution algorithm, evolutionary algorithm, genetic algorithm, tabu algorithm, particle swarm algorithm, stimulated annealing algorithm, Monte Carlo sampling algorithm, dead-end elimination algorithm, branch and bound algorithm, or a pruning algorithm.
- optimizing typically includes evaluating an energy function (e.g., force field model) and finding the minimum (e.g., global minimum or local minimum).
- Optimizing may include repeated evaluations of the energy function and may include fixing an atomic coordinate (e.g., fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate), introducing additional amino acid residues into the set of amino acid residues (e.g., the set of ligand binding amino acid residues), restricting the introduction of additional amino acid residues into the set of amino acid residues (e.g., the set of ligand binding amino acid residues), or a geometric transformation (e.g., translation or rotation) of an amino acid residue atomic coordinate (e.g., the atomic coordinate of the ligand binding amino acid residue atomic coordinates).
- an energy function e.g., force field model
- finding the minimum e.g., global minimum or local minimum.
- Optimizing may include repeated evaluations of the energy function and may include fixing an atomic coordinate (e.g
- the output of an optimization process may provide a set of ligand binding amino acid residues and a corresponding set of ligand binding amino acid residue atomic coordinates, and a set of core amino acid residues and a corresponding set of core amino acid residue atomic coordinates, which corresponds to an energetically stabilized protein.
- the outcome of the optimization is the global minimum (e.g., the most energetically stabilized protein).
- the outcome of the optimization is a local minimum (e.g., a minimum energy given the domain).
- the optimization is complete when the derivative of the energy with respect to the position of the atoms, ⁇ / ⁇ , is zero and the Hessian matrix has positive eigenvalues.
- optimizing includes a plurality of minimization calculations.
- the optimization is a finite number of iterations.
- An energy minimization calculation refers to the process of evaluating the energy as a function of the atomic coordinates, V(r).
- the energy function may include intra- and
- Vtotai(r) Vbonds(r) + Vangles(r) + Vdihedral(r) + Vimproper(r) + Vnonbonding(r) + Velectrostatics(r);
- V to tal(r) corresponds to the total energy as a function of the atomic positions
- Vbonds(r) corresponds to the energy contribution from bonded atoms
- V an gies(r) corresponds to the energy contribution from angles
- Vdihedrai(r) corresponds to the energy contribution from dihedral torsions
- Vimproper(r) corresponds to the energy contribution from out-of-plane torsions
- Vnonbonding(r) corresponds to the energy contribution from nonbonding interactions
- Veiectrostatics(r) corresponds to the energy contribution from electrostatic interactions.
- Additional energy function terms may also be included in the total energy function, Vtotai(r), for example additional functions from molecular mechanics, functions from structural bioinformatics (log-odds scores), amino acid sidechain packing functions (e.g., functions and algorithms which vary the identity and rotamer of an amino acid side chain), protein radius of gyration functions, or a penalty function.
- additional functions from molecular mechanics, functions from structural bioinformatics (log-odds scores), amino acid sidechain packing functions (e.g., functions and algorithms which vary the identity and rotamer of an amino acid side chain), protein radius of gyration functions, or a penalty function.
- biomolecule refers to a molecule present in living organisms (e.g., proteins, carbohydrates, lipids, and nucleic acids, metabolites) and may be endogenous or exogenous in origin.
- thermodynamically stable relative to the protein that has not been energetically stabilized is determined to be energetically stabilized by determining the difference in the Gibbs free energy between the folded and unfolded states of the protein, also refered to herein as AGfoiding.
- An energetically stabilized protein may be
- the energetically stabilized protein is an enzyme.
- the energetically stabilized protein is an apo protein (e.g., a protein that is not bound to a ligand).
- the energetically stabilized protein is a holo protein (e.g., a protein that is bound to a ligand).
- the energetically stabilized protein is an apo protein which is capable of becoming a holo protein upon ligand binding.
- an energetically stabilized protein refers to a protein which is capable of performing a function (e.g., modulating a signal pathway).
- the energetically stabilized protein resists side-reactions such as aggregation and proteolysis.
- the energetically stabilized protein has a AGfoiding of about -5 to about -40 kcal/mol in standard physiological conditions (e.g., temperature range of 20-40 degrees Celsius, atmospheric pressure of 1, pH of 6-8, glucose concentration of 1-20 mM, atmospheric oxygen concentration).
- exogenous refers to a molecule or substance (e.g., a compound, ligand, or protein) that originates from outside a given cell or organism.
- exogenous refers to a molecule or substance (e.g., a compound, ligand, or protein) that originates from outside a given cell or organism.
- a "therapeutic agent” as used herein refers to an agent (e.g., compound or composition) that when administered to a subject in sufficient amounts will have a therapeutic effect, such as an intended prophylactic effect, preventing or delaying the onset (or reoccurrence) of an injury, disease, pathology or condition, or reducing the likelihood of the onset (or reoccurrence) of an injury, disease, pathology, or condition, or their symptoms or the intended therapeutic effect, e.g., treatment or amelioration of an injury, disease, pathology or condition, or their symptoms including any objective or subjective parameter of treatment such as abatement; remission;
- small molecule refers, unless indicated otherwise, to a molecule having a molecular weight of less than about 700 Dalton, e.g., less than about 700, 650, 600, 550, 500, 450, 400, 350, 300, 250, 200, 100, or 50 Dalton.
- a computer-implemented method including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize the protein.
- the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to
- optimization is performed to improve, relative to a control, the protein-ligand interactions (e.g., decrease the dissociation constant of binding 1-fold, 2-fold, 3-fold, 4-fold or 5-fold).
- the optimization modulates, relative to a control, the non-covalent interactions between the protein and the ligand.
- step c) includes simultaneously optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates. In embodiments, step c) includes concurrently (e.g., performing an optimization iteration on all sets prior to continuing the optimization) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates.
- the optimizing is joint optimizing (e.g., optimizing the set of ligand binding amino acid residues, the set of core amino acid residues, and optionally the ligand simultaneously).
- step c) includes optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates.
- step c) includes optimizing the set of ligand binding amino acid residues and the set of core amino acid residues.
- step c) includes optimizing the set of ligand binding amino acid residues and the set of ligand binding amino acid residue atomic coordinates.
- step c) includes optimizing the set of core amino acid residues and the set of core amino acid residue atomic coordinates. In embodiments, step c) includes optimizing the set of ligand binding amino acid residue atomic coordinates and the set of core amino acid residue atomic coordinates.
- step c) includes optimizing the protein backbone.
- Optimizing the protein backbone may refer to repeated evaluations of the energy function and may include fixing an atomic coordinate (e.g., fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate, but not the side chain of the residue), introducing additional amino acid residues into the set of amino acid residues (e.g., the set of ligand binding amino acid residues), restricting the introduction of additional amino acid residues into the set of amino acid residues (e.g., the set of ligand binding amino acid residues), or a geometric transformation (e.g., translation or rotation) of an amino acid residue atomic coordinate, but not the side chain of the residue (e.g., the atomic coordinate of the ligand binding amino acid residue atomic coordinates).
- fixing an atomic coordinate e.g., fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate, but not the side chain of the residue
- step c) includes simultaneously optimizing the protein backbone and the set of ligand binding amino acid residues. In embodiments, step c) includes simultaneously optimizing the protein backbone and the ligand. In embodiments, step c) includes simultaneously optimizing the protein backbone and the set of core amino acid residues. In embodiments, step c) includes optimizing the protein backbone using known conformational sampling techniques in the art (e.g., rigid-body shifts of helices, backrub algorithms, or crankshaft algorithms). In embodiments, step c) is performed using a protein modeling software suite (e.g., Rosetta). In embodiments, step c) includes an ensemble (e.g., a finite set of proteins, which includes amino acid residue atomic coordinates) of backbones for conformational sampling calculations.
- a protein modeling software suite e.g., Rosetta
- step c) includes fixing (e.g., not geometrically displacing) an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate; fixing an atomic coordinate of at least one ligand atomic coordinate; prohibiting introduction of an additional amino acid residue into the set of ligand binding amino acid residues; or prohibiting the deletion of an amino acid residue from the set of ligand binding amino acid residues.
- fixing e.g., not geometrically displacing
- step c) includes fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate. In embodiments, step c) includes fixing all atomic coordinates of at least one ligand binding amino acid residue atomic coordinate. In embodiments, step c) includes fixing an atomic coordinate of at least one ligand atomic coordinate. In embodiments, step c) includes fixing all atomic coordinates of the ligand atomic coordinate. In embodiments, step c) includes prohibiting introduction of an additional amino acid residue into the set of ligand binding amino acid residues. In embodiments, step c) includes prohibiting the deletion of an amino acid residue from the set of ligand binding amino acid residues.
- step c) includes prohibiting introduction of an additional amino acid residue into the set of core amino acid residues. In embodiments, step c) includes prohibiting the deletion of an amino acid residue from the set of core amino acid residues. In embodiments, the method includes distance and angle constraints (i.e. specifying the distance of a ligand to an amino acid (e.g., a ligand binding amino acid residue) coordinate).
- the optimizing includes fixing (e.g., not geometrically displacing) at least one atomic coordinate of the ligand atomic coordinates. In embodiments, the optimizing does not include fixing at least one atomic coordinate of at least one core amino acid residue atomic coordinates. In embodiments, the optimizing does not include fixing at least one atomic coordinate of the core amino acid residue atomic coordinates. In embodiments, the optimizing does not include fixing any atomic coordinates of the core amino acid residue atomic
- the optimizing includes fixing angle form by three atoms (e.g., angles formed between atoms of the ligand and the ligand bind amino acid residues) or fixing the distance between atoms (e.g., at least one atomic coordinate of the ligand and at least one atomic coordinate of the ligand binding amino acid residue).
- the optimizing includes an iterative or heuristic algorithm. In embodiments, the optimizing includes an iterative algorithm. In embodiments, the optimizing includes a heuristic algorithm. In embodiments, the optimizing includes a simplex algorithm, memetic algorithm, differential evolution algorithm, evolutionary algorithm, genetic algorithm, tabu algorithm, particle swarm algorithm, or stimulated annealing algorithm. In embodiments, the optimizing includes a Monte Carlo sampling algorithm, dead-end elimination algorithm, branch and bound algorithm, or a pruning algorithm. In embodiments, the optimizing includes knobs-into-holes side chain packing. In embodiments, the optimization may begin with an idealized, parameterized backbone. In embodiments, optimization may relax the backbone structure of the protein, for example, by using gradient descent algorithms, while optimizing the protein sequence via rotamer sampling and minimization.
- the optimizing includes introducing an additional ligand binding amino acid residue into the set of ligand binding amino acid residues, deleting a ligand binding amino acid residue from the set of ligand binding amino acid residues, a geometric
- the optimizing includes introducing an additional ligand binding amino acid residue into the set of ligand binding amino acid residues (e.g., designating an amino acid residue previously designated as a core amino acid residue to a ligand binding amino acid residue). In embodiments, the optimizing includes replacing a ligand binding amino acid residue within the set of ligand binding amino acid residues. In embodiments, the optimizing includes deleting a ligand binding amino acid residue from the set of ligand binding amino acid residues (e.g., designating an amino acid residue previously designated as a ligand amino acid residue to a core binding amino acid residue).
- the optimizing includes a geometric transformation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates. In embodiments, the optimizing includes a geometric transformation of the atomic coordinates of at least one of the ligand binding amino acid residue atomic coordinates. In embodiments, the optimizing includes a geometric transformation of the atomic coordinates of the ligand binding amino acid residue atomic coordinates. [0075] In embodiments, the geometric transformation includes a translation (i.e., a geometric transformation that moves a coordinate by the same distance in a given direction) or a rotation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
- the geometric transformation includes a translation (e.g., displacing the x coordinate) of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a translation of at least two atomic coordinates of the ligand binding amino acid residue atomic coordinates.
- the geometric transformation includes a translation of all atomic coordinates (e.g., x, y, and z coordinates in Cartesian space) of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of at least two atomic coordinates of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of at least three atomic coordinates of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of all atomic coordinates of the ligand binding amino acid residue atomic coordinates.
- the geometric transformation includes a translation of all atomic coordinates (e.g., x, y, and z coordinates in Cartesian space) of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of at least one atomic coordinate of the ligand
- the optimizing includes a geometric transformation of at least one atomic coordinate of the core amino acid residue atomic coordinates.
- the geometric transformation includes a translation or a rotation of at least one atomic coordinate of the core amino acid residue atomic coordinates.
- the geometric transformation includes a translation of at least one atomic coordinate of the core amino acid residue atomic coordinates.
- the geometric transformation includes a translation of at least two atomic coordinates of the core amino acid residue atomic coordinates.
- the geometric transformation includes a translation of all atomic coordinates of the core amino acid residue atomic coordinates.
- the geometric transformation includes a rotation of at least one atomic coordinate of the core amino acid residue atomic coordinates.
- the geometric transformation includes a rotation of at least two atomic coordinates of the core amino acid residue atomic coordinates. In embodiments, the geometric
- the transformation includes a rotation of at least three atomic coordinates of the core amino acid residue atomic coordinates.
- the geometric transformation includes a rotation of all atomic coordinates of the core amino acid residue atomic coordinates.
- the optimizing includes la) calculating the force on each atom in the protein (e.g., the set of ligand binding amino acid residues; the set of core amino acid residues; and the ligand); 2a) evaluating the calculation to determine if it is the minimum or below an acceptable threshold; 3a) if the force is less than a threshold, the optimization is finished, otherwise perform a geometric transformation (e.g., translation) of at least one atomic coordinate on the atoms in the protein; and 4a) repeat.
- the geometric transformation of at least one atomic coordinate includes no greater than a 6 A displacement of any atomic coordinate.
- the geometric transformation of at least one atomic coordinate includes no greater than a 3 A
- the displacement is no greater than 0.1, 0.2, 0.3, 0.4, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 A displacement of any atomic coordinate. In embodiments, the displacement is no greater than 0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or 1.0 A displacement of any atomic coordinate.
- the set of ligand binding amino acids includes at least 50 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 40 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 30 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 20 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 12 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 10 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 8 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 6 amino acid residues.
- the set of ligand binding amino acids includes at least 5 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 4 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 3 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 2 amino acid residues. In embodiments the ligand binding amino acids are apolar. In embodiments the ligand binding amino acids are hydrophilic.
- the set of ligand binding amino acids includes 50 amino acid residues.
- the set of ligand binding amino acids includes 40 amino acid residues. In embodiments, the set of ligand binding amino acids includes 30 amino acid residues. In embodiments, the set of ligand binding amino acids includes 20 amino acid residues. In embodiments, the set of ligand binding amino acids includes 12 amino acid residues. In embodiments, the set of ligand binding amino acids includes 10 amino acid residues. In embodiments, the set of ligand binding amino acids includes 8 amino acid residues. In embodiments, the set of ligand binding amino acids includes 6 amino acid residues. In embodiments, the set of ligand binding amino acids includes 5 amino acid residues. In embodiments, the set of ligand binding amino acids includes 4 amino acid residues.
- the set of ligand binding amino acids includes 3 amino acid residues. In embodiments, the set of ligand binding amino acids includes 2 amino acid residues. In embodiments the ligand binding amino acids are polar. In embodiments the ligand binding amino acids are hydrophilic.
- the energy minimization calculation includes a molecular mechanics function, a structural bioinformatics function, an amino acid sidechain packing function, a protein radius of gyration function, or a combination thereof. In embodiments, the energy minimization calculation includes a penalty function.
- the core amino acids are at least 75% inaccessible to a 1.8 A spherical probe. In embodiments, the core amino acids are at least 75% inaccessible to a 1.0 A spherical probe. In embodiments, the core amino acids are at least 75% inaccessible to a 1.2 A spherical probe. In embodiments, the core amino acids are at least 75% inaccessible to a 1.4 A spherical probe. In embodiments, the core amino acids are at least 75% inaccessible to a 1.6 A spherical probe. In embodiments, the core amino acids are at least 75% inaccessible to a 2.0 A spherical probe.
- the core amino acids are at least 80% inaccessible to a 1.8 A spherical probe. In embodiments, the core amino acids are at least 90% inaccessible to a 1.8 A spherical probe. In embodiments, the core amino acids are at least 95% inaccessible to a 1.8 A spherical probe. In embodiments, the set of core amino acids includes at least 50 amino acid residues. In embodiments, the set of core amino acids includes at least 40 amino acid residues. In
- the set of core amino acids includes at least 30 amino acid residues.
- the set of core amino acids includes at least 20 amino acid residues.
- the set of core amino acids includes at least 12 amino acid residues.
- the set of core amino acids includes at least 10 amino acid residues.
- the set of core amino acids includes at least 8 amino acid residues.
- the set of core amino acids includes at least 6 amino acid residues.
- the core amino acids are apolar. In embodiments the core amino acids are hydrophobic. [0083] In embodiments, the set of core amino acids includes 6 amino acids. In embodiments, the set of core amino acids includes 8 amino acids. In embodiments, the set of core amino acids includes 10 amino acids. In embodiments, the set of core amino acids includes 20 amino acids.
- the set of core amino acids includes 30 amino acids. In embodiments, the set of core amino acids includes 40 amino acids. In embodiments, the set of core amino acids includes 35, 36, 37, 38, 39, or 40 amino acids. In embodiments, the set of core amino acids includes 37 amino acids. In embodiments, the core amino acids include the sequence:
- LGLVAFLIFGLVLILIHLFAAGWVFFAILLLLALILA (SEQ ID NO: 5).
- the core amino acids include the sequence: LGIILLLAIGLILLAFHLFFAGWLFIAILLFSGIILA (SEQ ID NO:6).
- the protein is 99% identical to SEQ ID NO:5. In embodiments, the protein is 98% identical to SEQ ID NO:5. In embodiments, the protein is 95% identical to SEQ ID NO:5. In embodiments, the protein is 90% identical to SEQ ID NO:5. In embodiments, the protein is 85% identical to SEQ ID NO:5. In embodiments, the protein is 80% identical to SEQ ID NO:5. In embodiments, the protein is 60% identical to SEQ ID NO:5. In embodiments, the protein is about 99% identical to SEQ ID NO:5. In embodiments, the protein is about 98% identical to SEQ ID NO:5. In embodiments, the protein is about 95% identical to SEQ ID NO:5. In embodiments, the protein is about 90% identical to SEQ ID NO:5. In embodiments, the protein is about 85% identical to SEQ ID NO:5. In embodiments, the protein is about 80% identical to SEQ ID NO:5. In embodiments, the protein is about 60% identical to SEQ ID NO:5.
- the protein is 99% identical to SEQ ID NO:6. In embodiments, the protein is 98% identical to SEQ ID NO:6. In embodiments, the protein is 95% identical to SEQ ID NO:6. In embodiments, the protein is 90% identical to SEQ ID NO:6. In embodiments, the protein is 85% identical to SEQ ID NO:6. In embodiments, the protein is 80% identical to SEQ ID NO:6. In embodiments, the protein is 60% identical to SEQ ID NO:6. In embodiments, the protein is about 99% identical to SEQ ID NO:6. In embodiments, the protein is about 98% identical to SEQ ID NO:6. In embodiments, the protein is about 95% identical to SEQ ID NO:6.
- the protein is about 90% identical to SEQ ID NO:6. In embodiments, the protein is about 85% identical to SEQ ID NO:6. In embodiments, the protein is about 80% identical to SEQ ID NO:6. In embodiments, the protein is about 60% identical to SEQ ID NO:6.
- the set of core amino acids includes at least 50% of the total number of amino acid residues in the protein.
- the ligand is a porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component, that is capable of binding a metal ion.
- the ligand is a detectable agent.
- the ligand is a therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theragostic, or a photodynamic therapy (PDT) agent.
- the ligand is a therapeutic agent.
- the ligand is a biological agent.
- the ligand is a cytotoxic agent (e.g., an anticancer agent).
- the ligand is a magnetic resonance imaging (MRI) agent.
- the ligand is a positron emission tomography (PET) agent.
- the ligand is a radiological imaging agent.
- the ligand is a diagnostic agent. In embodiments, the ligand is a theragostic agent. In embodiments, the ligand is a photodynamic therapy (PDT) agent. In embodiments, the ligand is a small molecule.
- PDT photodynamic therapy
- the ligand is a catalyst.
- the catalyst catalyzes an abiological or bio-orthogonal reaction.
- the ligand is a molecule that exists within a living system (e.g., within an organism or a cell).
- the ligand is (CF 3 )- 4 PZn.
- the ligand is (CF 3 ) 4 PFe.
- the ligand atomic coordinates are optimized using known methods in the art (e.g., density functional theory using the B3-LYP functional).
- the method further includes synthesizing the protein (e.g., utilizing the expression vectors such as the plasmid method described in the Example, such as cloning into the IPTG-inducible pET-1 la plasmid). In embodiments, the method further includes expressing the protein.
- FIG. 14 depicts a flowchart illustrating a process 1400 for designing proteins, in accordance with some example embodiments.
- the process 1400 can be performed in order to design an energetically stabilized protein (e.g., a protein that is structurally and thermodynamically stable as determined by the difference in the Gibbs free energy between the folded and unfolded states of the protein).
- an energetically stabilized protein e.g., a protein that is structurally and thermodynamically stable as determined by the difference in the Gibbs free energy between the folded and unfolded states of the protein.
- a set of ligand binding amino acid residues within a protein for binding to a ligand can be identified. These ligand binding amino acid residues can form the backbone of a protein. Each ligand binding amino acid residue within the protein can be associated with a set of ligand binding amino acid residue atomic coordinates, which can define the ligand binding amino acid residue in space. Furthermore, each atom of the ligand can be associated with a set of ligand atomic coordinates, which can define the ligand in space. As noted herein, these coordinates can be Cartesian coordinates, internal coordinates, polar coordinates, spherical coordinates, and/or the like.
- a set of core amino acid residues within the protein that do not bind to the ligand can be identified.
- the backbone of the protein can further include core amino acid residues.
- Each core amino acid residue within the protein can be associated with a set of core amino acid residue atomic coordinates, which define the core amino acid residue in space.
- the set of ligand binding amino acid residues, the set of ligand binding amino acid residue atomic coordinates, the set of core amino acid residues, and the set of core amino acid residue atomic coordinates can be optimized.
- the optimization can be performed using an energy minimization calculation including, for example, a molecular mechanics function, a structural bioinformatics function, an amino acid sidechain packing function, a protein radius of gyration function, and/or the like.
- Optimizing the set of ligand binding amino acid residues, the set of ligand binding amino acid residue atomic coordinates, the set of core amino acid residues, and the set of core amino acid residue atomic coordinates can generate an energetically stabilized protein.
- FIG. 15 depicts a block diagram illustrating a computing system 1500 consistent with implementations of the current subject matter.
- the computing system 1500 can be configured to perform the process 1400.
- the computing system 1500 can include a processor 1510, a memory 1520, a storage device 1530, and input/output devices 1540.
- the processor 1510, the memory 1520, the storage device 1530, and the input/output devices 1540 can be interconnected via a system bus 1550.
- the processor 1510 is capable of processing instructions for execution within the computing system 1500. Such executed instructions can implement one or more components of, for example, the database system 100 and/or the multitenant database system 200.
- the processor 1510 can be a single-threaded processor. Alternately, the processor 1510 can be a multi -threaded processor.
- the processor 1510 is capable of processing instructions stored in the memory 1520 and/or on the storage device 1530 to display graphical information for a user interface provided via the input/output device 540.
- the memory 1520 is a computer readable medium such as volatile or non-volatile that stores information within the computing system 1500.
- the memory 1520 can store data structures representing configuration object databases, for example.
- the storage device 1530 is capable of providing persistent storage for the computing system 1500.
- the storage device 1530 can be a floppy disk device, a hard disk device, an optical disk device, or a tape device, or other suitable persistent storage means.
- the input/output device 540 provides input/output operations for the computing system 1500.
- the input/output device 540 includes a keyboard and/or pointing device.
- the input/output device 540 includes a display unit for displaying graphical user interfaces.
- the input/output device 540 can provide input/output operations for a network device.
- the input/output device 540 can include Ethernet ports or other networking ports to communicate with one or more wired and/or wireless networks (e.g., a local area network (LAN), a wide area network (WAN), the Internet).
- LAN local area network
- WAN wide area network
- the Internet the Internet
- the computing system 1500 can be used to execute various interactive computer software applications that can be used for organization, analysis and/or storage of data in various formats.
- the computing system 1500 can be used to execute any type of software applications. These applications can be used to perform various functionalities, e.g., planning functionalities (e.g., generating, managing, editing of spreadsheet documents, word processing documents, and/or any other objects, etc.), computing
- the applications can include various add-in functionalities (e.g., SAP Integrated Business Planning as an add-in for a spreadsheet and/or other type of program) or can be standalone computing products and/or functionalities.
- the functionalities can be used to generate the user interface provided via the input/output device 540.
- the user interface can be generated and presented to a user by the computing system 1500 (e.g., on a computer screen monitor, etc.).
- a system including: at least one data processor; and at least one memory storing instructions which, when executed by the at least one data processor, result in operations including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and where
- a non-transitory computer-readable storage medium including program code, which when executed by at least one data processor, causes operations including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetic
- the protein sequence is:
- the protein sequence is SEQ ID NO: l .
- the protein sequence is SEQ ID NO:2.
- the protein sequence is SEQ ID NO:3.
- the protein sequence is SEQ ID NO:4.
- the protein sequence is SEQ ID NO:5.
- the protein sequence is SEQ ID NO:6.
- the protein sequence is SEQ ID NO:7.
- the protein sequence is SEQ ID NO: 1. In embodiments, the protein sequence is SEQ ID NO:2. In embodiments, the protein sequence is SEQ ID NO:3. [0104] In embodiments, the protein is 99% identical to SEQ ID NO: 1. In embodiments, the protein is 98% identical to SEQ ID NO: l . In embodiments, the protein is 95% identical to SEQ ID NO: 1.
- the protein is 90% identical to SEQ ID NO: 1. In embodiments, the protein is 85% identical to SEQ ID NO: l . In embodiments, the protein is 80% identical to SEQ ID NO: 1. In embodiments, the protein is 60% identical to SEQ ID NO: 1. In embodiments, the protein is about 99% identical to SEQ ID NO: 1. In embodiments, the protein is about 98% identical to SEQ ID NO: 1. In embodiments, the protein is about 95% identical to SEQ ID NO: 1. In embodiments, the protein is about 90% identical to SEQ ID NO: 1. In embodiments, the protein is about 85% identical to SEQ ID NO: 1. In embodiments, the protein is about 80% identical to SEQ ID NO: 1. In embodiments, the protein is about 60% identical to SEQ ID NO: 1.
- the protein is 99% identical to SEQ ID NO:2. In embodiments, the protein is 98% identical to SEQ ID NO:2. In embodiments, the protein is 95% identical to SEQ ID NO:2. In embodiments, the protein is 90% identical to SEQ ID NO:2. In embodiments, the protein is 85% identical to SEQ ID NO:2. In embodiments, the protein is 80% identical to SEQ ID NO:2. In embodiments, the protein is 60% identical to SEQ ID NO:2. In embodiments, the protein is about 99% identical to SEQ ID NO:2. In embodiments, the protein is about 98% identical to SEQ ID NO:2. In embodiments, the protein is about 95% identical to SEQ ID NO:2. In embodiments, the protein is about 90% identical to SEQ ID NO:2. In embodiments, the protein is about 85% identical to SEQ ID NO:2. In embodiments, the protein is about 80% identical to SEQ ID NO:2. In embodiments, the protein is about 60% identical to SEQ ID NO:2.
- the protein is 99% identical to SEQ ID NO:3. In embodiments, the protein is 98% identical to SEQ ID NO:3. In embodiments, the protein is 95% identical to SEQ ID NO:3. In embodiments, the protein is 90% identical to SEQ ID NO:3. In embodiments, the protein is 85% identical to SEQ ID NO:3. In embodiments, the protein is 80% identical to SEQ ID NO:3. In embodiments, the protein is 60% identical to SEQ ID NO:3. In embodiments, the protein is about 99% identical to SEQ ID NO:3. In embodiments, the protein is about 98% identical to SEQ ID NO:3. In embodiments, the protein is about 95% identical to SEQ ID NO:3. In embodiments, the protein is about 90% identical to SEQ ID NO:3. In embodiments, the protein is about 85% identical to SEQ ID NO:3. In embodiments, the protein is about 80% identical to SEQ ID NO:3. In embodiments, the protein is about 60% identical to SEQ ID NO:3.
- the protein is further bound to a ligand.
- the ligand is bound to the protein via a dative covalent bond.
- the ligand is a porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component, which is capable of binding a metal ion.
- the ligand is a detectable agent.
- the ligand is a therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theranostic, or a photodynamic therapy (PDT) agent.
- the ligand is a catalyst.
- the catalyst catalyzes an abiological or bio-orthogonal reaction.
- the ligand is a molecule that exists within a living system.
- the protein is 99% identical to SEQ ID NO:8. In embodiments, the protein is 98% identical to SEQ ID NO:8. In embodiments, the protein is 95% identical to SEQ ID NO:8. In embodiments, the protein is 90% identical to SEQ ID NO:8. In embodiments, the protein is 85% identical to SEQ ID NO:8. In embodiments, the protein is 80% identical to SEQ ID NO:8. In embodiments, the protein is 60% identical to SEQ ID NO:8. In embodiments, the protein is about 99% identical to SEQ ID NO:8. In embodiments, the protein is about 98% identical to SEQ ID NO:8. In embodiments, the protein is about 95% identical to SEQ ID NO:8.
- the protein is about 90% identical to SEQ ID NO:8. In embodiments, the protein is about 85% identical to SEQ ID NO:8. In embodiments, the protein is about 80% identical to SEQ ID NO:8. In embodiments, the protein is about 60% identical to SEQ ID NO:8.
- LGLVAFLIFGLVLILIHLFAAGWVFFAILLLLALILA SEQ ID NO:5
- LGIILLLAIGLILLAFHLFFAGWLFIAILLFSGIILA SEQ ID NO:6
- Embodiment 1 A computer-implemented method, comprising: (a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within said protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of said ligand is associated with a set of ligand atomic coordinates; (b) identifying a set of core amino acid residues within said protein that do not bind to said ligand, each core amino acid residue within said protein is associated with a set of core amino acid residue atomic coordinates; and (c) optimizing: said set of ligand binding amino acid residues; said set of ligand binding amino acid residue atomic coordinates; said set of core amino acid residues; and said set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize said protein.
- Embodiment 2 The method of embodiment 1, wherein step c) comprises simultaneously optimizing: said set of ligand binding amino acid residues; said set of ligand binding amino acid residue atomic coordinates; said set of core amino acid residues; and said set of core amino acid residue atomic coordinates.
- Embodiment 3 The method of embodiment 1, wherein the energy minimization calculation comprises a molecular mechanics function, a structural bioinformatics function, an amino acid sidechain packing function, a protein radius of gyration function, or a combination thereof.
- Embodiment 4 The method of embodiment 1, wherein the core amino acids are at least 75% inaccessible to a 1.8 A spherical probe.
- Embodiment 5 The method of embodiment 1, wherein said set of core amino acids comprises at least six amino acid residues.
- Embodiment 6 The method of any one of embodiments 1 to 5, wherein the optimizing comprises fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate; fixing an atomic coordinate of at least one ligand atomic coordinate; prohibiting introduction of an additional amino acid residue into the set of ligand binding amino acid residues; or prohibiting the deletion of an amino acid residue from the set of ligand binding amino acid residues.
- Embodiment 7 The method of any one of embodiments 1 to 5, wherein the optimizing comprises fixing at least one atomic coordinate of the ligand atomic coordinates.
- Embodiment 8 The method of any one of embodiments 1 to 7, wherein the energy minimization calculation comprises a penalty function.
- Embodiment 9 The method of any one of embodiments 1 to 8, wherein the optimizing does not comprise fixing at least one atomic coordinate of at least one core amino acid residue atomic coordinates.
- Embodiment 10 The method of any one of embodiments 1 to 8, wherein the optimizing comprises introducing an additional ligand binding amino acid residue into the set of ligand binding amino acid residues, deleting a ligand binding amino acid residue from the set of ligand binding amino acid residues, a geometric transformation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
- Embodiment 11 The method of embodiment 10, wherein the geometric transformation comprises a translation or a rotation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
- Embodiment 12 The method of any one of embodiments 1 to 11, wherein the optimizing comprises a geometric transformation of at least one atomic coordinate of the core amino acid residue atomic coordinates.
- Embodiment 13 The method of any one of embodiments 10 to 12, wherein the geometric transformation of at least one atomic coordinate comprises no greater than a 6 A displacement of any atomic coordinate.
- Embodiment 14 The method of any one of embodiments 10 to 12, wherein the geometric transformation of at least one atomic coordinate comprises no greater than a 3 A displacement of any atomic coordinate.
- Embodiment 15 The method of any one of embodiments 1 to 14, wherein the optimizing comprises an iterative or heuristic algorithm.
- Embodiment 16 The method of any one of embodiments 1 to 14, wherein the optimizing comprises a simplex algorithm, memetic algorithm, differential evolution algorithm, evolutionary algorithm, genetic algorithm, tabu algorithm, particle swarm algorithm, or stimulated annealing algorithm.
- Embodiment 17 The method of any one of embodiments 1 to 14, wherein the optimizing comprises a Monte Carlo sampling algorithm, dead-end elimination algorithm, branch and bound algorithm, or a pruning algorithm.
- Embodiment 18 The method of any one of embodiments 1 to 17, wherein the ligand is a porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component, that is capable of binding a metal ion.
- the ligand is a porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component, that is capable of binding a metal i
- Embodiment 19 The method of any one of embodiments 1 to 17, wherein the ligand is a detectable agent.
- Embodiment 20 The method of any one of embodiments 1 to 17, wherein the ligand is a therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theranostic, or a photodynamic therapy (PDT) agent.
- the ligand is a therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theranostic, or a photodynamic therapy (PDT) agent.
- MRI magnetic resonance imaging
- PET positron emission tomography
- radiological imaging agent diagnostic agent
- diagnostic agent theranostic
- PDT photodynamic therapy
- Embodiment 21 The method of any one of embodiments 1 to 17, wherein the ligand is a catalyst.
- Embodiment 22 The method of any one of embodiments 1 to 17, wherein the catalyst catalyzes an abiological or bio-orthogonal reaction.
- Embodiment 23 The method of any one of embodiments 1 to 17, wherein the ligand is a molecule that exists within a living system.
- Embodiment 24 A system, comprising: at least one data processor; and at least one memory storing instructions which, when executed by the at least one data processor, result in operations comprising: (a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within said protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of said ligand is associated with a set of ligand atomic coordinates; (b) identifying a set of core amino acid residues within said protein that do not bind to said ligand, each core amino acid residue within said protein is associated with a set of core amino acid residue atomic coordinates; and (c) optimizing: said set of ligand binding amino acid residues; said set of ligand binding amino acid residue atomic coordinates; said set of core amino acid residues; and said set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation
- Embodiment 25 The system of embodiment 24, wherein the energy minimization calculation comprises functions from molecular mechanics, functions from structural
- bioinformatics amino acid sidechain packing functions, protein radius of gyration functions, or a combination thereof.
- Embodiment 26 The system of embodiment 24, wherein the core amino acids are at least 75% inaccessible to a 1.8A spherical probe.
- Embodiment 27 The system of embodiment 24, wherein said set of core amino acids comprise at least six amino acid residues.
- Embodiment 28 The system of any one of embodiments 24 to 27, wherein the optimizing comprises fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate; fixing an atomic coordinate of at least one ligand atomic coordinate; prohibiting introduction of an additional amino acid residue into the set of ligand binding amino acid residues; or prohibiting the deletion of an amino acid residue from the set of ligand binding amino acid residues.
- Embodiment 29 The system of any one of embodiments 24 to 28, wherein the optimizing comprises fixing at least one atomic coordinate of the ligand atomic coordinates.
- Embodiment 30 The system of any one of embodiments 24 to 29, wherein the energy minimization calculation comprises a penalty function.
- Embodiment 31 The system of any one of embodiments 24 to 30, wherein the optimizing does not comprise fixing at least one atomic coordinate of at least one core amino acid residue atomic coordinates.
- Embodiment 32 The system of any one of embodiments 24 to 31, wherein the optimizing comprises introducing an additional ligand binding amino acid residue into the set of ligand binding amino acid residues, deleting a ligand binding amino acid residue from the set of ligand binding amino acid residues, a geometric transformation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
- Embodiment 33 The method of embodiment 32, wherein the geometric transformation comprises a translation or a rotation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
- Embodiment 34 The system of any one of embodiments 24 to 33, wherein the optimizing comprises a geometric transformation of at least one atomic coordinate of the core amino acid residue atomic coordinates.
- Embodiment 35 The system of any one of embodiments 24 to 34, wherein the geometric transformation of at least one atomic coordinate comprises no greater than a 6 A displacement of any atomic coordinate.
- Embodiment 36 The system of any one of embodiments 24 to 34, wherein the geometric transformation of at least one atomic coordinate comprises no greater than a 3 A displacement of any atomic coordinate.
- Embodiment 37 The system of any one of embodiments 24 to 36, wherein the optimizing comprises an iterative or heuristic algorithm.
- Embodiment 38 The system of any one of embodiments 24 to 36, wherein the optimizing comprises a simplex algorithm, memetic algorithm, differential evolution algorithm, evolutionary algorithm, genetic algorithm, tabu algorithm, particle swarm algorithm, or stimulated annealing algorithm.
- Embodiment 39 The system of any one of embodiments 24 to 36, wherein the optimizing comprises a Monte Carlo sampling algorithm, dead-end elimination algorithm, branch and bound algorithm, or a pruning algorithm.
- Embodiment 40 A non-transitory computer-readable storage medium including program code, which when executed by at least one data processor, causes operations
- each ligand binding amino acid residue within said protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of said ligand is associated with a set of ligand atomic coordinates; (b) identifying a set of core amino acid residues within said protein that do not bind to said ligand, each core amino acid residue within said protein is associated with a set of core amino acid residue atomic coordinates; and (c) optimizing: said set of ligand binding amino acid residues; said set of ligand binding amino acid residue atomic coordinates; said set of core amino acid residues; and said set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize said protein.
- Embodiment 41 A protein sequence obtainable based on the energy minimization calculation using the method of any of embodiments 1 to 23, the system of any of embodiments 24 to 39, or the non-transitory computer-readable medium of embodiment 40.
- Embodiment 42 A protein, or conservatively modified variant thereof, having the sequence SEQ ID NO: 1.
- Embodiment 43 The protein of embodiment 42, wherein the protein is 90% identical to SEQ ID NO: 1.
- Embodiment 44 The protein of embodiment 42, bound to a ligand.
- Embodiment 45 The protein of embodiment 42, wherein the ligand is bound to the protein via a dative covalent bond.
- Embodiment 46 The protein of embodiment 44, wherein the ligand is a porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component, that is capable of binding a metal ion.
- Embodiment 47 The protein of embodiment 44, wherein the ligand is a detectable agent.
- Embodiment 48 The protein of embodiment 44, wherein the ligand is a therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theranostic, or a photodynamic therapy (PDT) agent.
- MRI magnetic resonance imaging
- PET positron emission tomography
- PDT photodynamic therapy
- Embodiment 49 The protein of embodiment 44, wherein the ligand is a catalyst.
- Embodiment 50 The protein of embodiment 44, wherein the catalyst catalyzes an abiological or bio-orthogonal reaction.
- Embodiment 51 The protein of embodiment 44, wherein the ligand is a molecule that exists within a living system.
- Example 1 Strategy for designing hyperstable, non-natural protein-cofactor complexes with sub-A accuracy
- holo-PSl The high- resolution structure of holo-PSl is in sub-A agreement with the design.
- the structure of apo- PS1 retains the remote core packing of the holo, predisposing a flexible binding region for the desired ligand-binding geometry.
- Our results illustrate the unification of core packing and binding site definition as a fundamental principle of ligand-binding protein design.
- de novo heme-binding helical bundle proteins have been designed entirely from first principles (17, 20), but these "maquettes" have evaded structural determination, largely due to aggregation or their dynamical properties 17 ' 21 ' 22 .
- covalently linked peptide-heme complexes 23 the only structure of a de novo heme-binding protein was solved for an apo-protein, which showed a hydrophobically collapsed binding site with no space for binding heme 21 ' 24 .
- PS1 Protein design.
- the design of PS1 (Porphyrin-binding Sequence 1) began with the previously parameterized backbone from the de novo designed protein SCRPZ-2 28 , a protein that bound an extended po hinato(metal)-polypyridyl(metal) cofactor (FIG. IB).
- the parameters were adjusted to position a single His ligand to receive a second-shell hydrogen bond with Thr from a neighboring helix (see FIG. 2).
- Side chains in the vicinity of the binding site were computationally designed to stabilize the asymmetric ligand environment while maintaining a rigid symmetrical backbone.
- PS1 Biophysical characterization of PS1.
- PS1 is monomeric (FIGS. 7A-7B) and binds the water-insoluble cofactor, (CF 3 ) 4 PZn, forming highly thermostable complexes (extrapolated T m > 120 °C, Fig. 3c and Fig. S3) that are stable for over a year.
- the complex forms within seconds of adding (CF 3 ) 4 PZn from organic solution to aqueous PS1, suggesting a small kinetic barrier for assembly (FIG. 3 A).
- a tight dissociation constant of binding, ⁇ 45 nM, was measured under conditions where the water-insoluble porphyrin was solubilized with 1% w/v
- PS 1 also binds the ferrous iron-derivative of the porphyrin, (CF 3 ) 4 PFe (FIG. 9), despite the abysmal solubility in water of this cofactor.
- CF 3 ) 4 PFe is an electron-deficient (porphinato)metal complex capable of molecular oxygen activation for alkane hydroxylation and alkene epoxidation 36 .
- Solvent hydrogen- deuterium exchange (HDX) experiments and molecular dynamics simulations of apo-PSl also show a gradient in conformational stability between the apolar core and the binding site of apo- PS1 (FIG. 5C, FIGS. 12 and 13).
- the backbone surrounding the apolar core of both holo- and apo-PSl is highly protected from exchange, an important characteristic of cooperatively folded native proteins.
- the protected region extends into the porphyrin-binding site in the holo-protein but not in the apo-structure (FIG. 5C).
- the increased protection in the binding site of holo-PSl is seen at both solvent-exposed and interior positions, indicating increased conformational stability rather than steric restriction from the bound cofactor alone.
- the interior side chains stack into four layers, beginning at the edge of the porphyrin-binding site and extending to the end of the bundle (FIGS. 5D-5F).
- the layers closest to the binding site explore more conformations, accessing rotamers not seen in holo-PSl (FIG. 5E).
- the packing of the more distal layers is identical in the apo- and holo-structures (FIG. 5F).
- the third- and fourth-shell layers located up to 20 A away from the binding site, are precisely pre-organized to stabilize the conformation of the first-shell side chains when PS1 enfolds its cofactor. This finding is consistent with numerous studies on natural proteins 13"16 , which show that variation of residues involved in core packing distant from an active site can have profound influences on binding and catalysis.
- the entire core of the / ⁇ -symmetrical parameterized backbone of SCRPZ-2 was redesigned to bind (CF 3 ) 4 PZn via a customized Rosetta script for flexible backbone sequence design.
- the flexible backbone design protocol was as follows: Distance and angle constraints between His and Zn were loaded, the model was repacked without mutations, the backbone was relaxed via Rosetta Backrub, three trials of a Monte Carlo flexible backbone design sub-protocol (see Example 2) were performed, and models with native protein-like packing (i.e., a Rosetta PackStat score > 0.58) were output. 170 designs were output from 500 runs through the protocol (FIG. 6). We analyzed these 170 models for packing, radius of gyration, energy, and rotamer state probability within Matlab to select PS1 for expression. The design of PS2 proceeded in the same fashion.
- Protein expression, purification, and biophysical characterization Details regarding protein expression, purification, and biophysical characterization can be found in the supplement. Briefly, genes for the proteins were ordered from GenScript, cloned into a pET-1 la plasmid, and purified via a Ni column, followed by His-tag cleavage by TEV protease. The protein sequence of expressed, purified PS1 after His-tag cleavage is:
- the protein/cof actor solution was then spun at 14000 x g in a Amicon Ultra-0.5 mL centrifuge filter for 10 min, three times, replacing the buffer to 0.5 mL after each 10 min spin. Finally, the protein solution was spun for 4 min at 12000 x g in an Amicon ultrafree-MC GV filter (UFC30GV0S). The holo-PSl sample was then used for spectroscopic experiments immediately afterward, and diluted to an appropriate concentration if necessary. Binding of (CF 3 ) 4 PFe was carried out in the same fashion, with the exception that the porphyrin was first dissolved in a stock of DMSO/CHCl 3 .
- Sequence specific backbone (3 ⁇ 4 N , 15 N, 13 C a , 13 CO) and 13 C P resonance assignments were obtained by using 3D HNCACB / CBCA(CO)NH and 3D HNCO / CO(CA)NH along with the program AUTOASSIGN.
- 41 3 ⁇ 4 a and 3 ⁇ 4 p assignments were extended by 3D HAHB(CO)NH experiment and more peripheral side chain chemical shifts were assigned with aliphatic 3D CCH-TOCSY (mixing time: 75 ms) and simultaneous 3D 1 W 3 C diptati 7 13 C aromatic -resolved
- backbone dihedral angle constraints were derived from chemical shifts using the program TALOS for residues located in well-defined secondary structure elements 44 .
- 2D constant-time [ ⁇ C HJ-HSQC spectra were recorded as was described for the 5% fractionally C-labeled samples to obtain stereo-specific assignments for isopropyl groups of Val and Leu 45 .
- the ⁇ NH residual dipolar couplings (RDCs) were measured with 2D 3 ⁇ 4- 15 N IPAP-HSQC in samples aligned using Pfl phage (ASLA biotech).
- the program CYANA was used to assign long-range NOEs and calculate the structure 46 47 .
- PS1 design process The design of PS1 began with a ft-symmetrical parameterized backbone of a 4-helix bundle (Tables SI and S2) 1 . We have previously used this backbone parameterization to create a diheme-binding tetrameric 4-helix bundle, PATET, which was composed of 4 copies of a 25 residue helix containing the requisite metal- coordinating His and second shell H-bonding Thr residues placed at d and b positions in a heptad repeat, respectively 2 . This tetramer bound two hemes with a bis-His ligation in a Di- symmetrical bundle.
- Trp residue in the protein interior also serves as an absorption handle, as well as a fluorescent indicator of hydrophobic packing.
- Flexible backbone design protocol Flexible backbone design utilized angle and distance constraints between the Zn and His to restrict the design space to those consistent with the DFT-optimized imidazole-Zn distance of 2.0 A.
- We used an energy term (hack aro 1) that models quadrupolar interactions between aromatic side chains in every stage of the flexible backbone design protocol.
- We also employed an energy term (rg 2) that penalizes bundles with a large radius of gyration (rg).
- rg 2
- the flexible backbone design protocol was as follows: Distance and angle constraints between His and Zn were loaded, the model was repacked without mutations, the backbone was relaxed via Rosetta Backrub, three trials of a Monte Carlo flexible backbone design sub-protocol (see below) were performed, and models with native protein-like packing (i.e., a Rosetta PackStat score > 0.58) were output. The PackStat score was calculated 3 times per trial to account for its stochastic behavior. 170 designs were output from 500 runs through the protocol (Fig. SI). We analyzed these 170 models for packing, rg, energy, and rotamer state probability within Matlab to select PS1 for expression.
- the flexible backbone design sub-protocol consists of 3 Monte Carlo trials of (i) fixed backbone design with soft weights (decreased vdW interactions, i.e., soft rep design weights within Rosetta), (ii) sidechain minimization via MinMover, (iii) fixed backbone design with Score 13 weights, where the electrostatic term
- step (v) the model is filtered for native structure-like packing via PackStat (If 1 of 3 trials of PackStat score is > 0.58, the model passes the filter.).
- PackStat If 1 of 3 trials of PackStat score is > 0.58, the model passes the filter.
- hack aro is set to 1 and rg is set to 2.
- the final, designed sequence (PS 1) selected for protein expression was the following 108 amino acids:
- Rosetta ab initio folding 8 was performed on the PS 1 sequence in Rosetta 3.5.
- Ca RMSD of the folded core was scored against residues 14-23, 32-42, 69-79, and 87-97 of the design model.
- Ca RMSD of the binding region was scored against residues 5-13, 43-50, 61-68, and 98-105 of the design model.
- AUC Analytical ultracentrifugation
- the oligomeric state of apo- and holo-PS l were determined by analytical equilibrium sedimentation performed at 25 °C using a Beckman XL-I analytical ultracentrifuge. Ultracentrifugation was conducted at speeds of 25K, 30K, 35K, 40K and 45K r.p.m., and the radial gradient profiles were obtained by absorbance at 280 nm.
- a 200 ⁇ solution of the apo- and a 100 ⁇ solution of the holo-protein were prepared in 50 mM NaPi pH 7.5, 100 mM NaCl (apo) and 20 mM NaPi pH 7.5, 125 mM NaCl (holo). Data were globally fit to a single-species model of equilibrium sedimentation by a nonlinear least-squares method using IGOR Pro (Wavemetrics).
- Spectra were collected from 20 to 95 °C with an interval of 5 °C and an increase rate of 1 °C/minute, over a wavelength range from 215 to 250 nm.
- Apo- and holo-PSl were prepared at 10 ⁇ and 6.6 ⁇ , respectively, in 50 mM NaPi pH 7.5, 100 mM NaCl buffer.
- Temperature melts of apo-PSl were also performed at varying concentrations of Guanidine HCl denaturant (0M, 1M, 2M, 3M, 4M, 5M, 5.85, 7M).
- Elevated temperature experiments were performed in a custom-made temperature block of anodized aluminum, the temperature of which was controlled by heating rods and monitored by a pair of thermocouples wired to a PID through a solid-state relay.
- Cofactor e.g., ligand
- the geometry of (CF 3 ) 4 PZn was optimized via density functional theory using the B3-LYP functional and 6-31G* basis set implemented in Gaussian03.
- the starting geometry was obtained from the crystal structure of related meso-heptafluoropropyl(porphinato)Zn(II), with the fluoropropyl groups truncated to fluorom ethyl 10 .
- Meso-heptafluoropropyl(porphinato)Zn(II) co-crystalized with an axially ligating pyridine; imidazole was computationally substituted for pyridine for the geometry optimization of (CF 3 ) 4 PZn.
- MHHHHHHENLYFQ/SEFEKLRQTGDELVQAFQRLREIFDKGDDDSLEQVLEEIEELIQKH RQLFD RQEAADTEAAKQGDQWVQLFQRFREAIDKGDKDSLEQLLEELEQALQKIREL AEKKN (SEQ ID NO:4) where the "/" defines the cleavage site of TEV protease.
- the cells were then induced with IPTG and allowed to grow for 4 more hours. Cells were then centrifuged and frozen. The frozen cell pellets were lysed in a French press in the Duke University Biology Department.
- the expressed, His-tagged PS1 protein was purified via a Ni NTA column (Invitrogen) and confirmed by gel electrophoresis.
- the buffer was exchanged to the Sigma-recommended TEV protease buffer (5 mM DTT, 50 mM Tris, 0.5 mM EDTA, pH 8.0), and the PS1/TEV solution (His-tagged TEV protease was ordered from Sigma.) was allowed to rock for 1 day at room temperature.
- PS2 was expressed with the same His-tag as PS1, and cleaved and purified using the same methods. Binding of (CF 3 ) 4 PZn to PS2 was carried out using the same method as for PS1. We found that PS2 bound (CF 3 ) 4 PZn in a homogenous environment, indicated by the narrow electronic absorption bands of the porphyrin in PS2, nearly indistinguishable from that in PS1 (FIG. 10). PS2 will be structurally characterized in future studies in which we will examine the role of second and third-shell hydrogen bonds on the photophyiscal properties of holo-PS proteins. The expressed, purified, His-tag cleaved sequence of PS2 was:
- the molecular dynamics simulation was carried out using ACEMD 13 .
- the system was minimized for 2000 steps, followed by equilibration using the NPT ensemble for 10 ns at 1 atm using a time-step of 2 fs.
- the protein was allowed to move freely and simulated under the NVT ensemble using ACEMD' s NVT ensemble with a Langevin thermostat.
- damping at 0.1 ps-1 and a hydrogen mass repartitioning scheme.
- the simulation was carried out to 1 at 298 K.
- SOCKET Server for assessment of knobs-into-holes packing.
- PDB files of the PS1 design model, holo-PSl centroid, and apo-PSl open/closed centroids were individually uploaded to and analyzed by the SOCKET server 14 for knobs-into-holes side chain packing (see Section 4).
- a helical residue was defined as a knob if its side chain was within 8 A of 4 other side chains from residues on an adjacent helix (a hole).
- Output from the SOCKET server for each of these PDB files is displayed below showing the residues of each knob and hole. Note that the residue number of the PS1 design model is off register by 1 amino acid from the structural sequences, due to the presence of the N-terminal Ser residue from TEV cleavage of the expressed proteins.
- Example 3 - enFold Proteins can bind endogenous ligands
- the computational method described here is capable of producing proteins that noncovalently bind ligands in vivo.
- apo-proteins remain competent to bind an endogenous ligand, for example heme (FIG. 17 and FIGS. 18A-18B). These proteins are the first de novo designed proteins to our knowledge that noncovalently bind heme in vivo.
- Residues are numbered according to the expressed 109-residue PS l protein. All denotes a mutated residue, and * denotes a retained residue, as shown in Fig. S I .
- LEU 93, ALA 96, LEU 97, ILE 100 (knob: 57 (TRP 67, helix 2))
- LEU 89, GLU 92, LEU 93, ALA 96 (knob: 60 (LEU 70, helix 2))
- LEU 86, LEU 89, LEU 90, LEU 93 (knob: 64 (PHE 74, helix 2))
- LEU 36, ILE 39, GLU 40, ILE 43 (knob: 61 (PHE 72, helix 2))
- GLU 33 LEU 36, GLU 37, GLU 40 (knob: 65 (ARG 76, helix 2))
- LEU 90, GLU 93, LEU 94, ALA 97 knock: 60 (LEU 71, helix 2)
- LEU 87, LEU 90, LEU 91, LEU 94 knock: 64 (PHE 75, helix 2)
Landscapes
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Chemical & Material Sciences (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Medicinal Chemistry (AREA)
- Pharmacology & Pharmacy (AREA)
- General Health & Medical Sciences (AREA)
- Crystallography & Structural Chemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Medical Informatics (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Peptides Or Proteins (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
Abstract
Disclosed herein, inter alia, are methods and systems for optimizing protein ligand interactions for highly accurate de novo protein design.
Description
DESIGNED PROTEINS FOR LIGAND BINDING
CROSS-REFERENCES TO RELATED APPLICATIONS
[0001] This application claims the benefit of U.S. Provisional Application No. 62/537,774, filed on July 27, 2017, which is incorporated herein by reference in its entirety and for all purposes.
REFERENCE TO A "SEQUENCE LISTING," A TABLE, OR A COMPUTER
PROGRAM LISTING APPENDIX SUBMITTED AS AN ASCII FILE
[0002] The Sequence Listing written in file 048536-593001WO Sequence Listing_ST25.txt, created July 12, 2018, 7, 148 bytes, machine format IBM-PC, MS Windows operating system, is hereby incorporated by reference.
STATEMENT AS TO RIGHTS TO INVENTIONS MADE UNDER FEDERALLY SPONSORED RESEARCH AND DEVELOPMENT
[0003] This invention was made with Government support under grant numbers GM-54616 and GM-071628 awarded by The National Institutes of Health, and grant numbers CHE-
1413333, CHE-1413295 and DMR-1120901 awarded by The National Science Foundation. The Government has certain rights in the invention.
BACKGROUND
[0004] Many natural proteins contain precisely oriented cofactors that enable their functions, yet the de novo design of proteins that bind cofactors with atomic-scale precision has remained a significant challenge. De novo protein design critically tests our understanding of protein folding and function, and can provide new frameworks that combine man-made materials with protein scaffolds. Highly accurate design of porphyrin-binding proteins, validated by high- resolution structure determination, has presented a major unsolved challenge. Disclosed herein, inter alia, are solutions to these and other problems in the art.
BRIEF SUMMARY
[0005] In an aspect is provided a computer-implemented method, including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding
amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize the protein.
[0006] In an aspect is provided a system, including: at least one data processor; and at least one memory storing instructions which, when executed by the at least one data processor, result in operations including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize the protein.
[0007] In another aspect is provided a non-transitory computer-readable storage medium including program code, which when executed by at least one data processor, causes operations including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize the protein.
[0008] In an aspect is provided a protein sequence obtainable based on the energy minimization calculation using the method, the system, or the non-transitory computer-readable medium as described herein.
[0009] In an aspect is provided a protein, or conservatively modified variant thereof, having the sequence:
EFEKLRQTGDELVQAFQRLREIFDKGDDDSLEQVLEEIEELIQKHRQLFD RQEAADTEA AKQGDQWVQLFQRFREAIDKGDKDSLEQLLEELEQALQKIRELAEKKN (SEQ ID NO: l).
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] FIGS. 1A-1C. The design strategy. FIG. 1A: Structures of natural cofactor-binding proteins show a folded core supporting a cofactor-binding region. FIG. IB: Examples of previously designed tetra-helical porphyrin-binding proteins; all but PS1 (which is described herein) lack a folded core. The a2 protein is from ref 40; the remainder are described in the text. FIG. 1C: The design process starts with a parameterized backbone, which undergoes
simultaneous optimization of packing of core residues (shown as spheres) in the binding region (light color) and folded core (dark color), with flexible backbone. The resultant holo-protein is tightly packed both in the binding region and in the folded core, whereas the apo-protein is tightly packed only in the folded core, which anchors the under-packed binding region to bind the cofactor. cytochrome b562 (pdb 256b), DHFR, dihydrofolate reductase (pdb 8dfr), flavodoxin (pdb lczu). [0011] FIG. 2. The computational design workflow for optimized core packing. The abiological porphyrin cofactor, (CF3)4PZn, is shown in the upper left. The constrained, parameterized backbone of SCRPZ-2 feeds into a flexible backbone design protocol that allows the interior side chains and backbone to simultaneously conform to the porphyrin (CF3)4PZn. On the right are depicted the ab initio folding predictions of PS1 sequence. The Rosetta folding algorithm predicts a shallow folding funnel for the binding region (light gray) and a deep folding funnel shifted toward lower RMSD for the folded core (dark gray) of apo-PSl . The RMSD (root mean squared deviation) in A is against the helical residues within these regions in the designed model. Energy is in Rosetta energy units (r.e.u.).
[0012] FIGS. 3A-3D. Biophysical characterization of apo- and holo-PSl . FIG. 3 A: Electronic absorption and emission spectra of (CF3)4PZn/PSl holo-protein and (CF3)4PZn in toluene solvent. Inset shows normalized emission spectrum of (CF3)4PZn upon electronic excitation at 405 nm (OD = 0.1 at excitation wavelength); buffer = 100 mM NaCl, 50 mM NaPi, pH 7.5.
FIG. 3B: Determination of ΚΌ by apo-PSl titration into a buffer solution (100 mM NaCl, 50 mM NaPi, pH 7.5) of (CF3)4PZn with 1% w/v octyl-b-D-glucopyranoside. Inset shows spectral shifts upon porphyrin binding to PS1. FIG. 3C: Circular dichroism (CD) spectra of apo- and holo-PSl in 50 mM NaPi, 100 mM NaCl, pH 7.5 as a function of temperature. The transitions appear reversible based on the fact that the spectra are identical after cooling to room
temperature. Units are in molar residue ellipticity. Electronic absorbance spectra indicate holo- PSl retains the porphyrin upon cooling. FIG. 3D: Pump-probe transient absorption spectra of (CF3)4PZn bound in the interior of holo-PSl at 21 °C and 100 °C. The black spectrum shows characteristic S I→SN absorptions of (CF3)4PZn, which smoothly transitions into the gray spectrum showing characteristic TI→TN absorptions of (CF3)4PZn. Inset exemplifies identical transient dynamics (primarily intersystem crossing from Si to Ti) at AAbs. = 482 nm (scaled). Experimental conditions: solvent = 50 mM NaPi, 100 mM NaCl, pH 7.5; excitation wavelength = 600 ± 5 nm; magic-angle polarization between pump and probe pulses; pump-probe cross- correlation of -250 fs. [0013] FIG. 4. The structure of holo-PSl agrees closely with the design. The structure of holo-PSl superimposed on the design, with mean helical backbone RMSD of 0.8 ± 0.1 A. The holo-PSl model shown is the centroid of the NMR structural ensemble. 26 porphyrin-protein nuclear Overhauser effects (NOEs), drawn as sticks, experimentally determine the orientation of the porphyrin within the binding site of PS1. Middle panel compares observed vs. designed orientations. All hydrophobic and helical backbone heavy atoms within 4 A of porphyrin heavy atoms in the design were used for alignment (0.9 ± 0.1 A all-atom RMSD). Panel shows -10 A slices of the holo-PSl NMR centroid and design in the binding region and folded core, respectively.
[0014] FIGS 5A-5F. Apo- and holo-PSl share similar folded cores and differ in the binding region. FIG. 5 A: 2D XH-15N HSQC spectra acquired for apo- and holo-PSl . Experimental conditions: 0.78 mM at 298K, 50 mM NaPi, 100 mM NaCl, pH 7.5, in 5% D20. Resonance assignments are indicated using the one-letter amino acid code. Signals arising from side chains (Asn HD2/ND2, Gin HE2/NE2, Arg HE/NE and Trp HEl/NEl) are also labeled. The residues belonging to the binding region and folded core are color-coded as in (FIG. 5B). Non-helical residues are labeled in cyan font face. The inset in the HSQC spectrum of apo-PSl shows the chemical shift of the indole proton of Trp68 near 10.2 ppm. A dashed box surrounds 90% of the backbone resonances of apo-PSl and is also placed at the same position in the holo-PSl spectrum. Arrows point to resonances of residues within the binding region that change
dramatically upon binding of the cofactor. FIG. 5B: Solution NMR structures of apo-PSl and holo-PSl . The structures were aligned to the backbone of the helical folded core of the lowest energy holo-PSl model. Terminal residues 1, 108, and 109 are not shown for clarity. FIG. 5C: Hydrogen-deuterium exchange protection factors (PF) measured for apo- and holo-PSl, mapped onto the centroid structure of holo-PSl . Backbone amide nitrogens of residues with determined PFs are shown as spheres. Not shown: N of Trp68 indole side chain is protected in holo, but not apo. FIGS. 5D-5F: Backbone alignment of the holo- and apo-centroids at the folded core shows, FIG. 5F, agreement of side chain rotamer states far from the binding site and, e, differences in first-shell rotamers (e.g., Trp68, Leu98) accompanied by changes in backbone of the binding region. Centroids are from NMR structural ensembles clustered via RMSD of core side chain heavy atoms.
[0015] FIG. 6. PS1 design metrics. PS1 design ensemble resulting from flexible backbone sequence design. FIG. 6B: Residues (Ca atoms shown as spheres) within the PS 1 design that were allowed to vary from the SCRPZ-2 sequence. 40 of the 108 residues were allowed to vary, and, of the 40 residues, 28 were mutated and 12 residues were retained from the original SCRPZ-2 sequence as a result of the computational design process.
[0016] FIG. 7A-7B. Analytical ultracentrifugation and gel filtration analysis show that apo- and holo-PSl are monomeric in solution. FIG. 7A: Analytical ultracentrifugation. Solutions of apo- and holo-PSl were centrifuged at speeds ranging from 25,000 r.p.m. to 45,000 r.p.m. and monitored by absorbance at 280 nm. Parameters were globally fit to the data. Single-species fitting agrees well with the data over the entire range and yields the molecular weight of apo-PSl 15.81 ± 0.09 kD and holo-PSl 12.24 ± 0.91 kD, which agrees well with the 12.86 kD weight of PS1. At high concentration, the fit for apo-PSl is not ideal, suggesting a small degree of aggregation. Partial specific volumes were estimated from SEDNTERP15 for amino acid side chains. FIG. 7B: Analytical gel filtration analysis of apo- and holo-PSl . Detection wavelengths are labeled as the same color as their respective curves. Apo shows a small degree (< 5%) of dimerization (1.35 ml elution volume) relative to the monomer peak (1.62 ml elution volume). The small peak near 1.05 ml elution volume in holo-PSl is unbound (excess), aggregated porphyrin eluting in the void volume of the superdex 75 5/150 column. Samples were run at concentrations of 100 μΜ and 37 uM for apo and holo, respectively, in 50 mM NaPi, 150 mM NaCl, pH 7.0 buffer.
[0017] FIG. 8. Temperature and GnHCl induced unfolding of apo-PSl . CD spectra at 222 nm of apo-PSl as a function of temperature and denaturant (Guanidine HC1, GnHCl) concentration
in 50 mM NaPi, 100 mM NaCl, pH 7.5 buffer. The midpoint for GnHCl-induced unfolding at 95 °C was approximately 4.5 M.
[0018] FIG. 9. Scaled absorption spectra of (CF3)4PM/PS 1 complexes, M = Zn2+, Fe2+.
Loading of (CF3)4PFe into PS1 was ~ 40-50%, likely due to the extreme insolubility of the porphyrin. The featured bands at 550-650 nm indicate that (CF3)4PFe is in a homogenous environment in the ferrous state. The broad absorbance centered at 350 nm is also observed for (CF3)4PFe dissolved in organic solvent16, and does not reflect aggregation in water. The peak at 423 nm is also indicative of a homogenous binding environment. The absorption spectra were scaled to reflect the relative extinction coefficients of the porphyrins. Buffer = 50mM NaPi, 100 mM NaCl, pH 7.5.
[0019] FIG. 10. Absorption spectra of (CF3)4PZn/PSl and (CF3)4PZn/PS2 complexes. Each protein shows 100% porphyrin loading, based on absorbance at 280 nm and 423 nm.
Experimental conditions: buffer = 100 mM NaCl, 50 mM NaPi, pH 7.5.
[0020] FIG. 11. The NMR structural ensemble of apo-PSl contains two clusters of conformations, closed and open. Above, color mapping of the pairwise backbone RMSD matrix of each NMR ensemble member of apo-PSl . Apo models with high structural similarity in the region of residues 61-67 and 99-105 (labeled in the open structure shown below) are blue in the plot. Models that are structurally dissimilar (large RMSD) are red in the plot. Below, the model centroids representing the closed and open structures (models 1 and 18, respectively, in the deposited NMR structure). The porphyrin (CF3)4PZn is shown in green, and the holo centroid (orange) is also drawn for comparison.
[0021] FIG. 12. HDX protection factors for apo- and holo-PSl, as described in Table S5. Note that "68 indole" denotes the indole N of Trp68 side chain.
[0022] FIG. 13. Molecular dynamics simulations show the binding region of apo-PSl is more accessible to solvent. Histogram of number of waters within 3.5 A of any heavy atom of each buried amino acid side chain (an A or D position of the heptad repeat), from 1000 snapshots of a 1 trajectory of apo-PSl . All histograms are drawn to the same scale and show number of solvating waters normalized by side chain surface area. Binding region shown in light gray, and folded core in dark gray. [0023] FIG. 14 depicts a flowchart illustrating a process for designing proteins, in accordance with some example embodiments.
[0024] FIG. 15 depicts a block diagram illustrating a computing system, in accordance with some example embodiments.
[0025] FIG. 16. Solution NMR structure of PS1 and computational models of PS1 variants.
[0026] FIG. 17. The PS1 deletion variant binds endogenous heme when expressed in E. coli. Characteristic Soret and Q bands of heme can be seen at 410 and 550 nm in the displayed absorption spectra.
[0027] FIGS. 18A-18B. enFold proteins are capable of noncovalently binding endogenous ligands in the cell. FIG. 18A: Expression m E. coli of the deletion variant of PS1 (PS1 D103- 109, SEQ ID NO: 8) shows a high loading of endogenous heme in the porphyrin binding site. Inset: E. coli cultures of after induction). FIG. 18B: Denovo proteins from binary sequence patterning. Previous studies have only been able to incorporate heme exogenously, i.e. after expression and purification, heme is added to the purified apo-protein. See Patel et al, Protein Science, 18: 1388-1400.
DETAILED DESCRIPTION
[0028] Protein catalysis requires atomic-level orchestration of side chains, substrates, and cofactors, yet the ability to design a small-molecule-binding protein entirely from first principles with a precisely predetermined structure has not been demonstrated. Herein we describe a novel protein, PS1, which binds a highly electron-deficient, non-natural porphyrin at temperatures up to 100 °C. The high-resolution structure of holo-PSl is in sub-A agreement with the design. The structure of apo-PSl retains the remote core packing of the holo, predisposing a flexible binding region for the desired ligand-binding geometry. Our results illustrate the unification of core packing and binding site definition as a central principle of ligand-binding protein design.
I. Definitions
[0029] "Analog," or "analogue" is used in accordance with its plain ordinary meaning within Chemistry and Biology and refers to a chemical compound that is structurally similar to another compound (i.e., a so-called "reference" compound) but differs in composition, e.g., in the replacement of one atom by an atom of a different element, or in the presence of a particular functional group, or the replacement of one functional group by another functional group, or the absolute stereochemistry of one or more chiral centers of the reference compound. Accordingly, an analog is a compound that is similar or comparable in function and appearance but not in structure or origin to a reference compound.
[0030] The terms "a" or "an," as used in herein means one or more. In addition, the phrase "substituted with a[n]," as used herein, means the specified group may be substituted with one or more of any or all of the named substituents. For example, where a group, such as an alkyl or heteroaryl group, is "substituted with an unsubstituted C1-C20 alkyl, or unsubstituted 2 to 20 membered heteroalkyl," the group may contain one or more unsubstituted C1-C20 alkyls, and/or one or more unsubstituted 2 to 20 membered heteroalkyls.
[0031] A "detectable agent" or "detectable moiety" is a composition detectable by appropriate means such as spectroscopic, photochemical, biochemical, immunochemical, chemical, magnetic resonance imaging, or other physical means. For example, useful detectable agents include 18F, 32P, 33P, 45Ti, 47Sc, 52Fe, 59Fe, 62Cu, 64Cu, 67Cu, 67Ga, 68Ga, 77 As, 86Y, 90Y. 89Sr, 89Zr, 94Tc, 94Tc, 99mTc, 99Mo, 105Pd, 105Rh, mAg, mIn, 123I, 124I, 125I, 131I, 142Pr, 143Pr, 149Pm, 153Sm, 154"1581Gd, 161Tb, 166Dy, 166Ho, 169Er, 175Lu, 177Lu, 186Re, 188Re, 189Re, 194Ir, 198 Au, 199 Au, 211At, 211Pb, 212Bi, 212Pb, 213Bi, 223Ra, 225 Ac, Cr, V, Mn, Fe, Co, Ni, Cu, La, Ce, Pr, Nd, Pm, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb, Lu, 32P, fluorophore (e.g. fluorescent dyes or chromophores), phosphor (e.g., phosphorescent dyes or chromophores), lumophore (luminescent dyes or chromophores), electron-dense reagents, enzymes (e.g., as commonly used in an ELISA), biotin, digoxigenin, paramagnetic molecules, paramagnetic nanoparticles, ultrasmall superparamagnetic iron oxide ("USPIO") nanoparticles, USPIO nanoparticle aggregates, superparamagnetic iron oxide ("SPIO") nanoparticles, SPIO nanoparticle aggregates, monochrystalline iron oxide
nanoparticles, monochrystalline iron oxide, nanoparticle contrast agents, liposomes or other delivery vehicles containing Gadolinium chelate ("Gd-chelate") molecules, Gadolinium, radioisotopes, radionuclides (e.g. carbon-11, nitrogen-13, oxygen-15, fluorine-18, rubidium-82), fluorodeoxyglucose (e.g. fluorine-18 labeled), any gamma ray emitting radionuclides, positron- emitting radionuclide, radiolabeled glucose, radiolabeled water, radiolabeled ammonia, biocolloids, microbubbles (e.g. including microbubble shells including albumin, galactose, lipid, and/or polymers; microbubble gas core including air, heavy gas(es), perfluorcarbon, nitrogen, octafluoropropane, perflexane lipid microsphere, perflutren, etc.), iodinated contrast agents (e.g. iohexol, iodixanol, ioversol, iopamidol, ioxilan, iopromide, diatrizoate, metrizoate, ioxaglate), barium sulfate, thorium dioxide, gold, gold nanoparticles, gold nanoparticle aggregates, two- photon fluorophores, hyperpolarizable chromophores, or haptens and proteins or other entities which can be made detectable, e.g., by incorporating a radiolabel into a peptide or antibody specifically reactive with a target peptide. A detectable moiety is a monovalent detectable agent or a detectable agent capable of forming a bond with another composition.
[0032] Radioactive substances (e.g., radioisotopes) that may be used as imaging and/or labeling agents in accordance with the embodiments of the disclosure include, but are not limited to, 18F, 32P, 33P, 45Ti, 47Sc, 52Fe, 59Fe, 62Cu, 64Cu, 67Cu, 67Ga, 68Ga, 77 As, 86Y, 90Y, 89Sr, 89Zr, 94Tc, 94Tc, 99mTc, "Mo, 105Pd, 105Rh, luAg, mIn, 123I, 124I, 125I, 131I, 142Pr, 143Pr, 149Pm, 153Sm, 154" 1581Gd, 161Tb, 166Dy, 166Ho, 169Er, 175Lu, 177Lu, 186Re, 188Re, 189Re, 194Ir, 198Au, 199 Au, 211At, 211Pb, 212Bi, 212Pb, 213Bi, 223Ra, and 225 Ac. Paramagnetic ions that may be used as additional imaging agents in accordance with the embodiments of the disclosure include, but are not limited to, ions of transition and lanthanide metals (e.g. metals having atomic numbers of 21-29, 42, 43, 44, or 57-71). These metals include ions of Cr, V, Mn, Fe, Co, Ni, Cu, La, Ce, Pr, Nd, Pm, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb and Lu.
[0033] An amino acid residue in a protein "corresponds" to a given residue when it occupies the same essential structural position within the protein as the given residue.
[0034] The term "isolated" when applied to a nucleic acid or protein denotes that the nucleic acid or protein is essentially free of other cellular components with which it is associated in the natural state. It can be, for example, in a homogeneous state and may be in either a dry or aqueous solution. Purity and homogeneity are typically determined using analytical chemistry techniques such as polyacrylamide gel electrophoresis or high performance liquid
chromatography. A protein that is the predominant species present in a preparation is substantially purified. [0035] The term "amino acid" refers to naturally occurring and synthetic amino acids, as well as amino acid analogs and amino acid mimetics that function in a manner similar to the naturally occurring amino acids. Naturally occurring amino acids are those encoded by the genetic code, as well as those amino acids that are later modified, e.g., hydroxyproline, γ-carboxyglutamate, and O-phosphoserine. Amino acid analogs refers to compounds that have the same basic chemical structure as a naturally occurring amino acid, i.e., an a carbon that is bound to a hydrogen, a carboxyl group, an amino group, and an R group, e.g., homoserine, norleucine, methionine sulfoxide, methionine methyl sulfonium. Such analogs have modified R groups (e.g., norleucine) or modified peptide backbones, but retain the same basic chemical structure as a naturally occurring amino acid. Amino acid mimetics refers to chemical compounds that have a structure that is different from the general chemical structure of an amino acid, but that function in a manner similar to a naturally occurring amino acid. The terms "non-naturally occurring amino acid" and "unnatural amino acid" refer to amino acid analogs, synthetic amino acids, and amino acid mimetics, which are not found in nature.
[0036] Amino acids may be referred to herein by either their commonly known three letter symbols or by the one-letter symbols recommended by the IUPAC-IUB Biochemical
Nomenclature Commission. Nucleotides, likewise, may be referred to by their commonly accepted single-letter codes.
[0037] The terms "polypeptide," "peptide" and "protein" are used interchangeably herein to refer to a polymer of amino acid residues, wherein the polymer may in embodiments be conjugated to a moiety that does not consist of amino acids. The terms apply to amino acid polymers in which one or more amino acid residue is an artificial chemical mimetic of a corresponding naturally occurring amino acid, as well as to naturally occurring amino acid polymers and non-naturally occurring amino acid polymers. A "fusion protein" refers to a chimeric protein encoding two or more separate protein sequences that are recombinantly expressed as a single moiety. In embodiments, the protein includes at least 30 amino acid residues. A protein may be characterized as having a protein backbone. A "protein backbone" is used herein in accordance with its ordinary meaning and refers to the polymer of amino acid residues that create a continuous chain. For example, a rotein backbone may refer to the series
wherein each R independently represents optionally different amino acid side chains. In embodiments, the protein backbone includes core amino acid residues and ligand binding amino acid residues. In embodiments, the protein backbone includes core amino acid residues. In embodiments, the protein backbone includes ligand binding amino acid residues.
[0038] As may be used herein, the terms "nucleic acid," "nucleic acid molecule," "nucleic acid oligomer," "oligonucleotide," "nucleic acid sequence," "nucleic acid fragment" and
"polynucleotide" are used interchangeably and are intended to include, but are not limited to, a polymeric form of nucleotides covalently linked together that may have various lengths, either deoxyribonucleotides or ribonucleotides, or analogs, derivatives or modifications thereof.
Different polynucleotides may have different three-dimensional structures, and may perform various functions, known or unknown. Non-limiting examples of polynucleotides include a gene, a gene fragment, an exon, an intron, intergenic DNA (including, without limitation, heterochromatic DNA), messenger RNA (mRNA), transfer RNA, ribosomal RNA, a ribozyme, cDNA, a recombinant polynucleotide, a branched polynucleotide, a plasmid, a vector, isolated DNA of a sequence, isolated RNA of a sequence, a nucleic acid probe, and a primer.
Polynucleotides useful in the methods of the disclosure may comprise natural nucleic acid sequences and variants thereof, artificial nucleic acid sequences, or a combination of such sequences.
[0039] A polynucleotide is typically composed of a specific sequence of four nucleotide bases: adenine (A); cytosine (C); guanine (G); and thymine (T) (uracil (U) for thymine (T) when the polynucleotide is RNA). Thus, the term "polynucleotide sequence" is the alphabetical representation of a polynucleotide molecule; alternatively, the term may be applied to the polynucleotide molecule itself. This alphabetical representation can be input into databases in a computer having a central processing unit and used for bioinformatics applications such as functional genomics and homology searching. Polynucleotides may optionally include one or more non-standard nucleotide(s), nucleotide analog(s) and/or modified nucleotides.
[0040] "Conservatively modified variants" applies to both amino acid and nucleic acid sequences. With respect to particular nucleic acid sequences, "conservatively modified variants" refers to those nucleic acids that encode identical or essentially identical amino acid sequences. Because of the degeneracy of the genetic code, a number of nucleic acid sequences will encode any given protein. For instance, the codons GCA, GCC, GCG and GCU all encode the amino acid alanine. Thus, at every position where an alanine is specified by a codon, the codon can be altered to any of the corresponding codons described without altering the encoded polypeptide. Such nucleic acid variations are "silent variations," which are one species of conservatively modified variations. Every nucleic acid sequence herein which encodes a polypeptide also describes every possible silent variation of the nucleic acid. One of skill will recognize that each codon in a nucleic acid (except AUG, which is ordinarily the only codon for methionine, and TGG, which is ordinarily the only codon for tryptophan) can be modified to yield a functionally identical molecule. Accordingly, each silent variation of a nucleic acid which encodes a polypeptide is implicit in each described sequence.
[0041] As to amino acid sequences, one of skill will recognize that individual substitutions, deletions or additions to a nucleic acid, peptide, polypeptide, or protein sequence which alters, adds or deletes a single amino acid or a small percentage of amino acids in the encoded sequence is a "conservatively modified variant" where the alteration results in the substitution of an amino acid with a chemically similar amino acid. Conservative substitution tables providing functionally similar amino acids are well known in the art. Such conservatively modified variants are in addition to and do not exclude polymorphic variants, interspecies homologs, and alleles of the disclosure.
[0042] The following eight groups each contain amino acids that are conservative substitutions for one another: 1) Alanine (A), Glycine (G); 2) Aspartic acid (D), Glutamic acid (E); 3) Asparagine (N), Glutamine (Q); 4) Arginine (R), Lysine (K); 5) Isoleucine (I), Leucine (L), Methionine (M), Valine (V); 6) Phenylalanine (F), Tyrosine (Y), Tryptophan (W); 7) Serine (S), Threonine (T); and 8) Cysteine (C), Methionine (M) {see, e.g., Creighton, Proteins (1984)).
[0043] "Percentage of sequence identity" is determined by comparing two optimally aligned sequences over a comparison window, wherein the portion of the polynucleotide or polypeptide sequence in the comparison window may comprise additions or deletions {i.e., gaps) as compared to the reference sequence (which does not comprise additions or deletions) for optimal alignment of the two sequences. The percentage is calculated by determining the number of positions at which the identical nucleic acid base or amino acid residue occurs in both sequences to yield the number of matched positions, dividing the number of matched positions by the total number of positions in the window of comparison and multiplying the result by 100 to yield the percentage of sequence identity. The terms "identical" or percent "identity," in the context of two or more nucleic acids or polypeptide sequences, refer to two or more sequences or subsequences that are the same or have a specified percentage of amino acid residues or nucleotides that are the same (i.e., about 60% identity, preferably 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or higher identity over a specified region, when compared and aligned for maximum correspondence over a comparison window or designated region) as measured using a BLAST or BLAST 2.0 sequence comparison algorithms with default parameters described below, or by manual alignment and visual inspection {see, e.g., NCBI web site http://www.ncbi.nlm.nih.gov/BLAST/ or the like). Such sequences are then said to be "substantially identical". This definition also refers to, or may be applied to, the compliment of a test sequence. The definition also includes sequences that have deletions and/or additions, as well as those that have substitutions. As described below, the preferred algorithms can account for gaps and the like. Preferably, identity exists over a region that is at least about 25 amino acids or nucleotides in length, or more preferably over a region that is 50-100 amino acids or nucleotides in length.
[0044] An amino acid or nucleotide base "position" is denoted by a number that sequentially identifies each amino acid (or nucleotide base) in the reference sequence based on its position relative to the N-terminus (or 5'-end). Due to deletions, insertions, truncations, fusions, and the like that must be taken into account when determining an optimal alignment, in general the amino acid residue number in a test sequence determined by simply counting from the N-
terminus will not necessarily be the same as the number of its corresponding position in the reference sequence. For example, in a case where a variant has a deletion relative to an aligned reference sequence, there will be no amino acid in the variant that corresponds to a position in the reference sequence at the site of deletion. Where there is an insertion in an aligned reference sequence, that insertion will not correspond to a numbered amino acid position in the reference sequence. In the case of truncations or fusions there can be stretches of amino acids in either the reference or aligned sequence that do not correspond to any amino acid in the corresponding sequence.
[0045] The terms "numbered with reference to" or "corresponding to," when used in the context of the numbering of a given amino acid or polynucleotide sequence, refers to the numbering of the residues of a specified reference sequence when the given amino acid or polynucleotide sequence is compared to the reference sequence.
[0046] The term "amino acid side chain" refers to the functional substituent contained on amino acids. For example, an amino acid side chain may be the side chain of a naturally occurring amino acid. Naturally occurring amino acids are those encoded by the genetic code (e.g., alanine, arginine, asparagine, aspartic acid, cysteine, glutamine, glutamic acid, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, or valine), as well as those amino acids that are later modified, e.g., hydroxyproline, γ-carboxyglutamate, and O-phosphoserine. In embodiments, the amino acid side chain may be a non-natural amino acid side chain. In embodiments, the amino acid side
[0047] The term "non-natural amino acid side chain" refers to the functional substituent of compounds that have the same basic chemical structure as a naturally occurring amino acid, i.e., an a carbon that is bound to a hydrogen, a carboxyl group, an amino group, and an R group, e.g., homoserine, norleucine, methionine sulfoxide, methionine methyl sulfonium, allylalanine, 2- aminoisobutryric acid. Non-natural amino acids are non-proteinogenic amino acids that either occur naturally or are chemically synthesized. Such analogs have modified R groups (e.g., norleucine) or modified peptide backbones, but retain the same basic chemical structure as a naturally occurring amino acid. Non-limiting examples include exo-cis-3- aminobicyclo[2.2.1]hept-5-ene-2-carboxylic acid hydrochloride, cis-2- aminocycloheptanecarboxylic acid hydrochloride, cis-6-amino-3-cyclohexene-l-carboxylic acid hydrochloride, cis-2-amino-2-methylcyclohexanecarboxylic acid hydrochloride, cis-2-amino-2- methylcyclopentanecarboxylic acid hydrochloride ,2-(Boc-aminomethyl)benzoic acid, 2-(Boc- amino)octanedioic acid, Boc-4,5-dehydro-Leu-OH (dicyclohexylammonium), Boc-4-(Fmoc- amino)-L-phenylalanine, Boc-P-Homopyr-OH, Boc-(2-indanyl)-Gly-OH , 4-Boc-3- morpholineacetic acid, 4-Boc-3-morpholineacetic acid , Boc-pentafluoro-D-phenylalanine, Boc- pentafluoro-L-phenylalanine , Boc-Phe(2-Br)-OH, Boc-Phe(4-Br)-OH, Boc-D-Phe(4-Br)-OH, Boc-D-Phe(3-Cl)-OH , Boc-Phe(4-NH2)-OH, Boc-Phe(3-N02)-OH, Boc-Phe(3,5-F2)-OH, 2-(4- Boc-piperazino)-2-(3,4-dimethoxyphenyl)acetic acid purum, 2-(4-Boc-piperazino)-2-(2- fluorophenyl)acetic acid purum, 2-(4-Boc-piperazino)-2-(3-fluorophenyl)acetic acid purum, 2- (4-Boc-piperazino)-2-(4-fluorophenyl)acetic acid purum, 2-(4-Boc-piperazino)-2-(4- methoxyphenyl)acetic acid purum, 2-(4-Boc-piperazino)-2-phenylacetic acid purum, 2-(4-Boc- piperazino)-2-(3-pyridyl)acetic acid purum, 2-(4-Boc-piperazino)-2-[4- (trifluoromethyl)phenyl]acetic acid purum, Boc-P-(2-quinolyl)-Ala-OH, N-Boc-1, 2,3,6- tetrahydro-2-pyridinecarboxylic acid, Boc-P-(4-thiazolyl)-Ala-OH, Boc-P-(2-thienyl)-D-Ala- OH, Fmoc-N-(4-Boc-aminobutyl)-Gly-OH, Fmoc-N-(2-Boc-aminoethyl)-Gly-OH , Fmoc-N- (2,4-dimethoxybenzyl)-Gly-OH, Fmoc-(2-indanyl)-Gly-OH, Fmoc-pentafluoro-L-phenylalanine, Fmoc-Pen(Trt)-OH, Fmoc-Phe(2-Br)-OH, Fmoc-Phe(4-Br)-OH, Fmoc-Phe(3,5-F2)-OH, Fmoc- P-(4-thiazolyl)-Ala-OH, Fmoc-P-(2-thienyl)-Ala-OH, 4-(hydroxymethyl)-D-phenylalanine.
[0048] The terms "identical" or percent "identity," in the context of two or more nucleic acids or polypeptide sequences, refer to two or more sequences or subsequences that are the same or have a specified percentage of amino acid residues or nucleotides that are the same (i.e., about 60% identity, preferably 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%), 98%), 99%), or higher identity over a specified region, when compared and aligned for maximum correspondence over a comparison window or designated region) as measured using a BLAST or BLAST 2.0 sequence comparison algorithms with default parameters described below, or by manual alignment and visual inspection (see, e.g., NCBI web site
http://www.ncbi.nlm.nih.gov/BLAST/ or the like). Such sequences are then said to be
"substantially identical." This definition also refers to, or may be applied to, the compliment of a test sequence. The definition also includes sequences that have deletions and/or additions, as well as those that have substitutions. As described below, the preferred algorithms can account for gaps and the like. Preferably, identity exists over a region that is at least about 25 amino acids or nucleotides in length, or more preferably over a region that is 50-100 amino acids or nucleotides in length.
[0049] The term "expression" includes any step involved in the production of the polypeptide including, but not limited to, transcription, post-transcriptional modification, translation, post- translational modification, and secretion. Expression can be detected using conventional techniques for detecting protein (e.g., ELISA, Western blotting, flow cytometry,
immunofluorescence, immunohistochemistry, etc.).
[0050] "Control" or "control experiment" is used in accordance with its plain ordinary meaning and refers to an experiment in which the subjects or reagents of the experiment are treated as in a parallel experiment except for omission of a procedure, reagent, or variable of the experiment. In some instances, the control is used as a standard of comparison in evaluating experimental effects. In some embodiments, a control is the measurement of the activity of a protein in the absence of a compound as described herein (including embodiments and examples).
[0051] As used herein, the term "about" means a range of values including the specified value, which a person of ordinary skill in the art would consider reasonably similar to the specified value. In embodiments, about means within a standard deviation using measurements generally acceptable in the art. In embodiments, about means a range extending to +/- 10%> of the specified value. In embodiments, about means the specified value.
[0052] The terms "bind" and "bound" as used herein is used in accordance with its plain and ordinary meaning and refers to the association between atoms or molecules. The association can be direct or indirect. For example, bound atoms or molecules may be direct, e.g., by covalent bond or linker (e.g. a first linker or second linker), or indirect, e.g., by non-covalent bond (e.g. electrostatic interactions (e.g. ionic bond, hydrogen bond, halogen bond), van der Waals interactions (e.g. dipole-dipole, dipole-induced dipole, London dispersion), ring stacking (pi or hyrdophobic effects), hydrophobic interactions and the like).
[0053] The terms "set of ligand binding amino acid residues" as used herein refers to at least two ligand binding amino acid residues. "Ligand binding amino acid residues" refer to amino acid residues which are capable of binding (e.g., has a measurable dissociation constant of binding, has a dissociation constant of binding less than 1 μΜ) to a ligand. In embodiments, the ligand binding amino acid residues refer to amino acid residues which bind to a ligand. Each ligand binding amino acid residue is associated with a set of ligand binding amino acid residue atomic coordinates (e.g., Cartesian coordinates, internal coordinates, polar coordinates, or spherical coordinates) which defines the ligand binding amino acid residue in space (e.g., Euclidean space). In embodiments, ligand binding amino acid residues refer to amino acid residues within about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, or 30 A from the ligand. In embodiments, ligand binding amino acid residues refer to amino acid residues within about 5 A from the ligand. In determining the set of ligand binding amino acid residues, such factors such as the proximity of the amino acid to the ligand or the interactions between the amino acid and the ligand may influence the designation to be a "ligand binding amino acid residue."
[0054] The term "dissociation constant" is used in accordance with its plain ordinary meaning and refers to the ligand concentration at which half of the proteins are occupied (i.e. bound to a ligand) at equilibrium. Typically, the dissociation constant has molar units (M). The smaller the dissociation constant, the more tightly bound the ligand is, or the higher the affinity between ligand and protein. For example, a ligand with a nanomolar (nM) dissociation constant binds more tightly to a particular protein than a ligand with a micromolar (μΜ) dissociation constant.
[0055] The terms "ligand" and "cofactor" are synonymous, and used in accordance with their plain ordinary meaning in chemistry and biochemistry and refer to an agent (e.g., compound, metal, ion, biomolecule, agonist, antagonist) which is capable of binding to a protein (e.g., a protein described herein). In embodiments, a ligand refers to an agent (e.g., compound, metal, ion, biomolecule) which is binds (e.g., covalently or non-covalently) to a protein. Typically, upon binding the ligand has an effect on the protein (e.g., structural change of the protein,
modulation of signaling pathways). A ligand is associated with a set of ligand atomic coordinates (e.g., Cartesian coordinates, internal coordinates, polar coordinates, spherical coordinates) which define the ligand in space (e.g., Euclidean space). The ligand may be endogenous or exogenous. Non-limiting examples of ligands include a catalyst, detectable agent, therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theranostic (e.g., a combined therapeutic and diagnostic agent), photodynamic therapy (PDT) agent, porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component that is capable of binding a metal ion. In embodiments, the ligand is a peptide (e.g., 2 to 30 amino acid residues), a protein (e.g., greater than 30 amino acid residues), a small molecule (e.g., a compound with a molecular weight of less than 2000 Daltons), or a small molecule-metal-ion complex (e.g., a metalloporphyrin). In embodiments, the ligand is endogenous. In embodiments, the ligand is exogenous. In embodiments, the ligand is flavin. In embodiments, the ligand is heme.
[0056] The terms "set of core amino acid residues" as used herein refers to at least two core amino acid residues. Core amino acid residues refer to amino acid residues, which are incapable of binding to a ligand (e.g., does not have a measurable dissociation constant of binding, does not have a dissociation constant of binding less than 1 μΜ ). In embodiments core amino acids are amino acids which do not bind a ligand. Each core amino acid residue is associated with a set of core amino acid residue atomic coordinates (e.g., Cartesian coordinates, internal coordinates, polar coordinates, spherical coordinates) which defines the core binding amino acid residue in space (e.g., Euclidean space). Core amino acids are at least 75% inaccessible to a 1.8 A spherical probe. A typical set of core amino acid residues contains at least 6 amino acid residues. In embodiments, the set of core amino acid residues includes amino acid residues which are solvent inaccessible as measured by the accessible surface area. Additional information regarding the accessible surface area assessment may be found in Lins et al. (Lins, L., Thomas, A., & Brasseur, R. (2003) Protein Science: A Publication of the Protein Society, 12(7), 1406-141), which is incorporated herein in its entirety for all purposes. In embodiments, the core amino acids atomic coordinates are greater than 5 A from any ligand atomic acid coordinate. In embodiments, the set of core amino acid residues is hydrophobic. In
embodiments, the core amino acids includes the sequence:
LGLVAFLIFGLVLILIHLFAAGWVFFAILLLLALILA (SEQ ID NO: 5).
[0057] The terms "optimizing" and "optimization" are used in accordance with their ordinary meaning in mathematics and computer science and refers to identifying a favorable outcome subject to certain criteria (e.g., constraints) from a set of available possibilities. Optimizing may employ iterative or heuristic algorithms, such as simplex algorithm, memetic algorithm, differential evolution algorithm, evolutionary algorithm, genetic algorithm, tabu algorithm, particle swarm algorithm, stimulated annealing algorithm, Monte Carlo sampling algorithm, dead-end elimination algorithm, branch and bound algorithm, or a pruning algorithm. For example, optimizing typically includes evaluating an energy function (e.g., force field model) and finding the minimum (e.g., global minimum or local minimum). Optimizing may include repeated evaluations of the energy function and may include fixing an atomic coordinate (e.g., fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate), introducing additional amino acid residues into the set of amino acid residues (e.g., the set of ligand binding amino acid residues), restricting the introduction of additional amino acid residues into the set of amino acid residues (e.g., the set of ligand binding amino acid residues), or a geometric transformation (e.g., translation or rotation) of an amino acid residue atomic coordinate (e.g., the atomic coordinate of the ligand binding amino acid residue atomic coordinates). The output of an optimization process may provide a set of ligand binding amino acid residues and a corresponding set of ligand binding amino acid residue atomic coordinates, and a set of core amino acid residues and a corresponding set of core amino acid residue atomic coordinates, which corresponds to an energetically stabilized protein. In embodiments the outcome of the optimization is the global minimum (e.g., the most energetically stabilized protein). In embodiments the outcome of the optimization is a local minimum (e.g., a minimum energy given the domain). In embodiments the optimization is complete when the derivative of the energy with respect to the position of the atoms, δΈ/δτ, is zero and the Hessian matrix has positive eigenvalues. In embodiments, optimizing includes a plurality of minimization calculations. In embodiments the optimization is a finite number of iterations.
[0058] An energy minimization calculation refers to the process of evaluating the energy as a function of the atomic coordinates, V(r). The energy function may include intra- and
intermolecular energy terms within the system (e.g., protein) which may be written as Vtotai(r) = Vbonds(r) + Vangles(r) + Vdihedral(r) + Vimproper(r) + Vnonbonding(r) + Velectrostatics(r); where Vtotal(r) corresponds to the total energy as a function of the atomic positions; Vbonds(r) corresponds to the energy contribution from bonded atoms, Vangies(r) corresponds to the energy contribution from angles; Vdihedrai(r) corresponds to the energy contribution from dihedral torsions; Vimproper(r)
corresponds to the energy contribution from out-of-plane torsions; Vnonbonding(r) corresponds to the energy contribution from nonbonding interactions; and Veiectrostatics(r) corresponds to the energy contribution from electrostatic interactions. Additional energy function terms may also be included in the total energy function, Vtotai(r), for example additional functions from molecular mechanics, functions from structural bioinformatics (log-odds scores), amino acid sidechain packing functions (e.g., functions and algorithms which vary the identity and rotamer of an amino acid side chain), protein radius of gyration functions, or a penalty function.
[0059] The term biomolecule as used herein refers to a molecule present in living organisms (e.g., proteins, carbohydrates, lipids, and nucleic acids, metabolites) and may be endogenous or exogenous in origin.
[0060] The term "energetically stabilized protein" is used in accordance with its ordinary meaning in the art, and is understood to refer to a protein which is structurally and
thermodynamically stable relative to the protein that has not been energetically stabilized. For example, an energetically stabilized protein is determined to be energetically stabilized by determining the difference in the Gibbs free energy between the folded and unfolded states of the protein, also refered to herein as AGfoiding. An energetically stabilized protein may be
characterized by a well-dispersed MR spectrum and/or the presence of a significantly folded core. In embodiments, the energetically stabilized protein is an enzyme. In embodiments, the energetically stabilized protein is an apo protein (e.g., a protein that is not bound to a ligand). In embodiments, the energetically stabilized protein is a holo protein (e.g., a protein that is bound to a ligand). In embodiments, the energetically stabilized protein is an apo protein which is capable of becoming a holo protein upon ligand binding. In embodiments, an energetically stabilized protein refers to a protein which is capable of performing a function (e.g., modulating a signal pathway). In embodiments, the energetically stabilized protein resists side-reactions such as aggregation and proteolysis. In embodiments, the energetically stabilized protein has a AGfoiding of about -5 to about -40 kcal/mol in standard physiological conditions (e.g., temperature range of 20-40 degrees Celsius, atmospheric pressure of 1, pH of 6-8, glucose concentration of 1-20 mM, atmospheric oxygen concentration).
[0061] The term "exogenous" refers to a molecule or substance (e.g., a compound, ligand, or protein) that originates from outside a given cell or organism. Conversely, the term
"endogenous" refers to a molecule or substance that is native to, or originates within, a given cell or organism.
[0062] A "therapeutic agent" as used herein refers to an agent (e.g., compound or composition) that when administered to a subject in sufficient amounts will have a therapeutic effect, such as an intended prophylactic effect, preventing or delaying the onset (or reoccurrence) of an injury, disease, pathology or condition, or reducing the likelihood of the onset (or reoccurrence) of an injury, disease, pathology, or condition, or their symptoms or the intended therapeutic effect, e.g., treatment or amelioration of an injury, disease, pathology or condition, or their symptoms including any objective or subjective parameter of treatment such as abatement; remission;
diminishing of symptoms or making the injury, pathology or condition more tolerable to the patient; slowing in the rate of degeneration or decline; making the final point of degeneration less debilitating; or improving a patient's physical or mental well-being.
[0063] The term "small molecule " or the like as used herein refers, unless indicated otherwise, to a molecule having a molecular weight of less than about 700 Dalton, e.g., less than about 700, 650, 600, 550, 500, 450, 400, 350, 300, 250, 200, 100, or 50 Dalton.
[0064] In this disclosure, "comprises," "comprising," "containing" and "having" and the like can have the meaning ascribed to them in U.S. Patent law and can mean " includes," "including," and the like. "Consisting essentially of or "consists essentially" likewise has the meaning ascribed in U.S. Patent law and the term is open-ended, allowing for the presence of more than that which is recited so long as basic or novel characteristics of that which is recited is not changed by the presence of more than that which is recited, but excludes prior art embodiments. II. Methods
[0065] In an aspect is provided a computer-implemented method, including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize the protein. In embodiments, the
optimization is performed to improve, relative to a control, the protein-ligand interactions (e.g., decrease the dissociation constant of binding 1-fold, 2-fold, 3-fold, 4-fold or 5-fold). In
embodiments, the optimization modulates, relative to a control, the non-covalent interactions between the protein and the ligand.
[0066] In embodiments, step c) includes simultaneously optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates. In embodiments, step c) includes concurrently (e.g., performing an optimization iteration on all sets prior to continuing the optimization) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates. [0067] In embodiments, the optimizing is joint optimizing (e.g., optimizing the set of ligand binding amino acid residues, the set of core amino acid residues, and optionally the ligand simultaneously). In embodiments, step c) includes optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates. In embodiments, step c) includes optimizing the set of ligand binding amino acid residues and the set of core amino acid residues. In embodiments, step c) includes optimizing the set of ligand binding amino acid residues and the set of ligand binding amino acid residue atomic coordinates. In embodiments, step c) includes optimizing the set of core amino acid residues and the set of core amino acid residue atomic coordinates. In embodiments, step c) includes optimizing the set of ligand binding amino acid residue atomic coordinates and the set of core amino acid residue atomic coordinates.
[0068] In embodiments, step c) includes optimizing the protein backbone. Optimizing the protein backbone may refer to repeated evaluations of the energy function and may include fixing an atomic coordinate (e.g., fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate, but not the side chain of the residue), introducing additional amino acid residues into the set of amino acid residues (e.g., the set of ligand binding amino acid residues), restricting the introduction of additional amino acid residues into the set of amino acid residues (e.g., the set of ligand binding amino acid residues), or a geometric transformation (e.g., translation or rotation) of an amino acid residue atomic coordinate, but not the side chain of the residue (e.g., the atomic coordinate of the ligand binding amino acid residue atomic coordinates). In embodiments, step c) includes simultaneously optimizing the protein backbone and the set of ligand binding amino acid residues. In embodiments, step c) includes simultaneously optimizing the protein backbone and the ligand. In embodiments, step c)
includes simultaneously optimizing the protein backbone and the set of core amino acid residues. In embodiments, step c) includes optimizing the protein backbone using known conformational sampling techniques in the art (e.g., rigid-body shifts of helices, backrub algorithms, or crankshaft algorithms). In embodiments, step c) is performed using a protein modeling software suite (e.g., Rosetta). In embodiments, step c) includes an ensemble (e.g., a finite set of proteins, which includes amino acid residue atomic coordinates) of backbones for conformational sampling calculations.
[0069] In embodiments, step c) includes fixing (e.g., not geometrically displacing) an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate; fixing an atomic coordinate of at least one ligand atomic coordinate; prohibiting introduction of an additional amino acid residue into the set of ligand binding amino acid residues; or prohibiting the deletion of an amino acid residue from the set of ligand binding amino acid residues.
[0070] In embodiments, step c) includes fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate. In embodiments, step c) includes fixing all atomic coordinates of at least one ligand binding amino acid residue atomic coordinate. In embodiments, step c) includes fixing an atomic coordinate of at least one ligand atomic coordinate. In embodiments, step c) includes fixing all atomic coordinates of the ligand atomic coordinate. In embodiments, step c) includes prohibiting introduction of an additional amino acid residue into the set of ligand binding amino acid residues. In embodiments, step c) includes prohibiting the deletion of an amino acid residue from the set of ligand binding amino acid residues. In embodiments, step c) includes prohibiting introduction of an additional amino acid residue into the set of core amino acid residues. In embodiments, step c) includes prohibiting the deletion of an amino acid residue from the set of core amino acid residues. In embodiments, the method includes distance and angle constraints (i.e. specifying the distance of a ligand to an amino acid (e.g., a ligand binding amino acid residue) coordinate).
[0071] In embodiments, the optimizing includes fixing (e.g., not geometrically displacing) at least one atomic coordinate of the ligand atomic coordinates. In embodiments, the optimizing does not include fixing at least one atomic coordinate of at least one core amino acid residue atomic coordinates. In embodiments, the optimizing does not include fixing at least one atomic coordinate of the core amino acid residue atomic coordinates. In embodiments, the optimizing does not include fixing any atomic coordinates of the core amino acid residue atomic
coordinates. In embodiments, the optimizing includes fixing angle form by three atoms (e.g., angles formed between atoms of the ligand and the ligand bind amino acid residues) or fixing the
distance between atoms (e.g., at least one atomic coordinate of the ligand and at least one atomic coordinate of the ligand binding amino acid residue).
[0072] In embodiments, the optimizing includes an iterative or heuristic algorithm. In embodiments, the optimizing includes an iterative algorithm. In embodiments, the optimizing includes a heuristic algorithm. In embodiments, the optimizing includes a simplex algorithm, memetic algorithm, differential evolution algorithm, evolutionary algorithm, genetic algorithm, tabu algorithm, particle swarm algorithm, or stimulated annealing algorithm. In embodiments, the optimizing includes a Monte Carlo sampling algorithm, dead-end elimination algorithm, branch and bound algorithm, or a pruning algorithm. In embodiments, the optimizing includes knobs-into-holes side chain packing. In embodiments, the optimization may begin with an idealized, parameterized backbone. In embodiments, optimization may relax the backbone structure of the protein, for example, by using gradient descent algorithms, while optimizing the protein sequence via rotamer sampling and minimization.
[0073] In embodiments, the optimizing includes introducing an additional ligand binding amino acid residue into the set of ligand binding amino acid residues, deleting a ligand binding amino acid residue from the set of ligand binding amino acid residues, a geometric
transformation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
[0074] In embodiments, the optimizing includes introducing an additional ligand binding amino acid residue into the set of ligand binding amino acid residues (e.g., designating an amino acid residue previously designated as a core amino acid residue to a ligand binding amino acid residue). In embodiments, the optimizing includes replacing a ligand binding amino acid residue within the set of ligand binding amino acid residues. In embodiments, the optimizing includes deleting a ligand binding amino acid residue from the set of ligand binding amino acid residues (e.g., designating an amino acid residue previously designated as a ligand amino acid residue to a core binding amino acid residue). In embodiments, the optimizing includes a geometric transformation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates. In embodiments, the optimizing includes a geometric transformation of the atomic coordinates of at least one of the ligand binding amino acid residue atomic coordinates. In embodiments, the optimizing includes a geometric transformation of the atomic coordinates of the ligand binding amino acid residue atomic coordinates.
[0075] In embodiments, the geometric transformation includes a translation (i.e., a geometric transformation that moves a coordinate by the same distance in a given direction) or a rotation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a translation (e.g., displacing the x coordinate) of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a translation of at least two atomic coordinates of the ligand binding amino acid residue atomic coordinates. In
embodiments, the geometric transformation includes a translation of all atomic coordinates (e.g., x, y, and z coordinates in Cartesian space) of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of at least two atomic coordinates of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of at least three atomic coordinates of the ligand binding amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of all atomic coordinates of the ligand binding amino acid residue atomic coordinates.
[0076] In embodiments, the optimizing includes a geometric transformation of at least one atomic coordinate of the core amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a translation or a rotation of at least one atomic coordinate of the core amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a translation of at least one atomic coordinate of the core amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a translation of at least two atomic coordinates of the core amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a translation of all atomic coordinates of the core amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of at least one atomic coordinate of the core amino acid residue atomic coordinates. In
embodiments, the geometric transformation includes a rotation of at least two atomic coordinates of the core amino acid residue atomic coordinates. In embodiments, the geometric
transformation includes a rotation of at least three atomic coordinates of the core amino acid residue atomic coordinates. In embodiments, the geometric transformation includes a rotation of all atomic coordinates of the core amino acid residue atomic coordinates.
[0077] In embodiments, the optimizing includes la) calculating the force on each atom in the protein (e.g., the set of ligand binding amino acid residues; the set of core amino acid residues;
and the ligand); 2a) evaluating the calculation to determine if it is the minimum or below an acceptable threshold; 3a) if the force is less than a threshold, the optimization is finished, otherwise perform a geometric transformation (e.g., translation) of at least one atomic coordinate on the atoms in the protein; and 4a) repeat. [0078] In embodiments, the geometric transformation of at least one atomic coordinate includes no greater than a 6 A displacement of any atomic coordinate. In embodiments, the geometric transformation of at least one atomic coordinate includes no greater than a 3 A
displacement of any atomic coordinate. In embodiments, the displacement is no greater than 0.1, 0.2, 0.3, 0.4, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 A displacement of any atomic coordinate. In embodiments, the displacement is no greater than 0.01, 0.02, 0.03, 0.04, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or 1.0 A displacement of any atomic coordinate.
[0079] In embodiments, the set of ligand binding amino acids includes at least 50 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 40 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 30 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 20 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 12 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 10 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 8 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 6 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 5 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 4 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 3 amino acid residues. In embodiments, the set of ligand binding amino acids includes at least 2 amino acid residues. In embodiments the ligand binding amino acids are apolar. In embodiments the ligand binding amino acids are hydrophilic.
[0080] In embodiments, the set of ligand binding amino acids includes 50 amino acid residues.
In embodiments, the set of ligand binding amino acids includes 40 amino acid residues. In embodiments, the set of ligand binding amino acids includes 30 amino acid residues. In embodiments, the set of ligand binding amino acids includes 20 amino acid residues. In embodiments, the set of ligand binding amino acids includes 12 amino acid residues. In embodiments, the set of ligand binding amino acids includes 10 amino acid residues. In embodiments, the set of ligand binding amino acids includes 8 amino acid residues. In embodiments, the set of ligand binding amino acids includes 6 amino acid residues. In
embodiments, the set of ligand binding amino acids includes 5 amino acid residues. In embodiments, the set of ligand binding amino acids includes 4 amino acid residues. In embodiments, the set of ligand binding amino acids includes 3 amino acid residues. In embodiments, the set of ligand binding amino acids includes 2 amino acid residues. In embodiments the ligand binding amino acids are polar. In embodiments the ligand binding amino acids are hydrophilic.
[0081] In embodiments, the energy minimization calculation includes a molecular mechanics function, a structural bioinformatics function, an amino acid sidechain packing function, a protein radius of gyration function, or a combination thereof. In embodiments, the energy minimization calculation includes a penalty function.
[0082] In embodiments, the core amino acids are at least 75% inaccessible to a 1.8 A spherical probe. In embodiments, the core amino acids are at least 75% inaccessible to a 1.0 A spherical probe. In embodiments, the core amino acids are at least 75% inaccessible to a 1.2 A spherical probe. In embodiments, the core amino acids are at least 75% inaccessible to a 1.4 A spherical probe. In embodiments, the core amino acids are at least 75% inaccessible to a 1.6 A spherical probe. In embodiments, the core amino acids are at least 75% inaccessible to a 2.0 A spherical probe. In embodiments, the core amino acids are at least 80% inaccessible to a 1.8 A spherical probe. In embodiments, the core amino acids are at least 90% inaccessible to a 1.8 A spherical probe. In embodiments, the core amino acids are at least 95% inaccessible to a 1.8 A spherical probe. In embodiments, the set of core amino acids includes at least 50 amino acid residues. In embodiments, the set of core amino acids includes at least 40 amino acid residues. In
embodiments, the set of core amino acids includes at least 30 amino acid residues. In
embodiments, the set of core amino acids includes at least 20 amino acid residues. In
embodiments, the set of core amino acids includes at least 12 amino acid residues. In
embodiments, the set of core amino acids includes at least 10 amino acid residues. In
embodiments, the set of core amino acids includes at least 8 amino acid residues. In
embodiments, the set of core amino acids includes at least 6 amino acid residues. In
embodiments the core amino acids are apolar. In embodiments the core amino acids are hydrophobic. [0083] In embodiments, the set of core amino acids includes 6 amino acids. In embodiments, the set of core amino acids includes 8 amino acids. In embodiments, the set of core amino acids includes 10 amino acids. In embodiments, the set of core amino acids includes 20 amino acids.
In embodiments, the set of core amino acids includes 30 amino acids. In embodiments, the set of
core amino acids includes 40 amino acids. In embodiments, the set of core amino acids includes 35, 36, 37, 38, 39, or 40 amino acids. In embodiments, the set of core amino acids includes 37 amino acids. In embodiments, the core amino acids include the sequence:
LGLVAFLIFGLVLILIHLFAAGWVFFAILLLLALILA (SEQ ID NO: 5). In embodiments, the core amino acids include the sequence: LGIILLLAIGLILLAFHLFFAGWLFIAILLFSGIILA (SEQ ID NO:6).
[0084] In embodiments, the protein is 99% identical to SEQ ID NO:5. In embodiments, the protein is 98% identical to SEQ ID NO:5. In embodiments, the protein is 95% identical to SEQ ID NO:5. In embodiments, the protein is 90% identical to SEQ ID NO:5. In embodiments, the protein is 85% identical to SEQ ID NO:5. In embodiments, the protein is 80% identical to SEQ ID NO:5. In embodiments, the protein is 60% identical to SEQ ID NO:5. In embodiments, the protein is about 99% identical to SEQ ID NO:5. In embodiments, the protein is about 98% identical to SEQ ID NO:5. In embodiments, the protein is about 95% identical to SEQ ID NO:5. In embodiments, the protein is about 90% identical to SEQ ID NO:5. In embodiments, the protein is about 85% identical to SEQ ID NO:5. In embodiments, the protein is about 80% identical to SEQ ID NO:5. In embodiments, the protein is about 60% identical to SEQ ID NO:5.
[0085] In embodiments, the protein is 99% identical to SEQ ID NO:6. In embodiments, the protein is 98% identical to SEQ ID NO:6. In embodiments, the protein is 95% identical to SEQ ID NO:6. In embodiments, the protein is 90% identical to SEQ ID NO:6. In embodiments, the protein is 85% identical to SEQ ID NO:6. In embodiments, the protein is 80% identical to SEQ ID NO:6. In embodiments, the protein is 60% identical to SEQ ID NO:6. In embodiments, the protein is about 99% identical to SEQ ID NO:6. In embodiments, the protein is about 98% identical to SEQ ID NO:6. In embodiments, the protein is about 95% identical to SEQ ID NO:6. In embodiments, the protein is about 90% identical to SEQ ID NO:6. In embodiments, the protein is about 85% identical to SEQ ID NO:6. In embodiments, the protein is about 80% identical to SEQ ID NO:6. In embodiments, the protein is about 60% identical to SEQ ID NO:6.
[0086] In embodiments, the set of core amino acids includes at least 50% of the total number of amino acid residues in the protein.
[0087] In embodiments, the ligand is a porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component, that is capable of binding a metal ion. In embodiments, the ligand is a detectable agent. In
embodiments, the ligand is a therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theragostic, or a photodynamic therapy (PDT) agent. In embodiments, the ligand is a therapeutic agent. In embodiments, the ligand is a biological agent. In embodiments, the ligand is a cytotoxic agent (e.g., an anticancer agent). In embodiments, the ligand is a magnetic resonance imaging (MRI) agent. In embodiments, the ligand is a positron emission tomography (PET) agent. In embodiments, the ligand is a radiological imaging agent. In embodiments, the ligand is a diagnostic agent. In embodiments, the ligand is a theragostic agent. In embodiments, the ligand is a photodynamic therapy (PDT) agent. In embodiments, the ligand is a small molecule.
[0088] In embodiments, the ligand is a catalyst. In embodiments, the catalyst catalyzes an abiological or bio-orthogonal reaction. In embodiments, the ligand is a molecule that exists within a living system (e.g., within an organism or a cell). In embodiments, the ligand is (CF3)- 4PZn. In embodiments, the ligand is (CF3)4PFe. In embodiments, the ligand atomic coordinates are optimized using known methods in the art (e.g., density functional theory using the B3-LYP functional).
[0089] In embodiments, the method further includes synthesizing the protein (e.g., utilizing the expression vectors such as the plasmid method described in the Example, such as cloning into the IPTG-inducible pET-1 la plasmid). In embodiments, the method further includes expressing the protein.
[0090] FIG. 14 depicts a flowchart illustrating a process 1400 for designing proteins, in accordance with some example embodiments. Referring to FIG. 14, the process 1400 can be performed in order to design an energetically stabilized protein (e.g., a protein that is structurally and thermodynamically stable as determined by the difference in the Gibbs free energy between the folded and unfolded states of the protein).
[0091] At 1402, a set of ligand binding amino acid residues within a protein for binding to a ligand can be identified. These ligand binding amino acid residues can form the backbone of a protein. Each ligand binding amino acid residue within the protein can be associated with a set of ligand binding amino acid residue atomic coordinates, which can define the ligand binding amino acid residue in space. Furthermore, each atom of the ligand can be associated with a set of ligand atomic coordinates, which can define the ligand in space. As noted herein, these
coordinates can be Cartesian coordinates, internal coordinates, polar coordinates, spherical coordinates, and/or the like.
[0092] At 1404, a set of core amino acid residues within the protein that do not bind to the ligand can be identified. The backbone of the protein can further include core amino acid residues. Each core amino acid residue within the protein can be associated with a set of core amino acid residue atomic coordinates, which define the core amino acid residue in space.
[0093] At 1406, the set of ligand binding amino acid residues, the set of ligand binding amino acid residue atomic coordinates, the set of core amino acid residues, and the set of core amino acid residue atomic coordinates can be optimized. For example, the optimization can be performed using an energy minimization calculation including, for example, a molecular mechanics function, a structural bioinformatics function, an amino acid sidechain packing function, a protein radius of gyration function, and/or the like. Optimizing the set of ligand binding amino acid residues, the set of ligand binding amino acid residue atomic coordinates, the set of core amino acid residues, and the set of core amino acid residue atomic coordinates can generate an energetically stabilized protein.
III. Systems and mediums
[0094] FIG. 15 depicts a block diagram illustrating a computing system 1500 consistent with implementations of the current subject matter. Referring to FIGS. 14-15, the computing system 1500 can be configured to perform the process 1400. [0095] As shown in FIG. 15, the computing system 1500 can include a processor 1510, a memory 1520, a storage device 1530, and input/output devices 1540. The processor 1510, the memory 1520, the storage device 1530, and the input/output devices 1540 can be interconnected via a system bus 1550. The processor 1510 is capable of processing instructions for execution within the computing system 1500. Such executed instructions can implement one or more components of, for example, the database system 100 and/or the multitenant database system 200. In some example embodiments, the processor 1510 can be a single-threaded processor. Alternately, the processor 1510 can be a multi -threaded processor. The processor 1510 is capable of processing instructions stored in the memory 1520 and/or on the storage device 1530 to display graphical information for a user interface provided via the input/output device 540. [0096] The memory 1520 is a computer readable medium such as volatile or non-volatile that stores information within the computing system 1500. The memory 1520 can store data structures representing configuration object databases, for example. The storage device 1530 is
capable of providing persistent storage for the computing system 1500. The storage device 1530 can be a floppy disk device, a hard disk device, an optical disk device, or a tape device, or other suitable persistent storage means. The input/output device 540 provides input/output operations for the computing system 1500. In some example embodiments, the input/output device 540 includes a keyboard and/or pointing device. In various implementations, the input/output device 540 includes a display unit for displaying graphical user interfaces.
[0097] According to some example embodiments, the input/output device 540 can provide input/output operations for a network device. For example, the input/output device 540 can include Ethernet ports or other networking ports to communicate with one or more wired and/or wireless networks (e.g., a local area network (LAN), a wide area network (WAN), the Internet).
[0098] In some example embodiments, the computing system 1500 can be used to execute various interactive computer software applications that can be used for organization, analysis and/or storage of data in various formats. Alternatively, the computing system 1500 can be used to execute any type of software applications. These applications can be used to perform various functionalities, e.g., planning functionalities (e.g., generating, managing, editing of spreadsheet documents, word processing documents, and/or any other objects, etc.), computing
functionalities, communications functionalities, etc. The applications can include various add-in functionalities (e.g., SAP Integrated Business Planning as an add-in for a spreadsheet and/or other type of program) or can be standalone computing products and/or functionalities. Upon activation within the applications, the functionalities can be used to generate the user interface provided via the input/output device 540. The user interface can be generated and presented to a user by the computing system 1500 (e.g., on a computer screen monitor, etc.).
[0099] In an aspect is provided a system, including: at least one data processor; and at least one memory storing instructions which, when executed by the at least one data processor, result in operations including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy
minimization calculation, and wherein the optimization is performed to energetically stabilize the protein.
[0100] In another aspect is provided a non-transitory computer-readable storage medium including program code, which when executed by at least one data processor, causes operations including: a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within the protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of the ligand is associated with a set of ligand atomic coordinates; b) identifying a set of core amino acid residues within the protein that do not bind to the ligand, each core amino acid residue within the protein is associated with a set of core amino acid residue atomic coordinates; and c) optimizing the set of ligand binding amino acid residues; the set of ligand binding amino acid residue atomic coordinates; the set of core amino acid residues; and the set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize the protein. IV. Protein composition
[0101] In an aspect is provided a protein sequence obtainable based on the energy
minimization calculation using the method, the system, or the non-transitory computer-readable medium as described herein, including embodiments. In embodiments, the protein sequence is:
EFEKLRQTGDELVQAFQRLREIFDKGDDDSLEQVLEEIEELIQKHRQLFD RQEAADTEA AKQGDQWVQLFQRFREAIDKGDKDSLEQLLEELEQALQKIRELAEKKN (SEQ ID NO: 1). In embodiments, the protein sequence is SEQ ID NO: l . In embodiments, the protein sequence is SEQ ID NO:2. In embodiments, the protein sequence is SEQ ID NO:3. In embodiments, the protein sequence is SEQ ID NO:4. In embodiments, the protein sequence is SEQ ID NO:5. In embodiments, the protein sequence is SEQ ID NO:6. In embodiments, the protein sequence is SEQ ID NO:7.
[0102] In an aspect is provided a protein, or conservatively modified variant thereof, having the sequence:
EFEKLRQTGDELVQAFQRLREIFDKGDDDSLEQVLEEIEELIQKHRQLFDNRQEAADTEA AKQGDQWVQLFQRFREAIDKGDKDSLEQLLEELEQALQKIRELAEKKN (SEQ ID NO: l). [0103] In embodiments, the protein sequence is SEQ ID NO: 1. In embodiments, the protein sequence is SEQ ID NO:2. In embodiments, the protein sequence is SEQ ID NO:3.
[0104] In embodiments, the protein is 99% identical to SEQ ID NO: 1. In embodiments, the protein is 98% identical to SEQ ID NO: l . In embodiments, the protein is 95% identical to SEQ ID NO: 1. In embodiments, the protein is 90% identical to SEQ ID NO: 1. In embodiments, the protein is 85% identical to SEQ ID NO: l . In embodiments, the protein is 80% identical to SEQ ID NO: 1. In embodiments, the protein is 60% identical to SEQ ID NO: 1. In embodiments, the protein is about 99% identical to SEQ ID NO: 1. In embodiments, the protein is about 98% identical to SEQ ID NO: 1. In embodiments, the protein is about 95% identical to SEQ ID NO: 1. In embodiments, the protein is about 90% identical to SEQ ID NO: 1. In embodiments, the protein is about 85% identical to SEQ ID NO: 1. In embodiments, the protein is about 80% identical to SEQ ID NO: 1. In embodiments, the protein is about 60% identical to SEQ ID NO: 1.
[0105] In embodiments, the protein is 99% identical to SEQ ID NO:2. In embodiments, the protein is 98% identical to SEQ ID NO:2. In embodiments, the protein is 95% identical to SEQ ID NO:2. In embodiments, the protein is 90% identical to SEQ ID NO:2. In embodiments, the protein is 85% identical to SEQ ID NO:2. In embodiments, the protein is 80% identical to SEQ ID NO:2. In embodiments, the protein is 60% identical to SEQ ID NO:2. In embodiments, the protein is about 99% identical to SEQ ID NO:2. In embodiments, the protein is about 98% identical to SEQ ID NO:2. In embodiments, the protein is about 95% identical to SEQ ID NO:2. In embodiments, the protein is about 90% identical to SEQ ID NO:2. In embodiments, the protein is about 85% identical to SEQ ID NO:2. In embodiments, the protein is about 80% identical to SEQ ID NO:2. In embodiments, the protein is about 60% identical to SEQ ID NO:2.
[0106] In embodiments, the protein is 99% identical to SEQ ID NO:3. In embodiments, the protein is 98% identical to SEQ ID NO:3. In embodiments, the protein is 95% identical to SEQ ID NO:3. In embodiments, the protein is 90% identical to SEQ ID NO:3. In embodiments, the protein is 85% identical to SEQ ID NO:3. In embodiments, the protein is 80% identical to SEQ ID NO:3. In embodiments, the protein is 60% identical to SEQ ID NO:3. In embodiments, the protein is about 99% identical to SEQ ID NO:3. In embodiments, the protein is about 98% identical to SEQ ID NO:3. In embodiments, the protein is about 95% identical to SEQ ID NO:3. In embodiments, the protein is about 90% identical to SEQ ID NO:3. In embodiments, the protein is about 85% identical to SEQ ID NO:3. In embodiments, the protein is about 80% identical to SEQ ID NO:3. In embodiments, the protein is about 60% identical to SEQ ID NO:3.
[0107] In embodiments, the protein is further bound to a ligand. In embodiments, the ligand is bound to the protein via a dative covalent bond. In embodiments, the ligand is a porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine,
porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component, which is capable of binding a metal ion. In embodiments, the ligand is a detectable agent. In embodiments, the ligand is a therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theranostic, or a photodynamic therapy (PDT) agent. In embodiments, the ligand is a catalyst. In embodiments, the catalyst catalyzes an abiological or bio-orthogonal reaction. In embodiments, the ligand is a molecule that exists within a living system.
[0108] In embodiments, the protein is 99% identical to SEQ ID NO:8. In embodiments, the protein is 98% identical to SEQ ID NO:8. In embodiments, the protein is 95% identical to SEQ ID NO:8. In embodiments, the protein is 90% identical to SEQ ID NO:8. In embodiments, the protein is 85% identical to SEQ ID NO:8. In embodiments, the protein is 80% identical to SEQ ID NO:8. In embodiments, the protein is 60% identical to SEQ ID NO:8. In embodiments, the protein is about 99% identical to SEQ ID NO:8. In embodiments, the protein is about 98% identical to SEQ ID NO:8. In embodiments, the protein is about 95% identical to SEQ ID NO:8. In embodiments, the protein is about 90% identical to SEQ ID NO:8. In embodiments, the protein is about 85% identical to SEQ ID NO:8. In embodiments, the protein is about 80% identical to SEQ ID NO:8. In embodiments, the protein is about 60% identical to SEQ ID NO:8.
[0109] Informal Sequence Listing: [0110]
EFEKLRQTGDELVQAFQRLREIFDKGDDDSLEQVLEEIEELIQKHRQLFD RQE AADTEAAKQGDQWVQLFQRFREAIDKGDKDSLEQLLEELEQALQKIRELAEKKN (SEQ ID NCv l).
[0111]
SEFEKLRQTGDELVQAFQRLREIFDKGDDDSLEQVLEEIEELIQKHRQLFD RQ EAADTEAAKQGDQWVQLFQRFREAIDKGDKDSLEQLLEELEQALQKIRELAEKKN
(SEQ ID NO:2).
CATATGCATCACCATCACCATCACGAAAACCTGTATTTTCAGAGCGAATTCGAAAAA CTGCGTCAAACCGGCGACGAACTGGTGCAGGCATTTCAACGTCTGCGCGAAATTTTC GAT AAAGGTGATGACGAT AGTCTGGAAC AGGTTCTGGAAGAAATTGAAGAACTGAT CCAGAAACATCGTCAACTGTTTGACAATCGCCAGGAAGCGGCCGATACGGAAGCAG CTAAACAGGGCGACCAATGGGTCCAGCTGTTTCAACGTTTCCGCGAAGCCATTGATA
AAGGTGACAAAGATAGCCTGGAACAGCTGCTGGAAGAACTGGAACAGGCGCTGCA A A A A ATC C GC G A AC TGGC C G A A A AG A A A A AC T A AGG AT C C (SEQ ID NO:3)
[0112]
MHHHHHHE LYFQ/SEFEKLRQTGDELVQAFQRLREIFDKGDDDSLEQVLEEIE ELIQKHRQLFD RQEAADTEAAKQGDQWVQLFQRFREAIDKGDKDSLEQLLEELEQAL QKIRELAEKKN (SEQ ID NO:4)
[0113] LGLVAFLIFGLVLILIHLFAAGWVFFAILLLLALILA (SEQ ID NO:5) [0114] LGIILLLAIGLILLAFHLFFAGWLFIAILLFSGIILA (SEQ ID NO:6) [0115]
SEFEKLRQTGDEIIQLLQRLREAIDKGDDDSLEQILEELEEAFQKHRQLFE RQE AADTEFAKQGDQWLQLFQRIREAIDKGDKDSLEQLFEESEQGIQKIRELAEKKN (SEQ ID NO: 7)
[0116]
EFEKLRQTGDELVQAFQRLREIFDKGDDDSLEQVLEEIEELIQKHRQLFDNRQE AADTEAAKQGDQWVQLFQRFREAIDKGDKDSLEQLLEELEQALQKIR (SEQ ID NO:8)
V. Embodiments
[0117] Embodiment 1. A computer-implemented method, comprising: (a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within said protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of said ligand is associated with a set of ligand atomic coordinates; (b) identifying a set of core amino acid residues within said protein that do not bind to said ligand, each core amino acid residue within said protein is associated with a set of core amino acid residue atomic coordinates; and (c) optimizing: said set of ligand binding amino acid residues; said set of ligand binding amino acid residue atomic coordinates; said set of core amino acid residues; and said set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize said protein.
[0118] Embodiment 2. The method of embodiment 1, wherein step c) comprises simultaneously optimizing: said set of ligand binding amino acid residues; said set of ligand
binding amino acid residue atomic coordinates; said set of core amino acid residues; and said set of core amino acid residue atomic coordinates.
[0119] Embodiment 3. The method of embodiment 1, wherein the energy minimization calculation comprises a molecular mechanics function, a structural bioinformatics function, an amino acid sidechain packing function, a protein radius of gyration function, or a combination thereof.
[0120] Embodiment 4. The method of embodiment 1, wherein the core amino acids are at least 75% inaccessible to a 1.8 A spherical probe.
[0121] Embodiment 5. The method of embodiment 1, wherein said set of core amino acids comprises at least six amino acid residues.
[0122] Embodiment 6. The method of any one of embodiments 1 to 5, wherein the optimizing comprises fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate; fixing an atomic coordinate of at least one ligand atomic coordinate; prohibiting introduction of an additional amino acid residue into the set of ligand binding amino acid residues; or prohibiting the deletion of an amino acid residue from the set of ligand binding amino acid residues.
[0123] Embodiment 7. The method of any one of embodiments 1 to 5, wherein the optimizing comprises fixing at least one atomic coordinate of the ligand atomic coordinates.
[0124] Embodiment 8. The method of any one of embodiments 1 to 7, wherein the energy minimization calculation comprises a penalty function.
[0125] Embodiment 9. The method of any one of embodiments 1 to 8, wherein the optimizing does not comprise fixing at least one atomic coordinate of at least one core amino acid residue atomic coordinates.
[0126] Embodiment 10. The method of any one of embodiments 1 to 8, wherein the optimizing comprises introducing an additional ligand binding amino acid residue into the set of ligand binding amino acid residues, deleting a ligand binding amino acid residue from the set of ligand binding amino acid residues, a geometric transformation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
[0127] Embodiment 11. The method of embodiment 10, wherein the geometric transformation comprises a translation or a rotation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
[0128] Embodiment 12. The method of any one of embodiments 1 to 11, wherein the optimizing comprises a geometric transformation of at least one atomic coordinate of the core amino acid residue atomic coordinates.
[0129] Embodiment 13. The method of any one of embodiments 10 to 12, wherein the geometric transformation of at least one atomic coordinate comprises no greater than a 6 A displacement of any atomic coordinate. [0130] Embodiment 14. The method of any one of embodiments 10 to 12, wherein the geometric transformation of at least one atomic coordinate comprises no greater than a 3 A displacement of any atomic coordinate.
[0131] Embodiment 15. The method of any one of embodiments 1 to 14, wherein the optimizing comprises an iterative or heuristic algorithm. [0132] Embodiment 16. The method of any one of embodiments 1 to 14, wherein the optimizing comprises a simplex algorithm, memetic algorithm, differential evolution algorithm, evolutionary algorithm, genetic algorithm, tabu algorithm, particle swarm algorithm, or stimulated annealing algorithm.
[0133] Embodiment 17. The method of any one of embodiments 1 to 14, wherein the optimizing comprises a Monte Carlo sampling algorithm, dead-end elimination algorithm, branch and bound algorithm, or a pruning algorithm.
[0134] Embodiment 18. The method of any one of embodiments 1 to 17, wherein the ligand is a porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component, that is capable of binding a metal ion.
[0135] Embodiment 19. The method of any one of embodiments 1 to 17, wherein the ligand is a detectable agent.
[0136] Embodiment 20. The method of any one of embodiments 1 to 17, wherein the ligand is a therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI)
agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theranostic, or a photodynamic therapy (PDT) agent.
[0137] Embodiment 21. The method of any one of embodiments 1 to 17, wherein the ligand is a catalyst. [0138] Embodiment 22. The method of any one of embodiments 1 to 17, wherein the catalyst catalyzes an abiological or bio-orthogonal reaction.
[0139] Embodiment 23. The method of any one of embodiments 1 to 17, wherein the ligand is a molecule that exists within a living system.
[0140] Embodiment 24. A system, comprising: at least one data processor; and at least one memory storing instructions which, when executed by the at least one data processor, result in operations comprising: (a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within said protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of said ligand is associated with a set of ligand atomic coordinates; (b) identifying a set of core amino acid residues within said protein that do not bind to said ligand, each core amino acid residue within said protein is associated with a set of core amino acid residue atomic coordinates; and (c) optimizing: said set of ligand binding amino acid residues; said set of ligand binding amino acid residue atomic coordinates; said set of core amino acid residues; and said set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize said protein.
[0141] Embodiment 25. The system of embodiment 24, wherein the energy minimization calculation comprises functions from molecular mechanics, functions from structural
bioinformatics, amino acid sidechain packing functions, protein radius of gyration functions, or a combination thereof.
[0142] Embodiment 26. The system of embodiment 24, wherein the core amino acids are at least 75% inaccessible to a 1.8A spherical probe.
[0143] Embodiment 27. The system of embodiment 24, wherein said set of core amino acids comprise at least six amino acid residues. [0144] Embodiment 28. The system of any one of embodiments 24 to 27, wherein the optimizing comprises fixing an atomic coordinate of at least one ligand binding amino acid
residue atomic coordinate; fixing an atomic coordinate of at least one ligand atomic coordinate; prohibiting introduction of an additional amino acid residue into the set of ligand binding amino acid residues; or prohibiting the deletion of an amino acid residue from the set of ligand binding amino acid residues. [0145] Embodiment 29. The system of any one of embodiments 24 to 28, wherein the optimizing comprises fixing at least one atomic coordinate of the ligand atomic coordinates.
[0146] Embodiment 30. The system of any one of embodiments 24 to 29, wherein the energy minimization calculation comprises a penalty function.
[0147] Embodiment 31. The system of any one of embodiments 24 to 30, wherein the optimizing does not comprise fixing at least one atomic coordinate of at least one core amino acid residue atomic coordinates.
[0148] Embodiment 32. The system of any one of embodiments 24 to 31, wherein the optimizing comprises introducing an additional ligand binding amino acid residue into the set of ligand binding amino acid residues, deleting a ligand binding amino acid residue from the set of ligand binding amino acid residues, a geometric transformation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
[0149] Embodiment 33. The method of embodiment 32, wherein the geometric transformation comprises a translation or a rotation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates. [0150] Embodiment 34. The system of any one of embodiments 24 to 33, wherein the optimizing comprises a geometric transformation of at least one atomic coordinate of the core amino acid residue atomic coordinates.
[0151] Embodiment 35. The system of any one of embodiments 24 to 34, wherein the geometric transformation of at least one atomic coordinate comprises no greater than a 6 A displacement of any atomic coordinate.
[0152] Embodiment 36. The system of any one of embodiments 24 to 34, wherein the geometric transformation of at least one atomic coordinate comprises no greater than a 3 A displacement of any atomic coordinate.
[0153] Embodiment 37. The system of any one of embodiments 24 to 36, wherein the optimizing comprises an iterative or heuristic algorithm.
[0154] Embodiment 38. The system of any one of embodiments 24 to 36, wherein the optimizing comprises a simplex algorithm, memetic algorithm, differential evolution algorithm, evolutionary algorithm, genetic algorithm, tabu algorithm, particle swarm algorithm, or stimulated annealing algorithm. [0155] Embodiment 39. The system of any one of embodiments 24 to 36, wherein the optimizing comprises a Monte Carlo sampling algorithm, dead-end elimination algorithm, branch and bound algorithm, or a pruning algorithm.
[0156] Embodiment 40. A non-transitory computer-readable storage medium including program code, which when executed by at least one data processor, causes operations
comprising:(a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within said protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of said ligand is associated with a set of ligand atomic coordinates; (b) identifying a set of core amino acid residues within said protein that do not bind to said ligand, each core amino acid residue within said protein is associated with a set of core amino acid residue atomic coordinates; and (c) optimizing: said set of ligand binding amino acid residues; said set of ligand binding amino acid residue atomic coordinates; said set of core amino acid residues; and said set of core amino acid residue atomic coordinates; wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize said protein.
[0157] Embodiment 41. A protein sequence obtainable based on the energy minimization calculation using the method of any of embodiments 1 to 23, the system of any of embodiments 24 to 39, or the non-transitory computer-readable medium of embodiment 40.
[0158] Embodiment 42. A protein, or conservatively modified variant thereof, having the sequence SEQ ID NO: 1.
[0159] Embodiment 43. The protein of embodiment 42, wherein the protein is 90% identical to SEQ ID NO: 1.
[0160] Embodiment 44. The protein of embodiment 42, bound to a ligand.
[0161] Embodiment 45. The protein of embodiment 42, wherein the ligand is bound to the protein via a dative covalent bond.
[0162] Embodiment 46. The protein of embodiment 44, wherein the ligand is a porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component, that is capable of binding a metal ion. [0163] Embodiment 47. The protein of embodiment 44, wherein the ligand is a detectable agent.
[0164] Embodiment 48. The protein of embodiment 44, wherein the ligand is a therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theranostic, or a photodynamic therapy (PDT) agent.
[0165] Embodiment 49. The protein of embodiment 44, wherein the ligand is a catalyst.
[0166] Embodiment 50. The protein of embodiment 44, wherein the catalyst catalyzes an abiological or bio-orthogonal reaction.
[0167] Embodiment 51. The protein of embodiment 44, wherein the ligand is a molecule that exists within a living system.
EXAMPLES
Example 1 - Strategy for designing hyperstable, non-natural protein-cofactor complexes with sub-A accuracy
[0168] While the de novo design of proteins has seen many successes1"12, no small molecule ligand- or organic cofactor-binding protein has been designed entirely from first principles to achieve i) a unique structure and ii) a predetermined binding-site geometry with sub-A accuracy. Such achievements are prerequisites for the design of proteins that control and enable complex reaction trajectories, where the relative placements of cofactors, substrates, and protein side chains must be established within the length scale of a chemical bond. Here, we design a small molecule-binding protein based on the concept that the entire protein contributes to establishing the binding geometry of a ligand13"16. Mutational studies of natural ligand-binding proteins have highlighted the counter-intuitive importance of distant amino acids (10-20 A from the binding site) on binding affinity, which work in concert with first-shell amino acids surrounding the bound ligand13"16. We implement this concept for the first time in de novo protein design.
Hence, what are traditionally considered as separate sectors - the hydrophobic core and ligand- binding site - we treat as an inseparable unit. We utilize flexible backbone sequence design of a
parametrically defined protein template to simultaneously pack the protein interior both proximal to and remote from the ligand-binding site. Thus, tight interdigitation of core side chains quite removed from the binding site structurally restrains the first- and second-shell packing around the ligand. We apply this principle to the decades-old problem of structural non-uniqueness in de novo-designed heme-binding proteins17. We designed a novel protein, PS1, which binds a highly electron-deficient, non-natural porphyrin at temperatures up to 100 °C. The high- resolution structure of holo-PSl is in sub-A agreement with the design. The structure of apo- PS1 retains the remote core packing of the holo, predisposing a flexible binding region for the desired ligand-binding geometry. Our results illustrate the unification of core packing and binding site definition as a fundamental principle of ligand-binding protein design.
[0169] Recent successes in the field of de novo design of coiled coils3 7 and metalloproteins4'8" 10 are encouraging, but so far have not translated to more complex cofactors. In fact, attempts at computational design of novel small molecule ligand-binding proteins have been limited in number and generally focused on changing only the binding site of natural proteins, leaving the core of the protein intact18 19. For example, the binding site of a natural scaffold was
computationally redesigned to bind a hydrophobic organic ligand but required multiple rounds of mutagenesis and experimental selection using yeast display18. At the other extreme, de novo heme-binding helical bundle proteins have been designed entirely from first principles (17, 20), but these "maquettes" have evaded structural determination, largely due to aggregation or their dynamical properties17'21'22. With the exception of short, covalently linked peptide-heme complexes23, the only structure of a de novo heme-binding protein was solved for an apo-protein, which showed a hydrophobically collapsed binding site with no space for binding heme21'24. The lack of precise, predictive three-dimensional models of heme-binding maquettes, coupled with the failure to determine high-resolution structures, has limited their utility, although maquettes have elucidated electrostatic roles for tuning redox potentials of donors/acceptors in electron- transfer reactions20. An iterative trial-and-error approach has been shown to incrementally improve MR spectra of maquette proteins25, and may ultimately lead to the determination of three-dimensional structures; however, a robust computational method is needed to deliver precisely predetermined structures with sub-A accuracy. [0170] Our own work has focused on the development of computational design of cofactor- binding proteins26"28 with atomic-level accuracy. We used a step-wise strategy in which we first employed a mathematical parameterization of an antiparallel coiled coil to construct a rigid binding site, then, in a separate calculation, introduced side chain packing constrained by this
rigid backbone . This approach resulted in de novo porphyrin-binding proteins with the desired tertiary structure and ligand-binding stoichiometry, but not of sufficient conformational uniqueness to yield a high-resolution structure.
[0171] A body of work with natural proteins13"16 has shown that side chain packing quite distant from the binding site can propagate to significantly affect ligand binding, catalysis, and allosteric regulation. Thus, the entire hydrophobic core - even residues 20 A away from the binding site - should be considered as an essential extension of the primary and secondary shell interactions with the ligand. We noted that, unlike natural proteins (FIG. 1 A), previous de novo designed cofactor-binding proteins lack an extensive, well-defined apolar core. Instead, their interior packing is dominated by interactions with one or more porphyrins or multi-functional cofactors that span the length of the bundle (FIG. IB). Where a cofactor-free core was included29, the core was not computationally designed, and high-resolution structures were not determined. Here, we purposefully include a folded core remote from the ligand-binding site and optimize its sequence and structure in concert with the binding site to ensure appropriate coupling (FIG. 1C). As compared to earlier computational design of ligand-binding proteins11 18 our approach differs by: 1) beginning with a mathematically parameterized backbone rather than a natural protein; 2) applying flexible backbone design to the entire backbone as well as sequence design to all interior and substrate-binding sites rather than just the first and second- shell contacting residues; 3) not relying on screening of large numbers of designs or genetic selections to achieve the desired outcome.
[0172] Protein design. The design of PS1 (Porphyrin-binding Sequence 1) began with the previously parameterized backbone from the de novo designed protein SCRPZ-228, a protein that bound an extended po hinato(metal)-polypyridyl(metal) cofactor (FIG. IB). The backbone of SCRPZ-2 and its di-porphyrin-binding predecessors26'30 was designed with a simple equation defining a /^-symmetrical antiparallel coiled coil31. The parameters were adjusted to position a single His ligand to receive a second-shell hydrogen bond with Thr from a neighboring helix (see FIG. 2). Side chains in the vicinity of the binding site were computationally designed to stabilize the asymmetric ligand environment while maintaining a rigid symmetrical backbone.
Interhelical loops were then chosen following previously defined geometric principles26'28 32. Although SCRPZ-2 bound to its desired cofactor, its MR spectra was not as well dispersed as those for natural heme-containing proteins, and it lacked a cooperatively folded core.
[0173] We used the parameterized backbone of SCRPZ-2 as a starting point for design of a protein that binds a much smaller abiological porphyrin (CF3)4PZn ([5,10, 15,20-
tetrakis(trifluoromethyl)porphinato]Zn ) (FIG. 2) , a powerful photo-oxidant with an excited- state reduction potential similar to the ground-state reduction potential of the oxidized special pair of chlorophylls in photosystem II of green plants34. The reduced size of the (CF3)4PZn cofactor provided space for a hydrophobic core in what was formerly occupied by the large, bulky metal-polypyridyl group. We manually docked (CF3)4PZn in the porphyrin-binding site (FIG. 2) and used Backrub within Rosetta35 to sample small structural changes of the
parameterized backbone; we then employed alternating loops of fixed backbone sequence design and backbone/sidechain minimization. The models were assessed for packing of the porphyrin as well as the core. To isolate effects of introducing a well-defined hydrophobic core, we allowed sequence changes only in the protein interior and cofactor-binding site, keeping the identities of most solvent-exposed and loop residues fixed from that of SCRPZ-2. The final sequence of PS1 shares no similarity with any known natural protein (BLAST E value < 0.06 against the non-redundant protein sequence database nr). Although the final backbone model of PS1 differed by only 1 A root mean square deviation (RMSD) from the initial parameterized backbone of SCRPZ-2, fully 70% of the interior residues were changed from SCRPZ-2, and half of those retained were predicted to adopt different rotamers (FIG. 6).
[0174] Biophysical characterization of PS1. PS1 is monomeric (FIGS. 7A-7B) and binds the water-insoluble cofactor, (CF3)4PZn, forming highly thermostable complexes (extrapolated Tm > 120 °C, Fig. 3c and Fig. S3) that are stable for over a year. The complex forms within seconds of adding (CF3)4PZn from organic solution to aqueous PS1, suggesting a small kinetic barrier for assembly (FIG. 3 A). A tight dissociation constant of binding, ΚΌ = 45 nM, was measured under conditions where the water-insoluble porphyrin was solubilized with 1% w/v
octylglucopyranoside detergent (FIG. 3B). PS 1 also binds the ferrous iron-derivative of the porphyrin, (CF3)4PFe (FIG. 9), despite the abysmal solubility in water of this cofactor. Loading of PS1 with (CF3)4PFe suggests that the protein could also be used as a platform for engineering ground-state redox chemistry, as (CF3)4PFe is an electron-deficient (porphinato)metal complex capable of molecular oxygen activation for alkane hydroxylation and alkene epoxidation36.
[0175] Time-resolved transient absorption spectroscopy showed that protein/(CF3)4PZn interactions are preserved even at near-boiling temperatures where the protein retains its native structure (FIG. 3D). The excited-state spectra and dynamics of (CF3)4PZn within holo-PSl at 21 and 100 °C are indistinguishable, which indicates that the protein does not detectably perturb the porphyrin molecular framework— intersystem crossing rates of electronically excited porphyrins are known to be sensitive to temperature and environment37. Furthermore, these data indicate
that encapsulation of (CF3)4PZn in the binding site of PSl shields the porphyrin from nucleophilic attack that would otherwise occur in water, especially at high temperatures, i.e. the protein safeguards the porphyrin against a wasteful, degradative, photochemical side reaction. Thus, PSl effectively stabilizes an extraordinarily insoluble cofactor in aqueous solution, even at temperatures considered extreme for hyperthermophiles.
[0176] We also examined another high-scoring sequence (named PS2) of the design process, with a hydrophobic core unique from PSl, which was expressed, purified, and tested for binding to (CF3)4PZn. Electronic absorption spectra of holo-PS2 shows narrow absorption bands similar to those evinced by holo-PSl (Fig. 10), strongly suggesting that these designs analogously enfold the porphyrin in a unique binding environment.
[0177] Structural characterization of holo-PSl. An exceptionally well-resolved NMR structural ensemble of holo-PSl (FIGS 4 and 5) was computed using 19 nuclear Overhauser effects (NOEs) per residue and nearly complete ^NH residual dipolar coupling restraints. The backbone is in excellent agreement with the design (0.8 ± 0.1 A helical backbone RMSD), and core residues each populate a single rotamer state, almost all in agreement with the design
(FIGS. 4A,D,E). While the PSl design was selected based in part on its featuring an abundance of high-probability rotamer states of core residues, two low-probability rotamers were present in the designed core of PSl : one, Leu98, in the first-shell of the binding site, and the other, Leul9, in the remote folded core. Binding of the porphyrin forces Leu98 to adopt this low-probability rotamer, which is not present in the apo-protein (see FIG. 5E), whereas Leul9 adopts a more probable rotamer in both the holo- and apo-proteins. Trp68, fit snuggly between two CF3 groups of the porphyrin, can also be seen to adopt its predicted rotamer upon binding of the porphyrin, driving a unique conformation of the cofactor within the binding site.
[0178] The location and orientation of the porphyrin within PSl was determined by an exceptional number of porphyrin-protein NOEs (26 porphyrin-protein NOEs were used in the structural refinement, FIG. 4). Most importantly, the observed orientation of the cofactor is exactly as designed, within the precision of the NMR structure (FIG. 4). (CF3)4PZn was only displaced in its binding site relative to its predetermined orientation by an average translation (0.4 A) half the size of a covalent C-H bond, and by a small average rotation (11°) within the porphyrin plane.
[0179] Ab initio folding predictions and NMR structure of apo-PSl. Ab initio folding38 simulations of the apo-PSl sequence predict a bipartite structure with a conformationally unique
folded core, which closely resembles the core of holo-PSl, and a more flexible cofactor-binding region (FIG. 2). Significantly, hydrophobic collapse in the binding region is avoided, because it contains a polar His and also is rich in small Ala and Gly side chains (FIG. 4) to specifically associate with the face of the porphyrin ring, rather than the large hydrophobic residues used to stabilize hemes in maquettes. Thus, "negative design" in PS1 is implicitly achieved through the construction of a relatively polar cofactor-binding site, which creates a cofactor-shaped void in the apo-protein.
[0180] The MR structure of apo-PSl was also solved (FIG. 5), and the structural ensemble shows a folded core highly similar to that of holo-PSl . This finding indicates that the folded core both predisposes and anchors the flexible binding region for productive binding of the ligand. The binding region is more dynamic in apo-PSl, which contains two clusters of structures, open and closed. The open conformation likely facilitates binding of the large cofactor, but there is room for water to penetrate into the unoccupied binding site in both conformations. [0181] Dynamics and structural comparisons of apo- vs holo-PSl. Solvent hydrogen- deuterium exchange (HDX) experiments and molecular dynamics simulations of apo-PSl also show a gradient in conformational stability between the apolar core and the binding site of apo- PS1 (FIG. 5C, FIGS. 12 and 13). The backbone surrounding the apolar core of both holo- and apo-PSl is highly protected from exchange, an important characteristic of cooperatively folded native proteins. The protected region extends into the porphyrin-binding site in the holo-protein but not in the apo-structure (FIG. 5C). The increased protection in the binding site of holo-PSl is seen at both solvent-exposed and interior positions, indicating increased conformational stability rather than steric restriction from the bound cofactor alone.
[0182] In both the apo- and the holo-structures, the interior side chains stack into four layers, beginning at the edge of the porphyrin-binding site and extending to the end of the bundle (FIGS. 5D-5F). In the absence of cofactor to constrain and stabilize the tightly packed conformation of the holo-protein, the layers closest to the binding site explore more conformations, accessing rotamers not seen in holo-PSl (FIG. 5E). By contrast, the packing of the more distal layers is identical in the apo- and holo-structures (FIG. 5F). Thus, the third- and fourth-shell layers, located up to 20 A away from the binding site, are precisely pre-organized to stabilize the conformation of the first-shell side chains when PS1 enfolds its cofactor. This finding is consistent with numerous studies on natural proteins13"16, which show that variation of residues
involved in core packing distant from an active site can have profound influences on binding and catalysis.
[0183] The vast improvement in conformational specificity between PS1 and earlier designs illuminates the importance of considering hydrophobic core packing and the construction of ligand-binding sites as a joint optimization problem during computational design. Our previous studies indicate that the use of rigid backbones optimized for ligand-protein interactions alone are insufficient for conformational uniqueness without explicitly considering and designing a backbone that can also accommodate a well-defined apolar core. Similarly, attempts to radically change specificity of natural proteins by varying their binding sites, while treating the surrounding protein matrix as a rigid unit of fixed sequence, has required subsequent
experimental optimization via extensive rounds of random mutagenesis and selection18 19,39. The reliance on experimental methods such as directed evolution and genetic selections, while currently useful in many practical applications19, speaks to our incomplete understanding of protein structure and function, and the need to test and refine this knowledge through design. It is noteworthy that the first sequence designed and tested via our approach succeeded without need for experimental screening. Furthermore, another high-scoring protein design also bound the cofactor, suggesting a possible generality of the method within the helical bundle protein family. These studies bring chemists closer to the ultimate goal of the computational design of fully functional proteins with properties unprecedented in nature. [0184] PS1 design process. Full methods and scripts regarding the design of PS1 can be found in Example 2. Briefly, the entire core of the /^-symmetrical parameterized backbone of SCRPZ-2 was redesigned to bind (CF3)4PZn via a customized Rosetta script for flexible backbone sequence design. The flexible backbone design protocol was as follows: Distance and angle constraints between His and Zn were loaded, the model was repacked without mutations, the backbone was relaxed via Rosetta Backrub, three trials of a Monte Carlo flexible backbone design sub-protocol (see Example 2) were performed, and models with native protein-like packing (i.e., a Rosetta PackStat score > 0.58) were output. 170 designs were output from 500 runs through the protocol (FIG. 6). We analyzed these 170 models for packing, radius of gyration, energy, and rotamer state probability within Matlab to select PS1 for expression. The design of PS2 proceeded in the same fashion.
[0185] Protein expression, purification, and biophysical characterization. Details regarding protein expression, purification, and biophysical characterization can be found in the supplement. Briefly, genes for the proteins were ordered from GenScript, cloned into a pET-1 la
plasmid, and purified via a Ni column, followed by His-tag cleavage by TEV protease. The protein sequence of expressed, purified PS1 after His-tag cleavage is:
SEFEKLRQTGDELVQAFQRLREIFDKGDDDSLEQVLEEIEELIQKHRQLFD RQEAADTE AAKQGDQWVQLFQRFREAIDKGDKDSLEQLLEELEQALQKIRELAEKKN (SEQ ID NO:2). The sequence for PS2 can be found in Example 2.
[0186] Porphyrin binding to apo-protein. A 2-fold excess of the cofactor (CF3)4PZn was added from a 4 mM DMSO stock solution to a 50 mM NaPi, 100 mM NaCl, pH 7.5 buffer with apo-protein (Note that final DMSO concentrations were kept < 1%.). Buffer solution of apo-PSl protein was heated for 5 minutes at 50 °C, (CF3)4PZn was then added from the DMSO stock solution, the resultant mixture was vortexed for 5 seconds, and placed back in the heat block at 50 °C for 15 minutes, with vortexing every 3 minutes. The protein/cof actor solution was then spun at 14000 x g in a Amicon Ultra-0.5 mL centrifuge filter for 10 min, three times, replacing the buffer to 0.5 mL after each 10 min spin. Finally, the protein solution was spun for 4 min at 12000 x g in an Amicon ultrafree-MC GV filter (UFC30GV0S). The holo-PSl sample was then used for spectroscopic experiments immediately afterward, and diluted to an appropriate concentration if necessary. Binding of (CF3)4PFe was carried out in the same fashion, with the exception that the porphyrin was first dissolved in a stock of DMSO/CHCl3.
[0187] Nuclear magnetic resonance spectroscopy. NMR spectra were recorded at 298 K on a 900 MHz Bruker Avance II spectrometer equipped with cryogenic probe for the holo-protein or on a Bruker 600 MHz spectrometer equipped with cryogenic probe for the apo-protein.
Sequence specific backbone (¾N, 15N, 13Ca, 13CO) and 13CP resonance assignments were obtained by using 3D HNCACB / CBCA(CO)NH and 3D HNCO / CO(CA)NH along with the program AUTOASSIGN.41 ¾a and ¾p assignments were extended by 3D HAHB(CO)NH experiment and more peripheral side chain chemical shifts were assigned with aliphatic 3D CCH-TOCSY (mixing time: 75 ms) and simultaneous 3D 1 W3Cdiptati713Caromatic-resolved
[1H,1H]-NOESY(mixing time: 120 ms). Overall assignments were obtained for 98.1% and 95.9% of the backbone (excluding the N-terminal NH3 +) and 13CP, and for 97% and 94.6% of the side chain chemical shifts (excluding Lys NH3 +, Arg NH2, OH, side chain 13CO and aromatic 13CY) for the holo- and apo-proteins, respectively. All spectra were processed and analyzed with the programs NMRPIPE and XEASY, respectively42'43. ¾□ ¾ upper distance limit constraints for structure calculations were extracted from NOESY. In addition, backbone dihedral angle constraints were derived from chemical shifts using the program TALOS for residues located in well-defined secondary structure elements44. 2D constant-time [^C HJ-HSQC spectra were
recorded as was described for the 5% fractionally C-labeled samples to obtain stereo-specific assignments for isopropyl groups of Val and Leu45. The ^NH residual dipolar couplings (RDCs) were measured with 2D ¾-15N IPAP-HSQC in samples aligned using Pfl phage (ASLA biotech). The program CYANA was used to assign long-range NOEs and calculate the structure46 47. Backbone ^NH RDCS were used as orientational constraints for the later stages of refinement with XPLOR-NIH48. The final set of structures was further refined by restrained molecular dynamics in explicit water48. MR structure quality was assessed with the Protein Structure Validation Software Suite (PSVS)49 (Table S4).
[0188] Hydrogen-deuterium exchange measurements. For the measurements of H/D exchange rates, a series of 2D 15N HSQC spectra were obtained on a 900 MHz Bruker Avance II spectrometer. The first spectra were recorded 9 minutes after the dilution of 100 μΐ of a high concentration sample in H20 (2 mM for apo and 1.2 mM for holo) into 200 μΐ D20 buffer. 15- min HSQC spectra were recorded successively in the first 12 hours, a 15-min spectrum in every hour in the second 12 hours, a 15-min spectrum in every two hours in the third 12 hours, and so on. The last points were 2730.6 and 4903.5 min for apo and holo, respectively. For the H/D exchange rate analysis, the peak height of each isolated peak was extracted by nmrDraw and fitted to one-phase exponential decay.
[0189] Coordinates and data files have been deposited to the Protein Data Bank with accession codes 5TGW (apo-PSl) and 5TGY(holo-PSl) and to the BMRB (chemical shifts) with codes 30185 (apo-PSl) and 30186 (holo-PSl).
[0190] References cited in Example 1. 1. Roy, S. et al. A protein designed by binary patterning of polar and nonpolar amino acids displays native-like properties. J Am. Chem. Soc. 119, 5302- 5306 (1997). 2. Kuhlman, B. et al. Design of a novel globular protein fold with atomic-level accuracy. Science 302, 1364-1368 (2003). 3. Nanda, V. & Koder, R. L. Designing artificial enzymes by intuition and computation. Nat. Chem. 2, 15-24 (2010). 4. Peacock, A. F. A.
Incorporating metals into de novo proteins. Curr. Opin. Chem. Biol. 17, 934-939 (2013). 5. Huang, P.-S. et al. High thermodynamic stability of parametrically designed helical bundles. Science 346, 481 (2014). 6. Thomson, A. R. et al. Computational design of water-soluble a- helical barrels. Science 346, 485 (2014). 7. Woolfson, D. N. et al. De novo protein design: How do we expand into the universe of possible protein structures? Curr. Opin. Struct. Biol. 33, 16-26
(2015). 8. Mocny, C. S. & Pecoraro, V. L. De novo protein design as a methodology for synthetic bioinorganic chemistry. Acc. Chem. Res. 48, 2388-2396 (2015). 9. Ulas, G., Lemmin,
T., Wu, Y., Gassner, G. T. & DeGrado, W. F. Designed metalloprotein stabilizes a semiquinone
radical. Nat. Chem. 8, 354-359 (2016). 10. Olson, T. L. et al. Design of dinuclear manganese cofactors for bacterial reaction centers. Biochim. Biophys. Acta: Bioenergetics 1857, 539-547 (2016). 11. Huang, P.-S., Boyken, S. E. & Baker, D. The coming of age of de novo protein design. Nature 537, 320-327 (2016). 12. Brunette, T. J. et al. Exploring the repeat protein universe through computational protein design. Nature 528, 580-584 (2015). 13. Bollen, Y. J. M., Westphal, A. H., Lindhoud, S., van Berkel, W. J. H. & van Mierlo, C. P. M. Distant residues mediate picomolar binding affinity of a protein cofactor. Nat. Comm. 3, 1010 (2012). 14. Sela- Culang, I, Kunik, V. & Ofiran, Y. The structural basis of antibody-antigen recognition. Front. Immunol. 4 (2013). 15. van den Bedem, H., Bhabha, G., Yang, K., Wright, P. E. & Fraser, J. S. Automated identification of functional dynamic contact networks from X-ray crystallography. Nat. Methods 10, 896-902 (2013). 16. Koulechova, D. A., Tripp, K. W., Horner, G. &
Marqusee, S. When the scaffold cannot be ignored: The role of the hydrophobic core in ligand binding and specificity. J. Mol. Biol. 427, 3316-3326 (2015). 17. Reedy, C. J. & Gibney, B. R. Heme protein assemblies. Chem. Rev. 104, 617-650 (2004). 18. Tinberg, C. E. et al.
Computational design of ligand-binding proteins with high affinity and selectivity. Nature 501, 212-216 (2013). 19. Prier, C. K. & Arnold, F. H. Chemomimetic biocatalysis: Exploiting the synthetic potential of cofactor-dependent enzymes to create new catalysts. J. Am. Chem. Soc. 137, 13992-14006 (2015). 20. Farid, T. A. et al. Elementary tetrahelical protein design for diverse oxidoreductase functions. Nat. Chem. Biol. 9, 826-833 (2013). 21. Skalicky, J. J. et al. Solution structure of a designed four-a-helix bundle maquette scaffold. J. Am. Chem. Soc. 121, 4941-4951 (1999). 22. Huang, S. S., Koder, R. L., Lewis, M., Wand, A. J. & Dutton, P. L. The HP-1 maquette: From an apoprotein structure to a structured hemoprotein designed to promote redox-coupled proton exchange. Proc. Natl. Acad. Sci. USA 101, 5536-5541 (2004). 23.
Lombardi, A., Nastri, F. & Pavone, V. Peptide-based heme-protein models. Chem. Rev. 101, 3165-3190 (2001). 24. Huang, S. S., Gibney, B. R., Stayrook, S. E., Leslie Dutton, P. & Lewis, M. X-ray structure of a maquette scaffold. J. Mol. Biol. 326, 1219-1225 (2003). 25. Gibney, B. R., Rabanal, F., Skalicky, J. J., Wand, A. J. & Dutton, P. L. Iterative protein redesign. J. Am. Chem. Soc. 121, 4952-4960 (1999). 26. Bender, G. M. et al. De novo design of a single-chain diphenylporphyrin metalloprotein. J. Am. Chem. Soc. 129, 10732-10740 (2007). 27. Fry, H. C, Lehmann, A., Saven, J. G., DeGrado, W. F. & Therien, M. J. Computational design and elaboration of a de novo heterotetrameric alpha-helical protein that selectively binds an emissive abiological (porphinato)zinc chromophore. J. Am. Chem. Soc. 132, 3997-4005 (2010). 28. Fry, H. C. et al. Computational de novo design and characterization of a protein that selectively binds a highly hyperpolarizable abiological chromophore. J. Am. Chem. Soc. 135, 13914-13926
(2013). 29. Solomon, L. A., Kodali, G., Moser, C. C. & Dutton, P. L. Engineering the assembly of heme cofactors in man-made proteins. J. Am. Chem. Soc. 136, 3192-3199 (2014). 30.
Ghirlanda, G. et al. De novo design of a /^-symmetrical protein that reproduces the diheme four-helix bundle in cytochrome
J. Am. Chem. Soc. 126, 8141-8147 (2004). 31. North, B., Summa, C. M., Ghirlanda, G. & DeGrado, W. F. JA-symmetrical tertiary templates for the design of tubular proteins. J. Mol. Biol. 311, 1081-1090 (2001). 32. Lahr, S. J. et al. Analysis and design of turns in a-helical hairpins. J. Mol. Biol. 346, 1441-1454 (2005). 33. Goll, J. G., Moore, K. T., Ghosh, A. & Therien, M. J. Synthesis, structure, electronic spectroscopy, photophysics, electrochemistry, and x-ray photoelectron spectroscopy of highly-electron- deficient [5,10, 15,20-tetrakis(perfluoroalkyl)po hinato]zinc(II) complexes and their free base derivatives. J. Am. Chem. Soc. 118, 8344-8354 (1996). 34. Lubitz, W., Lendzian, F. & Bittl, R. Radicals, radical pairs and triplet states in photosynthesis. Acc. Chem. Res. 35, 313-320 (2002). 35. Kaufmann, K. W., Lemmon, G. H., DeLuca, S. L., Sheehan, J. H. & Meiler, J. Practically useful: What the Rosetta protein modeling suite can do for you. Biochemistry 49, 2987-2998 (2010). 36. Moore, K. T., Horvath, I. T. & Therien, M. J. Mechanistic studies of
(porphinato)iron-catalyzed isobutane oxidation. Comparative studies of three classes of electron- deficient porphyrin catalysts. Inorg. Chem. 39, 3125-3139 (2000). 37. Gentemann, S. et al. Variations and temperature dependence of the excited state properties of conformationally and electronically perturbed zinc and free base porphyrins. J. Phys. Chem. B 101, 1247-1254 (1997). 38. Bradley, P., Misura, K. M. S. & Baker, D. Toward high-resolution de novo structure prediction for small proteins. Science 309, 1868 (2005). 39. Tinberg, C. E. & Khare, S. D. in Computational Design of Ligand Binding Proteins (ed Barry L. Stoddard) 155-171 (Springer New York, 2016). 40. Choma, C. T. et al. Design of a heme-binding four-helix bundle. J. Am. Chem. Soc. 116, 856-865 (1994). 41. Zimmerman, D. E. et al. Automated analysis of protein NMR assignments using methods from artificial intelligence. J. Mol. Biol. 269, 592-610 (1997). 42. Delaglio, F. et al. NMRPipe: a multidimensional spectral processing system based on UNIX pipes. J. Biomol. NMR 6, 277-293 (1995). 43. Bartels, C, Xia, T.-h., Billeter, M., Giintert, P. & Wiithrich, K. The program XEASY for computer-supported NMR spectral analysis of biological macromolecules. J. Biomol. NMR 6, 1-10 (1995). 44. Cornilescu, G., Delaglio, F. & Bax, A. Protein backbone angle restraints from searching a database for chemical shift and sequence homology. J. Biomol. NMR 13, 289-302 (1999). 45. Neri, D., Szyperski, T., Otting, G, Senn, H. & Wuethrich, K. Stereospecific nuclear magnetic resonance assignments of the methyl groups of valine and leucine in the DNA-binding domain of the 434 repressor by biosynthetically directed fractional carbon-13 labeling. Biochemistry 28, 7510-7516 (1989). 46. Giintert, P.,
Mumenthaler, C. & Wuthrich, K. Torsion angle dynamics for MR structure calculation with the new program DYANA. J Mol. Biol. 273, 283-298 (1997). 47. Herrmann, T., Guntert, P. & Wuthrich, K. Protein NMR structure determination with automated NOE assignment using the new software CANDID and the torsion angle dynamics algorithm DYANA. J. Mol. Biol. 319, 209-227 (2002). 48. Schwieters, C. D., Kuszewski, J. L, Tjandra, N. & Marius Clore, G. The Xplor-NTH NMR molecular structure determination package. J. Magn. Reson. 160, 65-73 (2003). 49. Bagaria, A., Jaravine, V., Huang, Y. L, Montelione, G. T. & Guntert, P. Protein structure validation by generalized linear model root-mean-square deviation prediction. Protein Sci. 21, 229-238 (2012). Example 2 - Computational and synthesis methods
[0191] PS1 design process. The design of PS1 began with a ft-symmetrical parameterized backbone of a 4-helix bundle (Tables SI and S2)1. We have previously used this backbone parameterization to create a
diheme-binding tetrameric 4-helix bundle, PATET, which was composed of 4 copies of a 25 residue helix containing the requisite metal- coordinating His and second shell H-bonding Thr residues placed at d and b positions in a heptad repeat, respectively2. This tetramer bound two hemes with a bis-His ligation in a Di- symmetrical bundle. Asymmetry of the sequence was later introduced in a single chain diporphyrin-binding design, PAsc (FIG. IB), where loops were selected to connect the helices via a structural bioinformatics approach3,4. The attachment of these loops cemented the Crick parameters of the helical backbone, which was later employed in another single-chain protein design, SCRPZ-2, that bound an extended cofactor throughout the interior of the bundle (Fig. IB)5. The design of PS1 utilized the His and Thr positioning of one porphyrin-binding region from these previous designs, with the remainder of the protein then designated as a cofactor-free folded core. Because SCRPZ-2 was soluble and expressed well in E. coli, we elected to retain its exterior-facing amino acids and loops within the PS1 design, while computationally designing the entire core (binding region and folded core simultaneously). In doing so, we also isolate effects on cofactor binding due solely to the creation of a folded core that uniquely predisposes the binding region for cofactor association, which is simultaneously optimized for sequence and side chain packing along with the binding region of the (CF3)4PZn porphyrin. A flexible backbone sequence design protocol was developed (see below) to fine-tune the parameterized backbone to (CF3)4PZn and to achieve optimal side chain packing for creation of the folded core and positioning of (CF3)4PZn within the binding region.
[0192] Flexible backbone sequence design. We wrote a RosettaScript for flexible backbone sequence design, implemented in Rosetta 3.5, that proceeds through a cycle of backbone/side- chain relaxations and fixed backbone design, with a filtering step based on core packing
(RosettaScript provided below). Details of the process are provided in the subsections herein. [0193] Amino acids allowed to vary in the design. Because (CF3)4PZn could potentially act as a photo-oxidant, we disallowed any potentially oxidizable amino acids in the sequence (e.g., Tyr, Cys, Met, Tip, His) other than the single His and Trp residues described below. The initial residue identities of the bundle were chosen from a previous computationally designed 4-helix bundle SCRPZ-2 5, with a few changes, e.g., surface-exposed Tyr residues of the SCRPZ-2 sequence were constrained to be polar or charged during the computational sequence design in Rosetta. The entire core (40 residues in total) of SCRPZ-2 was allowed to vary during the design process, except for His46 and Thr9, which are keystone interactions dedicated to Zn coordination of the porphyrin used in previous designs (see FIG. 2). (63% of the SCRPZ-2 sequence consists of exterior residues and loops, and these were held fixed during the design of PS1.) Ultimately, of the 40 residues that could vary (out of 108), 28 residues were changed and 12 were retained from the SCRPZ-2 sequence, such that 70% of the core was computationally mutated to establish a preferred orientation of the porphyrin cofactor, as well as an interdigitated folded core. This percentage of retained residues can be rationalized based on the expected results of choosing large space-filling amino acids (Phe, Leu, He, Val) at random, such that a residue that is Leu in the sequence has a 25% chance of retaining its identity as Leu. Table S3 and FIG. 6B, as well as the residue file (resfile.txt) below, show precisely which residues were allowed to vary during the design process. Below and in the main text, we use residue numbering that is consistent with the expressed holo-protein, which contains an N-terminal Ser residue not present in the design, a remnant from a TEV protease cleavage site (see below). [0194] Selection of residue 68 as Trp. A motivation for this work is to position aromatic side chains in precise position relative to a photo-excitable cofactor to initiate proton-coupled electron transfer. We asked whether a Trp residue could be held in precise juxtaposition relative to the (CF3)4PZn cofactor, as a prelude to future studies in which "proton wires" are introduced to facilitate proton transfer concomitant with electron transfer from Trp to the photoexcited state of (CF3)4PZn. A Trp residue in the protein interior also serves as an absorption handle, as well as a fluorescent indicator of hydrophobic packing.
[0195] To select the sequence position of the single Trp residue, we used the Rosetta Backrub program 6,7 to create an ensemble of backbones that were relaxed around the (CF3)4PZn cofactor,
after the cofactor was docked in the porphyrin binding region of the SCRPZ-2 model, with an orientation described by CF3 groups pointing down the long axis of the bundle. No sequence design was performed to generate this backbone ensemble. Next, we performed fixed backbone sequence design on each member of the backbone ensemble, allowing Trp at all core residues, to determine a probable location of Trp within the protein interior, based on the frequency of occurrence within the designed sequences. Based on this information, we constrained residue 68 to be Trp during the flexible backbone design process below.
[0196] Flexible backbone design protocol. Flexible backbone design utilized angle and distance constraints between the Zn and His to restrict the design space to those consistent with the DFT-optimized imidazole-Zn distance of 2.0 A. We used an energy term (hack aro = 1) that models quadrupolar interactions between aromatic side chains in every stage of the flexible backbone design protocol. We also employed an energy term (rg = 2) that penalizes bundles with a large radius of gyration (rg). We noticed a propensity within Rosetta to output bundles that received good packing scores (via Packstat or Rosetta Holes) but displayed helices separated by large distances (large rg). The packing algorithms could not differentiate between interior or exterior when the helix-helix interfaces were very wide, and often inappropriately gave good packing scores when the designed bundle was qualitatively poorly packed. The inclusion of the rg term, as well as employing Rosetta Backrub, ameliorated this issue.
[0197] The flexible backbone design protocol was as follows: Distance and angle constraints between His and Zn were loaded, the model was repacked without mutations, the backbone was relaxed via Rosetta Backrub, three trials of a Monte Carlo flexible backbone design sub-protocol (see below) were performed, and models with native protein-like packing (i.e., a Rosetta PackStat score > 0.58) were output. The PackStat score was calculated 3 times per trial to account for its stochastic behavior. 170 designs were output from 500 runs through the protocol (Fig. SI). We analyzed these 170 models for packing, rg, energy, and rotamer state probability within Matlab to select PS1 for expression.
[0198] Flexible backbone design sub-protocol. The flexible backbone design sub-protocol consists of 3 Monte Carlo trials of (i) fixed backbone design with soft weights (decreased vdW interactions, i.e., soft rep design weights within Rosetta), (ii) sidechain minimization via MinMover, (iii) fixed backbone design with Score 13 weights, where the electrostatic term
(fa_pair) is replaced by hack elec (hack elec = 0.55), and the addition of extra rotamer sampling around χι (exl, level 3, i.e., sampled between 2 std of the mean chi angle value for each rotamer) and X2 (ex2, level 3) sidechain dihedrals, (iv) backbone minimization via MinMover, (v)
repetition of step iii (due to propensities of Rosetta to design a particular sequence to a particular backbone). At the end of step (v), the model is filtered for native structure-like packing via PackStat (If 1 of 3 trials of PackStat score is > 0.58, the model passes the filter.). In all energy functions for flexible backbone design, hack aro is set to 1 and rg is set to 2. The final, designed sequence (PS 1) selected for protein expression was the following 108 amino acids:
EFEKLRQTGDELVQAFQRLREIFDKGDDDSLEQVLEEIEELIQKHRQLFD RQEAADTEA AKQGDQWVQLFQRFREAIDKGDKDSLEQLLEELEQALQKIRELAEKKN (SEQ ID NO: l)
[0199] Ab initio folding. Rosetta ab initio folding8 was performed on the PS 1 sequence in Rosetta 3.5. Ca RMSD of the folded core was scored against residues 14-23, 32-42, 69-79, and 87-97 of the design model. Ca RMSD of the binding region was scored against residues 5-13, 43-50, 61-68, and 98-105 of the design model.
[0200] Porphyrin binding titration to determine KD. 2 μΜ of (CF3)4PZn was solubilized in a 1 mL solution of 50 mM NaPi, 100 mM NaCl, pH 7.5 buffer by inclusion of 1% w/v octyl-b- D-glucopyranoside. 2 μΐ^ of a 102 uM stock of apo-PS l (0.2 μΜ aliquots) was titrated into the 1 mL solution containing the porphyrin, and an electronic absorption spectrum was measured until > 2.5 equivalents of protein were added. Absorbance changes at 423 nm, due to His-Zn coordination-induced spectral shifts of the porphyrin, were fit to a single-site, protein-ligand binding model.
[0201] Analytical ultracentrifugation (AUC). The oligomeric state of apo- and holo-PS l were determined by analytical equilibrium sedimentation performed at 25 °C using a Beckman XL-I analytical ultracentrifuge. Ultracentrifugation was conducted at speeds of 25K, 30K, 35K, 40K and 45K r.p.m., and the radial gradient profiles were obtained by absorbance at 280 nm. A 200 μΜ solution of the apo- and a 100 μΜ solution of the holo-protein were prepared in 50 mM NaPi pH 7.5, 100 mM NaCl (apo) and 20 mM NaPi pH 7.5, 125 mM NaCl (holo). Data were globally fit to a single-species model of equilibrium sedimentation by a nonlinear least-squares method using IGOR Pro (Wavemetrics).
[0202] Size exclusion chromatography. Gel filtration profiles were obtained using a Superdex 75 5/150 column on an FPLC system (GE Healthcare AKTA). To evaluate the oligomeric state, 20 μΐ^ of 100 μΜ apo-PS l or 37 μΜ holo-PS l was injected onto the column and eluted with a 50 mM phosphate, 150 mM NaCl, pH 7.0 buffer mobile phase at a flow rate of 0.4 mL/min. The approximate molecular weight (MWapp) was calculated from a standard curve obtained with the GE LMW standard protein kit. From this curve, MWapp of the apo is 19.5 kD
and that of holo is 17.9 kD. These 13 kD proteins elute at higher MWapp due to their large negative surface charge (q = -12). For apo-PSl, a small dimer peak elutes at MWapp of 44.1 kD, and a smaller tetramer (or pentamer) peak at 103.2 kD.
[0203] Circular dichroism (CD). CD spectra were collected on a Jasco J-810 CD
spectrometer in a 0.1 cm path length quartz cuvette, using temperature/wavelength mode.
Spectra were collected from 20 to 95 °C with an interval of 5 °C and an increase rate of 1 °C/minute, over a wavelength range from 215 to 250 nm. Apo- and holo-PSl were prepared at 10 μΜ and 6.6 μΜ, respectively, in 50 mM NaPi pH 7.5, 100 mM NaCl buffer. Temperature melts of apo-PSl were also performed at varying concentrations of Guanidine HCl denaturant (0M, 1M, 2M, 3M, 4M, 5M, 5.85, 7M).
[0204] Steady-state electronic absorption and emission spectroscopy. Electronic absorption spectra were collected using a Shimadzu UV-1700 UV-Vis spectrophotometer or Cary 5000 spectrophotometer. Steady-state emission spectra were obtained on FLS920P spectrophotometer (Edinburgh Instruments Ltd. Livingston, UK) in 1 cm quartz optical cells. The steady-state emission spectra were corrected using the correction factor generated by the manufacturer.
[0205] Pump-probe transient absorption spectroscopy. Ultrafast transient absorption spectra were obtained using standard pump-probe methods9 with a time resolution of
approximately 200 fs. Elevated temperature experiments were performed in a custom-made temperature block of anodized aluminum, the temperature of which was controlled by heating rods and monitored by a pair of thermocouples wired to a PID through a solid-state relay.
Following pump-probe transient absorption experiments, electronic absorption spectra verified that the samples were robust.
[0206] Cofactor (e.g., ligand) geometry optimization. The geometry of (CF3)4PZn was optimized via density functional theory using the B3-LYP functional and 6-31G* basis set implemented in Gaussian03. The starting geometry was obtained from the crystal structure of related meso-heptafluoropropyl(porphinato)Zn(II), with the fluoropropyl groups truncated to fluorom ethyl10. Meso-heptafluoropropyl(porphinato)Zn(II) co-crystalized with an axially ligating pyridine; imidazole was computationally substituted for pyridine for the geometry optimization of (CF3)4PZn.
[0207] Visualization of protein structures and image rendering. Protein models were visualized and rendered in the PyMol visualization program u.
[0208] Protein expression and purification. The gene coding for the protein sequence of PS1 was ordered from GenScript, which was cloned into the IPTG-inducible pET-1 la plasmid (cloning site Ndel-BamHI). The sequence also coded for an N-terminal 6xHis-tag followed by a TEV protease cleavage sequence, followed finally by the designed sequence. The cloned gene sequence is:
CATATGCATCACCATCACCATCACGAAAACCTGTATTTTCAGAGCGAATTCGAAAAA CTGCGTCAAACCGGCGACGAACTGGTGCAGGCATTTCAACGTCTGCGCGAAATTTTC GATAAAGGTGATGACGATAGTCTGGAACAGGTTCTGGAAGAAATTGAAGAACTGAT CCAGAAACATCGTCAACTGTTTGACAATCGCCAGGAAGCGGCCGATACGGAAGCAG CT AAAC AGGGCGACC AATGGGTCC AGCTGTTTC AACGTTTCCGCGAAGCC ATTGAT A AAGGTGACAAAGATAGCCTGGAACAGCTGCTGGAAGAACTGGAACAGGCGCTGCA A A A A ATC C GC G A AC TGGC C G A A A AG A A A A AC T A AGG AT C C (SEQ ID NO:3)
[0209] The expressed protein sequence was ultimately:
MHHHHHHENLYFQ/SEFEKLRQTGDELVQAFQRLREIFDKGDDDSLEQVLEEIEELIQKH RQLFD RQEAADTEAAKQGDQWVQLFQRFREAIDKGDKDSLEQLLEELEQALQKIREL AEKKN (SEQ ID NO:4) where the "/" defines the cleavage site of TEV protease. The plasmids were transfected into E. coli BL21(DE3) cells, which were grown in LB/ampicillin media (or, for MR samples, M9 minimal media with isotope-labeled ammonia and glucose from Cambridge Isotopes) until OD @ 600 nm = 0.6. The cells were then induced with IPTG and allowed to grow for 4 more hours. Cells were then centrifuged and frozen. The frozen cell pellets were lysed in a French press in the Duke University Biology Department. The expressed, His-tagged PS1 protein was purified via a Ni NTA column (Invitrogen) and confirmed by gel electrophoresis. The buffer was exchanged to the Sigma-recommended TEV protease buffer (5 mM DTT, 50 mM Tris, 0.5 mM EDTA, pH 8.0), and the PS1/TEV solution (His-tagged TEV protease was ordered from Sigma.) was allowed to rock for 1 day at room temperature. The resulting His-tag-free PS1 protein was collected from the flow-through of a Ni NTA column and concentrated in a stock of 50 mM NaPi, 100 mM NaCl , pH 7.5 buffer, with an approximate yield of 40 mg/L. PS2 was expressed and purified in the same manner. [0210] Design of PS2. To explore the need for a second shell hydrogen bond to the Tip indole of W68, we designed a second sequence, PS2. Computational evaluation of positions where a second-shell polar residue could be introduced showed that a Ser at position 94 could form the desired hydrogen bond. This residue is Leu in PS1, so introducing a small Ser at this position
led to a local defect in the packing if this change were made directly into PS1. Thus, the entire core was redesigned using the original procedure, but this time requiring Ser and Trp at positions 94 and 68, respectively. The core of PS2 shares only 55 percent identity with PS1, as shown in the aligned sequences below. (The solvent-exposed amino acids are identical between PS1 and PS2, as per the design method, which only explicitly considers the protein core.)
[0211] Core residues of PS 1 : LGLVAFLIFGLVLILIHLFAAGWVFFAILLLLALILA (SEQ ID NO:5)
[0212] Core residues of PS2: LGIILLLAIGLILLAFHLFFAGWLFIAILLFSGIILA (SEQ ID NO: 6) [0213] PS2 was expressed with the same His-tag as PS1, and cleaved and purified using the same methods. Binding of (CF3)4PZn to PS2 was carried out using the same method as for PS1. We found that PS2 bound (CF3)4PZn in a homogenous environment, indicated by the narrow electronic absorption bands of the porphyrin in PS2, nearly indistinguishable from that in PS1 (FIG. 10). PS2 will be structurally characterized in future studies in which we will examine the role of second and third-shell hydrogen bonds on the photophyiscal properties of holo-PS proteins. The expressed, purified, His-tag cleaved sequence of PS2 was:
SEFEKLRQTGDEIIQLLQRLREAIDKGDDDSLEQILEELEEAFQKHRQLFENRQEAADTEF AKQGDQWLQLFQRIREAIDKGDKDSLEQLFEESEQGIQKIRELAEKKN (SEQ ID NO:7)
[0214] Cofactor synthesis. The cofactor [5, 10,15,20- tetrakis(trifluoromethyl)po hinato]zinc(II), abbreviated as (CF3)4PZn in the main text, was synthesized as previously reported10, and was confirmed by NMR and electronic absorption spectra. Likewise for (CF3)4PFe.
[0215] Clustering of apo-PSl NMR models. We implemented a greedy clustering algorithm in Matlab to form clusters within the family of structures of apo-PSl (Extended Data Fig. 7). A pairwise RMSD matrix of each apo-PS l model was scored against residues 61-67 and 99-105. These residues, which lie on opposite helices, show the largest conformational variation within the apo-PSl models. The clustering algorithm defines the centroid as the column of the RMSD matrix containing the largest number of RMSD values below a threshold of 1 A. Components of this column below this threshold have their corresponding rows and columns removed from the RMSD matrix, and the clustering algorithm repeats again on this truncated RMSD matrix. Of the 20 NMR models, two clusters were found with > 4 members each. The cluster defining the
closed conformation contained 13 members, and that of the open conformation contained 5 members.
[0216] Molecular dynamics simulations. The lowest-energy NMR structure of apo-PSl, which is the centroid of the closed conformation, was used as the starting conformation for the molecular dynamics simulation. The structure was solvated in a 17 A padding water box, neutralized by the addition of 12 Na+ counter ions. The AMBER force field 14SB was used for the parameterization of the protein. TIP3P water parameterization was used to describe the water molecules 12.
[0217] The molecular dynamics simulation was carried out using ACEMD13. The system was minimized for 2000 steps, followed by equilibration using the NPT ensemble for 10 ns at 1 atm using a time-step of 2 fs. We also used rigid bonds and a cutoff of 9 A using PME for long- range electrostatics. Following the relaxation phase, the protein was allowed to move freely and simulated under the NVT ensemble using ACEMD' s NVT ensemble with a Langevin thermostat. To achieve a time-step of 4 ps, we used damping at 0.1 ps-1 and a hydrogen mass repartitioning scheme. The simulation was carried out to 1 at 298 K.
[0218] SOCKET Server for assessment of knobs-into-holes packing. PDB files of the PS1 design model, holo-PSl centroid, and apo-PSl open/closed centroids were individually uploaded to and analyzed by the SOCKET server14 for knobs-into-holes side chain packing (see Section 4). A helical residue was defined as a knob if its side chain was within 8 A of 4 other side chains from residues on an adjacent helix (a hole). Output from the SOCKET server for each of these PDB files is displayed below showing the residues of each knob and hole. Note that the residue number of the PS1 design model is off register by 1 amino acid from the structural sequences, due to the presence of the N-terminal Ser residue from TEV cleavage of the expressed proteins.
Example 3 - enFold Proteins can bind endogenous ligands
[0219] The computational method described here is capable of producing proteins that noncovalently bind ligands in vivo. We have observed loading of endogenous heme in a PS 1 variant, where 7 terminal residues near the binding ligand site were deleted to allow
incorporation of a heme ligand with its bulky, charged proprionate functional groups (FIG. 16). The design methodology produces proteins which possess unique structure in the apo-form to avoid aggregation even at high concentration, which may occur during cellular expression.
These apo-proteins remain competent to bind an endogenous ligand, for example heme (FIG. 17
and FIGS. 18A-18B). These proteins are the first de novo designed proteins to our knowledge that noncovalently bind heme in vivo.
Data Tables
[0220] Table SI . Best-fit Crick parameters* at various stages of PSl design (no symmetry constraint)
*Parameters were fit using the CCCP server http://arteni.cs.dartmouth.edu/cccp/index.fit.php
[0221] Table S2. Best-fit Crick parameters* at various stages of PSl design ( 2-symmetry constraint)
*Parameters were fit using the CCCP server http://arteni.cs.dartmouth.edu/cccp/index.fit.php
[0222] Table S3. Designed residues of PSl .
16 I A 71 I L
17 A F 72 A F
20 V L 75 V F
23 I* I* 78 V A
24 M F 79 L I
32 L* L* 87 L* L*
35 L V 90 L* L*
36 L* L* 91 I L
39 A I 94 A L
40 Y E 95 Y E
42 L* L* 97 L A
43 I* I* 98 I L
49 L* L* 101 L I
50 A F 104 L* L*
51 Y D 105 F A aResidues are numbered according to the expressed 109-residue PS l protein. All denotes a mutated residue, and * denotes a retained residue, as shown in Fig. S I .
[0223] Table S4. Statistics of holo- and apo-PS l NMR structures
CYANA target function [A2] 2.67 + 0.03 1.65 ± 0.09
RDC 89 -
Average number of distance constraints violations per
CYANA conformer
0.2 - 0.5 A 0.0 0.0
> 0.5 A 0.0 0.0
Average number of dihedral-angle constraint violations
per CYANA conformer
> 5° 0.0 0.0
Average RMSD to the mean coordinates [A]
Regular secondary structure elements3, backbone heavy 0.34 ± 0.06 1.59 ± 0.43 atoms
Regular secondary structure elements3, all heavy atoms 1.16 ± 0.14 2.32 ± 0.38
All backbone heavy atoms 0.78 ± 0.17 1.98 ± 0.50
All heavy atoms 1.62 ± 0.24 2.81 ± 0.47
Average RMSD to the model [A] 1.05 ± 0.09 -
Ramachandran plot summary [%]
most favored regions 99.4 98.9
Additionally allowed regions 0.6 1.1 generously allowed regions 0.0 0.0 disallowed regions 0.0 0.0
Overall backbone assignments'3 98.1% 95.9%
Overall side chain chemical shift assignments0 97.0% 94.6%
3 Residues 5-26, 29-52, 58-81, 84-106.
b excluding the N-terminal NH3 +
c excluding Lys NH3 +, Arg NH2, OH, side chain 13CO and aromatic 13CY
[0224] Table S5. H-D exchange rates and protection factors for apo and holo-PSl, recorded at pH 6.5 and 298 K.
L 42 >0.3 1.79E+00 - N.D. -
I 43 >0.3 1.08E+00 - >0.3 -
Q 44 2.04E-02 6.34E+00 5.74 1.36E-03 8.45 2.71
K 45 >0.3 1.35E+01 - 1.45E-02 6.84
H 46 >0.3 5.91E+01 - 1.46E-02 8.30
R 47 >0.3 5.91E+01 - 1.55E-02 8.24
Q 48 >0.3 1.79E+01 - 3.50E-02 6.24
L 49 >0.3 3.91E+00 - >0.3 -
F 50 >0.3 3.33E+00 - >0.3 -
D 51 >0.3 1.35E+01 - >0.3 -
N 52 >0.3 1.96E+01 - >0.3 -
R 53 >0.3 2.35E+01 - >0.3 -
Q 54 N.D. 1.79E+01 - >0.3 -
E 55 >0.3 1.15E+01 - >0.3 -
A 56 >0.3 6.79E+00 - >0.3 -
A 57 >0.3 9.37E+00 - N.D. -
D 58 >0.3 1.18E+01 - >0.3 -
T 59 >0.3 5.39E+00 - N.D. -
E 60 >0.3 1.15E+01 - >0.3 -
A 61 N.D. 6.79E+00 - N.D -
A 62 >0.3 9.37E+00 - >0.3 -
K 63 >0.3 8.55E+00 - >0.3 -
Q 64 >0.3 1.42E+01 - >0.3 -
G 65 >0.3 2.77E+01 - >0.3 -
D 66 N.D. 1.75E+01 - >0.3 -
Q 67 NO. 7.28E+00 - >0.3 - w 68 >0.3 5.78E+00 - 1.94E-02 5.70
V 69 N.D. 1.45E+00 - 5.84E-03 5.51
Q 70 3.30E-03 7.80E+00 7.77 1.67E-02 6.15 -1.62
L 71 >0.3 3.91E+00 - 3.00E-03 7.17
F 72 2.67E-02 3.33E+00 4.83 5.01E-04 8.80 3.97
Q 73 >0.3 1.24E+01 - >0.3 8.47
R 74 1.49E-03 1.79E+01 9.40 1.81E-03 9.20 -0.20
F 75 1.00E-03 8.95E+00 9.10 9.44E-03 6.85 -2.25
R 76 7.88E-03 1.29E+01 7.40 3.42E-03 8.24 0.84
E 77 N.D. 1.21E+01 - 2.96E-03 8.32
A 78 3.80E-03 6.79E+00 7.49 3.23E-03 7.65 0.16
I 79 N.D. 1.75E+00 - 1.64E-02 4.67
D 80 N.D. 6.95E+00 - >0.3 -
K 81 >0.3 5.78E+00 - >0.3 -
G 82 >0.3 2.30E+01 - >0.3 -
D 83 N.D. 1.75E+01 - >0.3 -
K 84 >0.3 5.78E+00 - >0.3 -
D 85 N.D. 1.56E+01 - >0.3 -
S 86 >0.3 1.49E+01 - >0.3 -
L 87 >0.3 4.92E+00 - >0.3 -
E 88 >0.3 4.49E+00 - >0.3 -
Q 89 >0.3 7.80E+00 - >0.3 -
L 90 8.33E-03 3.91E+00 6.15 >0.3 -
L 91 >0.3 1.52E+00 - 3.22E-04 8.46
E 92 9.96E-03 4.49E+00 6.11 >0.3 -
E 93 5.03E-03 5.27E+00 6.95 7.50E-04 8.86 1.91
L 94 >0.3 1.79E+00 - 3.28E-03 6.30
E 95 >0.3 4.49E+00 - 1.21E-03 8.22
Q 96 2.67E-02 7.80E+00 5.68 2.00E-03 8.27 2.59
A 97 3.31E-03 1.49E+01 8.41 5.77E-04 10.16 1.75
L 98 2.06E-03 2.47E+00 7.09 **** -
Q 99 >0.3 6.64E+00 - 3.23E-02 5.33
10
K >0.3 1.35E+01 - >0.3 - 0
10
I >0.3 2.30E+00 - 2.48E-03 6.83
1
10
R >0.3 6.64E+00 - 1.86E-02 5.88
2
10
E >0.3 1.21E+01 - N.D. - 3
10
L >0.3 1.79E+00 - 4.22E-02 3.75
4
10
A >0.3 5.78E+00 - >0.3 - 5
10
E >0.3 7.28E+00 - >0.3 - 6
10
K >0.3 6.19E+00 - >0.3 - 7
10
K >0.3 1.13E+01 - >0.3 - 8
10
N >0.3 3.82E+01 - >0.3 -
9
W 68 >0.3 1.40E+01 - 1.56E-02 2.96 *****
*: Observed pseudo-first order rate constant. **: Calculated according to the paper titled Primary structure effects on peptide group hydrogen exchange (Bai et al, proteins, 1993, vl7, p75). ***: PF = kint/kex. ****: The kex of the L98 can not be obtained from the data recorded in the two-day period, and a comparison to the peak intensity of the same residue in the HSQC spectra recorded in 95% H20 shows that kex is very slow. *****; W 68 indole HN peak. N.O: not observed. N.D.: not determined due to overlap.
[0225] Table S6. SOCKET knobs into holes packing information: PSl Design Model
LEU 93, ALA 96, LEU 97, ILE 100 (knob: 57 (TRP 67, helix 2)) LEU 89, GLU 92, LEU 93, ALA 96 (knob: 60 (LEU 70, helix 2)) LEU 86, LEU 89, LEU 90, LEU 93 (knob: 64 (PHE 74, helix 2))
[0226] Table S7. SOCKET knobs into holes packing information: Apo-PSl open centroid
GLU 33, LEU 36, GLU 37, GLU 40 (knob: 65 (ARG 76, helix 2))
holes in helix 2:
PHE 72, PHE 75, ARG 76, ILE 79 (knob: 31 (LEU 36, helix 1)) ~
LEU 71, ARG 74, PHE 75, ALA 78 (knob: 76 (LEU 90, helix 3))
TRP 68, LEU 71, PHE 72, PHE 75 (knob: 80 (LEU 94, helix 3))
GLN 64, GLN 67, TRP 68, LEU 71 (knob: 83 (ALA 97, helix 3))
holes in helix 3 :
LEU 91, LEU 94, GLU 95, LEU 98 (knob: 15 (PHE 17, helix 0))~
LEU 94, ALA 97, LEU 98, ILE 101 (knob: 57 (TRP 68, helix 2))
LEU 90, GLU 93, LEU 94, ALA 97 (knob: 60 (LEU 71, helix 2))
LEU 87, LEU 90, LEU 91, LEU 94 (knob: 64 (PHE 75, helix 2))
Command Line and Input Files
[0228] Input files and command lines for design calculations.
[0229] Command lines and flags for generating the backbone ensemble via Rosetta backrub Flags
-nstruct 200
-constraints: cst fa file my atomic. est
-constraints xst fa weight 1
-extrachi cutoff 0
-exl
-ex2
-backrub :mc_kt 0.8
-backrub :ntrials 10000
-backrub :sc_prob_withinrot 0.1
-backrub :initial_pack
-b ackrub : mm b end wei ght 2
-backrub :pivot_residues 1-108
Command line
~/rosetta/rosetta-3.5/rosetta_source/bin/backrub.default.linuxgccrelease -database
~/rosetta/rosetta-3.5/rosetta_database/ -s holo_input_model.pdb @flags.txt - extra res fa PZ F.params
[0230] Command lines, RosettaScript, and flags for the flexible backbone sequence design protocol.
RosettaScript
<dock_design>
<SCOREFXNS>
<scorewts weights=scorel3>
<Reweight scoretype = atom _pair_constraint weight = \l>
<Reweight scoretype = angle constraint weight = \l>
<Reweight scoretype = hack aro weight = \l>
<Reweight scoretype = fa_pair weight = 0/>
<Reweight scoretype = hack_elec weight = 0.55/>
<Reweight scoretype = rg weight = 2/>
</scorewts>
<scorewts_backrub weights=scorel3>
<Reweight scoretype = atom_pair_constraint weight = \l>
<Reweight scoretype = angle constraint weight = \l>
<Reweight scoretype = rg weight = 2/>
<Reweight scoretype = hack_aro weight = \l>
<l scorewts_backrub>
<softwts weights =soft_rep_design>
<Reweight scoretype = atom jiair constraint weight = \l>
<Reweight scoretype = angle constraint weight = \l>
<Reweight scoretype = hack_aro weight = \l>
<Reweight scoretype = rg weight = 2/>
</softwts>
</SCOREFXNS>
<FILTERS>
<PackStat name = pstat threshold = 0.58 repeats = 3/>
</FILTERS>
<TASKOPERATIONS>
<ReadResfile name = rr filename = resfile.txt/>
<InitializeFromCommandline name = ifcl/>
<IncludeCurrent name = input_sc/>
<RestrictToRepacking name = no_mutations/>
<ExtraRotamersGeneric name = extra rotl exl = 1 ex2 = 1 exl sample level = 3 ex2_sample_level = 3 extrachi cutoff = 0/>
<OperateOnCertainResidues name = fixpolars>
<PreventRepackingRLT/>
<ResidueHasProperty property = POLAR/>
</OperateOnCertainResidues>
<OperateOnCertainResidues name = fixcharged>
<PreventRepackingRLT/>
<ResidueHasProperty property = CHARGED/>
</OperateOnCertainResidues>
</TASKOPERATIONS>
<MOVERS>
<ConstraintSetMover name = atomic cst file = my_atomic.cst/>
<PackRotamersMover name = repack scorefxn = scorewts task operations = ifcl,no_mutations/>
<PackRotamersMover name = prl scorefxn = softwts task operations = rr,ifcl/> <PackRotamersMover name = pr2 scorefxn = scorewts task operations = rr , if cl , extra r ot 1 />
<MinMover name=minmovsc scorefxn = softwts tolerance = 0.005 chi=l bb=0/>
<MinMover name=minmovbb scorefxn = scorewts tolerance = 0.005 chi=0 bb=l/>
<Backrub name = backrub pivot_residues=l-108 require mm bend =l/>
<Sidechain name = sidechain task operations =
ifcl,no_mutations,fixpolars,fixcharged/>
<ParsedProtocol name = backrub_protocol mode = single_random> <Add mover name = backrub apply _probability = 0.75/>
<Add mover name = sidechain apply _prob ability = 0.25/>
</ParsedProtocol>
<GenericMonteCarlo name = backrub mc mover name = backrub jirotocol scorefxn name = scorewts backrub trials = 200 temperature = 1.2 preapply = 0/> <ParsedProtocol name=flexdes>
<Add mover_name=prl/>
<Add mover_name=minmovsc/>
<Add mover_name=pr2/>
<Add mover_name=minmovbb/>
<Add mover_name=pr2 filter_name=pstat/>
</ParsedProtocol>
<GenericMonteCarlo name=iterate mover_name=flexdes
scorefxn_name=scorewts trials=3 preapply=0 temperature =0.4/>
</MOVERS>
<OUTPUT scorefxn=scorewts/>
<APPLY_TO_POSE>
</APPLY_TO_POSE>
<PROTOCOLS>
<Add mover_name=atomic/>
<Add mover_name=repack/>
<Add mover_name=backrub_mc/>
<Add mover name = iterate/>
<Add filter_name = pstat/>
</PROTOCOLS>
</dock_design>
[0231] Contents of constraint file (my atomic.cst) :
AtomPair NE2 45 A ZNl IX HARMONIC 2.0 0.1
Angle ZNl IX NE2 45A NDl 45A CIRCULARHARMONIC 2.806 .2
Angle ZNl IX NE2 45A CG 45A CIRCULARHARMONIC 2.845 .2
-parsenprotocol RosettaScript.xml
-nstruct 500
-out:file:fullatom
-ou pdb
-packing:multi_cool_annealer 10
-packing:linmem_ig 20
Command line input
~/rosetta/rosetta-3.5/rosetta_source/bin/rosetta_scripts. default.linuxgccrelease -datab ~/rosetta/rosetta-3.5/rosetta_database/ -s ../holo_input_model.pdb -extra_res_fa
PZNF.params @flags.txt
Contents of the residue file (resfile.txt):
NATRO
USE INPUT SC
start
2 A APOLAR NOTAA WYCMH
5 A APOLAR NOTAA WYCMH
6 A NATAA
8 A NATAA
9 A APOLAR NOTAA WYCMH
10 A NATAA
12 A APOLAR NOTAA WYCMH
13 A APOLAR NOTAA WYCMH
15 A APOLAR NOTAA WYCMH
16 A APOLAR NOTAA WYCMH 19 A APOLAR NOTAA WYCMH
22 A APOLAR NOTAA WYCMH
23 A APOLAR NOTAA WYCMH 31 A APOLAR NOTAA WYCMH
34 A APOLAR NOTAA WYCMH
35 A APOLAR NOTAA WYCMH
38 A APOLAR NOTAA WYCMH
39 A ALLAAxc NOTAA WYCMH
41 A APOLAR NOTAA WYCMH
42 A APOLAR NOTAA WYCMH
45 A NATRO
46 A NATAA
48 A APOLAR NOTAA WYCMH
49 A APOLAR NOTAA WYCMH
50 A ALLAAxc NOTAA WYCMH
60 A APOLAR NOTAA WYCMH
61 A APOLAR NOTAA WYCMH
64 A APOLAR NOTAA WYCMH
65 A NATAA
67 A PIKAA W
68 A APOLAR NOTAA WYCMH
70 A APOLAR NOTAA WYCMH
71 A APOLAR NOTAA WYCMH 74 A APOLAR NOTAA WYCMH
77 A APOLAR NOTAA WYCMH
78 A APOLAR NOTAA WYCMH 86 A APOLAR NOTAA WYCMH
89 A APOLAR NOTAA WYCMH
90 A APOLAR NOTAA WYCMH
93 A APOLAR NOTAA WYCMH
94 A ALLAAxc NOTAA WYCMH
96 A APOLAR NOTAA WYCMH
97 A APOLAR NOTAA WYCMH 100 A APOLAR NOTAA WYCMH
101 A NATAA
103 A APOLAR NOTAA WYCMH
104 A APOLAR NOTAA WYCMH
105 A NATAA Contents of (CF3)4PZn parameters file (PZNF.params):
NAME PZF
IO STRF G PZF Z
TYPE LIGAND AA UNK
ATOM ZNl Zn2p X 1.01
ATOM Nl Npro X -0.65
ATOM CI 8 aroC X 0.42
ATOM C17 aroC X -0.37
ATOM C16 aroC X 0.46
ATOM N4 Npro X -0.66
ATOM C13 aroC X 0.51
ATOM C12 aroC X -0.44
ATOM Cl l aroC X 0.46
ATOM N3 Npro X -0.65
ATOM C8 aroC X 0.43
ATOM C7 aroC X -0.39
ATOM C6 aroC X 0.47
ATOM N2 Npro X -0.67
ATOM C3 aroC X 0.53
ATOM C2 aroC X -0.44
ATOM CI aroC X 0.47
ATOM C20 aroC X -0.27
ATOM C19 aroC X -0.27
ATOM H6 Haro X 0.19
ATOM H7 Haro X 0.19
ATOM C21 CHI X 0.50
ATOM Fl F X -0.16
ATOM F2 F X -0.17
ATOM F10 F X -0.17
ATOM C4 aroC X -0.30
ATOM C5 aroC X -0.29
ATOM H2 Haro X 0.20
ATOM HI Haro X 0.20
ATOM C22 CHI X 0.52
ATOM F7 F X -0.17
ATOM F8 F X -0.17
ATOM F9 F X -0.18
ATOM C9 aroC X -0.28
ATOM CIO aroC X -0.26
ATOM H8 Haro X 0.19
ATOM H3 Haro X 0.19
ATOM C23 CHI X 0.50
ATOM F5 F X -0.16
ATOM F6 F X -0.17 ATOM F12 F X -0.17 ATOM C14 aroC X -0.29 ATOM CI 5 aroC X -0.29 ATOM H5 Haro X 0.20 ATOM H4 Haro X 0.19 ATOM C24 CHI X 0.53 ATOM Fl l F X -0.19 ATOM F3 F X -0.17 ATOM F4 F X -0.17 BOND ZN1 Nl
BOND ZN1 N3
BOND ZN1 N4
BOND ZN1 N2
BOND C24 Fl l
BOND Fl C21
BOND F2 C21
BOND F3 C24
BOND F4 C24
BOND F5 C23
BOND F6 C23
BOND F7 C22
BOND F8 C22
BOND Nl C18
BOND Nl CI
BOND N2 C6
BOND N2 C3
BOND N3 C8
BOND N3 Cl l
BOND N4 C16
BOND N4 C13
BOND CI C20
BOND CI C2
BOND C2 C3
BOND C2 C21
BOND C3 C4
BOND C4 C5
BOND C4 HI
BOND C5 H2
BOND C5 C6
BOND C6 C7
BOND C7 C22
BOND C7 C8
BOND C8 C9
BOND C9 H3
BOND C9 CIO
BOND CIO H8
BOND CIO Cl l
BOND Cl l C12
BOND C12 C13
BOND C12 C23
BOND C13 C14
BOND C14 C15
BOND C14 H4
BOND C15 H5
BOND C15 C16
BOND C16 C17
BOND C17 C24
BOND C17 C18
BOND C18 C19
BOND C19 H6
BOND C19 C20
BOND C20 H7
BOND C21 F10
BOND C22 F9
BOND C23 F12
CHI 1 C3 C2 C21 Fl
CHI 2 C8 C7 C22 F7
CHI 3 C13 C12 C23 F5
CHI 4 C18 C17 C24 Fl l
NBR ATOM ZN1
NBR RADIUS 6.387332
ICOOR INTERNAL ZN1 0.000000 0.000000 o.oooooo : ZN1 Nl C18
ICOOR INTERNAL Nl 0.000000 180.000000 2.064355 ZN1 Nl C18
ICOOR INTERNAL C18 0.000001 51.447825 1.369076 Nl ZN1 C18
ICOOR INTERNAL C17 6.625837 54.661320 1.412896 C18 Nl ZN1
ICOOR INTERNAL C16 -8.838188 54.513694 1.411530 C17 C18 Nl
ICOOR INTERNAL N4 8.385056 56.245312 1.374326 C16 C17 C18
ICOOR INTERNAL C13 -177.426793 72.629665 1.370374 ■ N4 C16 . C17
ICOOR INTERNAL C12 -172.597100 55.764058 1.412488 C13 N4 C16
ICOOR INTERNAL Cl l 15.118869 55.057460 1.410234 C12 C13 N4
ICOOR INTERNAL N3 -15.071141 55.251686 1.368644 Cl l C12 C13
ICOOR INTERNAL C8 174.454044 72.748278 1.368355 N3 Cl l C12
ICOOR INTERNAL C7 175.549092 54.651816 1.412704 C8 N3 Cl l
ICOOR INTERNAL C6 -8.949997 54.548075 1.411877 C7 C8 N3
ICOOR INTERNAL N2 8.726143 56.164530 1.373619 C6 C7 C8
ICOOR INTERNAL C3 ■ -177.455606 72.629643 1.370275 N2 C6 C7
ICOOR INTERNAL C2 -172.440358 55.835300 1.411183 C3 N2 C6
ICOOR INTERNAL CI 14.746094 55.058826 1.411413 C2 C3 N2
ICOOR INTERNAL C20 159.351269 54.572579 1.451463 CI C2 C3
ICOOR INTERNAL C19 -173.134442 73.124901 1.356577 ' C20 CI C2
ICOOR INTERNAL H6 178.675786 53.211637 1.077487 C19 C20 CI
ICOOR INTERNAL H7 177.242610 53.864408 1.078340 C20 CI C19
ICOOR INTERNAL C21 172.161545 61.209306 1.518010 C2 C3 CI
ICOOR INTERNAL Fl 26.588574 66.722458 1.351871 C21 C2 C3
ICOOR INTERNAL F2 118.742085 68.391753 1.355075 C21 C2 Fl
ICOOR INTERNAL F10 119.896471 67.475489 1.355382 C21 C2 F2
ICOOR INTERNAL C4 174.106305 70.610058 1.451470 C3 N2 C2
ICOOR INTERNAL C5 -2.613059 73.025475 1.356111 C4 C3 N2
ICOOR INTERNAL H2 -178.723869 53.755333 1.075472 C5 C4 C3
ICOOR INTERNAL HI -177.048251 53.662204 1.077885 C4 C3 C5
ICOOR INTERNAL C22 -176.097888 65.736086 1.522118 C7 C8 C6
ICOOR INTERNAL F7 -48.708509 68.833518 1.355195 C22 C7 C8
ICOOR INTERNAL F8 - 119.427444 65.177131 1.348034 C22 C7 F7
ICOOR INTERNAL F9 - 121.529899 68.269409 1.357705 C22 C7 F8
ICOOR INTERNAL C9 ■ -176.602426 70.718433 1.455487 C8 N3 C7
ICOOR INTERNAL CIO 2.316401 72.970357 1.356477 C9 C8 N3
ICOOR INTERNAL H8 -179.770544 53.066131 1.077420 CIO C9 C8
ICOOR INTERNAL H3 178.815847 53.777613 1.077761 C9 C8 CIO
ICOOR INTERNAL C23 171.964149 61.628846 1.518169 C12 C13 Cl l
ICOOR INTERNAL F5 28.592915 66.966988 1.353389 C23 C12 C13
ICOOR INTERNAL F6 118.547014 68.227339 1.355152 C23 C12 F5
ICOOR INTERNAL F12 120.209500 67.387458 1.354153 C23 C12 F6
ICOOR INTERNAL C14 174.179256 70.623080 1.451196 C13 N4 C12
ICOOR INTERNAL C15 -2.538511 73.008118 1.356972 C14 C13 N4
ICOOR INTERNAL H5 -178.625691 53.835483 1.074947 C15 C14 C13
ICOOR INTERNAL H4 -176.963403 53.769411 1.077427 C14 C13 C15
ICOOR INTERNAL C24 -176.025368 65.801932 1.521861 C17 C18 ; C16
ICOOR INTERNAL Fl l 70.140257 68.312105 1.358482 C24 C17 C18
ICOOR INTERNAL F3 - 119.046543 68.720707 1.354553 C24 C17 Fl l
ICOOR INTERNAL F4 - 119.508856 65.139031 1.348886 C24 C17 F3
[0234] References for Example 2.
[0235] 1. North, B., Summa, C. M., Ghirlanda, G. & DeGrado, W. F. JA-symmetrical tertiary templates for the design of tubular proteins. J. Mol. Biol. 311, 1081-1090 (2001). 2. Ghirlanda, G. et al. De novo design of a /^-symmetrical protein that reproduces the diheme four-helix bundle in cytochrome
J. Am. Chem. Soc. 126, 8141-8147 (2004). 3. Lahr, S. J. et al.
Analysis and design of turns in a-helical hairpins. J. Mol. Biol. 346, 1441-1454 (2005). 4.
Bender, G. M. et al. De novo design of a single-chain diphenylporphyrin metalloprotein. J. Am. Chem. Soc. 129, 10732-10740 (2007). 5. Fry, H. C. et al. Computational de novo design and characterization of a protein that selectively binds a highly hyperpolarizable abiological chromophore. J. Am. Chem. Soc. 135, 13914-13926 (2013). 6. Davis, I. W., Arendall Iii, W. B., Richardson, D. C. & Richardson, J. S. The backrub motion: How protein backbone shrugs when a sidechain dances. Structure 14, 265-274 (2006). 7. Friedland, G. D., Lakomek, N.-A.,
Griesinger, C, Meiler, J. & Kortemme, T. A correspondence between solution-state dynamics of an individual protein and the sequence and conformational diversity of its family. PLoS Comput Biol 5, el000393 (2009). 8. Bradley, P., Misura, K. M. S. & Baker, D. Toward high-resolution de novo structure prediction for small proteins. Science 309, 1868 (2005). 9. Polizzi, N. F. et al. Photoinduced Electron Transfer Elicits a Change in the Static Dielectric Constant of a de Novo Designed Protein. J. Am. Chem. Soc. 138, 2130-2133 (2016). 10. Goll, J. G, Moore, K. T., Ghosh, A. & Therien, M. J. Synthesis, structure, electronic spectroscopy, photophysics,
electrochemistry, and x-ray photoelectron spectroscopy of highly-electron-deficient [5,10, 15,20- tetrakis(perfluoroalkyl)porphinato]zinc(II) complexes and their free base derivatives. J. Am. Chem. Soc. 118, 8344-8354 (1996). 11. Schrodinger, LLC. The PyMOL Molecular Graphics System, Version 1.8. (2015). 12. Jorgensen, W. L., Chandrasekhar, J., Madura, J. D., Impey, R. W. & Klein, M. L. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79, 926-935 (1983). 13. Harvey, M. L, Giupponi, G. & Fabritiis, G. D. ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale. J Chem. Theory Comput. 5, 1632-1639 (2009). 14. Walshaw, J. & Woolfson, D. N. SOCKET: a program for identifying and analysing coiled-coil motifs within protein structures. J. Mol. Biol. 307, 1427-1450 (2001). 15. Hayes, D., Laue, T. & Philo, J. Program Sednterp: sedimentation interpretation program. Durham, H: University of New Hampshire (1995). 16. Moore, K. T., Fletcher, J. T. & Therien, M. J. Syntheses, NMR and EPR Spectroscopy, Electrochemical Properties, and Structural Studies of [5, 10,15,20-Tetrakis(perfluoroalkyl)porphinato]iron(II) and -iron(III) Complexes. J. Am. Chem. Soc. Ill, 5196-5209 (1999). 17. Grigoryan, G. & DeGrado, W. F. Probing designability via a generalized model of helical bundle geometry. J. Mol. Biol. 405, 1079-1100 (2011).
Claims
WHAT IS CLAIMED IS: 1. A computer-implemented method, comprising:
(a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within said protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of said ligand is associated with a set of ligand atomic coordinates;
(b) identifying a set of core amino acid residues within said protein that do not bind to said ligand, each core amino acid residue within said protein is associated with a set of core amino acid residue atomic coordinates; and (c) optimizing
said set of ligand binding amino acid residues;
said set of ligand binding amino acid residue atomic coordinates;
said set of core amino acid residues; and
said set of core amino acid residue atomic coordinates;
wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize said protein.
2. The method of claim 1, wherein step c) comprises simultaneously optimizing
said set of ligand binding amino acid residues;
said set of ligand binding amino acid residue atomic coordinates;
said set of core amino acid residues; and
said set of core amino acid residue atomic coordinates.
3. The method of claim 1, wherein the energy minimization calculation comprises a molecular mechanics function, a structural bioinformatics function, an amino acid sidechain packing function, a protein radius of gyration function, or a combination thereof.
4. The method of claim 1, wherein the core amino acids are at least 75% inaccessible to a 1.8 A spherical probe.
5. The method of claim 1, wherein said set of core amino acids comprises at least six amino acid residues.
6. The method of any one of claims 1 to 5, wherein the optimizing comprises fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate; fixing an atomic coordinate of at least one ligand atomic coordinate; prohibiting introduction of an additional amino acid residue into the set of ligand binding amino acid residues; or prohibiting the deletion of an amino acid residue from the set of ligand binding amino acid residues.
7. The method of any one of claims 1 to 5, wherein the optimizing comprises fixing at least one atomic coordinate of the ligand atomic coordinates.
8. The method of any one of claims 1 to 7, wherein the energy minimization calculation comprises a penalty function.
9. The method of any one of claims 1 to 8, wherein the optimizing does not comprise fixing at least one atomic coordinate of at least one core amino acid residue atomic coordinates.
10. The method of any one of claims 1 to 8, wherein the optimizing comprises introducing an additional ligand binding amino acid residue into the set of ligand binding amino acid residues, deleting a ligand binding amino acid residue from the set of ligand binding amino acid residues, a geometric transformation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
11. The method of claim 10, wherein the geometric transformation comprises a translation or a rotation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
12. The method of any one of claims 1 to 11, wherein the optimizing comprises a geometric transformation of at least one atomic coordinate of the core amino acid residue atomic coordinates.
13. The method of any one of claims 10 to 12, wherein the geometric transformation of at least one atomic coordinate comprises no greater than a 6 A displacement of any atomic coordinate.
14. The method of any one of claims 10 to 12, wherein the geometric transformation of at least one atomic coordinate comprises no greater than a 3 A displacement of any atomic coordinate.
15. The method of any one of claims 1 to 14, wherein the optimizing comprises an iterative or heuristic algorithm.
16. The method of any one of claims 1 to 14, wherein the optimizing comprises a simplex algorithm, memetic algorithm, differential evolution algorithm,
evolutionary algorithm, genetic algorithm, tabu algorithm, particle swarm algorithm, or stimulated annealing algorithm.
17. The method of any one of claims 1 to 14, wherein the optimizing comprises a Monte Carlo sampling algorithm, dead-end elimination algorithm, branch and bound algorithm, or a pruning algorithm.
18. The method of any one of claims 1 to 17, wherein the ligand is a porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component, that is capable of binding a metal ion.
19. The method of any one of claims 1 to 17, wherein the ligand is a detectable agent.
20. The method of any one of claims 1 to 17, wherein the ligand is a therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theranostic, or a photodynamic therapy (PDT) agent.
21. The method of any one of claims 1 to 17, wherein the ligand is a catalyst.
22. The method of any one of claims 1 to 17, wherein the catalyst catalyzes an abiological or bio-orthogonal reaction.
23. The method of any one of claims 1 to 17, wherein the ligand is a molecule that exists within a living system.
24. A system, comprising:
at least one data processor; and
at least one memory storing instructions which, when executed by the at least one data processor, result in operations comprising:
(a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within said protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of said ligand is associated with a set of ligand atomic coordinates;
(b) identifying a set of core amino acid residues within said protein that do not bind to said ligand, each core amino acid residue within said protein is associated with a set of core amino acid residue atomic coordinates; and (c) optimizing
said set of ligand binding amino acid residues;
said set of ligand binding amino acid residue atomic coordinates;
said set of core amino acid residues; and
said set of core amino acid residue atomic coordinates;
wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize said protein.
25. The system of claim 24, wherein the energy minimization calculation comprises functions from molecular mechanics, functions from structural bioinformatics, amino acid sidechain packing functions, protein radius of gyration functions, or a combination thereof.
26. The system of claim 24, wherein the core amino acids are at least 75% inaccessible to a 1.8 A spherical probe.
27. The system of claim 24, wherein said set of core amino acids comprise at least six amino acid residues.
28. The system of any one of claims 24 to 27, wherein the optimizing comprises fixing an atomic coordinate of at least one ligand binding amino acid residue atomic coordinate; fixing an atomic coordinate of at least one ligand atomic coordinate; prohibiting introduction of an additional amino acid residue into the set of ligand binding amino acid residues; or prohibiting the deletion of an amino acid residue from the set of ligand binding amino acid residues.
29. The system of any one of claims 24 to 28, wherein the optimizing comprises fixing at least one atomic coordinate of the ligand atomic coordinates.
30. The system of any one of claims 24 to 29, wherein the energy minimization calculation comprises a penalty function.
31. The system of any one of claims 24 to 30, wherein the optimizing does not comprise fixing at least one atomic coordinate of at least one core amino acid residue atomic coordinates.
32. The system of any one of claims 24 to 31, wherein the optimizing comprises introducing an additional ligand binding amino acid residue into the set of ligand binding amino acid residues, deleting a ligand binding amino acid residue from the set of ligand binding amino acid residues, a geometric transformation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
33. The method of claim 32, wherein the geometric transformation comprises a translation or a rotation of at least one atomic coordinate of the ligand binding amino acid residue atomic coordinates.
34. The system of any one of claims 24 to 33, wherein the optimizing comprises a geometric transformation of at least one atomic coordinate of the core amino acid residue atomic coordinates.
35. The system of any one of claims 24 to 34, wherein the geometric transformation of at least one atomic coordinate comprises no greater than a 6 A displacement of any atomic coordinate.
36. The system of any one of claims 24 to 34, wherein the geometric transformation of at least one atomic coordinate comprises no greater than a 3 A displacement of any atomic coordinate.
37. The system of any one of claims 24 to 36, wherein the optimizing comprises an iterative or heuristic algorithm.
38. The system of any one of claims 24 to 36, wherein the optimizing comprises a simplex algorithm, memetic algorithm, differential evolution algorithm,
evolutionary algorithm, genetic algorithm, tabu algorithm, particle swarm algorithm, or stimulated annealing algorithm.
39. The system of any one of claims 24 to 36, wherein the optimizing comprises a Monte Carlo sampling algorithm, dead-end elimination algorithm, branch and bound algorithm, or a pruning algorithm.
40. A non-transitory computer-readable storage medium including program code, which when executed by at least one data processor, causes operations comprising:
(a) identifying a set of ligand binding amino acid residues within a protein for binding to a ligand, wherein each ligand binding amino acid residue within said protein is associated with a set of ligand binding amino acid residue atomic coordinates and each atom of said ligand is associated with a set of ligand atomic coordinates;
(b) identifying a set of core amino acid residues within said protein that do not bind to said ligand, each core amino acid residue within said protein is associated with a set of core amino acid residue atomic coordinates; and (c) optimizing
said set of ligand binding amino acid residues;
said set of ligand binding amino acid residue atomic coordinates;
said set of core amino acid residues; and
said set of core amino acid residue atomic coordinates;
wherein the optimization is performed using at least an energy minimization calculation, and wherein the optimization is performed to energetically stabilize said protein.
41. A protein sequence obtainable based on the energy minimization calculation using the method of any of claims 1 to 23, the system of any of claims 24 to 39, or the non-transitory computer-readable medium of claim 40.
42. A protein, or conservatively modified variant thereof, having the sequence SEQ ID NO: l .
43. The protein of claim 42, wherein the protein is 90% identical to SEQ ID NO: l .
44. The protein of claim 42, bound to a ligand.
45. The protein of claim 42, wherein the ligand is bound to the protein via a dative covalent bond.
46. The protein of claim 44, wherein the ligand is a porphyrin, porphycene, rubyrin, rosarin, hexaphyrin, sapphyrin, chlorophyll, chlorin, phthalocyanine, porphyrazine, corrole, N-confused porphyrin, bacteriochlorophyll, pheophytin, texaphyrin, or related macrocyclic-based component, that is capable of binding a metal ion.
47. The protein of claim 44, wherein the ligand is a detectable agent.
48. The protein of claim 44, wherein the ligand is a therapeutic agent, biological agent, cytotoxic agent, magnetic resonance imaging (MRI) agent, positron emission tomography (PET) agent, radiological imaging agent, diagnostic agent, theranostic, or a photodynamic therapy (PDT) agent.
49. The protein of claim 44, wherein the ligand is a catalyst.
50. The protein of claim 44, wherein the catalyst catalyzes an abiological or bio-orthogonal reaction.
51. The protein of claim 44, wherein the ligand is a molecule that exists within a living system.
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP18837693.3A EP3659145A4 (en) | 2017-07-27 | 2018-07-27 | PROTEINS DESIGNED FOR LIGAND BINDING |
| US16/633,809 US20200234789A1 (en) | 2017-07-27 | 2018-07-27 | Designed proteins for ligand binding |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201762537774P | 2017-07-27 | 2017-07-27 | |
| US62/537,774 | 2017-07-27 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019023644A1 true WO2019023644A1 (en) | 2019-01-31 |
Family
ID=65039919
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2018/044195 Ceased WO2019023644A1 (en) | 2017-07-27 | 2018-07-27 | Designed proteins for ligand binding |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US20200234789A1 (en) |
| EP (1) | EP3659145A4 (en) |
| WO (1) | WO2019023644A1 (en) |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12587274B2 (en) | 2023-03-28 | 2026-03-24 | Quantum Generative Materials Llc | Satellite optimization management system based on natural language input and artificial intelligence |
| US12368503B2 (en) | 2023-12-27 | 2025-07-22 | Quantum Generative Materials Llc | Intent-based satellite transmit management based on preexisting historical location and machine learning |
| US12603701B2 (en) | 2023-12-27 | 2026-04-14 | Quantum Generative Materials Llc | Distributed satellite constellation management and control system |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150205912A1 (en) * | 2012-08-03 | 2015-07-23 | Novartis Ag | Methods to identify amino acid residues involved in macromolecular binding and uses therefor |
| US20160063177A1 (en) * | 2013-03-15 | 2016-03-03 | Alexandre Zanghellini | Automated method of computational enzyme identification and design |
-
2018
- 2018-07-27 EP EP18837693.3A patent/EP3659145A4/en not_active Withdrawn
- 2018-07-27 US US16/633,809 patent/US20200234789A1/en not_active Abandoned
- 2018-07-27 WO PCT/US2018/044195 patent/WO2019023644A1/en not_active Ceased
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20150205912A1 (en) * | 2012-08-03 | 2015-07-23 | Novartis Ag | Methods to identify amino acid residues involved in macromolecular binding and uses therefor |
| US20160063177A1 (en) * | 2013-03-15 | 2016-03-03 | Alexandre Zanghellini | Automated method of computational enzyme identification and design |
Non-Patent Citations (1)
| Title |
|---|
| See also references of EP3659145A4 * |
Also Published As
| Publication number | Publication date |
|---|---|
| EP3659145A4 (en) | 2022-06-15 |
| US20200234789A1 (en) | 2020-07-23 |
| EP3659145A1 (en) | 2020-06-03 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Polizzi et al. | De novo design of a hyperstable non-natural protein–ligand complex with sub-Å accuracy | |
| Goodsell et al. | The AutoDock suite at 30 | |
| Li et al. | Structural dynamics of Zika virus NS2B-NS3 protease binding to dipeptide inhibitors | |
| Yagi et al. | Three-dimensional protein fold determination from backbone amide pseudocontact shifts generated by lanthanide tags at multiple sites | |
| He et al. | Yeast frataxin solution structure, iron binding, and ferrochelatase interaction | |
| Assfalg et al. | Structural model for an alkaline form of ferricytochrome c | |
| Kelso et al. | α-turn mimetics: short peptide α-helices composed of cyclic metallopentapeptide modules | |
| Häussinger et al. | DOTA-M8: an extremely rigid, high-affinity lanthanide chelating tag for PCS NMR spectroscopy | |
| Balatri et al. | Solution structure of Sco1: a thioredoxin-like protein involved in cytochrome c oxidase assembly | |
| Cerofolini et al. | Examination of matrix metalloproteinase-1 in solution: a preference for the pre-collagenolysis state | |
| O’Brien et al. | Calmodulin fishing with a structurally disordered bait triggers CyaA catalysis | |
| Ju et al. | One protein, two enzymes revisited: a structural entropy switch interconverts the two isoforms of acireductone dioxygenase | |
| Nithianantham et al. | Structural basis of tubulin recruitment and assembly by microtubule polymerases with tumor overexpressed gene (TOG) domain arrays | |
| Aliyan et al. | Photochemical identification of molecular binding sites on the surface of amyloid-β fibrillar aggregates | |
| Shahlaei et al. | Exploring binding properties of sertraline with human serum albumin: Combination of spectroscopic and molecular modeling studies | |
| Go et al. | Structure and dynamics of de novo proteins from a designed superfamily of 4‐helix bundles | |
| Feliks et al. | Structural determinants of improved fluorescence in a family of bacteriophytochrome-based infrared fluorescent proteins: insights from continuum electrostatic calculations and molecular dynamics simulations | |
| US20200234789A1 (en) | Designed proteins for ligand binding | |
| US20230416726A1 (en) | Scaffolding protein functional sites using deep learning | |
| Whitley et al. | A Combined NMR and SAXS analysis of the partially folded cataract-associated V75D γD-crystallin | |
| Viková et al. | Rational steering of insulin binding specificity by intra-chain chemical crosslinking | |
| Dai et al. | Protein-embedded metalloporphyrin arrays templated by circularly permuted tobacco mosaic virus coat proteins | |
| Rogne et al. | Atomic-level structure characterization of an ultrafast folding mini-protein denatured state | |
| Johansson et al. | A minimal transmembrane β-barrel platform protein studied by nuclear magnetic resonance | |
| Bahramzadeh et al. | Three-Dimensional Protein Structure Determination Using Pseudocontact Shifts of Backbone Amide Protons Generated by Double-Histidine Co2+-Binding Motifs at Multiple Sites |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2018837693 Country of ref document: EP Effective date: 20200227 |
|
| WWW | Wipo information: withdrawn in national office |
Ref document number: 2018837693 Country of ref document: EP |











