WO2009114591A1

WO2009114591A1 - Method and apparatus for screening drugs for predictors of quantitatively measured events

Info

Publication number: WO2009114591A1
Application number: PCT/US2009/036752
Authority: WO
Inventors: June Sherie Almenoff; Dana E. Vanderwall
Original assignee: SmithKline Beecham Corp
Current assignee: SmithKline Beecham Corp
Priority date: 2008-03-11
Filing date: 2009-03-11
Publication date: 2009-09-17
Anticipated expiration: 2010-09-11

Abstract

Methods and systems for screening drugs for predictors of quantitatively measured events are provided. At least one predictor of an event is selected and data for a plurality of drugs are retrieved. The data for each of the drugs have an associated quantitative measure of the event. One or more relationships are statistically analyzed between the selected predictor and the event to determine one or more statistical associations among the relationships, using the quantitative measure of the event for each of the plurality of drugs. The statistical associations are determined without any a priori associations between the predictor and the event. The statistical associations are presented and include presentation of a measure of statistical significance.

Description

METHOD AND APPARATUS FOR SCREENING DRUGS FOR PREDICTORS OF QUANTITATIVELY MEASURED EVENTS

FIELD OF THE INVENTION

The present invention relates to the field of drug safety and, more particularly, to methods and apparatus for screening drugs for predictors of quantitatively measured events.

BACKGROUND OF THE INVENTION

Generally there are limits to the degree in which safety profiles of therapeutic drugs may be characterized before the drugs are approved for marketing. For example, pre-marketing studies of a new drug may be too short or include study populations that are too small for any adverse events to be detected. According to the U.S. Food and Drug Administration (FDA) Safety Information and Adverse Event Reporting Program (at the website fda.gov/Medwatch/report/030195.htm), an adverse event (or adverse experience) is described as: "[a]ny untoward medical occurrence in a patient or clinical investigation subject administered a pharmaceutical product and which does not necessarily have to have a causal relationship with this treatment. An adverse event (AE) can therefore be any unfavorable and unintended sign (including an abnormal laboratory finding, for example), symptom, or disease temporally associated with the use of a medicinal product, whether or not considered related to the medicinal product." As used herein, adverse events refer to any untoward medical occurrence in a patient administered a medicinal product associated with a treatment and which does not necessarily have to have a causal relationship with this treatment. An adverse event can, therefore, be described as any unfavorable and unintended sign (for example, an abnormal laboratory finding), symptom, or disease temporally

PB-HANRPORTBLNRPNPCBOCCELl-ANZeZSyO-I. DOC SKB-233USP PATENT

associated with the use of a medicinal product, whether or not considered related to this medicinal product.

Medical safety and medical outcomes databases contain adverse event reports on individual patients, each of which describes one or more adverse events or outcomes for that patient. Adverse events reports, which may be generated from the use of marketed drugs or biological products or also from investigational products in clinical trials, generally contain the following four elements: 1) an identifiable patient, 2) an identifiable reporter, 3) a suspect drug or biological product and 4) an adverse event or fatal outcome. More information regarding adverse event reporting is provided at the website fda.gov/medwatch/report/guide2.htm.

Adverse events for a drug may be revealed only after the drug is approved or when it is used in conjunction with other therapies. Sometimes, events that are described as adverse for a particular patient group, have the potential to provide benefit or efficacy in other clinical situations. For example, if a drug causes low blood pressure for one group of patients as an adverse event, this event could represent a potential efficacy to lower high blood pressure in individuals with hypertension (high blood pressure). As used herein, toxicity refers to adverse events of medicinal products that are detrimental to a subject's health, whereas efficacy refers to situations where untoward medical occurrence may lead to clinical benefit. Because of the limits in characterizing safety profiles of drugs prior to marketing, pharmaceutical manufacturers and regulatory agencies typically collect adverse events reports on the marketed drugs that are used to form databases of adverse events. The databases of adverse event reports represent one of the largest sources of information relating specifically to the safety profile of marketed drugs. Adverse event reports are submitted by health professionals and consumers for marketed drugs and by health professionals for investigational drugs in a clinical trial setting. Pharmaceutical companies are typically under legal obligation to provide received reports to various regulatory authorities.

Typically, non-analysis-based monitoring techniques by skilled individuals have been used to process these adverse event reports. For example, monitoring techniques may include a case-by-case examination of newly generated reports, a tabulation of counts of events for specific drugs, and a detailed review of all the data fields, such as any free-text medical narratives of reports associated with a

PB_H:\NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

possible safety concern. A non-analysis approach tends to be dependent on the knowledgeability and attentiveness of the individual safety reviewers. In addition, it may be very difficult for an individual reviewer to determine whether ten cases of a specific drug-event combination are disproportionately frequent enough to merit further investigation. Accordingly, it is desirable to use quantitative approaches for analyzing adverse event reports, for example, based on reporting rates and/or reporting ratios of the adverse event reports.

SUMMARY OF THE INVENTION

The present invention relates to methods and systems for screening drugs for predictors of quantitatively measured events. At least one predictor of an event is selected. Data for a plurality of drugs is retrieved, where the data for each of the drugs has an associated quantitative measure of the event. One or more relationships are statistically analyzed between the selected predictor and the event to determine one or more statistical associations among the relationships, using the quantitative measure of the event for each of the plurality of drugs. The statistical associations are determined without any a priori associations between the predictor and the event. The determined statistical associations are presented and include presentation of a measure of statistical significance.

BRIEF DESCRIPTION OF THE DRAWINGS The invention may be understood from the following detailed description when read in connection with the accompanying drawings. In the drawings, common numerical references are used to represent like features/elements. Included in the drawings are the following figures:

Fig. 1 is a functional block diagram illustrating an example system for screening drugs for predictors of quantitatively measured events, according to an embodiment of the present invention;

Fig. 2 is a flow chart illustrating an example method for screening drugs for predictors of quantitatively measured events, according to an embodiment of the present invention;

PB-HANRPORTBLXRPXPCBOCCELLANZeZSyO-I. DOC - A - SKB-233USP PATENT

Fig. 3 is a flow chart illustrating an example method for screening drugs for predictors of quantitative measures of adverse events, according to an embodiment of the present invention;

Figs. 4A and 4B are partial screen layouts for selecting predictors of s adverse events, according to an embodiment of the present invention;

Fig. 5 is a decision tree diagram illustrating presentation of drug screening results for statistical associations between predictors and an event, according to the method shown in Figs. 2 and 3;

Fig. 6A is scatter plot of drugs associated with tardive dyskinesia as ao function of a dopamine receptor, illustrating another presentation of drug screening results, according to the method shown in Fig. 3;

Fig. 6B is a portion of a heat map illustrating statistical associations between a plurality of predictors and a plurality of events, according to another example embodiment of the present invention; s Figs 7 is a decision tree diagram of the predictor Human Ether a Go-Go

(HERG) cellular ion channel for torsade de point syndrome, according to the method shown in Fig. 3;

Fig. 8 is a decision tree diagram of biochemical assay predictors of tardive dyskinesia, according to the method shown in Fig. 3; o Fig. 9 is a decision tree diagram of molecular fragment predictors of

Stevens-Johnson syndrome, according to the method shown in Fig. 3; and

Fig. 10 is a decision tree diagram of predefined structural feature predictors of methemoglobinaemia. ₅ DETAILED DESCRIPTION OF THE INVENTION

Aspects of the present invention relate to methods and systems for screening drugs for predictors of quantitatively measured events. At least one predictor of an event is selected. The predictor may include pharmacological assays, biochemical assays and/or compound properties. The event may include an adverse0 event or an efficacy associated with drug or drug dose. Data for a plurality of drugs is

PB_H:\NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

retrieved, where the data for each of the drugs has an associated quantitative measure of the event. One or more relationships are statistically analyzed between the selected predictor and the event to determine one or more statistical associations among the relationships, using the quantitative measure of the event for each of the plurality of drugs. The statistical associations are determined without any a priori associations between the predictor and the event. The determined statistical associations are presented and include presentation of a measure of statistical significance. For example, the event may include tardive dyskinesia and the predictor may include pharmacological activity at alpha-1 receptor subtypes. One or more statistical associations may be determined and presented between data for drugs including an association with the alpha-1 receptor subtypes and tardive dyskinesia.

The present invention uses quantitative measures of events from drugs that exhibit a range of expression of a particular event. The statistical modeling process determines statistical associations, without using a priori information of predictors and associated events, and generates a statistical model. The broad information on drugs that do not exhibit the particular event may contribute to the robustness and specificity of the determined statistical model. In addition, the statistical modeling process is typically multivariable in nature and may provide an improved result as compared to testing a single hypothesis about a single predictor and response variable.

Referring now to Fig. 1, an example system 100 for screening drugs for predictors of quantitatively measured events is shown. System 100 may include modeling system 102 and one or more local user devices 104 connected to modeling system 102. In an example embodiment, local user device 104 is connected to modeling system 102 by a global network, e.g. the Internet (not shown). It is understood that local user device 104 may be coupled to modeling system 104 by any suitable means, including any wired or wireless connection. Although one local user device 104 is shown, it is understood that modeling system 102 may be coupled to a number of local user devices 104. Local user device 104 may include user interface 120 for selecting parameters for modeling statistical associations between predictors and events and display 118 for displaying the parameters and results of the statistical modeling process. Although in an example embodiment local user device 104 is a computer, it

PB_H: \NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

is understood that local user device 104 may be any suitable device capable of providing selection of parameters and displaying the results of the modeling. It is understood that display 118 may include any display capable of presenting information including textual and graphical information. It is also understood that user interface 120 may be any suitable interface for selecting parameters for modeling statistical associations between predictors and events.

The parameters for selection may include event selection, a response variable (i.e. a quantitative measure) associated with an event, one or more predictors (described further below), and a statistical model. Event selection may include selection among types of events such as an adverse event or efficacy. The parameters may also include selection of one or more databases having information from which quantitative measures of events can be determined, for example from among external database(s) 122. For example, events database 110 may include measures of efficacy for a number of drugs. Accordingly, the response variable may be associated with various quantitative clinical pharmacology measures associated with the efficacy of the drugs. As another example, events database 110 may include measures of adverse events for a number of drugs. Accordingly, the response variable may be associated with quantitative measures of the adverse events (described further below). Examples of external databases 122 include, but are not limited to, post marketing safety databases, clinical trials databases, administrative medical claims databases, electronic health records databases or the like.

According to an example embodiment, the quantitative measure may include, but is not limited to, the empirical Bayes geometric mean (EBGM), the log (base 10) transform of the EBGM (log EBGM), a relative reporting (RR) ratio, a reporting odds ratio (ROR), a proportional reporting rate ratio (PRR), a multi-item gamma Poisson shrinker (MGPS) methodology, a Bayesian confidence propagation neural network (BCPNN) methodology and/or odds ratios or relative risks based on placebo-control, case-control, or controlled cohort studies. In the examples described below with respect to Figs. 7-10, the MGPS methodology was used. EBGM scores have been calculated, for example, from adverse event data collected in the Food and Drug Administration (FDA) Adverse Event Reporting System (AERS) or World Health Organization's (WHO) Vigibase adverse event reporting systems. The EBGM scores are typically obtained for individual "preferred terms." The preferred terms generally refer

PB_H:\N RPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

to a specific level in a hierarchy of a Medra medical vocabulary in which the adverse events are captured. For example, some of the preferred terms for a general category of QT prolongation include a prolonged QT interval and torsades de pointes. QT refers to the interval between the Q wave and the T wave in an electrocardiogram of the heart. The QT interval is typically a measure of a total duration of electrical activity of ventricular depolarization and repolarization. Although EBGM values are described with respect to adverse events, it is understood that any suitable quantitative measure of adverse events may be used.

According to an example embodiment, the predictors may include pharmacological or biochemical assay results represented as continuous or categorical values and/or compound properties. The compound properties may include biochemical or pharmacological properties represented as categorical values, chemical structure- or substructure- based descriptors and other modeled physical or biochemical properties of the compounds. Biochemical assay results refer to the measurement of interactions of molecules (such as drugs) with protein or other molecular targets. The assay results data typically includes assays that measure binding, inhibition or activation responses (referred to as modes) as a result of the interaction, depending on the protein target and type of assay. Some protein targets may be processed in multiple assays and may measure different modes.

Assay results are typically quantitatively determined in a standardized form as a result of a dose-response protocol, and a concentration at which 50% of a maximum response is observed. A negative log value of the concentration is typically defined as a pIC₅₀ for antagonist or binding assays and a pEC₅₀ is defined for activation or agonist assays. The negative log value of the concentration at which half maximal activity is observed may be more generally defined as pXC₅₀ values, where X= I (for antagonist or binding assays) or E (for activation or agonist assays). Other quantitative measures of activity or effectiveness in a biological assay could be used, such as percentage (%) inhibition or activation at a single test concentration, or other values derived from dose-response protocols, and are referred to herein as activity values. In an example embodiment, the activity values of the biochemical assays are used to provide a quantitative measurement enabling a statistically significant distinction in the response variables of subset of all entities, or drugs, in an analysis,

PB_H : \NRPORTBL\RP\PCBOCCELLA\262570_1. DOC SKB-233USP PATENT

as described further with respect to Fig. 7. Alternatively, the activity value of an assay may be used as binary measure of statistical significance.

Biochemical or pharmacological properties of compounds, i.e. drugs, may also be represented by categorical values. The categorical values may be generated by assigning ranges of activity values (such as ranges of pXC₅₀ values) and labeling the compounds with the appropriate category. For example, pIC₅₀ values between 4 to 5.8 may be labeled as "weak," PIC₅₀ values greater than 5.8 and less than or equal to 6.8 may be labeled as "medium," and PIC₅₀ values greater than 6.8 may be labeled as "strong." Binary values may also be used, for example, to describe whether a compound is a substrate for a particular enzyme or whether a specific property is included in a compound. The biochemical or pharmacological properties may be determined, for example, from known biological data.

Chemical structure- or substructure-based predictors may include abstractions of chemical structures of drug molecules. The chemical structure/substructure- based predictors may include numerical, categorical or binary values that may be statistically associated with a response variable. For example, the structure/substructure predictors may include molecular fragments, atom pairs and topological torsions, Daylight fingerprints algorithms provided by Daylight Chemical Information Systems, Inc. (described at the website daylight.com/dayhtml/doc/theory/theory.finger.html), pharmacophore bit strings, keys that capture an electronic character of each atom and an associated environment in a molecule, as well as user generated fragments to define chemical substructures.

The other modeled physical or biochemical properties may include models of physico-chemical and physio-chemical or biochemical properties that are built from chemical structure predictors and subsequently used as predictors.

Examples of physical or biochemical properties include solubility, blood brain barrier penetration or hydrophobicity.

Modeling system 102 may include controller 106, pre-processing filter 108, events database 110, statistical modeling processor 112, visualization processor 114 and modeling results database 116. It is contemplated that modeling system 106 may include any computer having a processor (e.g. a microprocessor or a dual-core microprocessor) for determining statistical associations between predictors and events and to identify drugs associated with the predictors. Modeling system 102 may

PB_H:\NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

include any software suitable for performing at least part of the functions of determining statistical associations between predictors and events and to identify drugs associated with the predictors.

Modeling system 102 may be coupled to one or more external databases 122 for retrieving event data, such as adverse events or clinical pharmacology events for a number of drugs, and may store the event data from external database(s) 122 in events database 110. In an example embodiment, external database(s) 122 may include any of a number of databases of spontaneous adverse event reports. For example, each major pharmaceutical company may have a proprietary database of reports focused on cases in which one of the company's products was considered to be "suspect" (which may number about 500,000 reports for the larger pharmaceutical companies). In addition, there are several combined databases maintained by government regulatory agencies and health authorities that are available in varying degrees for public use, such as AERS and the WHO Vigibase. Some of the databases maintained by the regulatory agencies may contain as many as about 3,000,000 reports. Typically, the databases contain adverse event reports that may include a demographic record (for example, age, gender, a date of the event, a seriousness of the event), one or more drug records (for example, a generic or trade name, a suspect or concomitant designation, a route of administration, a dosage), and one or more records documenting a sign, symptom or diagnosis. Individual databases may also contain narratives (from which event terms were coded), outcomes (e.g., hospitalization, death), and report source (consumer or health professional, domestic or foreign).

Controller 106 may parse one or more external databases 122 to retrieve and store data for the events in events database 110. For example, events database 110 may include quantitative information about adverse event reporting, chemical structure information, clinical pharmacology information and physicochemical properties, chemical descriptors (both measured and derived) for a number of drugs. External databases 122 typically include information on drugs that have been approved for human use and are considered to be relatively safe. Accordingly, events database 110 may use quantitative information on marketed drugs.

As described above, there are a variety of public and company specific databases which may share a common general organization, but may differ in the

PB_H:\NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

details the event report. For example, field and/or table descriptions may vary, the presence or absence of specific attributes may vary, and the use of specific medical coding dictionaries may vary. In addition, drug names may be collected as free text, with variations as to how trade, generic and/or mixed names may be entered into a record. Furthermore, external databases 122 may include multiple versions of a single case, For example, adverse events databases may contain multiple reports because of regulatory requirements to submit a series of reports as additional information about a case becomes available. In addition, there may be other sources of report duplication, such as multiple reports of the same medical event by different manufacturers or reports that may arrive through different pathways (such as from consumers or from manufacturers).

It may be desirable to reduce the presence of multiple and/or duplicate reports prior to processing by statistical modeling processor 112, because the presence of multiple reports (or duplicate reports) may be a source of false positives. Accordingly, controller 106 may control the formatting of events data from external database(s) 122 such that events data from a number of different external databases 122 may be entered into events database 110 in a predetermined format. In addition, controller 106 may combine multiple versions of reports into one best representative version to present a uniform view of event data from various data sources. Controller 106 may receive parameters, such as predictors and response variables, from local user device 104 and control pre-processing filter 108 to filter the associated adverse event data in event database 110. For example, pre-processing filter 108 may perform a data aggregation, for example on protein targets that have been processed in multiple assays, that measure a same mode of interaction but with multiple assay technologies. Pre-processing filter 108 may also exclude event data that includes fewer than a predetermined number of reports.

Controller 106 may provide the pre-processed event data from preprocessing filter 108 to statistical modeling processor 112. Controller 106 may also control visualization filter 114 to modify statistically modeled data determined by statistical modeling processor 112 for display and interpretation. In addition, controller 106 may control display 118 to display the statistically modeled data from modeling processor 112 or the modified modeled data from visualization filter 114.

PB_H.\NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

Statistical modeling processor 116 may determine statistical associations among the selected predictors and the quantitative measures of events. In particular, statistical modeling processor 116 may statistically analyze one or more relationships between the selected predictor(s) and the event to determine the statistical associations among the analyzed relationships, by using the quantitative measure of the event for each of the drugs. In an example embodiment, statistical modeling processor 116 may determine one or more statistical associations between EBGM values for a preferred term (i.e., of an adverse event) and the selected predictors. Statistical modeling processor 112 may use any suitable statistical method for determining statistical associations. In general, statistical modeling processor 116 desirably identifies stronger, more probable associations quantitative measures of events and a predictor. In an example embodiment, recursive partitioning or random forest modeling is used. Other statistical approaches may include: measures of correlation; other multiple tree methods that utilize boosting, bagging, or random splits; or multiple linear regression; partial least squares; Fischer's exact test; Chi- squared test, regression analysis, neural networks, and other artificial intelligence methods. Other data mining approaches may include a correlation analysis approach, a principle component analysis (PCA), a multi-dimensional scaling (MDS), as well as graph based analyses. One example of an alternative data mining approach is the correlation analysis. Using this technique, correlation is used to highlight relationships between the target activities of drug molecules, and then visualization is used to look for patterns in those relationships. According to an embodiment of the present invention, a Pearson's correlation is performed, with no a priori associations between the predictors and the events. A ^Λheat map' (for example, shown in Fig. 6B) is commonly used for analysis in the genomic and biomarker analysis, and enables a broad view of the relationships of many entities in terms of many columns of variables. A heat map relates to a table of data arranged in a grid of cells, where the cells are colored based on the value in that cell, transforming a table (a.k.a. grid or matrix), of numbers to a table of colors. In a simple case of one variable, a column may be sorted to identify objects with similar values, but as the number of variables grows, simple sorting may no longer be able group entities with similar profiles across all of the variables. A typical data mining approach is to organize the table, so as to group entities with

PB_H:\NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

similar profiles, and visualize patterns in the data set as a whole. Accordingly, a heat map allows for identifying patterns and groups visually.

Clustering methods are typically used to group the objects by their similarity to one another. In an example embodiment of the present invention, hierarchical (or agglomerative) clustering methods are used. One clustering approach includes clustering the table, where each row is one drug, and includes a column for each event being analyzed. Clustering may organize the rows so that drugs with similar patterns of events are grouped together. Optionally, the clustering can also be used to organize the columns, so that events that occur in similar drugs may be grouped together.

In an all-vs.-all analysis (i.e. a plurality of predictors and a plurality of events), clustering may be performed on a determined correlation matrix. By calculating a similarity between entities, as well as between columns, the rows and columns are rearranged to group similar rows and columns with one another. These operations of generating a correlation matrix and hierarchical clustering may be used to identify groups of events that are correlated with one or more groups of predictors.

The events and predictors may have similar patterns and may be grouped together for a number of different reasons. For example: (a) the grouped events might actually be describing what a medical expert would consider to be the same phenomena, and so reflect some redundancy in the medical dictionary used for adverse event coding; (b) the grouped events may be somewhat distinct medical implications, but may be driven from the same underlying biology or toxicity; (c) the grouped events may reflect the most notable adverse events of a particular class of drugs, but may or may not share a common mechanism; (d) the grouped predictors may reflect that the drugs that bind to one target assay generally also bind to another target assay; and (e) the predictors may be grouped together because they may be part of a more general biological network or system, and may reflect several different specific mechanisms of causing an adverse biological response or toxicity. Accordingly, data mining by an all-vs.-all analysis may leads to hypotheses that generate additional analysis. The additional analysis may include either or both model building using recursive partitioning and random forests on one or more preferred terms, as well as additional literature research to provide more background and context to the hypothetical relationships identified.

PB_H: \NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

Although hierarchical (or agglomerative) clustering methods are described, it is understood that other clustering methods may also be used. For example, partitional methods and self-organizing maps. In addition, although heat maps are described it is understood that other visualization approaches may be used. For example, profile plots (i.e. parallel coordinate plots) may also be used to visualize the clustered profiles. Principle component analysis (PCA) or multi-dimensional scaling (MDS) may also be used to transform the multivariate data space to a simpler set of dimensions, which may enable visualization of the relationships between entities to find patterns and groups. A related analysis and visualization includes a graph based analyses.

In this approach, the similarities between all entities are calculated based on one or more columns of variables, and where the similarity is above a particular threshold a link between the entities is created. To visualize this set of objects and links, a network is drawn where the objects are connected by their links to one another, and an algorithm is used to organize the graph to facilitate the identification of groups with many common links, or conversely that are distinct with few if any links. Different types of graph analyses include: (a) calculating a similarity between compounds based on chemical structure (using atom pairs & topological torsions, or Daylight fingerprints), and then coloring the objects (compounds) based on the value of specific events (for example using hepatotoxicity, prolonged QT, Stevens-Johnson Syndrome); (b) creating a graph based on links between any pair of predictors, any pair of events, and any event-predictor pair above a threshold of either an r-value (i.e. a correlation coefficient) or a p-value (i.e. the probability of an observed result happening by chance rather than due to an actual relationship) from a Pearson's correlation calculation; and (c) creating a network using only the links between predictors and events.

It is understood that a definition of statistical significance may depend on the type of statistical modeling method that are used. In general, statistical modeling processor 112 provides statistical processing that may determine chemical, biological, or physical properties of drugs which may be related to quantitative measures of events of the drugs. The quantitative data may be selected such that it has some statistical significance in the explanation or prediction of the event.

PB_H:\NRPORTBL\RP\PCBOCCELLA\262S70_1.DOC SKB-233USP PATENT

Recursive partitioning and random forest methods typically provide a multivariate analysis of predictors and the response variable and provide a classification and prediction that may identify relationships among predictive features. See, for example, an article to R.A. Berk, entitled "An Introduction to Ensemble Methods for Data Analysis," Department of Statistics, UCLA, July 25, 2004 and at website stat.berkeley.edu/~breiman/RandomForests/ cc_home.htm.

Recursive partitioning typically determines a decision tree structure in a feature space. Each node of the tree may define a subset of compounds that share similar features and have relatively homogenous response values. Under recursive partitioning, a search may be performed to determine a best partition (i.e. a split) among the compounds. For example, a single feature may be determined that is used to define two groups of compounds having a substantially different response distribution. Accordingly, a root node (i.e. a parent node) may be split into two initial "leaf" nodes. A search on the current leaf nodes may also be performed to determine a further best split by a single feature to provide a most different response distribution. Further splits may continue to be determined until all drug molecules within every group have a substantially minimum response-by-feature association. The resulting groups of drug molecules are typically displayed as a tree structure in the feature space, for example, as shown in Fig. 5. Random forests models are typically based on recursive partitioning but build a model that may include a collection of many trees, for example, hundreds of trees. Random forests are typically better at prediction as compared to recursive partitioning but may be more difficult to interpret or visualize. For example, to generate a random forest of 500 trees, 500 subsets of available features may be randomly selected. A recursive partitioning tree may be constructed based on each feature subset. A random subset of compounds may also be reserved to evaluate the performance of the model. A prediction may be generated by processing each compound through all 500 trees and then forming an averaged result for all of the trees. In general, models determined using a random forest method are typically more predictive, because these models are developed to work around, or to be more robust to, artifacts or weaknesses of a single tree (as in recursive partitioning). In contrast, single trees may not be as predictive as random forests, but

PB_MANRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

they may provide a useful visualization in a decision tree to interpret the data. Splits in single trees are typically driven by a most significant p-value, such that another predictor which is similar may not be sampled. Because one predictor may be a surrogate for another predictor and because data sets may not be as balanced and diverse as desired, the determined most significant descriptor (by recursive portioning) may not necessarily be the best predictor. In contrast, random forests use hundreds of trees and are typically generated by omitting a certain fraction of data and predictors for different trees in order to emphasize other predictors and data in the statistical modeling, in order to reduce the effects of surrogate predictors. The subsequent prediction of a substantially toxic or of a substantially efficacious compound by random forest is typically an average prediction of hundreds of trees.

Alternatively, if there are specific hypotheses about a single response and predictor variable, the hypothesis may be explored using simple regression methods or by determining ratios of the response for drugs with and without a particular descriptor. Additionally, simple graphical approaches may be used that plot the response variable as a function of predictor variables for all of the drugs in the dataset. For example, Fig. 6A shows a graph of EBGM as a function of dopamine affinity.

As described above, statistical modeling of event reports may determine the occurrence of statistical associations among predictors and the event. To determine the occurrence of statistical associations, no a priori assumptions are used in the statistical modeling with respect to relationships between the predictors and the events. In addition, drugs that both are and are not associated with events, and broad information on molecules that do and do not exhibit a particular event may be included in the statistical modeling, and may help to contribute to the robustness and specificity of the models.

Referring back to Fig. 1, modeling system 102 may include visualization filter 114 that receives the modeled data from statistical modeling processor 112 and formats the modeled data in a manner suitable for display, such as in a decision tree diagram (Fig. 5), as a scatter plot (Fig. 6A) or as a heat map (Fig. 6B). For example, the modeled data may be analyzed to determine whether the identified statistical associations are biologically or chemically relevant and interpretable with respect to the model generated by statistical modeling processor 112. Accordingly, the modeled

PB_H : \NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

data may be processed to visually emphasize statistical associations that are statistically and scientifically relevant to the model. By using this approach, visualization processor 114 may optimize both the statistical and biological relevance of the model. Modeling system 102 may include modeling results database 116 for storing the resulting model, the resulting statistical associations determined by statistical modeling processor 112 and/or optimized statistical associations from visualization processor. Controller 106 may verify the generated statistical model stored in modeling results database 116. For example, the model may be processed for the same predictors and results variables using another external database 122 or to compare previously determined modeling results with modeling results of other related medical events (for example, tardive dyskinesia and extrapyramidal symptoms).

A suitable display 118, user interface 120, external database(s) 122, events database 110, modeling results database 116, controller 106, pre-processing filter 108, statistical modeling processor 112, and visualization processor 114 will be understood from the description herein.

Fig. 2 is a flowchart illustrating an example method for screening drugs using a quantitative measure of events, according to an embodiment of the present invention. For example, drugs may be screened for efficacy using quantitative measures of events such as from a clinical pharmacology database. At step 200, predictors are selected, for example, by user interface 120 (Fig. 1). At step 202, an event is selected. For example, if events are associated with efficacy, various treatments of a particular disease may be selected. Although steps 200 and 202 are illustrated as being performed sequentially, it is contemplated that steps 200 and 202 may be performed in a reverse order or concurrently. At step 204, a statistical modeling method may be selected, such as recursive partitioning.

At optional step 206, pre-processing of the data may be applied. For example, events stored in events database 110 may be processed by pre-processing filter 108 (Fig. 1), which conditions the data to disregard sparsely populated events.

At step 208, statistical modeling is applied to determine one or more statistical associations between the predictors selected at step 200 and the event

PB_H:\NRPORTBL\RP\PC BOCCELLA\262570_1.DOC SKB-233USP PATENT

selected at step 202, for example, by statistical modeling processor 112. As described above, one or more relationships are statistically analyzed among the selected predictor(s) and the event to determine the statistical associations among the analyzed relationships, by using the quantitative measure of the event for each of the drugs. At step 210, drugs that are associated with the statistically significant predictors are identified. Accordingly, either the most toxic or the most efficacious compounds associated with an event may be identified based on the statistical associations. Although steps 208 and 210 are illustrated as being performed sequentially, it is understood that steps 208 and 210 may be performed concurrently. At step 212, statistical associations are displayed, including a measure of significance, for example, by display 118 (Fig. 1). For example, the measure of significance may be a quantitative measure, such as an activity, or a binary measure of significance, such as the inclusion or exclusion of specific chemical structures/substructures. At optional step 214, drugs identified as being associated with an event may be inspected for biological relevance with respect to the event. In this manner, identified drugs that do not exhibit biological relevance may be excluded from the screened drugs.

Fig. 3 is a flowchart illustrating an example method for screening drugs based on adverse events. The following figures illustrate diagrams useful for illustrating the example method shown in Fig. 3 : Figs. 4A and 4B illustrate respective partial screen layouts 402A, 402B of parameter input screen layout 400 for selecting predictors; Fig. 5 is a decision tree diagram 500 illustrating presentation of statistical associations between predictors and to adverse events; Fig. 6A is a scatter plot of drugs associated with a particular adverse event, tardive dyskinesia, as a function of a quantitative activity of a dopamine receptor, illustrating another presentation of statistical associations between a predictor (the dopamine receptor) and an adverse event (tardive dyskinesia); and Fig. 6B is a portion of a heat map illustrating statistical associations between a plurality of predictors and a plurality of events, according to another example embodiment of the present invention. At step 300, predictors are selected, for example, by user interface 120

(Fig. 1). As described above, predictors may include biochemical assay results, biochemical or pharmacological properties, chemical structure or substructure-based descriptors and other modeled physical or biochemical properties of compounds.

PB_H:\N RPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

Referring to Figs. 4A and 4B, an example parameter input screen layout 400, including respective partial input screen layouts 402A, 402B is shown. Partial input screen layout 402A includes a feature quantity parameter 404 for selecting a feature in a percentage of compounds, molecular fragments section 406 for selecting molecular fragments such as molecular substructures , and chemical descriptors section 408. Parameter selection 410 represents one or more calculated or modeled physical or biochemical properties. Parameter selection 412 (Fig. 4B) represents a source of target assay data. Parameter selection 414 represents a source of data for P450 enzyme inhibition and enzyme substrates. Parameter selection 416 represents a chemical descriptor that includes a combination of structural and electronic character (i.e. an electrotopological state). The electrotopological state is associated with contributions of individual electronegativies of atoms in a molecule or fragment, the valence (or bonding) of those atoms and the electronegativies of the other atoms in the molecule or fragment The chemical descriptor is described as a continuous numeric value.

Referring back to Fig. 3, at step 302, an adverse event is selected, for example, tardive dyskinesia, a neurological disorder. Although steps 300 and 302 are illustrated as being performed sequentially, it is contemplated that steps 300 and 302 may be performed in a reverse order or concurrently. At step 304, a statistical modeling method may be selected, such as recursive partitioning.

At optional step 306, data aggregation may performed. For example, a common biological or pharmacological target assay may have been processed in multiple assays that measure a same mode of interaction (i.e. inhibition) but with different assay technologies. The term "target" refers to proteins, or more specifically, a gene product, or complex of gene products which are thought to have potential therapeutic benefit if their activity, function, or state can be modulated (e.g., inhibited, activated, increased or decreased) by a candidate drug molecule. Not all proteins produced in the human proteome (the protein equivalent of the genome) are thought to have potential for a therapeutically beneficial modulation. Common target assays are biochemical assays {in vitro) that measure the interactions of small molecules (like drugs) interacting with protein targets, including receptors, enzymes, etc. These data include assays that measure binding, inhibition, or activation responses (referred to as 'modes') as a result of the

PB_H:\NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

interaction, depending on the target and the type of assay. The assays can use either isolated and highly purified forms of the proteins, or can use engineered cellular systems which enable the protein to be expressed and to be studied in a cellular context. This cellular context often allows for the presence of the other proteins of cellular components that the biological function of the target naturally uses, while still utilizing a engineered method of measuring a response resulting from the interaction of a small molecule specifically with the target.

The existence of multiple technologies is generally a function of either the evolution of the platforms over time, or because of operational and cost factors that are considered in the development and production support of these assays in day to day use in drug discovery. There is generally a partial overlap in compounds with data in multiple assays. Columns containing data measuring the same mode for the same target may be combined, generally, by determining a maximum value where multiple technologies exist. The result is a single column for a target assay and a mode that is more densely populated than the original individual columns. Although a maximum value is described, it is contemplated that an average value over all of the columns may be used to perform the data aggregation, or some other statistically derived representative value.

At optional step 308, sparsely populated adverse event data may be filtered, for example, by pre-processing filter 108 (Fig. 1). The sparsely populated data may be filtered to optimize computational efficiency and to avoid identifying biologically implausible associations. For example, columns of predictor values may be excluded that include fewer than about 5 compounds, where a difference between minimum pXC₅₀ values in the column and a maximum pXC₅₀ value in the column is less than about 0.5. In this example, a standard deviation of the pXC₅₀ values in a column is less than about 0.1 and a maximum pXC₅₀ value in a column is less than about 5. It is understood that the approach described above for filtering sparsely populated adverse event data is example and that any suitable approach for filtering sparsely populated adverse event data may be used. Although steps 306 and 308 are illustrated as being performed in order, it is understood that steps 306 and 308 may be performed in a reverse order or concurrently.

At step 310, statistical modeling is applied to determine one or more statistical associations between the predictors selected at step 300 and the adverse

PB_H.\NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

event selected at step 302, for example, by statistical modeling processor 112 (Fig. 1). At step 312, drugs are identified that are associated with the statistically significant predictors. Although steps 310 and 312 are illustrated as being performed sequentially, it is understood that steps 310 and 312 may be performed concurrently. At optional step 314, visualization filtering may be applied, for example, by visualization filter 114 (Fig. 1). The generated statistical model may be interpreted by suitable skilled persons in biological, pharmacological, medical and chemical sciences to understand the relationships in the larger context of medicine and drug discovery, and to verify that the relationships and descriptors identified are meaningful and reasonable. The level and type of interpretation possible may depend on the statistical method that is used. For example, some methods may be evaluated via performance metrics such as predictive ability, whereas others may be evaluated, for example, graphically in more detail.

An analysis of the modeling results may include considerations of model quality and robustness, as well as an inspection of the suggested relationships and the predictors identified by the pharmacological and biological implications of the models. Suggested statistical associations may be verified such that the relationships are not unduly influenced by artifacts in the data set. In cases where chemical descriptors are used, for example, they may be verified to identify parts of the molecules which can reasonably be expected to contribute to the chemical and pharmacological properties of the molecules. In the cases where measurements from biochemical assays are used as predictors, other considerations may be used.

Unlike chemical descriptors or physical properties, assay results are not an intrinsic property of the molecules. In addition, not all molecules may be tested in all assays, such that there may be a degree of sparseness to the data. Therefore, the biological activity predictors that are suggested by a statistical model may be verified to identify meaningful predictors. The verification may determine whether the predictors have been selected simply because a preponderance of the molecules in the active subset have been tested in one of the assays, whereas on average the data for that assay is sparse. Additionally, when biological targets or assays are identified as statistically meaningful predictors for a response variable, these results may be scientifically relevant only if the parameters defined in the model would be expected to lead to a pharmacological response in vivo. For example, predictors within a model

PB_H \NRPORTBL\RP\PCBOCCELLA\262570_1 DOC SKB-233USP PATENT

that involve low affinity biological effects (i.e., typically Ki values of > 10^~5) may not effectively contribute to the development of a hypothesis about a toxicity, because the drugs are not given at such high doses. Furthermore, such predictors may not be useful for predicting the future outcomes of new molecules. At step 316, statistical associations are displayed, including a measure of significance, for example, on display 118 (Fig. 1). Referring to Fig. 5, a decision tree diagram 500 is shown as an example of presenting drug screening results for an adverse event, such as an adverse event selected at step 302. Parent node 502 includes all drugs 504 in a training or analysis set. Drugs 504 may be associated with different predicator variables that are illustrated in this example with different shapes and patterns. Other presentation techniques, such as listing the drug names may also be used. The statistical modeling processing identifies drugs 504 that have a higher quantitative score (for example EBGM) for a particular adverse event and a statistical association with one or more predictors (such as a biological target activity, a chemical descriptor or substructure) that are most strongly associated with the quantitative score for this adverse event. In decision tree diagram 500, four drugs 504a-504d are identified in node 506 at split 518 for condition 514 as being associated with a specific predictor (a "triangle" feature). The remaining drugs shown in node 508 are also processed to identify other features in the remaining drugs that are associated with an adverse event. In decision tree diagram 500, three drugs 504e-504g are identified in node 510 at split 520 for condition 516 as being associated with predictors of the adverse event. The remaining drugs in node 512 may have a low association with predictors of an adverse event.

Referring to Fig. 6A, scatter plot 600 may also be used to illustrate statistical associations between predictors and an adverse event. Scatter plot 600 illustrates an association between tardive dyskinesia and the dopamine-2 receptor subtype (D2) antagonism. Drugs 602 are plotted according to an EBGM score for tardive dyskinesia (y-axis) and dopamine D2 receptor activity (x-axis), based on the statistical modeling processing as described above. Accordingly, scatter plot 600 may provide a simple graphical approach that displays an adverse event as a function of predictor variables for all of the drugs in a dataset. For example, the results may be interpreted to determine a multitude of factors that may influence the level of adverse events.

PB_H:\NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

Referring to Fig. 6B, heat map 640 (of which a portion of which is shown in Fig. 6B) may also be used to illustrate statistical associations between a plurality of predictors and a plurality of events, according to another embodiment of the present invention. Heat map 640 includes adverse events along the x-axis, predictors, such as biological assays, on the y-axis, a grid 642 indicating any statistical associations between predictors and adverse events, and a measure of significance 644. Grid 642 presents the measure of significance between an adverse event and the predictors. In this manner, any patterns may be observed for a number of adverse events and a number of predictors. For example, heat map 640 indicates a pattern 646 between several target assays and several adverse events.

Heat map 640 may be generated using pairwise methods such as a Pearson's correlation, a Chi squared test, a Fishers exact test, and the like, to determine statistical associations between a plurality of predictors and a plurality of events. For example, statistical modeling processor 116 (Fig. 1) may compare each column of predictor values (for example, assay data) to each column of quantitative measures of events (for example, EBGM values) for a list of drugs, in order to find statistical relationships between the predictors and the quantitative measures of events, with no a priori assumptions regarding relationships between the predictors and the events. The identified statistical associations may be provided as a list according to a pairwise relationship with a measure of significance 644 (such as a score, a p-value, an r value (i.e. a Pearson's product-moment correlation coefficient), etc). The measure of significance 644 is used to sort, filter and interrogate the resulting identified statistical associations. The list may be converted into a matrix format, as shown in heat map 640, with events along the x-axis and predictors (such as assays on the y-axis). Alternatively, the predictors may be provided along the x- axis and the events may be provided along the y-axis. The statistical associations are determined and populated in grid 642 with the measure of significance 644 (presented, for example, according to color) in the intersecting cells. In this manner, an all-vs.-all view of the identified statistical associations is provided and may be used to identify patterns in the statistical associations. Referring back to Fig. 3, at optional step 318, at least one of the prediction results and the generated statistical model may be stored, for example in modeling results database 116 (Fig. 1). At optional step 320, the generated statistical model may be applied to other similar adverse events (i.e. for events in the same database, such as events database 110). At

PB_HANRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

optional step 322, prediction results may be compared with different adverse events databases, such as external database 122 (Fig. 1), using the generated statistical model stored in modeling results database 116.

In generating a statistical model, a node purity and the predictability of a model may be adversely affected by drugs that have a predictive feature but do not exhibit a clinical toxicity for reasons related, for example, to absorption, indication, an amount of use and a route of administration. For example, a drug that has HERG binding which is used only as an occasional topical ointment may not be likely to have cardiac toxicity, and would therefore lower the statistical association of the toxicity response with the predictor.

In addition, because the relative reporting ratios for adverse events (such as EBGM or other measures of toxicity response) are derived from spontaneous reports rather than from placebo controlled data, there may be adverse event scores that may be confounded by underlying conditions in the patient. For example, seizure patients may report seizures when being treated with anti-convulsant drugs, and these seizures may represent a baseline disease rather than adverse events. According, statistical modeling according to present invention may be more robust when used to analyze toxicities that are typically drug related (such as torsades de pointe, tardive dyskinesia, etc.).

The predictor variable may show an association as opposed to a causation for a toxicity. If all drugs in the analysis have high affinity for D2, for example, the drugs may also have another property that is highly correlated with D2 in the drugs analyzed (a colinearity with a similar but distinct receptor, i.e. the dopamine-3 receptor subtype (D3)), so that it may be difficult to distinguish which of the two variables is more predictive. Additionally, D2 may be associated with another variable that has not been measured or included in the model and thus may represent a surrogate for the other predictor.

In an example embodiment, statistical modeling has been done on between about 1000 and 2000 marketed drugs. The data derived from these molecules may not be generalizable to all molecules, because it is taken from pools of compounds that are relatively safe, as they have been approved for use in humans.

PB_H:\NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

The present invention is illustrated by reference to a number of examples. The examples are included to more clearly demonstrate the overall nature of the invention. These examples are example, not restrictive of the invention.

In the examples shown below, recursive partitioning was used to demonstrate the modeling or predictor and response variables in a decision-tree format (such as shown in Fig. 5). The recursive partitioning processing identifies drugs that have a higher EBGM score for a particular adverse event and a statistical association with a predictor variable (such as a biological target activity, a chemical predictor or a substructure) and that are most strongly associated with the EBGM for this adverse event. In the examples shown below, the EBGM score in each node represent an average EBGM for all compounds in the respective node.

Referring back to Fig. 5, further description of visualization processing (step 314) is described with respect to the examples shown below. Sometimes, with recursive partitioning, as well as other statistical modeling methods, the analysis may yield a result where several different predictor variables may partition the data in a similar way, such as to yield a similar result at a given decision point (i.e., a split such as split 518 or split 520). Decision points where multiple predictors yield a similar solution may be referred to as having primary and surrogate splits. In some instances, the multiple solutions may provide similar results. The primary split is typically the split that has a highest statistical significance, but the surrogate splits may have similar statistical significance.

In displaying the results of the models shown in Figs. 7-10, certain guidelines where used as to whether the primary or surrogate split was displayed. If the predictor (for example the "triangle" predictor) shown as the primary split is biologically or chemically relevant and interpretable with respect to the model, then the primary split is displayed. If the primary split is not scientifically relevant (for example, a receptor binding that has a very low affinity or a chemical structure that is not pharmacologically meaningful), then the secondary splits are inspected in rank order of significance. Accordingly, the predictor that is most statistically significant, most scientifically relevant to the model and which is present in a majority of molecules within the node may be displayed. By using this approach, both the statistical and biological relevance of the model may be optimized.

PB_H: \NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

The aforementioned guidelines here for the display of Fig. 8, where the physical characteristics for brain penetration involved a surrogate split, but these characteristics were highly relevant to the brain disorder being studied. However, in some models, for example, Fig. 9, it is useful to consider all components of the 5 surrogate split.

EXAMPLES Biomarker for Cardiac Toxicity

Fig. 7 is a decision tree diagram 700 that illustrates an example of a statistical association between the Human Ether a Go-Go (HERG) cellular ion channel

I₀ and the adverse event of torsades de point. In decision tree diagram 700 and decision tree diagrams 800, 900, 1000 shown in the respective examples (Figs. 8-10), a condition (i.e. condition 708) described at the each split (i.e. split 710) indicating a false statement is represented by drugs in a node to the left of split 710 (i.e. node 706). A true statement for condition 708 is represented by the drugs in node 704 to is the right of split 710. The predictor(s) indicated by condition 708 (i.e. HERG) are determined to provide the most statistically significant partitioning of the drugs (for example, in parent node 702) in terms of the difference in the response variable (i.e. the quantitative measure EBGM of an adverse event) between the set of drugs that are part of node 704, and the remaining drugs that are part of node 706. In Figs. 7-10,

20 the variable n represents a number of drugs in a node.

In Figs. 7-10, substantially significant relationships between predictors and an adverse event are illustrated by the average EBGM score of the drugs in a node (for example, nodes 704, 706). The significant relationships may also be emphasized by the presentation of the nodes, such as by color, line style and/or fill of the node. In

25 Figs. 7-10, a thicker solid line for a node (such as node 804 in Fig. 8) represents a very significant relationship and a thicker dashed line for a node (such as node 816 in Fig. 8) represents a significant relationship. The line style may be determined, for example, based on the average EBGM score for nodes 804 and 816 as compared with nodes that do not have significant relationships for the adverse event (i.e. node 806)

30 (Fig. 8). Although significant relationships are emphasized by line thickness of a node (i.e. node 704), it is understood that the illustrated line style emphasis is merely example. It is understood that any suitable display of nodes to emphasize substantially significant relationships may be used.

PB_H . \NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

Referring to Fig. 7, torsades de pointes was analyzed as the adverse event. Torsades de pointes is a medically life-threatening cardiac arrhythmia that is associated with the use of certain medications. In this model, parent node 702 includes a set of 1337 drugs that were analyzed from the FDA-AERS database to determine whether any of the pharmacology as measured by in vitro target assays or metabolizing enzyme activities were associated with torsades de pointes. The results from about 450 macro-molecular target assays (using standard functional or receptor binding assays and enzyme activity assays) and more than 150 metabolizing enzymes (for example, see the website druginteractioninfo.org) were included in the model to identify one or more predictive targets.

Parent node 702 contains 1337 marketed drugs, whose average EBGM is 1.4. Of the hundreds of potential predictors, only one target was identified as being predictive of torsades de pointes. The target identified by the model was the HERG cellular ion channel, shown in condition 708 (i.e. "HERG/ERG, pIC₅₀>6.41"). Decision tree diagram 700 includes a quantitative measure of significance for HERG, represented by the pIC₅₀ value shown in condition 708. The modeling algorithm identified 1324 drugs in node 706 that bind to HERG with a pIC₅₀ of less than 6.41 as having lower risk of torsades. The modeling algorithm also identified 13 drugs in node 704 with a pIC₅₀ of greater than or equal to 6.41 as having higher risk of torsades de pointes. In particular, the algorithm identified drugs with an affinity for HERG greater than or equal to 6.41 as having an approximately 12.5-fold increased risk reporting for torsades de pointes compared to 1337 drugs in node 702 (with an average EBGM of 16.2 for the 13 compounds in node 704). This model provides a validation of the approach illustrated in the present invention, because drugs that have HERG inhibitory activity are well known to be strongly associated with torsades de pointes.

The ability to link in vitro binding of HERG to the human clinical toxicity (as represented by the EBGM for torsades de pointes) represents an important advance in predictive toxicology because (1) it directly links laboratory data to clinical outcomes and (2) it enables the skilled person to relate (both associatively and quantitatively) the degree of in vitro binding activity to a clinical outcome risk (based on the pIC₅₀ value 6.41 shown in condition 708). The modeling process and presentation provided by present invention, thus, may allow for a more informed selection of molecules in drug development, because it provides a way to estimate a

PB_H. \NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

"cutpoint" for human risk in a clinical setting. For example, if a candidate drug has a pIC₅₀ of 5.5 for HERG, it may have previously been discarded because it interacted with HERG. The ability to model the clinical effects of HERG using a variety of drug and toxicity responses, may modify the decision making process, for example, because the model suggests that drugs for torsades de pointe, with a pIC₅₀ of 5.5 is likely to have a markedly lower human risk.

This model may provide a proof-of-concept for the present invention, as HERG is a well-established biomarker for drugs with an increased risk of torsades de pointes. Decision tree diagram 700 indicates that compounds with a pIC₅₀ for HERG of greater than or equal to 6.41 had approximately a 12.5 fold higher risk of torsades de pointes compared to compounds where the pIC50 was less than 6.41. A percentage (%) variance (i.e. a measure of how much of the overall variance in the results is accounted for by the model) explained by this single predictor was about 4.2%.

Predictors of Tardive Dyskinesia Fig. 8 is a decision tree diagram 800 that illustrates an example of determined statistical associations between the molecular markers or targets associated for a serious drug-induced adverse event known as tardive dyskinesia. Tardive dyskinesia is a potentially disabling and irreversible neurological syndrome that is associated with chronic use of anti-schizophrenia medications. Although the cause of tardive dyskinesia is not fully understood, it is believed to be related to chronic suppression of the dopamine brain pathway, by anti-schizophrenia medications.

Parent node 802 includes 1274 drugs having an average EBGM of 1.1. According to the modeling processing, such as by statistical modeling processor 112 (Fig. 1), a substantially very significant relationship was determined for the dopamine (D2) receptor with a pIC₅₀ of greater than or equal to 7.75 (condition 820) and indicated by node 804. Another substantially significant relationship was determined for the dopamine (D3) receptor with a pIC₅₀ of greater than or equal to 6.52 (condition 824) and indicated by node 812. A further substantially significant relationship was determined for the alpha-IB adrenergic functional assay with a pKi of greater than or equal to 7.33 (condition 826) and indicated by node 816. The quantity Ki represents an experimentally determined equilibrium dissociation constant for a molecule (typically a small drug-like molecule) binding to proteins like enzymes or receptors.

PB_H : \NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

The term pKi represents the log base 10 of Ki, and which is a quantity generally used for structure activity relationship analyses. The % variance explained for the recursive partitioning model was 32.2%.

As demonstrated by split 828 from node 802 to node 804, the recursive partitioning determined that there is a strong association (condition 820) between high EBGM's (high toxicity) for tardive dyskinesia and drugs that have a relatively high affinity (i.e. that measure a pIC₅₀≥7.55) for the dopamine receptor (dopamine subtype D2 receptor). Although many anti-schizophrenic drugs include some D2 receptor antagonism, the degree of affinity may vary by a factor of 100. Using this model, a strong statistical relationship is shown between drugs that have a pIC₅₀ greater than or equal to 7.55 and tardive dyskinesia. This observation also corroborates with the clinical observation in schizophrenic patients that drugs with a very potent antagonism of the D2 receptor are more strongly associated with tardive dyskinesia than the "atypical antipsychotic" drugs, which have lower affinity for D2 receptors. Node 808 contains 223 drugs which are predicted to have good penetration of the blood brain barrier. In decision tree diagram 800, other descriptors were also identified as being associated with split 834 (i.e. surrogate splits) but were discounted and are not shown. These surrogate splits at split 834 are not shown because the associated value of the P-C₅₀ value was below a minimum expected for an in vivo pharmacological effect or because only a subset of the compounds have a measured value for that receptor due to the sparseness of the assay data.

Node 816 contains 13 drugs with an EBGM of 4.6 that have good brain penetration. The 13 drugs have an approximately three-fold increase in EBGM score for tardive dyskinesia, as compared with drugs in node 808. The drugs in node 816 bind to the alpha IB receptor with an affinity of greater than or equal to 7.33 (with respect to the functional pKi). Accordingly, drugs in node 816 may be considered to be associated with tardive dyskinesia. Although this association is not as strong as that seen with dopamine receptors (conditions 820 and 824), it appears to be independent association. This information is further corroborated with visual inspection of the EBGM values of the drugs and the receptor affinities, which suggests that for some drugs, the relationship of tardive dyskinesia is better explained by the alpha receptor as opposed to dopamine binding. The results shown in decision tree diagram 800, thus, suggest that alpha-1 receptor subtypes may be potential

PB_H:\N RPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

predictors of tardive dyskinesia and that alpha-1 receptor subtypes may be valuable to consider in the development of schizophrenia drugs.

The association of alpha-1 receptor subtypes with tardive dyskinesia shown in decision tree diagram 800 has not been described in the literature and may represent a novel finding in the area of developing safer drugs for schizophrenia. Rat data suggest that alpha-1 receptor subtypes and dopamine receptors are co-localized in the basal ganglia, a region of the brain associated with the control of locomotion and with tardive dyskinesia. Additional animal data suggests that the alpha-IB neurons modulate dopaminergic transmission in this region of the brain. Similar results for the major predictive variables for tardive dyskinesia were obtained by running a second model, using random forest processing, involving biological target and physical properties data, along with imputation data for sparse or missing data. A % variance explained by the second model was greater than 50%. In addition, a related adverse event known as extrapyramidal disorder gave similar findings (not shown) as the model for tardive dyskinesia, helping to confirm the validity of the novel findings shown in decision tree diagram 800. Finally, similar results were seen with by running the model in two separate safety databases (AERS and WHO). It should be noted, however, that approximately 50% of the WHO database is comprised of data from AERS. Predicting Adverse Events with Molecular Substructures

Fig. 9 is a decision tree diagram 900 that illustrates an example of determined statistical associations between molecular substructures from a set of molecules (associated with chemical substructures) and a severe-rash hypersensitivity syndrome known as Stevens-Johnson syndrome. In this example, a fragmentation algorithm of software program Molecular Substructure Miner (MoSS) is used for molecular substructure data mining. The MoSS software is described at the website borgelt.net/doc/moss/moss.html and in the publication to Borgelt et al. entitled, "MoSS : A Program for Molecular Substructure Mining," in Workshop Open Source Data Mining Software, 2005. It is understood, however, that any suitable software program may be used to perform molecular substructure mining. All 1458 drugs structures in the WHO database which met the minimum criteria in the preprocessing steps (step 306, 308 (Fig. 3)), were included in the statistical modeling process. Chemical fragments from about 5-20 atoms were also selected for all of the 1458 drugs. Table

PB_H . \NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

1 below includes parameters used for selecting molecular fragments as predictors of Stevens-Johnson syndrome. In Table 1, the adverse event refers to a particular drug event pair, that a model describes, and which may be used as the specific response variable.

With respect to the percentage of selected fragment descriptors shown in Table 1, all fragment descriptors are ranked by the statistical significance of the difference in the mean response variable between molecules with or without the fragment. The descriptors ranked as the top 20% are selected to build the model. This approach is used for computational efficiency. It is understood that this approach to selecting fragments is example, and that other suitable approaches may be used to select fragment descriptors.

Table 1

Statistical modeling processing using recursive partitioning, with no a priori assumptions regarding molecular fragment predictors, was used to identify chemical fragments that were most strongly associated with Stevens-Johnson syndrome. The resulting processing is shown in decision tree diagram 900. At split 928, condition 920 describes the inclusion of fragments (i.e. moieties) I -VI, all of which are substructures of a more general para-amino benzenesulfonyl substructure. Moieties I-VI include:

PB_H : \N RPORTBL\RP\PCBOCCELLA\262570_1. DOC SKB-233USP PATENT

At split 930, condition 922 describes the inclusion of fragments (i.e. moieties) VII and VIII, which are substructures of a more general benzylic amine structure. Moieties VII and VIII include:

VII VIII

Fragments I-VI are substructures derived from the surrogate split 928. Fragment VII and VIII are substructures derived from the surrogate split 930. The % variance explained for the determined statistical model was 7.2%.

Accordingly, condition 920 representing moieties I-VI indicates that node 904 is strongly associated with Stevens-Johnson syndrome and includes drugs with an average EBGM of 2.3. In addition, condition 922 representing moieties VII and VIII indicates that node 908 is strongly associated with Stevens-Johnson syndrome and includes drugs with an average EBGM of 2.0. In Fig. 9, conditions 920, 922 represent the inclusion of any one of the fragments (i.e. a logical or operation). In generating Fig. 9, the surrogate splits are rendered sequentially in an order according to the most frequently occurring and statistically significant predictors. The first listed predictor (such as the first predictor in condition 922) is the most significant predictor

PB_H :\N RPORTBL\RP\PCBOCCELLA\262570_1. DOC SKB-233USP PATENT

statistically. The remaining predictors in the condition 920, 922 are surrogate splits, and may or may not be present in all of the molecules in a daughter node, but would otherwise give rise to the same split and node population, aside from any cases of molecules missing in the predictor. The predictors shown in the conditions (such as conditions 920, 922) are often correlated, although the statistics are typically not correlated, because the predictor may not be present in all molecules.

At both splits 928 and 930, several fragments were identified that may represent surrogate splits (or statistically equivalent descriptors in this data set), but which identify the same general substructure in different levels of specificity. These alternate descriptors may contribute to the performance of the model in predicting the future activity of untested molecules. The approach of the present invention, with no a priori assumptions regarding the identified fragments that have been described in case examples, have previously not been associated by using statistical methods correlating human patient experience involving many patients and drugs. Fig. 9 illustrates the use of system 100 as a tool for identifying chemical structures that may have an increased risk of toxicity. Such information may be helpful when trying to analyze structure-toxicity (i.e. function) relationships and the associations may be helpful in the design process for new drugs.

Predicting Adverse Events with Chemical Descriptors Fig. 10 is a decision tree diagram 1000 that illustrates an example of determined statistical associations between specific chemical structural features (i.e., substructures within molecules) and an increased risk of a particular adverse event. In this example, a series of predictors including: anilines, amines, sulfonamides, sulfones, and sulfonyl urea substructures were selected as potential explanatory variables for a toxicity known as methemoglobinemia. In particular, Table 2 illustrates selected parameters including the predictors for methemoglobinemia.

PB_H : \NRPORTBL\RP\PCBOCCELLA\262570_1.DOC SKB-233USP PATENT

Table 2

Adverse Event Methemoglobinemia

Predictors: All anilines including indole, quinolines

Aromatic primary amine generic Benzylic acid-NSAIDs Benzylic amine Secondary amine Secondary aniline Secondary aniline alkyl Secondary aniline amides Secondary sulfonamide Secondary sulfonamide aniline Secondary, diaromatic anilines Secondary, primary anilines Sulfonamide Sulfone Sulfonyl Urea Tert aniline Tertiary aniline alkyl Tertiary diaromatic amine Tertiary aniline amides Tertiary aniline sulfonamides

% Variance Explained 1.163

PB_H : \N RPORTBL\RP\PCBOCCELLA\262570_1. DOC SKB-233USP PATENT

Methemoglobinemia is a severe blood disorder where cellular hemoglobin is altered so that is cannot bind to oxygen. In this model, there was a very slight average increase in methemoglobinemia for compounds containing anilines (of EBGM 0.79 shown in node 1004 as compared to an EBGM of 0.63 in node 1006). A very low percentage, however, of all aniline containing drugs (about 7.5%) of 12 out of 159 compounds had an increased score for methemoglobinemia. Although aniline itself may cause hemoglobin toxicity, it appears that its incorporation into drug molecules is not associated with a significant risk for this adverse event.

The results described above demonstrates that system 100 may be used to interrogate the clinical impact of substructures that may be of concern (based on anecdotal evidence). System 100 may objectively determine statistical associations between predictors and events that considers structural and clinical evidence, by using statistical model processing and one or more databases that includes vast clinical experience over a variety of drug structures. Accordingly, system 100 may be used to exonerate a particular substructure from being excluded in drug design (previously based on anecdotal evidence). As another example, system 100 may be used identify compounds having a particular fragment and determine which toxicities a particular substructure is most strongly associated with by using a correlation- or regression- based analysis. Although the invention has been described in terms of apparatus and methods for screening drugs, it is contemplated that one or more components may be implemented in software on microprocessors/general purpose computers (not shown). In this embodiment, one or more of the functions of the various components may be implemented in software that controls a general purpose computer. This software may be embodied in a computer readable medium, for example, a magnetic or optical disk, a memory-card.

Although the invention is illustrated and described herein with reference to specific embodiments, the invention is not intended to be limited to the details shown. Rather, various modifications may be made in the details within the scope and range of equivalents of the claims and without departing from the invention.

PB_H \NRPORTBL\RP\PCBOCCELLA\262570_1 DOC

Claims

SKB-233USP PATENTWhat is Claimed:

1. A method for screening drugs for predictors of quantitatively measured events comprising : selecting at least one predictor of an event; retrieving data for a plurality of drugs, the data for each of the drugs having an associated quantitative measure of the event; statistically analyzing one or more relationships between the selected at least one predictor and the event, using the quantitative measure of the event for each of the plurality of drugs, to determine one or more statistical associations among the one or more relationships, without using any a priori associations between the predictor and the event; and presenting the determined one or more statistical associations including a measure of statistical significance.

2. The method according to claim 1, further comprising : using the selected at least one predictor to identify substantially toxic or efficacious compounds associated with the event based on the determined one or more statistical associations.

3. The method according to claim 1, determining the one or more statistical associations includes an inclusion of data for drugs among the plurality of drugs that exhibit a range of expression for the event.

4. The method according to claim 1, presenting the one or more statistical associations including selecting from among the determined one or more statistical associations based at least on a biological relevance.

5. A computer readable medium including a computer program encoded with instructions to perform the method of claim 1.

6. The method according to claim 1, wherein the presenting of the determined one or more statistical associations includes presenting the one or more statistical associations as a scatter plot. SKB-233USP PATENT

7. The method according to claim 1, including, prior to determining the one or more statistical associations, filtering the retrieved data for the plurality of drugs to exclude sparsely populated quantitative measures of adverse events.

8. The method according to claim 1, including, prior to determining the one or more statistical associations, performing data aggregate processing of the retrieved data for the plurality of drugs.

9. The method according to claim 1, wherein the presenting of the determined one or more statistical associations includes presenting the one or more statistical associations as a decision tree diagram.

10. The method according to claim 9, wherein the decision tree diagram includes at least some nodes representing substantially toxic or efficacious compounds associated with the event based on the determined one or more statistical associations and the presenting of the one or more statistical associations includes presenting the nodes with at least one of different colors, shading or line styles to emphasis the statistical associations.

11. The method according to claim 1, including, prior to determining the one or more statistical associations, selecting the event from at least one of an adverse event and an efficacy.

12. The method according to claim 1, the selecting of the at least one predictor includes selecting the predictor from at least one of a pharmacological assay, a biochemical assay or a compound property selected from the group consisting of a biochemical property, a pharmacological property, a chemical structure, a chemical substructure and a modeled physical property.

13. The method according to claim 12, wherein the at least one predictor selected includes the pharmacological assay or the biochemical assay and the measure of statistical significance includes a quantitative measure of the pharmacological assay or the biochemical assay.

14. The method according to claim 12, wherein the selected at least one predictor includes the compound property and the measure of statistical significance includes a binary measure indicating the presence or absence of the compound property. SKB-233USP PATENT

15. The method according to claim 11, wherein the event selected is the adverse event and the quantitative measure of the adverse event includes an empirical Bayes geometric mean (EBGM), a loglO EBGM, a relative reporting (RR) ratio, a reporting odds ratio (ROR), a proportional reporting rate ratio (PRR), a multi- item gamma Poisson shrinker (MGPS) processing, a Bayesian confidence propagation neural network (BCPNN) processing and/or odds ratios or relative risks based on placebo-control, case-control, or controlled cohort studies.

16. The method according to claim 1, wherein the selecting of the at least one predictor includes selecting a plurality of predictors, the event includes a plurality of events, and the retrieving of the data includes retrieving data for the plurality of drugs corresponding to the plurality of events.

17. The method according to claim 16, the step of statistically analyzing including: statistically analyzing the one or more relationships between the plurality of predictors and the plurality of events to determine the one or more statistical associations among the one or more relationships; and performing a hierarchical clustering of the plurality of predictors and the plurality of events according to the one or more statistical associations, and wherein the presenting of the determined one or more statistical associations includes presenting the clustered statistical associations as a heat map.

18. The method according to claim 16, the step of statistically analyzing including processing by at least one of a Pearson's correlation, a Chi- squared test and a Fisher's exact test.

19. The method according to claim 1, wherein the event includes tardive dyskinesia, the predictor includes alpha-1 receptor subtypes, and presenting the one or more statistical associations includes presenting the statistical association between the data for drugs including an association with the alpha-1 receptor subtypes and tardive dyskinesia.

20. The method according to claim 1, retrieving of the data for the plurality of drugs includes selecting the data for the plurality of drugs from one or more external databases. SKB-233USP PATENT

21. The method according to claim 20, the method further including formatting the data for the plurality of drugs selected from the one or more external databases into a predetermined format.

22. The method according to claim 1, wherein statistically analyzing the one or more relationships includes performing statistical model processing of the selected at least one predictor and quantitative measure of the event.

23. The method according to claim 22, wherein the statistical model processing includes processing by at least one of recursive portioning, correlation, multiple tree modeling, multiple linear regression, partial least squares, Fischer's exact test, Chi-squared test, artificial intelligence algorithms, neural networks or regression analysis.

24. The method according to claim 22, further comprising : generating a statistical model based on the performed statistical model processing; storing the generated statistical model; determining one or more further statistical associations using the stored statistical model and data for a different plurality of drugs; presenting the one or more further statistical associations; and comparing the presented further statistical associations with the presented statistical associations.

25. The method according to claim 22, further comprising : generating a statistical model based on the performed statistical model processing; storing the generated statistical model; selecting a different event in a same event class within the data for the plurality of drugs; determining one or more further statistical associations using the stored statistical model and the data for the plurality of drugs with respect to the different event; SKB-233USP PATENT

presenting the one or more further statistical associations; and comparing the presented further statistical associations with the presented statistical associations.

26. A system for screening drugs for predictors of quantitatively measured events comprising : a local user device comprising: a user interface configured to provide selection of at least one predictor of an event, and a display; and a modeling system comprising : a controller configured to retrieve data for a plurality of drugs from an event database, the data for each of the drugs having an associated quantitative measure of the event, and a statistical modeling processor configured to statistically analyze one or more relationships between the selected at least one predictor and the event, using the quantitative measure of the event for each of the plurality of drugs, to determine one or more statistical associations among the one or more relationships, without using any a priori associations between the predictor and the event, wherein the controller is configured to present the determined one or more statistical associations including a measure of statistical significance on the display.

27. The system according to claim 26, wherein the at least one predictor is selected from at least one of a pharmacological assay, a biochemical assay or a compound property selected from the group consisting of a biochemical property, a pharmacological property, a chemical structure, a chemical substructure and a modeled physical property.

28. The system according to claim 26, wherein the user interface is configured to allow a user to select the event from at least one of an adverse event and an efficacy. SKB-233USP PATENT

29. The system according to claim 26, wherein the controller is configured to parse information for the drugs from one or more external databases, to format the parsed information into a predetermined format and to store the formatted information as the data for the plurality of drugs in the event database.

30. The system according to claim 26, further comprising a preprocessing filter configured to exclude sparsely populated quantitative measures of adverse events from the data for the plurality of drugs.

31. The system according to claim 26, further comprising a preprocessing filter configured to perform data aggregate processing of the data for the plurality of drugs.

32. The system according to claim 26, wherein the statistical modeling processor is configured to identify substantially toxic or efficacious compounds associated with the event based on the one or more statistical associations and the controller is configured to present the identified substantially toxic or efficacious compounds on the display.

33. The system according to claim 26, wherein the statistical model processor is configured to perform statistical model processing by at least one of recursive portioning, correlation, multiple tree modeling, multiple linear regression, partial least squares, Fischer's exact test, Chi-squared test, artificial intelligence algorithms, neural networks or regression analysis.

34. The system according to claim 26, further comprising a modeling results database configured to store at least one of a generated statistical model and the determined one or more statistical associations.

35. The system according to claim 23, wherein the controller is configured to present the determined one or more statistical associations as a decision tree diagram.

36. The system according to claim 35, wherein the decision tree diagram includes at least one node representing substantially toxic or efficacious compounds associated with the event based on the determined one or more statistical associations and comparator indicators including the measure of statistical significance. SKB-233USP PATENT

37. The system according to claim 36, wherein the controller is configured to display the at least one node with at least one of a different color, a different shading or a different line style to emphasis the one or more statistical associations.

38. The system according to claim 26, wherein the at least one predictor includes a plurality of predictors, the event includes a plurality of events, and the controller retrieves data for the plurality of drugs corresponding to the plurality of events.

39. The system according to claim 38, wherein: the statistical modeling processor is configured to statistically analyze the one or more relationships between the plurality of predictors and the plurality of events to determine the one or more statistical associations among the one or more relationships, and the statistical modeling processor is configured to hierarchically cluster the plurality of predictors and the plurality of events according to the one or more statistical associations, and the controller is configured to present the clustered statistical associations as a heat map.