By Yuchun Guo, Director, Computational Biology and Machine Learning, and Alla Sigova, Vice President, Biology, CAMP4 Therapeutics
Gene control is an exciting and rapidly expanding area of drug discovery centered on the 98% of the human genome that does not encode proteins, but instead regulates gene transcription. Next-generation sequencing technologies have enabled researchers to collect massive amounts of epigenomic data that shed light on mechanisms of gene regulation in human disease, opening up a whole new world of potential drug targets. The growth in epigenomics is accompanied by an ongoing need for new tools to use this wealth of data to drive drug development forward, and computational models are foundational for this purpose.
At CAMP4, we focus on the human genome’s hundreds of thousands of enhancers, which regulate expression of individual genes, and we have collected massive amounts of data on multiple epigenetic features of these enhancers in a variety of healthy and diseased human cell types. Our proprietary AI model, EPIC, quantifies and scores, or “reads,” these features to help predict relevant regulatory RNAs (regRNA) – a class of non-coding RNAs produced from the enhancers – as drug targets to treat genetic diseases. Since meaningful target identification is one of the most important steps in the drug discovery process, high accuracy and resolution are crucial for the AI models used for this purpose.
As required for many AI models, EPIC had to be trained to read and learn from the data in one cell type, and then read the data collected in other cell types to predict the best drug targets for them. In order for any AI model to read the data accurately, the data sets must be normalized or transformed in a way that allows for consistent, “apples-to-apples” comparison between them. Without such data transformation, predictions made by the model may not be accurate, reliable, or biologically relevant.
While there are several established normalization methods, we found they did not produce consistent or reliable results when applied to our data. We needed a new data normalization method that was specifically designed to handle multiple epigenetic features at once and thus increase the accuracy of predictions across cell types.
This led our team to develop a normalization method called Joint Multi-feature normalization (JMnorm) that we described in a paper published online in Nucleic Acids Research (NAR) on Dec. 6, 2023. We believe that by enabling more effective normalization of epigenetic features across datasets, JMnorm will improve, at the foundational level, the way in which we, as a company and as an industry, discover and develop drugs.
A new normal(ization)
We built and trained EPIC on data collected in a human myelogenous leukemia cell line and used EPIC to identify targets in liver cells, neurons, or other cell types relevant to our discovery programs.
Many of the features we characterize are functionally correlated, meaning they tend to occur together – in CAMP4’s case, at individual enhancers – and affect one another in similar ways from one cell type to the next. These types of correlations provide us with a more complete picture of the biology of gene regulation within a cell. For example, increased DNA accessibility, measured across the genome at thousands of enhancers, co-occurs with positions of acetylated histones and highlights enhancers that are producing regRNAs. Thus, when normalizing new data sets for EPIC, it is critical to retain those relationships between features – something other normalization methods were not designed to do; instead, they normalize one epigenetic feature at a time. While this approach can minimize the technical “noise” when comparing data sets, it can also result in the distortion or loss of biologically significant relationships between features. These undesirable alterations can reduce the predictive accuracy of drug discovery models like EPIC and impact their ability to identify biologically relevant drug targets in new cell types.
JMnorm solves this problem by normalizing multiple epigenetic features in a given data set simultaneously rather than one at a time and preserving the biologically relevant relationships between features better than other methods. In addition to improving accuracy of our target predictions, the method improves the performance of other downstream analyses, such as predictions of gene expression patterns across cell types and detection of changes in binding of transcription factors to DNA. JMnorm is also generalizable to any cell type, species, experimental condition, or computational model — meaning there is a universe of applications beyond CAMP4’s sphere of focus.
We previously presented posters on JMnorm at the Cold Spring Harbor Laboratory meeting on Network Biology in March 2023, Penn State’s 39th Summer Symposium in August 2023. With the paper’s publication in NAR, CAMP4 has now made the underlying method and code freely available to the research community. We expect JMnorm to enable others to build improved computational models of gene regulation and obtain better results from them.
It is important to note that JMnorm is not intended to replace other methods of data normalization but rather work in concert with them, depending on underlying goals and types of data. JMnorm is best suited for data types that contain functional correlations between individual features like those describing enhancer properties; but it is not necessarily superior relative to other methods when normalizing data where global changes occur across conditions.
As new sequencing technologies become available and the amount of available genomic data continues to grow, it is essential for researchers and drug developers to have the best tools for interrogating and using those data. Sharing pre-competitive tools, such as data normalization methods, can help the research community effectively and efficiently translate these massive data sets into actionable insights that maximize our advances in developing truly impactful therapies for patients.