Extracting Signal from Sequence: Applications of Machine Learning to Solve Biological Puzzles

Carla M. Mann

The era of “big data” has both necessitated and facilitated the development of complex computational methods to assist in analysis. Machine learning methods present one avenue for modeling, and allow the elucidation of intricate and often cryptic signals underlying biological processes. My Ph.D. research focused on identifying sequence and structural determinants of molecular recognition in protein-nucleic acid complexes. In these studies, I developed two machine learning models:  1) RPIDisordera random forest classification model for predicting RNA-protein interactions based on information derived from the protein sequences; and  2) MEDJED[1], a regression model that can assist in the design of CRISPR-mediated gene editing experiments by predicting the extent to which a preferred DNA repair pathway (MMEJ) will be invoked in response to a DNA break at particular genomic target site. MEDJED is part of the Gene Sculpt Suite1,[2], a set of tools for precision genome engineering.


[1] Mann CM, Martínez-Gálvez G, Welker JM, Wierson WA, Ata H, Almeida MP, Clark KJ, Essner JJ, McGrail M, Ekker SC, Dobbs D (2019) The Gene Sculpt Suite: A set of tools for precision genome engineering. Nucleic Acids Reshttps://doi.org/10.1093/nar/gkz405

[2] Ata H, Ekstrom TL, Martínez-Gálvez G, Mann CM, Dvornikov, AV, Schaefbauer KJ, Ma AC, Dobbs D, Clark KJ, Ekker SC (2018) Robust activation of microhomology-mediated end joining for precision gene editing applications. PLoS Genet. 14(9): e1007652. https://doi.org/10.1371/journal.pgen.1007652