High-throughput optimization of DNA-aptamer secondary structure for classification and machine learn

Andrea Bertozzi, UCLA
-
CSE1 691

We consider the secondary structures for aptamers, single stranded DNA sequences that often fold on themselves and can be designed to bind to small molecules. Given a specific aptamer sequence, there are well-established computational tools to identify the lowest energy secondary structure. However there is need for a high-throughput process whereby thousands of DNA structures can be calculated in real time for use in an interactive setting, in particular when combined with aptamer selection processes in which thousands of candidate molecules are screened in the lab. We present a new method called GMfold, which algorithmically uses subgraph matching ideas, in which the DNA chain is a graph with nucleotides as graph nodes and adjacency along the chain to define edges in the primary DNA structure. This allow us to cluster thousands of DNA strands using modern machine learning algorithms. We present examples using data from in vitro systematic evolution of ligands by exponential enrichment (SELEX). This work is intended to serve as a buiding block for future machine-learning informed DNA-aptamer selection processes for target binding and medical therapeutics.