### Abstract

RNA-Seq technology offers new high-throughput ways for transcript identification and quantification based on short reads, and has recently attracted great interest. This is achieved by constructing a weighted DAG whose vertices stand for exons, and whose arcs stand for split alignments of the RNA-Seq reads to the exons. The task consists of finding a number of paths, together with their expression levels, which optimally explain the weights of the graph under various fitting functions, such as least sum of squared residuals. In (Tomescu et al. BMC Bioinformatics, 2013) we studied this genome-guided multi-assembly problem when the number of allowed solution paths was linear in the number of arcs. In this paper, we further refine this problem by asking for a bounded number k of solution paths, which is the setting of most practical interest. We formulate this problem in very broad terms, and show that for many choices of the fitting function it becomes NP-hard. Nevertheless, we identify a natural graph parameter of a DAG G, which we call arc-width and denote G, and give a dynamic programming algorithm running in time O(W^{k}\G^{k}(G + k)n), where n is the number of vertices and W is the maximum weight of G. This implies that the problem is fixed-parameter tractable (FPT) in the parameters W, G, and k. We also show that the arc-width of DAGs constructed from simulated and real RNA-Seq reads is small in practice. Finally, we study the approximability of this problem, and, in particular, give a fully polynomial-time approximation scheme (FPTAS) for the case when the fitting function penalizes the maximum ratio between the weights of the arcs and their predicted coverage.

Original language | English |
---|---|

Article number | 7078863 |

Pages (from-to) | 1345-1354 |

Number of pages | 10 |

Journal | IEEE/ACM Transactions on Computational Biology and Bioinformatics |

Volume | 12 |

Issue number | 6 |

DOIs | |

Publication status | Published - Nov 1 2015 |

### Fingerprint

### Keywords

- approximation algorithm
- digraph-width measure
- dynamic programming
- fixed-parameter tractability
- NP-hardness
- RNA-sequencing
- splicing graph
- transcript prediction

### ASJC Scopus subject areas

- Biotechnology
- Genetics
- Applied Mathematics

### Cite this

*IEEE/ACM Transactions on Computational Biology and Bioinformatics*,

*12*(6), 1345-1354. [7078863]. https://doi.org/10.1109/TCBB.2015.2418753

**Explaining a Weighted DAG with Few Paths for Solving Genome-Guided Multi-Assembly.** / Tomescu, Alexandru I.; Gagie, Travis; Popa, Alexandru; Rizzi, Romeo; Kuosmanen, Anna; Mäkinen, Veli.

Research output: Contribution to journal › Article

*IEEE/ACM Transactions on Computational Biology and Bioinformatics*, vol. 12, no. 6, 7078863, pp. 1345-1354. https://doi.org/10.1109/TCBB.2015.2418753

}

TY - JOUR

T1 - Explaining a Weighted DAG with Few Paths for Solving Genome-Guided Multi-Assembly

AU - Tomescu, Alexandru I.

AU - Gagie, Travis

AU - Popa, Alexandru

AU - Rizzi, Romeo

AU - Kuosmanen, Anna

AU - Mäkinen, Veli

PY - 2015/11/1

Y1 - 2015/11/1

N2 - RNA-Seq technology offers new high-throughput ways for transcript identification and quantification based on short reads, and has recently attracted great interest. This is achieved by constructing a weighted DAG whose vertices stand for exons, and whose arcs stand for split alignments of the RNA-Seq reads to the exons. The task consists of finding a number of paths, together with their expression levels, which optimally explain the weights of the graph under various fitting functions, such as least sum of squared residuals. In (Tomescu et al. BMC Bioinformatics, 2013) we studied this genome-guided multi-assembly problem when the number of allowed solution paths was linear in the number of arcs. In this paper, we further refine this problem by asking for a bounded number k of solution paths, which is the setting of most practical interest. We formulate this problem in very broad terms, and show that for many choices of the fitting function it becomes NP-hard. Nevertheless, we identify a natural graph parameter of a DAG G, which we call arc-width and denote G, and give a dynamic programming algorithm running in time O(Wk\Gk(G + k)n), where n is the number of vertices and W is the maximum weight of G. This implies that the problem is fixed-parameter tractable (FPT) in the parameters W, G, and k. We also show that the arc-width of DAGs constructed from simulated and real RNA-Seq reads is small in practice. Finally, we study the approximability of this problem, and, in particular, give a fully polynomial-time approximation scheme (FPTAS) for the case when the fitting function penalizes the maximum ratio between the weights of the arcs and their predicted coverage.

AB - RNA-Seq technology offers new high-throughput ways for transcript identification and quantification based on short reads, and has recently attracted great interest. This is achieved by constructing a weighted DAG whose vertices stand for exons, and whose arcs stand for split alignments of the RNA-Seq reads to the exons. The task consists of finding a number of paths, together with their expression levels, which optimally explain the weights of the graph under various fitting functions, such as least sum of squared residuals. In (Tomescu et al. BMC Bioinformatics, 2013) we studied this genome-guided multi-assembly problem when the number of allowed solution paths was linear in the number of arcs. In this paper, we further refine this problem by asking for a bounded number k of solution paths, which is the setting of most practical interest. We formulate this problem in very broad terms, and show that for many choices of the fitting function it becomes NP-hard. Nevertheless, we identify a natural graph parameter of a DAG G, which we call arc-width and denote G, and give a dynamic programming algorithm running in time O(Wk\Gk(G + k)n), where n is the number of vertices and W is the maximum weight of G. This implies that the problem is fixed-parameter tractable (FPT) in the parameters W, G, and k. We also show that the arc-width of DAGs constructed from simulated and real RNA-Seq reads is small in practice. Finally, we study the approximability of this problem, and, in particular, give a fully polynomial-time approximation scheme (FPTAS) for the case when the fitting function penalizes the maximum ratio between the weights of the arcs and their predicted coverage.

KW - approximation algorithm

KW - digraph-width measure

KW - dynamic programming

KW - fixed-parameter tractability

KW - NP-hardness

KW - RNA-sequencing

KW - splicing graph

KW - transcript prediction

UR - http://www.scopus.com/inward/record.url?scp=84961736134&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84961736134&partnerID=8YFLogxK

U2 - 10.1109/TCBB.2015.2418753

DO - 10.1109/TCBB.2015.2418753

M3 - Article

VL - 12

SP - 1345

EP - 1354

JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics

JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics

SN - 1545-5963

IS - 6

M1 - 7078863

ER -