Explaining a Weighted DAG with Few Paths for Solving Genome-Guided Multi-Assembly

Alexandru I. Tomescu, Travis Gagie, Alexandru Popa, Romeo Rizzi, Anna Kuosmanen, Veli Mäkinen

    Research output: Contribution to journalArticle

    3 Citations (Scopus)

    Abstract

    RNA-Seq technology offers new high-throughput ways for transcript identification and quantification based on short reads, and has recently attracted great interest. This is achieved by constructing a weighted DAG whose vertices stand for exons, and whose arcs stand for split alignments of the RNA-Seq reads to the exons. The task consists of finding a number of paths, together with their expression levels, which optimally explain the weights of the graph under various fitting functions, such as least sum of squared residuals. In (Tomescu et al. BMC Bioinformatics, 2013) we studied this genome-guided multi-assembly problem when the number of allowed solution paths was linear in the number of arcs. In this paper, we further refine this problem by asking for a bounded number k of solution paths, which is the setting of most practical interest. We formulate this problem in very broad terms, and show that for many choices of the fitting function it becomes NP-hard. Nevertheless, we identify a natural graph parameter of a DAG G, which we call arc-width and denote G, and give a dynamic programming algorithm running in time O(Wk\Gk(G + k)n), where n is the number of vertices and W is the maximum weight of G. This implies that the problem is fixed-parameter tractable (FPT) in the parameters W, G, and k. We also show that the arc-width of DAGs constructed from simulated and real RNA-Seq reads is small in practice. Finally, we study the approximability of this problem, and, in particular, give a fully polynomial-time approximation scheme (FPTAS) for the case when the fitting function penalizes the maximum ratio between the weights of the arcs and their predicted coverage.

    Original languageEnglish
    Article number7078863
    Pages (from-to)1345-1354
    Number of pages10
    JournalIEEE/ACM Transactions on Computational Biology and Bioinformatics
    Volume12
    Issue number6
    DOIs
    Publication statusPublished - Nov 1 2015

    Fingerprint

    RNA
    Arc of a curve
    Genome
    Genes
    Weights and Measures
    Path
    Exons
    Bioinformatics
    Computational Biology
    Dynamic programming
    Fully Polynomial Time Approximation Scheme
    Throughput
    Polynomials
    Approximability
    Technology
    Number of Solutions
    Graph in graph theory
    Quantification
    High Throughput
    Dynamic Programming

    Keywords

    • approximation algorithm
    • digraph-width measure
    • dynamic programming
    • fixed-parameter tractability
    • NP-hardness
    • RNA-sequencing
    • splicing graph
    • transcript prediction

    ASJC Scopus subject areas

    • Biotechnology
    • Genetics
    • Applied Mathematics

    Cite this

    Explaining a Weighted DAG with Few Paths for Solving Genome-Guided Multi-Assembly. / Tomescu, Alexandru I.; Gagie, Travis; Popa, Alexandru; Rizzi, Romeo; Kuosmanen, Anna; Mäkinen, Veli.

    In: IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 12, No. 6, 7078863, 01.11.2015, p. 1345-1354.

    Research output: Contribution to journalArticle

    Tomescu, Alexandru I. ; Gagie, Travis ; Popa, Alexandru ; Rizzi, Romeo ; Kuosmanen, Anna ; Mäkinen, Veli. / Explaining a Weighted DAG with Few Paths for Solving Genome-Guided Multi-Assembly. In: IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2015 ; Vol. 12, No. 6. pp. 1345-1354.
    @article{8c743b54b10a43e4b512eed8e8ac4f63,
    title = "Explaining a Weighted DAG with Few Paths for Solving Genome-Guided Multi-Assembly",
    abstract = "RNA-Seq technology offers new high-throughput ways for transcript identification and quantification based on short reads, and has recently attracted great interest. This is achieved by constructing a weighted DAG whose vertices stand for exons, and whose arcs stand for split alignments of the RNA-Seq reads to the exons. The task consists of finding a number of paths, together with their expression levels, which optimally explain the weights of the graph under various fitting functions, such as least sum of squared residuals. In (Tomescu et al. BMC Bioinformatics, 2013) we studied this genome-guided multi-assembly problem when the number of allowed solution paths was linear in the number of arcs. In this paper, we further refine this problem by asking for a bounded number k of solution paths, which is the setting of most practical interest. We formulate this problem in very broad terms, and show that for many choices of the fitting function it becomes NP-hard. Nevertheless, we identify a natural graph parameter of a DAG G, which we call arc-width and denote G, and give a dynamic programming algorithm running in time O(Wk\Gk(G + k)n), where n is the number of vertices and W is the maximum weight of G. This implies that the problem is fixed-parameter tractable (FPT) in the parameters W, G, and k. We also show that the arc-width of DAGs constructed from simulated and real RNA-Seq reads is small in practice. Finally, we study the approximability of this problem, and, in particular, give a fully polynomial-time approximation scheme (FPTAS) for the case when the fitting function penalizes the maximum ratio between the weights of the arcs and their predicted coverage.",
    keywords = "approximation algorithm, digraph-width measure, dynamic programming, fixed-parameter tractability, NP-hardness, RNA-sequencing, splicing graph, transcript prediction",
    author = "Tomescu, {Alexandru I.} and Travis Gagie and Alexandru Popa and Romeo Rizzi and Anna Kuosmanen and Veli M{\"a}kinen",
    year = "2015",
    month = "11",
    day = "1",
    doi = "10.1109/TCBB.2015.2418753",
    language = "English",
    volume = "12",
    pages = "1345--1354",
    journal = "IEEE/ACM Transactions on Computational Biology and Bioinformatics",
    issn = "1545-5963",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    number = "6",

    }

    TY - JOUR

    T1 - Explaining a Weighted DAG with Few Paths for Solving Genome-Guided Multi-Assembly

    AU - Tomescu, Alexandru I.

    AU - Gagie, Travis

    AU - Popa, Alexandru

    AU - Rizzi, Romeo

    AU - Kuosmanen, Anna

    AU - Mäkinen, Veli

    PY - 2015/11/1

    Y1 - 2015/11/1

    N2 - RNA-Seq technology offers new high-throughput ways for transcript identification and quantification based on short reads, and has recently attracted great interest. This is achieved by constructing a weighted DAG whose vertices stand for exons, and whose arcs stand for split alignments of the RNA-Seq reads to the exons. The task consists of finding a number of paths, together with their expression levels, which optimally explain the weights of the graph under various fitting functions, such as least sum of squared residuals. In (Tomescu et al. BMC Bioinformatics, 2013) we studied this genome-guided multi-assembly problem when the number of allowed solution paths was linear in the number of arcs. In this paper, we further refine this problem by asking for a bounded number k of solution paths, which is the setting of most practical interest. We formulate this problem in very broad terms, and show that for many choices of the fitting function it becomes NP-hard. Nevertheless, we identify a natural graph parameter of a DAG G, which we call arc-width and denote G, and give a dynamic programming algorithm running in time O(Wk\Gk(G + k)n), where n is the number of vertices and W is the maximum weight of G. This implies that the problem is fixed-parameter tractable (FPT) in the parameters W, G, and k. We also show that the arc-width of DAGs constructed from simulated and real RNA-Seq reads is small in practice. Finally, we study the approximability of this problem, and, in particular, give a fully polynomial-time approximation scheme (FPTAS) for the case when the fitting function penalizes the maximum ratio between the weights of the arcs and their predicted coverage.

    AB - RNA-Seq technology offers new high-throughput ways for transcript identification and quantification based on short reads, and has recently attracted great interest. This is achieved by constructing a weighted DAG whose vertices stand for exons, and whose arcs stand for split alignments of the RNA-Seq reads to the exons. The task consists of finding a number of paths, together with their expression levels, which optimally explain the weights of the graph under various fitting functions, such as least sum of squared residuals. In (Tomescu et al. BMC Bioinformatics, 2013) we studied this genome-guided multi-assembly problem when the number of allowed solution paths was linear in the number of arcs. In this paper, we further refine this problem by asking for a bounded number k of solution paths, which is the setting of most practical interest. We formulate this problem in very broad terms, and show that for many choices of the fitting function it becomes NP-hard. Nevertheless, we identify a natural graph parameter of a DAG G, which we call arc-width and denote G, and give a dynamic programming algorithm running in time O(Wk\Gk(G + k)n), where n is the number of vertices and W is the maximum weight of G. This implies that the problem is fixed-parameter tractable (FPT) in the parameters W, G, and k. We also show that the arc-width of DAGs constructed from simulated and real RNA-Seq reads is small in practice. Finally, we study the approximability of this problem, and, in particular, give a fully polynomial-time approximation scheme (FPTAS) for the case when the fitting function penalizes the maximum ratio between the weights of the arcs and their predicted coverage.

    KW - approximation algorithm

    KW - digraph-width measure

    KW - dynamic programming

    KW - fixed-parameter tractability

    KW - NP-hardness

    KW - RNA-sequencing

    KW - splicing graph

    KW - transcript prediction

    UR - http://www.scopus.com/inward/record.url?scp=84961736134&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84961736134&partnerID=8YFLogxK

    U2 - 10.1109/TCBB.2015.2418753

    DO - 10.1109/TCBB.2015.2418753

    M3 - Article

    VL - 12

    SP - 1345

    EP - 1354

    JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics

    JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics

    SN - 1545-5963

    IS - 6

    M1 - 7078863

    ER -