TY - GEN
T1 - Combining structural analysis and computer vision techniques for automatic speech summarization
AU - Sert, Mustafa
AU - Baykal, Buyurman
AU - Yazici, Adnan
PY - 2008
Y1 - 2008
N2 - Similar to verse and chorus sections that appear as repetitive structures in musical audio, key-concept (or topic) of some speech recordings (e.g., presentations, lectures, etc.) may also repeat itself over the time. Hence, accurate detection of these repetitions may be helpful to the success of automatic speech summarization. Based on this motivation, we consider the applicability of music structural analysis methods to speech summary generation. Our method transforms a 1 - D time-domain speech signal to a 2-D image representation, namely (dis)similarity matrix and detects possible repetitions within the matrix by using proper computer vision techniques. In addition, the method does not transcribe speech signal into words, phrases, or sentences. Hence, it can be generalized as speech-to-speech summarization method, in which summarization results are presented by speech instead of text. Furthermore, the method does not need a prior knowledge about the language or grammar of speech signal. Experiments show that, our method can capture the main theme of speech signals compared to the ideal transcription sections defined by experts and computational analysis shows our proposed method has a good performance.
AB - Similar to verse and chorus sections that appear as repetitive structures in musical audio, key-concept (or topic) of some speech recordings (e.g., presentations, lectures, etc.) may also repeat itself over the time. Hence, accurate detection of these repetitions may be helpful to the success of automatic speech summarization. Based on this motivation, we consider the applicability of music structural analysis methods to speech summary generation. Our method transforms a 1 - D time-domain speech signal to a 2-D image representation, namely (dis)similarity matrix and detects possible repetitions within the matrix by using proper computer vision techniques. In addition, the method does not transcribe speech signal into words, phrases, or sentences. Hence, it can be generalized as speech-to-speech summarization method, in which summarization results are presented by speech instead of text. Furthermore, the method does not need a prior knowledge about the language or grammar of speech signal. Experiments show that, our method can capture the main theme of speech signals compared to the ideal transcription sections defined by experts and computational analysis shows our proposed method has a good performance.
UR - http://www.scopus.com/inward/record.url?scp=62949196953&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=62949196953&partnerID=8YFLogxK
U2 - 10.1109/ISM.2008.90
DO - 10.1109/ISM.2008.90
M3 - Conference contribution
AN - SCOPUS:62949196953
SN - 9780769534541
T3 - Proceedings - 10th IEEE International Symposium on Multimedia, ISM 2008
SP - 515
EP - 520
BT - Proceedings - 10th IEEE International Symposium on Multimedia, ISM 2008
T2 - 10th IEEE International Symposium on Multimedia, ISM 2008
Y2 - 15 December 2008 through 17 December 2008
ER -