Abstract
This paper presents a statistical model for measuring structural similarity between webpages from bilingual websites. Starting from basic assumptions we derive the model and propose an algorithm to estimate its parameters in unsupervised manner. Statistical approach appears to benefit the structural similarity measure: in the task of distinguishing parallel webpages from bilingual websites our languageindependent model demonstrates an Fscore of 0.94-0.99 which is comparable to the results of language-dependent methods involving content similarity measures.
Original language | English |
---|---|
Pages (from-to) | 24-31 |
Number of pages | 8 |
Journal | International Conference Recent Advances in Natural Language Processing, RANLP |
Volume | 2015-January |
Publication status | Published - 2015 |
Event | 10th International Conference on Recent Advances in Natural Language Processing, RANLP 2015 - Hissar, Bulgaria Duration: Sept 7 2015 → Sept 9 2015 |
ASJC Scopus subject areas
- Software
- Computer Science Applications
- Artificial Intelligence
- Electrical and Electronic Engineering