Using Movie Subtitles for Creating a Large-Scale Bilingual Corpora

13 years 6 months ago

Download www.lrec-conf.org

This paper presents a method for compiling a large-scale bilingual corpus from a database of movie subtitles. To create the corpus, we propose an algorithm based on Gale and Church's sentence alignment algorithm(1993). However, our algorithm not only relies on character length information, but also uses subtitle-timing information, which is encoded in the subtitle files. Timing is highly correlated between subtitles in different versions (for the same movie), since subtitles that match should be displayed at the same time. However, the absolute time values can't be used for alignment, since the timing is usually specified by frame numbers and not by real time, and converting it to real time values is not always possible, hence we use normalized subtitle duration instead. This results in a significant reduction in the alignment error rate.

Einav Itamar, Alon Itai

Real-time Traffic

Church's Sentence Alignment | Education | Large-scale Bilingual Corpus | LREC 2008 | Time Values |

claim paper

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2008
Where	LREC
Authors	Einav Itamar, Alon Itai

Comments (0)

Sciweavers

Using Movie Subtitles for Creating a Large-Scale Bilingual Corpora

Church's Sentence Alignment | Education | Large-scale Bilingual Corpus | LREC 2008 | Time Values |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers