Maja Popović (Humboldt-Universität zu Berlin): (Dis)similarity Metrics for Texts (School Seminar)

Abstract:
Natural language processing (NLP) is a multidisciplinary field closely related to linguistics, machine learning and artificial intelligence. It comprises a number of different subfields dealing with different kinds of analysis and/or generation of natural language texts. All these methods and approaches need some kind of evaluation, i.e. comparison between the obtained result with a given gold standard. For tasks dealing with text generation (such as speech recognition or machine translation), a comparison between two texts has to be carried out. This is usually done either by counting matched words or word sequences (which produces a similarity score) or by calculating edit distance, i.e. a number of operations needed to transform the generated word sequence into a desired word sequence (which produces a “dissimilarity” score called “error rate”). The talk will give an overview of advantages, disadvantages and challenges related to this type of metrics mainly concentrating on machine translation (MT) but also relating to some other NLP tasks.

Speaker bio:
Maja Popović graduated at the Faculty of Electrical Engineering, University of Belgrade and continued her studies at the RWTH Aachen, Germany, where she obtained her PhD with the thesis “Machine Translation: Statistical Approach with Additional Linguistic Knowledge”. After that, she continued her research at the DFKI Institute and thereafter at the Humboldt University of Berlin, mainly related to various approaches for evaluation of machine translation. She has developed two open source evaluation tools, (i) Hjerson, a tool for automatic translation error classification, and (ii) chrF, an automatic metric for machine translation evaluation based on character sequence matching.