Until now, the focus has been on people’s names, but we also recognise some organisation names. ![]() We currently restrict the recognition to names consisting of least two parts. At the JRC, we currently use methods (a) to (c), but do not use part-of-speech taggers, because we do not have access to such software for all languages of interest. Other languages, such as Arabic, do not distinguish case. For the European languages, it is sufficient to consider only uppercase words. Sometimes, Machine Learning approaches are used for recognising names within their context by looking at words surrounding known names. ‘John’ Surname), or (d) because the sequence of surrounding parts-of-speech indicates to a tagger that a certain word group is likely to be a name. ‘President’ Name Surname), (c) because part of a sequence of candidate words is a known name component (e.g. Moreover, the cross-lingual aspect (detecting the same names across languages) is often limited to single language pairs or can only be trained on parallel text.ħ People’s names can be recognised in text (a) through a lookup procedure if a list of known names exists, (b) by analysing the local context (e.g. MUC-6 1995, Daille & Morin 2000), multilingual Named Entity Recognition is quite new (ACL-MLNER 2003, Poibeau 2003). Named entity recognitionĦ Though Named Entity Recognition (NER) is a known research area (e.g. Background and related workĥ This section gives some background and points to state-of-the-art applications regarding named entity recognition (See Named entity recognition), transliteration of person names and their mapping with European name variants (See Transliteration of proper names), and the usage of graphs showing relations between persons (See Relation Maps). The italics being the recognised trigger word(s). This is followed by evaluation results (Section Evaluation ) and by a section on learning relationships between people and how the automatically generated information on names can be used in automatic news analysis (Section Using names to explore document collections ). each group of related texts is treated as one meta-text, for which person and geographical place names are extracted and keywords are identified.Ĥ After giving some background on name transliteration and referring to related work (Section Background and related work ), we describe tools to identify names in text (Section Proper name recognition ) and the mechanism to merge name variants, including those written in Cyrillic, Arabic, and Greek script (Section Detecting and merging name variants ). The JRC’s name recognition tools are applied to each of these clusters, i.e. We then track related news clusters within the same language and across six of the languages (Pouliquen et al. For a subset of about 15,000 articles per day in currently eight languages, we apply unsupervised hierarchical clustering techniques to group related articles separately for each language. EMM is a software toolset that monitors a daily average of 25,000 news articles in currently 30 languages, deriving from 800 different international news sources. Due to the highly multilingual work environment in the European Commission – an organisation with twenty official languages – multilinguality of tools and the cross-lingual aspect are of prime importance.ģ Our analysis is applied to the output of the Europe Media Monitor system EMM (Best et al., 2002). Previous work focused on answering the questions What (Pouliquen et al. This paper focuses on the occurrence of proper names in news, i.e. ![]() This seems plausible as, according to Gey (2000), 30% of content-bearing words in news are proper names.Ģ In news analysis it is important to know What is the subject, Who is being talked about, Where and When things happened, and How it was reported. Crestan & de Loupy (2004) showed that Named Entity extraction and visualisation help users to browse large document collections more quickly and efficiently. Software tools that automatically pre-select the news articles of interest and that pre-process the chosen text collection simplify the daily repetitive task of media monitoring. Introductionġ Many large organisations continuously monitor the media, and especially the news, to stay informed about events of interest, and to find out what the media say about certain persons, organisations, or subjects. We thank Tomaž Erjavec for helping us with the Slovene language, and Helen Salak for providing us with knowledge about Farsi. We also want to thank Carlo Ferigato who introduced us to various fuzzy matching techniques. We thank the whole team of the Web Technology sector at the JRC for providing us with the valuable news data to test the tools, as well as for their technical support.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |