Syntax-based collocation extraction. (Q1958406)

This book relies on the author's PhD thesis, carried out within a well-known school in (computational) linguistics -- the Department of Linguistics, University of Geneva. The topic of the book is of high interest, both in theoretical and computational linguistics, since phraseological units or multi-word expressions cover a wide range of phenomena: compound nouns, phrasal verbs, idioms, and the main focus of this book, collocations -- as the largest subset within phraseological units. There are numerous NLP (natural language processing) applications for which collocations are a key factor. (1) In NLP text production, one can just mention machine translation and natural language production tasks. (2) In text analysis, handling collocations is also essential in parsing, word sense disambiguation, information retrieval, topic segmentation, speech recognition, and dictionary look-up. The main objective of the book is to propose a collocation extraction methodology that is more sensitive to the morpho-syntactic context of collocation source corpora. The author relies on detailed syntactic information provided by the Fips proficient multi-lingual syntactic parser for the collocation extraction using the syntactic proximity criterion (instead of the linear proximity one). The improvement in collocation identification is shown to be achieved by applying association measures on syntactically homogeneous material, by syntax-based methods. Another investigation of the author is the acquisition of bilingual collocation resources for the integration into a rule-based machine translation system. An efficient method is proposed for finding translation equivalents of collocations in parallel corpora, using the syntax-based collocation extraction technique on both the source and the target versions of the corpus. The book is organized in five chapters, briefly described as follows: Chapter 1 introduces the key concept of word collocation, discusses the word collocation relevance for NLP, and provides sound arguments for a syntax-based approach to collocation extraction. Chapter 2 (``On collocations'') investigates the complex phenomena of collocations and their description in the literature, and it identifies the collocations' most salient defining features. Chapter 3 (``Survey of extraction methods'') contains the basics of collocation extraction methodologies relying on statistical association measures. It discusses the role of linguistic preprocessing of source corpora, and surveys extensively the existing extraction work. Chapter 4 (``Syntax-based extraction'') presents the syntactic parser used to perform the proposed extraction method. The syntax-based extraction is compared to the sliding window, syntax-free approach, for monolingual (French) and cross-lingual (English, French, Italian, and Spanish) evaluation experiments. A qualitative analysis of the results is provided, pointing out increases of more than two times in the extraction accuracy. Chapter 5 (``Extensions'') proposes extensions of the extraction methodology in three directions: (a) solution for the extraction of complex collocations (comprising more than three words); (b) detection of all the syntactic configurations that are appropriate to collocations in a given language; and (c) automatic acquisition of bilingual collocation correspondences. The final Chapter 6 summarizes the main contributions of the work and the research directions to be followed: portability of the proposed method, relationship between syntactic parsing and collocation extraction, involved tools and resources for further NLP applications. Six very consistent and useful appendices are enclosed: A: List of collocation dictionaries; B: List of collocation definitions; C: Association measures -- mathematical notes; D: Monolingual evaluation (Experiment 1); E: Cross-lingual evaluation (Experiment 2); F: Output comparison.

0 references

reviewed by

Neculai Curteanu

0 references

zbMATH Keywords

natural language processing

0 references

computational linguistics

0 references

collocation extraction

0 references

syntax-based collocation extraction