Proceedings 2009

Format PDF

Full version

Additional

Online articles

Aliyev R.M. Kheidorov I.E. Yan Jinbing Belarussian State University, Minsk, Belarus

Syllable lattice-based keyword search methods may help to overcome the problem of Out of Vocabulary (OOV) words and compensate the loss of search performance caused by recognition error. While there has been no effective search model in lattice-based search approaches, a syllable posterior probability-based search model is proposed. The model takes account of the lattice structure and syllable posterior probability. A search method based on the model is proposed. A series of experiments shows that our method is suitable for keyword search

Apresjan V.Ju. Russian Language Institute

SEMANTIC SOURCES OF CONCESSION

The paper addresses the issue of concession as a complex derived meaning and analyzes its semantic origins. It also considers polysemy of concessive words and proposes semantic tools to distinguish among closely synonymous concessives derived from words with a non-concessive primary meaning. In particular, the following lexical items are analyzed: concessive conjunction "tol'ko" derived from a restrictive particle, and concessive conjunction/parenthetical word "pravda" derived from a factual noun. Their similarities and differences are analyzed in the light of the primary meanings of "tol'ko" and "pravda".

Baglei S.G. Antonov A.V. Meshkov V.S. Sukhanov A.V. Galaktika Corporation, Moscow, Russia

STATISTICAL DISTRIBUTIONS OF WORDS IN A COLLECTION OF RUSSIAN TEXTS

Statistical properties of texts have been widely studied in the fields of applied mathematics and linguistics. We explored statistical distribution of words in documents of a large collection of Russian texts using a probabilistic Bernoulli text generation process in our model. Unlike the traditional Bernoulli process, each document in the collection is considered as a finite text. We explored distributions of word frequencies in texts within a model representing a set of “bags-of-words”. We plan to use the obtained results to elaborate a more realistic estimated probability of word generation in arbitrary Russian text with regard to word correspondence to the text collection.

Baranov A.N. Institute of Russian Language, Moscow, Russia

SEMANTIC CORRELATES OF FORMAL VARIATION IN THE FIELD OF IDIOMATICS (THE OPERATION OF SUBSTITUTION)

The issue of formal variation of idioms is discussed. The paper focuses on the operation of substitution of different components on an idiom. A classification of different types of substitution operation is elaborated. It is hypothesized that formal variations of different kinds have specific semantic and discursive functions. Linguistic description of variation in the field of idiomatic presupposes an analysis of correlation between formal variation and meaning changes in an idiom. In is shown that substitution of the components of an idiom in most cases results in a generation of alternative semantic levels and, consequently, a linguistic play.

V.I. Belikov, Russian Language Institute n.a. V.V.Vinogradov, Russian Academy of Sciences M.V. Akhmetova, Journal “Zhivaia Starina”

WWW STATISTICAL ESTIMATION OF THE FUNCTIONAL PROPERTIES OF LEXICAL ITEMS

The paper deals with the possibilities of using web-cite statistics for objective estimation of functional properties of vocabulary items: their stylistic status, territorial distribution, obsolescence of an item and its replacement by a new one, etc. The functional properties of particular words and phraseological units reveal themselves in their frequencies in different text arrays (classical vs. web-literature, official texts, weblogs, etc).

Bogatyrev М.Y. Tuhtin V.V., Tula StateUniversity

CREATING CONCEPTUAL GRAPHS AS ELEMENTS OF SEMANTIC TEXT LABELING

Prospects of applying conceptual graphs as elements of semantic text labeling are considered. This kind of labeling is metadata that can be used to effectively solve some of the Text Mining problems. An algorithm for creating conceptual graphs is proposed and some results of its applications to modeling abstracts of scientific papers are presented.

Bogdanova N.V. Asinovsky A. S. Rusakova M. V. Ryko A. I. Stepanova S. B. Sherstinova T. Yu. Saint Petersburg State University, Russia

A SPEECH CORPUS AS A TOOL FOR MONITORING AND FIXATION OF VARIOUS FORMS OF NATURAL LANGUAGE

The paper concerns methodological principles and describes the technology of creation of the Corpus of Spontaneous Russian Speech and the structure of the database. Preliminary investigations based on Corpus material are briefly presented.

Bolshakov I.A. National Polytechnic Institute, Mexico

CROSSLEXICA: A LARGE ELECTRONIC DICTIONARY OF

A large Russian electronic dictionary contains a vocabulary of 185,000 entries, 1.75 million collocations, 2 million semantic links, English translations of entry titles, and their morphoparadigms. It functions dialogically (for text editing or language learning) and is also accessible from external software for parsing, word sense disambiguation, detection & correction of malapropisms, steganography, etc.

Bugakov O. V. Ukrainian Lingua-Information Fund, NAS of Ukraine, Kiev, Ukraine

CREATING A SEMANTIC DICTIONARY OF PREPOSITIONAL CONSTRUCTIONS ON THE BASIS OF THE UKRAINIAN NATIONAL LINGUISTIC CORPUS

Search capabilities of the Ukrainian national linguistic corpus and linguistic databases built on its basis are examined. The structure of the semantic dictionary of prepositional constructions built in accordance with the theory of lexicographic systems is described. Key words: preposition, main word, dependent word, semantic state, electronic semantic dictionary of prepositional constructions.

Dikonov V.G. Boguslavsky Igor M. Institute for Information Transmission Problems, Russian Academy of Sciences

UNIVERSAL DICTIONARY OF CONCEPTS

A universal dictionary of concepts, developed as a part of the ongoing effort to create a semantic intermediary language for global information exchange, is presented. The article describes basic principles and contents of the dictionary and outlines the current state of the project. The dictionary can evolve into an open and freely available language-neutral resource with many potential applications. For example, the extensible dictionary of concepts can serve as a pivot to uniformly record and link meanings of words of different languages and facilitate creation of bi- and multilingual dictionaries. Another possible use is word sense markup of corpora. The dictionary of concepts is going to be linked at the word sense level with lexicons of major world languages including Russian, English, Spanish, French, Arabic, Hindi, etc.

Dobrovol'skij D.O. Levontina I.B. Russian Language Institute, Russian Academy of Sciences

RUSSIAN NET, GERMAN NEIN , ENGLISH NO: CONTRASTIVE SEMANTIC ANALYSIS WITH PARALLEL CORPORA

‘No’ seems to be a very simple and universal idea. However, surprisingly enough, the German word nein or the English no are not always good equivalents for the Russian word net, and vice versa. Parallel corpora show that in many cases net is translated differently, even though the respective phrase with nein/no is acceptable. And we often see net in Russian translation instead of some other units. We assume that such lack of coincidence must have certain semantic reasons. They are probably rooted in semantic differences between net and nein/no. In our paper we try to reveal these reasons.

Ermakov A.E. Pleshko V.V. RCO Ltd Moscow

NATURAL LANGUAGE QUERY PROCESSING FOR SEARCH ENGINE BASED ON LINGUISTIC ANALYSIS

A new method of transforming natural language queries into search engine language queries is described, which is based on the automatic analysis of syntactic relations between words and their representation as relevant search engine language operators saving the meaning of an original query to the extent possible.

Fedorova O.V. Shavrygina A.S. Lomonosov Moscow State University, Russia

PROCESSING INITIAL-STRESS AND NON-INITIAL-STRESS WORDS IN SPOKEN-WORD RECOGNITION IN RUSSIAN

The data of an experimental investigation of spoken-word recognition in Russian are presented. Two experiments showed that word recognition and word recall are faster and better in initial-stress word than in non-initial-stress words. The results support the metrical segmentation theory.

Goldin Valentin Martianov A.O. Sdobnova Alevtina Saratov State University

THE DIGITAL RUSSIAN ASSOCIATIVE DICTIONARY OF SCHOOLCHILDREN

The paper deals with some ways of solving different issues of psycholinguistics, sociolinguistics and culturology based on the materials of the digital "Associative Dictionary of Schoolchildren of Saratov city and Saratov region".

Gornostay T. Tilde, Riga Aker A. Department of Computer Science, University of Sheffield

DEVELOPMENT AND IMPLEMENTATION OF MULTILINGUAL OBJECT

The fast growing amount of images available on the web has motivated development of automatic approaches for image description generation. Using multi-document summarization for this task has been proposed recently. This paper describes a method for developing and implementing object type toponym-referenced text corpora in the context of optimizing the multi-document summarization for generating toponym-referenced descriptions of images. Object type corpora are developed for four different languages: English, German, Italian and Latvian.

Grigorian E.L. South Federal University, Rostov-on-Don, Russia

ON THE NATURE OF SYNTACTIC POLISEMY

The analysis of variations of actant structures reveal the fact that most syntactic structures represent a set of semantic features which are not necessarily realized in every context. In many cases semantic distinctions are neutralized and the constructions differ only in communicative structure or style.

Grishina Elena Institute of Russian Language, RAS, Moscow, Russia

ON GESTURE–WORD CORRELATION (VOCAL GESTURE OH IN SPOKEN RUSSIAN)

The paper analyzes the usage of the vocal gesture Oh according to the data of the future Multimodal Russian Corpus (MURCO). The investigation is based on the analysis of the body and face movements that accompany this vocal gesture in the process of oral speech. As a result three meanings were detected 1) Oh as a deixis, 2) Oh as an interjection, and 3) Oh as a physiological exclamation.

Iagounova E.V. St.Petersburg State University

BEST RECOGNIZABLE WORDS UNDER DIFFERENT EXPERIMENTAL SETTINGS

Basic features of the sets formed by the words, best recognizable under white-noise masking and within meaningless text fragments have been analyzed. It is observed that the sets are crucially dependent on such broad text parameters as professional text vs. fiction and dynamic vs. static text.

Iomdin B.L. Russian Language Institute n.a. V.V.Vinogradov, Russian Academy of Sciences

EVERYDAY TERMINOLOGY. IN PURSUIT OF STANDARDS

The paper is devoted to the vocabulary describing everyday life artifacts. This vocabulary is shown to be treated very differently in dictionaries, production standards, and usage; henceforce, unified lexicographic definitions of words belonging to this vocabulary are hardly possible at all. A draft project of an explanatory and encyclopedic thesaurus of everyday life terminology is presented.

Iomdin L.L. Institute for Information Transmission Problems, Russian Academy of Sciences Lobanov B.M. United Institute of Informatics Problems, National Academy of Science of the Republic of Belarus

SYNTACTIC CORRELATES OF PROSODICALLY MARKED ELEMENTS OF THE SENTENCE AND THEIR ROLE IN THE TASKS OF TEXT-TO-SPEECH SYNTHESIS

The paper describes a feasibility study of using syntactic parsing of written text at an initial stage of text-to-speech synthesis algorithm. An attempt has been made to establish correlations between the elements of an automatically created dependency tree structure of a sentence, on the one hand, and prosodically strong elements of this sentence, on the other hand. First experimental results show that the approach may be effective.

Kibrik A.A. Institute of Linguistics RAS, Khudyakova M.V. Lomonosov Moscow State University, Kodzasov S.V. Lomonosov Moscow State University

PROSODIC TRANSCRIPTION: LEVELS OF DETAIL

In the book Kibrik and Podlesskaya (eds.) 2009, a prosodically oriented system of discourse transcription for spoken Russian was proposed. In this paper a number of extensions for that system are suggested, such as the distinction between expiratory and pitch accents, a more detailed account of pitch accents, interval of tone in an accent, dynamic vowel doubling, etc.

Kibrik A.E., Lomonosov Moscow State University, Russia

TOWARDS THE PROBLEM OF LINGUISTIC VARIABILITY:

Clausal coordination is studied in 23 related Daghestanian idioms. Clausal coordination is extremely variable across this language sample: there is not a pair of idioms with identical coordinate clausal constructions. At first sight, the choice of formal coding technique used by specific idioms appears random and chaotic. Such situation creates irresolvable theoretical difficulties. Neither the traditional method of classification nor the structural calculus method are helpful. In the paper an alternative method is employed. It can be called the multifactor second-order calculus method. A calculus of coordinate constructions is implemented at the level of parameterized principles and strategies predetermining specific coordinate constructions, rather than at the level of coordinate constructions themselves.

Khurshudian V.G. Daniel M.A. Levonian D.V. Plungian V.A. Polyakov A.E. Rubakov S.V. Corpus Technologies

EASTERN ARMENIAN NATIONAL CORPUS www.eanc.net

Eastern Armenian National Corpus (EANC) is a comprehensive linguistic database of annotated texts in Eastern Armenian from the mid 19th century to the present. The EANC contains about 110 million tokens and is enhanced with a powerful search engine. EANC is available at www.eanc.net.

Klyshinsky E.S. Keldysh Institute of Applied Mathematics RAS Manushkin E.S. Moscow State Institute of Electronics and Mathematics

THE METHOD OF AUTOMATED SYNTAX SEGMENTATION RULES GENERATION

The paper proposes a method of automated generation of syntax segmentation rules. The method is based on FIRST, LAST, FIRST2 and LAST2 sets calculated for existing BNF grammars describing the rules for syntax analysis of natural languages texts.

Kobzareva T.Yu. Russian State University for Humanities, Russia

SYNTACTIC INCOMPATIBILITY AS A PROPERTY OF THE LINEAR ORGANIZATION OF A RUSSIAN SENTENCE

The paper considers a property of the linear organization of sentence in Russian, the so-called syntactic incompatibility, or impossibility of simultaneous appearance of some components in its fragments set by punctuation marks or coordinative conjunctions. The property can be taken into account at different stages of automatic analysis.

Kobozeva I. M. Lomonosov Moscow State University, Moscow, Russia

SEMANTICS OF THE VERB PONIMAT’: FROM PRОPOSITIONAL TOWARDS INTERPERSONAL ATTITUDE

The Russian verb ponimat’ ‘understand’ in constructions with a personal direct object is studied. 6 of its readings, corresponding to different intentional states (rational, emotional, interpersonal) are explicitly defined. The emergence of non-rational readings is explained on the cognitive basis.

Kodzasov S.V. Arkhipov A.V. Zakharov L.M. Krivnova O.F. Lomonosov Moscow State University, Moscow, Russia

THE DATABASE ON INTONATION OF RUSSIAN NARRATIVE TEXTS

The paper represents the results obtained at the 2nd stage of development of the DB “Intonation of the Russian informative and narrative texts”. This stage opened the 2nd triennial cycle of inquiry into Russian intonation.

Kozhunova O.S. Institute for Informatics Problems of the Russian Academy of Sciences

Detection of nominalized structures in parallel patent texts in Russian and in German

In the paper nominalization in bilingual situation (Russian-German) involving comparative study results for three languages (Russian, English, and German), approach of parallel texts identification for patent sphere and transformation types have been analyzed.

Kozerenko A.D. V.V.Vinogradov Russian Language Institute, Russian Academy of Sciences

PARENTHESES IN RUSSIAN IDIOMS

The paper offers a semantic analysis of Russian idioms containing the word parentheses. A paradoxical fact is observed that two idioms that in Russian sound as enclose in parentheses and put outside the parentheses have the same meaning. The clue is given and other Russian idioms containing the word parentheses are examined.

Komarova A.D. Russian State University for Humanities, Russia

PAUSES AFTER POSTPOSITIONS AND TOPICAL PARTICLE WA IN JAPANESE: A CORPUS STUDY

This research discusses pauses after postpositions and topical particle wa and before them in Japanese. It aims to find out how frequent and probable these pauses are, their usual length and if they differ depending on the syntactic position.

Korotaev N.A. Russian State University for Humanities

A CORPUS STUDY OF PAUSATION AT SYNTACTIC BOUNDARIES: WHY PAUSES DO NOT ALWAYS APPEAR WHERE WE EXPECT THEM

The so-called ideal delivery presupposes a pause at every elementary discourse unit boundary. Natural discourse, however, provides numerous examples when such pauses are missing. The paper reports a corpus-based study of these cases in spoken Russian. It is argued that they are characterized by a high degree of semantic integration, which correlates with syntactic and prosodic properties of the examined sequences. For instance, absence of pauses is intrinsic to most complex clauses. Analyzing a corpus of night dream stories, it has also been found that the ratio between boundaries with and without pauses varies appreciably from one story to another.

Kotov A.A. Russian State University for the Humanities

PATTERNS OF EMOTIONAL REACTIONS IN COMMUNICATION: PROBLEMS OF CORPORA STUDIES AND APPLICATION TO COMPUTER AGENTS

We study cognitive architecture of computer agents, simulating emotional speech behavior, and changing their mood in time. Basing on a multimodal corpus (records of university exams) we study sequences of contrastive emotional reactions and the possibility to apply the sequences to computer agents.

Cotta Ramusino P. University of Milan, Italy

CITIZEN-INSTITUTION NON-MEDIATED DIALOGUE: THE RUSSIAN DIRECT LINE CASE

This paper analyses a specific kind of institutional discourse: Russian Direct Line. It aims to give account of interactional strategies used by subordinate participants of the given interaction. It tries to investigate how “naïf” interviewers, who are not familiar with strategies regulating a neutral or “neutralistic” position, manage to avoid possible consequences of their own speech acts, by using pragmatic and metapragmatic acts, basically aimed at downgrading.

Kreydlin G.E. Russian State University for Humanities

THE NONVERBAL BEHAVIOR OF PEOPLE OF DIFFERENT CULTURES IN A DIALOG I: FINNISH AND RUSSIAN GESTURE SYSTEMS

The paper presents some reflections of the so-called exterior observer about Finnish nonverbal semiotic culture, some nonverbal signs and models of Finnish dialog behavior. Corresponding Russian nonverbal data are given for comparison

Kretov A.A. Voronezh State University, Russia Rafaeva A.V. Lomonosov Moscow State University, Russia

ON THE SEMANTIC CLASSIFICATION PROGRAM ProSeCa: THEORETICAL AND PRACTICAL ASPECTS

A modified version of E. Kuznetsova’s definition-based semantic identification method is proposed. The main point of it is that lexical semantics is concentrated in the most common nouns. A computer program of semantic classification is described. Perspectives of using and developing the program are outlined.

Krylov Sergey A. Institute of Oriental Studies of Russian Academy of Sciences, Moscow & Institute of System Analysis of Russian Academy of Sciences, Moscow

“QUASI-CORPUS” INVESTIGATION OF LEXICAL PRODUCTIVITY OF NON-TRIVIAL BASIC DIATHESES OF RUSSIAN WITH SPECIAL REGARD TO S. I. OZHEGOV’S DICTIONARY OF RUSSIAN

“Quasi-corpus” linguistics allows the investigation of both primary and secondary information sources (like grammars and dictionaries). The paper studies the statistics of grammatical data (on government patterns, transitivity, impersonality etc.) in the text of S.I.Ozhegov’s “Dictionary of Russian” (1989).

Krylova T.V., V.V.Vinogradov Russian Language Institute, Russian Academy of Sciences

THE ADJECTIVES WITH MEANING OF HIGH AND LOW TEMPERATURE AND LINGUISTIC ESTIMATION OF TEMPERATURE

In this article the adjectives холодный, прохладный, горячий, жаркий, теплый are considered. In the first part we analyze their division to groups. In the second part we consider their combinations with adverbs of degree. We advance the hypothesis that many differences in using of temperature adjectives are caused by difference in linguistic estimation of high and low temperature. In conclusion the same idea is illustrated by the material of verb with meaning of temperature.

Kudinov M.S. MSU, Grishina E.A. Institute of Russian Language, Moscow

SEMIAUTOMATIC MARKING TOOLS FOR THE RUSSIAN MULTIMEDIA CORPUS (MURCO)

The paper describes two workbenches for corpus markers: a speech act marker's workbench (Marker) and a gesture marker's workbench (GesturesMarker). These programs allow the annotator to describe in quick and uniform manner Russian gesticulation and speech acts used in Russian spoken language.

Kuznetsov I.P. Institute of Informatics problems, Moscow, Russia Efimov D.A. ZAO Synergetic Systems, Moscow, Russia

MEANS FOR TUNING OF THE “SEMANTIX” LINGUISTIC РROCESSOR TO SUBJECT FIELDS

The linguistic processor ”Semantix” for automatic formalization of natural language texts is presented It extracts data on user objects, their links and actions from texts. The processor uses special tools and methods for tuning to new subject fields. As an example the process of tuning for the text corpus about monuments is considered.

Kustova G.I. Moscow State Pedagogical University

THE SEMANTIC DATABASE OF VERBAL ADJECTIVES: STRUCTURE AND TYPES OF INFORMATION

The paper discusses the issues of elaboration of an electronic semantic dictionary (database) of Russian verbal adjectives (like vkhodnoj, lechebnyj, osvetitel’nyj etc). The topics considered include: a) the correlation between the verbal adjective and the verbal situation and the possibilities of expressing verbal arguments, e.g. stiral’naja mashina (‘washing machine’, instrument), vs. stiral’nyj poroshok (‘washing powder’, means); b) the correlation between the semantic class and the functional predicate of a noun and the semantic model of combinations like «verbal adjective + noun»; c) information types in the database; d) specification of semantic marking in the dictionary of the National Corpus of Russian language.

Lande D.V. Zhigalo V.V. ElVist iInformation centre, Kiev, Ukraine

THE APPROACH TO CREATION OF MULTILINGUAL PARALLEL CORPORA OF WEB PUBLICATIONS

An algorithm of creating bilingual parallel corpora of documents from web publications is described. The algorithm uses frequency morphological dictionaries and empirical statistical properties of texts. An approach of homonymy resolution by means of statistical approach is presented, which allows choosing the most frequent normal forms. The algorithm has been developed as a software complex and integrated into the InfoStream system of content monitoring. As a result of algorithm operation aimed to determine basic word forms, a bilingual parallel corpus of electronic texts from web publications that contains more than 450 000 pairs of documents.

Lebedev A.S. Moscow State Institute of Electronics and Mathematics

AN EDITOR OF AUGMENTED TRANSITION NETWORKS WITH A GRAPHICAL USER INTERFACE

The problem of semantic search is considered on an example of search for abstracts. An approach to the creation of a linguistic processor using augmented transition networks, inserted graphs, and arrangement of objects based on their descriptive part is proposed.

Lobanov B.M. United Institute of Informatics Problems, National Academy of Science of the Republic of Belarus

THE PROBLEM OF THE «Ё»-HOMOGRAPHS RESOLUTION IN TEXT-TO-SPEECH SYNTHESIS

The problem of adequate ambiguity resolution in text-to-speech synthesis, for a special case of graphic homonymy related to the letter Ё is considered. Statistical characteristics of homographic pairs including Ё homographs and distributions among the frequent pairs of such homographs are investigated. The methods of resolution for the highly frequent homographic pair «ВСЁ» and «ВСЕ» are discussed.

Loukachevitch N.V. Dobrov B.V. Research Computer Center of M.V. Lomonosov Moscow State University NCO Center for Information Research

SUMMARIZATION OF NEWS CLUSTERS BASED ON THEMATIC REPRESENTATION

The paper describes a technology of multi-document summarization, based on news cluster topical structure, lexical cohesion modelling and thesaurus descriptions of lexical senses. Lexical knowledge helps to improve cohesion and recall of a summary and reduce repetitions.

Lashevskaya O. Kuznetsova Ju. University of Tromsø

RUSSIAN FRAMENET: TOWARDS A CORPUS-BASED DICTIONARY OF CONSTRUCTIONS

The paper presents our basic approach to creating a FrameNet-oriented resource for Russian language, which involves extracting sampling from the Russian National Corpus and adding a layer of semantic and syntactic annotation. We discuss aims and methods of the project and give several examples of argument labeling in the dictionary and in the companion corpus.

Makhova A.A. Lyashevskaya O.N. Desyatova A.V.

NAMES OF BODY PARTS FROM THE VIEWPOINT OF TOPOLOGY

The paper describes Russian names of body parts through the notion of topological type as introduced by L. Talmy. The corpus analysis of collocation with adjectives of shape and dimension makes it possible to define a number of topological types of body parts, such as juts, rods etc. and identify some peculiarities of their spatial perception.

Mihkla M. Kiissel I. Nurk T. Piits L. Institute of the Estonian Language

TRANSCRBING, STRUCTURING AND TEMPORAL ANALYSIS OF FLUENT SPEECH CORPUS FOR A UNIT SELECTION TTS SYSTEM FOR ESTONIAN

The paper reports the development of a speech corpus for Estonian text-to-speech synthesis based on unit selection. The process of transforming an orthographic Estonian text into a pronounced text, requiring the consideration of quantity, palatalization and other essential features of an Estonian pronounced text, is described. In order to optimize the unit selection algorithm and to guarantee the necessary quality of the synthetic speech the whole speech database is represented as a phonological tree. We present the evidence that the collocational strength shortens the duration of words and that contextual predictability is a significant feature to be considered in developing models of word duration.

Mitrofanova O.A. Saint Petersburg State University, Russia Zakharov V.P. Saint Petersburg State University; Institute of Linguistic Studies, Russian Academy of Science

AUTOMATIC ANALYSIS OF TERMINOLOGY IN THE RUSSIAN TEXT CORPUS ON CORPUS LINGUISTICS

The paper presents the results of semi-automatic analysis of terminology in the Russian text corpus on Corpus Linguistics. Special attention is given to extraction of one-word and multi-word terms as well as to the use of lexical-grammatical patterns in the description of term structure and contexts of use.

Mutalov R.O. Dagestan State University, Makhachkala

AN EXPERIENCE OF CREATION OF THE NATIONAL CORPUS OF DAGESTAN LANGUAGES

Problems and prospects of national corpora of six literary languages of Dagestan, created in the Dagestan State University, are considered. Special attention is given to the creation of a system of automatic markup of texts and digitization of printed texts.

Nedoluzhko А. Charles University, Prague, Czech Republic

COREFERENCE ANNOTATION IN PRAGUE DEPENDENCY TREEBANK

The paper presents the pattern for annotating coreferential relations on the PTD corpus. Three levels of annotation are discussed: annotating grammatical coreference (the antecedent is calculated according to the grammar rules of a given language); annotating textual pronominal coreference; an extended pattern for annotating nominal textual coreference and associative anaphora. The first two (grammatical coreference and pronominal coreference) have been annotated on the whole PDT corpus, whereas the nominal coreference and assosiative anaphora are currently in the focus of the author's research. Certain complicated cases are going to be discussed and first results of the research presented.

Nikolaeva Y.V. Lomonosov Moscow State University, Russia

SEGMENTATION OF ORAL NARRATIVE DISCOURSE AND ILLUSTRATIVE GESTURES: VISUAL CLUES AS SEGMENT MARKERS

The paper is devoted to the interrelations between speech accompanying gestures and the discourse structure. The main aim was to find out how different characteristics of illustrative gestures mark discourse segment boundaries.

Okatiev V.V. Erekhinskaya T.N. Skatov D.S. DICTUM Ltd., Nizhny Novgorod, Russia

MODELS AND METHODS OF PUNCTUATION USE IN RUSSIAN LANGUAGE SYNTAX PARSING

The paper describes functional ambiguity of punctuation marks in the Russian language. A formal model of isolations and series of coordination members is presented. Mathematical target setting for punctuation use in syntax parsing and the algorithm for this task are suggested.

Orlova S.V. Lomonosov Moscow State University, Russia

TRANSLATION OF GERMAN PARTICLE DOCH USED IN STATEMENTS INTO RUSSIAN (IN STATEMENTS): VED’, ŽE, VSE ŽE OR VSE-TAKI?

The paper is devoted to the comparative analysis of the semantics of the German particle DOCH in statements and its translation equivalents taken from German-Russian dictionaries - the Russian particles VED', ŽE, VSE ŽE and VSE-TAKI.

Ostapova I.V. Ukrainian Linguo-Information Fond, National Academy of Sciences of Ukraine

ETYMOLOGICAL DICTIONARY: LEXICOGRAPHIC STRUCTURE AND REPRESENTATION IN DIGITAL ENVIRONMENT

A technology for building an instrumental system for supporting the dictionary in digital environment was developed. The technology is based on a formal model of lexicographic system of etymological dictionaries. The main focus is given to mechanisms of language indexation.

Paducheva E.V. Institute of Scientific and Technical Information Russian Academy of Sciences

POSSESSIVES AND MANNER OF ACTION NOUNS: CORPUS BASED EXPLORATION

Possessives (i.e. possessive pronouns and adjectives) resemble the genitive, but possessive Subject co-occurs with a genitive Object in the context of a verbal noun (мейерхольдовская постановка Ревизора), while genitive Subjects are not compatible with genitive Objects. Possessive-genitive diathesis serves as a diagnostics for NOUNS OF MANNER.

Palko M.L. Institute of Linguistics

PROSODY OF THE GERMAN VOCATIVE NPs IN CONTRAST TO THE RUSSIAN ONES

The prosody of the German vocative NPs is discussed as contrasted to the prosody of the Russian vocatives. The analysis shows that the German vocatives do not allow for prestressed articulations that are highly characteristic of the Russian vocatives used in unofficial and close contacts between the hearer and the listener, cf. MOLODOJ chelovek! with a wordform molodoj to be accented. The non-vocative NPs also demonstrate more restrictions in prestressed patterns formation, which seems to be the typological parameter of German and of most West European languages.

Partee Barbara H. University of Massachusetts, Amherst, MA, USA; Moscow State University

THE DYNAMICS OF ADJECTIVE MEANING

Meaning and context interact dynamically; how can one account for context-dependence without abandoning compositionality? We illustrate with the semantics of different kinds of adjectives. We show how compositional semantics sheds light on word meaning, and how compositional semantics, lexical semantics, and context all interact.

Pazelskaya A.G. ABBYY Software

DERIVATIONAL PATTERNS AND SYNTACTIC POSITIONS OF DEVERBAL NOMINALS (ON CORPUS DATA)

This paper is a part of general study of differences in behaviour in Russian deverbal nominals derived via various patterns. The investigation is done on the basis of corpus data, mostly obtained from the Russian National Corpus. We study preferences of nominals ascending to the three most productive derivational patterns with respect to the syntactic position of the resulting nominal in a sentence.

Pereverzeva S. I. Russian State University for the Humanities, Russia

NONVERBAL COMMUNICATIVE ACT OF CONSOLATION: MATERIALS FOR A DICTIONARY OF NONVERBAL COMMUNICATIVE ACTS

The paper discusses some issues regarding the dictionaries of Russian speech acts and Russian nonverbal acts. I provide a preliminary draft of a dictionary entry “consolation” as an example of lexicographical description of nonverbal acts.

Petrenko M. Princess Dashkova Moscow Humanities Institute, Russia

ONTOLOGICAL SEMANTICS AND ABDUCTION: PARSING ELLIPSIS

New avenues for modeling abductive reasoning within the framework of Ontological Semantics are explored. Specifically, the rich knowledge resources and dynamic parsing module of Ontological Semantics allow processing elliptic input with a set of inference rules, which establish on the one hand, dependencies between verbalized and non-verbalized case-roles across clauses, and on the other hand, dependencies between scalar attribute values and specific event classes. Examples are provided to illustrate each case.

Podlesskaya V.I. Russian State University for the Humanities Kibrik A.A. Institute of Linguistics RAS

THE ROLE OF DISCOURSE MARKERS IN LOCAL DISCOURSE STRUCTURE: A CORPUS STUDY

Based on a corpus of spoken narratives, the study shows how discourse markers can be differently integrated into local discourse structure: some can be used as a separate “minimal discourse unit”, while others are always integrated into a bigger unit with a propositional meaning. The two discourse markers most frequent in the corpus, VOT and NU, are compared and VOT is shown to be less integrated into prosodic, linear and hierarchic structure than NU.

Polyakov A.E. NTC «Informregistr» Bergelson M.B. Lomonosov Moscow State University, Russia Pilshckov I.A. IMK of Lomonosov Moscow State University

LOMONOSOV CONCORDANCE – CONCEPT AND IMPLEMENTATION

This paper qualifies the concepts and terminology relevant to the development of comprehensive digital Concordance to the texts of Lomonosov, and discusses the practical decisions which are necessary for the implementation of this lexicographical product. The concordance is based on the corpus of author’s texts supplied with structural, philological and grammatical markup. We describe the technology we use to build the corpus and the concordance, the principles of corpus markup, and the structure of concordance vocabulary entries, as well as its application to linguistic research.

Potapov M.V. Ryazan State Radio Engineering University

SYNTACTICALLY THE INVARIANT METHOD OF IDENTIFICATION OF SEMANTICS OF THE INFORMATION

In the report the description of practically approved method of an estimation of the semantic maintenance of the information streams based on statistic - a linguistic way of primary processing of the bit information and approaches of the theory of recognition of images contains at the analysis of multivariate attributes

Potemkin, S. Philological Faculty, Moscow State University, Russia

UNSUPERVISED PARSING

A statistical approach to parsing of raw text is described. The parsing algorithm builds a projective dependency tree in quadratic time after training on an unannotated corpus.

Prodan A.I. Korolkov E.A. Oparin I.V. Talanov A.O. Speech Technology Center, Russia

MULTI-TIER MARKUP OF SPEECH CORPUS FOR HYBRID RUSSIAN TTS SYSTEM “VITALVOICE”

The paper deals with the features of a system for multi-level markup of speech corpora. These corpora are used for the hybrid Russian TTS system “VitalVoice” developed at Speech Technology Center (STC). VitalVoice is basically a Unit Selection TTS system complemented with triphone inventory. The basic advantage of this approach is that it allows getting speech units from the speech corpus in a quick and efficient way. The database consists of interrelated levels of markup (phrases, intonation models, words, syllables, etc.). The levels of markup, their use in the TTS system and automatic markup checking are described in detail.

Rakhilina E.V. Institute Of Russian Language, RAS Karpova O.S. Russian State University for Humanities Reznikova T.I. All-Russian Institute of Scientific and Technical Information, RAS

SEMANTIC-DERIVATIONAL MODELS OF POLYSEMOUS ADJECTIVES: METAPHOR, METONYMY AND THEIR INTERACTION

The paper reports on a project intended to provide a corpus-based description of semantic-derivational models for Russian adjectives. The research deals with high-frequency adjectives in the attributive use denoting the quality of a person or thing. We discuss basic metonymical and metaphorical patterns and analyze several non-regular shifts.

Rozina R.I. Russian Language Institute n.a. V.V.Vinogradov, Russian Academy of Sciences

THE SO-CALLED: SEMANTIC ANALYSIS OF PARENTHETICAL METALINGUISTIC PHRASES

The paper is concerned with meaning and textual functions of a group of parenthetical phrases expressing the speaker’s attitude to the manner of speech. It is argued that their function is to ensure the transition in the text between different styles, the relation between which changes in the course of time, and that the meaning of these phrases is extended in the way that might be regular.

Romanov Aleksandr S. Mescheriakov Roman V. Tomsk state university of control system and radioelectronics

AUTHORSHIP IDENTIFICATION WITH SUPPORT VECTOR MACHINE IN CASE OF TWO POSSIBLE ALTERNATIVES

Authorship identification problem is viewed as a classification task. The importance of resolving the binary authorship classification problem for authorship identification is justified. Description and results of authorship identification experiment with support vector machine in the case of two possible alternatives are given.

Ryko A.I. Stepanova S.B. Saint Petersburg State University, Russia

STRATEGIES OF DELIMITATION OF SYNTACTIC UNITS IN SPONTANEOUS SPEECH

The paper discusses methods of dividing spontaneous speech into syntactic units using the Corpus of Spoken Russian. We analyze individual strategies of experts who took part in the experiment, and examine connections between the boundaries of sentences and their final intonation.

Semenova S.Yu. INION RAS, Russia

ON ENCYCLOPAEDIC DATA IN AN APPLIED SEMANTIC DICTIONARY

Inclusion of information on ontological realities into a semantic dictionary, which is a trend in modern lexicography, corresponds to ideas of cognitive science with its focus on the wholeness of the information perception process. The paper is concerned with the encyclopaedic data within the NLP-aimed semantic dictionary that has the rigid formats for lexical data representation. Encyclopaedic functions in the RUSLAN machine semantic dictionary are considered. Some ways of loading and enhancement of the functions are discussed. A number of words and lexical classes relevant to certain types of encyclopaedic data are considered.

Sharapov R.V. Sharapova E.V. Murom Institute of Vladimir State University

AN ALGORITHM OF LINK SPAM DETECTION

Approaches to detecting spam links on the basis of the analysis of page content are considered. We focus on the detection of advertisement (paid) links. Features of paid links are analyzed. The algorithm of detecting a spam link is given.

Sharonov I.A. Russian State University for the Humanities

COMMUNICATIVES AND METHODS OF ITS DESCRIPTION

Short dialogical utterances with fixed and vague grammatical structure are analyzed. We call these utterances “communicatives” and focus on the main principles underlying the classification of such language forms and ways of their pragmatic and conversational analysis. To describe the functioning of a communicative in conversation we need to clarify their semantic, formal and discursive characteristics, which include: - communicative intention or emotional state; - what kind of speech act – direct or indirect – a communicative represent; - the source form of the communicative and the mode of its transposition into communicative; - the discursive boundaries with adjacent utterances; - standard intonation patterns and other phonetic characteristics of the communicative in speech.

Shmeleva E.Y. Shmelev A.D. Institute of Russian Language, Moscow

VARIATION, CONTINUATION, AND SERIALITY OF JOKES: PROBLEMS OF DATABASE CONSTRUCTION

The paper deals with different kinds of joke variation and intertextual relations between jokes. We discuss such phenomena as realization of a joke, versions of a joke, continuation of a joke, modification of the original joke, addition to the original joke, series of jokes, joke cycle.

Sidorova E.A. Kononenko I.S. A.P.Ershov Institute of Informatics Systems, Russia

AN ONTOLOGY-BASED APPROACH TO FACT EXTRACTION

An approach is proposed to develop fact extraction technology applicable in information systems of various kinds. The approach makes use of the knowledge base including domain ontology, domain vocabulary, model for text segmentation, and fact extraction schemes that relate vocabulary items and lexical-syntactic constructions to ontology entities.

Skatov D.S. Erekhinskaya T.N. Okatiev V.V. DICTUM Ltd., Nizhny Novgorod, Russia

MODELS AND METHODS FOR THE ANALYSIS OF HIERARCHICALLY STRUCTURED TEXTS

The analysis of hierarchically structured texts (laws, standards etc.) is discussed. An overview of developments in the domain are given. The developed models and methods for the analysis of hierarchically structured texts are described.

Sokolova E. G. Russian State University for Humanities, Moscow Kononenko I. S. Zagorulko Yu. A. A.P. Ershov Institute of Informatics Systems SB RAS, Novosibirsk

EXPERIENCE OF SYSTEMATIZING KNOWLEDGE AND INTERNET RESOURCES FOR A KNOWLEDGE PORTAL ON COMPUTATIONAL LINGUISTICS

The paper describes an experience of systematizing knowledge and internet resources for a knowledge portal on computational linguistics. A composition and structure of objects of the portal, place of the portal among other catalogues on computational linguistics, an experience of development of bilingual vocabulary of terms on computational linguistics with using procedures of automatic extraction of terms from text are considered.

Aleksandra V. Ter-Avanesova Institute of Russian Language of Russian Academy of Sciences, Moscow Sergej A. Krylov Institute of Oriental Studies of Russian Academy of Sciences, Moscow & Institute of System Analysis of Russian Academy of Sciences, Moscow

THE USE OF LEXICO-GRAMMATICAL DATABASES IN THE RUSSIAN DIALECTAL LEXICOGRAPHY

The lexico-grammatical database (LGDB) for Russian folk dialects with two [o]-like phonemes that was built with the help of StarLing informational system is significantly enriched. It includes now the data on a Middle Russian dialect of the village Pustosha (Shatura district, Moscow region, and a LGDB for Vologda suburban dialects, including about 30 thousand word-forms that represent about 4500 lexemes. The kernel dialectal corpus (KDC) contains texts with partial lexico-grammatical tagging.

Svetlana Timoshenko Leonid Cinman Institute for Information Transmission Problems, Russian Academy of Sciences

LEXICAL FUNCTIONS AND SEARCH ENGINE OPTIMIZATION (BASED ON WORDS WITH NUMERIC VALUES)

To provide more precise web search we have developed a special option in the ETAP-3 multifunctional NLP environment. The search query consisting of two or three words has been supplemented with the values of certain lexical functions to generate an incomplete sentence which lacks only the numeral information. We expect that it may help in searching numeral data like “The height of the Pisa tower”. The results of the experiment show that the search precision index in this domain of knowledge increases by 24 % on the average.

Tikhomirov I.A. Smirnov I.V. Institute for Systems Analysis of RAS, Moscow

APPLYING LINGUISTIC SEMANTICS AND MACHINE LEARNING METHODS TO SEARCH PRECISION IMPROVEMENT IN SEARCH ENGINE “EXACTUS”

The paper considers problems of using linguistic semantics and machine learning methods in the Exactus search engine. An experimental evaluation of search quality showed that these methods improve search precision and recall. Prospects of applying linguistic semantics and machine learning methods in search engines are discussed.

Liliya I. Tsirulnik United Institute of Informatics Problems, National Academy of Sciences of Belarus Svetlana G. Barbuk Minsk State Linguistic University, Belarus Boris M. Lobanov United Institute of Informatics Problems, National Academy of Sciences of Belarus

STATISTICAL ANALYSIS AND CONTEXTUAL RULES OF HOMOGRAPH DISAMBIGUATION ON TEXT-TO-SPEECH SYNTHESIS

The rules of accent position location in the homographs based on the results of contextual and statistical analysis of scientific and artistic text corpora are described. The implementation of the developed rules in Russian TTS synthesis system "MultiPhone" increase the degree of adequacy of sense understanding of synthesized speech.

Trub V.M.

ON THE PROBLEM OF VARIABILITY OF IMPERATIVE ASPECTUAL FORMS

The paper deals with the correlation between different aspectual forms of imperative verbs. We believe that one of the aims of semantic interpretation of inducements conveyed by different aspectual forms consists in the explication of semantic differences between them and the explanation of causes of irregularities reflected in the use of a form opposed to the default one.

Uryson E.V. Institute of Russian Language, Moscow

KAK BY (lit. ‘as if, like’) AND KONKRETNO (lit. ‘specifically’)

The semantics of Russian colloquial “parasitic” particles KAK BY (lit. ‘as if, like’) and KONKRETNO (lit. ‘specifically’) is described. The goal is to show that their emergence in the language is due to the lexical system of the language. KAK BY in its first meaning denotes similarity, and the words denoting similarity usually have a meaning denoting a set (a class). This is the way of “desemantization” of the conjunction KAK BY. The particle KONKRETNO develops its parasitic meaning by analogy with the word VOOBSHCHE (‘in general’); the cause is that some meanings of KONKRETNO are antonyms to some meanings of VOOBSHCHE.

Usacheva M.N. Lomonosov Moscow State University, Russia

MEANINGS OF THE PREPOSITIONS “PO” AND “K” IN RUSSIAN: ENCODING OF ADJUNCTS AND SEMANTIC ROLES

This work is devoted to the application of the spatial meaning description method (developed primarily for Dagestani languages but claimed to be typologically universal: see [Ganenkov 2002, 2005[, [Mazurova 2007]) to Russian prepositions “po” and “k”.

Vasilyev V.G. Institute of Informatics Problems of the Russian Academy of Sciences

MARKUP OF TEXT FRAGMENTS DURING CLASSIFICATION

A comparative analysis of approaches to the selection of meaningful fragments of texts by using statistical methods of classification is presented. We consider new algorithms based on hidden Markov models covering the text by special hierarchical multiple fragments, as well as based on pre-segmenting the text into fragments without taking account of the information about the structure of classes.

Voskresenskiy A.L. Gulenko I.E. Khakhalin G.K.

RUSLED DICTIONARY AS TOOL FOR SEMANTIC STUDY

The use of Russian sign language dictionary as an indicator of various Russian words meanings is described. This approach is enabled to more purposefully carry out analysis of context for word disambiguation.

Yanko T.E. Institute of Linguistics

RUSSIAN VOCATIVES: LEXICON AND CONSTRUCTIONS

According to Zwicky, semantically parallel NPs often have distinct vocative properties. Whether a given NP can be used as a call or an address is a dictionary information. In this paper a variety of specific vocative strategies and vocative constructions that change a vocative potential of lexical items is analyzed.

Yavorskaya M.V. Azarova I.V. Saint Petersburg State University, Russia

STRUCTURING OF ATTRIBUTIVE WORD MEANINGS IN RUSSNET THESAURUS (IN RUSSIAN ADJECTIVES OF PERCEPTION)

Adjectives with perceptional meanings are described. We focus on the problem of attributive meanings structuring for computer thesaurus RussNet. 178 attributive word-meaning pairs are marked up in the random samples of corpus contexts. Attributes for different spheres of perception are compared.

Yudina M. Fedorova O.

SYNTACTIC AMBIGUITY RESOLUTION: PRIMING AND SELF-PRIMING EFFECTS

The report is devoted to the first experimental research on the influence of syntactic priming on syntactic ambiguity resolution of relative clauses in Russian. Within the frame of syntactic priming we can see two effects: the syntactic priming itself and self-priming (persistent preference of subject’s own syntactic strategy).

Zalizniak Anna A. Institute of linguistics, Russian Academy of Sciences

ON THE NOTION OF SEMANTIC SHIFT

The paper deals with the notion of “semantic shift” as a category of semantic typology and the unit of the “Catalogue of semantic shifts in the languages of the world”; it reflects some results of the work on a project, realized in the Institute of Linguistics, Russian Academy of Sciences, by a group of linguists (Anna A. Zalizniak, Maria Bulakh, Dmitriy Ganenkov, Ilya Gruntov, Timur Maisak and Maxim Russo). The problem of identification of semantic shifts in cases of syncretism (semantic generality) is discussed in more detail.

Zanegina N. N. The Institute of the Russian Language

I’VE NEVER TOLD THAT: ABOUT LITURATIVES, STRIKEOUTS OR IMAGINARY TEXTS

This paper deals with linguistic peculiarities of strikeout texts - their semantics and syntax. These texts are very often used in Internet communication.

Zakharova I.V. Gorodechnyj P.P. Chelyabinsk State University, Dept. of mathematics

AN APPROACH TO AUTOMATED ONTOLOGY BUILDING IN TEXT ANALYSIS PROBLEMS

An approach to how to automatically build an ontology for complex tasks of full-text document classification using UDC is discussed.

Zimmerling A.V. Moscow State University for the Humanities, MGGU

ZERO CATEGORIES IN UNIVERSAL GRAMMAR

The paper discusses the status of zero categories in general syntax. The taxon ‘pro’ is not sufficient for tagging all covert pronouns in finite clauses. Moreover, the notion of ‘discourse pro-drop languages’ is not a valid tool in syntactic typology. Discourse-linked dropping of anaphoric pronouns, coreferent deletion and constraint on overt realization of pro-forms are different syntactic operations. More specifically, I am challenging some points in Holmberg’s analysis of Finnish pro and claiming that 1-2 person pro-forms regularly display features different from 3rd person pronominal zeros. Finally, I am discussing the status of ‘Mel’čuk’s zeros’, e.g. theta-role sensitive zero lexemes and proving for that theta-role sensitive zero pronouns with an Agentive value and theta-role neutral pro-forms may coexist in one and the same language.

Zobnin A.I. Lomonosov State University, Russian Language Institute n.a. V.V.Vinogradov, Russian Academy of Sciences Sakharova A.V. Russian Language Institute n.a. V.V.Vinogradov, Russian Academy of Sciences

UNIVERSAL SYNTAX ANNOTATION SYSTEM OBJECTATE

The object model and the features of the Universal text annotation system ObjectATE are described. This system is used in Vinogradov Institute of Russian language of RAS for semimanual morphological and syntactical annotation of ancient manuscripts. It allows the user to define his own annotation models by describing classes, add-ins, fields and relations in the metadata layer (for example, for syntax markup).

Proceedings 2009

Contents

Format PDF

Additional

Collection of proceedings