Proceedings 2008

A
Azarova I.V., Grebenkov A.S.,Lando T.M. Saint-Petersburg State University
THE CONTEXT SCHEMA OF PREDICATE ARGUMENTS FOR AUTOMATIC EXPANSION OF A DOMAIN ONTOLOGY
In the paper the fact mining system Factus is described, it is a prototype model oriented to a restricted domain, which is apt to widen. The problems of domain ontology representation and its extension on the basis of extracted features during text processing are discussed aiming at so called “open concept frame”. The text analysis is accomplished by means of special structures including syntactic arrangements of predicate arguments, their context markers being used for pattern validation.
Apresjan V.Ju. Institute of Russian Language, Moscow
RUSSIAN AND ENGLISH EMOTIONAL CONCEPTS
The paper outlines a new method of cross-linguistic comparison of emotion concepts, where entire emotion “clusters” rather than individual terms are juxtaposed. The method is applied to eleven emotion clusters in Russian and English languages. The paper considers both universal semantic tendencies and specific linguistic means in the expression of emotion. The paper proposes certain tentative explanations for the observed cross-linguistic similarities and discrepancies.
Apresjan Ju. D. Institute of Russian Language, Moscow
ON A PROJECT OF A PRODUCTION DICTIONARY OF RUSSIAN
The paper is concerned with a project aimed at creating a production dictionary of contemporary Russian. Work on the project started in 2006 at the Russian Language Institute of RAS. The main idea of the dictionary is to present a complete and unified account of all linguistically relevant properties of each lexical unit. Apart from grammatical forms and senses they include a) regular semantic modifications of the dictionary definition in verifiable contextual conditions, b) detailed government patterns and their possible modifications, c) a list of minor type sentences specific for a given lexical unit, d) its combinatorial potential (especially as handled by the theory of lexical functions), e) its lexicalized prosody. All these make an integral part of the linguistic competence of speakers and should be characterized on the basis of the latest theoretical findings of linguistic research in the respective fields.
Akhmetova М.V. Journal “Zhivaia Starina”
REGIONAL VARIANTS OF THE URBAN REALTY TERMS
The paper deals with the Russian regional terms, describing urban realty — names of different types of apartment houses and flats (depending on the time of building, planning, material, etc.). These words, as a rule, are rarely included into the explanatory dictionaries, except of some colloquial words, which are normative for the speech of Moscow and St.-Petersburg citizens. The research was carried out on the materials of the Integrum database, including mass media publications and public documents from all of the Russian-speaking space. Using the statistics of mentioning these terms in the regional and central public documents, helps to make preliminary conclusion about their areal distribution.
B
Baranov A.N. Institute of Russian Language, Moscow
AGAINST DECOMPOSITION OF MEANING: RECOGNITION IN SEMANTICS OF IDIOMS
In the report the problem of inner form representation in definitions of idioms is discussed. Decomposition of meaning cannot be used for semantic description of non-discrete semantic phenomena such as metaphor and image. It is proposed to use for semantic representation of inner form the strategy of recognizing of metaphor. The process of recognition is supported in a definition of idiom by semantic “trigger”, which generates the necessary chain of associations.
BelikovV.I. Institute of the Russian Language
LEXICOGRAPHY OF PROVERBS
The dictionaries of Russian proverbs are analyzed with respect to their repertory and the selection of the main variant of items.
Bogdanov A.V. Moscow State University
ORTHOGRAPHY IN THE INTERNET: THE ANALYSIS OF ONE MISSPELL
In the paper we discuss the orthography in the Internet and we analyse a widespread misspell which is writing the soft sign in the ending of verb forms containing suffix -s’a (-ся), like delaet’s’a (делаеться). Our analysis shows that the number of such misspells allows us to talk about kind of new standards in written language of the Internet.
Bogdanova N.V. Brodt I.S. Kukanova V.V. Pavlova O.V. Sapunova E.M. Philippova N.S. St. Petersburg State University
THE CORPUS OF SPOKEN RUSSIAN: DESIGN PRINCIPLES AND APPROACHES TO DATA ANALYSIS
The paper reports principles for balancing the corpus of spontaneous monologues in the Russian language collected according to shared linguistic and sociolinguistic parameters. It presents samples of collected data, benefits of multilevel analysis and perspectives of further augmentation.
Borschev V.B. VINITI RAS & UMass
“JA NE BYL…MENJA NE BYLO..” OR HOW MANY DIFFERENT BYT’ (BE) IN RUSSIAN
This work introduces and analyzes the Russian example Ja ne byl v zale, kogda vyklučili svet ‘I wasn’t in the hall when they turned out the lights.’ This example refutes Ju.D. Apresjan’s claim that sentences of that kind cannot have a “synchronous” interpretation. Various meanings of the verb byt’ ‘be’ in locative and existential sentences are discussed.
Braslasvki P.I. Sokolov E.A. Institute of Engineering Science UD RAS, Ekaterinburg
COMPARISON OF FIVE METHODS FOR VARIABLE LENGTH TERM EXTRACTION
The paper investigates and compares five methods for variable length term extraction and assembling. Experiments are conducted on a corpus of scientific papers on genetics and microbiology. Evaluation method combining both expert and formal assessment is proposed, the results of comparative evaluation of the methods are presented.
Buzikashvili N.E. Institute of System Analysis, Russian Academy of Sciences
MULTITASKING SEARCH: FACT ARTIFACT, NEGLIGIBLE EXCEPTION?
The paper considers search on the Web. Questions on the users’ manners of search are formulated, with emphasis on multiple tasks execution. It is shown that multitasking is rare, usually includes only two task sessions and is formed into a temporal inclusion of an interrupting task into the interrupted one. Quantitative characteristics of search behavior in 3 classes of temporal sessions (single-task session, several tasks executed one-by-one, and multitasking session) were compared, and significant differences were revealed.
D
Dessiatova А.V. Russian State University for Humanities Lashevskaja О.N. V.V.Vinogradov Institute of the Russian Language Mahova А.А. Moscow State University
DESCRIBING SHAPE: INSTRUMENTAL CONSTRUCTION «X Y-ОМ»
The paper analyzes the semantics of Russian instrumental construction with the meaning of shape (xvost kol’com ‘ring tail’, slozhit’ gubki bantikom ‘to make Cupid’s bow’). Spatial interpretation of this construction is described in terms of topological classes (Talmy 2000, Rakhilina 2000). Possible mirroring of topological classes in both slots X and Y is investigated as well as their predictable mutual accommodation.
Dobrovol’skij D.O. Russian Academy of Sciences, Russian Language Institute Padučeva E.V. Russian Academy of Sciences, Institute of Scientific and Technical Information
DEIXIS WITHOUT SPEAKER:TOWARDS THE SEMANTICS OF THE GERMAN DEICTIC ELEMENTS HIN AND HER
Semantics of deictic words can be analysed more efficiently if we take the communicative situation of the utterance into account. Traditionally, the semantics of the German deictic elements hin and her was described as being orientated towards the speaker – towards the speaker’s place and time. However, this is true only for the canonical communicative situation, when speaker and hearer are both in the same place. In non-canonical situations (when speaker and hearer are not in the same place) and especially in contexts of hypotaxis or narrative, the speaker may be deprived of his “deictic privileges”, which are then transferred to some other persons.
Druzhkin K.Ju. Tsinman L.L.. Institute for Information Transmission Problems
THE PARSER OF ETAP-3 LINGUISTIC PROCESSOR:
An attempt is made to optimize the operation of the parser in ETAP-3 linguistic processor. The idea is to change parsing rules in such a way that the emerging syntactic hypotheses be ranked according to probabilities of their appearance in the resulting syntactic tree of the sentence processed. Experimental results are given.
E
Ermakov А.Е. RCO, Moscow
AUTOMATIZATION OF AN ONTOLOGICAL ENGINEERING FOR SYSTEMS OF KNOWLEDGE MINING IN TEXT
The present report is devoted to the problems of using ontologies in text mining systems. Peculiarities of ontologies used in such systems are examined. A method for automatic ontology generation, in which terms of data domain and relations between them are initially detected by means of computer analysis of the text, is proposed.
F
G
Gelbukh A.1 Sidorov G.1 Lara-Reyes D.1, Chanona-Hernandez L.2, Chubukova M.3 1 Natural Language and Text Processing Laboratory, Center for Research in Computer Science, National Polytechnic Institute, Av. Juan Dios Batiz, s/n, Zacatenco, 07738, Mexico City, Mexico 2 ESIME Zacatenco, National Polytechnic Institute, Zacatenco, 07738, Mexico City, Mexico 3 Philological education department, Moscow Institute of Continuous Education, Moscow, Russia
GENETIC ALGORITHM FOR AUTOMATIC DIVISION OF WORS INTO MORPHEMES
The paper discusses unsupervised technique for automatic detection of morpheme structure of words in flexive languages, using Spanish language as a case study. We use global optimization implemented as genetic algorithm, without any heuristics or assumptions that affect the problem dimensions a priori. Description of genetic algorithm is given; preliminary results of evaluations are presented. Input data is the list of words, compiled on the basis of a dictionary or a corpus. Output data is the same list of the words separated in morphemes. As many other automatic methods, this algorithm does not pretend to detect a hundred percent correct results and require postprocessing. Still, it allows for fast detection of tendencies in data and for obtaining of preliminary results without manual work.
Gerassimenko O. University of Tartu, Estonia
TWO MEANINGS, TWO LINGUISTIC ITEMS? RUSSIAN „AHA” IN SPONTANEOUS DIALOGUE
Russian dialogue particle „aha” can express either agreement/confirmation or surprise/satisfaction. In Russian lexicography those meanings are mostly presented as being expressed with homonymic linguistical items, the particle and the interjection. The paper examines examples of „aha” in spontaneous institutional dialogues and discusses the possibility of finding a common meaning part.
Grashchenkova A.E. Russian State University for Humanities
RUSSIAN ADJECTIVE PHRASE: SPLIT AP HYPOTHESIS
The paper presents a minimalist approach to Russian Adjective Phrase (AP hereafter) structure. The puzzling properties of predicative long form (LoF) adjectives with complements is the starting point of the paper. To explain the distribution of the complement-taking adjectives, we suggest the multi-layered structure of adjectival phrase. The internal A is a lexical head that surface as a short form (ShF) adjective. External small a is a functional head responsible for case concord of attributive LoFs. The chief claim concerns the properties of lexical A heads in Russian. These ShF phrases: (i) are the locus for argument merging; (ii) project their own Spec position; (iii) do not assign structural case (iv) allow eventive (stage-level) interpretation. At the same time, the external LoF a-shell lacks all these properties and is responsible for case-concord in noun phrases. As for the constraint on complementation, attested with Russian predicative LoF adjectives, we supposed that it is due to the two facts: “defective” structure of nominative predicates on the one hand and the elaborated shell structure of adjectival phrase on the other. In such constructions the subject of the lexical AP “has not enough time” to raise to Spec, IP and activate case feature on I, which subsequently should be transmitted to the a head through Pred. This conflict does not arise in case of instrumentals (assigned by Pred) and ShF (no case assignment). Then, case features on LoF do not influence its complement-taking potential in attributive function and in secondary predication. We ascribe the grammaticality of attributive and secondary instrumental / nominative LoFs on the fact that such adjectival phrases are control structures and the case value does not dependend on the internal subject raising. The proposed analysis is supported by several other properties of LoF and ShF: distribution of symmetric predicates, stage/individual-level interpretations, properties of derived nominals and others.
Grishina E.A. Savchuk S.O. Institute of Russian Language,Moscow
CORPUS OF ORAL RUSSIAN IN THE FRAMEWORK OF RUSSIAN NATIONAL CORPUS. CONSTRUCTION PROJECT
The paper describes the construction project “Corpus of Oral Russian”, which may be created on the basis of the Movie Sub-Corpus of the Russian National Corpus. The authors offer some solutions to the problems concerning the structure of the Corpus, the types of the annotation, the format of the issues, the types of the queries, and the variety of the tasks which may be posed and solved by the use of the Corpus.
I
Iagounova E.V. St.Petersburg State University
SET OF RECOGNIZABLE WORDS AS COMPRESSION TEXTS (WITH COMPARISON OF KEY-WORD SET)
Main characteristics of set of recognizable words (in perception text in white noise) have been described in terms of compression texts (with comparison of key-word set). Results of reconstruction text with the set words are analyzed with reference of discovering main characteristics of the set. One of the most finding is the dependence of sense structure of a text on following text parameters: professional vs. fiction and dynamic vs. static.
Iomdin B.L. Russian Language Institute
THE IDEA OF MATCHING NAMES IN RUSSIAN
The paper deals with metalanguage lexical units that convey certain relations of names of different objects: these are Russian units одноимённый ‘of the same name, cognominal’ (and its derivates) and так и называется » ‘called exactly this way’. Such items are difficult to interpret in NLP applications. Lexicographic definitions are proposed based on a number of key senses identified by the author: ideas of coincidence, correspondence, and simplicity.
Iomdin Leonid L. Institute for Information Transmission Problems, Russian Academy of Sciences
IN THE DEPTHS OF MICROSYNTAX: A LEXICAL CLASS OF SYNTACTIC IDIOMS
A class of Russian syntactic idioms is considered from the theoretical and NLP points of view. The class, formed with the noun сила ‘force, power’ consists of a variety of lexical units with surprisingly individual peculiarities. Examples of this class include (1) a preposition в силу ≈‘by virtue of’, as in В силу этой теории поведение в одной точке вселенной влияет на поведение в другой точке ‘By virtue of this theory, the behaviour in one point of the universe influences the behaviour in another point’; (2) an adverb of degree от силы ‘at the most’, as in от силы десять человек ‘ten people at the most’, (3) an adverbial pattern в X-овую силу ‘using such and such part of one’s force, as in работает в полную силу ‘he works to the full extent of his power’, работает в треть силы ‘He works using a third of his force’; (4) a predicative adverb в силах 1 ≈ ‘being able’ as in старик был не в силах быстро ходить ‘the old man was unable to walk fast’, (5) a predicative adverbial pattern в (чьих-либо) силах 2 ‘within one’s power’, as in сдержать смех было не в моих силах ‘to contain laughter was beyond my powers’. Specific descriptions of several of these idioms are given using a specially designed standard layout.
K
Kibrik A.A. Institute of linguistics, Russian Academy of sciences
SPEAKER’S PROSODIC PORTRAIT AS A TOOL OF SPOKEN DISCOURSE TRANSCRIPTION
A methodological tool is proposed that enhances the quality of discourse transcription, in the course of preparing corpora of spoken language. Prosodic prototypes unerlying discourse segmentation and expression of phasal meanings can be identified with the help of prosodic portraits of individual speakers.
Kobozeva I.M. Orlova S.V. Lomonosov Moscow State University
UNICELLULAR ORGANISMS OF COMMUNICATION UNDER A MICROSCOPE: GERMAN PARTICLE JA VERSUS ITS RUSSIAN TRANSLATION EQUIVALENTS VED’ AND ŽE
In the paper German modal particle JA in constative utterances is compared to its Russian translation equivalents VED’ and ŽE on the basis of studying parallel samples of modern German prose and its professional translations into Russian. The analysis reveals the following differences: 1) VED’ presupposes its proposition as а fact while JA and ŽE do not, and it explains the ability of the latter two to be freely used in imperatives; 2) VED’ and ŽE specify the degree of rhetoric activity (≈ intencity of illocutionary force) as normal and high resp. while for JA this semantic feature is irrelevant and this makes the choice of its translation equivalent dependant on such pragmatic features of the context as its relation to speaker’s interests and interpersonal relations among the interlocutors; 3) JA can be used in responses, implying yes / no answers to direct questions, while VED’ and ŽE cannot occur in this context; 4) the use of VED’ in answers demands the dictal component of its propositional content to be different from that of the question; 5) VED’ cannot occur in correcting remarks and direct answers if it is not preceded by initial adversative particles (NO, А, DA). Its use together with one of these particles overtly marks the response as conflicting with some of the addressee’s initial assumptions and thus violating the maxim of consent and so in some cases it may damage semantic equivalence of the translation with respect to the interpersonal aspects of utterance meaning.
Kodzasov S.V. Arkhipov A.V. Zakharov D.M. Krivnova O.F. Moscow State University
DATABASE "INTONATION OF RUSSIAN INFORMATIONAL TEXTS"
The development of a data base for intonation of oral mass-media texts is now in progress at the Philological Faculty of Moscow University. A highly detailed system for sentence prosody description is used. Great differences are found between the use of prosodic means in informal dialogues and in informational texts in TV-programs.
Kozhunova O.S. Institute for Informatics Problems of the Russian Academy of Sciences
CLASSIFICATION SCHEME OF THE SEMANTIC DICTIONARY OF THE MONITORING SYSTEM: TEST APPLICATION TO EVALUATION OF SCIENTIFIC WORK’ PERFORMANCE
A brief description of the experiment on the evaluation of the scientific work’ performance in Russian Academy of sciences carried out in 2007 is given. At the final stage this action revealed several problems. In this connection, a method and an instrument of their solution are suggested. These are classification method and semantic dictionary with integrated classification scheme, correspondingly.
Kozlova A.V. Lutikova E.A. Fedorova O.V. Lomonosov Moscow State University
‘CAUSE’ OR ‘ENABLE’: ANALYSIS OF CAUSATIVE VERBS SEMANTICS
In this paper, data of the experimental investigation of Russian causative verbs semantics is presented. The investigation was conducted in the framework of Force dynamics theory. We distinguish the concepts of CAUSE, ENABLE, and PREVENT depending on the correlation of three main parameters of the causative situation: 1)the tendency of the patient for a result, 2) the presence of opposition between the affector and the patient, and 3) the occurrence of a result.
Komarova A.D. Russian State University for Humanities
PAUSES ON THE DIFFERENT TYPES OF SYNTACTIC BOUNDARIES IN JAPANESE: A CORPUS STUDY
The present research is concerned with the pauses at different syntactic boundaries in oral monologue Japanese speech. It aims to find out, how frequent and therefore probable are the pauses at the boundaries of sentences and clauses lesser than sentences and what their “normal” length is.
Korotaev N.A. Podlesskaya V.I. Russian State University for Humanities
PROSODY OF CLAUSE-COMBINING IN RUSSIAN: A CORPUS-BASED CASE-STUDY
The paper reports a corpus-based study of prosodic strategies employed in multiclausal structures with a postpositioned dependent clause in spoken Russian. Three main strategies are discussed: (1) the pitch direction at the primary accent in the main clause is opposite to that in the dependent clause, (2) the pitch direction at the primary accent in the main clause copies that in the dependent clause, and (3) the main clause remains non-accented. Quantitative and qualitative analysis is provided to explain the speaker’s choice between the three strategies.
Kotov A.A. Russian State University for Humanities
CONTROLLING DYNAMIC SPEECH BEHAVIOUR OF VIRTUAL COMPUTER AGENTS
We represent and discuss a model to control speech behaviour of a virtual computer agent (computer game agent, interface component or, in the future, mobile robot). The model simulates “mood dynamics”, which controls agent’s behaviour in a communication. In particular, the model uses a set of phrasal templates to construct shorts monologues, revealing the dynamics of agent’s “feelings” and allowing the agent to switch between several dialogues in a communication.
Kreydlin G.E. Russian State University for Humanities
MECHANISMS OF INTERATION BETWEEN VERBAL AND NOHVERBAL UNITS IN A DIALOG II B. DEICTIC GESTURES AND SPEECH ACTS
Academic lecture regarded as a kind of a dialog is a suitable experimental ground for studying general regularities and specific rules of gesture-speech interrelation and human interaction. In the first part (part II A) of the research a classification of didactic deictic gestures has been compiled and some classes of these gestures has been described. In this part (part II B) of the research I imply to demonstrate that deictic gestures of each type have their own non-trivial relations with the verbal and nonverbal signs in a dialog.
Krylov Sergej A. Institute of Oriental Studies, Russian Academy of Sciences, Moscow, Russia Institute of Systemic Analysis, Russian Academy of Sciences, Moscow, Russia
EVALUATING OF FREQUENCY OF SYNTACTIC MOLECULES (ON THE EVIDENCE FROM THE RUSSIAN GENERAL CORPUS)
An attempt is made, to evaluate the frequency of syntactic molecules (= minimal autosemantic sentence parts, able to serve as answers to a question) on the evidence from the Russian General Corpus (created on the base of the Uppsala Corpus) with the help of the StarLing database processing software package.
Krylova T.V. Institute of Russian Language of Russian Science Academy
БЛАГОРОДНЫЙ: LANGUAGE CONCEPTION OF CONNECTION BETWEEN INTERNAL QUALITIES AND BIRTH OF PERSON
The objects of this article are words благородный and великодушный. Firstly, we describe the difference in their semantics and try to establish the connection between meaning of благородный and its internal form. Then, the polysemy of благородный is examined. At last, we analyse the meaning of lexemes благородный 3.1 и 3.2 (благородное лицо, благородное животное) and formulate the hypothesis that the conception of connection between internal qualities and birth of person is still preserved in modern language.
Kryuchkova O.I. Goldin V.E. Saratov State University N.G. Chernyshevkij
TEXTUAL DIALECT CORPUS AS A MODEL OF TRADITIONAL RURAL COMMUNICATION
The report deals with the principles of organization and methods of building a multimedia textual dialect corpus, representing dialect as a comprehensive whole of cultural and communicative features and modeling the communication of specific speech groups in specific social and cultural environment.
Kudashev I.S. Kudasheva I.O. University of Helsinki
USEFUL EXTENSIONS TO TRANSLATION-ORIENTED TERMINOLOGICAL DICTIONARIES)
In this article, we describe some useful extensions to translation-oriented terminological dictionaries using as an example two dictionaries compiled at the University of Helsinki, Palmenia Centre for Continuing Education in Kouvola, in 2003–2007. These dictionaries are mostly descriptive but they contain some elements which are usually characteristic of normative dictionaries, such as restrictive labels, strict terminological definitions, and concept charts. Special attention is paid to translator-friendly techniques, such as explicit marking of partial and artificial equivalents and explanation of the differences between concepts in the source and target languages.
Kuznetsov I.P. Institute for informatics problems of the RAS Efimov D.A. Synergetics Systems
LINGUISTIC РROCESSOR “SEMANTIX” FOR KNOWLEDGE EXTRACTION FROM NATURAL TEXTS IN RUSSIA AND ENGLISH
Paper considers the linguistic processor ”Semantix” for automatic formalization of natural language texts in some fields: criminal, autobiography, texts about terrorism . The processor extract from texts the user objects, their links and facts of object actions. Results are XML-files which are used for Knowledge Base organization, semantic search and analytic tasks.
Kustova G.I. Moscow State Pedagogical University
About «non-nominative» dictionaries (lexical databases)
This report deals with a project of dictionary (lexical database) including «non-nominative» items which are used as adverbial modifiers (for ex. на ходу, под предлогом (чего), во всяком случае).
L
Lande, D.V. Brajchevskiy, S.M. Darmokhval, A.T. Morozov, A.Y., ElVisti Information Center, Ukraine, Kiev
WEB-SPACE AND MATERIALS OF NEWS AGENCIES
In this article we investigate to what extent materials available to paying subscribers are openly published on web-sites. We obtained the distribution of news agencies’ messages based on the time of delay. We also measured specific quantity of reprints of the news agencies’ materials on web sites as well as Internet messages included to the agencies’ news-lines.
Levontina I.B. Institute of Russian Language, Moscow
The riddles of the Russian particle uzh
Russian discourse particle uzh is very difficult to describe. It produces manifold pragmatic effects, and it is unclear how this effects are connected with the components of its meaning. The paper is devoted to some of such components and discourse effects they cause.
Litvinenko A.O. Moscow State University
ASYNDETON AND COLON. TRANSCRIBING SPOKEN NARRATIVE
The paper is devoted to a closed class of Russian asyndetic composite sentences that require the use of colon in written language and are characterized by a special intonation in spoken language. The problems that arise while transcribing such sentences in spoken narrative are discussed.
Lobanov B.M. United Institute of Informatics Problems, National Academy of Science of Belarus
AN ALGORITHM OF TEXT SEGMENTATION ON SYNTACTIC SYNTAGMAS FOR TTS SYNTHESIS
An algorithm of segmentation of the text on the syntactic syntagmas, based on the analysis of the steady phrase-logical and grammar-semantic word-combinations making the sentence is suggested. The basic sense of allocation consists in the sentence of considered word-combinations that now freedom of its division into syntagmas is limited, namely: the syntagma border can be only outside of word-combinations, but not in them.
Lobanov B.M. Tsirulnik L.I. Sizonov O.G. United Institute of Informatics Problems on National Academy of Science of the Republic of Belarus
«INTOCLONATOR» - A COMPUTER SYSTEM FOR PROSODIC SPEECH PARAMETERS CLONING
A computer system of prosodic speech parameters cloning is described. The system allows to automate the process of creation of a complex prosodic portraits necessary for TTS synthesis. The system is intended for widening of inventory of prosodic portraits for the personalized speech synthesis under texts of various genres.
Loukachevitch N. Dobrov B. Chuyko D. Research Computer Center Moscow State University NCO Center for Information Research
AUTOMATED ANALYSIS OF MULTIWORD EXPRESSIONS FOR COMPUTATIONAL DICTIONARIES
In the paper we describe the development of an automatized system for analysis of multiword expressions that facilitates the discovery of specific features of syntactic and semantic behaviour of multiword expressions. The analysis is based on automatic comparison of the component structure of expressions and uses the knowledge described in a thesaurus-like lingustic resource. At present we test the system in the process of terms acquisition for Ontology on natural sciences and technologies.
Lashevskaja O.N. Institute of Russian language, Moscow Sharoff S.A. University of Leeds, United Kingdom
FREQUENCY DICTIONARY OF THE RUSSIAN NATIONAL CORPUS: PRINCIPLES AND TECHNOLOGY
A frequency dictionary represents the base lexicon of contemporary Russian (1950–2005) that gives information about word frequency in actual use and provides frequency comparisons between different functional styles and periods of creation of texts. The dictionary is based on texts of the Russian National Corpus Словарь (100 million words).
M
Olga V. Mitrenina State University of St.-Petersburg
SYNTAX OF CORRELATIVE CONSTRUCTIONS IN RUSSIAN: А GENERATIVE APPROACH
Barriers between the correlative clause and the main clause in correlative constructions in Russian are described. It is also shown that correlatives do not reconstruct in Russian. The preliminary syntactic structure of Russian correlatives is suggested, that involves the position of topic and/or focus.
Mitrofanova O.A. Belik V.V. Kadina V.V. Saint-Petersburg State University
CORPUS ANALYSIS OF SELECTIONAL PREFERENCES OF FREQUENT WORDS IN RUSSIAN
The paper presents results of a corpus-based study of selectional preferences of frequent Russian lexemes. Research procedure requires analysis of co-occurrence data obtained from Russian texts. It is implied that selectional preferences of a lexical item may be defined through sorting its left/right neighbours in bigrams by MI-score values. Given an ordered set of neighbours for a lexical item, it is possible to induce its context patterns. Selectional preferences are specified with respect to morphological and semantic features of co-occurring lexical items.
Mitrofanova O.A. Panicheva P.V. Saint-Petersburg State University; Lashevskaja O.N. Institute of Russian Language, Moscow
STATISTICAL WORD SENSE DISAMBIGUATION IN CONTEXTS FOR NAMES OF PHYSICAL OBJECTS
The paper presents experimental results on automatic word sense disambiguation. Contexts for Russian nouns denoting physical objects extracted from the National Corpus of the Russian Language serve as an empirical basis of the study. Optimal conditions for WSD are defined taking into account lexical markers of word meanings in contexts and semantic annotation of contexts.
Mikhailov M.N. Isolahti N.B. University of Tampere, Tampere, Finland
THE INTERPRETING CORPUS AS A NEW TYPE OF TEXT CORPUS
The issues discussed are the principles of compiling of interpreting corpora with a corpus of court interpreting as an example. Such a corpus combines a spoken corpus with a parallel corpus. The tagging should reflect communicative, prosodic, as well as extralinguistic information. The interpreting corpora are a valuable resource of data for multidisciplinary research.
Muravenko E. V. Russian State University for the Humanities
ON THE DICTIONARY OF CHANGES IN RUSSIAN LANGUAGE GOVERNMENT
The report lays the foundation for the need to compile a new specialized dictionary, reflecting changes in Russian language government over the period from early 19th century to the present day. The author presents a concise list of principles underlying such a dictionary and introduces a sample dictionary article for the verb skuchat’.
N
Nedoluzhko А. Hajič J. Co. Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic
THE PRAGUE DEPENDENCY TREEBANK
The Prague Dependency Treebank 2.0 (PDT 2.0) contains a large amount of Czech texts with complex and interlinked morphological (2 million words), syntactic (1.5 MW) and complex semantic annotation (0.8 MW); in addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level. PDT 2.0 is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Besides the large corpus of Czech, a corpus of Czech-English parallel resources (The Prague Czech-English Dependency Treebank) is being developed. English sentences from the Wall Street Journal and their translations into Czech are being annotated in the same way as in PDT 2.0. This corpus is suitable for experiments in machine translation, with a special emphasis on dependency-based (structural) translation. In the report, the basic annotation scheme is represented, with special reference to complex semantic (tectogrammatical) level. The system of syntactic functors and valency lexicon VALLEX are also discussed.
O
Oja, Anni Tallinn University This work is supported by the ETF grant no 6147.
Choosing language in Internet conversations between Russians and Estonians
Current study examines interlingual communication in Estonian web-portal rate.ee. First conversations between Estonians and Russians are viewed in order to see the factors in choosing language for first conversation act (conversations are normally strings of picture comments). Most of these factors are related to situation (who are the participants, how it is more comfortable to communicate, what is the purpose), but some things are learned unintentionally via community of practice, generally environment-related unwritten rules of politeness and polite language choices with equipment of suitable vocabulary.
P
Pavlova A.V. SAP AG, Walldorf, Deutschland
THE MEANING OF PROSODIC INFORMATION IN LEXIGORAFIC REPRESENTATION OF POLYSEMY AND HOMONIMY
The lexical semantics of the word can determine its weak or strong accentual position in the phrase, its intention to play the role of the topic or the comment. The bonds between the lexical meaning of the word and its potential accentuality could help to describe the different meanings of one and the same polisemic word in more detail. The interaction between the polisemic word and its accentaulity allows to find its additional specific and particular meanings. The subjective and estimating, negative and retrospective semantics is especially „appealing“ for the phrase accent. But there are several factors which can withstand this accent „appeal“, for example specific communication task (pure narrativity, explanation of cause, imperative sentences), idiomatic phrases, innumeration, the use of numerals. If we also include the prosodic information about accentuality into the dictionary, it is necessary to comment at least on the potential obstacles which can destroy the anticipated accentual construction of the phrase. This comment could be presented for instance in the foreword of such a dictionary. Generally not all the words of the vocabulary request this kind of prosodic information. On the other hand, there are some lexical meanings of the polisemic words closely connected with the accentual emphasis, this fact should not be neglected in the lexicography.
Padučeva E.V. Russian Academy of Sciences, Institute of Scientific and Technical Information
REGISTER OF INTERPRETATION AS DISAMBIGUATING CONTEXT
The focus of attention in modern semantics gradually transfers from the meaning of separate linguistic entities to meaning shifts and contexts that motivate these meaning shifts. The type of communicative situation (and REGISTERS OF INTERPRETATION it engenders – such as dialogical register, narrative, hypotaxis) is one of the most relevant parameters. Examples are given of EGOCENTRICAL grammatical categories, words and constructions that have different interpretations in different registers.
Paljko M.L. Institute of Linguistics
INTONATION OF THE GERMAN COHERENT DISCOURSE IN CONTRAST TO THE RUSSIAN ONE
It is widely recognized that the marker of text incompleteness in many languages is the rising tone. This paper argues that in German the intonational strategies of the coherence maintaining can be specified and that a variety of ways to show that a statement is not text-final can be singled out.
Partee B.H. University of Massachusetts, Amherst, MA, USA and RGGU, Moscow
Symmetry and symmetrical predicates
A goal of this paper is to analyze the differences between mathematical definitions of symmetry and a concept of symmetry that would fit best with observed linguistic generalizations. This requires a closer look at some aspects of the linguistic behavior of symmetric and non-symmetric predicates.
Pereverzeva S. I. Kreydlin G.E. Russian State University for the Humanities
CORPOREALITY AND PECULIARITIES OF SEMIOTIC BEHAVIOUR IN DIALOGUE
This paper discusses modification of some syntactic rules that regulate the interaction of verbal and nonverbal semiotic codes in the dialogue. We show that there is regular correspondence between particular meanings in the semantic explanation of the gesture given and different components of the physical realization of this gesture.
Kedrova G.E. Potemkin S.B. Moscow State University
ALIGNMENT OF UN-ANNOTATED PARALLEL CORPORA
Aligning parallel texts, i.e. automatically setting the sentences or words in one text into correspondence with their equivalents in a translation, is a very useful preprocessing step for a range of applications, including but not limited to machine translation, cross-language information retrieval, and dictionary creation. We are presenting a new alignment algorithm for aligning bilingual, linguistically un-annotated parallel corpora. It enables alignment at sentence level, using bilingual dictionary and heuristic cues, along with linguistics-based rules. The program based on the algorithm currently aligns Russian and English texts, requires no previous marking-up or other manual text pre-processing. Russian lemmas are retrieved in the grammar dictionary. The adaptive nature of the system allows experiments with a variety of fiction or non-fiction (i.e. scientific and juridical) texts. The algorithm deals with the typical alignment problems like the correct alignment of one-to many sentences correspondence and omission of a sentence, or how to align texts with different syntactic patterns in two languages. First phase of performance tests seems promising, and we are going to develop word and multiword alignment technique.
Prozorova E.V. Moscow State University
TRANSCRIPTION AS A TOOL FOR ANALYSIS OF PAUSES IN RUSSIAN SIGN LANGUAGE DISCOURSE.
In this paper, we analyse pauses in Russian Sign Language discourse. In order to describe different types of pauses, we use signed discourse transcription data, which contains information on movement phases of signs and on changes in the facial expression and body posture of the signer.
Protasov S.V. Moscow Institute of Physics and Technology
INFERENCE AND ESTIMATION OF A LONG-RANGE TRIGRAM MODEL
We describe an implementation of a simple probabilistic link grammar. This probabilistic language model extends trigrams by allowing a word to be predicted not only from the two immediately preceeding words, but potentially from any preceeding pair of adjacent words that lie within the same sentence. In this way, the trigram model can skip over less informative words to make its predictions. The underlying "grammar" is nothing more than a list of pairs of words that can be linked together with. Finally, we report some experimental results using russian corpora.
R
Rozina R.I. Russian Language Institute, Moscow
NOMINALIZATIONS IN EVERYDAY SPEECH
The paper is devoted to the comparison between nominalizations in Russian everyday speech and slang on the one hand and in modern standard Russian on the other hand. Derivational bases, means of derivation, meaning, argument frame and surface behavior of nominalizations are considered. The analyses suggest that, considering the intermediate position of nominalizations between nouns and verbs, Russian colloquial and slang nominalizations are less related to motivating verbs than nominalizations in standard Russian.
Rubashkin V. Sh. Pivovarova L. M. Saint-Petersburg State University
ONTOLOGY EDITOR AS INTEGRATED DEVELOPMENT ENVIRONMENT
The development and usage of ontoeditor designed for operation with the knowledge model of InTez ontology are presented. Browsing, input, editing and other functions are discussed. The ontoeditor is compared with similar environments developed abroad.
Ryko A.I. Stepanova S.B. Laboratory of Experimental Phonetics Saint-Petersburg, Russia
MULTILEVEL LINGUISTIC ANNOTATION OF THE RUSSIAN SPEECH CORPUS
The paper considers multilevel linguistic annotation of the Russian Speech Corpus and its potential for description of spontaneous speech in comparison to standard language.
S
Savchuk S.O. Grishina E.A. Institute of Russian Language, Moscow
VARIATION IN RUSSIAN. DICTIONARY PROJECT
The paper presents the project of the new dictionary of variants in Russian, which is supposed to be accomplished on the basis of the Russian National Corpus. The paper gives the preliminary description of the dictionary word list, the types of posed and solved tasks and problems.
Sidorova E.A. A.P. Ershov Institute of Informatics Systems, Russian Academy of Science
MULTIPURPOSE DICTIONARY SUBSYSTEM FOR EXTRACTION OF SUBJECT LEXICON
The technology intended for building of subject-oriented dictionaries and solving of various tasks of text analysis in information systems is considered. A problem of simultaneous use of several dictionaries and coordination of their contents is investigated.
Sokolova E. Russian State University for Humanities Kononenko I. Institute of Informatics Systems SB RAS Zagorulko Yu. Institute of Informatics Systems SB RAS
PROBLEMS OF DESCRIBING COMPUTATIONAL LINGUISTICS IN ONTOLOGY OF A KNOWLEDGE PORTAL
In this paper we discuss problems that emerge while developing ontology for the scientific discipline concerned with computational language, text and speech processing, that is Computational Linguistics. The problems range from defining the name and scope of the subject domain to meeting formal requirements set on the ontology specification by the knowledge portal design. Difficulties are due to the deviation of the CL from “classic” sciences like, for example, archeology, since computer for CL is not only amplification and intellectualization of modeling means. It is inherent part of the science. We consider the problems and the ontology organization.
Stepanova S.B. Asinovsky A.S. Bogdanova N.V. Rusakova M.V. Sherstinova T.Y. Faculty of Philology and Arts, St. Petersburg State University, St. Petersburg, Russia
SPEECH CORPUS OF THE RUSSIAN EVERYDAY COMMUNICATION "ONE SPEAKER'S DAY": BASIC CONCEPTION AND CURRENT STATE
The report concerns the methodological principles elaborated for creation of the speech corpus of the Russian everyday communication “One Speaker’s Day”. The paper presents the main rules for data processing on primary stages, the description of the database, and the current state of the corpus formation.
Strandson K. Gerassimenko O. Kasterpalu R. Koit M. Rääbis A. University of Tartu, Estonia
TOWARDS HUMAN-COMPUTER INTERACTION IN NATURAL LANGUAGE
Estonian human-human calls (directory inquiries) are analyzed with the further aim to develop a computer-human dialogue system that interacts with a user in natural language. The analysis is based on the Estonian Dialogue Corpus. Linguistic features of clients’ requests and agents’ grants are studied. A client’s initial request sets up a goal which will be achieved in collaboration with the agent. Information is given briefly by agents, using short sentences or phrases. Information-sharing sub-dialogues are initiated by both participants if either a request or a grant needs to be adjusted. A formal grammar of information dialogue is introduced in the paper. The results of the study will be implemented in two dialogue systems under development.
Sun Shuang Kobozeva I. M. Lomonosov Moscow State University
RECOGNITION OF СASE SEMANTICS FOR RUSSIAN-CHINESE AUTOMATIC TRANSLATION: INSTRUMENTAL OF INSTRUMENT VS. INSTRUMENTAL OF COMPARISON
On the basis of Nirenburg & Raskin «Ontological Semantics» formal rules are proposed for recognizing semantic roles of Instrument and Similar-to (in form and in general) expressed by the instrumental case in Russian. The rules are needed for the correct translation of NP adjuncts with the head N from the class of artifacts within an AT system
Sukhova N.V. Lomonosov Moscow State University
THE DIRECTIONS OF INTERACTION BETWEEN HESITATION PAUSES AND KINETIC PHRASES
The article aims at defining a potential set of different directions in which hesitation pauses and kinetic phrases can interact. The material is a spontaneous monologue stretch of English speech. Due to a multidisciplinary approach there are seven ways detected, alongside of which the investigation of pause-kinetic interaction can be conducted.
Sharonov I.A. Russian State University for Humanities
BORDERLINES BETWEEN EMOTIONAL INTERJECTIONS AND MODAL PARTICLES
The research is aimed to distinguish interjections and participles with a help of syntactic, semantic and pragmatic criteria. The word should be regarded as interjection, if it is syntactically autonomous, spontaneous and not addressed reaction to linguistic, and also to extra-linguistic stimulus
Shemanaeva O.Yu. Russian State University for Humanities
VERBS OF GOING DOWN: SEMANTICS AND COMPATIBILITY
Russian verbs of going down are described in this paper. The relevant parameters of adequate semantic description are shown, for example the control of the subject, the speed of movement, the layer in which the subject is being put. Three main metaphorical extensions – BAD IS DOWN, the large amount of something and the disappearance from sight are being discussed.
Shmeleva E.Y. Shmelev A.D. Institute of Russian Language, Moscow
“WE” AND “OTHERS”: THE SIMULATION OF UKRAINIAN SPEECH IN RUSSIAN JOKES
Simulating Ukrainian speech, making fun of funny-sounding Ukrainian words and names are unmistakable sings that jokes about Ukrainians are produced in the Russian linguistic environment. The paper aims at revealing links between typical joke plots, “linguistic masks” of the characters, and ethnic stereotypes.
Shmyrev N.V. SRISA RAS, Moscow
VOXFORGE.ORG FREE SPEECH CORPUS
We discuss the work on building the first free speech database for recognition systems. This report reviews free speech sources, processing technique and problems related to the collection of the big multilingual speech database.
T
Tikhomirov I.A. Smirnov I.V. Institute for Systems Analysis of RAS, Moscow
INTEGRATION OF LINGUISTIC AND STATISTIC SEARCH METHODS IN SEARCH ENGINE “EXACTUS”
The paper considers problems of using linguistic methods of search in contemporary search engines. The features of search engine Exactus are described. The experimental evaluation of the quality of search is performed. The advantages of integration of linguistic and statistic methods are shown.
Toldova S.Ju. Moscow State University Kustova G.I. Moscow State Pedagogical University Lyashevskaya O.N. VINITI RAN
SEMANTIC FILTERS FOR THE WORD SENSE DISAMBIGUATION IN RNC: VERBS
This report deals with methods of word sense disambiguation (reduction) using the information about verb argument structure. Most of the systems based on this method require specially designed resources such as WordNet, FrameNet etc. We explore the possibility to extract and use the information available from the standard dictionaries including a Verb-argument dictionary. We used a subcorpus of National corpus of Russian language that has unambiguous morphological annotation as training and testing data. The aim was to reduce the number of tags for verbs in the semantic annotation. The experiment has shown that the information extracted from dictionaries could not be used as it is. However the extracted argument structure can be used as a seed set for future training. It allows to remove rare meanings and can reduce the number of semantic tags for a verb. The further corpus training and enriching the argument structure with general semantic properties of nouns can further improve the method.
Tsirulnik L.I. Lobanov B.M. Sizonov O.G. United Institute of Informatics Problems on National Academy of Science of the Republic of Belarus
ALGORITHM OF THE INTONATION MARKING OF NARRATIVE SENTECES FOR TTS SYNTHESIS
The paper presents an algorithm of segmentation into phrases and intonation tagging of narrative sentences. The algorithm takes into account the positional and combinatory prosodic factors. The use of the proposed algorithm in TTS synthesis system provides an elimination of so called “second degree of monotony” in synthesized speech.
U
Uryson E.V. Institute of Russian Language, Moscow
RUSSIAN CONJUNCTIONS A TO [LIT.: ‘AND/BUT THAT’] AND A NE TO [LIT.: ‘AND/BUT NOT THAT’]: WHY ARE THEY SYNONYMS IN SOME CONTEXTS?
Russian conjunctions a to [lit.: ‘and/but that’] and a ne to [lit.: ‘and/but not that’] according to their form cannot be synonyms. Yet they easily substitute for one another in some contexts. To explain this fact I analyze the element TO of these conjunctions. It derives from demonstrative/anaphoric pronoun TO(T) and in the conjunctions under discussion is not quite bleached. TO in A TO and A NE TO refers to certain fragments of a semantic structure of an utterance. The difference between the conjunctions is in the scope of TO. Compositional analysis of Russian conjunctions and particles is considered.
Olga Uryupina Institute of Linguistics, Russian Academy of Science; Ashmanov and Partners
DETECTING SENTENCE BOUNDARIES IN RUSSIAN
In this paper we propose a data-driven algorithm for detecting sentence boundaries in Russian. The algorithm relies on shallow features and does not require any deep syntactic knowledge. We evaluate our approach with three publicly available machine learners: C4.5, Ripper and SVM-light. The evaluation results suggest that our algorithm significantly outperforms rule-based approaches.
V
Vasilev V.G. Institute of Informatics Problems of the Russian Academy of Sciences
COMPLEX TECHNOLOGY OF AUTOMATIC TEXT CLASSIFICATION
The report discusses the problems that arise when building automatic text classification systems. Main elements of the integrated text classification technology are described. Particular attention is given to the construction of combined decision rules for the implementation of a hierarchical classification of texts.
W
Yorick Wilks University of Sheffield, UK
Artificial Companions as a new kind of dialogue interface to the future Internet
This paper seeks to connect the future of the Internet to a new, even though relatively underdeveloped, technology, that of computer speech and language and its embodiment, in a concept I shall call an Artificial Companion. Before moving to describe the integration that constitutes the Companion, we must first mention two technologies, not only in their own right but because, in each case, there have been misunderstandings about their achievements and goals. They are: Language and speech technology Agents and the Semantic Web The first of these is Berners-Lee’s [Berners-Lee et al., 2001] vision of how the Internet will change, and it is to that new Internet we intend the Companion as the human interface, on the ground that without it the Internet may get harder and not easier to use, and we shall return to the Semantic Web at the end of this paper. The second notion above is that agents will change from transitory software entities that e.g. locate a cheap camera on the internet, to more permanent social Companion entities that deal with a user through dialogue over a long period, learn his or her needs and preferences and elicit large quantities of life data though conversation.
Y
Yanko T.E. Institute of Linguistics
PROSODY IN A DICTIONARY, AND A DICTIONARY ОF PROSODIC IDIOMS
Representing prosodic data in a dictionary raises two problems: to account for limitations on the communicative and prosodic application of words and constructions by their definitions or functions in discourse; to collect idiomatic illocutions and their prosodic parameters in a prosodic dictionary.
Z
Zalizniak Anna A. Institute of linguistics, Russian Academy of Sciences
A.PLATONOV’S TEXTS AS A LINGUISTIC SOURCE
Anomalous phrases in A.Platonov’s texts so far have been investigated exclusively as a source of information on the author’s poetic world. The paper demonstrates that Platonov’s linguistic anomalies can be used as a source of information about Russian language. These anomalies reveal some subtle semantic, combinatorial and categorical properties of Russian words, which hardly could have been noticed otherwise. This information can be used in explanatory dictionaries of Russian, as well as in the semantic tagging of electronic corpora.
Zatsman I.M. Kurchavova O.A. Institute for informatics problems of the RAS
TERMS FOR SCIENTIFIC AND TECHNICAL KNOWLEDGE REPRESENTATION IN DIGITAL SPHERE
Documents of the 7-th Framework program of the European Union, accepted for the period 2007-2013, contain formulations of the new tasks concerning to the knowledge representation problem in the digital sphere. In the paper key positions of these formulations are analyzed. Results of the analysis are used for definition of some terms suggested for the description of knowledge representation processes in digital libraries.
Zimmerling A.V. Moscow State University for the Humanities, MGGU/Russian State University for the Humanities, RGGU
LOCAL AND GLOBAL RULES IN SYNTAX
The paper discusses word order and phrasal prosody in Russian. I claim that both phenomena can be described in terms of two successive sets of rules — local rules vs. global rules. Combinations of these two sets of rules are typical of multilayer language models and for algorithmic generation of complex structural objects in formal grammars. Modern Russian applies to a highly formalized rule of choosing the locus of the main phrasal accent: the hierarchy of potential accent bearers is a mirror image of the grammatical hierarchy of arguments and adjuncts. The order of communicative constituents in Russian is governed by 7-8 Linear-Accent Transformations (LA-transformations). LA-transformations are Movement rules, which both operate on constituent order and change accent markings of communicative constituents. In the preceding Russian linguistic tradition (cf. Paducheva and Yanko) LA-transformations are defined as Context-Sensitive rules, which makes word order calculus impossible. I discuss the possibility to reformulate LA-transformations as pairs of the type and offer an analysis compatible with Mildly Context-Sensitive Grammars, e.g. Stablerian Minimalist Grammars.

.