A
Automatic verb-noun collocation extraction is an important natural language processing task. The results obtained in this area of research can be used in a variety of applications including language modeling, thesaurus building, semantic role labeling, and machine translation. Our paper describes an experiment aimed at comparing the verb-noun collocation lists extracted from a large corpus using a raw word order-based and a syntax-based approach. The hypothesis was that the latter method would result in less noisy and more exhaustive collocation sets. The experiment has shown that the collocation sets obtained using the two methods have a surprisingly low degree of correspondence. Moreover, the collocate lists extracted by means of the window-based method are often more complete than the ones obtained by means of the syntax-based algorithm, despite its ability to filter out adjacent collocates and reach the distant ones. In order to interpret these differences, we provide a qualitative analysis of some common mismatch cases.
The activities of A.E. Kibrik reflected the process of movement from the structural linguistic to the functional one; this process is characteristic for the modern linguistics but Kibrik formulated the main principles of a new paradigm very precisely, especially in his article “Linguistic postulates” (1983–1992). He pointed to the narrowness of the structural linguistics studying only the structure of the language. He wrote that it is necessary to study linguistic phenomena with the mental activity of speaking persons, called to reveal linguistic processes as a matter of fact, emphasized the central role of semantics in the language. Now linguistics develops in the direction that was determined by A. E. Kibrik thirty years ago.
The paper aims to illustrate the applicability of conditional random field (CRF) models to Russian texts. Introduced in 2001, CRF method has been successfully exploited and proved its efficiency for a variety of NLP tasks. Its main advantage over HMM is the possibility to model the dependencies and interdependencies in sequential data. Yet this approach has not been widely used for Russian. Since CRF operates with language-independent features, its initial adaptation for Russian can be minimalistic. We show how CRF models produce state-of-the-art quality for several basic NLP tasks, including named entity recognition, part-of-speech tagging and objectoriented sentiment analysis. We exploited CRF-Suite tool to train and evaluate our models. We used a corpus of news texts for NER and POS-tagging tasks and a subcorpus from Russian Twitter for SA. The results of the evaluation were compared to other existing methods for each of the three tasks.
The paper considers semantic structure of emotion causatives and their interaction with negation, namely, its narrow or wide scope. Emotion causatives are defined as a group of causatives with their specific semantic properties that distinguish them from other groups of causatives. One of those properties concerns their relation with corresponding decausatives, which, unlike causatives, do not license wide scope of negation. There are several factors that enable negation to have scope over the causative element in emotion causatives — their imperfective aspect, generic referential status of the causative NP phrase, agentivity and conativity of the causative. Non-agentive causatives never license the negation of the causative component. Agentive conative causatives license the negation of the causative component more frequently and easily than agentive non-conative causatives, prompting the assumption that in their semantic structures the causative component has different statuses (assertion in the former, presupposition in the latter). It also has different forms for conatives and non-conatives. Conativity vs. non-conativity of emotion causatives is related to the emotion type, with conative synthetic causatives being limited to basic emotions. The greatest degree of conativity and, hence, the assertive status of the causative component characterizes three emotion causatives — zlit’ ‘to make mad’, veselit’ ‘to cheer up’, and pugat’ ‘to frighten’.
The paper describes a novel method for automatic collocation error correction in NL texts written by language learners or translated from another NL with the aid of machine translators. We assume that the main cause of collocation errors is the strategy of word-by-word translation used by authors of the texts or by machine translators, so the errors essentially depend on the source language. While processing a sentence from the text, the method considers as potential correcting variants all its paraphrases that have the same syntactic structure and are built by replacing all words of the sentence by their substitutes. Substitutes are automatically generated using word translation equivalents taken from a translation dictionary. To detect an error in the sentence, we propose a relevance degree function computed from the probability of the word’s syntactic links and applied to the sentence and its paraphrases. If the function value for the sentence is lower than for some of its paraphrases, our method signals an error, then it is corrected by an appropriate sentence paraphrase. The method was evaluated by correcting collocation errors in English texts written by Russian speakers. Stanford Parser and an English text collection were used to gather statistics and compute the probability of English word syntactic links. Within certain limitation, the experiments gave promising results: our method detected about 80% of collocation errors (with words of various POS) and 87% of proposed correcting paraphrases contained a proper correction.
B
The paper considers the semantics and pragmatics of threat as a speech act. In lexical semantics, the concept of a threat is often explained as a unified (single) notion. It is shown that speech acts of threat in Russian are divided into two types: threat-penalty and threat-warning. The latter type of threat — threat-warning — has a specific variety — threat-compulsion. Threat-penalty is a kind of a threat situation in which something bad occurred and speaker informs the hearer (who is responsible for this) that he will be punished. Threat-warning presupposes that no bad thing has occurred yet and the speaker shows the hearer that he should not do this bad thing. The realization of threat-compulsion assumes that the speaker tries to force the hearer to do something under threat of penalty. Distinguishing the three kinds of threat is important for forensic linguistics. In cases of extremism, murder, bribe, exaction and others articles of law detection of body of the crime presupposes an analysis of criminal intention, which is reflected apart from everything else in kinds of threat. Implicit ways of threatening are the most complicated to analyze in forensic linguistics. The analysis of implicit threat presupposes that all parts of semantic representation of this speech act (variables with terms and constants) should be identified in the text. The paper focuses on the case of implicit threat. The specific feature of the case analyzed consists in the implicit expression of penalty.
The main research question of any corpus investigation, either while experimenting with the Internet or working with the RNC or any other corpus, should be the question of the object of investigation: do we study a particular corpus, search engine or the language “overall”? Unfortunately, researchers usually accept as self-evident the assumption of “scalability” of the results obtained with a specific corpus study to the whole body of language. The article examines the criteria to justify the possibility to scale specific data and proposes an approach to assessing the limits of discovered facts, as adopted in the framework of an ongoing project to create the General Internet Corpus of Russian (GICR). One of the basic ideas of this project is that scaling the results is a very limited operation. For the majority of linguistic and lexicographical problems, corpus analysis should be carried out within a well-defined genre and sociolinguistic parameters.
We aim at comparing some corpora-based computational resources that enable us to analyse the collocational profiles of the SVCs in both languages. The resources include SketchEngine, which works for both languages, Lexit for Italian and NKRJA for Russian. The case study focuses on the Italian verb mettere followed by a prepositional phrase with the prepositions in and a, and the corresponding Russian verb stavit’/postavit’ followed by a prepositional phrase with the prepositions v and na. We discuss the options offered by the tools at the syntax-semantic interface. A closer comparison of the three tools shows that they provide different data. We have explored some aspects of the semantic tagging of Lexit and NKRJA and propose an integration of the two tools. It seems that further development of semantic tagging could be helpful in creating Italian-Russian lexicographic resources.
Methods and approaches used by the authors to solve the problem of sentiment analyses on the seminar ROMIP-2012 are described. The lexical approach is represented with the lexicon-based method which uses emotional dictionaries manually made for each domain with the addition of the words from the training collections. The machine learning approach is represented with two methods: the maximum entropy method and support vector machine. Text representation for the maximum entropy method includes the information about the proportion of positive and negative words and collocations, the quantity of interrogation and exclamation marks, emoticons, obscene language. For the support vector machine binary vectors with cosine normalization are built on texts. The test results of the described methods are compared with those of the other participants of the ROMIP seminar. The task of classification of reviews for movies, books and cameras is investigated. On the whole. The lexical approach demonstrates worse results than machine learning methods , but in some cases excels it. It is impossible to single out the best method of machine learning: on some collections maximum entropy method is preferable, on others the support vector machine shows better results.
Manually annotated corpora are very important and very expensive resources: the annotation process requires a lot of time and skills. In OpenCorpora project we are trying to involve into annotation works native speakers with no special linguistic knowledge. In this paper we describe the way we organize our processes in order to maintain high quality of annotation and report on our preliminary results.
Those who seek, will they find? (search function of verbal hesitationsin Russian spontaneous speech)
The article is dedicated to verbal hesitations used in Russian spontaneous speech when a speaker is trying to find a better way of expressing his idea. The search process always mates hesitation, sometimes self-correction, and sometimes stays incomplete. Our conclusions are based on the reflections on the Russian Speech Corpus (balanced annotated textothec and the One Speech Day block).
The paper shows how Russian external possessor constructions are treated in the ABBY Y Compreno ® system. The specific tasks of our system require that sentences with external possessor constructions be considered as synonymous with those with internal possessors. Accordingly, the semantic structure is generated in such a way that the possessor, whether external or not, and the possessum form a single constituent. This is not the case with the syntactic structure because there is much evidence that the external possessor is not syntactically dependent on its possessum. The semantic and syntactic structures of external possessor constructions are not isomorphic so we have to apply a syntax-semantic interface to derive one from the other. We show that two different kinds of interface must be used. For constructions with strong lexical restrictions we use a special normalization module while leaving the syntactic description relatively simple. In contrast, constructions with fewer lexical restrictions require a more sophisticated syntactic description where movements are postulated.
While mainstream semantic parsing mostly consists in word sense disambiguation, semantic role labeling and assigning WordNet/FrameNet categories, deeper NL understanding requires much more. It includes understanding of the meaning of words, extralinguistic knowledge and is based on a more intricately elaborated representation of this meaning than that provided by standard resources. For example, the semantic model should not only know that ask for, implore and demand belong to the same REQUEST frame. It should also formally represent the very idea of an incentive speech act (e.g. ‘X tells Y that he wants him to do Z’) and even the difference between such request varieties as represented by the words listed. Our aim is to build a semantic analyzer supplied with this kind of semantic knowledge and capable of constructing semantic representations that convey this knowledge and can be used for inferences. However, before constructing a parser, one should define the target representation. The focus of this paper is to propose a semantic representation richer than usually considered. Since the depth of representation is an important decision in language modeling, the topic deserves a detailed discussion. Our paper demonstrates selected NL phenomena untreatable by state-of-the-art parsers and semantic representations proposed for them.
We note that Western European lexicography has neither precise definition of paronymy nor dictionaries of paronyms. However, such dictionaries can help us correct malapropisms like massive evacuation or sensitive shoes. Although three comprehensive dictionaries of Russian paronyms have been published in the recent decades, it remains unclear what additional features of similarity of two words of the same root and the same POS are needed to consider the words paronymous. Based on the collected statistics of affix proximity of paronyms in the largest printed dictionary of Russian paronyms, we propose a formal criterion of paronymy. Two words of the same root and the same POS are considered formally paronymous if their affix differences (separately for suffices and prefixes) are limited to particular values. Affix difference equals the minimal number of editing operations on affixes (deletion, insertion or substitution) that transform an affix chain of one word into that of the other. Aiming to develop a computer dictionary of formal paronyms, we first compiled а computer dictionary of 23,000 Russian words divided into 2,400 same-root, same-POS groups. All words were split into morphs: prefixes, the root, suffixes, and the ending. Then affix distances between word pairs from the groups were automatically computed, and all formally paronymous pairs were selected. These pairs constitute the resulting computer dictionary of paronyms, which contains 21,800 word entries with their 190,000 paronyms, larger than all known dictionaries of paronyms.
The article deals with modeling the understanding of natural language texts in special cases that differ from the trivial ‘normal’ condition ’what is said is what is meant’ (literal understanding). This includes hints, metaphors etc. The article is focused on irony, which seems to be a paradox: ‘what is meant’ is different from ‘what is said’. By thorough analysis of examples of irony both in the literature and in common usage (including texts of media and the Internet) we classify the cases of irony. The sense components of utterances which are to be understood in the opposite way have been identified. They are not only parts of the dictum but of the modal frame as well. The pragmatic analysis showed the intentions of the Speaker using irony, including the cases when the object of mockery is the Speaker himself, or the Hearer. Correlation of irony vs. mockery and irony vs. quotations is investigated. The results should be used by designing the model of natural text understanding.
The paper presents the settings and the results of the ROMIP 2013 machine translation evaluation campaign for the English-to-Russian language pair. The quality of generated translations was assessed using automatic metrics and human evaluation. We also demonstrate the usefulness of a dynamic mechanism for human evaluation based on pairwise segment comparison.
The Information Extraction task and the task of Named Entities recognition (NER) in unstructured texts in particular, are essential for modern Mass Media systems. The paper presents a case study of NER system for Russian. The system was built and tested on the Russian news texts. The method of ambiguity resolution under discussion is based on dictionaries and heuristic rules. The dictionary-oriented approach is motivated by the set of strict initial requirements. First, the target set of Named Entities should be extracted with very high precision; second, the system should be easily adapted to a new domain by non-specialists; and third, these updates should result in the same high precision. We focus on the architecture of the dictionaries and on the properties that the dictionaries should have for each class of Named Entities in order to resolve ambiguous situations. The five classes under consideration are Person, Location, Organization, Product and Named Event. The properties and structure of synonyms and context words, expressions and entities necessary for disambiguation are discussed.Key words: Named Entities Recognition, Named Entities ambiguity, Named Entities disambiguation, rule-based approach.
C
A two-step approach to devising a hierarchical taxonomy of a domain is outlined. As the first step, a coarse “high-rank” taxonomy frame is built manually using the materials of the government and other representative sites. As the second step, the frame is refined topic-by-topic using the Russian Wikipedia category tree and articles filtered of “noise”. A topic-to-text similarity score, based on annotated suffix trees, is used throughout. The method consists of three main stages: 1) clearing Wikipedia data of noise, such as irrelevant articles and categories; 2) refining the taxonomy frame with the remaining relevant Wikipedia categories and articles; 3) extracting key words and phrases from Wikipedia articles. Also, a set of so-called descriptors is assigned to every leaf; these are phrases explaining aspects of the leaf topic. In contrast to many existing taxonomies, our resulting taxonomy is balanced so that all the branches are of similar depths and similar numbers of leaves. The method is illustrated by its application to a mathematics domain, “Probability theory and mathematical statistics”.
In 2012, Russian Information Retrieval Seminar (ROMIP) continued the investigation of sentiment analysis issues. Along with the last year’s tasks on sentiment classification of user reviews we proposed two new tasks on sentiment classification of news-based opinions and query-based extraction of opinionated blog posts. For all tasks new test collections were prepared. The paper describes the characteristics of the collections, track tasks, the labeling process, and evaluation metrics. We summarize the participants’ results and describe our simple approach for sentiment extraction task.
We propose a text-to-speech system based on the two most popular approaches: statistical speech synthesis (based on hidden Markov models) and concatenative speech synthesis (based on Unit Selection). TTS systems based on Unit Selection generate speech that is quite natural but highly variable in quality. On the other hand, statistical parametric systems produce speech with much more consistent quality but reduced naturalness due to their vocoding nature. Combining both approaches improves the overall naturalness of synthesized speech. To reduce variability of Unit Selection results, we calculate a statistical generalization of the speaker’s intonation. We created a methodology of voice model building in order to solve the task of speech parameterization. The model is a set of HMM models whose state parameters are clustered to provide good quality of synthesized speech even under conditions of insufficient training data. MFCC coefficients, pitch, energy and duration values are used as fundamental features. Objective and subjective experiments show that our method increases the naturalness of synthesized speech.
D
In Daghestan, the number of Russian speakers has been dramatically increasing over the last few decades. Russian has assumed the functional niche previously vacant in this extremely multilingual setting, becoming the first ever lingua franca of the region as a whole. Russian is acquired in a situation of strong interaction with local languages and shows contact properties on various linguistic levels: phonetics, morphology, syntax and lexicon. Its regional variant is also visibly developing as a self-identification device. The aim of this paper to discuss some (socio)linguistic properties of this idiom, attribute them either to interference or to imperfect learning, and to argue for building a corpus of Daghestanian Russian.
The paper describes the current state of development of the lexical basis of an open and free lexical-semantic resource — the Universal Dictionary of UNL Concepts (UNLDC). The resource serves as a lexicon of an artificial intermediary language UNL (Universal Networking Language). It links the elementary units of UNL — concepts with lexicons of natural languages and various external lexical and semantic resources, including Wordnet and SUMO ontology. The dictionary’s main goal is to support automated semantic analysis, encoding the meaning of the text as UNL semantic graphs and subsequent generation of text in different natural languages.
The paper focuses on the structure and principles for constructing a new German-Russian phraseological dictionary based on corpus data. Fragments of this dictionary are available on the website of the German Language Institute in Mannheim: “Deutsch-russische Idiome online” http://wvonline.ids-mannheim.de/idiome_russ/index.htm. Relevant information is also made available via the Europhras homepage at http://www. europhras.org. In section 1, I formulate certain general principles of modern bilingual phraseology. Section 2 discusses the state of the art of German- Russian phraseography and explains the need for a new German-Russian phraseological dictionary. In Section 3, key features of the new corpusbased dictionary are considered. The basic difference between the present dictionary and traditional ones is that all examples of idiom usage are taken from text corpora DeReKo and DWDS, and in individual cases from the German-language Internet. Parallel texts from the Russian National Corpus (RNC) are also used. The use of authentic examples based on text corpora is a new approach in bilingual lexicography. Traditional dictionaries were based on a limited body of randomly selected examples, and the use of the idioms was often not even exemplified. The advantages of using corpora consist not only in more detailed and well thought-out illustrations of the expressions being described, but also in additional possibilities that the corpus provides for compiling the idiom-list and structuring entries.
The paper deals with the structure of expressive attributive word meanings implemented in the wordnet-type thesaurus for Russian (RussNet). The adjectives involved express the appraisal of objects and situations denoted by nouns, the assessment depending on the intrinsic qualities of objects or rendering the subjective attitude of the speaker. The research was based on a 21 million word corpus of modern texts. The sentiment meaning in RussNet is structured according to three parameters: Polarity, Domain, and Objectivity. “Polarity”, the intrinsic parameter of the class, describes a positive or negative sentiment value and its measure. “Domain” represents one of the three most commonly expressed standpoints: pragmatic, moral, and aesthetic, as well as actualization of lexical functions Ver/AntiVer, Pos/AntiPos, and Bon/AntiBon defining semantic interaction with hierarchical groups of nominal meanings (semantic trees and subtrees of the RussNet thesaurus). “Objectivity” describes the assessment source as either being personal or custom, usual or individual for the object described. The parameters listed above are organized into a rather intricate scheme but in practical work its structure can be simplified, Yet, detailed analysis can help structuring fuzzy sentiment expressions and detecting versatile evaluative content.
E
This paper describes fast implementation of a hybrid automated translation system for processing user-generated content. We report on engine customization for TripAdvisor, the world’s largest travel website. Due to the growing potential of the Russian travel market, TripAdvisor created the Russian version of its website and decided to translate all English reviews into Russian. PROMT, a leading provider of industrial MT solutions, was selected as MT vendor for the English-Russian language pair. According to the client’s request we had to perform customization within a short period. All input data represent user-generated content, so we faced several problems while building a large-scale, robust, high-quality engine. We decided to create a solution based on a hybrid machine translation system for the hybrid approach makes possible fast and efficient customization of a translation system with little or none in-domain data. We automatically crawled a large web-based Russian text corpus of tourist reviews to build a statistical language model for our hybrid translation system. We analyzed a batch of tourist reviews in English provided by TripAdvisor, created a number of dictionaries, a translation memory and defined translation rules for user-generated content. To handle the problem of various typos and misspellings we added most frequent misspelled words and phrases to the created dictionaries. We experimented on a test set of tourist reviews in English provided by TripAdvisor. We report on improvements over our baseline system output both by automatic evaluation metrics and linguistic expertise.
F
Dialogue is a fundamental part of language use. In search of systematic evidence how the dialogue mechanisms work we turn to the referential communication task originally devised by R. Krauss and specified by H. Clark. In our experiment, two students or children were seated at tables separated by an opaque screen, in front of each were 12 cards of so-called Tangram figures. For the Director the cards were already arranged in a target sequence, and for the Matcher the same figures lay in a random sequence. The Director’s job was to get the Matcher to rearrange his or her figures to match the target ordering. They carried out the task in four trials. All conversations (36 adults’ and 8 children’s dialogues) were transcribed, including changes of speaker, back-channel responses, hesitations, and false starts. We consider a prediction proposed by H. Clark that people prefer analogical perspective, which focuses on the resemblances of the figures to natural objects, to literal perspective, which focuses on the literal features of the objects, i.e. their geometric parts. Our results confirm the hypothesis; we also describe some peculiarities of the child dialogue strategies.
The paper studies the use of fact semantic filters in application to sentiment analysis of book reviews. The tasks were to divide book reviews into 2 classes (positive, negative) or into 3 classes (positive, negative, and neutral). The main machine learning pitfalls concerning sentiment analysis were classified and analyzed.
G
We develop a graph representation and learning technique for parse structures for sentences and paragraphs of text. We introduce parse thicket as a set of syntactic parse trees augmented by a number of arcs for inter-sentence word-word relations such as coreference and taxonomies. These arcs are also derived from other sources, including Rhetoric Structure and Speech Act theory. We introduce respective indexing rules that identify inter-sentence relations and join phrases connected by these relations in the search index. We propose an algorithm for computing parse thickets from parse trees. We develop a framework for automatic building and generalizing of parse thickets. The proposed approach is used for evaluation in the product search where search queries include multiple sentences. We draw the comparison for search relevance improvement by pair-wise sentence generalization and thicket-level generalization.
The paper presents a semantic and pragmatic analysis of noun reduplication in colloquial Russian and the Internet language. We consider the repetition of a noun within the same prosodic unit separated by a particle “takoj” (‘such’) as in “statja takaja statja” (‘paper such a paper’). Drawing on a corpus of examples gathered from Internet texts we categorize the semantics of this reduplication pattern into six types: (1) prototype and connotation, (2) non-fitting a stereotype, (3) condescension and irony, (4) expression of emotions, (5) discourse topic and scene-setting topic (6) object nomination and ellipsis. Compared to the model “such X-X”, the model “X such X” more often points to the negative attitude. We also consider the syntactic structure of the given reduplication pattern.
One can look upon the Web as a large corpus that can teach us about language use, and also about the real world. In order to determine what is new or interesting we need to know what the norm for language use is. This involves creating a language model that corresponds to what is found on the web. Since the web is so big, it is impossible to download it all and count appearances of words and phrases, so one must use the technique of probing: generating things to be tested and submitting them to a search engine to find their frequency of occurrence. It has been shown using Google to gather statistics is perilous since Google does not provide exact counts but rather estimates the number of pages containing an expression. These counts can be very far from the reality of what is really in Google’s index. Using another search engine, such as Exalead, is one solution, but then the problem of index coverage comes into play. Google has declared having seen 1 trillion unique URLs (in 2008) but estimates of the size of Google’s index are about 50 billion pages, so some hidden choice has been made of what is in the index and what is not. This means that frequency based language models derived from search engines are only approximate. Nonetheless, it is possible to make rough, relative judgments of how often one linguistic phenomenon appears with respect to another, and using probing can provide some information of the relative frequency of these phenomena. Over a long period, it is possible to generate and test a great number of possibilities, some examples of the usefulness of this technique are finding what words commonly occur with other words, what colors are often associated with nouns, what are the most common translation of multiword expressions, what are the most likely transliteration of English terminology and names into Japanese, for example. The Web is not a uniform corpus, far from it. There are many different language registers even within one language: there are professionally edited well written articles, there are more colloquial blog posts, there are hastily written error-filled comments, all which generate different language models. One recent exploitation of user-generated content on the web has been the mining of opinions concerning some subject, or company, or product. Affect analysis is now a thriving market and a true commercial success for natural language processing. Many other areas of text mining remain to be explored. For example, the particular language used to tag photos in social media sites (such as Panoramio or Flickr) and reveal many things about the user (especially in conjunction with GPS and time data). This language is different from that found in the general web, or on Wikipedia. We can use it to find out the interesting things to visit in a city, we can predict where a tourist can go, we can even guess whether a user is a woman or a man, from their tagging behavior. Mining this information can lead to additional applications that exploit this new knowledge.
The study analyzes the main types of gestures, which accompany the Russian verbs with/without prefixes. The gestures are described from the topological point of view: any hand/head movement is placed along the Cartesian coordinates and the statistical correspondence between prefixes and topological characteristics of gestures is detected. The paper presents the gestural profiles (the set of gestural attributes) of 16 Russian prefixes. The study makes use of the data of the Multimodal Russian Corpus (MURCO).
I
The paper discusses several types of Russian microsyntactic units — nonstandard syntactic constructions and syntactic idioms with repeated verbal elements. The primary construction discussed is of the type chitat’ ne chital (no sdelal chto-to menee sil’noe) ≈ ‘one did not really read it (but one did something less strong)’. In this construction, two copies of the same verb in different inflectional forms (one in the infinitive and the other in finite form) are present, the latter preceded by the negative particle. Since lexical instantiation of the verbal positions is virtually free, the only restriction being imposed on their lexical coincidence, the construction should be treated as lexically unbound and, hence, as a non-standard microsyntactic construction. There are two more constructions that appear to be lexically and syntactically close to the primary one: the so-called emphatic tautological infinitive construction of the type s’jest’-to on s”est ≈ ‘he will definitely eat it’ and a syntactic idiom with lexically bound repeated verbal elements of the type Ja tebja znat’ ne znaju ≈ ‘I don’t know you and have no wish to do anything with you’. We focus on the semantics of these three units and ways to discriminate them in human and automatic natural language processing tasks.
The paper continues research into words denoting everyday life objects in the Russian language. This research is conducted for developing a new encyclopedic thesaurus of Russian everyday life terminology. Working on this project brings up linguistic material which leads to discovering new trends and phenomena not covered by the existing dictionaries. We discuss derivation models which gain polularity: clipped forms (komp < komp’juter ‘computer’, nout < noutbuk ‘notebook computer’, vel < velosiped ‘bicycle’, mot < motocikl ‘motorbike’), competing masculine and feminine contracted nouns derived from adjectival noun phrases (mobil’nik (m.) / mobilka (f.) < mobil’nyj telefon (m.) ‘mobile phone’, zarjadnik (m.) / zarjadka (f.) < zarjadnoe ustrojstvo (n.) ‘AC charger’), hybrid compounds (plat’e-sviter ‘sweater dress’, jubka-brjuki ‘skirt pants’, shapkosharf ‘scarf hat’, vilkolozhka ‘spork, foon’). These words vary in spelling and syntactic behaviour. We describe a newly formed series of words denoted multifunctional objects: mfushka < MFU < mnogofunkcional’noe ustrojstvo ‘MFD, multifunction device’, mul’titul ‘multitool’, centr ‘unit, set’. Explaining the need to compose frequency lists of word meanings rather than just words, we offer a technique for gathering such lists and provide a sample produced from our own data. We also analyze existing dictionaries and perform various experiments to study the changes in word meanings and their comparative importance for speakers. We believe that, apart from the practical usage for our lexicographic project, our results might prove interesting for research in the evolution of the Russian lexical system.
K
The paper reports on a research project in progress which involves a dictionary of Russian lexical constructions and a corpus tagged with FrameNet-like annotation scheme. Russian Frame- Bank, originally conceived as an analogue of Berkeley FrameNet, takes into account some recent approaches adopted in Construction Grammar and Russian lexical semantics, as well as certain features of the Russian lexical system and grammar.
We focus on the semantic annotation of constructions in FrameBank. First, the article describes the inventory of semantic roles used in FrameBank which correlates with the semantic classification of verbs and other predicates. Semantic roles form a hierarchy: 88 roles are classified into six clusters (those of Agent, Patient, Experiencer, Instrument, Addressee, Circumstances), which are further subdivided into some smaller groups. The hierarchical organization makes the inventory of semantic roles more flexible for use in theoretical research and computational applications (such as automatic semantic role labeling). We also show that many examples are annotated in a more appropriate way by introducing syncretic semantic roles (e. g. Instrument-Place or Result-Manner). Second, we touch upon an ongoing project on the systematization of semantic shifts in verbal lexemes (metaphor, metonymy, and rebranding, which is argued to be a special type of a semantic shift, see, for example, [Rakhilina et al. 2010a]) and the corresponding changes in argument structure constructions (including changes of a morpho-syntactic pattern, omission of a participant which belongs to a known class, etc.). The labels for the shifts are provided, along with examples of their realization. Lexical constructions are defined on constant (lexicalized) slots, mainly verbs and other predicates in a particular meaning. Frames are thus seen as the signifié side of constructional clusters formed by synonymous predicates, aspectual pairs, etc. Since it is not uncommon for polysemous lexemes that the formal façade of constructions is inherited from sense to sense, we claim that the frame nets cannot be routed without taking into account sense relations in polysemous predicates. The final discussion deals with the relation between semantic classes of verbs, semantic roles, and lexical/semantic constraints on the classes of participants as provided by FrameBank data.
The paper deals with statistical methods for predicting positions and durations of prosodic breaks in a Russian TTS system. We use CART and Random Forest classifiers to calculate probabilities for break placement and break durations, using grammatical feature tags, punctuation, word and syllable counts and other features to train the classifier. The classifiers are trained using a large high-quality speech database consisting of read speech. The experimental results for prosodic break prediction shown an improvement compared to the rule-based algorithm currently integrated in the VitalVoice TTS system; the Random Forest classifier shows the best results, although the large size of the model makes it more difficult to use in a commercial TTS system. To make the system more flexible and deal with the remaining break placement errors, we propose combining probabilities and rules in a working TTS system, which is the direction of our future research. We observe good results in experiments with predicting pause durations. A statistical model of break duration prediction has been implemented in the TTS system in order to make synthesized speech more natural.
Among the central issues in the theory of discourse is discourse taxonomy, that is elucidation of the parameters classifying discourses into types. There are several such parameters, and they are often cofused. The main ones include mode, genre, and functional style. The distinction in mode concerns the medium: spoken or written. Genres are related to the typical communicative goals, acknowledged by discourse communities, and are characterized by standard schemata. Functional styles are identified in connection with the various domains of human existence. There are other discourse taxonomies as well, in particular, quite important is the distinction between types of presentation that characterize not whole discourses but their fragments, or passages. Each discourse taxonomy reflects upon grammatical, lexical and other local linguistic choices. Such choices are a resultant of all factors stemming in discourse taxonomies. Even though discourse taxonomies are in principle independent from each other, discourse types established on the basis of different parameters may have similar properties. For example, the written mode and the official functional style have similar reflexes in the linguistic structure.
The paper deals with a part of Russian phraseology:, the idioms containing the odin/edin (‘one’, ‘single’) lexical component, e.g. vse kak odin, odinedinstvennyj, vse do edinogo, odnoj levoj, ni odna zhivaja dusha, iz odnogo testa etc. (English equivalents for: ‘one and all’, ‘all alone’, ‘all down to the last one’, ‘with one hand tied behind one’s back’, ‘not a one living soul’, ‘cut from the same cloth’). We observe that, first, the meaning of the idioms containing odin/edin depends on the meaning of the word odin in this context (ex. in smekh odin and v odin prisest we have two different lexical meanings of odin). Second, we try to classify these idioms according to the inner form model that we see in each case. For example, vse do odnogo is based on the model labeled “exhaustion” while the similar idiom vse kak odin is based on another model, labeled “matching”. Apart from suggesting several classes of idioms depending on their inner form model, we show that the presence of the component odin systematically brings two semantic effects to the meaning of the idioms: uniqueness, oneness, wholeness vs. insufficiency, poorness, lameness.
In spoken Russian discourse, complement clauses introduced by a combination of to (originally — a correlative pronoun in nominative or accusative case) and chto (complementizer) may exhibit specific features that are not possible in standard written speech. Based on the data from several spoken corpora, the present study claims that to chto is regularly used as a compound complementizer. In prosodic terms, to chto is often pronounced together with the subordinate clause, while topronoun usually adheres to the main predicate, a strong intonation boundary appearing between it and the chto-clause. In semantic terms, to chto-constructions may violate the condition of ‘givenness’ that presumably licenses the use of the correlative pronoun to in standard speech. In syntactic terms, to chto may be used with predicates that require a different case (genitive, instrumental) or a prepositional phrase. Also, coordination of chto-clauses and to chto-clauses are possible, and to chto-clauses appear in contexts with other correlative pronouns in the main clause (like takoj).
An utterance is generated as an expression of an internal communication stimulus. As indicated in the theory of politeness, contradicting tendencies may interfere with the expression of an initial stimulus, in particular an initial face threatening act may be modified by the strategies of negative and positive politeness. Basing on the observations on a multimodal emotional corpus we argue that a certain number of expressive cues in a similar way compensate and modify an initial communication stimulus. (a) A speaker may compensate the changes in gaze direction through gestures, showing iconic gestures when looking aside, and closing gestures when looking at the addressee. We show that “looking aside” is usually combined with addressed gestures (demonstration, iconic gestures). (b) Smiles may also compensate the definitiveness of the main utterance. We show that smiles usually appear in the postposition to an utterance and reduce face threatening in the situations of failure or doubtful proposal — in these cases smiles do not express pleasure and are not connected to jokes.
Human body and its parts in different languages and cultures (the results of the scientific project)
The paper presents the main results of a project aimed at constructing semiotic representations of human body and corporality in different natural languages (English, Arabic (the Egyptian dialect), Lithuanian, German and Hindi) and the corresponding body languages. The lexical system of a body language consists of gestures (in a broad sense of the word), i.e. gestures proper (manual gestures, gestures of legs, etc.), postures, meaningful glances, touches and some other semiotic classes of units. The primary directions of the project are (1) to describe somatic objects and their significant combinations; (2) to describe major classes of these objects, such as the human body itself, body parts, bones, biological liquids; (3) to examine the features of these objects and their values as well as those of their names; (4) to exhibit different kinds of gestures with somatic objects, among them those expressing human relationships. We also focus on some results in the field of applied nonverbal semiotics, i.e. (a) description of Russian symptomatic gestures performed by a patient in a conversation with a doctor. These gestures may serve to characterize a patient’s disease; (b) semantic analysis of Russian phraseological units with names of somatic objects; (c) exploration of meaning and functional characteristics of the so-called Bible somatisms — linguistic expressions in the Bible texts with names of somatic objects as well as of the gestures; (d) analysis of theatrical corporeal behavior.
The paper discusses a stage of abstract noun grammaticalization — namely, transformation into adverbial expressions, cf. v ozhidanii ‘waiting’, pod okhranoj ‘under protection’, po priglasheniju ‘by invitation’, v blagodarnost’ ‘in gratitude’. Two types of such adverbials are distinguished: 1) the agent of the adverbial is not expressed (Passazhiry khodili po perronu v ozhidanii poezda ‘The passengers were strolling along the platform waiting for the train’); 2) the agent of the adverbial is necessarily expressed (Prijekhal po priglasheniju djadi ‘came by invitation of his uncle’). In contrast to adverbials, nominalizations can express all arguments.
The paper is devoted to testing rules useful for sentiment analysis of Russian. First, we describe the working principles of the POLYARNIK sentiment analysis system, which has an extensive sentiment dictionary but a minimal set of rules to combine sentiment scores of opinion words and expressions. Then we present the results achieved by this system in ROMIP-2012 evaluation where it was applied in the sentiment analysis task of news quotes. The analysis of detected problems became a basis for implementation of several new rules, which were then tested on the ROMIP-2012 data.
The article presents the Typological Database of Qualities, which aims at providing a new tool for research in lexical typology. The database contains information on the lexicalization of several semantic fields of adjectives in different languages (like ‘sharp’ — ‘blunt’, ‘empty’ — ‘full’, ‘solid’ — ‘soft’, ‘thick’ — ‘thin’, ‘smooth’ — ‘rough’, etc.). We discuss issues concerning database structure (in particular, the choice of information units that would make the meanings from different languages comparable to each other). Special attention is devoted to the representation of figurative meanings in the Database which allows to investigate the models of their derivation from the literal meanings. The developed database can be used for solving both theoretical and practical tasks. On the practical level, the Database may serve as a multilingual dictionary which accounts for fine-grained differences in meaning between individual words. On the theoretical side, the Database allows for various generalizations on cross-linguistic patterns of polysemy and semantic change.
L
The article is focused on the properties of the zero copula used as a present tense form in Russian. The principal aim is to check whether the zero copula can be used in the same contexts as non-zero verbs or if it has particular features. I find out that there are contexts where the zero copula is allowed while non-zero verbs in the present tense are prohibited; conversely,, there are constructions which require a non-zero verb and prohibit the zero copulas. The former contexts include mainly biclausal constructions. The reason is that the zero copula lacks morphological tense and mood markers and does not apparently contradict any syntactic restrictions. The latter contexts, where the zero copula is prohibited belong to constructions with temporal meanings and constructions with predicatives. In the end I draw attention to the fact that constructions with the zero copula are not simply a reduced variant of some full structures, they have some particular rules of use which differ in some respects from those of non-zero verbs.
The Russian conjunctions a to as well as a ne to ‘≈ or else’ have repeatedly become objects of linguistic studies. First of all researchers were interested in semantic distinctions between these conjunctions and conditions of their interchangeability. Besides, much attention has been paid to the structure of polysemy of these items, especially a to. Yet one of the interesting meanings of the conjunction a to seems not to have received an adequate description. It is the meaning which is usually described as causal: Sxodi v bulochnuju, a to xleba net ‘Go to the baker’s since we are out of bread’;. Pojdem domoj, a to zactra rano vstavat’ ‘Let’s go home because tomorrow we have to get up early tomorrow’; Net li u tebja soli, a to u menja konchilas’ ‘Do you have some salt, since mine is over?’ Apparently, the idea of cause alone is absolutely insufficient. The paper addresses this causative meaning of a to contrasting it with other senses of the conjunction and other words of causation’.
I discuss typical intonation patterns in Russian reported speech constructions, based on the data from the Prosodically Annotated Corpus of Spoken Russian which consists of 4 experimental subcorpora of Russian spoken discourse (the current version of the corpus is available on the website http://spokencorpora.ru/). More than 400 occurrences of reported speech of different types (direct speech, indirect speech, semi-direct speech) have been analyzed. I have attempted to show that (i) intonation patterns in preceding framing clauses (falling tone in main phrasal accent, rising tone in main phrasal accent and absence of main phrasal accent) correspond to the type of the reported speech (direct, indirect and semi-direct, accordingly); (ii) however, this correspondence is more a tendency than a cause-and-effect relationship; (iii) there are some typical patterns in semi-direct speech that use ‘mixed’ intonation in order to keep the ‘original’ illocutionary meanings and to integrate the reported speech into the following context as much as possible: the list pattern and the head-tail-pattern
Compounding is a common phenomenon for many languages, especially those with rich morphology. Dealing with compounds is a challenge for NLP systems since compounds are not often included in the dictionaries and other lexical sources. We present a compound splitting method combining language independent features (similarity measure, corpus data) and language specific component transformation rules. Due to the usage of language independent features, the method can be applied to different languages. We report on our experiments in splitting of German and Russian compound words, giving positive results compared to matching of compound parts in a lexicon. To the best of our knowledge. elaborated compound splitting is a rare component of NLP systems for Russian, yet our experiments show that it could be beneficial to use a specialized vocabulary.
Our research aims at automatic identification of constructions associated with particular lexical items and its subsequent use in building the catalogue of Russian lexical constructions. The study is based on the data extracted from the Russian National Corpus (RNC, http://ruscorpora.ru). The main accent is made on extensive use of morphological and lexico-semantic data drawn from the multi-level corpus annotation. Lexical constructions are regarded as the most frequent combinations of a target word and corpus tags which regularly occur within a certain left and/or right context and mark a given meaning of a target word. We focus on nominal constructions with target lexemes that refer to speech acts, emotions, and instruments. The toolkit that processes corpus samples and learns up the constructions is described. We provide analysis for the structure and content of extracted constructions (e. g. r:ord der:num t:ord r:qual|pervyj ‘first’ + LJUBOV’ ‘love’; LJUBOV’ ‘love’ + PR|s ‘from’ + ANUM m sg gen|pervyj ‘first’ + S f inan sg gen|vzgljad ‘sight’ = love at first sight). As regards their structure, constructions may be considered as n-grams (n is 2 to 5). The representation of constructions is bipartite as they may combine either morphological and lemma tags or lexical-semantic and lemma tags. We discuss the use of visualization module PATTERN.GRAPH that represents the inner structure of extracted constructions.
A new electronic frequency dictionary shows the distribution of grammatical forms in the inflectional paradigm of Russian nouns, adjectives and verbs, i. e. the grammatical profile of individual lexemes and lexical groups. While the frequency hierarchy of grammatical categories (e. g. the frequency of part of speech classes or the average ratio of Nominative to Instrumental case forms) has long been the standard topic of research, the present project shifts the focus to the distribution of grammatical forms in particular lexical units. Of particular concern are words with certain biases in grammatical profile, e. g. verbs used mostly in Imperative, in past neutral or nouns used often in plural. The dictionary will be a source for many of the future research in the area of Russian grammar, paradigm structure, grammatical semantics, as well as variation of grammatical forms.
The resource is based on the data of the Russian National Corpus. The article addresses some general issues such as corpora use in compiling frequency resources and technology of corpus data processing. We suggest certain solutions related to the selection of data and the level of granularity of grammatical profile. Text creation time and language registers are discussed as parameters which may shape the grammatical profile fluctuations
We present an approach to speaker-independent recognition of large-vocabulary continuous speech characterized by code-switching between Ukrainian and Russian. The approach does not require language boundary detection or language identification. Special speech and text corpora are not needed to train acoustic and linguistic models. The approach takes into account peculiarities of phonetic systems of Russian and Ukrainian languages. A cross-lingual speech recognition system is developed. A previously developed acoustic model of Ukrainian speech serves for both languages. A set of HMM-models representing 54 Ukrainian phonemes and several non-speech units such as breath, fillers and silence are used. Bilingual linguistic model is trained on a set of Ukrainian and Russian texts. Pronunciation lexicon combines word forms in both languages . Phonemic transcription of Russian word forms are generated using Ukrainian phonemes. Recognition post-processing can be applied to smooth recognized word sequences by using a dictionary containing Ukrainian and Russian words which sound equally but are written differently. The proposed approach can be applied to the recognition of bilingual speech with between-phrase and within-phrase code-switching. Developed cross-lingual speech recognition system was tested on Ukrainian, Russian, and Ukrainian-Russian speech of one bilingual speaker. Preliminary results show that the proposed approach could achieve a good performance. Accuracy of mixed speech recognition is lower only by 3–7% as compared with monolingual speech recognition accuracy.
M
The paper studies the task of extracting product features from reviews. We consider this task as a classification problem and propose a number of classification features. These features are computed using different statistics returned by queries to Yandex search engine, the Internet library and the Russian National Corpus. To justify our approach, we create and manually label a product features dataset, compute the proposed classification features and conduct classification experiments. The results produced by various classifiers applied to different subsets of the data show the feasibility of our approach. We also look at the usefulness of the proposed classification features.
The paper describes a rule-based approach to sentiment analysis. The developed algorithm aims at classifying texts into two classes: positive or negative. We distinguish two types of sentiments: abstract sentiments, which are relevant to the whole text, and sentiments referring to some particular object in the text. As opposed to many other rule-based systems, we do not regard the text as a bag of words. We strongly believe that such classical method of text processing as syntactic analysis can considerably enhance sentiment analysis performance. Accordingly, we first parse the text and then take into account only the phrases that are syntactically connected to relevant objects. We use the dictionary to determine whether such a phrase is positive or negative and assign it a weight according to the importance of the object it is connected with. Than we calculate all these weights and some other factors and decide whether the whole text is positive or negative. The algorithm showed competitive results at ROMIP track 2012.
Errors in the original text will most probably affect the quality of machine translation. It would be interesting to see how different types of errors can influence the translation. To do this, we selected three sets of 500 random queries in English, German and Polish. In each set we corrected different types of errors: 1) missing diacritical marks (except English); 2) all misprints (including diacritics); 3) errors in punctuation and use of capitals; 4) all types of errors listed in 1)–3). As a result we had five sets of 500 queries for German and Polish and four sets for English. Then we translated all the sets into Russian using three free online statistical machine translation systems and compared their BLEU scores to see how they increase in corrected tests as compared to the original ones. We also used different types of BLEU: along with the usual one, which treats punctuation signs as words, we used simplified BLEU which disregards punctuation, and also extended BLEU which takes into consideration both punctuation and use of capitals. We show that in a fully corrected text BLEU increases by approx. 10–15% as compared to original sets. Correcting each of the two main types of errors — misprints and punctuation/ capitalization — gives an increase of 5–10% each depending on the language and on the peculiarities of the test sets. On the other hand, correcting only diacritics has very small impact on the translation quality: close to zero in German and 0,5–1% in Polish.
This paper attempts to refine our understanding of the grammatical and semantic features of the Russian collective numerals using data of corpora. The focus of our attention is the word dvoe considered in comparison with other quantity words comprising the meaning ‘two’ in their semantics, i. e. the numerals dva ‘two’ and oba ‘both’, as well as the noun para ‘pair, couple’. The importance for the Russian language of the semantic category of “twoness” has been shown, and a new term gemina tantum has been introduced to designate the class of nouns that tend to be used in plural form and normally refer to two objects forming a pair or a couple, cf. shoes, boots, eyes, parents, spouses. Semantic analysis of the words dvoe and oba in the context of human nouns has shown that these words practically never interchange because, despite similar assertions, they carry different presuppositions and implications.
We analyze N. Struve’s hypothesis that the author of the text of The romance with cocaine (published in 1936 under the pseudonym M. Ageev) was Vladimir Nabokov. We compare the idiostyle features of this text and all of Nabokov’s texts, as well as what is available in the Russian National Corpus, published before the Ageev and Nabokov works and after them. The general conclusion is that Nabokov seems not to be involved in this text. This problem was stated by Nikita Struve, who rejected biographical arguments and required that “philological”, literary or poetic arguments should be given. We consider all of these arguments.
N
The paper presents an overview of a finished project focused on annotation of grammatical, pronominal and extended nominal coreference and bridging relations in the Prague Dependency Treebank (PDT 2.0). We give an overview of existing similar projects and their interests and compare them with our project. We describe the annotation scheme and the typology of coreferential and bridging relations and give the statistics of these types in the annotated corpus. Further we give the final results of the inter-annotator agreement with some explanations. We also briefly present the anaphora resolution experiments trained on the coreferentially annotated corpus and the future plans.
The paper describes an attempt to construct a Named Entity classifier upon ABBYY Compreno Syntactic and Semantic Parser that was presented at the “Dialogue” conference in 2012. The classifier employs supervised learning technique, namely the Conditional Random Fields model, developed under heavy constraints on the available feature set: no external NE lists or non-local features are used. The system is evaluated on the NER field’s “gold standard” evaluation corpus of CoNLL-2003 achieving F-scores of 91.61% on dev and 87.51% on test set. The classifier outperforms several other systems developed under the same constraints on features, but underperforms a single system that makes use of significantly richer local context. The gain of individual classifier features based on parser attributes is explored; it is demonstrated that Compreno’s semantic hierarchy and surface (syntactic) slots provide classifier with the most valuable features used to locate and classify NEs. This reliance on parser results, however, leads to error propagation from parser to classifier, as shown in the section on error analysis. Final conclusions make an attempt to offer several topics for following research.
P
Linguistic entities (words, grammatical categories, syntactic constructions) are called egocentricals, if their semantics presupposes the speaker as one of the participants in the situation described, cf., for example, sejčas, as in On sejčas doma [‘he’s now at home’, the speaker is the holder of the moment of speech], edva li ‘unlikely’, as in On edva li pridet [‘he’s unlikely to come’, the speaker is the subject of doubt], subjunctive mood, as in Byla by sejčas vesna! [‘if it were spring now!’, the speaker is the subject of volition]. Only canonical communicative situations can afford a sterling, i.e. full value, speaker — with the synchronous addressee, with the field of vision common to the speaker and the addressee, etc. In non-canonical communicative situations, such as narrative or hypotaxis, when the speaker is not accessible as a performer of his/her presupposed role, and some substitute of the speaker comes into play, different egocentricals behave differently. Two types of egocentricals are discerned — shiftable (i.e. secondary) egocentricals, which can be used in all types of communicative situations, and hard (i.e. primary) egocentricals, which stick to the canonical communicative situation, thus belonging to the so called main clause phenomena. One egocentrical is discussed in detail: the adverb odnaždy ‘once upon a time’.
ATEX is a rule-based sentiment analysis system for texts in the Russian language. It includes full morpho-syntactic analysis of Russian text, and highly elaborated linguistic rules, yielding fine-grained sentiment scores. ATEX is participating in a variety of sentiment analysis tracks at ROMIP 2012. The system was tuned to process news texts in politics and economy. The performance of the system is evaluated in different topics: blogs on movies, books and cameras; news. No additional training is performed: ATEX is tested as a universal ‘ready-to-use’ system for sentiment analysis of texts in different topics and different classification settings. The system is compared to a number of sentiment analysis algorithms, including statistical ones trained with datasets in respective topics. Overall system performance is very high, which indicates high usability of the system to different topics with no actual training. According to expectations, the results are especially good in the ‘native’ political and economic news topic, and in the movie blog topic, proving both to share common ways of expressing sentiment. With regard to blog texts, the system demonstrated the best performance in two-class classification tasks, which is a result of the specific algorithm design paying more attention to sentiment polarity than to sentiment/neutral classes. Along these lines areas of future work are suggested, including incorporation of a statistical training algorithm.
While analyzing errors in the search queries, it is easy to notice that the most part of query spelling errors are trivial typos. Such errors usually do not depend on the surrounding words and their correction can be done in the automatic mode. In this work we tried to define a class of query spelling errors that can be corrected automatically. For the selected class we developed a classifier dividing corrections into reliable (suitable for automatic query spelling correction) and low-reliable (suitable only for the query spelling suggestion). As candidates for autocorrections we used query speller suggestions familiar to the users of search engines by “Did you mean...” function. For the classifier training we used typical lexical and statistical features. The experiments showed high performance of the word-level features and the ability to configure the classifier for a given level of accuracy. The application of the proposed method of trivial typo correction can significantly improve the quality of the query spelling errors correction.
The paper proposes a substantial classification of collocates (pairs of words that tend to cooccur) along with heuristics that can help to attibute a word pair to a proper type automatically. The best studied type is frequent phrases, which includes idioms, lexicographic collocations, and syntactic selection. Pairs of this type are known to occur at a short distance and can be singled out by choosing a narrow window for collecting cooccurrence data. The next most salient type is topically related pairs. These can be identified by considering word frequencies in individual documents, as in the well-known distributional topic models. The third type is pairs that occur in repeated text fragments such as popular quotes of standard legal formulae. The characteristic feature of these is that the fragment contains several aligned words that are repeated in the same sequence. Such pairs are normally filtered out for most practical purposes, but filtering is usually applied only to exact repeats; we propose a method of capturing inexact repetition. Hypothetically one could also expect to find a forth type, collocate pairs linked by an intrinsic semantic relation or a long-distance syntactic relation; such a link would guarantee co-occurrence at a certain relatively restricted range of distances, a range narrower than in case of a purely topical connection, but not so narrow as in repeats. However we do not find many cases of this sort in the preliminary empirical study.
This paper investigates constraints on incorporation of nominal roots into compound verbs in Russian. This type of incorporation is generally impossible. The author examines several apparent exceptions from this generalization and proposes an explanation to the constraint itself as well as to the exceptions. A special attention is paid to the relation between (non-existing) compound verbs and compound nominals corresponding to the same nominal+verbal complex. Exceptions from the general constraint “no nominal roots within a compound verb” include deverbal adjectives which are formally equivalent to participles, verbs with reflexive and reciprocal “pronominal” components, verbs derived from compound nominals and compound verbs that have lost their semantic interpretability as complex verb. This interpretability is postulated to be the crucial feature correlating with the constraint on verbal compounds with nominal component, for the reason that this interpretability indicates the presence of two independent nodes (V and NP) in the structure of the compound. If such two node structure becomes a verb, the inner NP node receives case from higher structure levels and cannot incorporate into compound verb.
The paper deals with the new variants of noun government in the modern Russian language. These variants are accounted for by certain semantic factors, such as development of meaning and semantic analogy. Due to the development of meaning the nouns avarija ‘accident’, piruèt ‘pirouette’ and kontseptsija ‘conception’ get new variants of government (avarija s + instrumental, piruèty s ‘with’+ instrumental / vokrug ’around’ + genitive, kontseptsija po ‘on’+ dative). By semantic analogy the nouns bum ‘boom’, fobija ‘phobia’ and vostrebovannost’ ‘demand’ adopt syntactic features of their synonyms. Bum ‘boom’ accepts a PP na ‘on’+ accusative (by analogy with words moda ‘fashion’ or, spros ‘demand’). Fobija ‘phobia’ governs either pered ‘before’+ instrumental (by analogy with the noun strakh ‘fear’), or k + dative (by analogy with words belonging to the semantic group ‘attitude (positive or negative) toward smb, or smth’, e.g. neprijazn’ ‘dislike’, uvazhenije ‘respect’). Vostrebovannost’ ‘demand’ governs v ‘in’+ prepositional case by analogy with the semantically similar word potrebnost’ ‘need’. Well-educated native speakers were asked to fill in questionnaires containing phrases with these variants. Their answers are presented.
This article deals with the use of strikethrough (also known as liturative) on the Russian Web. We summarize two previous attempts to classify the instances of liturative and propose a new classification based on three binary syntactic and semantic features. This classification allows distinguishing six main types of lituratives (two other theoretically possible types are not attested). The features in question are [± substitution] (whether or not the stikethrough text serves as a substitute for the normal text), [± violation of conversational maxims] and [± negative attitude towards the speaker] (whether or not the strikethrough text could possibly cause a negative attitude towards its author). All findings are illustrated with real examples extracted from Russian blogs. In the last section of the paper, we discuss technical issues of using strikethrough on the Web and its implementation on various websites (Live- Journal, Mail.ru, Yandex, Gmail, Facebook*, VKontakte). We attempt to explain why the popularity of strikethrough is gradually decreasing.
The paper focuses on phenomena that fall under a broad category of what is called “loose uses” of language or “vague reference”. These are lexical, grammatical and prosodic resources that allow the speaker to refer to objects and events for which the speaker fails to retrieve the exact name, or simply finds the exact name to be unnecessary or inappropriate. Based on first-hand corpus data of spoken Russian, the paper investigates expressions that are used in a language to temporarily substitute a delayed constituent, as well as those that do not imply any later substitution, but rather suggest an approximate nomination sufficient at the current moment of communication. These expressions can be used instead of their supposed exact correlate or together with it. The first option implies that an expression is used as a generic, or as a cover bleached nomination. The second option implies that the speaker doesn’t take the full responsibility for the given actual nomination the expression is added to, since it is in some sense incomplete or not fully appropriate. The study of lexical resources of vague reference in spoken Russian is complemented by investigating also the associated syntactic and prosodic patterns.
The paper presents the key principles of building a grammar dictionary and a morphological analyzer for XVIII–XIXth century Russian texts based on orthographical, morphological and lexical features exemplified by the Russian National Corpus (RNC). The analyzer should involve different modules applicable to different kinds of texts depending on their respective orthographical and grammatical phenomena. Several alternative ways of implementing orthographical and morphological rules are discussed (including pre-processing, online normalization etc.). Evaluation data of the first analysis results are presented.
Morphological disambiguation is one of the key aims of part-of-speech tagging. The task is considered to be solved, though all the tools for disambiguation use a lot of manually created data. This paper describes an attempt to disambiguate Russian corpus without manually annotated data. The method used was proposed about twenty years ago but has not been applied to synthetic languages yet. The main idea of our approach is to derive disambiguation rules automatically from a corpus with ambiguous annotations using only a few statistical data. It can be done in a simple way by means of unsupervised learning. The results are quite high and can be compared to results of existing systems. We also tried to measure the size of the corpus necessary to produce a reasonable set of disambiguation rules and showed that it can be comparable in size with the corpora used to train statistical disambiguation models.
R
The paper examines the lexical semantics and syntax of the form postoj (attenuated imperative of Russian verb postojat’ ‘stand (for a while)’) describing it as one of the quasi-grammatical markers of continuous prohibitive, such as prekrati, perestan’, xvatit, budet, ostav’, xoroš, etc. All of them mark the illocution for interrupting the ongoing situation. Postoj differs from the other markers by its attenuative semantics, so that the situation (definite and taking place at the moment of speech, but not explicated in the sentence) has to be interrupted only for a while. The speaker offers to use this short span of time to improve it with some additional means; asyndetic clause, which follows postoj, explicates the speaker’s suggestion.
S
The paper analyzes contrast and emphasis, modifiers of communicative meanings, their semantics and accentual structure in the sentences examined. We argue that contrastive and emphatic highlighting of one of the utterance components in the given examples are made by the speakers strategically, in order to convey occasional implicit meanings. All examples are illustrated with graphs displaying tone fluctuations, sound intensity, modulation of sound, and other prosodic features.
The paper is concerned with regular ambiguity of the type ‘parameter — high value’ for Russian quantitative parametric nouns like glubina (depth), davlenie (pressure), etc. This type of ambiguity is shown to be heterogenous. For some dimensional nouns the ambiguity is caused by the metonymic shift from the meaning of a magnitude to the meaning of a spatial area where the value of this magnitude is high. For most parametric nouns this ambiguity is revealed in combinations with the verbs of surprise like udivljat’sja (‘be surprised’). The ambiguity has some analogs in non-quantitative parametric nouns, e.g. ‘parameter — Bon [the lexical function]’ for the non-quantitative parameter kachestvo (‘quality’).
The paper discusses discursive functions of three Russian constructions: “esli mozhno tak skazat” [if I can say so], “esli mozhno tak vyrazit’sya” [if I can express it this way] and “s pozvolenija skazat” [if I’m allowed to call it X]. These constructions play the role of metalinguistic tools that structure the information flow. Functioning as parenthesis, these constructions mark the speaker’s attitude towards his/her own speech actions and attract the attention of the addressee to the non-trivial form of expression. These non-trivial forms include unexpected lexical choice, metaphoric nomination and breaking the norms of word formation. By using “esli mozhno tak skazat” or “esli mozhno tak vyrazit’sya” the speaker can also introduce the process of searching for the most optimal way of expressing an idea. Non-trivial lexical choice or ungrammatical forms introduced by the constructions esli mozhno tak skazat ‘if I may say so’ or esli mozhno tak vyrazit’sya ‘if I may express myself so’are signals of the speaker’s stance towards the object or the situation. Another possible goal of unusual verbal behavior is switching from bona fide to non-bona fide mode of communication. Along with the negative evaluation this switch can lead to the ironic interpretation of the utterance. The third construction — “s pozvolenija skazat” ‘if I am allowed to say so’ — functions as a signal of linguistic categorization process. By using it the speaker shows that the object cannot belong to a particular category due to the lack of necessary properties.
The article discusses problems of identification, analysis, classification (according to the International System of Units and separately according to word formation peculiarities), and processing of quantitative expressions (QE) with measurement units (MUs) as applied to text-to-speech synthesis by means of the linguistic processor NooJ and specially collected legal, scientific and technical text corpora for the Belarusian and Russian languages. In addition to a general description of algorithms and resources for finding QE in Belarusian and Russian texts, the paper gives an overview of QE with MUs with regard to how their components could be written, i.e. digital descriptors, and MUs proper (five different types). It is shown that QE with MUs can get the correct intonation marking only after they are properly generated, i. e. expanded into orthographical words.
Many studies discuss how morphological ambiguity influences processing. In particular, it is well known that attraction errors in subject-verb agreement are produced more often and cause smaller delay in comprehension if the form of the intervening noun coincides with the Nominative case form. This is the case in the German example die Stellungnahme gegen die Demonstrationen waren… ‘the position against the demonstrations (Acc.Pl=Nom.Pl) were’ as opposed to die Stellungnahme zu den Demonstrationen waren… ‘the position on the demonstrations (Dat. Pl≠Nom.Pl) were’. However, the explanation of this phenomenon is a matter of debate. How are such errors produced or missed in comprehension, how are ambiguous forms represented so that they can influence this process?.. We offer a novel perspective on this problem by looking at novel data. We conducted two self-paced reading experiments exploring how Russian adjective forms ambiguous for case influence processing of case errors on the following nouns. We compare sentences containing errors like fil’my bez izvestnyh akterah ‘movie.NOM.PL without famous.GEN.PL=PREP.PL actor.PREP.PL’ and fil’my bez izvestnyh akteram ‘movie.NOM.PL without famous.GEN. PL≠DAT.PL actor.DAT.PL’ to grammatically correct sentences. Errors of the first type are detected later and their effect is less pronounced. The results help answering several questions that arise in connection with attraction errors in subject-verb agreement.
In the paper the method of Discourse Contexts is introduced to describe the semantics and use of one Russian verb pair that corresponds to situation of discrimination. Discrimination implies comparison resulting in differentiation and singling out. Discourse context is understood as complex entity including: (i) abstract Immanent Situation which underlies the use of verbs and represents a configuration of essential elements such as an idea of discrimination and entities involved including subject of discrimination and features of discrimination; (ii) Entity Situation in which the essential elements are classified according to concrete verb; (iii) Grammar Constituent. The analysis of the material of Russian National Corpus gives five types of discourse contexts for verbs otlichit’ — otlichat’, which are presented and exemplified in the paper. Discourse contexts are shown to help catch different meanings and explain semantic peculiarities of otlichit’ — otlichat’.
Obtaining natural synthesized speech is the main goal of modern research in the field of speech synthesis. It strongly depends on the prosody model used in the text-to-speech (TTS) system. This paper deals with speech synthesis evaluation with respect to the prosodic model used. Our Russian VitalVoice TTS is a unit selection concatenative system. We describe two approaches to prosody prediction used in VitalVoice Russian TTS. These are a rule-based approach and a hidden Markov model (HMM) based hybrid approach. We conduct an experiment for evaluating the naturalness of synthesized speech. Four variants of synthesized speech depending on the applied approach and the speech corpus size were tested. We also included natural speech samples into the test. Subjects had to rate the samples from 0 to 5 depending on their naturalness. The experiment shows that speech synthesized using the hybrid HMM-based approach sounds more natural than other synthetic variants. We discuss the results and the ways for further investigation and improvements in the last section.
The article is dedicated to the largest digital resource in the world that contains a uniform description of language grammars — typological database “Languages of the World” (“Jazyki Mira”). There is information on the contents of the database, the programs for data procession. The database “Languages of the world” has three main areas of application: it can be used for quantitative researches, as a reference linguistic resource and for educational purposes. We give examples of database application in scientific researches in typology and areal linguistics. The examples demonstrate new opportunities of studying such questions as stability of grammatical features, liability to borrowing, typological and areal classification of languages. “Languages of the World” is compared with another famous typological database WALS.
T
The paper argues for a theory that accounts for the hierarchical structure of Russian verb. The theory assumes that possible derivations of verb stems are constrained by aspectual selectional characteristics of prefixes or by their position with respect to the “secondary imperfective” morpheme. Accordingly, two groups of prefixes can be identified, selectionally restricted and positionally restricted. The paper focuses on dialectal variation that determines class membership of individual prefixes and shows that this variation is conditioned by the same selectional and positional constraints. In that way, the dialectal variation provides further support for the proposed theory of the structure of Russian verb.
U
The paper deals with multilingual sentiment analysis. We propose a method for projecting an opinion lexicon from a source language to a target language with the use of a parallel corpus. We can make sentiment classification in a target language using an opinion lexicon even if we have no labeled dataset. The advantage of our method is that it captures the context of a word and thus produces a correct translation of it. We apply our method to the language pair English- Russian and conduct sentiment classification experiments. They show that our method allows creating high-quality opinion lexicons.
The object of the paper is the Russian adverb VPORU ‘suiting best’. In the 19 century the meaning of this word was less rich, so it was used in more types of contexts than now. At presence the adverb VPORU is freely used in three types of contexts: (a) Pidzhak emu vporu ‘The coat is the right size for him’; (b) Ej zamuzh vporu (a ona v kukly igraet) ‘She should marry (but not play with dolls)’; (c) Zdes’ tak temno — vporu na chetveren’kakh polzti ‘It is so dark here — one might as well crawl on his fours’. The adverb VPORU in (a) freely combines with negation in the 19 cent. language, but not in the the present day Russian. The reason is that in the 19 cent. language the meaning of the word VPORU in (a) is ‘suiting’ but not ‘suiting best’. The latter meaning consists of two predicates. It is demonstrated that negation of such sense breaks a Grice maxim. So, Grice maxims being applied to a meaning of an anomalous word combination can explain the reason of its anomaly. The adverb VPORU in (b) and (c) does not combine with negation. Contexts (b) are similar to (a). As for (c) the meaning of VPORU here has a rich modal frame. Being in the scope of negation the assertion of VPORU contradicts this modal frame; this reason of an anomaly of a word combination has been described by Ju. D. Apresjan [1978/1995].
Y
The prosodic cues for discourse incompleteness may be either identical with the prosodic means expressing the topic or independent of marking the communicative constituents of a sentence: the topic or the focus. The autonomous prosodic marking of discourse incompleteness becomes possible in the context of tails. A tail is a fragment of a sentence placed after the accent-bearer of the focus. (Thus in the sentence Malo ja smyslju v muzhskoj krasote ‘Little I know about men’s attractiveness’ with malo ‘little’ as the accent-bearer of the focus the fragment ja smyslju v muzhskoj krasote is the tail). A tail may be either deaccented or it may be used to carry the rise of discourse incompleteness. Generating a tail is conditioned by activation of entities within a sentence, contrast, emphasis, and verification expressed either by lexemes or by prosody, or both. In Russian, a tail can also result from a specific word order transformation with the focus accent-bearer being shifted to the left in front of the finite verb. The sentence-final verb, therefore, transforms into the tail to be specifically used as the bearer of discourse incompleteness pitch accent. (Thus in the sentence Ja pidzhak snjal… literally: ‘I my coat took off…’ with pidhzak ‘coat’ as the accentbearer of the focus the sentence-final verb snjal ‘took off’ is the tail). Sentences with tails are able, therefore, to display a full set of communicative meanings including topics, focus and discourse incompleteness expressed by separate accent-bearers carrying the respective pitch accents.
Z
Open Information Extraction (IE) is the task of extracting relational tuples representing facts from text, with no prior specification of relation, no pre-specified vocabulary, or a manually tagged training corpus. Part-of-speech based systems are shown to be competitive with parsingbased systems on this task and work faster for large-scale corpora. Nevertheless, implementation of such a system requires language-specific information. So far, all work has been done for English. We present a relation extraction algorithm for Open IE in Spanish, based on POStagged input and semantic constraints. We provide a description of its implementation in an Open IE system for Spanish ExtrHech. We compare its performance with Open IE systems for English, including a comparison on a parallel English-Spanish dataset, and show that the performance is comparable with the state-of-the-art systems, while the system is more robust to noisy input. We give a comparative analysis of errors in extractions for both languages.
The paper argues that transitive impersonals in Russian, Ukrainian and Icelandic can be accounted for in terms of Mel’čuk’s zero lexemes reanalyzed here as pronouns in the nominative case acting as agreement controllers. An alternative analysis resorting to Burzio’s Generalization stipulates defective vP for different classes of verbs licensing transitive impersonals but fails to make correct predictions. The distribution of impersonals in Russian and Ukrainian does not depend on the distinction of unaccusative vs unergative vs psych predicates. Most Russian verbs labeled ‘psych’ in the previous generative research are either semantic causatives or agentive verbs with an external argument and valency grid .
Stemming from traditional “rule based” translation a “model based” approach is considered as an underlying model for statistical machine translation. This paper concerns with training on parallel corpora and application of this model for parsing and translation.