A
The article is devoted to the problem of how to automatically measure the interpretability of topic models. Some new, intra-text, approaches to estimate the interpretability of the topics are proposed. Computational experiments are conducted with the use of text files from “PostNauka”, which is a collection of popular science content.
In this paper, we explore the ways to improve POS-tagging using various types of auxiliary losses and different word representations. As a baseline, we utilized a BiLSTM tagger, which is able to achieve state-of-the-art results on the sequence labelling tasks. We developed a new method for characterlevel word representation using feedforward neural network. Such representation gave us better results in terms of speed and performance of the model. We also applied a novel technique of pretraining such word representations with existing word vectors. Finally, we designed a new variant of auxiliary loss for sequence labelling tasks: an additional prediction of the neighbour labels. Such loss forces a model to learn the dependencies inside a sequence of labels and accelerates the process of training. We test these methods on English and Russian languages.
This paper discusses a method to detect statistically significant linguistic differences between corpora while factoring in possible variability within the very corpora to be compared. Specifically, we compare two small corpora of dialects of Even, Bystraja and Lamunkhin Even, in an attempt to identify morphemes that are more frequent in either of the corpora. To investigate whether this difference might be due to an over-representation of a speaker who happens to be an outlier in terms of using a particular morpheme, we use DP, a measurement of evenness of the distribution of a specific linguistic feature across subcorpora of the same corpus.
The paper deals with a curious phenomenon of quasi-synonymy that occurs
in Russian between sentences with non-negated and negated predicates
in the construction with the adverb dolgo ‘for a long time’. Consider sentences
like Chainik dolgo zakipal ‘It took the kettle a long time to boil, lit. Kettle
for a long time boiled’ vs. Chainik dolgo ne zakipal ‘It took the kettle a long
time to boil, lit. Kettle for a long time not boiled’. The paper is an attempt
to define the semantic and pragmatic mechanisms of such quasi-synonymy, as well as semantic and aspectual classes of predicates where it occurs.
It also considers subtle semantic, pragmatic and communicative differences
associated with non-negated and negated construction, respectively.
Such quasi-synonymy occurs primarily in cases when the predicate belongs
to the aspectual class of accomplishments and denotes a telic process
or action with a desired result (‘to boil’, ‘to cool down’, ‘to warm up’,
‘to grow up’, ‘to finish’, etc.). Those predicates include two major semantic
components, that is, a lasting process or action and an instant result. In the
imperfective aspect they allow at least two possible interpretations, namely,
of a process and that of a result. Similar interpretations of sentences with
such predicates occur due to different scope assignments of negation and
dolgo. In sentences with non-negated predicate dolgo has scope over the
‘process’ component in the verb; in sentences with negated predicate negation
has scope over the ‘result’ component of the verb while at the same
time falling into the scope of dolgo. The former type of sentences describes
long-lasting processes, whereas the latter type describes long-awaited results,
which pragmatically amount to the same thing.
The paper is a corpus study of the factors involved in disambiguating potential
scope ambiguity in written sentences with negation and universal
quantifier all, such as I cannot visit all these universities, which, depending
on topic-focus assignment, can alternatively mean ‘I cannot visit any
of these universities’ (cannot is focus) and ‘I cannot visit some of these universities’
(all is focus). The factors at play in scope disambiguation are the
syntactic function of the constituent containing all (subject, direct complement,
adjunct); the status of the main predicate and all with respect to the
information structure of the utterance (topic vs. focus); veridical vs. nonveridical
context; sentence type (unreal conditional, rhetorical question);
and pragmatic implicatures pertaining to the situations described in the utterances.
The paper also demonstrates differences in the frequency distribution
of various scope readings and their underlying causes, as well as formulating
typical contexts for each scope interpretation.
B
Framework for Russian plagiarism detection using sentence embedding similarity and negative sampling
In this paper, we propose a new approach for advanced plagiarism detection in Russian language. It is based on a classifier, dealing with two different types of sentence similarity measures: token set similarity and cosine similarity between sentence embeddings (based on pre-trained RusVectōrēs, unsupervised fastText, and supervised StarSpace models). The diversity of feature space makes it possible to detect different types of plagiarism, starting from simple copy&paste cases and ending with complex manual paraphrases. The proposed approach implies an ability to focus on the particular plagiarism type identification, allowing to train a universal model at the same time. The method shows great results on detection of different types of plagiarism and outperforms the previous approach.
Generic topics of large-scale document collections can often be divided into more specific subtopics. Topic hierarchies provide a model for such topic relation structure. These models can be especially useful for exploratory search systems. Various approaches to building hierarchical topic models have been proposed so far. However, there is no agreement on a standard approach, largely due to the lack of quality metrics to compare existing models. To bridge this gap we propose automated evaluation metrics which measure the quality of topic-subtopic relations (edges) of a topic hierarchy. We compare automated evaluations with human assessment to validate the proposed metrics. Finally, we show how the proposed metrics can be used to control and to improve the quality of existing hierarchical models.
The paper describes a new version of the semantic analyzer SemETAP. Our approach is based on the assumption that the depth of understanding is growing with the number of inferences we can draw from the text. The salient features of SemETAP include: 1) intensive use of both linguistic and background knowledge. The former is incorporated in the Combinatorial Dictionary and the Grammar, and the latter is stored in the Ontology and Repository of Individuals. 2) Words and concepts of the ontology may be supplied with explicit decompositions for inference purposes. 3) Two levels of semantic structure are distinguished. Basic semantic structure (BSemS) interprets the text in terms of ontological elements. Enhanced semantic structure (EnSemS) extends BSemS by means of a series of inferences. 4) A new logical formalism Etalog is developed in which all inference rules are written. Semantic analysis with inference allows us to extract implicit information. The analyzer is tested on the task of interpreting high spots of the football match.
Subject index, or back-of-the-book index, is a device intended to provide an easy access to relevant fragments of a text document. Subject indexes usually contain particular single-word and multi-word terms from the corresponding documents. Such indexes are especially useful for reading large documents with specialized terminology, as well as educational texts in difficult scientific and technical areas. The central problem of back-ofthe-book indexing is recognition of terms to be included into the index. The paper describes a method developed for extracting and filtering terms from a given educational scientific text, with the purpose of reliable term selection in computer indexing systems. The method is primarily based on rules with lexico-syntactic patterns representing linguistic information about terms and typical contexts of their usage in Russian scientific and educational texts; simple occurrences statistics of terms is used as well. Experimental evaluation of the method has shown a considerable increase of precision and recall of term extraction compared with the widely-used standard techniques.
This paper addresses the task of automatic genre classification for Arabic within the Functional Text Dimensions framework, which allows texts to get a reliable genre description, while maintaining an adequate amount of genre labels. Our aim in this study is to build an automatic classification model that can annotate any Web text in Standard Arabic in terms of genres. To build the training corpus we translated English and Russian annotated texts into Arabic using Google MT. For building the model experimented with various machine learning approaches, such as Logistic Regression, SVM, LSTM, and different features, such as words, character n-grams and embedding vectors. For testing the classification models, we collected and annotated in terms of FTDs our own corpus of Arabic Web texts. The best performing model offers reasonable classification accuracy in spite of being based on a training corpus produced by MT.
D
The notion of event boundaries is closely connected with the category of aspect. Aspectual forms show different views of “internal temporal consistuency of a situation” (Comrie 1976:3) and, consequently, construals of events in different ways. Recently scholars have started looking into the core of the aspectual distinction through multimodality, considering hand gestures. On the basis of Russian and French oral narratives produced by native speakers, we conducted a study, testing our hypothesis about the existence of direct correlation between the expression of boundaries in verbs and in gestures. Means of boundary expression regarded for Russian on the verbal level were perfective (soveršennyj vid) and imperfective (nesoveršennyj vid) verbs, and for French—passé composé and imparfait. On the kinesthetic level we distinguished between bounded gestures (i.e., involving a pulse of movement) and unbounded gestures (i.e., smooth by nature). While for French L1 we found a direct correlation between gesture boundary schemas and aspectual forms, the results for Russian L1 did not support our hypothesis. With a view to these differences between the two languages, we studied the boundedness correlation in oral narratives produced by Russians speaking French as L2 (CEFR levels B2-C1). The comparison between L1 and L2 narratives revealed a certain change of gestural patterns: the Russian speakers of French L2 used almost the same number of unbounded and bounded gestures with the perfective verb forms and more unbounded gestures with the imperfective forms, thus moving closer towards French L1 speakers’ verb-gesture patterns. The use of gestures can be accounted for by a series of noise factors related to language peculiarities, the cognitive mechanism of profiling and challenges of speaking in L2.
The paper outlines the principles of analyzing German and Russian modal
constructions. Our first task is to clarify the set of meanings of German
modal verbs and the conditions for their implementation. The second task
is to describe the means of expressing modal values in Russian that are
encountered in parallel corpora as functional equivalents of constructions
with German modal verbs. As empirical data we use a representative array
of parallel German-Russian texts from the Russian National Corpus (RNC).
A supracorpora database of translation correspondences is constructed,
in which both the German constructions with modal verbs and their Russian
translation equivalents are attributed an annotation of their relevant
characteristics. This database, on the one hand, is a valuable linguistic resource
that can be used, among other things, to create a new generation
of electronic interactive German-Russian and Russian-German dictionaries.
On the other hand, the inventory of Russian construction types with (implicit)
modal meanings constructed on this database will contribute to the
Construction Grammar and confirm the continuity between grammar and
lexicon.
E
Discourse marker tipa became widespread in colloquial Russian in the decade
1990s–2000s. However, until recently, it has gained little attention.
In this paper we use the data from the Russian National Corpus and we aim
to accomplish the following goals: 1) to highlight the origin of the discourse
marker tipa from the noun tip ‘type’, 2) to describe the semantics of the discourse
marker tipa as well as that of the partly grammaticalized element tipa
as part of parametric constructions. We base our approach mainly on the
results achieved by Susanne Fleischman and Marina Yaguello.
F
The problem of spelling correction is crucial for search engines as misspellings
have a negative effect on their performance. It gets even harder when
search queries are related to a specific area not quite covered by standard
spell checkers, such as geographic information systems (GIS). Moreover,
standard spell-checkers are interactive, i.e. they can notice a misspelled
word and suggest candidate corrections, but picking one of them is up to the
user. This is why we decided to develop a spelling correction unit for 2GIS,
a cartographic search company. To do this, we have extracted and manually
annotated a corpus of GIS lookup queries, trained a language model,
performed various experiments to find the best feature extractor, then fitted
a logistic regression using an approach suggested in SpellRuEval, and
then used it iteratively to get a better result. We have then measured the
resulting performance by means of cross-validation, compared at against
two baseline algorithms and observed a substantial increase. We also present
an interpretation of the result achieved by calculating and discussing
the importance of specific features and analyzing the output of the model.
G
The problem of detecting heated arguments in text such as political debates and customer complaints is formulated as tree kernel learning of discourse structures. Affective argumentation structure is discovered in the form of discourse trees extended with edge labels for communicative actions. Extracted argumentation structures are then encoded as defeasible logic programs and are subject to dialectical analysis, to establish the validity of the main claim being communicated. We evaluate the accuracy of each step of this affect processing pipeline as well as overall performance.
The paper examines dependencies between the syntactic and prosodic
structure with particular attention to the pausation and different levels
of prosodic boundary strength. The research is based on the prosodic data
markup for a spoken Russian text and the manual tagging of this text with
the relevant syntactic constituent boundaries. Two types of structures, the
finite clause and the asyndetic coordination, exhibit a strong positive correlation
with the appearance of a pause and the perceptual prosodic boundary.
We also demonstrate the presence of a substantial correlation between
the syntactic embedding depth and prosodic boundaries. The results of our
research show a significant connection between some of the initially proposed
syntactic factors and prosodic structure. We thus anticipate that
prosodic modules of TTS systems can benefit from taking certain syntactic
information into consideration.
I
The article intends to describe the formal variation of the connectors of the
Russian language on the basis of a cognitive-semantic approach. Every
discourse variant DV of a connector K, i.e. the specific form assumed
by K in a discourse section, is singled out, and registered in the supracorpora
database of connectors (SCDB), in which a system of intersecting
clusters has been developed, allowing to assign in the course of the annotation
the same DV to different structural clusters. In the next phase,
on the base of further semantic analysis, the DVs with a common element
are combined into a structural-semantic complex around a basic form: the
minimal linguistic unit that enables the speaker to express a certain logical-semantic
relation, and the listener to identify it. In conclusion, criteria
for describing the formal variation of the connectors are proposed, as well
as examples of the “profiles” of the basic forms. They reflect the potential
of linguistic means that the speaker has at his disposal to express one or another
logical-semantic relations or one of their combinations.
The paper describes the Russian connective khotya (‘although’) from
a contrastive perspective. First, it focuses on the semantic description
of the connective and proposes to differentiate its four meanings, namely,
concessive propositional, concessive illocutionary, adversative propositional
and adversative illocutionary. The paper analyzes the functioning
of the connective khotya (prototypical marker of concessive relations) and
that of the connective no (‘but’, prototypical marker of adversative relations).
In so doing, it comes to the following conclusion: the adversative
meaning of khotya develops on the basis of its concessive meaning as the
connection between the situations presented in the textual fragments that
are linked by the connective becomes less logical. Similarly, i.e. vice-versa,
as the logical connection between situations becomes stronger, this gives
rise to a concessive interpretation in utterances with no. Further, the paper
takes a closer look at French equivalents khotya gets, when occurring
in each of its four meanings. The concluding section attempts to define the
degree of language-specificity of khotya. To this end, several parameters are considered: (1) cases where the connective has a zero equivalent, (2)
cases of divergent translation (the connective is translated by a non-connective),
(3) number of translation patterns. To perform a contrastive analysis
and to collect statistical data, the supracorpora database of connectives
is used. The database is built upon the parallel Russian-French and FrenchRussian
subcorpora of the RNC.
The paper continues a series of research studies into the microsyntax
of Russian, conducted by the author for a considerable period of time.
Specifically, the focus is on the adverbial syntactic idiom tо i delo ‘≈ every
now and then’, which seems very interesting and instructive as it combines
implicit semantic features and a unique set of syntactic facets that could
be revealed by both present-day and diachronic linguistic data. This syntactic
idiom is considered against the background of other microsyntactic
elements that happen to be its neighbors in the dictionary but feature a substantially
different set of linguistically relevant properties. It is shown how
phraseological units of such kind can be presented in the Microsyntactic
dictionary of Russian, under development by the author and his colleagues,
and in the corpus of texts annotated with microsyntactic phenomena.
This paper addresses the problem of readability assessment for Russian texts and investigates the impact of 24 lexical, syntactic and frequency features. The research was conducted on Russian Readability Corpus containing two sub-corpora, two sets of 5–11 grade level textbooks on Social studies for native speakers of Russian. The sub-corpora were collected for research purposes, annotated and marked as BOG and NIK. The application of the Ridge regression has demonstrated the connection between readability and average sentence length, average number of coordinating chains, average number of sub-trees, frequency and lexical features. The results of the study have the potential to be applied in a wide variety of areas including primarily education, as well as webpage design, document management.
K
This paper presents corpus-based research of quotation constructions in Russian Sign Language (RSL). Quotation constructions have been observed from different perspective in different signed and spoken languages [Brendel, Meibauer, Steinbach 2011]; [Litvinenko et al. 2009]. Based on the corpus of spontaneous narratives recorded from RSL signers [Burkova 2015], we conducted a quantitative analysis of these constructions. We analyzed constituents of quotation construction, such as the source (author of utterance) indication, the introducing matrix predicate, and the quote. Our investigation of non-manual markers in the corpus revealed that nonmanual marking of quotation is optional for RSL quotations. We distinguished direct and indirect quotations in our data based on the reference of indexical elements, the use of subordinating conjunction, and the imperative mood. We found that in RSL non-manuals do not mark the direct/ indirect type of quotation. Our data show that RSL signers tend to use direct quotation much more frequently than indirect quotation. In addition, we compared our findings with the data on quotation constructions in some other sign languages and with the studies of quotation in natural discourse of spoken languages. This comparison showed that RSL quotations share core properties with quotations in spoken and signed languages [Litvinenko et al. 2009].
Although language production and comprehension are parts of one and the same linguistic capacity, they have been studied separately for a long time. A key issue in the present day research is how the two processes are related, and whether transitions from thought to language and vice versa are accomplished by a single or two separate systems. Important progress in this area has been achieved in the field of psycho- and neurolinguistics; a brief review is provided in Section 1. In this paper we explore the production—comprehension relationship on the basis of our multichannel resource “Russian Pear Chats and Stories”. In Section 2 we describe this resource, including the stimulus material, data collection setup, participants and corpus size, and technical aspects. Section 3 lays out two main theoretical notions: a model of face-to-face multichannel communication and a scheme of the production-comprehension interweaving in each interlocutor. In subsequent sections we discuss three case studies of production—comprehension relationships: relative contributions of kinetic channels to discourse understanding (Section 4), turn-taking and eye gaze (Section 5), and multichannel continuity (Section 6). The evidence of the multichannel corpus suggests a cognitive architecture that integrates language production and comprehension.
In the paper we discuss methods used to create CoSyCo, a corpus of syntactic co-occurrences, which provides information on syntactically related words in Russian. We describe a list of shallow parsing templates, which were used to collect data for CoSyCo. The paper includes an overview of the corpora collected for CoSyCo creation and an outline of how the noun ‘virus’ is used in its subcorpora as an example of the information which can be obtained from this online resource.
Word-vector representations have been extensively studied for rich resource languages with large text datasets. However, only a few studies analyze semantic representations of low resource languages, when only small corpus is available. In this study we introduce a methodology and compare techniques to learn semantic representations of low resource languages. The proposed methodology consists of defining accurate preprocessing steps, applying language-independent stemmer and learning word-vector representations. In addition, we propose a simple word embeddings evaluation scheme that can be easily adapted to any language. By using this methodology we learn word-vector representations for Buryat language. In order to promote further research we make the source code and the resulting word embeddings corpus publicly available.
Topic—focus articulation in Russian has been mainly studied against isolated
utterances. In a categorical sentence, this communicative opposition
is reflected in the linear-accentual structure [Paducheva 2015]. For a simple
declarative sentence, that would normally mean that the topic (theme)
comes first and has a rising phrasal accent, while the focus (rheme) completes
the utterance and is pronounced with a falling accent. At the same
time, these formal features do more than just differentiate between topics
and foci; they also mark the discourse-semantic category of phase [Kodzasov
2009]. In syntactically simple utterances, topics tend to correlate with
anticipated continuation, hence non-final phase; foci are usually phase-final.
As I intend to show in this paper, the non-final phase provides a variety
of contexts that challenge the topic—focus distinction. The study is based
on the “Stories about presents and skiing”—a collection of prosodically annotated
spoken narratives.
In Section 1, I concentrate on issues within a simple clause, where
non-final verbal elements often have a fuzzy communicative interpretation.
In Section 2, I analyze complex syntactic structures. The data show that
non-final clauses may demonstrate both thematic and rhematic properties
with regard to their intonation patterns, internal structure and discourse
function. Hence, one can claim that some non-final clauses are topics, while others are foci. However, a majority of non-final clauses in the analyzed
corpus may not be unambiguously attributed to either of these categories.
Section 3 provides a pilot study of complex intonation patterns. Only
phase distinction being considered, utterances with more than one accentual
phrase may follow either (i) the basic adaptation strategy (comprising
a non-final rising accent and a final falling accent), or, more often, (ii) a complicated
strategy: (a) multiple parallel adaption, (b) consecutive adaptation,
or (c) parenthetical strategy.
Our project aims to design a syntactic parser, which constructs a semantic representation in a frame format: a clause is represented as a table of valencies, filled in with semantic markers. This representation is compared to a list of scripts—used to disambiguate and classify the semantic representation as well as to select an appropriate reaction for a companion robot F-2.
Thе paper discusses the most important results of the project “Hierarchy
of prosodic phrasing in spoken language: controlling factors and means
of realization”. The project was aimed at expanding the empirical base
of phrasal prosody researches, which inadequacy is marked in many scientific
areas: discourse theory, syntax, intonational phonology, general
phonetics, speech synthesis and recognition etc. The introduction provides
a brief description of the study background and formulates the tasks which
were necessary to solve for the ultimate goal of the project planned for
3 years of implementation. The first section describes the characteristics
of speech corpora created in the the project for construction of a complex,
linguistic-prosodic database required for the study and modeling of prosodic
phrasing in Russian speech, which takes into account, if possible,
all controlling factors and means of realization. The second section is devoted
to the description of the structure and composition of wordbreaks’
discursive features database (BDF), obtained on the basis of annotated,
prosodically graduated and acoustically analyzed speech corpora. It should be noted the universality and flexibility of the format and structure of the
database as a computer resource, freely admitting to extend its feature set
and to detail their parametric characteristics. The third section illustrates
as the BDF application for theoretical and statistical modelling of inter-level
correlations “syntax—linguistic prosody” in both directions and “linguistic
prosody and speech signal (acoustic speech)” in both directions. The conclusion
summarizes the results of research and discusses some promising
directions for further studies on relevant topics.
The paper deals with metatext (parenthetical) constructions (MC) with mental
verbs (znat’ ‘know’, ponimat’ ‘understand’, verit’ ‘believe’ and the like)
in the 2nd person. The following problems are considered: is there a semantic
correlation between the proposition and MC; what illocutionary function
MC and proposition have. It was shown that some MCs are used only in interrogative
sentences.
The paper reports our participation in the shared task on word sense induction and disambiguation for the Russian language (RUSSE’2018). Our team was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th for the bts-rnc and active-dict datasets (containing mostly polysemous words) among all 19 participants. The method we employed was extremely naive. It implied representing contexts of ambiguous words as averaged word embedding vectors, using off-the-shelf pre-trained distributional models. Then, these vector representations were clustered with mainstream clustering techniques, thus producing the groups corresponding to the ambiguous word’ senses. As a side result, we show that word embedding models trained on small but balanced corpora can be superior to those trained on large but noisy data—not only in intrinsic evaluation, but also in downstream tasks like word sense induction.
L
This paper presents an outline of the readability assessment system construction for the purposes of the Russian language learning. The system is designed to help educators easily obtain the information about the difficulty level of reading materials. The estimation task is posed here as a regression problem on data set of 600 texts and a range of lexico-semantic and morphological features. The scale choice and annotated text collection issues are also discussed. Finally, we present the results of the experiment with learners of Russian as a foreign language to evaluate the quality of a predictive model.
Many words that according to the dictionaries have just one meaning are in fact understood in different ways by different speakers. In this article we deal with Russian nouns denoting everyday life objects which are subject to much variation by age, gender, and region and are poorly described by the existing dictionaries. We report the results of a multilevel survey, propose some possible metrics of word knowledge and show to what extent the words we studied are known among a certain population. We also claim that different speakers possess different sets of meanings for each word, propose ways to discover the distribution patterns for these sets and introduce the notion of disperse polysemy. We believe that our findings may be useful in lexicography (providing detailed information on current word usage in different social groups), lexical semantics (researching meaning shifts and patterns of its distribution among speakers), and language testing (more precise detection of the vocabulary sizes both in native speakers and in language learners).
The paper deals with the Russian interjections (oj, oh, aj, ogo, uh, etc.),
namely their non-canonical use in collocations with K-words (Wh-words),
mostly kak and kakoj. This type of use demonstrates a sort of syntactic recomposition
— collocations oj kak, oh kakoj, etc. function as lexical units
with the meaning of high degree, high quality or big quantity, although with
very specific semantic shades. The paper makes use of the corpus data (the
Russian National Corpus as well as the Internet data) to discover individual
properties of interjections and their historical changes. Primary interjections
are described against the background of interjections derived from
the words of different part of speech. It turns out that in non-canonical use
of primary interjections K-word can hardly be omitted, whereas derived
interjections can also function the same way even without K-word. Noncanonical
use of derived interjections is, with and without K-words, is very
popular in contemporary Russian, especially in slang.
The paper deals with the Russian aby as a marker of “free choice” (or, rather,
not specified choice criteria) within indefinite pronouns against the background
of other markers of “free choice” such as ugodno, popalo, pridetsia.
It pays attention not only to the synchronic semantics of aby, but also to its
history and claims that the modern meaning of aby is related to its usage
as a conjunction. The paper makes use of the corpus data (the Russian National
Corpus as well as the Internet data) to follow the changes in the use
of the particle in question over the last two hundred years. It investigates
into the range of K-words that can collocate with aby: the most typical are
collocations with kto, chto, kak and kakoi; however, collocations with other
K-words are also present in the corpora. In addition, it discusses the question
of negative polarity of aby and the increasing degree of its polarization.
The paper describes an experiment on an instrumental evaluation of the intonation
quality of synthesized Russian speech by using of “Inton@Trainer”
computer system. The system was originally designed to train learners
in producing the basic intonation patterns of Russian speech. It is based
on comparing the melodic portraits of a reference sentence and a sentence
pronounced by the learner. Our approach to assessing the intonational
quality of speech allows to treat a synthesized speech with the same
strict requirements as are applied to students studying Russian as a second
language. We describe the technology used for the instrumental evaluation
of the intonation quality of synthesized speech and the acoustic database
of reference phrases used to assess the intonation quality of synthesized
speech. The paper presents the results of testing the intonation quality
of two Russian synthetic voices. We discuss the results of the experiment
and outline the ways for improving the methods for objective evaluation
of synthesized speech prosodic quality, as well as the possibility of applying
the developed system in other linguistic tasks.
In this paper we present the RuSentRel corpus including analytical texts in the sphere of international relations. For each document we annotated sentiments from the author to mentioned named entities, and sentiments of relations between mentioned entities. In the current experiments, we considered the problem of extracting sentiment relations between entities for the whole documents as a three-class machine learning task. We experimented with conventional machine-learning methods (Naive Bayes, SVM, Random Forest).
The paper explores the distribution and interpretation of the discourse
marker po(-)xodu (PX) and addresses a possible path of its diachronic
development. We argue that the range of uses of PX attested in the corpora
supports an analysis that identifies three meanings / functions of this
item labeled eventive PX, epistemic PX and discourse-level PX throughout
this paper. We propose that the latter two are the products of re-interpretation
of the former. We argue for a presuppositional analysis of the eventive
PX whereby it requires there be a set of background events that show
a temporal overlap with the asserted event and add up to the integral whole.
We analyze the epistemic PX as resulting from inferential reinterpretation
of the relationship between background and asserted events, with the abductive
reasoning being the key ingredient of this reinterpretation. Finally,
we treat the discourse-level PX as a counterpart of the eventive PX in the domain
of speech acts. We speculate that Krifka’s (2014) recent view of speech
acts as index changers opens a way of accounting for this parallelism
in a principled way. On the diachronic side, we identify PX as the product
of diachronic development of the construction in which the argument of the
noun xod ‘move’ is expressed by an overt DP. In the course of development, this DP was first replaced by pro, which gave rise to the eventive PX, and
later on developed epistemic and discourse-level meanings / functions.
M
Nowadays a new yet powerful tool for drug repurposing and hypothesis
generation emerged. Text mining of different domains like scientific libraries
or social media has proven to be reliable in that application. One particular
task in that area is medical concept normalization, i.e. mapping a disease
mention to a concept in a controlled vocabulary, like Unified Medical Language
System (UMLS). This task is challenging due to the differences in language
of health care professionals and social media users. To bridge this
gap, we developed end-to-end architectures based on bidirectional Long
Short-Term Memory and Gated Recurrent Units. In addition, we combined
an attention mechanism with our model. We have done an exploratory study
on hyperparameters of proposed architectures and compared them with the
effective baseline for classification based on convolutional neural networks.
A qualitative examination of the mentions in user reviews dataset collected
from popular online health information platforms as well as quantitative one
both show improvements in the semantic representation of health-related
expressions in user reviews about drugs.
Being a matter of cognition, user interests should be apt to classification
independent of the language of users, social network and the essence of interest
itself. To prove it, we built a collection of English and Russian Twitter
and Vkontakte community pages manually classified according to the
interests of their followers. First, we created a model of Major Interests
(MaIs) with the help of expert analysis and then classified the mentioned set
of pages using machine learning algorithms (SVM, Neural Network, Naive
Bayes, Logistic Regression, Decision Trees, k-Nearest Neighbors) trying
different optimization techniques. We take three interest domains that are
typical of both English and Russian-speaking communities: football, rock
music, vegetarianism. The results of classification show a greater correlation
between Russian-Twitter and English-Twitter pages. The Logistic Regression
with Bernoulli bag-of-words model proves to be the most effective
classification algorithm.
N
In this paper, we decribe the coreference annotation on a multi-lingual parallel
treebank (PAWS), a portion of Wall Street Journal translated into Czech,
Russian and Polish which continues the tradition of multilingual treebanks
with coreference annotation. The paper focuses on language-specific differences.
We analyse syntactic structures concerning anaphoric relations
in the languages under analysis, such as personal and impersonal constructions
in polypredicative constructions and pro-drop qualities.
The paper presents a contrastive analysis of pronominal adverbs in German
(dabei, darauf, damit etc.) and their equivalents in English, Czech and Russian.
The analysis is based on an empirical study of parallel news texts. Our
main focus is to show the interplay between cohesive devices expressed
through German pronominal adverbs in text and explore their equivalents
in English, Czech and Russian. As the dataset at hand contains translations,
we also focus on the influence of the translation factor in parallel texts.
P
The paper addresses the notion of “snyataya utverditel’nost’” (suspended
assertion). The author argues that the term “suspended assertion”, introduced
by U. Weinreich in 1963, covers the same range of phenomena as the
term nonveridicality (its suggestedRussian equivalent is neveridicativnost’),
which has become widespread due to the works by F. Zwarz, A. Giannakidou
and many others. It is demonstrated that the notion of suspended assertion
an be applied to interpret a number of facts of the Russian language, such as nibud’-pronouns, pronouns of negative polarity, the disappearance
of a semantic argument of verbs with the direct (non- parametrical) diathesis,
the mirror symmetry of past and future, the negation with an extended
scope, nibud’-pronouns in the scope of negation, the interchangeability
of eshche ‘yet’ and uzhe ‘already’. It’s the author’s conviction that the notion
of suspended assertion will be applicable in many other contexts.
The paper describes the results of the first shared task on word sense induction (WSI) for the Russian language. While similar shared tasks were conducted in the past for some Romance and Germanic languages, we explore the performance of sense induction and disambiguation methods for a Slavic language that shares many features with other Slavic languages, such as rich morphology and virtually free word order. The participants were asked to group contexts of a given word in accordance with its senses that were not provided beforehand. For instance, given a word “bank” and a set of contexts for this word, e.g. “bank is a financial institution that accepts deposits” and “river bank is a slope beside a body of water”, a participant was asked to cluster such contexts in the unknown in advance number of clusters corresponding to, in this case, the “company” and the “area” senses of the word “bank”. For the purpose of this evaluation campaign, we developed three new evaluation datasets based on sense inventories that have different sense granularity. The contexts in these datasets were sampled from texts of Wikipedia, the academic corpus of Russian, and an explanatory dictionary of Russian. Overall, 18 teams participated in the competition submitting 383 models. Multiple teams managed to substantially outperform competitive stateof-the-art baselines from the previous years based on sense embeddings.
This paper deals with the phenomenon of speech act conjunction in which
the relation expressed by the conjunction holds on the level of speech act
performance rather than on the level of states of affairs. It is argued that
besides clearly speech act and clearly non-speech act uses, there is a class
of constructions of an intermediate nature. The criteria are proposed that serve to distinguish between these three types of use. In particular,
it is demonstrated that imperative sentences can only be of the “intermediate”
type, while interrogative sentences can represent the clearly speech
act use. The proposed distinction manifests itself in grammar. Namely, different
conjunctions are compatible with different types of speech act use;
the correlative item togda (‘then’) cannot be used within a clearly speech
act construction.
The current paper deals with the integration of the Japanese language in a multilingual NLP model, namely, the Compreno model. The formalism includes morphological, syntactic and semantic patterns, covering all possible semantic and syntactic dependencies a word can attach. The architecture of the model allows us to acquire nearly all semantic links of a word through its proper positioning in a thesaurus-like semantic hierarchy, where words are linked through semantic dependencies. The inheritance principle of the hierarchy simplifies the syntactic description of a newly added language as well. Unlike the traditional approach to Japanese parsing based on chunks, or bunsetsus, we suggest a Japanese parser based on constituents. Special attention is given to the tools that allow us to automatize language description process and significantly speed up the description. The work on the Japanese model is still in progress, therefore, we show the current results we have achieved, and point out problems that remain to be solved.
This paper studies the impact corpus size has on the robustness of various frequency-based measures of corpus distance (or similarity, respectively), such as Euclidean distance, Manhattan distance, Cosine distance, χ², Spearman’s ρ, and Simple-Maths Keyword distance. An experiment performed using the British National Corpus shows that Euclidean distance is least influenced by corpus size and thus is best suited for the purpose of comparing corpora.
The paper focuses on Russian constructions with clauses (or VPs) combined
by means of the discourse marker A, that behaves as a conjunction
or as a particle in different contexts. Prosodically, the construction may
come up in two forms: (a) as a single illocution with the first clause pronounced
with a rising pitch that projects discourse continuation, and (b)
as two separate illocutions with the first clause pronounced with a falling
pitch that projects no continuation. Basing on the data from the Prosodically
Annotated Corpus of Spoken Russian, prosody and grammar of (a)
and (b) were analyzed qualitatively and quantitatively. Type (b) appeared
to be as frequent as type (a) and systematically favored in pragmatically
marked contexts.
R
This paper describes a practical solution for the task of referring expressions
generation (REG) in the context of a question-answering system.
When an answer to a question is found in the knowledge base the system has
to decide how to present the answer to the user, which properties uniquely
distinguish the object found from other objects in the knowledge base.
Another task where referring expressions would be useful is the semantic
graph visualization task. Building on top of the graph-based approach
presented by Krahmer et al in 2003 this paper provides some practical improvements
to the algorithm, namely: 1) Instead of depth-first graph search
we use breadth-first search, which is dramatically faster when a scene
graph is big but the description graph to be found is small, 2) Limit on the
size (the number of edges) of the resulting description graph to increase
performance and avoid useless long descriptions. Also a sketch on linguistic
realization of the referring expressions is outlined.
S
The structure of Russian everyday dialogue was studied on the basis
of 73 microdialogues of everyday speech communication from the ʽOne Day
of Speechʼ corpus (the ORD Corpus). The aim of the research was to find
out what types of speech acts commonly initiate and complete everyday
dialogues, as well as to reveal the most typical sequences of speech acts
in these dialogues. Altogether, 2230 speech acts of 30 people referring
to both professional, and household conversations have been analysed.
N-gram analysis has been used to calculate the most frequent sequences
of speech acts. The obtained results showed that dialogues are usually
started by representatives, i. e. speech acts related to the exchange of information
(38% of all cases), etiquette beginnings (greetings, vocatives) take
place in 23% of the dialogues, and in 19% of cases the conversation begins
with a regulative form. Speech acts ending dialogues show a greater variety:
representatives contribute 2% of all dialogue ends, valuative judgments
and regulatory forms cover 14% each, further go directives (8%), commissions
(8%), etiquette forms (8%) and emotional and expressive form (7%).
As for the most typical bigrams of speech acts, they are the following: two
consecutive representatives (22.35%), a regulatory form followed by a representative
(6.93%), a representative and a regulatory form (6%), a valuative
with a following representative (5.21%), a representative and a valuative
judgment (4.77%), as well as two combinations of a directive with a representative
(2.77% each). Besides, the article presents data on the occurrence
of the most frequent pairs of speech acts at the subtype level. Here, the most
frequent one is the sequence ʽquestionʼ+ʽanswerʼ, which covers 2.45%.
Probabilistic topic modeling is a powerful tool of text analysis, that reveals topics as distributions over words and then softly assigns documents to the topics. Even though the aggregated distributions can be good with basic models, a sequential topic representation of each document is often unsatisfactory. This work introduces a method that allows to increase the quality of topical representation of each single text using its segmental structure. Our approach is based on Additive Regularization of Topic Models (ARTM), which is a technique for imposing additional criteria into the model. The proposed method efficiently avoids a bag-of-words assumption by considering the topical connections of words that co-occur in a local segment. We assume, that sequential sentences are topically and semantically coherent, while the number of topics in each particular text fragment is low. We apply our model to topic segmentation task and achieve a better quality than the current state-of-the-art TopicTiling algorithm. In further experiments we demonstrate that the proposed technique reveals an interpretable sequential structure of documents, while keeping a number of topics low, i.e. the sparsity of the model increases. Apart from topic segmentation, the constructed topical text embeddings can be used in any other applications, where the analysis of the document structure is desirable.
In this paper we introduce RusDraCor—an open corpus of Russian drama for digital literary & linguistic research. The corpus (rus.dracor.org) contains plays from the middle of XVIII to the first third of XX century provided with structural (plus some semantic) markup and metadata. Texts are encoded in the XML-based standard TEI, widely used in building corpora for the humanities. We describe the contents and annotation layers of our corpus, provide some details on its development and enrichment, and finally describe three research cases. Each case demonstrates the use of RusDraCor to answer specific questions about composition, structural features and historical evolution of Russian drama.
The paper reviews the problem of speech disfluency which over the years has
becometraditionalforthe “Dialogue” conference (seePodlesskaya, Komarova
2010; Laurinavichyute, Fedorova 2010; Fedorova 2010; Podlesskaya 2013;
Bogdanova-Beglarian 2013; Podlesskaya 2014; Potanina et al. 2016).
In this paper, we compared speech disfluencies in two corpora
of dialogues between children of 10–12 years old (section 1) and adults
(section 2). Both corpora were collected using the referential communication
task “Tangrams” (to perform the task, participants had to agree on the
nomination of some abstract figures).
In the third section of the text, the authors provide the classifications
of speech disfluencies present in the dialogues with examples. The results
of the comparison and the methods of analysis are given in the fourth
paragraph. Finally, the last section contains the discussion of the results
and perspectives of the further work. The paper shows that speech of children
of the given age group differs from adults’ speech in terms of disfluencies
at the discourse level.
Every adult native speaker of Russian knows that kon’ is masculine and lan’ is feminine, although 3rd declension nouns present some difficulties in the first and second language acquisition. However, will the fact that these nouns are less frequent than masculine nouns ending in a consonant or feminine nouns ending in -a/ja play a role for online subject-predicate agreement processing? Or will subject-predicate agreement processing be more problematic with subjects of a certain gender? Finally, some final consonants are more characteristic for feminine gender, while the others for masculine gender. Are speakers sensitive to this? We present two experiments addressing these questions. We found that all three factors play a role, but for different tasks (online agreement processing or determining the gender of a novel word) and at different processing stages.
We offer a new neural architecture for character-level morphological tagging, combining character-level networks with the output of neural language model on morhological tags. Our proposal reduces tagging error up to 10% in comparison with baseline model and achieves state-of-the-art performance both on ru_syntagrus and MorphoRuEval datasets.
The paper deals with differential object marking in the Russian Speech of Nanai-Russian bilingual speakers, namely the variation such as принес рыбу ~ принес рыба (‘{he} brought fish-acc ~ fish-nom’). The puzzle is that this peculiarity can result from a number of different processes: morphosyntactic borrowing from Nanai, penetration of dialectal features into the speech of bilinguals, under-acquisition or reinterpretation of the Standard Russian system. The data of a small corpus of contact-influenced Russian Speech is used to test all these hypotheses. The results are following. Nominative forms are used in DO-position in quite a systematic way and such uses cannot be estimated as occasional “errors”. The main factors that influence the NOM~ACC distribution are a) information structure and b) the accentual type of noun stem. The latter fact supports the hypothesis of a systematic reinterpetation of the Standard Russian system in the situation of incomplete acquisition. No significant correlations with animacy, definiteness, verb form and word order were attested. DOM pattern of Nanai Russian differs from those of Russian dialects and reveals some similarity to those of Nanai. However it cannot be considered as a full morphosyntactic calque.
T
This paper is a first step towards a corpus-based description of the semantics
of Russian pronouns in intensional contexts. Having justified the use
of corpus in (formal) semantic research, I delineate a particular issue within
the topic: whether a given pronoun is interpreted de se or de re in counteridentity
contexts.
A counteridentity context is a clause within the scope of a counterfactual
(clause or adverbial) that affects the identity of a real individual, e.g.
if I were you, were I you, etc. If a pronoun such as I, my or the Russian reflexive
possessive svoj is used in such a context, two options are theoretically
possible: either it picks out the speaker’s real self (de re), or it refers to the
identity assumed by the speaker in the contrary-to-fact situations introduced
by the counterfactual (de se).
Using data from the GICR corpus (approx. 20 billion tokens), I show that
for the Russian first-person singular pronoun ja and its corresponding possessive
moj, de se reference is possible but de re interpretation is more frequent.
The opposite holds for the reflexive sebja, whereas svoj is interpreted
de se with no exception. Special attention is paid to situations where more
than one referential strategy is possible. The paper concludes with a couple
of observations relevant for the future formal accounts of de se reference.
The purpose of the paper is to investigate cues signalling the relations between discourse units in Russian. Building a lexicon of discourse connectives is an indispensable subtask in many discourse parsing applications as well as an essential issue in theoretical researches of text coherence. In order to develop such a resource for Russian, we have conducted a corpus-based study of discourse connectives that were manually extracted from the Russian Rhetorical Structure Treebank (Ru-RSTreebank). The Treebank includes 79 texts annotated within the RST framework [Mann, Thompson 1988]. In order to provide a deeper analysis of connectives in Russian, we focus on causal relations only, namely, the ‘Cause-Effect’ relation. Some of the connectives (primary connectives) are enumerated in grammars and dictionaries. They primarily mark the intra-sentential relations. However, there is an expansive class of less grammaticalized items (secondary connectives) that have received less attention till now. Some of them are based on content words (e.g. по причине ‘for the cause’). Secondary connectives often serve as linking devices for inter-sentential relations. We suggest a scheme for connectives annotation for Russian. We specify the basic patterns that can be used for less-grammaticalized connectives mining in an unannotated corpus. Besides, we provide the comparison of two classes of connectives (primary vs. secondary ones). Our research has shown that these two classes differ in their properties. There is a statistically significant difference between them with respect to the nucleus/ satellite position, intra- vs. inter-sentential relations and some others.
U
The subject of this paper are Russian so called adverbial prepositions; cf. vokrug (kostra) ‘around smth.’, daleko ot (doma) ‘far from smth.’, etc. By definition, an adverbial preposition either coincides with an adverb (cf. vokrug) or contains an adverb and a preposition (cf. daleko ot). As I have demonstrated in my previous works, an adverbial preposition and the underlying adverb have the same meaning, the only difference between them being in the mode of expression of the main semantic actant; cf. Gorel koster, vokrug (preposition) kostra stojali liudi ‘A fire was burning, people were standing around it’ vs. Gorel koster, vokrug (adverb) stojali liudi ‘A fire was burning, people were standing around’. From the modern point of view, syntactic distinction is insufficient for interpreting such cases as different words (or different meanings of a word). So, an adverbial preposition and the underlying adverb should be interpreted as the same meaning of a given word. I argue that this word is an adverb (or a prepositional adverb). This paper deals with syntax of these adverbs. Such adverbs have one or more semantic actants, at least one of them being expressed by a noun or a prepositional group. The problem is that in some cases it is not clear whether the prepositional group is governed by the adverb or by the verb governing this adverb (thus the adverb and the prepositional group are co-governed by the verb). A criterion of adverb vs. verb governing of such groups is discussed. Two Russian adverbs zadolgo ‘for a long time before smth.’ and nezadolgo ‘for a long time before smth.’ are described from this point of view.
V
This paper contributes to the debate on the analysis of linguistic tautologies—structures
that state an unquestionable truth by virtue of their logical
form and therefore require a reinterpretation to be informative. While
there is a great number of studies of nominal tautologies of the form ‘Х is X’,
clausal tautologies, i.e. conditionals ‘if P, P’, disjunctives ‘either P or not P’,
free relatives ‘P, what P’, etc., are given less attention. This paper investigates
one of such patterns, namely, correlative tautologies, where the subordinate
clause precedes the main clause, that could be exemplified by the
expression chto budet to (i) budet lit. ‘what will be that (EMPH) will be’. The
data taken from the Russian National Corpus and Internet as well as dictionary
definitions show that tautologies of this kind exhibit various peculiar
properties. First, some correlative tautologies can receive opposite interpretations
in different contexts, i.e. chto bylo, to bylo lit.’what has been that
has been’ can mean both ‘this fact cannot be denied’ [Bylugina, Shmelev
1997] or ‘the past should be forgotten for the sake of the future’ [Active Dictionary
of Russian]. Next, the particle i, which is commonly used in Russian
correlatives, cf. [Mitrenina 2010], is acceptable for some tautologies but not
licensed in others. I argue that for correlative tautologies the crucial ingredient
is salience of the situation in question as presented by the speaker that,
along with specific vs. generic readings available, results in four possible
strategies of their interpretation.
Y
One of the means of designating the coherence in the spoken discourse
is demonstrating that the current utterance of the discourse is not terminal.
Every step of narrative consisting of the chain of statements can be marked
as non-final. The prosodic cues for incompleteness applied to the speech
act of a statement have been studied in details in linguistic literature. In this
paper, the discourse incompleteness is analyzed as composed not only
with statements but with questions, imperatives, and vocatives as well. The
results of the investigation are as follows. The wh-questions, imperatives,
and vocatives can be freely composed with the meaning of discourse continuity,
and they have specific prosodic cues for marking this combination
of meanings. Whereas the yes-no-questions do not accept the prosodic incompleteness
marking. The prosodic patterns of incompleteness and the
accent placement in questions, vocatives, and imperatives are exemplified
here by the dialogues taken from the Multimodal corpus of the Russian National
corpus, the Prosodically Annotated Corpus of Spoken Russian (spokencorpora.ru),
and the minor working collection of the Russian speech
recordings specifically set up for this investigation. The software program
Praat was used in the process of analyzing the sounding data.
Z
The paper proposes a semantic analysis of the Russian indefinite adverb
kak-nibud’ based on the data collected from the French-Russian, ItalianRussian,
and English-Russian parallel subcorpora of the Russian National
Corpus, as well as from the Data Base of the Russian Discourse Markers
and their French equivalents. The study applies the “unidirectional method”
of contrastive analysis within which the translation by a professional translator
is viewed as a quasi-lexicographic explication of a given unit revealing
implicit components of its semantics. Our analysis demonstrates that
kak-nibud’ is a highly language-specific Russian word. It reflects in a high
percentage of null equivalents of this unit in the three languages under investigation,
for both Russian taken as the source or target language. The
study has also allowed us to show that the analyzed adverb can function
as a marker of non-controllability of a hypothetic event similar to the function
of the subjunctive mood in Romance languages. On the other hand, the
use of kak-nibud’ (‘anyhow’, ‘poorly’) in a purely evaluative meaning cited
by monolingual and bilingual dictionaries has shrunk in contemporary Russian
compared to the Russian of the 19th century.
This paper is addressed the problem of parametric variation in Russian
grammar, with focus on copular constructions with agreeing and nonagreeing
adjectival predicates. Basing on Russian National Corpus, I reconstruct
two dialects of Russian morphosyntax. They differ regarding the
assignment of the predicative instrumental case, raising conditions and
the distribution of agreeing vs non-agreeing predicates after быть 'be',
стать 'become' and казаться 'seem'. Russian-A only licenses predicative
instrumental on adjectives after SEEM (казалось странным, что P)
and non-agreeing predicatives after non-zero forms of BE or BECOME
(было странно, что P). Russian-B allows non-agreeing forms after SEEM
(казалось странно, что P) and forms of the predicative instrumental case
after non-zero forms of BE and BECOME (было странным, что P). I argue
that the differences between Russian-A and Russian-B must explained
in terms of parametric settings and claim that Russian predicatives lack
forms of the predicative instrumental. The assignment of the predicative
instrumental to adjectival heads can be explained as subject control in all
dialects, but only Russian-B allows raising of sententional arguments to the
position of the matrix subject.
The article describes the developed architecture for modeling natural communicative
behavior on the F-2 robot. The important part of our work is the
study of human communicative behavior and the transfer of this behavior
to the robot. For this purpose we are developing the Russian Emotional Corpus
(REC) where video recordings of natural emotional dialogues are collected.
We explore the features of natural communication, and also develop
an architecture that takes into account these features. For example, using
the architecture presented in the article a robot can express any communicative
function, using one or more executive organs: for example, to express
an appeal with facial expressions, head movements or gestures. The
developed architecture also allows us to flexibly combine gestures with different
communicative functions. The architecture allows us to use “split”,
“join” and “single” modes to combine tags from different BML-packages,
and also to synchronize tags in a single BML-package. These features are
important for modeling of human-like behavior for the robot F-2, and are
necessary to improve the communication between a robot and a user.